neuralmonkey.trainers.rl_trainer module¶
Training objectives for reinforcement learning.

neuralmonkey.trainers.rl_trainer.
rl_objective
(decoder: neuralmonkey.decoders.decoder.Decoder, reward_function: Callable[[numpy.ndarray, numpy.ndarray], numpy.ndarray], subtract_baseline: bool = False, normalize: bool = False, temperature: float = 1.0, ce_smoothing: float = 0.0, alpha: float = 1.0, sample_size: int = 1) → neuralmonkey.trainers.generic_trainer.Objective¶ Construct RL objective for training with sentencelevel feedback.
Depending on the options the objective corresponds to: 1) sample_size = 1, normalize = False, ce_smoothing = 0.0
Bandit objective (Eq. 2) described in ‘Bandit Structured Prediction for Neural SequencetoSequence Learning’ (http://www.aclweb.org/anthology/P171138) It’s recommended to set subtract_baseline = True. sample_size > 1, normalize = True, ce_smoothing = 0.0
Minimum Risk Training as described in ‘Minimum Risk Training for Neural Machine Translation’ (http://www.aclweb.org/anthology/P161159) (Eq. 12). sample_size > 1, normalize = False, ce_smoothing = 0.0
The Google ‘Reinforce’ objective as proposed in ‘Google’s NMT System: Bridging the Gap between Human and Machine Translation’ (https://arxiv.org/pdf/1609.08144.pdf) (Eq. 8). sample_size > 1, normalize = False, ce_smoothing > 0.0
Google’s ‘Mixed’ objective in the above paper (Eq. 9), where ce_smoothing implements alpha.Note that ‘alpha’ controls the sharpness of the normalized distribution, while ‘temperature’ controls the sharpness during sampling.
Parameters:  decoder – a recurrent decoder to sample from
 reward_function – any evaluator object
 subtract_baseline – avg reward is subtracted from obtained reward
 normalize – the probabilities of the samples are renormalized
 sample_size – number of samples to obtain feedback for
 ce_smoothing – add crossentropy loss with this coefficient to loss
 alpha – determines the shape of the normalized distribution
 temperature – the softmax temperature for sampling
Returns: Objective object to be used in generic trainer