neuralmonkey.trainers.rl_trainer module¶

Training objectives for reinforcement learning.

class neuralmonkey.trainers.rl_trainer.ReinforceObjective(decoder: neuralmonkey.decoders.decoder.Decoder, reward_function: Callable[[numpy.ndarray, numpy.ndarray], numpy.ndarray], subtract_baseline: bool = False, normalize: bool = False, temperature: float = 1.0, ce_smoothing: float = 0.0, alpha: float = 1.0, sample_size: int = 1) → None¶

Bases: neuralmonkey.trainers.objective.Objective

__init__(decoder: neuralmonkey.decoders.decoder.Decoder, reward_function: Callable[[numpy.ndarray, numpy.ndarray], numpy.ndarray], subtract_baseline: bool = False, normalize: bool = False, temperature: float = 1.0, ce_smoothing: float = 0.0, alpha: float = 1.0, sample_size: int = 1) → None¶

Construct RL objective for training with sentence-level feedback.

Depending on the options the objective corresponds to: 1) sample_size = 1, normalize = False, ce_smoothing = 0.0 Bandit objective (Eq. 2) described in ‘Bandit Structured Prediction for Neural Sequence-to-Sequence Learning’ (http://www.aclweb.org/anthology/P17-1138) It’s recommended to set subtract_baseline = True. 2) sample_size > 1, normalize = True, ce_smoothing = 0.0 Minimum Risk Training as described in ‘Minimum Risk Training for Neural Machine Translation’ (http://www.aclweb.org/anthology/P16-1159 Eq. 12). 3) sample_size > 1, normalize = False, ce_smoothing = 0.0 The Google ‘Reinforce’ objective as proposed in ‘Google’s NMT System: Bridging the Gap between Human and Machine Translation’ (https://arxiv.org/pdf/1609.08144.pdf) (Eq. 8). 4) sample_size > 1, normalize = False, ce_smoothing > 0.0 Google’s ‘Mixed’ objective in the above paper (Eq. 9), where ce_smoothing implements alpha.

Note that ‘alpha’ controls the sharpness of the normalized distribution while ‘temperature’ controls the sharpness during sampling.

Parameters:

decoder – a recurrent decoder to sample from
reward_function – any evaluator object
subtract_baseline – avg reward is subtracted from obtained reward
normalize – the probabilities of the samples are re-normalized
sample_size – number of samples to obtain feedback for
ce_smoothing – add cross-entropy with this coefficient to loss
alpha – determines the shape of the normalized distribution
temperature – the softmax temperature for sampling

loss¶: Return the loss tensor fetched by the trainer.

neuralmonkey.trainers.rl_trainer.rl_objective(*args, **kwargs) → neuralmonkey.trainers.rl_trainer.ReinforceObjective¶