neuralmonkey.attention.scaled_dot_product module¶
The scaled dot-product attention mechanism defined in Vaswani et al. (2017).
The attention energies are computed as dot products between the query vector and the key vector. The query vector is scaled down by the square root of its dimensionality. This attention function has no trainable parameters.
See arxiv.org/abs/1706.03762
-
class
neuralmonkey.attention.scaled_dot_product.
MultiHeadAttention
(name: str, n_heads: int, keys_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful], values_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful] = None, dropout_keep_prob: float = 1.0, reuse: neuralmonkey.model.model_part.ModelPart = None, save_checkpoint: str = None, load_checkpoint: str = None, initializers: List[Tuple[str, Callable]] = None) → None¶ Bases:
neuralmonkey.attention.base_attention.BaseAttention
-
__init__
(name: str, n_heads: int, keys_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful], values_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful] = None, dropout_keep_prob: float = 1.0, reuse: neuralmonkey.model.model_part.ModelPart = None, save_checkpoint: str = None, load_checkpoint: str = None, initializers: List[Tuple[str, Callable]] = None) → None¶ Create a new
BaseAttention
object.
-
attention
(query: tensorflow.python.framework.ops.Tensor, decoder_prev_state: tensorflow.python.framework.ops.Tensor, decoder_input: tensorflow.python.framework.ops.Tensor, loop_state: neuralmonkey.attention.namedtuples.MultiHeadLoopState) → Tuple[tensorflow.python.framework.ops.Tensor, neuralmonkey.attention.namedtuples.MultiHeadLoopState]¶ Run a multi-head attention getting context vector for a given query.
This method is an API-wrapper for the global function ‘attention’ defined in this module. Transforms a query of shape(batch, query_size) to shape(batch, 1, query_size) and applies the attention function. Output context has shape(batch, 1, value_size) and weights have shape(batch, n_heads, 1, time(k)). The output is then processed to produce output vector of contexts and the following attention loop state.
Parameters: - query – Input query for the current decoding step of shape(batch, query_size).
- decoder_prev_state – Previous state of the decoder.
- decoder_input – Input to the RNN cell of the decoder.
- loop_state – Attention loop state.
Returns: Vector of contexts and the following attention loop state.
-
attention_keys
¶
-
attention_mask
¶
-
attention_values
¶
-
context_vector_size
¶ Return the static size of the context vector.
Returns: An integer specifying the context vector dimension.
-
finalize_loop
(key: str, last_loop_state: neuralmonkey.attention.namedtuples.MultiHeadLoopState) → None¶ Store the attention histories from loop state under a given key.
Parameters: - key – The key to the histories dictionary to store the data in.
- last_loop_state – The loop state object from the last state of the decoding loop.
-
initial_loop_state
() → neuralmonkey.attention.namedtuples.MultiHeadLoopState¶ Get initial loop state for the attention object.
Returns: The newly created initial loop state object.
-
visualize_attention
(key: str, max_outputs: int = 16) → None¶ Include the attention histories under a given key into a summary.
Parameters: - key – The key to the attention histories dictionary.
- max_outputs – Maximum number of images to save.
-
-
class
neuralmonkey.attention.scaled_dot_product.
ScaledDotProdAttention
(name: str, keys_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful], values_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful] = None, dropout_keep_prob: float = 1.0, reuse: neuralmonkey.model.model_part.ModelPart = None, save_checkpoint: str = None, load_checkpoint: str = None, initializers: List[Tuple[str, Callable]] = None) → None¶ Bases:
neuralmonkey.attention.scaled_dot_product.MultiHeadAttention
-
__init__
(name: str, keys_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful], values_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful] = None, dropout_keep_prob: float = 1.0, reuse: neuralmonkey.model.model_part.ModelPart = None, save_checkpoint: str = None, load_checkpoint: str = None, initializers: List[Tuple[str, Callable]] = None) → None¶ Create a new
BaseAttention
object.
-
-
neuralmonkey.attention.scaled_dot_product.
attention
(queries: tensorflow.python.framework.ops.Tensor, keys: tensorflow.python.framework.ops.Tensor, values: tensorflow.python.framework.ops.Tensor, keys_mask: tensorflow.python.framework.ops.Tensor, num_heads: int, dropout_callback: Callable[[tensorflow.python.framework.ops.Tensor], tensorflow.python.framework.ops.Tensor], masked: bool = False, use_bias: bool = False) → tensorflow.python.framework.ops.Tensor¶ Run multi-head scaled dot-product attention.
See arxiv.org/abs/1706.03762
When performing multi-head attention, the queries, keys and values vectors are first split to sets of smaller vectors, one for each attention head. Next, they are transformed using a linear layer and a separate attention (from a corresponding head) is applied on each set of the transformed triple of query, key and value. The resulting contexts from each head are then concatenated and a linear layer is applied on this concatenated output. The following can be summed by following equations:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_o head_i = Attention(Q * W_Q_i, K * W_K_i, V * W_V_i)
The scaled dot-product attention is a simple dot-product between the query and a transposed key vector. The result is then scaled using square root of the vector dimensions and a softmax layer is applied. Finally, the output of the softmax layer is multiplied by the value vector. See the following equation:
Attention(Q, K, V) = softmax(Q * K^T / √(d_k)) * V
Parameters: - queries – Input queries of shape
(batch, time(q), k_channels)
. - keys – Input keys of shape
(batch, time(k), k_channels)
. - values – Input values of shape
(batch, time(k), v_channels)
. - keys_mask – A float Tensor for masking sequences in keys.
- num_heads – Number of attention heads.
- dropout_callback – Callable function implementing dropout.
- masked – Boolean indicating whether we want to mask future energies.
- use_bias – If True, enable bias in the attention head projections (for all queries, keys and values).
Returns: Contexts of shape
(batch, time(q), v_channels)
and weights of shape(batch, time(q), time(k))
.- queries – Input queries of shape
-
neuralmonkey.attention.scaled_dot_product.
empty_multi_head_loop_state
(batch_size: Union[int, tensorflow.python.framework.ops.Tensor], num_heads: Union[int, tensorflow.python.framework.ops.Tensor], length: Union[int, tensorflow.python.framework.ops.Tensor], dimension: Union[int, tensorflow.python.framework.ops.Tensor]) → neuralmonkey.attention.namedtuples.MultiHeadLoopState¶
-
neuralmonkey.attention.scaled_dot_product.
mask_energies
(energies_4d: tensorflow.python.framework.ops.Tensor, mask: tensorflow.python.framework.ops.Tensor, mask_value=-1000000000.0) → tensorflow.python.framework.ops.Tensor¶ Apply mask to the attention energies before passing to softmax.
Parameters: - energies_4d – Energies of shape
(batch, n_heads, time(q), time(k))
. - mask – Float Tensor of zeros and ones of shape
(batch, time(k))
, specifies valid positions in the energies tensor. - mask_value – Value used to mask energies. Default taken value from tensor2tensor.
Returns: Energies (logits) of valid positions. Same shape as
energies_4d
.Note
We do not use
mask_value=-np.inf
to avoid potential underflow.- energies_4d – Energies of shape
-
neuralmonkey.attention.scaled_dot_product.
mask_future
(energies: tensorflow.python.framework.ops.Tensor, mask_value=-1000000000.0) → tensorflow.python.framework.ops.Tensor¶ Mask energies of keys using lower triangular matrix.
Mask simulates autoregressive decoding, such that it prevents the attention to look at what has not yet been decoded. Mask is not necessary during training when true output values are used instead of the decoded ones.
Parameters: - energies – A tensor to mask.
- mask_value – Value used to mask energies.
Returns: Masked energies tensor.
-
neuralmonkey.attention.scaled_dot_product.
split_for_heads
(x: tensorflow.python.framework.ops.Tensor, n_heads: int, head_dim: int) → tensorflow.python.framework.ops.Tensor¶ Split a tensor for multi-head attention.
Split last dimension of 3D vector of shape
(batch, time, dim)
and return a 4D vector with shape(batch, n_heads, time, dim/n_heads)
.Parameters: - x – input Tensor of shape
(batch, time, dim)
. - n_heads – Number of attention heads.
- head_dim – Dimension of the attention heads.
Returns: A 4D Tensor of shape
(batch, n_heads, time, head_dim/n_heads)
- x – input Tensor of shape