neuralmonkey.attention.scaled_dot_product module

The scaled dot-product attention mechanism defined in Vaswani et al. (2017).

The attention energies are computed as dot products between the query vector and the key vector. The query vector is scaled down by the square root of its dimensionality. This attention function has no trainable parameters.

See arxiv.org/abs/1706.03762

class neuralmonkey.attention.scaled_dot_product.MultiHeadAttention(name: str, n_heads: int, keys_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful], values_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful] = None, dropout_keep_prob: float = 1.0, save_checkpoint: str = None, load_checkpoint: str = None, initializers: List[Tuple[str, Callable]] = None) → None

Bases: neuralmonkey.attention.base_attention.BaseAttention

__init__(name: str, n_heads: int, keys_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful], values_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful] = None, dropout_keep_prob: float = 1.0, save_checkpoint: str = None, load_checkpoint: str = None, initializers: List[Tuple[str, Callable]] = None) → None

Create a new BaseAttention object.

attention(query: tensorflow.python.framework.ops.Tensor, decoder_prev_state: tensorflow.python.framework.ops.Tensor, decoder_input: tensorflow.python.framework.ops.Tensor, loop_state: neuralmonkey.attention.namedtuples.MultiHeadLoopState) → Tuple[tensorflow.python.framework.ops.Tensor, neuralmonkey.attention.namedtuples.MultiHeadLoopState]

Run a multi-head attention getting context vector for a given query.

This method is an API-wrapper for the global function ‘attention’ defined in this module. Transforms a query of shape(batch, query_size) to shape(batch, 1, query_size) and applies the attention function. Output context has shape(batch, 1, value_size) and weights have shape(batch, n_heads, 1, time(k)). The output is then processed to produce output vector of contexts and the following attention loop state.

Parameters:
  • query – Input query for the current decoding step of shape(batch, query_size).
  • decoder_prev_state – Previous state of the decoder.
  • decoder_input – Input to the RNN cell of the decoder.
  • loop_state – Attention loop state.
Returns:

Vector of contexts and the following attention loop state.

context_vector_size

Return the static size of the context vector.

Returns:An integer specifying the context vector dimension.
finalize_loop(key: str, last_loop_state: neuralmonkey.attention.namedtuples.MultiHeadLoopState) → None

Store the attention histories from loop state under a given key.

Parameters:
  • key – The key to the histories dictionary to store the data in.
  • last_loop_state – The loop state object from the last state of the decoding loop.
initial_loop_state() → neuralmonkey.attention.namedtuples.MultiHeadLoopState

Get initial loop state for the attention object.

Returns:The newly created initial loop state object.
visualize_attention(key: str, max_outputs: int = 16) → None

Include the attention histories under a given key into a summary.

Parameters:
  • key – The key to the attention histories dictionary.
  • max_outputs – Maximum number of images to save.
class neuralmonkey.attention.scaled_dot_product.ScaledDotProdAttention(name: str, keys_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful], values_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful] = None, dropout_keep_prob: float = 1.0, save_checkpoint: str = None, load_checkpoint: str = None, initializers: List[Tuple[str, Callable]] = None) → None

Bases: neuralmonkey.attention.scaled_dot_product.MultiHeadAttention

__init__(name: str, keys_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful], values_encoder: Union[neuralmonkey.model.stateful.TemporalStateful, neuralmonkey.model.stateful.SpatialStateful] = None, dropout_keep_prob: float = 1.0, save_checkpoint: str = None, load_checkpoint: str = None, initializers: List[Tuple[str, Callable]] = None) → None

Create a new BaseAttention object.

neuralmonkey.attention.scaled_dot_product.attention(queries: tensorflow.python.framework.ops.Tensor, keys: tensorflow.python.framework.ops.Tensor, values: tensorflow.python.framework.ops.Tensor, keys_mask: tensorflow.python.framework.ops.Tensor, num_heads: int, dropout_callback: Callable[[tensorflow.python.framework.ops.Tensor], tensorflow.python.framework.ops.Tensor], masked: bool = False, use_bias: bool = False) → tensorflow.python.framework.ops.Tensor

Run multi-head scaled dot-product attention.

See arxiv.org/abs/1706.03762

When performing multi-head attention, the queries, keys and values vectors are first split to sets of smaller vectors, one for each attention head. Next, they are transformed using a linear layer and a separate attention (from a corresponding head) is applied on each set of the transformed triple of query, key and value. The resulting contexts from each head are then concatenated and a linear layer is applied on this concatenated output. The following can be summed by following equations:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_o
head_i = Attention(Q * W_Q_i, K * W_K_i, V * W_V_i)

The scaled dot-product attention is a simple dot-product between the query and a transposed key vector. The result is then scaled using square root of the vector dimensions and a softmax layer is applied. Finally, the output of the softmax layer is multiplied by the value vector. See the following equation:

Attention(Q, K, V) = softmax(Q * K^T / √(d_k)) * V
Parameters:
  • queries – Input queries of shape (batch, time(q), k_channels).
  • keys – Input keys of shape (batch, time(k), k_channels).
  • values – Input values of shape (batch, time(k), v_channels).
  • keys_mask – A float Tensor for masking sequences in keys.
  • num_heads – Number of attention heads.
  • dropout_callback – Callable function implementing dropout.
  • masked – Boolean indicating whether we want to mask future energies.
Returns:

Contexts of shape (batch, time(q), v_channels) and weights of shape (batch, time(q), time(k)).

neuralmonkey.attention.scaled_dot_product.empty_multi_head_loop_state(batch_size: Union[int, tensorflow.python.framework.ops.Tensor], num_heads: Union[int, tensorflow.python.framework.ops.Tensor], length: Union[int, tensorflow.python.framework.ops.Tensor], dimension: Union[int, tensorflow.python.framework.ops.Tensor]) → neuralmonkey.attention.namedtuples.MultiHeadLoopState
neuralmonkey.attention.scaled_dot_product.mask_energies(energies_4d: tensorflow.python.framework.ops.Tensor, mask: tensorflow.python.framework.ops.Tensor, mask_value=-1000000000.0) → tensorflow.python.framework.ops.Tensor

Apply mask to the attention energies before passing to softmax.

Parameters:
  • energies_4d – Energies of shape (batch, n_heads, time(q), time(k)).
  • mask – Float Tensor of zeros and ones of shape (batch, time(k)), specifies valid positions in the energies tensor.
  • mask_value – Value used to mask energies. Default taken value from tensor2tensor.
Returns:

Energies (logits) of valid positions. Same shape as energies_4d.

Note

We do not use mask_value=-np.inf to avoid potential underflow.

neuralmonkey.attention.scaled_dot_product.mask_future(energies: tensorflow.python.framework.ops.Tensor, mask_value=-1000000000.0) → tensorflow.python.framework.ops.Tensor

Mask energies of keys using lower triangular matrix.

Mask simulates autoregressive decoding, such that it prevents the attention to look at what has not yet been decoded. Mask is not necessary during training when true output values are used instead of the decoded ones.

Parameters:
  • energies – A tensor to mask.
  • mask_value – Value used to mask energies.
Returns:

Masked energies tensor.

neuralmonkey.attention.scaled_dot_product.split_for_heads(x: tensorflow.python.framework.ops.Tensor, n_heads: int, head_dim: int) → tensorflow.python.framework.ops.Tensor

Split a tensor for multi-head attention.

Split last dimension of 3D vector of shape (batch, time, dim) and return a 4D vector with shape (batch, n_heads, time, dim/n_heads).

Parameters:
  • x – input Tensor of shape (batch, time, dim).
  • n_heads – Number of attention heads.
  • head_dim – Dimension of the attention heads.
Returns:

A 4D Tensor of shape (batch, n_heads, time, head_dim/n_heads)