Attention and Augmented Recurrent Neural Networks

They can be used to boil a sequence down into a high-level understanding, to annotate sequences, and even to generate new sequences from scratch!

### Neural Turing Machines

Neural Turing Machines 神经图灵机将 RNNs 和外部 memory bank 结合在一起。向量用来表示神经网络中的自然语言，memory 是向量的数组。

NMTs 采取了一个非常聪明的方法，每一步读和写都包括 memory 中的所有位置，只是对于不同的位置，读和写的程度不同。

$r \leftarrow \sum_ia_iM_i$

$M_i \leftarrow a_iw+(1-a_i)M_i$

The Neural GPU 4 overcomes the NTM’s inability to add and multiply numbers.
Zaremba & Sutskever 5 train NTMs using reinforcement learning instead of the differentiable read/writes used by the original.
Neural Random Access Machines 6 work based on pointers.
Some papers have explored differentiable data structures, like stacks and queues [7, 8].
memory networks [9, 10] are another approach to attacking similar problems.

Code
- Neural Turing MachineTaehoon Kim’s - Neural GPU publication TensorFlow Models repository - Memory Networks, Taehoon Kim’s

### Attention Interfaces

The attending RNN generates a query describing what it wants to focus on. Each item is dot-producted with the query to produce a score, describing how well it matches the query. The scores are fed into a softmax to create the attention distribution.

- parsing tree:Grammar as a foreign language - conversational modeling:A Neural Conversational Model - image captioning: attention interface 也可以是 CNN 和 RNN 之间的，它允许 RNN 在每一个时间步生成文本时能关注图像中的不同的位置。

Then an RNN runs, generating a description of the image. As it generates each word in the description, the RNN focuses on the conv net’s interpretation of the relevant parts of the image. We can explicitly visualize this:

More broadly, attentional interfaces can be used whenever one wants to interface with a neural network that has a repeating structure in its output.

Standard RNNs do the same amount of computation for each time step. This seems unintuitive. Surely, one should think more when things are hard? It also limits RNNs to doing O(n) operations for a list of length n.

Adaptive Computation Time [15] is a way for RNNs to do different amounts of computation each step. The big picture idea is simple: allow the RNN to do multiple steps of computation for each time step.

In order for the network to learn how many steps to do, we want the number of steps to be differentiable. We achieve this with the same trick we used before: instead of deciding to run for a discrete number of steps, we have an attention distribution over the number of steps to run. The output is a weighted combination of the outputs of each step.

There are a few more details, which were left out in the previous diagram. Here’s a complete diagram of a time step with three computation steps.

That’s a bit complicated, so let’s work through it step by step. At a high-level, we’re still running the RNN and outputting a weighted combination of the states:

S 表示 RNN 中的隐藏状态。

The weight for each step is determined by a “halting neuron.” It’s a sigmoid neuron that looks at the RNN state and gives a halting weight, which we can think of as the probability that we should stop at that step.

We have a total budget for the halting weights of 1, so we track that budget along the top. When it gets to less than epsilon, we stop

When we stop, might have some left over halting budget because we stop when it gets to less than epsilon. What should we do with it? Technically, it’s being given to future steps but we don’t want to compute those, so we attribute it to the last step.

When training Adaptive Computation Time models, one adds a “ponder cost” term to the cost function. This penalizes the model for the amount of computation it uses. The bigger you make this term, the more it will trade-off performance for lowering compute time.

Code：

The only open source implementation of Adaptive Computation Time at the moment seems to be Mark Neumann’s (TensorFlow).

reference: