# chapter27-Question Answering

Speech and language Processing, chapter27:Question Answering

# 论文笔记 - Attention 综述

Attention and Augmented Recurrent Neural Networks

They can be used to boil a sequence down into a high-level understanding, to annotate sequences, and even to generate new sequences from scratch!

### Neural Turing Machines

Neural Turing Machines 神经图灵机将 RNNs 和外部 memory bank 结合在一起。向量用来表示神经网络中的自然语言，memory 是向量的数组。

NMTs 采取了一个非常聪明的方法，每一步读和写都包括 memory 中的所有位置，只是对于不同的位置，读和写的程度不同。

$$r \leftarrow \sum_ia_iM_i$$

$$M_i \leftarrow a_iw+(1-a_i)M_i$$

The Neural GPU 4 overcomes the NTM’s inability to add and multiply numbers.

Zaremba & Sutskever 5 train NTMs using reinforcement learning instead of the differentiable read/writes used by the original.

Neural Random Access Machines 6 work based on pointers.

Some papers have explored differentiable data structures, like stacks and queues [7, 8].

memory networks [9, 10] are another approach to attacking similar problems.

Code

### Attention Interfaces

The attending RNN generates a query describing what it wants to focus on. Each item is dot-producted with the query to produce a score, describing how well it matches the query. The scores are fed into a softmax to create the attention distribution.

Then an RNN runs, generating a description of the image. As it generates each word in the description, the RNN focuses on the conv net’s interpretation of the relevant parts of the image. We can explicitly visualize this:

More broadly, attentional interfaces can be used whenever one wants to interface with a neural network that has a repeating structure in its output.

### Adaptive Computation time

Standard RNNs do the same amount of computation for each time step. This seems unintuitive. Surely, one should think more when things are hard? It also limits RNNs to doing O(n) operations for a list of length n.

Adaptive Computation Time [15] is a way for RNNs to do different amounts of computation each step. The big picture idea is simple: allow the RNN to do multiple steps of computation for each time step.

In order for the network to learn how many steps to do, we want the number of steps to be differentiable. We achieve this with the same trick we used before: instead of deciding to run for a discrete number of steps, we have an attention distribution over the number of steps to run. The output is a weighted combination of the outputs of each step.

There are a few more details, which were left out in the previous diagram. Here’s a complete diagram of a time step with three computation steps.

That’s a bit complicated, so let’s work through it step by step. At a high-level, we’re still running the RNN and outputting a weighted combination of the states:

S 表示 RNN 中的隐藏状态。

The weight for each step is determined by a “halting neuron.” It’s a sigmoid neuron that looks at the RNN state and gives a halting weight, which we can think of as the probability that we should stop at that step.

We have a total budget for the halting weights of 1, so we track that budget along the top. When it gets to less than epsilon, we stop

When we stop, might have some left over halting budget because we stop when it gets to less than epsilon. What should we do with it? Technically, it’s being given to future steps but we don’t want to compute those, so we attribute it to the last step.

When training Adaptive Computation Time models, one adds a “ponder cost” term to the cost function. This penalizes the model for the amount of computation it uses. The bigger you make this term, the more it will trade-off performance for lowering compute time.

Code：

The only open source implementation of Adaptive Computation Time at the moment seems to be Mark Neumann’s (TensorFlow).

reference:

# chapter14 dependency Parsing1

CS224d-Lecture6:Dependency Parsing

# chapter13 Lexicalized PCFG

• Lexicalized PCFGs

• Lexicalization of a treebank： 给treebank添加head

• Lexicalized probabilistic context-free Grammars： Lexicalized PCFGs中的参数，也就是rules的概率

• Parsing with lexicalized PCFGs：使用动态规划CKY对测试集中的sentences寻找最优parse tree

• Parameter estimation in lexicalized probabilistic context-free grammars： 通过训练集，也就是语料库corpus得到Lexicalized PCFG的参数

• Accuracy of lexicalized PCFGs：测试集的准确率计算

• 相关研究

# chapter12-句法分析

• ambiguous and disambiguation

• PCFGs

• 如何从语料库中得到PCFGs，极大似然估计，也就是计算频率

• Chomsky 范式

• SKY，一种bottom-up的dynamic programming算法～

# chapter11-上下文无关语法CFG

• 上下文无关语法 CFG，又叫短语结构语法

• 英语语法中的各种规则：

• 句子级的结构，分4种

• 名词短语，中心词，以及围绕中心词的前后修饰语

• 动词短语

• 并列关系

• Treebanks：经过剖析后的语料库，包含了句法syntatic信息和语义semantic信息

• Treebanks as Grammars

• Heads and Head Finding

• 语法等价与范式

• Lexicalized Grammars 词汇语法

# chapter10-词性标注

• 词性分类 clossed class, open Classes

• 标记集 Tagset

• HMM tagging: 生成模型 $p(t_i|w_i)=p(w_i|t_i)p(t_i|t_{i-1})$ 有viterbi算法和greedy算法～

• MEMM: 判别模型 $p(t_i|w_i,t_{i-1},…f;w)$ 也有Viterbi和greedy两种算法，更新参数w～

• 双向模型 CRF，chapter20讲

• Embeddings

# chapter7-logistic回归

• logistic regression 模型 p(y|x,w)

• 针对语言模型的特征处理 $f_i(c,x)$

• 训练模型

• 正则化

• 特征选择：信息增益

• 分类器选择：bias-variance

# chapter6-朴素贝叶斯和情感分类

• Naive bayes 模型 p(x|y)

• 训练：求根据极大似然估计（频率代替概率） p(y|x),p(x)，无参估计

• 优化：各种预处理和特征提取

• 验证模型： Precision, Recall, F-measure

• 对于多分类的处理

• 交叉验证

• 比较分类器：统计显著性测试

# chapter4:语言模型和N元语法

• N-gram: Markov chain是其中一个特例

• 验证N-gram模型： 困惑度

• 预处理：泛化和zeros情况

• smoothing：用来处理zeros～

• 困惑度和交叉熵

# chapter3-有限状态自动机

• 有限状态机 FSA

• 确定性的识别器 DFSA

• 确定性的识别器 NFSA：深度优先搜索 or 广度优先搜索

• 形式语言

# chapter2:正则表达式、文本标准化和编辑距离

• 各种正则表达式，析取，组合和优先关系

• 文本标准化：各种预处理方法

• 编辑距离：动态规划

# chapter9 隐马尔可夫模型

chapter6: 隐马尔科夫模型 standford tutorial, 可以说是看过的关于隐马尔科夫最好的教程了。要是看英文原版能和看中文一样easy该多好，可惜文化差异的情况下，即使是单词都认识，理解起来也会有点困难。对此，只能自己重新总结一下吧～