论文笔记-无监督机器翻译

Motivation

Back-translation 得到的伪平行语料，是基于 pure target sentence 得到 pesudo source sentence，然后把 prue target sentence 作为 label 进行监督学习(保证target 端是pure sentence，source端的sentence可以稍微 noisy)。这实质上就是一个 reconstruction loss. 其缺点在于 pesudo source sentence 质量无法保证，会导致误差累积（pesudo source sentence 并没有得到更新，所以并没有纠正存在的错误）。

单语语料的选择

neural-based methods aim to select potential parallel sentences from monolingual corpora in the same domain. However, these neural models need to be trained on a large parallel dataset first, which is not applicable to language pairs with limited supervision.

• Parallel sentence extraction from comparable corpora with neural network features, LERC 2016

• Bilingual word embeddings with bucketed cnn for parallel sentence extraction, ACL 2017

• Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation, COLING 2018

完全的无监督机器翻译

The main technical protocol of these approaches can be summarized as three steps:

• Initialization

• Language Modeling

• Back-Translation

Initialization

Given the ill-posed nature of the unsupervised NMT task, a suitable initialization method can help model the natural priors over the mapping of two language spaces we expect to reach.

there two main initiazation methods:

• bilingual dictionary inference 基于双语词典的推理

• Word translation without parallel data. Conneau, et al. ICLR 2018

• Unsupervised neural machine translation, ICLR 2018

• Unsupervised machine translation using monolingual corpora only, ICLR 2018a

• BPE

• Phrase-based & neural unsupervised machine translation. emnlp Lample et al. 2018b

language modeling

Train language models on both source and target languages. These models express a data-driven prior about the composition of sentences in each language.

In NMT, language modeling is accomplished via denosing autoencoding, by minimizing:

Back-Translation

• Dual learning for machine translation, NIPS 2016

• Improving neural machine translation models with monolingual data. ACL 2016

Extract-Edit

• Extract: 先根据前两步得到的 sentence 表示，从 target language space 中选择与 source sentence 最接近的 sentence（依据相似度？）.

• Edit: 然后对选择的 sentence 进行 edit.

Edit

employ a maxpooling layer to reserve the more significant features between the source sentence embedding $e_s$ and the extracted sentence embedding $e_t$ ($t\in M$), and then decode it into a new sentence $t’$.

$e_s$: [es_length, encoder_size]

$e_t$: [et_length, encoder_size]

Evaluate

$$r_s=f(W_2f(W_1e_s+b_1)+b_2)$$

$$r_t=f(W_2f(W_1e_t’+b_1)+b_2)$$

learning

Comparative Translation

cosine 相似度越大越接近，所以 -logP 越小越好。这里面涉及到的参数 $\theta_{enc}, \theta_R$

Basically, the translation model is trying to minimize the relative distance of the translated sentence t* to the source sentence s compared to the top-k extracted-and-edited sentences in the target language space. Intuitively, we view the top-k extracted-and-edited sentences as the anchor points to locate a probable region in the target language space, and iteratively improve the source-to-target mapping via the comparative learning scheme.

we can view our translation system as a “generator” that learns to generate a good translation with a higher similarity score than the extracted-and-edited sentences, and the evaluation network R as a “discriminator” that learns to rank the extracted- and-edited sentences (real sentences in the target language space) higher than the translated sentences.

Model selection

Basically, we choose the hyper-parameters with the maximum expectation of the ranking scores of all translated sentences.

Implementation details

Initialization

cross-lingual BPE embedding, set BPE number 60000.

Model structure

all encoder parameters are shared across two languages. Similarly, we share all decoder parameters across two languages.

The λ for calculating ranking scores is 0.5. As for the evaluation network R, we use a multilayer perceptron with two hidden layers of size 512.

论文笔记-对话系统

paper

A Survey on Dialogue Systems: Recent Advances and New Frontiers, Chen et al. 2018

motivation

• 序列到序列模型 sequence-to-sequence models

• 检索式模型 retrieval-based methods

• pipeline methods

• end-to-end methods

pipeline methods

• language understanding

• dialogue state tracking

• policy learning

• natural language generation

language understanding

Given an utterance, natural language understanding maps it into semantic slots. The slots are pre-defined according to different scenarios.

slots 都是根据特定的场景定义好的。

• intent dection: 这个是 utterance-level classification，也就是一个分类任务

• domain classification: 也是分类任务

• slot filling: 这是 word-level 的任务，可以定义成序列标注问题，输入是一个 utterance，输出是对应每个 word 的 semantic label.

• CRF baseline

• DBNs:

• Deep belief network based semantic taggers for spoken language understanding

• Use of kernel deep convex networks and end-to-end learning for spoken language understanding

• RNN:

• Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding, 2013

• Deep belief nets for natural language call-routing. 2011

• Recurrent neural networks for language understanding, 2013

• Spoken language understanding using long short-term memory neural networks. 2014

Dialogue State Tracking

• [26] Deep neural network approach for the dialog state tracking challenge. 2013

• [58] Multi-domain dialog state tracking using recurrent neural networks. 2015

• [59] Neural belief tracker: Data-driven dialogue state tracking, 2017

Policy learning

• 基于规则的监督学习（state 的状态需要规则来定义）:

• [111] Building task-oriented dialogue systems for online shopping,
• deep reinforcement learning:

• [14] Strategic dialogue management via deep reinforcement learning, 2015

natural language generation

• [123] Context-aware nat- ural language generation for spoken dialogue systems. Zhou et al, 2016 COLING

adopted an encoder-decoder LSTM-based structure to incorporate the question information, semantic slot values, and dialogue act type to generate correct answers. It used the attention mechanism to attend to the key information conditioned on the current decoding state of the decoder. Encoding the di- alogue act type embedding, the neural network-based model is able to generate variant answers in response to different act types.

end-to-end model

• user 的反馈很难传递到每一个 module

• 每一个 module 都是相互依赖的 (process interde- pendence)

• network-based end-to-end 模型需要大量的标注数据

• Learning end-to-end goal-oriented dialog, Bordes et al, 2017 ICLR

• A network-based end-to-end trainable task-oriented di- alogue system, 2017 ACL

• end-to-end reinforcement learning 在对话管理中，联合训练 state tracking 和 policy learning， 从而使得模型鲁棒性更强。

• Towards end-to-end learn- ing for dialog state tracking and management us- ing deep reinforcement learning，2016 ACL
• task-completion neural dialogue system, 其目标就是完成一个任务。

• End-to-end task- completion neural dialogue systems，2017

• 检索的结果不包含有关语义分析中的不确定性信息

• 检索的过程是不可微的 (non-differentiabl), 因此 semantic parsing 和 dialogue policy 只能分别训练，导致 online end-to-end 的模型很难部署。

Neural Generative models

• response diversity

• modeling topics and personalities

• leveraging outside knowledge base

• the interactive learning

• evaluation

Response Diversity

A challenging problem in current sequence-to-sequence dialogue systems is that they tend to generate trivial or non-committal, universally relevant responses with little meaning, which are often involving high frequency phrases along the lines of I dont know or Im OK.

1. MMI and IDF

1. beam-search

1. re-ranking

[38][77][72] 结合全局特征，重新执行 re-ranking 的步骤，从而避免生成 dull or generic 的回复。

1. PMI

[57] 猜测问题不仅仅在于解码和 respones 的频率，而且消息本身也缺乏足够的信息。 它提出使用逐点互信息（PMI）来预测名词作为关键词，反映答复的主要依据，然后生成一个包含给定关键字的答复.

1. latent variable

Topic and Personality

1. Topic aware neural response generation, AAAI 2017 注意到人们经常把他们的对话与主题相关的概念联系起来，并根据这些概念做出他们的回复。他们使用Twitter LDA模型来获取输入的主题，将主题信息和输入表示输入到一个联合注意模块中，并生成与主题相关的响应。
1. Multiresolution recurrent neural networks: An application to dialogue response generation. AAAI 2017 对粗粒度的 tokens sequence 和 dialogue generation 进行联合建模，粗粒度的 tokens 主要是用来探索 high-level 的语义信息，通常是 name entity 或 nouns.
1. Emotional chatting machine: Emotional conversation generation with internal and external memory 将情感 embedding 融入到了对话生成中。Affective neural response generation, 2017 通过三种方式增强回复的情感：
• incorporating cognitive engineered affective word embeddings

• augmenting the loss objective with an affect-constrained objective function

• injecting affective dissimilarity in diverse beam-search inference procedure

1. Assigning personality/identity to a chatting machine for coherent conversation generation 让对话个性化，并且保持一致性。Neural per- sonalized response generation as domain adaptation 提出了一种两阶段的训练方法，使用大规模数据对模型进行初始化，然后对模型进行微调，生成个性化响应。
1. Personalizing a dialogue system with transfer reinforcement learning 使用强化学习来消除对话的前后不一致性。

Evaluation

• BLEU, METEOR, and ROUGE 值，也就是直接计算 word overlap、ground truth和你生成的回复。由于一句话可能存在多种回复，因此从某些方面来看，BLEU 可能不太适用于对话评测。

• 计算 embedding的距离，这类方法分三种情况：直接相加求平均、先取绝对值再求平均和贪婪匹配。

• 进行图灵测试，用 retrieval 的 discriminator 来评价回复生成。

Retrieval-based Methods

single-turn response match

$$match(x,y)=x^TAy$$

Convolutional neu- ral network architectures for matching natural lan- guage sentences, 2014 利用深度卷积神经网络体系结构改进模型，学习消息和响应的表示，或直接学习两个句子的相互作用表示，然后用多层感知器来计算匹配的分数。

展望

• Swift Warm-Up，在一些新的领域，特定领域对话数据的收集和对话系统的构建是比较困难的。未来的趋势是对话模型有能力从与人的交互中主动去学习。
• Deep Understanding. 深度理解。现阶段基于神经网络的对话系统极大地依赖于大量标注好的数据，结构化的知识库以及对话语料数据。在某种意义上产生的回复仍然缺乏多样性，有时并没有太多的意义，因此对话系统必须能够更加有效地深度理解语言和真实世界。
• Privacy Protection. 目前广泛应用的对话系统服务于越来越多的人。很有必要注意到的事实是我们使用的是同一个对话助手。通过互动、理解和推理的学习能力，对话助手可以无意中隐蔽地存储一些较为敏感的信息。因此，在构建更好的对话机制时，保护用户的隐私是非常重要的。

论文笔记-Using monoligual data in machine transaltion

Why Monolingual data enhancement

• Large scale source-side data:

enhancing encoder network to obtain high quality context vector

representation of source sentence.

• Large scale target-side data:

boosting fluency for machine translation when decoding.

The methods of using monolingual data

Target-side language model: Integrating Language Model into the Decoder

shallow fusion

both an NMT model (on parallel corpora) as well as a recurrent neural network language model (RNNLM, on larger monolingual corpora) have been pre-trained separately before being integrated.

Shallow fusion: rescore the probability of the candidate words.

deep fusion

Using Target-side Monolingual Data for Neural Machine Translation through Multi-task Learning, EMNLP, 2017

$\sigma$ 参数在两个任务训练时都会更新。而 $\theta$ 参数仅仅在训练翻译模型时才会更新参数。

auto-encoder

Semi-Supervised Learning for Neural Machine Translation, ACL, 2016

Back-translation

What is back-translation?

Synthetic pseudo parallel data from target-side monolingual data using a reverse translation model.

why back-translation and motivation?

It mitigates the problem of overfitting and fluency by exploiting additional data in the target language.

Different aspects of the BT which influence the performance of translation:

• Size of the Synthetic Data

• Direction of Back-Translation

• Quality of the Synthetic Data

Dummy source sentence

Pseudo parallel data:

+ target-side mono-data

The downside:

the network ‘unlearns’ its conditioning on the source context if the ratio of monolingual training instances is too high.

Improving Neural Machine Translation Models with Monolingual Data, Sennrich et al, ACL 2016

Self-learning

Synthetic target sentences from source-side mono-data:

• Build a baseline machine translation (MT) system on parallel data

• Translate source-side mono-data into target sentences

• Real parallel data + pseudo parallel data

reference

1. Improving Neural Machine Translation Models with Monolingual Data, Sennrich et al, ACL 2016

2. Using Monolingual Data in Neural Machine Translation: a Systematic Study, Burlot et al. ACL 2018

3. Copied Monolingual Data Improves Low-Resource Neural Machine Translation, Currey et al. 2017 In Proceedings of the Second Conference on Machine Translation

4. Semi-Supervised Learning for Neural Machine Translation, Cheng et al. ACL 2016

5. Exploiting Source-side Monolingual Data in Neural Machine Translation, Zhang et al. EMNLP 2016

6. Using Target-side Monolingual Data for Neural Machine Translation through Multi-task Learning, Domhan et al. EMNLP 2018

On Using Monolingual Corpora in Neural Machine Translation, Gulcehre, 2015

1. Back-Translation Sampling by Targeting Difficult Words in Neural Machine Translation, EMNLP 2018

2. Understanding Back-Translation at Scale, Edunov et al. EMNLP 2018

论文笔记-sentence embedding

supervised learning

a structured self-attentive sentence embedding

Model Architecture:

word embedding: $S\in R^{n\times d}$, d 表示词向量维度

$$S=(w_1,w_2,…,w_n)$$

bidirection-LSTM: $H\in R^{n\times 2u}$, u 表示隐藏状态维度

$$H=(h_1,h_2,…,h_n)$$

single self-attention: $a\in R^n$, 表示 sentence 中对应位置的权重。与 encoder 之后的 sentence 加权求和得到 attention vector $m\in R^{2u}$.

$$a=softmax(w_{s2}tanh(W_{s1}H^T))$$

r-dim self-attention：有 r 个上述的 attention vector，并转换成矩阵形式，$A\in R^{n\times r}$ 与 encode 之后的句子表示 H 加权求和得到 embedding matrix $M\in R^{r\times 2u}$

$$A=softmax(W_{s2}tanh(W_{s1}H^T))$$

$$M=AH$$

penalization term

The best way to evaluate the diversity is definitely the Kullback Leibler divergence between any 2 of the summation weight vectors. However, we found that not very stable in our case. We conjecture it is because we are maximizing a set of KL divergence (instead of minimizing only one, which is the usual case), we are optimizing the annotation matrix A to have a lot of sufficiently small or even zero values at different softmax output units, and these vast amount of zeros is making the training unstable. There is another feature that KL doesn’t provide but we want, which is, we want each individual row to focus on a single aspect of semantics, so we want the probability mass in the annotation softmax output to be more focused. but with KL penalty we cant encourage that.

$$P=||(AA^T-I)||^2_{F}$$

$AA^T$ 是协方差矩阵，对角线元素是同一向量的内积，非对角线元素不同向量的内积。将其作为惩罚项加到 original loss 上，期望得到的是不同 vector 内积越小越好（内积越小，差异越大），并且向量的模长越大越好（概率分布更集中于某一两个词）。

training

3 different datasets:

• the Age dataset

• the Yelp dataset

• the Stanford Natural Language Inference (SNLI) Corpus

论文笔记-Explicit Semantic Analysis

paper:

Motivation

Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts.

传统的方法：

1. 词袋模型: 将 text 看作是 unordered bags of words, 每一个单词看作是一维特征。但是这并不能解决 NLP 中的两个主要问题： 一词多义和同义词（polysemy and synonymy）。
1. 隐语义分析：Latent Semantic Analysis (LSA)

LSA is a purely statistical technique, which leverages word co-occurrence information from a large unlabeled corpus of text. LSA does not use any explicit human-organized knowledge; rather, it “learns” its representation by applying Singular Value Decomposition (SVD) to the words-by-documents co-occurrence matrix. LSA is essentially a dimensionality reduction technique that identifies a number of most prominent dimensions in the data, which are assumed to correspond to “latent concepts”. Meanings of words and documents are then represented in the space defined by these concepts.

LSA 是一种纯粹的统计技术，它利用来自大量未标记文本语料库的单词共现信息。 LSA不使用任何明确的人类组织知识; 相反，它通过将奇异值分解（SVD）应用于逐个文档的共现矩阵来“学习”其表示。 LSA本质上是一种降维技术，它识别数据中的许多最突出的维度，假设它们对应于“潜在概念”。 然后，在这些概念定义的空间中表示单词和文档的含义。

1. 词汇数据库，WordNet.

However, lexical resources offer little information about the different word senses, thus making word sense disambiguation nearly impossible to achieve.Another drawback of such approaches is that creation of lexical resources requires lexicographic expertise as well as a lot of time and effort, and consequently such resources cover only a small fragment of the language lexicon. Specifically, such resources contain few proper names, neologisms, slang, and domain-specific technical terms. Furthermore, these resources have strong lexical orientation in that they predominantly contain information about individual words, but little world knowledge in general.

concept 定义

Observe that an encyclopedia consists of a large collection of articles, each of which provides a comprehensive exposition focused on a single topic. Thus, we view an encyclopedia as a collection of concepts (corresponding to articles), each accompanied with a large body of text (the article contents).

example:

Ben Bernanke, Federal Reserve, Chairman of the Federal Reserve, Alan Greenspan (Bernanke’s predecessor), Monetarism (an economic theory of money supply and central banking), inflation and deflation.

ESA 对一个 texts 的表示是 wiki 中所有的 concept 的 weighted combination，这里为了展示方便，只列举了最相关的一些 concept.

ESA(explicit semantic analysis)

1. the set of basic concepts

2. the algorithm that maps text fragments into interpretation vectors

如何构建 concept 集合

1.using Wikipedia as a Repository of Basic Concepts

2.building a semantic interpreter

$$T[i,j]=tf(t_i, d_j)\cdot log\dfrac{n}{df_i}$$

TF 表示在文档 $d_j$ 中，单词 $t_i$ 出现的频率。

$$tf(t_i, d_j)=\begin{cases} 1 + log\ count(t_i, d_j), &\text{if count(t_i, d_j) > 0} \ 0, &\text{otherwise} \end{cases}$$

IDF 表示逆文档频率。反应一个词在不同的文档中出现的频率越大，那么它的 IDF 值应该低，比如介词“to”。而反过来如果一个词在比较少的文本中出现，那么它的 IDF 值应该高。

$$IDF=log\dfrac{n}{df_i}$$

$df_i=|{d_k:t_i\in d_k}|$ 表示出现该单词的文档个数，n 表示总的文档个数。

$$T[i,j]\leftarrow \dfrac{T[i,j]}{\sqrt{\sum_{l=1}^r T[i,j]^2}}$$

r 表示单词的总量。也就是除以所有单词对应的向量二范数之和平方。

论文笔记-再看 Capsules 以及 capsules 在文本分类上的应用

再看 Capsules

胶囊的计算

• affine transform

• weighting and sum

• squash

动态路由

$$c_{ij}=\dfrac{exp(b_{ij})}{\sum_k exp(b_ik)}$$

共享版 or 全连接版

全连接版

$$v_j = \text{squash}\sum_i\dfrac{e^{<\hat u_{j|i}, v_j>}}{\sum_k e^{<\hat u_{k|i}, v_k>}}\hat u_{j|i}, \hat u_{j|i}=W_{ij}u_i$$

共享版

• 输入 [batch, 6, 6, 32, 8]=[batch, 1152, 8], 输入有 6x6x32=1152 个 capsules.

• 输出 [batch, 16, 10]， 输出有 10 个 capsules.

capsules 在文本分类上的应用

Model Architecture

N-gram Convolutional Layer

• kernel: [k1, embed_size, 1, B]

• B 表示卷积核的个数

• k1 是sentence 长度维度上的 sliding-window 尺寸

Primary Capsule Layer

• d 表示 capsule 的维度

• 实际上依然是普通的卷积操作，不同的是，原本是从 channels B 到 channels C.现在每个 channels C 对应的有 d 个。也就是初始化的 capsules.

• kernel: [1, 1, B, Cd], 实现时先生成 Cd channels, 然后 split.

Convolutional Capsule Layer

• 输出的 capsules 维度依旧是 d

• 但是 capsules 的个数发生了变化，在 Hinton 论文中是通过全连接维度的变换，这里是通过卷积的操作来实现 capsules 个数的变换的。

• shared: $W\in R^{N\times d\times d}$. N 是 capsules 的个数

• no-shared: $W\in R^{H\times N\times d\times d}$.H 是低维的 capsules 的个数。

Dynamic Routing Between Capsules

Part 1, Intution

CNN 是如何工作的？

• 浅层的卷积层会检测一些简单的特征，例如边缘和颜色渐变。

• 更高层的 layer 使用卷积操作将简单的特征加权合并到更复杂的特征中。

• 最后，网络顶部的网络层会结合这些非常高级的特征去做分类预测。

CNN 存在的问题

Hinton: “The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.”

Internal data representation of a convolutional neural network does not take into account important spatial hierarchies between simple and complex objects.

• CNN只关注要检测的目标是否存在，而不关注这些组件之间的位置和相对的空间关系。如下例，CNN判断人脸只需要检测出它是否存在两个眼睛，两个眉毛，一个鼻子和一个嘴唇，现在我们把右眼和嘴巴换个位置，CNN依然认为这是个人。

• CNN 对旋转不具备不变性，学不到 3D 空间信息。例如下面的自由女神，我们只看自由女神的一张照片，我们可能没有看过其它角度的照片，还是能分辨出这些旋转后的照片都是自由女神，也就是说，图像的内部表征不取决于我们的视角。但是CNN做这个事情很困难，因为它无法理解3D空间。

• 另外，神经网络一般需要学习成千上万个例子。人类学习数字，可能只需要看几十个个例子，最多几百个，就能区别数字。但是CNN需要成千上万个例子才能达到比较好的效果，强制去学习。并且关于前面的旋转不变性，CNN可以通过增强数据的手段去改善，但是这也就需要用到大量的数据集。

Hardcoding 3D World into a Neural Net: Inverse Graphics Approach

Capsules 就是为了解决这些问题。其灵感来源于计算机图形学中的渲染技术。

Computer graphics deals with constructing a visual image from some internal hierarchical representation of geometric data.

Inspired by this idea, Hinton argues that brains, in fact, do the opposite of rendering. He calls it inverse graphics: from visual information received by eyes, they deconstruct a hierarchical representation of the world around us and try to match it with already learned patterns and relationships stored in the brain. This is how recognition happens. And the key idea is that representation of objects in the brain does not depend on view angle.

Hiton 认为大脑是反向渲染的一个过程。接受视觉信息，然后对这样一个具有层次的信息表示进行解构，并与我们已知的模式进行匹配。这里的关键是对象信息的表示在大脑中是不依赖于某一个视角的。

So at this point the question is: how do we model these hierarchical relationships inside of a neural network?

Capsules

Capsules introduce a new building block that can be used in deep learning to better model hierarchical relationships inside of internal knowledge representation of a neural network.

• Capsule可以学习到物体之间的位置关系，例如它可以学习到眉毛下面是眼睛，鼻子下面是嘴唇，可以减轻前面的目标组件乱序问题。

• Capsule可以对3D空间的关系进行明确建模，capsule可以学习到上面和下面的图片是同一个类别，只是视图的角度不一样。Capsule可以更好地在神经网络的内部知识表达中建立层次关系

• Capsule只使用CNN数据集的一小部分，就能达到很好的结果，更加接近人脑的思考方式，高效学习，并且能学到更好的物体表征。

Part2, How Capsules Work

What is a Capsule?

CNN 在解决视角的不变性时是通过 max pooling 解决的。选择一块区域的最大值，这样我们就能得到激活的不变性(invariance of activities). 不变性意味着，稍微改变输入的一小部分，输出依旧不变。并且，在图像中移动我们识别的目标，我们依然能检测出这个目标来。

Capsules encapsulate all important information about the state of the feature they are detecting in vector form.

• Capsule 是一个神经元向量（activity vector）

• 这个向量的模长表示某个entity存在的概率，entity可以理解为比如鼻子，眼睛，或者某个类别，因为用vector的模长去表示概率，所以我们要对vector进行压缩，把vector的模长压缩到小于1，并且不改变orientation，保证属性不变化。

• 这个向量的方向表示entiry的属性（orientation），或者理解为这个vector除了长度以外的不同形态的instantiation parameter，比如位置，大小，角度，形态，速度，反光度，颜色，表面的质感等等。

How does a capsule work?

• 计算输入标量 $x_i$ 的权重 $w_i$

• 对输入标量 $x_i$ 进行加权求和

• 通过非线性激活函数，进行标量与标量之间的变换，得到新标量 $h_j$

capsule 的前向转换的计算方式：

• matrix multiplication of input vectors（输入向量 $u_i$ 的矩阵 W 乘法）

• scalar weighting of input vectors（输入向量 $\hat u_i$ 的标量加权 $c_i$）

• sum of weighted input vectors（输入向量的加权求和）

• vector-to-vector nonlinearity（向量到向量的非线性变换）

输入向量 $u_i$ 的矩阵 W 乘法

Affine transform:

$$\hat u_{j|i}=W_{ij}u_i$$

• 向量的长度表示低维的胶囊能检测出对应实体的概率。向量的方向则编码了检测对象的中间状态表示。我们可以假设低维胶囊 $u_i$ 分别表示眼睛，嘴巴，鼻子这三个低层次的特征，高维胶囊 $u_j$ 检测脸部高层次特征。

• 矩阵 W 编码了低层次特征之间或低层次特征与高层次特征之间的重要的空间或其他关系。

• $u_i$ 乘以相应的权重矩阵 W 得到prediction vector（注意这个图里只画了一个 prediction vector，也就是 $\hat u_i$，因为这里只对应了一个 capsule 输出，如果下一层有 j 个 capusles，$u_i$ 就会生成 j 个 prediction vectors）

输入向量 $\hat u_i$ 的标量加权 $c_i$

weighting:

$$s_{j|i} = c_{ij}\hat u_{j|i}$$

$$s_{k|i} = c_{ik}\hat u_{k|i}$$

$c_{ij}$ 是耦合系数 (coupling coefficients), 通过迭代动态路由过程来决定。比如上图中的，高层次有两个胶囊 capsule J 和 capsule K. 那么对于 capsule i 通过上一步的矩阵 W 就可以得到 $\hat u_{j|i}, \hat u_{k|i}$, 他们对应的权重分别是 $c_{ij}, c_{ik}$, 并且有 $c_{ij} + c_{ik} = 1$.

• 在这张图片中，我们现在有一个低层次的已经被激活的capsule，它对下层每个capsule会生成一个prediction vector，所以这个低层次capsule现在有两个prediction vector，对这两个prediction vectors分配权重分别输入到高层次capsule J和K中。

• 现在，更高层次的capsule已经从其他低层次的capsule中获得了许多输入向量，也就是图片中的点，标红的部分是聚集的点，当这些聚集点意味着较低层次的capsule的预测是相互接近的。

• 低层次capsule希望找到高一层中合适的capsule，把自己更多的vector托付给它。低层次capsule通过dynamic routing的方式去调整权重c。

加权输入向量的总和

sum:

$$s_j = \sum_i c_{ij}\hat u_{j|i}$$

向量到向量的非线性变换

Squashing是一种新的激活函数，我们对前面计算得到的向量施加这个激活函数，把这个向量的模长压缩到1以下，同时不改变向量方向，这样我们就可以利用模长去预测概率，并且保证属性不变化。

• Invariance 不变性：物体表示不随变换变化，例如空间的 Invariance，是对物体平移之类不敏感（物体不同的位置不影响它的识别)

• Equivariance同变性：用变换矩阵进行转换后，物体表示依旧不变，这是对物体内容的一种变换

Part 3, Dynamic Routing Between Capsules

Lower level capsule will send its input to the higher level capsule that “agrees” with its input. This is the essence of the dynamic routing algorithm.

• $c_{ij}$ 为非负 scalar

• 对于每一个的 low level capsule i，所有的权重之和 $c_{ij}$ 为 1，j 表示 high-level capsule j.

• 对于每一个 low-level capsule, 对应的权重数量等于 high-level 的数量

• 权重由动态路由算法确定

1. 这个过程的输入是 第 $l$ 层的 capsule 经过矩阵变换之后的 prediction vector $\hat u$. 迭代步数 r

2. 初始化 $b_{ij}$ 为 0， $b_{ij}$ 是用来计算 $c_{ij}$ 的

3. 对接下来4-6行代码迭代r次，计算第 $l+1$ 层 capsule j 的 output

4. 在第 $l$ 层，每个 capsule i 对 $b_{ij}$ 做 softmax 得到 $c_{ij}$，$c_{ij}$ 是第 $l$ 层 capsule i 给第 $l+1$ 层 capsule j 分配的权重。在第一次迭代的时候，所有的 $b_{ij}$ 都初始化为 0，所以这里得到的 $c_{ij}$ 都是相等的概率，当然后面迭代多次后，$c_{ij}$ 的值会有更新。

5. 在第 $l+1$ 层，每个 capsule j 利用 $c_ij$ 对 $\hat u_j|i$ 进行加权求和，得到输出向量 $s_j$

6. 在第 $l+1$ 层，使用squash激活函数对 $s_j$ 做尺度的变换，压缩其模长不大于1

7. 我们在这一步更新参数，对所有的 $l$ 层 capsule i和 $l+1$ 层 capsule j，迭代更新 $b_{ij}$，更新 $b_{ij}$ 为旧 $b_{ij}$ + capsule j 的输入和输出的点乘，这个点乘是为了衡量这个 capsule 的输入和输出的相似度，低层 capsule 会把自己的输出分给跟它相似的高层 capsule。

susht 大佬画的图，不过没有把 $b_{ij}$ 表示出来。 $b_{ij}$ 的参数量和 $c_{ij}$ 以及 $u_{ij}$ 的个数是一致的。

• 假设有两个高层capsule，紫色向量v1和v2分别是这两个capsule的输出，橙色向量是来自低层中某个capsule的输入，黑色向量是低层其它capsule的输入。

• 左边的capsule，橙色u_hat跟v1方向相反，也就是这两个向量不相似，它们的点乘就会为负，更新的时候会减小对应的c_11数值。右边的capsule，橙色u_hat跟v2方向相近，更新的时候就会增大对应的c_12数值。

• 那么经过多次迭代更新，所有的routing权重c都会更新到这样一种程度：对低层capsule的输出与高层capsule的输出都进行了最佳匹配。

CapsNet Architecture

• ReLU Conv1: 这是一个普通的卷积层，参数量是 $9\times 9\times 256$, 假设输入是 $28\times 28\times 1$, 得到的输出是 $20\times 20\times 256$.
• PrimaryCaps: 这里构建了 32 个 channels 的 capsules, 每个 capsule 的向量维度是 8. 依旧是采用卷积的方法，每一个 channels 使用 8 个卷积核 $9\times 9$. 所以总的参数量是 $9\times 9\times 256 \times 32\times 8 + 32 \times 8= 5308672$。 一个 channels 对应一个 feature map，在这里是 $6\times 6$ 个 8 维的 capsules. 所以最后是 $6\times 6\times 32=1152$ 个 capsules。
• DigitCaps: 对前面1152个capules进行传播与routing更新，输入是1152个capsules，输出是10个capules，每个 capsule 的维度由 8 变成了 16，表示10个数字类别，最后我们用这10个capules去做分类. 总的参数量是 $1152\times 8\times 16+1152+1152=149760$. 后面两个 1152 分别表示 $b_{ij}, c_{ij}$.

loss function

$$L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta)$$

• 不同的是，这里设定了 正类的概率 越小于 $m^+=0.9$ 其贡献的 loss 越小，负类的概率越大于 $m^-=0.1$ 其贡献的 loss 越大。

• $\lambda$ 参数设置为0.5，这是为了防止负例减小了所有capsule向量的模长。

Capsules做分类的优点

论文笔记-预训练语言模型2-ULMFiT

Motivation

对比之前的几种模型

concatenate embeddings: ELMo

Recent approaches that concatenate embeddings derived from other tasks with the input at different layers (Peters et al., 2017; McCann et al., 2017; Peters et al., 2018) still train the main task model from scratch and treat pretrained embeddings as fixed parameters, limiting their usefulness.

ELMo有以下几个步骤：

• 利用LM任务进行预训练

• 再利用目标领域的语料对LM模型做微调

• 最后针对目标任务进行 concatenate embedding，然后训练模型

pretraining LM:

In light of the benefits of pretraining (Erhan et al., 2010), we should be able to do better than randomly initializing the remaining parameters of our models. However, inductive transfer via finetuning has been unsuccessful for NLP (Mou et al., 2016). Dai and Le (2015) first proposed finetuning a language model (LM) but require millions of in-domain documents to achieve good performance, which severely limits its applicability.

ULMFiT

We show that not the idea of LM fine-tuning but our lack of knowledge of how to train them effectively has been hindering wider adoption. LMs overfit to small datasets and suffered catastrophic forgetting when fine-tuned with a classifier. Compared to CV, NLP models are typically more shallow and thus require different fine-tuning methods.

• 通用的语言模型微调

• discriminative fine-tuning, slanted triangular learning rates

Universal Language Model Fine-tuning

• General-domain LM pretraining

• Target task LM fine-tuning

• Target task classifier fine-tuning

General-domain LM pretraining

Wikitext-103 (Merity et al., 2017b) consisting of 28,595 preprocessed Wikipedia articles and 103 million words.

Target task LM fine-tuning

discriminative fine-tunin

As different layers capture different types of information (Yosinski et al., 2014), they should be fine-tuned to different extents.

$${\theta^1,\theta^2, …, \theta^L}$$

$${\eta^1,\eta^2, …, \eta^L}$$

Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates.

$$\theta_t = \theta_{t-1}-\eta\cdot\nabla_{\theta}J(\theta)$$

$$\theta_t^l = \theta_{t-1}^l-\eta^l\cdot\nabla_{\theta^l}J(\theta)$$

Slanted triangular learning rates

• T 是迭代次数，这里实际上是 $epochs \times \text{number of per epoch}$

• cut_frac 是增加学习率的迭代步数比例

• cut 是学习率增加和减少的临界迭代步数

• p 是一个分段函数，分别递增和递减

• ratio 表示学习率最小时，与最大学习率的比例。比如 t=0时，p=0, 那么 $\eta_0=\dfrac{\eta_{max}}{ratio}$

Target task classifier fine-tuning

concat pooling

We first unfreeze the last layer and fine-tune all unfrozen layers for one epoch. We then unfreeze the next lower frozen layer and repeat, until we finetune all layers until convergence at the last iteration.

BPTT for Text Classification

backpropagation through time(BPTT)

We divide the document into fixed length batches of size b. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, we use variable length backpropagation sequences (Merity et al., 2017a).

experiment

ablations

“from scratch”: 没有 fine-tune

“supervised”: 表示仅仅在 label examples 进行 fine-tune

“semi-supervised”: 表示在 unable examples 上也进行了 fine-tune

对 tricks 进行分析

“full” :fine-tuning the full model

“discr”: discriminative fine-tuning

“stlr”: slanted triangular learning rates

为什么要学习 GAN？

• High-dimensional probability distributions, 从高维概率分布中训练和采样的生成模型具有很强的能力来表示高维概率分布。
• Reinforcement learning. 和强化学习结合。
• Missing data. 生成模型能有效的利用无标签数据，也就是半监督学习 semi-supervised learning。

生成模型如何工作的

Maximum likehood estimation

$$\sum_i^mp_{\text{model}}(x^{(i)}; \theta)$$

m 是样本数量。

相对熵

p 是真实分布， q 是非真实分布。

$$H(p,q) = \sum_ip(i)log{\dfrac{1}{q(i)}}$$

$$H(p) = \sum_ip(i)log{\dfrac{1}{p(i)}}$$

$$D(p||q) = H(p,q)-H(p) = \sum_ip(i)log{\dfrac{p(i)}{q(i)}}$$

GAN 是如何工作的？

GAN 框架

• 判别器 discriminator

• 生成器 gererator

containing latent variables z and observed variables x.

z 是需要学习的隐藏变量，x 是可观察到的变量。

Generator

differentiable function G, 可微分函数 G. 实际上就是 神经网络。z 来自简单的先验分布，G(z) 通过模型 $p_{model}$ 生成样本 x. 实践中，对于 G 的输入不一定只在第一层 layer, 也可以在 第二层 等等。总之，生成器的设计很灵活。

cost function

The discriminator’s cost, J (D)

$$\dfrac{p_{data}(x)}{p_{\text{model}}(x)}$$

GANs 通过监督学习来获得这个 ratio 的估计，这也是 GANs 不同于 变分自编码 和 波尔兹曼机 (variational autoencoders and Boltzmann machines) 的区别。

Minimax, zero-sum game

$$J^{(G)} = -J^{(D)}$$

$$V(\theta^{(D)}, \theta^{(G)})=-J^{(D)}(\theta^{(D)}, \theta^{(G)})$$

outer loop 是关于 $\theta^{(G)}$ 的最小化，inner loop 是关于 $\theta^{(D)}$ 的最大化。

In practice, the players are represented with deep neural nets and updates are made in parameter space, so these results, which depend on convexity, do not apply

Heuristic, non-saturating game

Minimizing the cross-entropy between a target class and a classifier’s predicted distribution is highly effective because the cost never saturates when the classifier has the wrong output.

$$J^{(G)}(\theta^{(D)}, \theta^{(G)})=-\dfrac{1}{2}E_zlogD(G(z))-\dfrac{1}{2}E_{x～p_{data}}log(1-D(x))$$

$$J^{(G)}(\theta^{(D)}, \theta^{(G)})=-\dfrac{1}{2}E_zlogD(G(z))$$

In the minimax game, the generator minimizes the log-probability of the discriminator being correct. In this game, the generator maximizes the logprobability of the discriminator being mistaken.

Maximum likelihood game

$$J^{(G)}=-\dfrac{1}{2}E_zexp(\sigma^{-1}(D(G(z))))$$

in practice, both stochastic gradient descent on the KL divergence and the GAN training procedure will have some variance around the true expected gradient due to the use of sampling (of x for maximum likelihood and z for GANs) to construct the estimated gradient.

Is the choice of divergence a distinguishing feature of GANs?

Jensen-Shannon divergence， reverse KL

KL 散度并不是对称的。$D_{KL}(p_{data}||q_{model})$ 与 $D_{KL}(p_{model}||q_{data})$ 是不一样的。极大似然估计是前者，最小化 Jensen-Shannon divergence 则更像后者。

f-GAN 证明，KL 散度也能生成清晰的sample，并且也只选择少量的modes, 说明 Jensen-Shannon divergence 并不是 GANs 不同于其他模型的特征。

GANs 通常选择少量的 mode 来生成样本，这个少量指的是小于模型的能力。 而 reverse KL 则是选择更可能多的 mode of the data distribution 在模型能力范围内。它通常不会选择更少的 mode. 这也解释了 mode collapse 并不是散度选择的原因。

Altogether, this suggests that GANs choose to generate a small number of modes due to a defect in the training procedure, rather than due to the divergence they aim to minimize.

Comparison of cost functions

$D(G(z))$ 表示 判别器 给 generate sample 为真的概率。

Maximum likelihood also suffers from the problem that nearly all of the gradient comes from the right end of the curve, meaning that a very small number of samples dominate the gradient computation for each minibatch. This suggests that variance reduction techniques could be an important research area for improving the performance of GANs, especially GANs based on maximum likelihood.

DCGAN 的结构。

GAN，NCE， MLE 的对比

• MiniMax GAN 和 NCE 的 cost function 相同

• 更新策略不一样，GAN 和 MLE 都是梯度下降，而 MLE copies the density model learned inside the discriminator and converts it into a sampler to be used as the generator. NCE never updates the generator; it is just a fixed source of noise.

Tips and Tricks

How to train a GAN: https://github.com/soumith/ganhacks

论文笔记-Contextual Augmentation

paper 1

Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

$$P(\cdot|y, S/{w_i})^{1/\tau}$$

paper 2

Bert 想对于 LSTM 更获得更深层的含义。

When the task-specific dataset is with more than two different labels,we should re-train a label size compatible label embeddings layer instead of directly fine-tuning the pre-trained one.

论文笔记-baseline for OOD

paper: A baseline for detecting misclassified and out-of-distribution examples

Motivation

.

• baseline mathod 是什么？也就是怎么去评价一个模型自动区分出 OOD 的能力。

• 作者给出的标准任务和数据

Baseline method

• error and success prediction: 能否正确的对一个样本分类

• in- and out-of-distribution detection： 能否正确的检测出 OOD

metric1: AUROC

ROC 曲线是依赖于阈值的性能验证指标 (metric which is a threshold-independent performance evalution). 因为在这类不均衡问题中，我们关注的是 positive label. 所以我们关注的指标是 真正类率 TPR, 负正类率 FPR.

• TPR, 真正类率(true positive rate ,TPR),： 如果一个实例是正类并且也被 预测成正类，即为真正类（True positive），真正类率是指分类器所识别出来的 正实例占所有正实例的比例。就是正类的 Recall 吧～ TPR = TP / (TP + FN)

• FPR, 负正类率： 分类器错认为正类的负实例占所有负实例的比例，FPR = FP / (FP + TN)

metric2: AUPR

Area Under the Precision-Recall curve (AUPR)

The PR curve plots the precision (tp=(tp+fp)) and recall

(tp=(tp + fn)) against each other. 对于 PR 曲线，选择哪个类别作为 positive 类，非常重要。

论文笔记 Deep Transition Architecture

paper 1

paper: Self-Attention: A Better Building Block for Sentiment Analysis Neural Network Classifiers, 2018 WASSA@EMNLP

• Sinusoidal Position Encoding

• Learned Position Encoding

• Relative Position Representations

Sinusoidal 是在 Transformer 中使用的, 好处在于即使是测试集中出现 sentence 的长度比训练集中所有的 sentence 都要长，也能计算其 position encoding.

Relative 是效果最好的,作者和 Tansformer 的作者是一样的，值得一看。Self-attention with relative position representations

For this method, the self-attention mechanism is modified to explicitly learn the relative positional information between every two sequence positions. As a result, the input sequence is modeled as a labeled, directed, fully-connected graph, where the labels represent positional information. A tunable parameter k is also introduced that limits the maximum distance considered between two sequence positions. [Shaw et al., 2018] hypothesized that this will allow the model to generalize to longer sequences at test time.

paper 2

• encoder transition

• query transition

• decoder transition

GRU

$$h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde h_t$$

candidate state:

$$\tilde h_t = tanh(W_{xh}x_t + r_t\odot (W_{hh}h_{t-1}))$$

reset gate:

$$r_t = \sigma(W_{xr}x_t+W_{hr}h_{t-1})$$

update gate:

$$z_t=\sigma(W_{xz}x_t+W_{hz}h_{t-1})$$

T-GRU (transition GRU)

$$h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde h_t$$

candidate state:

$$\tilde h_t = tanh(r_t\odot (W_{hh}h_{t-1}))$$

reset gate:

$$r_t = \sigma(W_{hr}h_{t-1})$$

update gate:

$$z_t=\sigma(W_{hz}h_{t-1})$$

L-GRU( Linear Transformation enhanced GRU)

$$h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde h_t$$

candidate state:

$$\tilde h_t = tanh(W_{xh}x_t + r_t\odot (W_{hh}h_{t-1}))+ l_t\odot H(x_t)$$

$$H(x_t)=W_xx_t$$

$$l_t=\sigma(W_{xl}x_t+W_{hl}h_{t-1})$$

$\tilde h_t = tanh(W_{xh}x_t +W_{hh2}h_{t-1} +l_t\odot W_{xh2}x_t+ r_t\odot (W_{hh}h_{t-1}))$

DNMT

Decoder

$L_s$ 表示 encoder transition 的深度 depth. $j$ 表示 current time step.

$$\overrightarrow h_{j,0}=L-GRU(x_j, \overrightarrow h_{j-1,L_s})$$

$$\overrightarrow h_{j,k}=T-GRU(\overrightarrow h_{j, k-1}),\text{ for } 1\le k\le L_s$$

Decoder

• query transition: depth $L_q$

• decoder transition: depth $L_d$

论文笔记-BERT

BERT(Bidirectional Encoder Representations from Transformers.)

Why Bidirectional?

$$P(S)=p(w_1,w_2, …, w_n)=p(w1)p(w_2|w_1)…p(w_n|w_1…w_{n-1})=\prod_{i=1}^m p(w_i|w_1…w_{i-1})$$

$$P(S)=\prod_{i=1}^m p(w_i|w_1…w_{i-1})$$

$$P(S)=\prod_{i=1}^m p(w_i|w_{i+1}…w_{m})$$

ELMo 就是这样的双向语言模型(BiLM)

Input representation

• position embedding: 跟 Transformer 类似

• sentence embedding, 同一个句子的词的表示一样，都是 $E_A$ 或 $E_B$. 用来表示不同的句子具有不同的含义

• 对于 [Question, Answer] 这样的 sentence-pairs 的任务，在句子末尾加上 [SEP].

• 对于文本分类这样的 single-sentence 的任务，只需要加上 [CLS], 并且 sentence embedding 只有 $E_A$.

• “My dog is <MASK>“. 80% 被 代替

• “My dog is apple”. 10% 被一个随机的 token 代替

• “My dog is hairy”. 10% 保持原来的样子

为什么不用 <MASK> 代替所有的 token？

If the model had been trained on only predicting ‘<MASK>’ tokens and then never saw this token during fine-tuning, it would have thought that there was no need to predict anything and this would have hampered performance. Furthermore, the model would have only learned a contextual representation of the ‘<MASK>’ token and this would have made it learn slowly (since only 15% of the input tokens are masked). By sometimes asking it to predict a word in a position that did not have a ‘<MASK>’ token, the model needed to learn a contextual representation of all the words in the input sentence, just in case it was asked to predict them afterwards.

只需要 random tokens 足够吗？为什么还需要 10% 的完整的 sentence?

Well, ideally we want the model’s representation of the masked token to be better than random. By sometimes keeping the sentence intact (while still asking the model to predict the chosen token) the authors biased the model to learn a meaningful representation of the masked tokens.

next sentence generation

50% 是正确的相邻的句子。 50% 是随机选取的一个句子。这个任务在预训练中能达到 97%-98% 的准确率，并且能很显著的提高 QA NLI 的任务。

pre-training procudure

• batch_size 256.

• 每一个 sentences 对： 512 tokens

• 40 epochs

• Adam lr=1e-4, $\beta_1=0.9$, $\beta_2=0.999$, L2 weight decay 0.01

• learning rate warmup 10000 steps

• 0.1 dropout

• gelu instead of relu

Fine-tune procedure

• 比如 sentences pairs 的 Quora Question Pairs(QQP) 预测两个句子之间语义是否相同。如下图中（a）.

• 如果是 single sentence classification 比如 Stanford Sentiment Treebank（SST-2）和 Corpus of Linguistic Acceptability（CoLA）这种分类问题。如下图（b）

fine-tune 时的超参数跟 pre-training 时的参数大致相同。但是训练速度会很快

• Batch size: 16, 32

• Learning rate (Adam): 5e-5, 3e-5, 2e-5

• Number of epochs: 3, 4

如何使用 BERT

文本分类

https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_classifier.py

• 数据预处理

• 预训练模型加载

数据处理

• input_ids 是通过词典映射来的

• input_mask 在 fine-tune 阶段，所有的词都是 1, padding 的是 0

• segment_ids 在 text_a 中是 0, 在 text_b 中是 1, padding 的是 0

加载预训练模型

missing_keys 这里是没有从预训练模型提取参数的部分，也就是 classifier ['classifier.weight', 'classifier.bias']层，因为这一层是分类任务独有的。

unexpected_keys 则是对于分类任务不需要的，但是在预训练的语言模型中是存在的。查看 BertForMaskedLM 的模型就能看到，cls 层，是专属于语言模型的，在下游任务中都需要去掉。

rejection系列3 OpenMax

paper: Towards Open Set Deep Networks. CVPR

Motivation

closed set recognition 天然的特性使得它必须选择一个类别作为预测对象。但是实际场景下， recognition system 必须学会 reject unknown/unseen classes 在 testing 阶段。

A key element of estimating the unknown probability is adapting Meta-Recognition concepts to the activation patterns in the penultimate layer of the network.

Introduction

probability/confidence scores. 比如通过对抗学习得到的 adversarial images. 作者在后面也提到了， threshold 实际上拒绝的不是 unknown, 而是 uncertain predictions.

OpenMax incorporates likelihood of the recognition system failure. This likelihood is used to estimate the probability for a given input belonging to an unknown class. For this estimation, we adapt the concept of Meta-Recognition[22, 32, 9] to deep networks. We use the scores from the penultimate layer of deep networks (the fully connected layer before SoftMax, e.g., FC8) to estimate if the input is “far” from known training data. We call scores in that layer the activation vector(AV).

A key insight in our opening deep networks is noting that “open space risk” should be measured in feature space rather than in pixel space.

We show that an extreme-value meta-recognition inspired distance normalization process on the overall activation patterns of the penultimate network layer provides a rejection probability for OpenMax normalization for unknown images, fooling images and even for many adversarial images.

Open set deep networks

Building on the concepts of open space risk, we seek to choose a layer (feature space) in which we can build a compact abating probability model that can be thresholded to limit open space risk.

multi-classes meta-recognition

. Prior work on meta-recognition used the final system scores, analyzed their distribution based on Extreme Value Theory (EVT) and found these distributions follow Weibull distribution.

from wikipedia:

It seeks to assess, from a given ordered sample of a given random variable, the probability of events that are more extreme than any previously observed.

We take the approach that the network values from penultimate layer (hereafter the Activation Vector (AV)), are not an independent per-class score estimate, but rather they provide a distribution of what classes are “related.”

rejection系列1-overview

Motivation

In real-world recognition/classification tasks, limited by various objective factors, it is usually difficult to collect training samples to exhaust all classes when training a recognizer or classifier. A more realistic scenario is open set recognition (OSR), where incomplete knowledge of the world exists at training time, and unknown classes can be submitted to an algorithm during testing, requiring the classifiers not only to accurately classify the seen classes, but also to effectively deal with the unseen ones.

This paper provides a comprehensive survey of existing open set recognition techniques covering various aspects ranging from related definitions, representations of models, datasets, experiment setup and evaluation metrics. Furthermore, we briefly analyze the relationships between OSR and its related tasks including zero-shot, one-shot (few-shot) recognition/learning techniques, classification with reject option, and so forth. Additionally, we also overview the open world recognition which can be seen as a natural extension of OSR. Importantly, we highlight the limitations of existing approaches and point out some promising subsequent research directions in this field.

Introduction

a more realistic scenario is usually open and non-stationary such as driverless, fault/medical diagnosis, etc., where unseen situations can emerge unexpectedly, which drastically weakens the robustness of these existing methods.

To meet this challenge, several related research directions actually have been explored including lifelong learning [1], [2], transfer learning [3]–[5], domain adaptation [6], zero-shot [7]–[9], one-shot (few-shot) [10]–[16] recognition/learning and open set recognition/classification [17]–[19], and so forth.

recognition should consider four basic categories of classes as follows:

• known known: train/dev 中有标签的样本，包括正负类别，并且有相关的语义信息。

• known unknown: train/dev 中有标签的样本，负类，没有相关的语义信息。

• unknown known: test 中没有出现在 train 中的样本，但是有相关的语义信息。比如，train 中有猫，然后 test 中有另外一种猫科动物，那么动物这个样本是有意义的吧？？？

• unknown unknown: test 中没有出现在 train 中的样本，并且没有任何相关的语义信息。

Unlike the traditional classification, zero-shot learning (ZSL) can identify unseen classes which have no available observations in training. However, the available semantic/attribute information shared among all classes including seen and unseen ones are needed.

zero-shot 是针对 unknown known, 也就是包含了语义信息。

The ZSL mainly focuses on the recognition of the unknown known classes defined above. Actually, such a setting is rather restrictive and impractical, since we usually know nothing about the testing samples which may come from known known classes or not.

unknown known 这种设定很有限，并且不切实际。因为我们很难知道 test 中的样本是否是包含了语义信息，无法判断是 unknown known or unknown unknown.

comparision between open set recognition and traditional classification

Via these decision boundaries, samples from an unknown unknown class are labeled as ”unknown” or rejected rather than misclassified as known known classes.

$L(x, y, f(x)) \ge 0$ 是 loss function. P(x,y) 是对应样本 (x, y) 的概率，通常这个联合分布的概率我们是不知道的，因为我们无法确定自然界中样本空间(label space)到底是个什么分布。

[李航，机器学习] 中关于风险函数的定义：

Therefore, traditional recognition/classification approaches minimize the empirical risk instead of the ideal risk RI by using other knowledge, such as assuming that the label space is at least locally smooth and regularizing the empirical risk minimization.

Note that traditional recognition problem is usually performed under the closed set assumption. When the assumption switches to open environment/set scenario with the open space, other things should be added since intuitively there is some risk in labeling sample in the open space as any known known classes. This gives such an insight for OSR that we do know something else: we do know where known known classes exist, and we know that in open space we do not have a good basis for assigning labels for the unknown unknown classes.

open space risk

To improve the overall open set recognition error, our 1-vs-set formulation balances the unknown classes by obtaining a core margin around the decision boundary A from the base SVM, specializing the resulting half-space by adding another plane $\Omega$ and then generalizing or specializing the two planes (shown in Fig. 2) to optimize empirical and open space risk. This process uses the open set training data and the risk model to define a new “open set margin.” The second plane $\Omega$ allows the 1-vs-set machine to avoid the overgeneralization that would misclassify the raccoon in Fig. 2. The overall optimization can also adjust the original margin with respect to A to reduce open space risk, which can avoid negatives such as the owl.

While we do not know the joint distribution $P(x, y)$ in, one way to look at the open space risk is as a weak assumption: Far from the known data the Principle of Indifference [8] suggests that if there is no known reason to assign a probability, alternatives should be given equal probability. In our case, this means that at all points in open space, all labels (both known and unknown) are equally

likely, and risk should be computed accordingly. However, we cannot have constant value probabilities over infinite spaces—the distribution must be integrable and integrate to 1. We must formalize open space differently (e.g., by ensuring the problem is well posed and then assuming the probability is proportional to relative Lebesgue measure [9]). Thus, we can consider the measure of the open space to the full space, and define our risk penalty proportional to such a ratio.

where open space risk is considered to be the fraction (in terms of Lebesgue measure) of positively labeled open space compared to the overall measure of positively labeled space (which includes the space near the positive examples).

open space risk 是开放空间中 positive label 的总数与总体空间中 positive label 的总体度量。

openness

openness，用来表征数据集的开放程度：

• $C_{TR}$ 是训练集中的类别数，越大，开放程度越小。

• $C_{TE}$ 测试集中的类别数。

The Open Set Recognition Problem

our goal is to balance the risk of the unknown in open space with the empirical (known) risk. In this sense, we formally define the open set recognition problem as follows:

a categorization of OSR techniques

• Deep Network-based

• EVT-based

• Dirichlet Process-based OSR models

Deep Neural Network-based OSR Models

the OpenMax effectively addressed the challenge of the recognition for fooling/rubbish and unrelated open set images. However, as discussed in [71], the OpenMax fails to recognize the adversarial images which are visually indistinguishable from training samples but are designed to make deep networks produce high confidence but incorrect answers [96], [98].

OpenMax 有效的解决了 不相关的 open set images 的问题，但是却无法有效区分对抗生成样本。

Actually, the authors in [72] have indicated that the OpenMax is susceptible to the adversarial generation techniques directly working on deep representations. Therefore, the adversarial samples are still a serious challenge for open set recognition. Furthermore, using the distance from MAV, the cross entropy loss function in OpenMax does not directly incentivize projecting class samples around the MAV. In addition to that, the distance function used in testing is not used in training, possibly resulting in inaccurate measurement in that space [73]. To address this limitation, Hassen and Chan [73] learned a neural network based representation for open set recognition, which is similar in spirit to the Fisher Discriminant, where samples from the same class are closed to each other while the ones from different classes are further apart, leading to larger space among known known classes for unknown unknown classes’ samples to occupy.

• OpenMax to text classification

• Deep Open classifier

• tWiSARD

• hidden unknown unknown classes

Adversarial Learning-based OSR Models

Note that the main challenge for open set recognition is the incomplete class knowledge existing in training, leading to the open space risk when classifiers encounter unknown unknown classes during testing. Fortunately, the adversarial learning technique can account for open space to some extent by adversarially generating the unknown unknown class data according to the known known class knowledge, which undoubtedly provides another way to tackle the challenging multiclass OSR problem.

open set recognition 最大的挑战是 training 中不完整的 knowledge， 在 testing 中遇到 unknown unknown 导致 open space risk.

EVT-based OSR Models

As a powerful tool to increase the classification performance, the statistical Extreme Value Theory (EVT) has recently achieved great success due to the fact that EVT can effectively model the tails of the distribution of distances between training observations using the asymptotic theory[100].

Remark: As mentioned above, almost all existing OSR methods adopt the threshold-based classification scheme, where recognizers in decision either reject or categorize the input samples to some known known class using empirically set threshold. Thus the threshold plays a key role. However, the selection for it usually depends on the knowledge of known known classes, inevitably incurring risks due to lacking available information from unknown unknown classes [57]. This indicates the threshold-based OSR methods still face serious challenges.

Dirichlet Process-based OSR Models (生成模型)

Dirichlet process (DP) [104]–[108] considered as a distribution over distributions is a stochastic process, which has been widely applied in clustering and density estimation problems as a nonparametric prior defined over the number of mixture components. Furthermore, this model does not overly depend on training samples and can achieve adaptive change as the data changes, making it naturally adapt to the open set recognition scenario. In fact, researchers have begun the related research

Dirichlet 过程作为一种基于混合模型的非参数方法广泛用于聚类，参数估计。这种模型不需要依赖于 training，可以随着 dataset 的变化而自适应的变化，这使得它能有效的适用于 open set 的场景。

Remark: Instead of addressing the OSR problem from the discriminative model perspective, CD-OSR actually reconsiders this problem from the generative model perspective due to the use of HDP, which provides another research direction for open set recognition. Furthermore, the collective decision strategy for OSR is also worth further exploring since it not only takes the correlations among the testing samples into account but also provides a possibility for new class discovery, whereas single-sample decision strategy2 adopted by other existing OSR methods can not do such a work since it can not directly tell whether the single rejected sample is an outlier or from new class.

Beyond open set Recognition

open world recognition (OWR), where a recognition system should perform four tasks:

• detecting unknown unknown classes

• choosing which samples to label for addition to the model

• labelling those samples

• updating the classifier

Remark: As a natural extension of OSR, the OWR faces more serious challenges which require it not only to have the ability to handle the OSR task, but also to have minimal

downtime, even to continuously learn, which seems to have the flavor of lifelong learning to some extent. Besides, although some progress regarding the OWR has been made, there is still a long way to go.

Dataset and evalution metrics

dataset

Experiment Setup: In open set recognition, most existing experiments are carried out on a variety of recastes multi-class benchmark datasets. Specifically, taking the Usps dataset as an example, when it is used for OSR problem, one can randomly choose S distinct labels as the known known classes, and vary openness by adding a subset of the remaining labels.

Evaluation Metrics for Open Set Recognition

• TP： true positive

• FP: false positive

• TN: true negative

• FN: false negative

• TU: true unknown

• FU: false unknown

accuracy

$$\text{accuracy}=\dfrac{TP+TN}{TP+TN+FP+FN}$$

$$\text{accuracy}_O=\dfrac{(TP+TN)+TU}{(TP+TN+FP+FN)+(TU+FU)}$$

$0\le \lambda \le 1$ 是正则化常数。

F-measure

F1:

$$F1=\dfrac{2\text{precision} \text{recall}}{\text{precision}+\text{recall}}$$

$$precision=\dfrac{TP}{TP+FP}$$

$$recall=\dfrac{TP}{TP+FN}$$

Instead, the computations of Precision and Recall in it are only for available known known classes. Additionally, the work [67] has indicated that although the computations of Precision and Recall are only for available known known classes, the FN and FP also consider the false unknown unknown classes and false known known classes by taking into account the false negative and the false positive, and we refer the reader to [67] for more details.

Note that the Precision

contains the macro-Precision and micro-Precision while Recall includes the macro-Recall and micro-Recall, which leads to the corresponding macro-F-measure and micro-F-measure. Nevertheless, whether it is macro-F-measure or micro-F-measure, the higher their values, the better the performance of the corresponding OSR model.

Youden’s index for OSR

$$J= \text{Recall}+S-1$$

future research directions

• 大部分工作都是基于判别模型来做的，只有少部分是基于生成模型，也许生成模型会更有探索空间。

• OSR 的主要挑战是传统的分类器是在 closed-set 场景下获得的，一旦 unknown unknown class 落入这个空间，将永远无法被正确的分类。

Open set + ‘sth’

As open set scenario is a more practical assumption for the real-world classification/recognition tasks, it can naturally be combined with various fields involving classification/recognition such as semi-supervised learning, domain adaptation, active learning, multi-task learning, multi-label learning, multi-view learning, and so forth. For example, [124]–[126] recently introduced this scenario into domain adaptation, while [127] explored the open set classification in active learning field. Therefore, many interesting works are worth looking forward to.

Generalized Open Set Recognition

Appending semantic/attribute information

In fact, a lot

of semantic/attibute information is shared between the known known and the unknown unknown classes. Therefore, we can fully utilize this kind of information to ’cognize’ the unknown unknown classes, or at least to provide a rough semantic/attribute description for the corresponding unknown unknown classes instead of simply rejecting them.

The $\text{side-information}^1$ in ZSL denotes the semantic/attribute information shared among all classes including known known and unknown known classes.

where the $\text{side-information}^4$ denotes the available semantic/attribute information only for known known classes

Using other available side-information

**The main reason for open space risk is that the traditional classifiers trained under closed set scenario usually divide over-occupied space for known known classes, thus inevitably resulting in misclassifications once the unknown unknown class samples

fall into the space divided for some known known class.** From this perspective, the open space risk will be reduced as the space divided for those known known classes decreases by

using other side-information like universum [135], [136] to shrink their regions as much as possible.

Knowledge Integration for Open Set Recognition

In fact, the incomplete knowledge of the world is universal, especially for the single individuals: something you know does not mean I also know.

how to integrate the classifiers trained on each sub-knowledge set to further reduce the open space risk will be an interesting yet challenging topic in the future work, especially for such a situation: we can only obtain the classifiers trained on corresponding sub-knowledge sets, yet these sub-knowledge sets are not available due to the privacy protection of data.

论文笔记-CoQA

Motivation

We introduce CoQA, a novel dataset for building Conversational Question Answering systems.1 Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains.

CoQA, 对话式阅读理解数据集。从 7 个不同领域的 8k 对话中获取的 127k 问答对。

The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage.

We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.

CoQA 跟传统的 RC 数据集所面临的挑战不一样，主要是指代和推理。

We ask other people a question to either seek or test their knowledge about a subject. Depending on their answer, we follow up with another question and their answer builds on what has already been discussed. This incremental aspect makes human conversations succinct. An inability to build up and maintain common ground in this way is part of why virtual assistants usually don’t seem like competent conversational partners.

Introduction

In CoQA, a machine has to understand a text passage and answer a series of questions that appear in a conversation. We develop CoQA with three main goals in mind.

The first concerns the nature of questions in a human conversation. Posing short questions is an effective human conversation strategy, but such questions are a pain in the neck for machines.

The second goal of CoQA is to ensure the naturalness of answers in a conversation. Many existing QA datasets restrict answers to a contiguous span in a given passage, also known as extractive answers (Table 1). Such answers are not always natural, for example, there is no extractive answer for Q4 (How many?) in Figure 1. In CoQA, we propose that the answers can be free-form text (abstractive answers), while the extractive spans act as rationales for the actual answers. Therefore, the answer for Q4 is simply Three while its rationale is spanned across multiple sentences.

The third goal of CoQA is to enable building QA systems that perform robustly across domains. The current QA datasets mainly focus on a single domain which makes it hard to test the generalization ability of existing models.

Dataset collection

1. It consists of 127k conversation turns collected from 8k conversations over text passages (approximately one conversation per

passage). The average conversation length is 15 turns, and each turn consists of a question and an answer.

1. It contains free-form answers. Each answer has an extractive rationale highlighted in the passage.
1. Its text passages are collected from seven diverse domains — five are used for in-domain evaluation and two are used for out-of-domain

evaluation.

Almost half of CoQA questions refer back to conversational history using coreferences, and a large portion requires pragmatic reasoning making it challenging for models that rely on lexical cues alone.

The best-performing system, a reading comprehension model that predicts extractive rationales which are further fed into a sequence-to-sequence model that generates final answers, achieves a F1 score of 65.1%. In contrast, humans achieve 88.8% F1, a superiority of 23.7% F1, indicating that there is a lot of headroom for improvement.

Baseline 是将抽取式阅读理解模型转换成 seq2seq 形式，然后从 rationale 中获取答案，最终得到了 65.1% 的 F1 值。

question and answer collection

We want questioners to avoid using exact words in the passage in order to increase lexical diversity. When they type a word that is already present in the passage, we alert them to paraphrase the question if possible.

questioner 提出的问题应尽可能避免使用出现在 passage 中的词，这样可以增加词汇的多样性。

For the answers, we want answerers to stick to the vocabulary in the passage in order to limit the number of possible answers. We encourage this by automatically copying the highlighted text into the answer box and allowing them to edit copied text in order to generate a natural answer. We found 78% of the answers have at least one edit such as changing a word’s case or adding a punctuation.

passage collection

Not all passages in these domains are equally good for generating interesting conversations. A passage with just one entity often result in questions that entirely focus on that entity. Therefore, we select passages with multiple entities, events and pronominal references using Stanford CoreNLP (Manning et al., 2014). We truncate long articles to the first few paragraphs that result in around 200 words.

Table 2 shows the distribution of domains. We reserve the Science and Reddit domains for out-ofdomain evaluation. For each in-domain dataset, we split the data such that there are 100 passages in the development set, 100 passages in the test set, and the rest in the training set. For each out-of-domain dataset, we just have 100 passages in the test set.

In domain 中包含 Children, Literature, Mid/HIgh school, News, Wikipedia. 他们分出 100 passage 到开发集(dev dataset), 其余的在训练集 (train dataset). out-of-diomain 包含 Science Reddit ，分别有 100 passage 在开发集中。

test dataset:

Some questions in CoQA may have multiple valid answers. For example, another answer for Q4 in Figure 2 is A Republican candidate. In order to

account for answer variations, we collect three additional answers for all questions in the development and test data.

In the previous example, if the original answer was A Republican Candidate, then the following question Which party does he

belong to? would not have occurred in the first place. When we show questions from an existing conversation to new answerers, it is likely they will deviate from the original answers which makes the conversation incoherent. It is thus important to bring them to a common ground with the original answer.

We achieve this by turning the answer collection task into a game of predicting original answers. First, we show a question to a new answerer, and when she answers it, we show the original answer and ask her to verify if her answer matches the original. For the next question, we ask her to guess the original answer and verify again. We repeat this process until the conversation is complete. In our pilot experiment, the human F1 score is increased by 5.4% when we use this verification setup.

Dataset Analysis

What makes the CoQA dataset conversational compared to existing reading comprehension datasets like SQuAD? How does the conversation flow from one turn to the other? What linguistic phenomena do the questions in CoQA exhibit? We answer these questions below.

1. 指代词(he, him, she, it, they)出现的更为频繁， SQuAD 则几乎没有。

2. SQuAD 中 what 几乎占了一半，CoQA 中问题类型则更为多样， 比如 did, was, is, does 的频率很高。

3. CoQA 的问题更加简短。见图 3.

4. answer 有 33% 的是 abstractive. 考虑到人工因素，抽取式的 answer 显然更好写，所以这高于作者预期了。yes/no 的答案也有一定比重。

Conversation Flow

A coherent conversation must have smooth transitions between turns.

Linguistic Phenomena

Relationship between a question and its passage：

• lexical match: question 和 passage 中至少有一个词是匹配的。

• Paraphrasing: 解释型。虽然 question 没有与 passage 的词，但是确实对 rationale 的一种解释，也就是换了一种说法，当作问题提出了。通常这里面包含： synonymy(同义词), antonymy(反义词), hypernymy(上位词), hyponymy(下位词) and negation(否定词).

• Pragmatics: 需要推理的。

Relationship between a question and its conversation history：

• No coref

• Explicit coref.

• Implicit coref.

论文笔记-dropblock

paper:

dropblock 是关于 CNN 的，后两篇是关于 RNN 的正则化。

DropBlock

Motivation

Deep neural networks often work well when they are over-parameterized and trained with a massive amount of noise and regularization, such as weight decay and dropout. Although dropout is widely used as a regularization technique for fully connected layers, it is often less effective for convolutional layers. This lack of success of dropout for convolutional layers is perhaps due to the fact that

activation units in convolutional layers are spatially correlated so information can still flow through convolutional networks despite dropout.

Thus a structured form of dropout is needed to regularize convolutional networks. In this paper, we introduce DropBlock, a form of structured dropout, where units in a contiguous region of a feature map are dropped together. We found that applying DropBlock in skip connections in addition to the convolution layers increases the accuracy. Also, gradually increasing number of dropped units during training leads to better accuracy and more robust to hyperparameter choices.

dropblock

In this paper, we introduce DropBlock, a structured form of dropout, that is particularly effective to regularize convolutional networks. In DropBlock, features in a block, i.e., a contiguous region of a feature map, are dropped together. As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data (see Figure 1).

• block_size is the size of the block to be dropped

• $\gamma$ controls how many activation units to drop.

We experimented with a shared DropBlock mask across different feature channels or each feature channel has its DropBlock mask. Algorithm 1 corresponds to the latter, which tends to work better in our experiments.

Similar to dropout we do not apply DropBlock during inference. This is interpreted as evaluating an averaged prediction across the exponentially-sized ensemble of sub-networks. These sub-networks include a special subset of sub-networks covered by dropout where each network does not see contiguous parts of feature maps.

block_size:

In our implementation, we set a constant block_size for all feature maps, regardless the resolution of feature map. DropBlock resembles dropout [1] when block_size = 1 and resembles SpatialDropout [20] when block_size covers the full feature map.

block_size 设置为 1 时, 类似于 dropout. 当 block_size 设置为整个 feature map 的 size 大小时，就类似于 SpatialDropout.

setting the value of $\gamma$:

In practice, we do not explicitly set $\gamma$. As stated earlier, $\gamma$ controls the number of features to drop. Suppose that we want to keep every activation unit with the probability of keep_prob, in dropout [1] the binary mask will be sampled with the Bernoulli distribution with mean 1 − keep_prob. However, to account for the fact that every zero entry in the mask will be expanded by block_size2 and the blocks will be fully contained in feature map, we need to adjust $\gamma$ accordingly when we sample the initial binary mask. In our implementation, $\gamma$ can be computed as

• keep_prob 是传统的 dropout 的概率，通常设置为 0.75-0.9.

• feat_size 是整个 feature map 的 size 大小。

• (feat_size - block_size + 1) 是选择 dropblock 中心位置的有效区域。

The main nuance of DropBlock is that there will be some overlapped in the dropped blocks, so the above equation is only an approximation.

Scheduled DropBlock:

We found that DropBlock with a fixed keep_prob during training does not work well. Applying small value of keep_prob hurts learning at the beginning. Instead, gradually decreasing keep_prob over time from 1 to the target value is more robust and adds improvement for the most values of keep_prob.

Experiments

In the following experiments, we study where to apply DropBlock in residual networks. We experimented with applying DropBlock only after convolution layers or applying DropBlock after both convolution layers and skip connections. To study the performance of DropBlock applying to different feature groups, we experimented with applying DropBlock to Group 4 or to both Groups 3 and 4.

深度学习-优化算法

computes the gradient of the cost function to the parameters $\theta$ for the entire training dataset.

$$\theta= \theta - \delta_{\theta}J(\theta)$$

Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x(i) and label y(i):

$$\theta= \theta - \delta_{\theta}J(\theta; x^{(i)}; y^{(i)})$$

Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn

online. SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily.

While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD’s fluctuation, on the one hand, enables it to jump to new and potentially better local minima. On the other hand, this ultimately

complicates convergence to the exact minimum, as SGD will keep overshooting. However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost

certainly converging to a local or the global minimum for non-convex and convex optimization respectively.

Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples.

$$\theta= \theta - \delta_{\theta}J(\theta; x^{(i+n)}; y^{(i+n)})$$

• reduces the variance of the parameter updates, which can lead to more stable convergence;

• can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient mini-batch very efficient.

Challenges

• Choosing a proper learning rate.

• Learning rate schedules. try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset’s characteristics.

• the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring

features.

• Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. [5] argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.

Gradient descent optimization algorithms

Momentum [17] is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Figure 2b. It does this by padding a fraction $gamma$ of the update vector of the past time step to the current

update vector.

Momentum

paper: [Neural networks :

the official journal of the International Neural Network Society]()

without Momentum:

$$\theta += -lr * \nabla_{\theta}J(\theta)$$

with Momentum:

$$v_t=\gamma v_{t-1}+\eta \nabla_{\theta}J(\theta)$$

$$\theta=\theta-v_t$$

The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions.

$\gamma$ 看做摩擦系数， 通常设置为 0.9。$\eta$ 是学习率。

paper: [Yurii Nesterov. A method for unconstrained convex minimization problem

with the rate of convergence o(1/k2).]()

We would like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again. Nesterov accelerated gradient (NAG) [14] is a way to give our momentum term this kind of prescience.

$$v_t=\gamma v_{t-1}+\eta \nabla_{\theta}J(\theta-\gamma v_{t-1})$$

$$\theta=\theta-v_t$$

$$\phi = \theta-\gamma v_{t-1}$$

paper: [Adaptive Subgradient Methods for Online Learning

and Stochastic Optimization]()

Adagrad [8] is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

$$g_{t,i}=\nabla_{\theta_t}J(\theta_t,i)$$

$$\theta_{t+1,i}=\theta_{t,i}-\eta \cdot g_{t,i}$$

$$\theta_{t+1,i}=\theta_{t,i}-\dfrac{\eta}{\sqrt G_{t,ii}+\epsilon} g_{t,i}$$

$G_{t,ii}$ 是对角矩阵，对角元素是对应的梯度大小。

RMSprop

Geoff Hinton Lecture 6e

Adagrad 中随着 cache 的累积，最后的梯度会变为 0，RMSprop 在此基础上进行了改进，给了 cache 一个衰减率，相当于值考虑了最近时刻的梯度值，而很早之前的梯度值经过衰减后影响很小。

$$E[g^2]_ t=0.9E[g^2]_ {t-1}+0.1g^2_t$$

$$\theta_{t+1}=\theta_t-\dfrac{\eta}{E[g^2]_ t+\epsilon}g_t$$

Adam: a Method for Stochastic Optimization.

In addition to storing an exponentially decaying average of past squared gradients $v_t$ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients $m_t$, similar to momentum:

similar like momentum:

$$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t$$

$$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$

$m_t$ and $v_t$ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As $m_t$ and $v_t$ are initialized as vectors of 0’s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. β1 and β2 are close to 1). They counteract these biases by computing bias-corrected first and second moment estimates:

$$\hat m_t=\dfrac{m_t}{1-\beta^t_1}$$

$$\hat v_t=\dfrac{v_t}{1-\beta^t_2}$$

They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which

yields the Adam update rule:

$$\theta_{t+1}=\theta_t-\dfrac{\eta}{\sqrt{\hat a}+ \epsilon}{\hat m_t}$$

• $m_t$ 是类似于 Momentum 中参数更新量，是梯度的函数. $\beta_1$ 是摩擦系数，一般设为 0.9.

• $v_t$ 是类似于 RMSprop 中的 cache，用来自适应的改变不同参数的梯度大小。

• $\beta_2$ 是 cache 的衰减系数，一般设为 0.999.

Adam: a Method for Stochastic Optimization.

$$v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$

Norms for large p values generally become numerically unstable, which is why $l_1$ and $l_2$ norms are most common in practice. However, $l_{\infty}$ also generally exhibits stable behavior. For this reason, the authors propose AdaMax [10] and show that $v_t$ with $l_{\infty}$ converges to the following more stable value. To avoid confusion with Adam, we use ut to denote the infinity norm-constrained $v_t$:

$$\mu_t=\beta_2^{\infty}v_{t-1}+(1-\beta_2^{\infty})|g_t|^{\infty}$$

$$=max(\beta_2\cdot v_{t-1}, |g_t|)$$

$$\theta_{t+1}=\theta_t-\dfrac{\eta}{\mu_t}{\hat m_t}$$

Note that as $\mu_t$ relies on the max operation, it is not as suggestible to bias towards zero as $m_t$ and $v_t$ in Adam, which is why we do not need to compute a bias correction for ut. Good default values are again:

$$\eta = 0.002, \beta_1 = 0.9, \beta_2 = 0.999.$$

Visualization of algorithms

we see the path they took on the contours of a loss surface (the Beale function). All started at the same point and took different paths to reach the minimum. Note that Adagrad, Adadelta, and RMSprop headed off immediately in the right direction and converged similarly fast, while Momentum and NAG were led off-track, evoking the image of a ball rolling down the hill. NAG, however, was able to correct its course sooner due to its increased responsiveness by looking ahead and headed to the minimum.

shows the behaviour of the algorithms at a saddle point, i.e. a point where one dimension has a positive slope, while the other dimension has a negative slope, which pose a difficulty for SGD as we mentioned before. Notice here that SGD, Momentum, and NAG find it difficulty to break symmetry, although the latter two eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope, with Adadelta leading the charge.

example

model

TestNet(

(linear1): Linear(in_features=10, out_features=5, bias=True)

(linear2): Linear(in_features=5, out_features=1, bias=True)

(loss): BCELoss()

)

[('linear1.weight', Parameter containing:

tensor([[ 0.2901, -0.0022, -0.1515, -0.1064, -0.0475, -0.0324,  0.0404,  0.0266,

-0.2358, -0.0433],

[-0.1588, -0.1917,  0.0995,  0.0651, -0.2948, -0.1830,  0.2356,  0.1060,

0.2172, -0.0367],

[-0.0173,  0.2129,  0.3123,  0.0663,  0.2633, -0.2838,  0.3019, -0.2087,

-0.0886,  0.0515],

[ 0.1641, -0.2123, -0.0759,  0.1198,  0.0408, -0.0212,  0.3117, -0.2534,

-0.1196, -0.3154],

[ 0.2187,  0.1547, -0.0653, -0.2246, -0.0137,  0.2676,  0.1777,  0.0536,

('linear1.bias', Parameter containing:

tensor([ 0.1216,  0.2846, -0.2002, -0.1236,  0.2806], requires_grad=True)),

('linear2.weight', Parameter containing:

tensor([[-0.1652,  0.3056,  0.0749, -0.3633,  0.0692]], requires_grad=True)),

('linear2.bias', Parameter containing:



add model parameters to optimizer

<bound method Optimizer.state_dict of Adam (

Parameter Group 0

betas: (0.8, 0.999)

eps: 1e-08

lr: 0.001

weight_decay: 3e-07

)>


不同的模块设置不同的参数

<bound method Optimizer.state_dict of Adam (

Parameter Group 0

betas: (0.8, 0.999)

eps: 1e-08

lr: 0.001

weight_decay: 3e-07

Parameter Group 1

betas: (0.8, 0.999)

eps: 1e-08

lr: 0.0003

weight_decay: 3e-07

)>


辅助类lr_scheduler

lr_scheduler用于在训练过程中根据轮次灵活调控学习率。调整学习率的方法有很多种，但是其使用方法是大致相同的：用一个Schedule把原始Optimizer装饰上，然后再输入一些相关参数，然后用这个Schedule做step()。

论文笔记-Multi-cast Attention Networks

paper: [Multi-Cast Attention Networks for Retrieval-based Question

Answering and Response Prediction](https://arxiv.org/abs/1806.00778)

Motivation

Our approach performs a series of soft attention

operations, each time casting a scalar feature upon the inner word embeddings. The key idea is to provide a real-valued hint (feature) to a subsequent encoder layer and is targeted at improving the representation learning process.

The key idea of attention is to extract only the most relevant information that is useful for prediction. In the context of textual data, attention learns to weight words and sub-phrases within documents based on how important they are. In the same vein, co-attention mechanisms [5, 28, 50, 54] are a form of attention mechanisms that learn joint pairwise attentions, with respect to both document and query.

attention 注意力的关键思想是仅提取对预测有用的最相关信息。在文本数据的上下文中，注意力学习根据文档中的单词和子短语的重要性来对它们进行加权。

Attention is traditionally used and commonly imagined as a feature extractor. It’s behavior can be thought of as a dynamic form of pooling as it learns to select and compose different words to form the final document representation.

This paper re-imagines attention as a form of feature augmentation method. Attention is casted with the purpose of not compositional learning or pooling but to provide hints for subsequent layers. To the best of our knowledge, this is a new way to exploit attention in neural ranking models.

An obvious drawback which applies to many existing models is that they are generally restricted to one attention variant. In the case where one or more attention calls are used (e.g., co-attention and intra-attention, etc.), concatenation is generally used to fuse representations [20, 28]. Unfortunately, this incurs cost in subsequent layers by doubling the representation size per call.

The rationale for desiring more than one attention call is intuitive. In [20, 28], Co-Attention and Intra-Attention are both used because each provides a different view of the document pair, learning high quality representations that could be used for prediction. Hence, this can significantly improve performance.

Network for Sentence Pair Modeling](https://aclanthology.info/pdf/D/D17/D17-1122.pdf)

Moreover, Co-Attention also comes in different flavors and can either be used with extractive max-mean pooling [5, 54] or alignment-based pooling [3, 20, 28]. Each co-attention type produces different document representations. In max-pooling, signals are extracted based on a word’s largest contribution to the other text sequence. Mean-pooling calculates its contribution to the overall sentence. Alignment-pooling is another flavor of co-attention, which aligns semantically similar sub-phrases together.

co-attention可以用于提取max-mean pooling或alignment-based pooling。每种co-attention都会产生不同的文档表示。在max-pooling中，基于单词对另一文本序列的最大贡献来提取特征；mean-pooling计算其对整个句子的贡献；alignment-based pooling是另一种协同注意力机制，它将语义相似的子短语对齐在一起。因此，不同的pooling操作提供了不同的句子对视图。

• [20] [A

Decomposable Attention Model for Natural Language Inference](https://arxiv.org/abs/1606.01933)

Our approach is targeted at serving two important purposes: (1) It removes the need for architectural engineering of this component by enabling attention to be called for an arbitrary k times with hardly any consequence and (2) concurrently it improves performance by modeling multiple views via multiple attention calls. As such, our method is in similar spirit to multi-headed attention, albeit efficient. To this end, we introduce Multi-Cast Attention Networks (MCAN), a new deep learning architecture for a potpourri of tasks in the question answering and conversation modeling domains.

（1）消除调用任意k次注意力机制所需架构工程的需要，且不会产生任何后果。

In our approach, attention is casted, in contrast to the most other works that use it as a pooling operation. We cast co-attention multiple times, each time returning a compressed scalar feature that is re-attached to the original word representations. The key intuition is that compression enables scalable casting of multiple attention calls, aiming to provide subsequent layers with a hint of not only global knowledge but also cross sentence knowledge.

Model Architecture: Multi-cast Attention Networks

Figure 1: Illustration of our proposed Multi-Cast Attention Networks (Best viewed in color). MCAN is a wide multi-headed attention architecture that utilizes compression functions and attention as features.

Input Encoder

Highway Encoder：

Highway Networks可以对任意深度的网络进行优化。这是通过一种控制穿过神经网络的信息流的闸门机制所实现的。通过这种机制，神经网络可以提供通路，让信息穿过后却没有损失，将这种通路称为information highways。即highway networks主要解决的问题是网络深度加深、梯度信息回流受阻造成网络训练困难的问题。

highway encoders can be interpreted as data-driven word filters. As such, we can imagine them to parametrically learn which words have an inclination to be important and not important to the task at hand. For example, filtering stop words and words that usually do not contribute much to the prediction. Similar to recurrent models that are gated in nature, this highway encoder layer controls how much information (of each word) is flowed to the subsequent layers.

$$y=H(x,W_H)\cdot T(x,W_T) + (1-T(x,W_T))\cdot x$$

co-attention

Co-Attention [50] is a pairwise attention mechanism that enables

attending to text sequence pairs jointly. In this section, we introduce four variants of attention, i.e., (1) max-pooling, (2) mean-pooling, (3) alignment-pooling, and finally (4) intra-attention (or self attention).

1.affinity/similarity matrix

$$s_{ij}=F(q_i)^TF(d_j)$$

$$s_{ij}=q_i^TMd_j, s_{ij}=F[q_i;d_j]$$

2. Extractive pooling

max-pooling

$$q’=soft(max_{col}(s))^Tq, d’=soft(max_{row}(s))^Td$$

soft(.) 是 softmax 函数。$q’,d’$ 是 co-attentive representations of q and d respectively.

mean-pooling

$$q’=soft(mean_{col}(s))^Tq, d’=soft(mean_{row}(s))^Td$$

each pooling operator has different impacts and can be intuitively understood as follows: max-pooling selects each word based on its maximum importance of all words in the other text. Mean-pooling is a more wholesome comparison, paying attention to a word based on its overall influence on the other text. This is usually dataset-dependent, regarded as a hyperparameter and is tuned to see which performs best on the held out set.

3. Alignment-Pooling

$$d_i’:=\sum^{l_q}{j=1}\dfrac{exp(s{ij})}{\sum_{k=1}^{l_q}exp(s_{ik})}q_j$$

$$q_i’:=\sum^{l_d}{j=1}\dfrac{exp(s{ij})}{\sum_{k=1}^{l_d}exp(s_{ik})}d_j$$

$q_i’$ 是 $q_i$ 和 d 的软对齐。也就是，$q_i’$ 是 关于 $q_i$ 的 ${d_j}^{l_d}_{j=1}$ 的加权和。

4. intra-Attention

$$x_i’:=\sum^{l}{j=1}\dfrac{exp(s{ij})}{\sum_{k=1}^{l}exp(s_{ik})}x_j$$

Multi-Cast Attention

Casted Attention

$$f_c=F_c[\overline x, x]$$

$$f_m=F_m[\overline x \circ x]$$

$$f_s=F_m[\overline x-x]$$

Intuitively, what is achieved here is that we are modeling the influence of co-attention by comparing representations before and after co-attention. For soft-attention alignment, a critical note here is that x and $\overline x$ (though of equal lengths) have ‘exchanged’ semantics. In other words, in the case of q, $\overline q$ actually contains the aligned representation of d.

Compression Function

The rationale for compression is simple and intuitive - we do not want to bloat subsequent layers with a high dimensional vector which consequently incurs parameter costs in subsequent layers. We investigate the usage of three compression functions, which are capable of reducing a n dimensional vector to a scalar.

• Sum

Sum（SM）函数是一个非参数化函数，它对整个向量求和，并输出标量

$$F(x)=\sum_i^nx_i, x_i\in x$$

• Neural networks

$$F(x)=RELU(W_cx+b)$$

• Factorization Machines

FM是表达模型，使用分解参数捕获特征之间的成对相互作用。 k是FM模型的因子数。

Long Short-Term Memory Encoder

As such, the key idea behind casting attention as features right before this layer is that it provides the LSTM encoder with hints that provide information such as (1) longterm and global sentence knowledge and (2) knowledge between sentence pairs (document and query).

LSTM在document和query之间共享权重。 关键思想是LSTM编码器通过使用非线性变换作为门控函数来学习表示序列依赖性的表示。因此，在该层之前引人注意力作为特征的关键思想是它为LSTM编码器提供了带有信息的提示，例如长期和全局句子知识和句子对（文档和查询）之间的知识。

• Pooling Operation

$$h=MeanMax[h_1,…,h_l]$$

Prediction Layer and Optimization

$$y_{out} = H_2(H_1([x_q; x_d ; x_q \circ x_d ; x_q − x_d ]))$$

$$y_{pred} = softmax(W_F · y_{out} + b_F )$$

论文笔记-CNN与自然语言处理

• Embedding: 使用预训练的中文词向量。

• Encoder: 基于 Bi-GRU 对 passage,query 和 alternatives 进行编码处理。

• Attention: 用 trilinear 的方式，并 mask 之后得到相似矩阵，然后采用类似于 BiDAF 中的形式 bi-attention flow 得到 attened passage.

• contextual: 用 Bi-GRU 对 attened passage 进行编码，得到 fusion.

• match 使用 attention pooling 的方式将 fusion 和 enc_answer 转换为单个 vector. 然后使用 cosin 进行匹配计算得到最相似的答案。

• 可以用 ELMO 或 wordvec 先对训练集进行预训练得到自己的词向量。

• attention 层可以使用更丰富的方式，很多paper 中也有提到。甚至可以加上人工提取的特征。比如苏剑林 blog 中提到的。

• 还有个很重要的就是 match 部分， attention pooling 是否可以换成其他更好的方式？

ConvS2S

Facebook 的这篇 paper 就改变了这些传统的思维，不仅用 CNN 编码全局信息，而且还能 decoder.

Motivation

Multi-layer convolutional neural networks create hierarchical representations over the input sequence in which nearby input elements interact at lower layers while distant elements interact at higher layers.

Hierarchical structure provides a shorter path to capture long-range dependencies compared to the chain structure modeled by recurrent networks, e.g. we can obtain a feature representation capturing relationships within a window of n words by applying only O(n/k) convolutional operations for kernels of width k, compared to a linear number O(n) for recurrent neural networks.

Inputs to a convolutional network are fed through a constant number of kernels and non-linearities, whereas recurrent networks apply up to n operations and non-linearities to the first word and only a single set of operations to the last word. Fixing the number of nonlinearities applied to the inputs also eases learning.

Ｍodel Architecture

• position embedding

• convolution block structure

• Multi-step attention

convolution blocks

$$h_l=(XW+b)\otimes \sigma(XV+c)$$

The output of each layer is a linear projection X ∗ W + b modulated by the gates σ(X ∗ V + c). Similar to LSTMs, these gates multiply each element of the matrix X ∗W+b

and control the information passed on in the hierarchy.

$$h_i^l=tanh(XW+b)\otimes \sigma(XV+c)$$

residual connection: 为了得到更 deep 的卷积神经网络，作者增加了残差链接。

$$h_i^l=v(W^l[h_{i-k/2}^{l-1},…,h_{i+k/2}^{l-1}]+b_w^l)+h_i^{l-1}$$

For instance, stacking 6 blocks with k = 5 results in an input field of 25 elements, i.e. each output depends on 25 inputs. Non-linearities allow the networks to exploit the full input field, or to focus on fewer elements if needed.

• ConvS2S 是 1D 卷积，kernel 只是在时间维度上平移，且 stride 的固定 size 为1,这是因为语言不具备图像的可伸缩性，图像在均匀的进行降采样后不改变图像的特征，而一个句子间隔着取词，意思就会改变很多了。

• 在图像中一个卷积层往往有多个 filter，以获取图像不同的 pattern，但是在 ConvS2S 中，每一层只有一个 filter。一个句子进入 filter 的数据形式是 [1, n, d]. 其中 n 为句子长度， filter 对数据进行 n 方向上卷积，而 d 是词的向量维度，可以理解为 channel，与彩色图片中的 rgb 三个 channel 类似。

Facebook 在设计时，并没有像图像中常做的那样，每一层只设置一个 filter。这样做的原因，一是为了简化模型，加速模型收敛，二是他们认为一个句子的 pattern 要较图像简单很多，通过每层设置一个 filter，逐层堆叠后便能抓到所有的 pattern. 更有可能的原因是前者。因为在 transorfmer 中，multi-head attention 多头聚焦取得了很好的效果，说明一个句子的 pattern 是有多个的.

For encoder networks we ensure that the output of the convolutional layers matches the input length by padding the input at each layer. However, for decoder networks we have to take care that no future information is available to the decoder (Oord et al., 2016a). Specifically, we pad the input by k − 1 elements on both the left and right side by zero vectors, and then remove k elements from the end of the convolution output.

Multi-step Attention

$$d_i^l=W_d^lh_i^l+b_d^l+g_i$$

$$a_{ij}^l=\dfrac{exp(d_i^l\cdot z_j^u)}{\sum_{t=1}^mexp(d_i^l\cdot z_j^u)}$$

$$c_i^l=\sum_{j=1}^ma_{ij}^l(z_j^u+e_j)$$

• 在训练阶段是 teacher forcing, 卷积核 $W_d^l$ 在 target sentence $h^l$ 上移动做卷积得到 $(W_d^lh_i^l + b_d^l)$，类似与 rnn-decoder 中的隐藏状态。然后加上上一个词的 embedding $g_i$,得到 $d_i^l$.

• 与 encdoer 得到的 source sentence 做交互，通过 softmax 得到 attention weights $a_{ij}^l$.

• 得到 attention vector 跟 rnn-decoder 有所不同，这里加上了 input element embedding $e_j$.

We found adding e_j to be beneficial and it resembles key-value memory networks where the keys are the z_j^u and the values are the z^u_j + e_j (Miller et al., 2016). Encoder outputs z_j^u represent potentially large input contexts and e_j provides point information about a specific input element that is useful when making a prediction. Once c^l_i has been computed, it is simply added to the output of the corresponding decoder layer h^l_i.

$z_j^u$ 表示更丰富的信息，而 $e_j$ 能够能具体的指出输入中对预测有用的信息。还是谁用谁知道吧。。

This can be seen as attention with multiple ’hops’ (Sukhbaatar et al., 2015) compared to single step attention (Bahdanau et al., 2014; Luong et al., 2015; Zhou et al., 2016; Wu et al., 2016). In particular, the attention of the first layer determines a useful source context which is then fed to the second layer that takes this information into account when computing attention etc. The decoder also has immediate access to the attention history of the k − 1 previous time steps because the conditional inputs $c^{l-1}_{i−k}, . . . , c^{l-1}i$ are part of $h^{l-1}{i-k}, . . . , h^{l-1}_i$ which are input to $h^l_i$. This makes it easier for the model to take into account which previous inputs have been attended to already compared to recurrent nets where this information is in the recurrent state and needs to survive several non-linearities. Overall, our attention mechanism considers which words we previously attended to (Yang et al., 2016) and performs multiple attention ’hops’ per time step. In Appendix §C, we plot attention scores for a deep decoder and show that at different layers, different portions of the source are attended to.

FAST READING COMPREHENSION WITH CONVNETS

Gated Linear Dilated Residual Network (GLDR):

a combination of residual networks (He et al., 2016), dilated convolutions (Yu & Koltun, 2016) and gated linear units (Dauphin et al., 2017).

text understanding with dilated convolution

kernel:$k=[k_{-l},k_{-l+1},…,k_l]$, size=$2l+1$

input: $x=[x_1,x_2,…,x_n]$

dilation: d

$$(k*x)_ t=\sum_{i=-l}^lk_i\cdot x_{t + d\cdot i}$$

Repeated dilated convolution (Yu & Koltun, 2016) increases the receptive region of ConvNet outputs exponentially with respect to the network depth, which results in drastically shortened computation paths.

model Architecture

The receptive field of this convolutional network grows

exponentially with depth and soon encompasses a long sequence, essentially enabling it to capture

similar long-term dependencies as an actual sequential model.

Convolutional BiDAF. In our convolutional version of BiDAF, we replaced all bidirectional LSTMs with GLDRs . We have two 5-layer GLDRs in the contextual layer whose weights are un-tied. In the modeling layer, a 17-layer GLDR with dilation 1, 2, 4, 8, 16 in the first 5 residual blocks is used, which results in a reception region of 65 words. A 3-layer GLDR replaces the bidirectional LSTM in the output layer. For simplicity, we use same-padding and kernel size 3 for all convolutions unless specified. The hidden size of all GLDRs is 100 which is the same as the LSTMs in BiDAF.

论文笔记-batch,layer,weights normalization

paper:

Layer Normalization

Motivation

batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case.

Batch Normalization 就是将线性输出归一化。

batch normalization requires running averages of the summed input statistics. In feed-forward networks with fixed depth, it is straightforward to store the statistics separately for each hidden layer. However, the summed inputs to the recurrent neurons in a recurrent neural network (RNN) often vary with the length of the sequence so applying batch normalization to RNNs appears to require different statistics for different time-steps.

BN 不是用于 RNN 是因为 batch 中的 sentence 长度不一致。我们可以把每一个时间步看作一个维度的特征提取，如果像 BN 一样在这个维度上进行归一化，显然在 RNN 上是行不通的。比如这个 batch 中最长的序列的最后一个时间步，他的均值就是它本身了，岂不是出现了 BN 在单个样本上训练的情况。

In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case.

Layer Normalization

layer normalization 并不是在样本上求平均值和方差，而是在 hidden units 上求平均值和方差。

BN 和 LN 的差异：

Layer normalisztion 在单个样本上取均值和方差，所以在训练和测试阶段都是一致的。

Layer normalized recurrent neural networks

RNN is common among the NLP tasks to have different sentence lengths for different training cases. This is easy to deal with in an RNN because the same weights are used at every time-step. But when we apply batch normalization to an RNN in the obvious way, we need to to compute and store separate statistics for each time step in a sequence. This is problematic if a test sequence is longer than any of the training sequences. Layer normalization does not have such problem because its normalization terms depend only on the summed inputs to a layer at the current time-step. It also has only one set of gain and bias parameters shared over all time-steps.

$a^t=W_{hh}h^{t-1}+W_{xh}x^t$

layer normalize 在 LSTM 上的使用：

tensorflow 实现

layer normalization

layer_norm_mine 得到的结果与源码一致。可以发现 计算均值和方差时， tf.nn.momentsaxes=[1:-1]. （tf.nn.moments 中 axes 的含义是在这些维度上求均值和方差）. 也就是说得到的均值和方差确实是 [batch,]. 只是在转换成 beta 和 gamma 的分布时，依旧是在最后一个维度上进行的。有意思，所以最终的效果应该和 batch normalization 效果是一致的。只不过是否符合图像或文本的特性就另说了。

character embedding

Motivation

A language model is formalized as a probability distribution over a sequence of strings (words), and traditional methods usually involve making an n-th order Markov assumption and estimating n-gram probabilities via counting and subsequent smoothing (Chen and Goodman 1998). The count-based models are simple to train, but probabilities of rare n-grams can be poorly estimated due to data sparsity (despite smoothing techniques).

While NLMs have been shown to outperform count-based n-gram language models (Mikolov et al. 2011), they are blind to subword information (e.g. morphemes). For example, they do not know, a priori, that eventful, eventfully, uneventful, and uneventfully should have structurally related embeddings in the vector space. Embeddings of rare words can thus be poorly estimated, leading to high perplexities for rare words (and words surrounding them). This is especially problematic in morphologically rich languages with long-tailed frequency distributions or domains with dynamic vocabularies (e.g. social media).

neural language models 将词嵌入到低维的向量中，使得语义相似的词在向量空间的位置也是相近的。然后 Mikolov word2vec 这种方式不能有效的解决子单词的信息问题，比如一个单词的各种形态，也不能认识前缀。这种情况下，不可避免的会造成不常见词的向量表示估计很差，对于不常见词会有较高的困惑度。这对于词语形态很丰富的语言是一个难题，同样这种问题也是动态词表的问题所在（比如社交媒体）。

Recurrent Neural Network Language Model

$$Pr(w_{t+1}=j|w_{1:t})=\dfrac{exp(h_t\cdot p^j+q^j)}{\sum_{j’\in V}exp(h_t\cdot p^{j’}+q^{j’})}$$

$$NLL=-\sum_{T}^{t=1}logPr(w_t|w_{1:t-1})$$

Chracter-level Convolution Neural Network

$$f^k[i]=tanh(<C^k[* ,i:i-w+1], H> +b)$$

<>表示做卷积运算(Frobenius inner product). 然后加上 bias 和 非线性激活函数 tanh.

Highway Network

Highway Network 分为两层 layer.

• one layer of an MLP applies an affine transformation:

$$z=g(W_y+b)$$

• one layer 有点类似 LSTM 中的 gate 机制：

$$z=t\circ g(W_Hy+b_H)+(1-t)\circ y$$

ELMo

ELMo

ELMo is a task specific combination of the intermediate layer representations in the biLM.

ELMo 实际上只是下游任务的中间层，跟 BERT 一样。但也有不同的是， ELMo 每一层的向量表示会获得不同的 信息。底层更能捕捉 syntax and semantics 信息，更适用于 part-of-speech tagging 任务，高层更能获得 contextual 信息，更适用于 word sense disambiguation 任务。所以对不同的任务，会对不同层的向量表示的利用不同。

Model architecture

The final model uses L = 2 biLSTM layers with 4096 units and 512 dimension projections and a residual connection from the first to second

layer. The context insensitive type representation uses 2048 character n-gram convolutional filters followed by two highway layers and a linear projection down to a 512 representation.

论文笔记-QANet

paper:

Motivation

Its encoder consists exclusively of convolution and self-attention, where convolution models local interactions and self-attention models global interactions. On the SQuAD dataset, our model is 3x to 13x faster in training and 4x to 9x faster in inference, while achieving equivalent accuracy to recurrent models.

encoder 编码方式仅仅由 卷积 和 自注意力 机制构成，没了 rnn 速度就是快。

The key motivation behind the design of our model is the following: convolution captures the local structure of the text, while the self-attention learns the global interaction between each pair of words.

we propose a complementary data augmentation technique to enhance the training data. This technique paraphrases the examples by translating the original sentences from English to another language and then back to English, which not only enhances the number of training instances but also diversifies the phrasing.

Model

• an embedding layer

• an embedding encoder layer

• a context-query attention layer

• a model encoder layer

• an output layer.

the combination of convolutions and self-attention is novel, and is significantly better than self-attention alone and gives 2.7 F1 gain in our experiments. The use of convolutions also allows us to take advantage of common regularization methods in ConvNets such as stochastic depth (layer dropout) (Huang et al., 2016), which gives an additional gain of 0.2 F1 in our experiments.

CNN 和 self-attention 的结合比单独的 self-attention 效果要好。同时使用了 CNN 之后能够使用常用的正则化方式 dropout, 这也能带来一点增益。

Input embedding layer

obtain the embedding of each word w by concatenating its word embedding and character embedding.

Each character is represented as a trainable vector of dimension p2 = 200, meaning each word can be viewed as the concatenation of the embedding vectors for each of its characters. The length of each word is either truncated or padded to 16. We take maximum value of each row of this matrix to get a fixed-size vector representation of each word.

Embedding encoding layer

The encoder layer is a stack of the following basic building block: [convolution-layer × # + self-attention-layer + feed-forward-layer]

• convolution: 使用 depthwise separable convolutions 而不是用传统的 convolution，因为作者发现 it is memory efficient and has better generalization. 怎么理解这个，还得看原 paper. The kernel size is 7, the number of filters is d = 128.

Each of these basic operations (conv/self-attention/ffn) is placed inside a residual block, shown lower-right in Figure 1. For an input x and a given operation f, the output is f(layernorm(x))+x.

Context-Query Attention Layer

content: $C={c_1, c_2,…,c_n}$

query: $Q={q_1,q_2,…q_m}$.

• content: [batch, content_n, embed_size]

• query: [batch, query_m, embed_size]

sim_matrix: [batch, content_n, query_m]

The similarity function used here is the trilinear function (Seo et al., 2016). $f(q,c)=W_0[q,c,q\circ c]$.

content-to-query

$A = \tilde SQ^T$, shape = [batch, content_n, embed_size]

query_to_content

Empirically, we find that, the DCN attention can provide a little benefit over simply applying context-to-query attention, so we adopt this strategy.

$\tilde S$.shape=[batch, content_n, query_m]

$\overline S^T$.shape=[batch, query_m, content_n]

$C^T$.shape=[batch, query_m, embed_size]

Ouput layer

$$p^1=softmax(W_1[M_0;M_1]), p^2=softmax(W_2[M_0;M_2])$$

$$L(\theta)=-\dfrac{1}{N}\sum_i^N[log(p^1_{y^1})+log(p^2_{y^2})]$$

QANet 哪里好，好在哪儿？

• separable conv 不仅参数量少，速度快，还效果好。将 sep 变成传统 cnn, F1 值减小 0.7.

• 去掉 CNN， F1值减小 2.7.

• 去掉 self-attention, F1值减小 1.3.

• layer normalization

• residual connections

• L2 regularization

论文笔记 Pointer Networks and copy mechanism

paper:

Pointer Network

Motivation

We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence.

Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence [1] and Neural Turing Machines [2], because the number of target classes in each step of the output depends on the length of the input, which is variable.

Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class.

It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output.

We show Ptr-Nets can be used to learn approximate solutions to three challenging geometric problems – finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem – using training examples

alone.

Ptr-Net 可以用来学习类似的三个几何问题。

Ptr-Nets not only improve over sequence-to-sequence with input attention, but also allow us to generalize to variable size output dictionaries. We show that the learnt models generalize beyond the maximum lengths they were trained on.

Ptr-Net 不仅可以提升 seq2seq with attention,而且能够泛化到变化的 dictionayies.

• 一是，简单的 copy 在传统的方法中很难实现，而 Ptr-Net 则是直接从输入序列中生成输出序列。

• 而是，可以解决输出 dictionary 是变化的情况。普通的 Seq2Seq 的 output dictionary 大小是固定的，对输出中包含有输入单词(尤其是 OOV 和 rare word) 的情况很不友好。一方面，训练中不常见的单词的 word embedding 质量也不高，很难在 decoder 时预测出来，另一方面，即使 word embedding 很好，对一些命名实体，像人名等，word embedding 都很相似，也很难准确的 reproduce 出输入提到的单词。Point Network 以及在此基础上后续的研究 CopyNet 中的 copy mechanism 就可以很好的处理这种问题，decoder 在各 time step 下，会学习怎样直接 copy 出现在输入中的关键字。

Model Architecture

sequence-to-sequence Model

$$p(C^P|P;\theta)=\sum_{i=1}^m(P)p_{\theta}(C_i|C_1,…,C_{i-1},P;\theta)$$

$$\theta^* = {argmax}{\theta}\sum{P,C^P}logp(C^P|P;\theta)$$

In this sequence-to-sequence model, the output dictionary size for all symbols $C_i$ is fixed and equal to n, since the outputs are chosen from the input. Thus, we need to train a separate model for each n. This prevents us from learning solutions to problems that have an output dictionary with a size that depends on the input sequence length.

Content Based Input Attention

This model performs significantly better than the sequence-to-sequence model on the convex hull problem, but it is not applicable to problems where the output dictionary size depends on the input.

Nevertheless, a very simple extension (or rather reduction) of the model allows us to do this easily.

Ptr-Net

seq2seq 模型的输出词是在固定的 dictionary 中进行 softmax，并选择概率最大的词，从而得到输出序列。但这里的输出 dictionary size 是取决于 input 序列的长度的。所以作者提出了新的模型，其实很简单。

$$u_j^i=v^Ttanh(W_1e_j+W_2d_i) ，j\in(1,…,n)$$

$$p(C_i|C_1,…,C_{i-1},P)=softmax(u^i)$$

i 表示decoder 的时间步，j 表示输入序列中的index. 所以$e_j$ 是 encoder 编码后的隐藏向量，$d_i$ 是 decoder 当前时间步 i 的隐藏向量。跟一般的 attention 基本上一致。只不过得到的 softmax 概率应用在输入序列 $C_1,…,C_{i-1}$ 上。

CopyNet

Motivation

We address an important problem in sequence-to-sequence (Seq2Seq) learning referred to as copying, in which certain segments in the input sequence are selectively replicated in the output sequence. A similar phenomenon is observable in human language communication. For example, humans tend to repeat entity names or even long phrases in conversation.

The challenge with regard to copying in Seq2Seq is that new machinery is needed to decide when to perform the operation.

For example:

• What to copy: 输入中的哪些部分应该被 copy?

• Where to paste: 应该把这部分信息 paste 到输出的哪个位置？

Model Architecture

• From a cognitive perspective, the copying mechanism is related to rote memorization, requiring less understanding but ensuring high literal fidelity. 从认知学角度，copy机制近似于死记硬背，不需要太多的理解，但是要保证文字的保真度。

• From a modeling perspective, the copying operations are more rigid and symbolic, making it more difficult than soft attention mechanism to integrate into a fully differentiable neural model. 从模型的角度，copy 操作更加死板和符号化，这也使得相比 soft attention 机制更难整合到一个完整的可微分的神经模型中去。

Encoder:

LSTM 将 source sequence 转换为隐藏状态 M(emory) $h_1,…,h_{T_S}$.

Decoder:

• Prediction: COPYNET predicts words based on a mixed probabilistic model of two modes, namely the generate-mode and the copymode, where the latter picks words from the source sequence. 下一个词的预测由两种模式混合而成。生成 generate-mode 和 copy-mode. 后者就像前面 Ptr-Net 所说的，在 source sentence 获取词。
• State Update: the predicted word at time t−1 is used in updating the state at t, but COPYNET uses not only its word-embedding but also its corresponding location-specific hidden state in M (if any). 更新 decoder 中的隐藏状态时，t 时间步的隐藏状态不仅与 t-1 步生成词的 embedding vector 有关，还与这个词对应于 source sentence 中的隐藏状态的位置有关。
• Reading M: in addition to the attentive read to M, COPYNET also has“selective read” to M, which leads to a powerful hybrid of

content-based addressing and location-based addressing. 什么时候需要 copy，什么时候依赖理解来回答，怎么混合这两种模式很重要。

Prediction with Copying and Generation:$s_t\rightarrow y_t$

$$p(y_t|s_t,y_{t-1},c_t,M)=p(y_t,g|s_t,y_{t-1},c_t,M) + p(y_t,c|s_t,y_{t-1},c_t,M)$$

• Content-base

Attentive read from word-embedding

• location-base

Selective read from location-specific hidden units

$$p(y_t,g|\cdot)=\begin{cases} \dfrac{1}{Z}e^{\psi_g(y_t)}&y_t\in V\ 0,&y_t\in X \bigcap \overline V\ \dfrac{1}{Z}e^{\psi_g(UNK)},&y_t\notin V\cup X \end{cases}$$

$$p(y_t,c|\cdot)=\begin{cases}\dfrac{1}{Z}\sum_{j:x_j=y_t}{e^{\psi_c(x_j)}},&y_t\in X\0&\text {otherwise}\end{cases}$$

Z 是两种模型共享的归一化项，$Z=\sum_{v\in V\cup{UNK}}e^{\psi_g(v)}+\sum_{x\in X}e^{\psi_c(x)}$.

Generate-Mode:

$$\psi_g(y_t=v_i)=\nu_i^TW_os_t, v_i\in V\cup UNK$$

• $W_o\in R^{(N+1)\times d_s}$

• $\nu_i$ 是 $v_i$ 对应的 one-hot 向量. 得到的结果是当前词的概率。

generate-mode 的 score $\psi(y_t=v_i)$ 和普通的 encoder-decoder 是一样的。全链接之后的 softmax.

copy-mode:

$$\psi(y_t=x_j)=\sigma(h_j^TW_c)s_t,x_j\in \mathcal{V}$$

• $h_j$ 是 encoder hidden state. j 表示输入序列中的位置。

• $W_c\in R^{d_h\times d_s}$ 将 $h_j$ 映射到跟 $s_t$ 一样的语义空间。

• 作者发现使用 tanh 非线性变换效果更好。同时考虑到 $y_t$ 这个词可能在输入中出现多次，所以需要考虑输入序列中所有的为 $y_t$ 的词的概率的类和。

state update

$$c_t=\sum_{\tau=1}^{T_S}\alpha_{t\tau}$$

$$\alpha_{t\tau}=\dfrac{e^{\eta(s_{t-1},h_{\tau})}}{\sum_{\tau’}e^{\eta(s_{t-1},h_{\tau’})}}$$

CopyNet 的 $y_{t-1}$ 在这里有所不同。不仅仅考虑了词向量，还使用了 M 矩阵中特定位置的 hidden state，或者说，$y_{t−1}$ 的表示中就包含了这两个部分的信息 $[e(y_{t−1});\zeta(y_{t−1})]$，$e(y_{t−1})$ 是词向量，后面多出来的一项 $\zeta(y_{t−1})$ 叫做 selective read, 是为了连续拷贝较长的短语。和attention 的形式差不多，是 M 矩阵中 hidden state 的加权和.

$$\zeta(y_{t-1})=\sum_{\tau=1}^{T_S}\rho_{t\tau}h_{\tau}$$

$$\rho_{t\tau}=\begin{cases}\dfrac{1}{K}p(x_{\tau},c|s_{t-1},M),& x_{\tau}=y_{t-1}\ 0,& \text{otherwise} \end{cases}$$

• 当 $y_{t-1}$ 没有出现在 source sentence中时， $\zeta(y_{t-1})=0$.

• 这里的 $K=\sum{\tau’:x_{\tau’}=y_{t-1}}p(x_{\tau’},c|s_{t-1},M)$ 是类和。还是因为输入序列中可能出现多个当前词，但是每个词在 encoder hidden state 的向量表示是不一样的，因为他们的权重也是不一样的。

• 这里的 p 没有给出解释，我猜跟前面计算 copy 的 score 是一致的？

• 直观上来看，当 $\zeta(y_{t-1})$ 可以看作是选择性读取 M (selective read). 先计算输入序列中对应所有 $y_{t-1}$ 的权重，然后加权求和，也就是 $\zeta(y_{t-1})$.

Hybrid Adressing of M

$$\zeta(y_{t-1}) \longrightarrow{update} \ s_t \longrightarrow predict \ y_t \longrightarrow sel. read \zeta(y_t)$$

Learning

$$L=-\dfrac{1}{N}\sum_{k=1}^N\sum_{t=1}^Tlog[p(y_t^{(k)}|y_{<t}^{(k)}, X^{(k)})]$$

N 是batch size，T 是 object sentence 长度。

论文笔记-Match LSTM

Motivation

SQuAD the answers do not come from a small set of candidate

answers and they have variable lengths. We propose an end-to-end neural architecture for the task.

The architecture is based on match-LSTM, a model we proposed

previously for textual entailment, and Pointer Net, a sequence-to-sequence model proposed by Vinyals et al. (2015) to constrain the output tokens to be from the input sequences.

• MCTest: A challenge dataset for the open-domain machine comprehension of text.

• Teaching machines to read and comprehend.

• The Goldilocks principle: Reading children’s books with explicit memory representations.

• Towards AI-complete question answering: A set of prerequisite toy tasks.

• SQuAD: 100,000+ questions for machine comprehension of text.

Traditional solutions to this kind of question answering tasks rely on NLP pipelines that involve multiple steps of linguistic analyses and feature engineering, including syntactic parsing, named entity recognition, question classification, semantic parsing, etc. Recently, with the advances of applying neural network models in NLP, there has been much interest in building end-to-end neural architectures for various NLP tasks, including several pieces of work on machine comprehension.

End-to-end model architecture:

• Teaching machines to read and comprehend.

• The Goldilocks principle: Reading children’s books with explicit memory representations.

• Attention-based convolutional neural network for machine comprehension

• Text understanding with the attention sum reader network.

• Consensus attention-based neural networks for chinese reading comprehension.

However, given the properties of previous machine comprehension datasets, existing end-to-end neural architectures for the task either rely on the candidate answers (Hill et al., 2016; Yin et al., 2016) or assume that the answer is a single token (Hermann et al., 2015; Kadlec et al., 2016; Cui et al., 2016), which make these methods unsuitable for the SQuAD dataset.

We propose two ways to apply the Ptr-Net model for our task: a sequence model and a boundary model. We also further extend the boundary model with a search mechanism.

Model Architecture

Pointer Network

Pointer Network (Ptr-Net) model : to solve a special kind of problems where we want to generate an output sequence whose tokens must come from the input sequence. Instead of picking an output token from a fixed vocabulary, Ptr-Net uses attention mechanism as a pointer to select a position from the input sequence as an output symbol.

MATCH-LSTM AND ANSWER POINTER

• An LSTM preprocessing layer that preprocesses the passage and the question using LSTMs. 使用 LSTM 处理 question 和 passage.

• A match-LSTM layer that tries to match the passage against the question. 使用 match-LSTM 对lstm编码后的 question 和 passage 进行匹配。

• An Answer Pointer (Ans-Ptr) layer that uses Ptr-Net to select a set of tokens from the passage as the answer. The difference between the two models only lies in the third layer. 使用 Pointer 来选择 tokens.

LSTM preprocessing Layer

$$H^p=\overrightarrow {LSTM}(P), H^q=\overrightarrow {LSTM}(Q)$$

Match-LSTM Layer

$$\overrightarrow G_i=tanh(W^qH^q+(W^pH_i^p+W^r\overrightarrow {h^r}_{i-1}+b^p)\otimes e_Q)\in R^{l\times Q}$$

$$\overrightarrow \alpha_i=softmax(w^T\overrightarrow G_i + b\otimes e_Q)\in R^{1\times Q}$$

the resulting attention weight $\overrightarrow α_{i,j}$ above indicates the degree of matching between the

$i^{th}$ token in the passage with the $j^{th}$ token in the question.

$$\overrightarrow z_i=\begin{bmatrix} h^p \ H^q\overrightarrow {\alpha_i^T} \ \end{bmatrix}$$

$$h^r=\overrightarrow{LSTM}(\overrightarrow{z_i},\overrightarrow{h^r_{i-1}})$$

$$\overleftarrow G_i=tanh(W^qH^q+(W^pH_i^p+W^r\overleftarrow {h^r}_{i-1}+b^p)\otimes e_Q)$$

$$\overleftarrow \alpha_i=softmax(w^T\overleftarrow G_i + b\otimes e_Q)$$

• $\overrightarrow {H^r}\in R^{l\times P}$ 表示隐藏状态 $[\overrightarrow {h^r_1}, \overrightarrow {h^r_2},…,\overrightarrow {h^r_P}]$.

• $\overleftarrow {H^r}\in R^{l\times P}$ 表示隐藏状态 $[\overleftarrow {h^r_1}, \overleftarrow {h^r_2},…,\overleftarrow {h^r_P}]$.

The Sequence Model

The answer is represented by a sequence of integers $a=(a_1,a_2,…)$ indicating the positions of the selected tokens in the original passage.

$$F_k=tanh(V\tilde {H^r}+(W^ah^a_{k-1}+b^a)\otimes e_{P+1})\in R^{l\times P+1}$$

$$\beta_k=softmax(v^TF_k+c\otimes e_{P+1}) \in R^{1\times (P+1)}$$

$$h_k^a=\overrightarrow{LSTM}(\tilde {H^r}\beta_k^T, h^a_{k-1})$$

$$p(a|H^r)=\prod_k p(a_k|a_1,a_2,…,a_{k-1}, H^r)$$

$$p(a_k=j|a_1,a_2,…,a_{k-1})=\beta_{k,j}$$

$$-\sum_{n=1}^N logp(a_n|P_n,Q_n)$$

The Boundary Model

So the main difference from the sequence model above is that in the boundary model we do not need to add the zero padding to Hr, and the probability of generating an answer is simply modeled as:

$$p(a|H^r)=p(a_s|H^r)p(a_e|a_s, H^r)$$

Search mechanism, and bi-directional Ans-Ptr.

Training

Dataset

SQuAD: Passages in SQuAD come from 536 articles from Wikipedia covering a wide range of topics. Each passage is a single paragraph from a Wikipedia article, and each passage has around 5 questions associated with it. In total, there are 23,215 passages and 107,785 questions. The data has been split into a training set (with 87,599 question-answer pairs), a development set (with 10,570 questionanswer pairs) and a hidden test set

configuration

• dimension l of the hidden layers is set to 150 or 300.

• Adammax: $\beta_1=0.9, \beta_2=0.999$

• minibatch size = 30

• no L2 regularization.

论文笔记-QA BiDAF

paper:

Motivation

Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query.

Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni-directional attention.

In this paper we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bidirectional attention flow mechanism to obtain a query-aware context representation without early summarization.

Introduction

Attention mechanisms in previous works typically have one or more of the following characteristics. First, the computed attention weights are often used to extract the most relevant information from the context for answering the question by summarizing the context into a fixed-size vector. Second, in the text domain, they are often temporally dynamic, whereby the attention weights at the current time step are a function of the attended vector at the previous time step. Third, they are usually uni-directional, wherein the query attends on the context paragraph or the image.

• 1.attention 的权重用来从 context 中提取最相关的信息，其中 context 压缩到一个固定 size 的向量。

• 2.在文本领域，context 中的表示在时间上是动态的。所以当前时间步的 attention 权重依赖于之前时间步的向量。

• 3.它们通常是单向的，用 query 查询内容段落或图像。

Model Architecture

BiDAF 相比传统的将 attention 应用于 MC 任务作出如下改进:

• First, our attention layer is not used to summarize the context paragraph into a fixed-size vector. Instead, the attention is computed for every time step, and the attended vector at each time step, along with the representations from previous layers, is allowed to flow through to the subsequent modeling layer. This reduces the information loss caused by early summarization.

1）并没有把 context 编码到固定大小的向量表示中，而是让每个时间步计算得到的 attended vactor 可以流动（在 modeling layer 通过 biLSTM 实现）这样可以减少早期加权和造成的信息丢失。

• Second, we use a memory-less attention mechanism. That is, while we iteratively compute attention through time as in Bahdanau et al. (2015), the attention at each time step is a function of only the query and the context paragraph at the current time step and does not directly depend on the attention at the previous time step.

2）memory-less，在每一个时刻，仅仅对 query 和当前时刻的 context paragraph 进行计算，并不直接依赖上一时刻的 attention.

We hypothesize that this simplification leads to the division of labor between the attention layer and the modeling layer. It forces the attention layer to focus on learning the attention between the query and the context, and enables the modeling layer to focus on learning the interaction within the query-aware context representation (the output of the attention layer). It also allows the attention at each time step to be unaffected from incorrect attendances at previous time steps.

• Third, we use attention mechanisms in both directions, query-to-context and context-to-query, which provide complimentary information to each other.

Character Embedding Layer and Word Embedding Layer -> Contextual Embedding Layer -> Attention Flow Layer -> Modeling Layer -> Output Layer

Character Embedding Layer and word embedding alyer

• charatter embedding of each word using CNN.The outputs of the CNN are max-pooled over the entire width to obtain a fixed-size vector for each word.

• pre-trained word vectors, GloVe

• concatenation of them above and is passed to a two-layer highway networks.

context -> $X\in R^{d\times T}$

query -> $Q\in R^{d\times J}$

contextual embedding layer

model the temporal interactions between words using biLSTM.

context -> $H\in R^{2d\times T}$

query -> $U\in R^{2d\times J}$

attention flow layer

the attention flow layer is not used to summarize the query and context into single feature vectors. Instead, the attention vector at each time step, along with the embeddings from previous layers, are allowed to flow through to the subsequent modeling layer.

$$S_{tj}=\alpha(H_{:t},U_{:j})\in R$$

Context-to-query Attention:

$$a_t=softmax(S_{t:})\in R^J$$

$$\tilde U_{:t}=\sum_j a_{tj}U_{:j}\in R^{2d}$$

Query-to-context Attention:

$$b=softmax(max_{col}(S))\in R^T$$

$$\tilde h = \sum_tb_tH_{:t}\in R^{2d}$$

$$G_{:t}=\beta (H_{:t},\tilde U_{:t}, \tilde H_{:t})\in R^{d_G}$$

function $\beta$ 可以是 multi-layers perceptron. 在作者的实验中：

$$\beta(h,\tilde u,\tilde h)=[h;\tilde u;h\circ \tilde u;h\circ \tilde h]\in R^{8d\times T}$$

Modeling Layer

captures the interaction among the context words conditioned on the query.

Output Layer

$$p^1=softmax(W^T(p^1)[G;M])$$

$$p^2=softmax(W^T(p^2)[G;M^2])$$

Training

$$L(\theta)=-{1 \over N} \sum^N_i[log(p^1_{y_i^1})+log(p^2_{y_i^2})]$$

$\theta$ 包括参数：

• the weights of CNN filters and LSTM cells

• $w_{S}$,$w_{p^1},w_{p^2}$

$y_i^1,y_i^2$ 表示i样本中开始可结束位置在 context 中的 index.

$p^1,p^2\in R^T$ 是经过 softmax 得到的概率，可以将 gold truth 看作是 one-hot 向量 [0,0,…,1,0,0,0]，所以对单个样本交叉熵是:

$$- log(p^1_{y_i^1})-log(p^2_{y_i^2})$$

Test

The answer span $(k; l)$ where $k \le l$ with the maximum value of $p^1_kp^2_l$ is chosen, which can be computed in linear time with dynamic programming.

深度学习-Batch Normalization

Motivation

Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift.

stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters, specifically the learning rate and the initial parameter values. The training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers – so that small changes to the network parameters amplify as the network becomes deeper.

The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shift (Shimodaira, 2000). This is typically handled via domain adaptation (Jiang, 2008). However, the notion of covariate shift can be extended beyond the learning system

as a whole, to apply to its parts, such as a sub-network or a layer.

Therefore, the input distribution properties that aid the network generalization – such as having the same distribution between the training and test data – apply to training the sub-network as well.As such it is advantageous for the distribution of x to remain fixed over time.

Fixed distribution of inputs to a sub-network would have positive consequences for the layers outside the subnetwork, as well.

$z=g(Wu+b)$

• ReLU

• 初始化 Xavier initialization.

• 用一个较小的学习速率进行学习

If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.

We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift. Eliminating it offers a promise of faster training. We propose a new mechanism, which we call Batch Normalization, that takes a step towards reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs.

Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or

of their initial values. This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.

BN 除了能解决 internal covariate shift 的问题，还能够降低梯度对学习率，初始化参数设置的依赖。这使得我们可以使用较大的学习率，正则化模型，降低对 dropout 的需求，最后还保证网络能够使用具有饱和性的非线性激活函数。

Towards Reducing Internal Covariate Shift

whitening 白化操作

It has been long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated.

However, if these modifications are interspersed with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step.

$x=u+b \rightarrow \hat x = x-E[x] \rightarrow loss$

$\dfrac{\partial l}{\partial b}=\dfrac{\partial l}{\partial \hat x}\dfrac{\partial \hat x}{\partial b} = \dfrac{\partial l}{\partial \hat x}$

$u+(b+\Delta b)-E[u+(b+\Delta b)]=u+b-E[u+b]$

This problem can get worse if the normalization not only centers but also scales the activations. We have observed this empirically in initial experiments, where the model blows up when the normalization parameters are computed outside the gradient descent step.

进行白化操作，并且在优化时考虑标准化的问题

The issue with the above approach is that the gradient descent optimization does not take into account the fact that the normalization takes place. To address this issue, we would like to ensure that, for any parameter values, the network always produces activations with the desired distribution.Doing so would allow the gradient of the loss with respect to the model parameters to account for the normalization, and for its dependence on the model parameters Θ.

$\hat x = Norm(x, \chi)$

$$\frac{\partial{Norm(x,\chi)}}{\partial{x}}\text{ and }\frac{\partial{Norm(x,\chi)}}{\partial{\chi}}$$

Normalization via Mini-Batch Statistics

对比于白化的两个简化

Since the full whitening of each layer’s inputs is costly, we make two necessary simplifications. The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have zero mean and unit variance.

$$\hat x^{(k)} = \dfrac{x^{(k)}-E[x^{(k)}]}{\sqrt {Var[x^{(k)}]}}$$

where the expectation and variance are computed over the

training data set.

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.

$$y^{(k)} = \gamma^{(k)}\hat x^{(k)} + \beta^{(k)}$$

$\gamma^{(k)}$, $\beta^{(k)}$ 是可学习的参数，用来回复经过标准化之后的网络的表达能力。如果 $\gamma^{(k)}=\sqrt {Var[x^{(k)}]}$, $\beta^{(k)}=E[x^{(k)}]$

In the batch setting where each training step is based on the entire training set, we would use the whole set to normalize activations. However, this is impractical when using stochastic optimization. Therefore, we make the second simplification: since we use mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation.

Note that the use of mini-batches is enabled by computation of per-dimension variances rather than joint covariances; in the joint case, regularization would be required since the mini-batch size is likely to be smaller than the number of activations being whitened, resulting in singular covariance matrices.

BN 核心流程

batch size m, 我们关注其中某一个维度 $x^{k}$, k 表示第k维特征。那么对于 batch 中该维特征的 m 个值：

$$B={x_{1,…,m}}$$

$$BN_{\gamma, \beta}:x_{1,..,m}\rightarrow y_{1,..,m}$$

• 对于输入的 mini-batch 的一个维度，计算均值和方差

• 标准化（注意 epsilon 避免0错误）

• 使用两个参数进行平移和缩放

Thus, BN transform is a differentiable transformation that introduces normalized activations into the network. This ensures that as the model is training, layers can continue learning on input distributions that exhibit less internal covariate shift, thus accelerating the training.

BN 是可微的，保证模型可训练，网络可以学习得到输入的分布，来减小 internal covarite shift, 从而加速训练。

Training and Inference with Batch-Normalized Networks

The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference; we want

the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization

$\hat x = \dfrac{x-E[x]}{\sqrt{Var[x]+\epsilon}}$

using the population, rather than mini-batch, statistics.

Using moving averages instead, we can track the accuracy of a model as it trains. Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation.

Batch-Normalized Convolutional Networks

• 第1-5步是算法1的流程，对每一维标准化，得到 $N_{BN}^{tr}$

• 6-7步优化训练参数 $\theta \bigcup {\gamma^{k}, \beta^{k}}$，在测试阶段参数是固定的

• 8-12步骤是将训练阶段的统计信息转化为训练集整体的统计信息。因为完成训练后在预测阶段，我们使用的是模型存储的整体的统计信息。这里涉及到通过样本均值和方差估计总体的均值和方差的无偏估计，样本均值是等于总体均值的无偏估计的，而样本均值不等于总体均值的无偏估计。具体可看知乎上的解答 https://www.zhihu.com/question/20099757

Batch Normalization enables higher learning rates

In traditional deep networks, too high a learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima.

By normalizing activations throughout the network, it prevents small changes in layer parameters from amplifying as the data propagates through a deep network.

Batch Normalization also makes training more resilient to the parameter scale. Normally, large learning rates may increase the scale of layer parameters, which then amplify the gradient during backpropagation and lead to the model explosion. However, with Batch Normalization, backpropagation through a layer is unaffected by the scale of its parameters.

BN 能让训练时的参数更有弹性。通常，学习率过大会增大网络参数，在反向传播中导致梯度过大而发生梯度爆炸。而 BN 使得网络不受参数的大小的影响。

代码实现

tensorflow 已经封装好了 BN 层，可以直接通过 tf.contrib.layers.batch_norm() 调用，如果你想知道函数背后的具体实现方法，加深对BN层的理解，可以参考这篇文章Implementing Batch Normalization in Tensorflow

机器学习-过拟合

解决方法

获取更多数据

• 从数据源头获取更多数据：这个是容易想到的，例如物体分类，我就再多拍几张照片好了；但是，在很多情况下，大幅增加数据本身就不容易；另外，我们不清楚获取多少数据才算够；

• 根据当前数据集估计数据分布参数，使用该分布产生更多数据：这个一般不用，因为估计分布参数的过程也会代入抽样误差。

• 数据增强（Data Augmentation）：通过一定规则扩充数据。如在物体分类问题里，物体在图像中的位置、姿态、尺度，整体图片明暗度等都不会影响分类结果。我们就可以通过图像平移、翻转、缩放、切割等手段将数据库成倍扩充；

使用合适的模型

（PS：如果能通过物理、数学建模，确定模型复杂度，这是最好的方法，这也就是为什么深度学习这么火的现在，我还坚持说初学者要学掌握传统的建模方法。）

在权值上加噪声

Graves, Alex, et al. “A novel connectionist system for unconstrained handwriting recognition.” IEEE transactions on pattern analysis and machine intelligence 31.5 (2009): 855-868.

• It may work better, especially in recurrent networks (Hinton)

论文笔记 - Attention 综述

Attention and Augmented Recurrent Neural Networks

They can be used to boil a sequence down into a high-level understanding, to annotate sequences, and even to generate new sequences from scratch!

Neural Turing Machines

Neural Turing Machines 神经图灵机将 RNNs 和外部 memory bank 结合在一起。向量用来表示神经网络中的自然语言，memory 是向量的数组。

NMTs 采取了一个非常聪明的方法，每一步读和写都包括 memory 中的所有位置，只是对于不同的位置，读和写的程度不同。

$$r \leftarrow \sum_ia_iM_i$$

$$M_i \leftarrow a_iw+(1-a_i)M_i$$

The Neural GPU 4 overcomes the NTM’s inability to add and multiply numbers.

Zaremba & Sutskever 5 train NTMs using reinforcement learning instead of the differentiable read/writes used by the original.

Neural Random Access Machines 6 work based on pointers.

Some papers have explored differentiable data structures, like stacks and queues [7, 8].

memory networks [9, 10] are another approach to attacking similar problems.

Code

Attention Interfaces

The attending RNN generates a query describing what it wants to focus on. Each item is dot-producted with the query to produce a score, describing how well it matches the query. The scores are fed into a softmax to create the attention distribution.

Then an RNN runs, generating a description of the image. As it generates each word in the description, the RNN focuses on the conv net’s interpretation of the relevant parts of the image. We can explicitly visualize this:

More broadly, attentional interfaces can be used whenever one wants to interface with a neural network that has a repeating structure in its output.

Standard RNNs do the same amount of computation for each time step. This seems unintuitive. Surely, one should think more when things are hard? It also limits RNNs to doing O(n) operations for a list of length n.

Adaptive Computation Time [15] is a way for RNNs to do different amounts of computation each step. The big picture idea is simple: allow the RNN to do multiple steps of computation for each time step.

In order for the network to learn how many steps to do, we want the number of steps to be differentiable. We achieve this with the same trick we used before: instead of deciding to run for a discrete number of steps, we have an attention distribution over the number of steps to run. The output is a weighted combination of the outputs of each step.

There are a few more details, which were left out in the previous diagram. Here’s a complete diagram of a time step with three computation steps.

That’s a bit complicated, so let’s work through it step by step. At a high-level, we’re still running the RNN and outputting a weighted combination of the states:

S 表示 RNN 中的隐藏状态。

The weight for each step is determined by a “halting neuron.” It’s a sigmoid neuron that looks at the RNN state and gives a halting weight, which we can think of as the probability that we should stop at that step.

We have a total budget for the halting weights of 1, so we track that budget along the top. When it gets to less than epsilon, we stop

When we stop, might have some left over halting budget because we stop when it gets to less than epsilon. What should we do with it? Technically, it’s being given to future steps but we don’t want to compute those, so we attribute it to the last step.

When training Adaptive Computation Time models, one adds a “ponder cost” term to the cost function. This penalizes the model for the amount of computation it uses. The bigger you make this term, the more it will trade-off performance for lowering compute time.

Code：

The only open source implementation of Adaptive Computation Time at the moment seems to be Mark Neumann’s (TensorFlow).

reference:

论文笔记 memory networks

Memory Networks 相关论文笔记。

• Memory Network with strong supervision

• End-to-End Memory Network

• Dynamic Memory Network

Paper reading 1: Memory Networks, Jason Weston

Motivation

RNNs 将信息压缩到final state中的机制，使得其对信息的记忆能力很有限。而memory work的提出就是对这一问题进行改善。

However, their memory (encoded by hidden states and weights) is typically too small, and is not compartmentalized enough to accurately remember facts from the past (knowledge is compressed into dense vectors). RNNs are known to have difficulty in performing memorization.

Memory Networks 提出的基本动机是我们需要 长期记忆（long-term memory）来保存问答的知识或者聊天的语境信息，而现有的 RNN 在长期记忆中表现并没有那么好。

Memory Networks

four components:

• I:(input feature map)

• G:(generalization)

• O:(output feature map)

• R:(response)

详细推导过程

1.I component: :encode input text to internal feature representation.

2.G component: generalization 就是结合 old memories和输入来更新 memories. $m_i=G(m_i, I(x),m), ∀i$

3.O component: reading from memories and performing inference, calculating what are the relevant memories to perform a good response.

$$o_1=O_1(q,m)=argmax_{i=1,2,..,N}s_O(q,m_i)$$

$$o_2=O_2(q,m)=argmax_{i=1,2,..,N}s_O([q,o_1],m_i)$$

output: $[q,o_1, o_2]$ 也是module R的输入.

$s_O$ is a function that scores the match between the pair of sentences x and mi. $s_O$ 用来表征 question x 和 记忆 $m_i$ 的相关程度。

$$s_O=qUU^Tm$$

$s_O$ 表示问题q和当前memory m的相关程度

U：bilinear regression参数，相关事实的 $qUU^Tm_{true}$ 的score高于不相关事实的分数 $qUU^Tm_{random}$

4.R component : 对 output feature o 进行解码，得到最后的response: r=R(o)

$$r=argmax_{w\in W}s_R([q,m_{o_1},m_{o_2}],w)$$

W 是词典，$s_R$ 表示与output feature o 最相关的单词。

$s_R$ 和 $s_O$ 的形式是相同的。

$$s(x,y)=xUU^Ty$$

Huge Memory 问题

• 可以按 entity 或者 topic 来存储 memory，这样 G 就不用在整个 memories 上操作了

• 如果 memory 满了，可以引入 forgetting 机制，替换掉没那么有用的 memory，H 函数可以计算每个 memory 的分数，然后重写

• 还可以对单词进行 hashing，或者对 word embedding 进行聚类，总之是把输入 I(x) 放到一个或多个 bucket 里面，然后只对相同 bucket 里的 memory 计算分数

损失函数

minimize: $L_i = \sum_{j\ne y_i}max(0,s_j - s_{y_i}+\Delta)$

QA实例：

(6) 有没有挑选出正确的第一句话

(7) 正确挑选出了第一句话后能不能正确挑出第二句话

(6)+(7) 合起来就是能不能挑选出正确的语境，用来训练 attention 参数

(8) 把正确的 supporting fact 作为输入，能不能挑选出正确的答案，来训练 response 参数

Paper reading 2 End-To-End Memory Networks

motivation

The model in that work was not easy to train via backpropagation, and required supervision at each layer of the network.

Our model can also be seen as a version of RNNsearch with multiple computational steps (which we term “hops”) per output symbol.

Model architecture

Single layer

• input: $x_1,…,x_i$

• query: q

1.将input和query映射到特征空间

• memory vector {$m_i$}: ${x_i}\stackrel A\longrightarrow {m_i}$

• internal state u: $q\stackrel B \longrightarrow u$

2.计算attention，也就是query的向量表示u，和input中各个sentence的向量表示 $m_i$ 的匹配度。compute the match between u and each memory mi by taking the inner product followed by a softmax.

$$p_i=softmax(u^Tm_i)$$

p is a probability vector over the inputs.

3.得到context vector

• output vector: ${x_i}\stackrel C\longrightarrow {c_i}$

The response vector from the memory o is then a sum over the transformed inputs ci, weighted by the probability vector from the input:

$$o = \sum_ip_ic_i$$

4.预测最后答案，通常是一个单词

$$\hat a =softmax(Wu^{k+1})= softmax(W(o^k+u^k))$$

W可以看做反向embedding，W.shape=[embed_size, V]

5.对 $\hat a$ 进行解码，得到自然语言的response

$$\hat a \stackrel C \longrightarrow a$$

A: intput embedding matrix

C: output embedding matrix

W: answer prediction matrix

B: question embedding matrix

Multiple Layers/ Multiple hops

$$u_{k+1}=u^k+o^k$$

对比上一篇paper来理解

• input components: 就是将query和sentences映射到特征空间中

• generalization components： 更新memory，这里的memory也是在变化的，${m_i}=AX$， 但是embedding matrix A 是逐层变化的

• output components: attention就是根据inner product后softmax计算memory和query之间的匹配度，然后更新input，也就是[u_k,o_k]， 可以是相加/拼接，或者用RNN. 区别是，在上一篇论文中是argmax，$o_2=O_2(q,m)=argmax_{i=1,2,..,N}s_O([q,o_1],m_i)$, 也就是选出匹配程度最大的 memory $m_i$, 而这篇论文是对所有的memory进行加权求和

• response components: 跟output components类似啊，上一篇论文是与词典中所有的词进行匹配，求出相似度最大的 $r=argmax_{w\in W}s_R([q,m_{o_1},m_{o_2}],w)$，而这篇论文是 $\hat a=softmax(Wu^{k+1})=softmax(W(u^k+o^k))$ 最小化交叉熵损失函数训练得到 answer prediction matrix W.

Overall, it is similar to the Memory Network model in [23], except that the hard max operations within each layer have been replaced with a continuous weighting from the softmax.

一些技术细节

• 上一层的output embedding matrix 是下一层的 input embedding matrix, 即 $A^{k+1}=C^k$

• 最后一层的output embedding 可用作 prediction embedding matrix， 即 $W^T=C^k$

• question embedding matrix = input embedding matrix of the first layer, $B=A^1$

1. Layer-wise (RNN-like)
• $A^1=A^2=…=A^k, C^1=C^2=…C^k$

• $u^{k+1} = Hu^k+o^k$

Experiments

Modle details

Sentence representations

1.Bag of words(BOW) representation

$$m_i=\sum_jAx_{ij}$$

$$c_i=\sum_jCx_{ij}$$

$$u=\sum_jBq_j$$

2.encodes the position of words within the sentence 考虑词序的编码

$$m_i=\sum_jl_j\cdot Ax_{ij}$$

i表示第i个sentence，j表示这个sentence中的第j个word

$$l_{kj}=(1-j/J)-(k/d)(1-2j/J)$$

$$l_{kj} = 1+4(k- (d+1)/2)(j-(J+1)/2)/d/J$$

wolframalpha1

wolframalpha2

position encoding 代码实现

Temporal Encoding

$$m_i=\sum_jAx_{ij}+T_A(i)$$

$$c_i=\sum_jCx_{ij}+T_C(i)$$

Learning time invariance by injecting random noise

we have found it helpful to add “dummy” memories to regularize TA.

Training Details

1.learning rate decay

3.linear start training

完整代码实现

https://github.com/PanXiebit/text-classification/blob/master/06-memory%20networks/memn2n_model.py

Paper reading 3 Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Motivation

Most tasks in natural language processing can be cast into question answering (QA) problems over language input.

Model Architecture

• Input Module: 将输入文本编码为distribution representations

• Question Module: 将question编码为distribution representations

• Episodic Memory Module: 通过attention机制选择focus on输入文本中的某些部分，然后生成memory vector representation.

• Answer Module: 依据the final memory vector生成answer

Detailed visualization:

Input Module

1.输入是single sentence，那么input module输出的就是通过RNN计算得到的隐藏状态 $T_C= T_I$, $T_I$ 表示一个sentence中的词的个数。

2.输入是a list of sentences，在每个句子后插入一个结束符号 end-of-sentence token, 然后每个sentence的final hidden作为这个sentence的representation. 那么input module输出 $T_C$, $T_C$等于sequence的sentence个数。

Question Module

$$q_t=GRU(L[w_t^Q],q_{t-1})$$

L代表embedding matrix.

$$q=q_{T_Q}$$

$T_Q$ 是question的词的个数。

Episodic Memory Module

1.Needs for multiple Episodes: 通过迭代使得模型具有了传递推理能力 transitive inference.

2.Attention Mechanism: 使用了一个gating function作为attention机制。相比在 end-to-end MemNN 中attention使用的是linear regression，即对inner production通过softmax求权重。 这里使用一个两层前向神经网络 G 函数.

$$g_t^i=G(c_t,m^{i-1},q)$$

$c_t$ 是candidate fact, $m_{i-1}$ 是previous memory， question q. t 表示sentence中的第t时间步，i表示episodic的迭代次数。

$$z_t^i=[c_t, m^{i-1},q, c_t\circ q,c_t\circ m^{i-1},|c_t-q|,|c_t-m^{i-1}|, c_t^TW^{(b)}q, c_t^TW^{(b)}m^{i-1}]$$

$$G = \sigma(W^{(2)}tanh(W^{(1)}z_i^t+b^{(1)})+b^{(2)})$$