paper: CoQA: A Conversational Question Answering Challenge


We introduce CoQA, a novel dataset for building Conversational Question Answering systems.1 Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains.
CoQA, 对话式阅读理解数据集。从 7 个不同领域的 8k 对话中获取的 127k 问答对。

The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage.

We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.
CoQA 跟传统的 RC 数据集所面临的挑战不一样,主要是指代和推理。

We ask other people a question to either seek or test their knowledge about a subject. Depending on their answer, we follow up with another question and their answer builds on what has already been discussed. This incremental aspect makes human conversations succinct. An inability to build up and maintain common ground in this way is part of why virtual assistants usually don’t seem like competent conversational partners.

而 CoQA 就是要测试这种能力。


In CoQA, a machine has to understand a text passage and answer a series of questions that appear in a conversation. We develop CoQA with three main goals in mind.

The first concerns the nature of questions in a human conversation. Posing short questions is an effective human conversation strategy, but such questions are a pain in the neck for machines.
第一点:人类在对话时,会提出很简短的问题,但这对于机器来说却很难。比如 Q5 “Who?”

The second goal of CoQA is to ensure the naturalness of answers in a conversation. Many existing QA datasets restrict answers to a contiguous span in a given passage, also known as extractive answers (Table 1). Such answers are not always natural, for example, there is no extractive answer for Q4 (How many?) in Figure 1. In CoQA, we propose that the answers can be free-form text (abstractive answers), while the extractive spans act as rationales for the actual answers. Therefore, the answer for Q4 is simply Three while its rationale is spanned across multiple sentences.
第二点:答案不是抽取式的 extractive,而是总结性的 abstractive, free-from text. 比如 Q4. 好难啊!!!

The third goal of CoQA is to enable building QA systems that perform robustly across domains. The current QA datasets mainly focus on a single domain which makes it hard to test the generalization ability of existing models.
第三点:数据来自多种 domain,提高泛化性。

Dataset collection


  1. It consists of 127k conversation turns collected from 8k conversations over text passages (approximately one conversation per passage). The average conversation length is 15 turns, and each turn consists of a question and an answer.

  2. It contains free-form answers. Each answer has an extractive rationale highlighted in the passage.

  3. Its text passages are collected from seven diverse domains — five are used for in-domain evaluation and two are used for out-of-domain evaluation.

Almost half of CoQA questions refer back to conversational history using coreferences, and a large portion requires pragmatic reasoning making it challenging for models that rely on lexical cues alone.

The best-performing system, a reading comprehension model that predicts extractive rationales which are further fed into a sequence-to-sequence model that generates final answers, achieves a F1 score of 65.1%. In contrast, humans achieve 88.8% F1, a superiority of 23.7% F1, indicating that there is a lot of headroom for improvement.
Baseline 是将抽取式阅读理解模型转换成 seq2seq 形式,然后从 rationale 中获取答案,最终得到了 65.1% 的 F1 值。

question and answer collection

We want questioners to avoid using exact words in the passage in order to increase lexical diversity. When they type a word that is already present in the passage, we alert them to paraphrase the question if possible.
questioner 提出的问题应尽可能避免使用出现在 passage 中的词,这样可以增加词汇的多样性。

For the answers, we want answerers to stick to the vocabulary in the passage in order to limit the number of possible answers. We encourage this by automatically copying the highlighted text into the answer box and allowing them to edit copied text in order to generate a natural answer. We found 78% of the answers have at least one edit such as changing a word’s case or adding a punctuation.
对于答案呢,尽可能的使用 passage 中出现的词,从而限制出现很多中答案的可能性。作者通过复制 highlighted text(也就是 rationale 吧) 到 answer box,然后让 answerer 去生成相应的 answer. 其中 78% 的答案是需要一个编辑距离,比如一个词的大小写或增加标点符号。

passage collection

Not all passages in these domains are equally good for generating interesting conversations. A passage with just one entity often result in questions that entirely focus on that entity. Therefore, we select passages with multiple entities, events and pronominal references using Stanford CoreNLP (Manning et al., 2014). We truncate long articles to the first few paragraphs that result in around 200 words.
如果一个 passage 只有一个 entity,那么根据它生成的对话都会是围绕这个 entity 的。显然这不是这个数据集想要的。因此,作者使用 Stanford CoreNLP 来对 passage 进行分析后选择多个 entity 和 event 的 passage.

Table 2 shows the distribution of domains. We reserve the Science and Reddit domains for out-ofdomain evaluation. For each in-domain dataset, we split the data such that there are 100 passages in the development set, 100 passages in the test set, and the rest in the training set. For each out-of-domain dataset, we just have 100 passages in the test set.
In domain 中包含 Children, Literature, Mid/HIgh school, News, Wikipedia. 他们分出 100 passage 到开发集(dev dataset), 其余的在训练集 (train dataset). out-of-diomain 包含 Science Reddit ,分别有 100 passage 在开发集中。
test dataset:

Collection multiple answers

Some questions in CoQA may have multiple valid answers. For example, another answer for Q4 in Figure 2 is A Republican candidate. In order to account for answer variations, we collect three additional answers for all questions in the development and test data.
一个问题可能出现多种回答,因此在dev dataset 和 test dataset 中有三个候选答案。

In the previous example, if the original answer was A Republican Candidate, then the following question Which party does he belong to? would not have occurred in the first place. When we show questions from an existing conversation to new answerers, it is likely they will deviate from the original answers which makes the conversation incoherent. It is thus important to bring them to a common ground with the original answer.
比如上图中 Q4, 如果回答是 A Republican candidate. 但是整个对话是相关的,所以接下来的问题就会使整个对话显得混乱了。

We achieve this by turning the answer collection task into a game of predicting original answers. First, we show a question to a new answerer, and when she answers it, we show the original answer and ask her to verify if her answer matches the original. For the next question, we ask her to guess the original answer and verify again. We repeat this process until the conversation is complete. In our pilot experiment, the human F1 score is increased by 5.4% when we use this verification setup.
因为机器在学习的时候是有 original answer 进行对比的,同样的这个过程在人工阶段也是需要的,可以减少上诉的混乱情况,answerer 在给出一个答案后,作者会告诉他们是否与 original 匹配,然后直到整个过程完成。

Dataset Analysis

What makes the CoQA dataset conversational compared to existing reading comprehension datasets like SQuAD? How does the conversation flow from one turn to the other? What linguistic phenomena do the questions in CoQA exhibit? We answer these questions below.

在 question 中:

  1. 指代词(he, him, she, it, they)出现的更为频繁, SQuAD 则几乎没有。
  2. SQuAD 中 what 几乎占了一半,CoQA 中问题类型则更为多样, 比如 did, was, is, does 的频率很高。
  3. CoQA 的问题更加简短。见图 3.
  4. answer 有 33% 的是 abstractive. 考虑到人工因素,抽取式的 answer 显然更好写,所以这高于作者预期了。yes/no 的答案也有一定比重。

Conversation Flow

A coherent conversation must have smooth transitions between turns.
一段好的对话是具有引导性的,不断深入挖掘 passage 的信息。

作者将 passage 均匀分成 10 chunks,然后分析随着对话 turn 的变化,其对应的 passage chunks 变化的情况。

Linguistic Phenomena

Relationship between a question and its passage:
- lexical match: question 和 passage 中至少有一个词是匹配的。
- Paraphrasing: 解释型。虽然 question 没有与 passage 的词,但是确实对 rationale 的一种解释,也就是换了一种说法,当作问题提出了。通常这里面包含: synonymy(同义词), antonymy(反义词), hypernymy(上位词), hyponymy(下位词) and negation(否定词).
- Pragmatics: 需要推理的。

Relationship between a question and its conversation history:
- No coref
- Explicit coref.
- Implicit coref.