Challenge Description
Motivation
A fundamental phenomenon of natural language is the variability of semantic expression, where the same meaning can be expressed by or inferred from different texts. Many natural language processing applications, such as Question Answering (QA), Information Retrieval (IR), Information Extraction (IE), and (multi) document summarization need to model this variability in order to recognize that a particular target meaning can be inferred from different text variants. Even though many applications face similar underlying semantic problems, these problems are usually addressed in an application-oriented manner. Consequently it is difficult to compare, under a generic evaluation framework, semantic methods that were developed within different applications. The PASCAL RTE Challenge introduces textual entailment as a common task and evaluation framework, covering a broad range of semantic-oriented inferences needed for practical applications. This task is therefore suitable for evaluating and comparing semantic-oriented models in a generic manner. Eventually, work on textual entailment may promote the development of general semantic "engines", which will be used across multiple applications.
Textual Entailment
Textual entailment recognition is the task of deciding, given two text fragments, whether the meaning of one text is entailed (can be inferred) from another text (see the Instructions tab for the specific operational definition of textual entailment assumed in the challenge). This task captures generically a broad range of inferences that are relevant for multiple applications. For example, a Question Answering (QA) system has to identify texts that entail the expected answer. Given the question "Who killed Kennedy?", the text "the assassination of Kennedy by Oswald" entails the expected answer "Oswald killed Kennedy". Similarly, in Information Retrieval (IR) the concept denoted by a query expression should be entailed from relevant retrieved documents. In multi-document summarization a redundant sentence or expression, to be omitted from the summary, should be entailed from other expressions in the summary. In Information Extraction (IE) entailment holds between different text variants that express the same target relation. And in Machine Translation evaluation a correct translation should be semantically equivalent to the gold standard translation, and thus both translations have to entail each other. Thus, modeling textual entailment may consolidate and promote broad research on applied semantic inference.
Task Definition
Participants in the evaluation exercise are provided with pairs of small text snippets (one or more sentences in English), which we term Text-Hypothesis (T-H) pairs. Examples were manually tagged for entailment (i.e. whether T entails H or not) by human annotators and will be divided into a Development Set, containing 800 pairs, and a Test Set, containing 800 pairs. Participating systems will have to decide for each T-H pair whether T indeed entails H or not, and results will be compared to the manual gold standard.
The goal of the RTE challenges is to provide opportunities for presenting and comparing possible approaches for modeling textual entailment. In this spirit, we aim at an explorative rather than a competitive setting. While participant results will be reported there will not be an official ranking of systems. A development set is released first to provide typical examples of the different types of test examples. The test set will be released three weeks prior to the result submission date. We regard it as acceptable to run automatic knowledge acquisition methods (such as synonym collection from corpora or the Web) specifically for the linguistic constructs that are present in the test data, as long as the methodology is general and fully automated, and the cost of running the learning/acquisition procedure at full scale can be reasonably estimated.
Dataset Collection and Application Settings
The dataset of Text-Hypothesis pairs was collected by human annotators. It consists of four subsets, which correspond to typical success and failure settings in different applications (as listed below). Within each application setting the annotators selected both positive entailment examples (annotated YES), where T does entail H, as well as negative examples (annotated NO), where entailment does not hold (50%-50% split). Some T-H examples appear in the Instructions section. H is a (usually short) single sentence, and T consists of one or more sentences, up to a short paragraph, to simulate an even more realistic scenario.
In this application setting, the hypotheses are propositional IR queries, which specify some statement, e.g. "Alzheimer's disease is treated using drugs". The hypotheses were adapted and simplified from standard IR evaluation datasets (TREC and CLEF). Texts (T) that do or do not entail the hypothesis were selected from documents retrieved by different search engines (e.g. Google, Yahoo and Microsoft) for each hypothesis. In this application setting it is assumed that relevant documents (from an IR perspective) must necessarily entail the hypothesis.
In this setting T and H are sentences taken from a news document cluster, a collection of news articles that describe the same news item. Annotators were given output of multi-document summarization systems, including the document clusters and the summary generated for each cluster. The annotators picked sentence pairs with high lexical overlap, preferably where at least one of the sentences was taken from the summary (this sentence usually played the role of T). For positive examples, the hypothesis was simplified by removing sentence parts, until it was fully entailed by T. Negative examples were simplified in the same manner. This process simulates the need of a summarization system to identify information redundancy, which should be avoided in the summary.
Annotators were given questions and the corresponding answers returned by QA systems. Transforming a question-answer pair to text-hypothesis pair consisted of the following stages: First, the annotators picked from the answer passage an answer term of the expected answer type, either a correct or an incorrect one. Then, the annotators turned the question into an affirmative sentence with the answer term "plugged in". These affirmative sentences serve as the hypotheses (H), and the original answer passage serves as the text (T). For example, given the question, "Who is Ariel Sharon?" and an answer text "Israel's prime Minister, Ariel Sharon, visited Prague" (T), the question is turned into the statement "Ariel Sharon is the Israeli Prime Minister" (H), producing a positive entailment pair. This process simulates the need of a QA system to verify that the retrieved passage text indeed entails the provided answer.
This task is inspired by the Information Extraction (and Relation Extraction) application, adapting the setting to having pairs of texts rather than a text and a structured template. The pairs were generated using different approaches. In the first approach, ACE-2006 relations (the relations proposed in the ACE-2007 RDR task) were taken as templates for hypotheses. Relevant news articles were then collected as texts (T) and corresponding hypotheses were generated manually based on the ACE templates and slot fillers taken from the text. For example, given the ACE relation 'X work for Y' and the text "An Afghan interpreter, employed by the United States, was also wounded."(T), a hypothesis "An interpreter worked for Afghanistan." is created, producing a non-entailing (negative) pair. In the second approach, the MUC-4 annotated dataset was similarly used to create entailing pairs. In the third approach, the outputs of actual IE systems were used to generate entailing and non-entailing pairs. In the forth approach, new types of hypotheses that may correspond to typical IE relations were manually generated for different sentences in the collected news articles. These processes simulate the need of IE systems to recognize that the given text indeed entails the semantic relation that is expected to hold between the candidate template slot fillers.
The following sources were used in the preparation of the data:
- Cicero information extraction system, from Language Computer Corporation, provided by Sanda Harabagiu, Andrev Hickl, John Lehman and Paul Aarseth
- PowerAnswer question answering system, from Language Computer Corporation, provided by Dan Moldovan and Marta Tatu.
- NewsInEssence online multi-document summarization system
- AnswerBus online question answering system.
- New York University's information extraction system, provided by Ralph Grishman, Department of Computer Science, Courant Institute of Mathematical Sciences, New York University.
- DUC 2005 multi-document summarization dataset, annotated with Pyramid method, from Columbia University.
- MUC-4 information extraction dataset, from the National Institute of Standards and Technology (NIST).
- TREC-QA question collections, from the National Institute of Standards and Technology (NIST).
- CLEF-QA question collections, from DELOS Network of Excellence for Digital Libraries.
- We would like to thank the people and organizations that made these sources available for the challenge.
- We would also like to acknowledge the people and organizations involved in creating and annotating the data: Pamela Forner, Cameron Fordyce and Errol Hayman from CELCT; Anselmo Peñas from UNED; Courtenay Hendricks, Annika Hämäläinen and Adam P. Savel from the Butler Hill Group; and Microsoft Research.
- A special thank to Roy Bar-Haim and Idan Szpektor (Bar-Ilan University) for their advice and support throughout the campain preparation.
- Thanks to Valentina Bruseghini for the technical support.
Textual Entailment at NLPZone.org - resources and bibliography