Challenge
Description
Motivation
A fundamental
phenomenon of natural language is the variability of semantic
expression, where
the same meaning can be expressed by or inferred from different texts.
Many
natural language processing applications, such as Question Answering
(QA),
Information Retrieval (IR), Information Extraction (IE), and (multi)
document
summarization need to model this variability in order to recognize that
a
particular target meaning can be inferred from different text variants.
Even
though many applications face similar underlying semantic problems,
these
problems are usually addressed in an application-oriented manner.
Consequently
it is difficult to compare, under a generic evaluation framework,
semantic
methods that were developed within different applications. The PASCAL
RTE
Challenge introduces textual entailment as a common task and evaluation
framework, covering a broad range of semantic-oriented inferences
needed for
practical applications. This task is therefore suitable for evaluating
and
comparing semantic-oriented models in a generic manner. Eventually,
work on
textual entailment may promote the development of general semantic
"engines", which will be used across multiple applications.
Textual
Entailment
Textual entailment
recognition is the task of deciding, given two text fragments, whether
the
meaning of one text is entailed (can be inferred) from another text
(see the Instructions tab for the specific
operational definition of
textual entailment assumed in the challenge). This task captures
generically a
broad range of inferences that are relevant for multiple applications.
For
example, a Question Answering (QA) system has to identify texts that
entail the
expected answer. Given the question "Who killed Kennedy?", the text
"the assassination of Kennedy by Oswald" entails the expected answer
"Oswald killed Kennedy". Similarly, in Information Retrieval (IR) the
concept denoted by a query expression should be entailed from relevant
retrieved documents. In multi-document summarization a redundant
sentence or
expression, to be omitted from the summary, should be entailed from
other
expressions in the summary. In Information Extraction (IE) entailment
holds
between different text variants that express the same target relation.
And in
Machine Translation evaluation a correct translation should be
semantically
equivalent to the gold standard translation, and thus both translations
have to
entail each other. Thus, modeling textual entailment may consolidate
and
promote broad research on applied semantic inference.
Task
Definition
Participants in the
evaluation exercise are provided with pairs of small text snippets (one
or more
sentences in English), which we term Text-Hypothesis (T-H)
pairs.
Examples were manually tagged for entailment (i.e. whether T
entails H
or not) by human annotators and will be divided into a Development Set,
containing 800 pairs, and a Test Set, containing 800 pairs.
Participating
systems will have to decide for each T-H pair whether T
indeed
entails H or not, and results will be compared to the manual
gold
standard.
The goal of the RTE
challenges is to provide opportunities for presenting and comparing
possible
approaches for modeling textual entailment. In this spirit, we aim at
an
explorative rather than a competitive setting. While participant
results will
be reported there will not be an official ranking of systems. A
development set
is released first to provide typical examples of the different types of
test
examples. The test set will be released three weeks prior to the result
submission date. We regard it as acceptable to run automatic knowledge
acquisition methods (such as synonym collection from corpora or the
Web) specifically
for the linguistic constructs that are present in the test data, as
long as the
methodology is general and fully automated, and the cost of running the
learning/acquisition procedure at full scale can be reasonably
estimated.
Dataset
Collection and Application Settings
The dataset of
Text-Hypothesis pairs was collected by human annotators. It consists of
four
subsets, which correspond to typical success and failure settings in
different
applications (as listed below). Within each application setting the
annotators
selected both positive entailment examples (annotated YES), where T
does
entail H, as well as negative examples (annotated NO), where
entailment
does not hold (50%-50% split). Some T-H examples appear in the
Instructions section. H is a (usually short) single sentence,
and T
consists of one or two sentences.
One of the main goals
for the RTE-2 dataset is to provide more realistic text-hypothesis
examples,
originating from actual applications. Therefore, the examples are based
mostly
on outputs of existing web-based systems (see Acknowledgments below).
We
allowed only minimal correction of texts extracted from the web, e.g.
fixing
spelling and punctuation but not style, therefore the English of some
of the
pairs is less-than-perfect.
In this application
setting, the hypotheses are propositional IR queries, which specify
some
statement, e.g. "Alzheimer's disease is treated using drugs". The
hypotheses were adapted and simplified from standard IR evaluation
datasets
(TREC and CLEF). Texts (T) that do or do not entail the
hypothesis were
selected from documents retrieved by different search engines (e.g.
Google,
Yahoo and Microsoft) for each hypothesis. In this application setting
it is
assumed that relevant documents (from an IR perspective) must
necessarily
entail the hypothesis.
In this setting T and
H are sentences taken from a news document cluster, a
collection of news
articles that describe the same news item. Annotators were given output
of
multi-document summarization systems, including the document clusters
and the
summary generated for each cluster. The annotators picked sentence
pairs with
high lexical overlap, preferably where at least one of the sentences
was taken
from the summary (this sentence usually played the role of T).
For
positive examples, the hypothesis was simplified by removing sentence
parts,
until it was fully entailed by T. Negative examples were
simplified in
the same manner. This process simulates the need of a summarization
system to
identify information redundancy, which should be avoided in the summary.
Annotators were given
questions and the corresponding answers returned by QA systems.
Transforming a
question-answer pair to text-hypothesis pair consisted of the following
stages:
First, the annotators picked from the answer passage an answer term of
the
expected answer type, either a correct or an incorrect one. Then, the
annotators turned the question into an affirmative sentence with the
answer
term "plugged in". These affirmative sentences serve as the
hypotheses (H), and the original answer passage serves as the
text (T).
For example, given the question, "Who is Ariel Sharon?" and an
answer text "Israel's prime Minister, Ariel Sharon, visited Prague"
(T), the question is turned into the statement "Ariel
Sharon is the Israeli Prime Minister" (H), producing a positive
entailment pair. This process simulates the need of a QA system to
verify that
the retrieved passage text indeed entails the provided answer.
This task is inspired
by the Information Extraction
(and Relation Extraction) application, adapting the setting to having
pairs of
texts rather than a text and a structured template. The pairs were
generated
using 4 different approaches. In the first approach, ACE-2004 relations
(the
relations tested in the ACE-2004 RDR task) were taken as templates for
hypotheses. Relevant news articles were then collected as texts (T)
and
corresponding hypotheses were generated manually based on the ACE
templates and
slot fillers taken from the text. For example, given the ACE relation 'X
work for Y' and the text "An Afghan interpreter, employed by
the
United States, was also wounded." (T), a hypothesis
"An interpreter worked for Afghanistan." is created, producing
a non-entailing (negative) pair. In the second approach, the MUC-4
annotated
dataset was similarly used to create entailing pairs. In the third
approach,
the outputs of actual IE systems, for both the MUC-4 dataset and the
news
articles collected for the ACE relations, were used to generate
entailing and
non-entailing pairs. In the forth approach, new types of hypotheses
that may
correspond to typical IE relations were manually generated for
different
sentences in the collected news articles. These processes simulate the
need of
IE systems to recognize that the given text indeed entails the semantic
relation that is expected to hold between the candidate template slot
fillers.
Bar-Ilan
University, Israel (Coordinator):
|
Ido
Dagan, Roy Bar-Haim, Idan Szpektor
|
Microsoft
Research, USA:
|
Bill
Dolan
|
MITRE,
USA:
|
Lisa
Ferro
|
CELCT,
Trento - Italy:
|
Bernardo
Magnini, Danilo Giampiccolo
|
The following sources
were used in the preparation of the data:
·
AnswerBus
question answering system, provided by Zhiping Zheng, Computational
Linguistics
Department, Saarland University.
·
PowerAnswer question answering
system, from Language Computer
Corporation, provided by Dan Moldovan, Abraham Fowler, Christine Clark,
Arthur
Dexter and Justin Larrabee.
·
Columbia
NewsBlaster multi-document
summarization system, from the
Natural Language Processing group at Columbia University's Department
of
Computer Science.
·
NewsInEssence
multi-document summarization system, provided by Dragomir R. Radev and
Jahna
Otterbacher from the Computational Linguistics And Information
Retrieval
research group, University of Michigan.
·
IBM's
information extraction system, provided by
Salim Roukos and Nanda Kambhatla, I.B.M. T.J. Watson Research Center.
·
New York
University's information extraction
system, provided by Ralph Grishman, Department of Computer Science,
Courant
Institute of Mathematical Sciences, New York University.
·
ITC-irst's
information extraction system, provided
by Lorenza Romano, Cognitive and Communication Technologies (TCC)
division,
ITC-irst, Trento, Italy.
·
MUC-4 information extraction
dataset, from the National
Institute of Standards and Technology (NIST).
·
TREC-QA
question collections, from the National Institute of Standards and
Technology
(NIST).
·
CLEF-QA question collections,
from DELOS Network of
Excellence for
Digital Libraries.
We
would like to thank the people and organizations that made these
sources available for the challenge. In addition, we thank Oren
Glickman and
Dan Roth for their assistance and advice.
We would also like to
acknowledge the people and organizations involved in creating and
annotating
the data: Malky Rabinowitz, Dana Mills, Ruthie Mandel, Errol Hayman,
Vanessa
Sandrini, Allesandro Valin, Elizabeth Lima, Jeff Stevenson, Amy Muia
and the
Butler Hill Group.
Textual
Entailment at NLPZone.org - resources and bibliography