PASCAL - Pattern 
Analysis, Statistical Modelling and Computational Learning

Second Recognising Textual Entailment Challenge

1 October 2005 - 10 April 2006.

Note

Note: this section has been updated with information regarding submission of system results and reports.

Task Definition

We consider an applied notion of textual entailment, defined as a directional relation between two text fragments, termed T - the entailing text, and H - the entailed text. We say that T entails H if, typically, a human reading T would infer that H is most likely true. This somewhat informal definition is based on (and assumes) common human understanding of language as well as common background knowledge. Textual entailment recognition is the task of deciding, given T and H, whether T entails H.    

ID

TEXT

HYPOTHESIS

TASK

ENTAILMENT

1

The drugs that slow down or halt Alzheimer's disease work best the earlier you administer them.

Alzheimer's disease is treated using drugs.

IR

YES

2

Drew Walker, NHS Tayside's public health director, said: "It is important to stress that this is not a confirmed case of rabies."

A case of rabies was confirmed.

IR

NO

3

Yoko Ono unveiled a bronze statue of her late husband, John Lennon, to complete the official renaming of England's Liverpool Airport as Liverpool John Lennon Airport.

Yoko Ono is John Lennon's widow.

QA

YES

4

Arabic, for example, is used densely across North Africa and from the Eastern Mediterranean to the Philippines, as the key language of the Arab world and the primary vehicle of Islam.

Arabic is the primary language of the Philippines.

QA

NO

5

About two weeks before the trial started, I was in Shapiro's office in Century City.

Shapiro works in Century City.

QA

YES

6

Meanwhile, in his first interview to a Western print publication since his election as president of Iran earlier this year, Ahmadinejad attacked the "threat" to bring the issue of Iran's nuclear activity to the UN Security Council by the US, France, Britain and Germany.

Ahmadinejad is a citizen of Iran.

IE

YES

Table 1: Example T-H pairs

Some additional judgment criteria and guidelines are listed below (examples are taken from Table 1):

·         Entailment is a directional relation. The hypothesis must be entailed from the given text, but the text need not be entailed from the hypothesis.

·         The hypothesis must be fully entailed by the text. Judgment would be NO if the hypothesis includes parts that cannot be inferred from the text.

·         Cases in which inference is very probable (but not completely certain) are judged as YES. In example #5 one could claim that although Shapiro's office is in Century City, he actually never arrives to his office, and works elsewhere. However, this interpretation of T is very unlikely, and so the entailment holds with high probability. On the other hand, annotators were guided to avoid vague examples for which inference has some positive probability which is not clearly very high.

·         Our definition of entailment allows presupposition of common knowledge, such as: a company has a CEO, a CEO is an employee of the company, an employee is a person, etc. For instance, in example #6, the entailment depends on knowing that the president of a country is also a citizen of that country.   

Data Sets and Format

Both Development and Test sets are formatted as XML files, as follows:

<pair id="id_num" entailment="YES|NO" task="task_acronym">

   <t> the text... </t>

   <h> the hypothesis... </h>         

</pair>

Where:

§    each T-H pair appears within a single <pair> element.

§    the element <pair> has the following attributes:

·          id, a unique numeral identifier of the T-H pair.

·          task, the acronym of the application setting from which the pair has been generated (see introduction): 'IR','IE','QA' or 'SUM'.

·          entailment (in the development set only), the gold standard entailment annotation, being either 'YES' or 'NO'.

§    the element <t> (text) has no attributes, and it may be made up of one or more sentences.

§    the element <h> (hypothesis) has no attributes, and it usually contains one simple sentence.

The data is split to a development set and a test set, to be released separately. The goal of the development set is to guide the development and tuning of participating systems. Notice that since the given task has an unsupervised nature it is not expected that the development set can be used as a main resource for supervised training, given its anecdotal coverage. Rather it is typically assumed that systems will be using generic techniques and resources.

Data preprocessing

Following requests made in the first RTE challenge, this year we preprocessed the text and hypothesis of each pair. The preprocessing includes sentence splitting, using MXTERMINATOR (Reynar and Ratnaparkhi, 1997) and dependency parsing using MINIPAR (Lin, 1998).  See the "Datasets" tab for further details. Using the pre-processed data is optional, and it is allowed, of course, to use alternative tools for preprocessing.  Note that since the preprocessing is done automatically it does contain some errors. We provide this data as-is, and give no warranty on the quality of the pre-processed data.

Submission

Systems should tag each T-H pair as either YES, predicting that entailment does hold for the pair, or as NO otherwise. As indicated originally, this year partial submissions are not allowed - the submission should cover the whole dataset. 

Systems should be developed based on the development data set. Analyses of the test set (either manual or automatic) should not impact in any way the design and tuning of systems that publish their results on the RTE-2 test set. We regard it as acceptable to run automatic knowledge acquisition methods (such as synonym collection) specifically for the lexical and syntactic constructs that will be present in the test set, as long as the methodology and procedures are general and not tuned specifically for the test data. In any case, participants are asked to report about any process that was performed specifically for the test set.

Evaluation Measures

The evaluation of all submitted runs will be automatic. The judgments (classifications) returned by the system will be compared to those manually assigned by the human annotators (the Gold Standard). The percentage of matching judgments will provide the accuracy of the run, i.e. the fraction of correct responses.

As a second measure, an Average Precision measure will be computed. This measure evaluates the ability of systems to rank all the T-H pairs in the test set according to their entailment confidence (in decreasing order from the most certain entailment to the least certain). The more the system is confident that T entails H, the higher the ranking is. A perfect ranking would place all the positive pairs (for which the entailment holds) before all the negative pairs. Average precision is a common evaluation measure for system rankings, and is computed as the average of the system's precision values at all points in the ranked list in which recall increases, that is at all points in the ranked list for which the gold standard annotation is YES (Voorhees and Harman, 1999). More formally, it can be written as follows:

1/R * sum for i=1 to n (E(i) *#-correct-up-to-pair-i/i)

where n is the number of the pairs in the test set, R is the total number of positive pairs in the test set, E(i) is 1 if the i-th pair is positive and 0 otherwise, and i ranges over the pairs, ordered by their ranking.

Note the difference between this measure and the Confidence Weighted Score used in the first challenge.

This score will be computed for systems that will provide as output a confidence-ranked list of all test examples (in addition to the YES/NO output for each example).

Results Submission Format

Results will be submitted in a file with one line for each T-H pair in the test set, in the following format:

 pair_id<blank space>judgment

where

The first line in the file should look as follows:

ranked:<blank space>yes/no

The first line indicates whether the submission includes confidence ranking of the pairs (see evaluation measures above). Average precision will be computed only for systems that specify "ranked: yes" in the first line. If the submission includes confidence ranking, the pairs in the file should be ordered by decreasing entailment confidence order: the first pair should be the one for which the entailment is most certain, and the last pair should be the one for which the entailment is least likey (i.e. the one for which the judgment as "NO" is the most certain). Thus, in a ranked list all the positively classified pairs are expected to appear before all those that were classified as negative.

For example, suppose that the pair identifiers in the test set are 1...6. A valid submission file that includes ranking may look as follows:

ranked: yes

4 YES

3 YES

6 YES

1 NO

5 NO

2 NO

Participating teams will be allowed to submit results of up to 2 systems. The corresponding result files should be named run1.txt, and run2.txt if a second run is submitted. 

The results files should be zipped and submitted via the submit form.

System Reports

Participants are requested to submit a full system report by February 21 (we have a tight schedule this year due to the EACL 2006 conference in Trento, scheduled right before the RTE 2 Workshop).  Reports should include up to 6 double column pages, using the ACL style files and guidelines. The reports should include system description, quantitative results on the test set, and qualitative and quantitative analysis of the results and system behavior. Reviews with comments for the camera ready version and decisions about presentation in the workshop will be sent back to the authors on early March. 

Reports are to be sent by email to Benardo Magnini (magnini@itc.it) with the following subject line: "RTE2 report submission".

We aim to have an interactive and informative workshop setting, which will enable exploring the space of the entailment phenomenon and alternative methods to approach it. Therefore we encourage, apart from the straightforward system description, some analysis of results and general observations you might have about the entailment recognition problem. We strongly believe that to understand and interpret the results of a system the plain score is not sufficiently informative. In particular, we advocate including:

·         General observations and analysis of the entailment phenomena, data types, problems, etc.

·         Analysis of system performance - analysis and characterization of success and failure cases, identifying inherent difficulties and limitations of the system vs. its strengths.

·         Description of the types of features used by the system; getting a feeling (with examples) for the concrete features that actually made an impact.

·         Noting if there is a difference in performance between the development and test set. Identifying the reasons; was there over fitting?

·         If your system (as described in the report) is complex - identifying which parts of the system were eventually effective vs. parts that were not crucial or even introduced noise, illustrating the different behaviors through examples.

·         In case the development set was used for learning detailed information (like entailment rules) - discussing whether the method is scalable.

·         Providing illustrative examples for system behavior.

·         Providing results also for the development set.

 

Camera ready report:

Camera ready version of the report, to be included in the workshop proceedings, should be submitted in pdf format (with no page numbers) by March 14.

References

D. Lin. 1998. Dependency-based evaluation of MINIPAR. In Proceedings of the Workshop on Evaluation of Parsing Systems at LREC 1998. Granada, Spain.

J. C. Reynar and A. Ratnaparkhi. 1997. A Maximum Entropy Approach to Identifying Sentence Boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing,March 31-April 3. Washington, D.C.

M. Voorhees and D. Harman. 1999. Overview of the seventh text retrieval conference. In Proceedings of the Seventh Text Retrieval Conference (TREC-7). NIST Special Publication.