We consider an applied notion of textual entailment, defined as a directional relation between two text fragments, termed T - the entailing text, and H - the entailed text. We say that T entails H if, typically, a human reading T would infer that H is most likely true. This somewhat informal definition is based on (and assumes) common human understanding of language as well as common background knowledge. Textual entailment recognition is the task of deciding, given T and H, whether T entails H.
|
ID |
TEXT |
HYPOTHESIS |
TASK |
ENTAILMENT |
|
1 |
The drugs that slow down or halt Alzheimer's disease work best the earlier you administer them. |
Alzheimer's disease is treated using drugs. |
IR |
YES |
|
2 |
Drew Walker, NHS Tayside's public health director, said: "It is important to stress that this is not a confirmed case of rabies." |
A case of rabies was confirmed. |
IR |
NO |
|
3 |
Yoko Ono unveiled a bronze statue of her late husband, John Lennon, to complete the official renaming of England's Liverpool Airport as Liverpool John Lennon Airport. |
Yoko Ono is John Lennon's widow. |
QA |
YES |
|
4 |
Arabic, for example, is used densely across North Africa and from the Eastern Mediterranean to the Philippines, as the key language of the Arab world and the primary vehicle of Islam. |
Arabic is the primary language of the Philippines. |
QA |
NO |
|
5 |
About two weeks before the trial started, I was in Shapiro's office in Century City. |
Shapiro works in Century City. |
QA |
YES |
|
6 |
Meanwhile, in his first interview to a Western print publication since his election as president of Iran earlier this year, Ahmadinejad attacked the "threat" to bring the issue of Iran's nuclear activity to the UN Security Council by the US, France, Britain and Germany. |
Ahmadinejad is a citizen of Iran. |
IE |
YES |
|
7 |
The flights begin at San Diego's Lindbergh Field in April, 2002 and follow the Lone Eagle's 1927 flight plan to St. Louis, New York, and Paris |
Lindbergh began his flight from Paris to New York in 2002. |
QA |
NO |
|
8 |
The world will never forget the epic flight of Charles Lindbergh across the Atlantic from New York to Paris in May 1927, a feat still regarded as one of the greatest in aviation history. |
Lindbergh began his flight from New York to Paris in 1927. |
QA
|
YES |
| 9 |
Medical science indicates increased risks of tumors, cancer, genetic damage and other health problems from the use of cell phones. |
Cell phones pose health risks. |
IR |
YES |
| 10 |
The available scientific reports do not show that any health problems are associated with the use of wireless phones. |
Cell phones pose health risks. |
IR |
NO |
Table 1: Example T-H pairs
Some additional judgment criteria and guidelines are listed below (examples are taken from Table 1):
Both Development and Test sets are formatted as XML files, as follows:
<pair id="id_num" entailment="YES|NO" task="IE|IR|QA|SUM" length="short|long">
<t>the text...</t>
<h>the hypothesis...</h>
</pair>
Where:
The data is split to a development set and a test set, to be released separately. The goal of the development set is to guide the development and tuning of participating systems. Notice that since the given task has an unsupervised nature it is not expected that the development set can be used as a main resource for supervised training, given its anecdotal coverage. Rather it is typically assumed that systems will be using generic techniques and resources.
Systems must tag each T-H pair as either "YES", predicting that entailment does hold for the pair, or as "NO" otherwise. No partial submissions are allowed, i.e. the submission must cover the whole dataset.
Systems should be developed based on the development data set. Analyses of the test set (either manual or automatic) should not impact in any way the design and tuning of systems that publish their results on the RTE-3 test set. We regard it as acceptable to run automatic knowledge acquisition methods (such as synonym collection) specifically for the lexical and syntactic constructs that will be present in the test set, as long as the methodology and procedures are general and not tuned specifically for the test data. In any case, participants are asked to report about any process that was performed specifically for the test set.
The evaluation of all submitted runs will be automatic. The judgments (classifications) returned by the system will be compared to those manually assigned by the human annotators (the Gold Standard). The percentage of matching judgments will provide the accuracy of the run, i.e. the fraction of correct responses.
As a second measure, an Average Precision measure, will be computed. This measure evaluates the ability of systems to rank all the T-H pairs in the test set according to their entailment confidence (in decreasing order from the most certain entailment to the least certain). The more the system is confident that T entails H, the higher the ranking is. A perfect ranking would place all the positive pairs (for which the entailment holds) before all the negative pairs. Average precision is a common evaluation measure for system rankings, and is computed as the average of the system's precision values at all points in the ranked list in which recall increases, that is at all points in the ranked list for which the gold standard annotation is YES. More formally, it can be written as follows:
1/R * sum for i=1 to n (E(i) * #-entailing-up-to-pair-i/i)
where n is the number of the pairs in the test set, R is the total number of positive pairs in the test set, E(i) is 1 if the i-th pair is positive and 0 otherwise, and i ranges over the pairs, ordered by their ranking. This score will be computed for systems that will provide as output a confidence-ranked list of all test examples (in addition to the YES/NO output for each example).
Results will be submitted in a file with one line for each T-H pair in the test set, in the following format:
pair_id<blank space>judgment
where
The first line in the file should look as follows:
ranked:<blank space>"YES|NO"
The first line indicates whether the submission includes confidence ranking of the pairs (see evaluation measures above). Average precision will be computed only for systems that specify "ranked: YES" in the first line. If the submission includes confidence ranking, the pairs in the file should be ordered by decreasing entailment confidence order: the first pair should be the one for which the entailment is most certain, and the last pair should be the one for which the entailment is least likely (i.e. the one for which the judgment as "NO" is the most certain). Thus, in a ranked list all the positively classified pairs are expected to appear before all those that are classified as negative.
Each submitted run must be a plain text file, with a filename composed of a unique 5-6 element alpha-numeric string, and the number of run separated by "-", e.g. CLCT07-run1.txt. Participating teams will be allowed to submit 2 result files per system. The corresponding result files should be named XXXXX-run1.txt, and XXXXX-run2.txt if a second run is submitted for the same system.
The results files should be zipped and submitted via email to infocelct at itc . it, with the subject line "[RTE3 RESULT SUBMISSION]".
Participants are requested to submit a full system report by March 26, 2007. It should be noted that the change of schedule for paper submissions, from April 2 to March 26, 2007, was due to the merging of the RTE workshop with the Paraphrasing Workshop, which will result in unifying the reviewing process of the two types of papers ( system and technical ) and make it more competitive for RTE system reports. As the schedule is quite tight, we suggest preparing a draft of the paper in advance of receiving results of the system evaluation. Report submissions will follow the same procedure for article submissions as for the main workshop (using the START system). Report submissions must be uploaded by filling out the submission form at the following URL: www.softconf.com/acl07/ACL07-WS9/submit.html. Please remember to select the "RTE3 paper" option in the Submission category field.
Reports should include up to 6 double column pages, using the ACL Style files and guidelines. As the reports presented at the workshop are expected to be very informative, in order to further explore entailment phenomena and any alternative methods of approaching it, we suggest an analysis of results and a presentation of any general observations you might have about the entailment recognition problem, in addition to the straightforward system description.
Due to workshop time limitations this year, not all system reports will be presented orally. The reviewing of RTE3 system reports will be blind and the papers should not include the authors' names and affiliations. The best will presented orally, while the remaining papers which pass a sufficient level of quality will be presented as posters. Reviews with comments for the camera ready version and decisions about presentation in the workshop will be sent back to the authors by April 23, 2007.
The camera ready version of each report, to be included in the workshop proceedings, should be submitted in pdf format (with no page numbers) by May 6, 2007.