Grid@CLEF 2009 - New CLEF 2009 Pilot Track

Grid@CLEF 2009 - New CLEF 2009 Pilot Track

Posted by Rebecca Martin on Wed, 04/02/2009 - 01:00

Grid@CLEF is an activity of the Cross-Language Evaluation Forum (CLEF), which is launching a new pilot track in the CLEF 2009 campaign. Information about the objectives, the task, the organization, and the subscription procedure follows; for more information and updates, please visit the Grid@CLEF Web site at:


Multilingual information access (MLIA) is increasingly part of many complex systems, such as digital libraries, intranet and enterprise portals, Web search engines.

The Cross-Language Evaluation Forum (CLEF) research community has been outstanding and very active in designing, developing, and testing MLIA methods and techniques, constantly improving the performances of such components. But is this enough? Do we really know how MLIA components (stop lists, stemmers, IR models, relevance feedback, translation techniques, etc.) behave with respect to languages? Do we have a deep comprehension of how these components interact together when the language changes?
Unfortunately, today's picture is quite fragmentary since researchers have mainly focused on specific aspects of multilinguality but a comprehensive and unifying view is still missing. This situation prevents an easy adoption of MLIA techniques and technology transfer by relevant application and developer communities. Indeed, it is often difficult for people outside the IR community to extract from the specialised scientific literature indications about the most promising
approaches and solutions.

We are thus launching a cooperative effort where a series of large-scale and systematic grid experiments will allow us to to improve our comprehension of MLIA systems and gain an exhaustive picture of their behaviour with respect to languages. In this way, we can exploit the
valuable resources and experimental collections made available by CLEF over the years in order to gain more insights about the effectiveness of the various weighting schemes and retrieval techniques with respect to the languages and to disseminate this knowledge to the relevant
application and developer communities.


This first year task focuses on *monolingual retrieval*, i.e. querying topics against documents in the same language of the topics, *in five European languages*:

* Dutch;
* English;
* French;
* German;
* Italian.

The selected languages will allow participants to test both romance and germanic languages, as well as languages with word compounding issues. Moreover, these languages have been extensively studied in the MLIA field and, therefore, it will be possible to compare and assess the
outcomes of the first year experiments with respect to the existing literature.

The reference scenario for Grid@CLEF 2009 concerns an IR system which consists of:

- a tokenizer component for processing the input document collection and producing a stream of tokens;
- an optional stop list component for removing stop words form the stream of tokens;
- an optional word decompounder component for splitting compound words in the stream of tokens;
- an optional stemmer component for stemming words in the stream of tokens;
- a weighting/scoring engine component for scoring documents against queries and producing an output ranked list.

Instead of directly feeding the next component, as usually happens in a monolithic IR system, the Grid@CLEF task requires each component to input and output from/to XML files in a well-defined format. This choice allows the exchange of these XML files among participants and the creation of a whole experiment from the chaining of components that may belong to different IR systems.

Therefore, the Grid@CLEF 2009 track has a twofold goal:

1. to prepare participants' systems to work according to this new framework based on the exchange of well-defined XML messages;
2. to conduct as many experiments as possible, i.e. to put as many dots as possible on the grid, according to this new framework.

To facilitate the participation in this first year task, participants are required to participate in what we call the *island mode*, where all the components which constitute the IR system of the reference scenario are developed and run by the same participant. The participant is only requested to implement the XML messaging format for each of his own components and publish all the intermediate results of these components on the online XML messaging exchange system.

*Participanting in the Grid@CLEF 2009 pilot track is easy: you only need to join the island mode and produce as many experiments as possible.*


The tentative schedule for the Grid@CLEF 2009 track is as follows:

* Topics and collections release: early March 2009;
* XML messaging framework specification release: early April 2009;
* XML messaging exchange online system release: early May 2009;
* Experiment submission: mid June 2009;
* Results computation: early July 2009;
* Working note papers: mid August 2009;
* CLEF 2009 Workshop: from 30 September to 2 October 2009 in Corfu,

*Track Coordinators*

* Nicola Ferro, University of Padua, Italy - ferro (at)
* Donna Harman, National Institute of Standards and Technology
(NIST), USA - donna.harman (at)

*Advisory Committee*

* Chris Buckley, Sabir Research, USA;
* Fredric Gey, University of California at Berkeley, USA;
* Kalervo Javelin, University of Tampere, Finland;
* Noriko Kando, National Institute of Informatics (NII), Japan;
* Craig Macdonald, University of Glasgow, UK;
* Prasenjit Majumder, Indian Statistical Institute, Kolkata, India;
* Paul McNamee, Johns Hopkins University, USA;
* Teruko Mitamura, Carnegie Mellon University, USA;
* Mandar Mitra, Indian Statistical Institute, Kolkata, India;
* Stephen Robertson, Microsoft Research Cambridge and City University London, UK;
* Jacques Savoy, University of Neuchael, Switzerland.


Registration for CLEF 2009 and subscription to the Grid@CLEF 2009 pilot track open *4 February*. You can find more information on the main CLEF Web site at:

under "CLEF 2009".