2nd Workshop on Applications of Pattern Analysis
19 - 21 October 2011, CIEM

Schedule

Wednesday 19 October 2011

10h: Opening and announcements

Open text analysis

10h15 - 11h: Tom Diethe: Medical Text Mining

11h - 11h30: Break

11h30 - 12h15: Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Ricard Gavalda: Detecting Sentiment Change in Twitter Streaming Data

12h15 - 13h: Jose M. Carmona-Cejudo, Manuel Baena-Garcia, Jose del Campo-Avila, Rafael Morales-Bueno, Joao Gama, Albert Bifet: Using GNUsmail to Compare Data Stream Mining Methods for On-line Email Classification

13h - 15h30: Lunch break

Classification

15h30 - 16h15: Jesse Read, Albert Bifet, Geoff Holmes, Bernhard Pfahringer: Streaming Multi-label Classification

16h15 - 17h: Diego Garcia-Saiz, Marta Zorrilla: Comparing classification methods for predicting distance students' performance

17h - 17h30: Break

17h30 - 18h30: Open problems and informal discussions

Thursday 20 October 2011

Images and Vision

10h - 10h45: Benjamin X. Hall, John Shawe-Taylor and Alan Johnston: Employing The Complete Face in AVSR to Recover from Facial Occlusions

10h45 - 11h30: Vassilios Stathopoulos, Joemon M. Jose: Bayesian Probabilistic Models for Image Retrieval

11h30 - 12h: Break

12h - 13h: Invited Talk - Shai Ben-David: Utilizing unlabeled and weakly labeled samples in classification learning tasks

13h - 15h: Lunch break

Harvest

15h - 15h15: Jose Luis Balcazar: Presentation of Pascal-2 Harvest Session

15h - 15h45: (Harvest session: Stark) Tobias Koetter: intro to KNIME

15h45 - 16h30: (Harvest session: Stark) Jose Luis Balcazar: The yacaree approach to association rules

16h30 - 17h: Break

17h - 17h45: (Harvest session: Stark) Javier de la Dehesa: Implementing yacaree on KNIME

17h45 - 18h30: (Harvest session: Freeling) Xavier Carreras: Treeler: Open-source Structured Prediction for NLP

Friday 21 October 2011

Change tracking

10h - 10h45: Indre Zliobaite, Albert Bifet, Geoff Holmes, Bernhard Pfahringer: MOA Concept Drift Active Learning Strategies for Streaming Data

10h45 - 11h15: Break

11h15 - 12h: David Sanchez, Lluis A. Belanche, Anicet R. Blanch: A Software System for the Microbial Source Tracking Problem

12h - 12h45: Saatviga Sudhahar, Roberto Franzosi, Nello Cristianini: Automating Quantitative Narrative Analysis of News Data

Contributed Talks:

Automating Quantitative Narrative Analysis of News Data

We present a working system for large scale quantitative narrative analysis (QNA) of news corpora, which includes various recent ideas from text mining and pattern analysis in order to solve a problem arising in computational social sciences. The task is that of identifying the key actors in a body of news, and the actions they perform, so that further analysis can be carried out. This step is normally performed by hand and is very labour intensive. We then characterise the actors by: studying their position in the overall network of actors and actions; studying the time series associated with some of their properties; generating scatter plots describing the subject/object bias of each actor; and investigating the types of actions each actor is most associated with. The system is demonstrated on a set of 100,000 articles about crime appeared on the New York Times between 1987 and 2007. As an example, we find that Men were most commonly responsible for crimes against the person, while Women and Children were most often victims of those crimes.

Comparing classification methods for predicting distance students' performance

Virtual teaching is constantly growing and, with it, the necessity of instructors to predict the performance of their students. In response to this necessity, different machine learning techniques can be used. Although there are so many benchmarks comparing their performance and accuracy, there are still very few experiments carried out on educational datasets which have very special features which make them different from other datasets. Therefore, in this work we compare the performance and interpretation level of the output of the different classification techniques applied on educational datasets and propose a meta-algorithm to preprocess the datasets and improve the accuracy of the model, which will be used by virtual instructors for their decision making through the ElWM tool.

Bayesian Probabilistic Models for Image Retrieval

In this paper we present new probabilistic ranking functions for content based image retrieval. Our methodology generalises previous approaches and is based on the predictive densities of generative probabilistic models modelling the density of image features. We evaluate the proposed methodology and compare it against two state of the art image retrieval systems using a well known image collection.

Detecting Sentiment Change in Twitter Streaming Data

MOA-TweetReader is a real-time system to read tweets in real time, to detect changes, and to find the terms whose frequency changed. Twitter is a micro-blogging service built to discover what is happening at any moment in time, anywhere in the world. Twitter messages are short, and generated constantly, and well suited for knowledge discovery using data stream mining. MOA-TweetReader is a software extension to the MOA framework. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA-TweetReader is released under the GNU GPL license.

Streaming Multi-label Classification

This paper presents a new experimental framework for studying multi-label evolving stream classification, with efficient methods that combine the best practices in streaming scenarios with the best practices in multi-label classification. Many real world problems involve data which can be considered as multi-label data streams. Efficient methods exist for multi-label classification in non streaming scenarios. However, learning in evolving streaming scenarios is more challenging, as the learners must be able to adapt to change using limited time and memory. We present a new experimental software that extends the MOA framework. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. It is released under the GNU GPL license.

MOA Concept Drift Active Learning Strategies for Streaming Data

We present a framework for active learning on evolving data streams, as an extension to the MOA system. In learning to classify streaming data, obtaining the true labels may require major effort and may incur excessive cost. Active learning focuses on learning an accurate model with as few labels as possible. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and classifiers need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. If changes do not occur close to the boundary, they will be missed and classifiers will fail to adapt. We propose a software system that implements active learning strategies, extending the MOA framework. This software is released under the GNU GPL license.

Using GNUsmail to Compare Data Stream Mining Methods for On-line Email Classification

Real-time classification of emails is a challenging task because of its online nature, and also because email streams are subject to concept drift. Identifying email spam, where only two different labels or classes are defined (spam or not spam), has received great attention in the literature. We are nevertheless interested in a more specific classification where multiple folders exist, which is an additional source of complexity: the class can have a very large number of different values. Moreover, neither cross-validation nor other sampling procedures are suitable for evaluation in data stream contexts, which is why other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using recently proposed mechanisms such as fading factors. In this paper, we present GNUsmail, an open-source extensible framework for email classification, and we focus on its ability to perform online evaluation. GNUsmails architecture supports incremental and online learning, and it can be used to compare different data stream mining methods, using state-of-art online evaluation metrics. Besides describing the framework, characterized by two overlapping phases, we show how it can be used to compare different algorithms in order to find the most appropriate one. The GNUsmail source code includes a tool for launching replicable experiments.

A Software System for the Microbial Source Tracking Problem

The aim of this paper is to report the achievement of Ichnaea, a fully computer-based prediction system that is able to make fairly accurate predictions for Microbial Source Tracking studies. The system accepts examples showing different concentration levels, uses indicators (variables) with different environmental persistence, and can be applied at different geographical or climatic areas. We describe the inner workings of the system and report on the specific problems and challenges arisen from the machine learning point of view and how they have been addressed.

Employing The Complete Face in AVSR to Recover from Facial Occlusions

Existing Audio-Visual Speech Recognition (AVSR) systems visually focus intensely on a small region of the face, centred on the immediate mouth area. This is poor design for a variety reasons in real world situations because any occlusion to this small area renders all visual advantage null and void. This is poorby design because it is well known that humans use the complete face to speechread. We demonstrate a new application of a novel visual algorithm, the Multi-Channel Gradient Model, the deploys information from the complete face to perform AVSR. Our MCGM model performs near to the performance of Discrete Cosine Transforms in the case where a small region of interest around the lips, but in the case of an occluded face we can achieve results that match nearly 70% of the performance that DCTs can achieve on the DCT best case, lips centeric approach.