19 - 21 October 2011, CIEM
Schedule
Wednesday 19 October 2011
10h: Opening and announcements
Open text analysis
10h15 - 11h: Tom Diethe: Medical Text Mining
11h - 11h30: Break
11h30 - 12h15: Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Ricard Gavalda: Detecting Sentiment Change in Twitter Streaming Data
12h15 - 13h: Jose M. Carmona-Cejudo, Manuel Baena-Garcia, Jose del Campo-Avila, Rafael Morales-Bueno, Joao Gama, Albert Bifet: Using GNUsmail to Compare Data Stream Mining Methods for On-line Email Classification
13h - 15h30: Lunch break
Classification
15h30 - 16h15: Jesse Read, Albert Bifet, Geoff Holmes, Bernhard Pfahringer: Streaming Multi-label Classification
16h15 - 17h: Diego Garcia-Saiz, Marta Zorrilla: Comparing classification methods for predicting distance students' performance
17h - 17h30: Break
17h30 - 18h30: Open problems and informal discussions
Thursday 20 October 2011
Images and Vision
10h - 10h45: Benjamin X. Hall, John Shawe-Taylor and Alan Johnston: Employing The Complete Face in AVSR to Recover from Facial Occlusions
10h45 - 11h30: Vassilios Stathopoulos, Joemon M. Jose: Bayesian Probabilistic Models for Image Retrieval
11h30 - 12h: Break
12h - 13h: Invited Talk - Shai Ben-David: Utilizing unlabeled and weakly labeled samples in classification learning tasks
13h - 15h: Lunch break
Harvest
15h - 15h15: Jose Luis Balcazar: Presentation of Pascal-2 Harvest Session
15h - 15h45: (Harvest session: Stark) Tobias Koetter: intro to KNIME
15h45 - 16h30: (Harvest session: Stark) Jose Luis Balcazar: The yacaree approach to association rules
16h30 - 17h: Break
17h - 17h45: (Harvest session: Stark) Javier de la Dehesa: Implementing yacaree on KNIME
17h45 - 18h30: (Harvest session: Freeling) Xavier Carreras: Treeler: Open-source Structured Prediction for NLP
Friday 21 October 2011
Change tracking
10h - 10h45: Indre Zliobaite, Albert Bifet, Geoff Holmes, Bernhard Pfahringer: MOA Concept Drift Active Learning Strategies for Streaming Data
10h45 - 11h15: Break
11h15 - 12h: David Sanchez, Lluis A. Belanche, Anicet R. Blanch: A Software System for the Microbial Source Tracking Problem
12h - 12h45: Saatviga Sudhahar, Roberto Franzosi, Nello Cristianini: Automating Quantitative Narrative Analysis of News Data
Contributed Talks:
Automating Quantitative Narrative Analysis of News Data
We present a working system for large scale quantitative narrative
analysis (QNA) of news corpora, which includes various recent ideas
from text mining and pattern analysis in order to solve a problem
arising in computational social sciences. The task is that of
identifying the key actors in a body of news, and the actions they
perform, so that further analysis can be carried out. This step is
normally performed by hand and is very labour intensive. We then
characterise the actors by: studying their position in the overall
network of actors and actions; studying the time series associated
with some of their properties; generating scatter plots describing the
subject/object bias of each actor; and investigating the types of
actions each actor is most associated with. The system is demonstrated
on a set of 100,000 articles about crime appeared on the New York
Times between 1987 and 2007. As an example, we find that Men were most
commonly responsible for crimes against the person, while Women and
Children were most often victims of those crimes.
Comparing classification methods for predicting distance students'
performance
Virtual teaching is constantly growing and, with it, the necessity of
instructors to predict the performance of their students. In response
to this necessity, different machine learning techniques can be
used. Although there are so many benchmarks comparing their
performance and accuracy, there are still very few experiments carried
out on educational datasets which have very special features which
make them different from other datasets. Therefore, in this work we
compare the performance and interpretation level of the output of the
different classification techniques applied on educational datasets
and propose a meta-algorithm to preprocess the datasets and improve
the accuracy of the model, which will be used by virtual instructors
for their decision making through the ElWM tool.
Bayesian Probabilistic Models for Image Retrieval
In this paper we present new probabilistic ranking functions for
content based image retrieval. Our methodology generalises previous
approaches and is based on the predictive densities of generative
probabilistic models modelling the density of image features. We
evaluate the proposed methodology and compare it against two state of
the art image retrieval systems using a well known image collection.
Detecting Sentiment Change in Twitter Streaming Data
MOA-TweetReader is a real-time system to read tweets in real time, to
detect changes, and to find the terms whose frequency changed. Twitter
is a micro-blogging service built to discover what is happening at any
moment in time, anywhere in the world. Twitter messages are short, and
generated constantly, and well suited for knowledge discovery using
data stream mining. MOA-TweetReader is a software extension to the MOA
framework. Massive Online Analysis (MOA) is a software environment for
implementing algorithms and running experiments for online learning
from evolving data streams. MOA-TweetReader is released under the GNU
GPL license.
Streaming Multi-label Classification
This paper presents a new experimental framework for studying
multi-label evolving stream classification, with efficient methods that
combine the best practices in streaming scenarios with the best
practices in multi-label classification. Many real world problems
involve data which can be considered as multi-label data
streams. Efficient methods exist for multi-label classification in non
streaming scenarios. However, learning in evolving streaming scenarios
is more challenging, as the learners must be able to adapt to change
using limited time and memory. We present a new experimental software
that extends the MOA framework. Massive Online Analysis (MOA) is a
software environment for implementing algorithms and running
experiments for online learning from evolving data streams. It is
released under the GNU GPL license.
MOA Concept Drift Active Learning Strategies for Streaming Data
We present a framework for active learning on evolving data streams,
as an extension to the MOA system. In learning to classify streaming
data, obtaining the true labels may require major effort and may incur
excessive cost. Active learning focuses on learning an accurate model
with as few labels as possible. Streaming data poses additional
challenges for active learning, since the data distribution may change
over time (concept drift) and classifiers need to adapt. Conventional
active learning strategies concentrate on querying the most uncertain
instances, which are typically concentrated around the decision
boundary. If changes do not occur close to the boundary, they will be
missed and classifiers will fail to adapt. We propose a software system
that implements active learning strategies, extending the MOA
framework. This software is released under the GNU GPL license.
Using GNUsmail to Compare Data Stream Mining Methods for On-line
Email Classification
Real-time classification of emails is a challenging task because of its
online nature, and also because email streams are subject to concept
drift. Identifying email spam, where only two different labels or
classes are defined (spam or not spam), has received great attention in
the literature. We are nevertheless interested in a more specific
classification where multiple folders exist, which is an additional
source of complexity: the class can have a very large number of
different values. Moreover, neither cross-validation nor other sampling
procedures are suitable for evaluation in data stream contexts, which
is why other metrics, like the prequential error, have been
proposed. However, the prequential error poses some problems, which
can be alleviated by using recently proposed mechanisms such as fading
factors. In this paper, we present GNUsmail, an open-source extensible
framework for email classification, and we focus on its ability to
perform online evaluation. GNUsmails architecture supports incremental
and online learning, and it can be used to compare different data
stream mining methods, using state-of-art online evaluation
metrics. Besides describing the framework, characterized by two
overlapping phases, we show how it can be used to compare different
algorithms in order to find the most appropriate one. The GNUsmail
source code includes a tool for launching replicable experiments.
A Software System for the Microbial Source Tracking Problem
The aim of this paper is to report the achievement of Ichnaea, a fully
computer-based prediction system that is able to make fairly accurate
predictions for Microbial Source Tracking studies. The system accepts
examples showing different concentration levels, uses indicators
(variables) with different environmental persistence, and can be
applied at different geographical or climatic areas. We describe the
inner workings of the system and report on the specific problems and
challenges arisen from the machine learning point of view and how they
have been addressed.
Employing The Complete Face in AVSR to Recover from Facial Occlusions
Existing Audio-Visual Speech Recognition (AVSR) systems visually focus
intensely on a small region of the face, centred on the immediate
mouth area. This is poor design for a variety reasons in real world
situations because any occlusion to this small area renders all visual
advantage null and void. This is poorby design because it is well
known that humans use the complete face to speechread. We demonstrate
a new application of a novel visual algorithm, the Multi-Channel
Gradient Model, the deploys information from the complete face to
perform AVSR. Our MCGM model performs near to the performance of
Discrete Cosine Transforms in the case where a small region of
interest around the lips, but in the case of an occluded face we can
achieve results that match nearly 70% of the performance that DCTs can
achieve on the DCT best case, lips centeric approach.


