# Funded PhD position in machine learning at the University of Technology of Compiègne (France)

The Laboratory Heudiasyc, CNRS & University of Technology of Compiègne (France) invites applications for a fully funded 3-years PhD position.

Topic: "Model Selection in Block-Clustering"

Starting date: October 2009

Block clustering aims at partitioning simultaneously the rows and columns of a data matrix. This problem, which has been investigated from the early 80's in data analysis, has recently attracted interest in the machine learning community, thanks to applications such that text mining, marketing or recommender systems (see e.g. Netflix challenge). This thesis aims at developing new model selection tools for block clustering, starting from their theoretical foundations, up to their empirical evaluation. It will be developed in the ClasSel project funded by the French National Research Agency.

Clustering, and particularly block-clustering is highly contingent on the choices of the fitting criterion and the number of classes. Viewing the clustering problem as a density estimation problem provides answers to these problems thanks to the general-purpose estimation and model selection tools developed in the statistical framework. In this context, model selection is usually performed via penalized maximum likelihood scores, that trade-off fitting for model complexity. The main approaches are AIC (Akaike Information criterion) that builds on Kullback-Leibler divergence and BIC (Bayesian Information criterion) that maximizes the model posterior. However, these criteria do not take into account the classification objectives when density models are used in the clustering framework. A third criterion, ICL (Integrated Completed Likelihood) accounts for this aspect. Finally, resampling methods, though scarcely used in clustering, may also provide estimates of the optimism of complex models.

All these criteria are sample-size dependent. Hence, their adaptation to block-clustering is problematic since problem size is characterized here by the number of rows and the number of columns of the data matrix. The notion of sample size cannot be transposed directly, and the first attempts to apply information criteria in this framework are not convincing. This project aims at proposing such criteria, either by revising the standard criterion, or by starting from new theoretical grounds.

Applicants should complete their Master degree in 2009, and have interest and expertise in machine learning, statistics, computer science and/or applied mathematics.

Expressions of interest with a short CV, including reference(s) should be sent to: Gerard.Govaert (at) utc.fr or Yves.Grandvalet (at) utc.fr

Some links:

- Heudiasyc Lab. http://www2.hds.utc.fr/index.php?id=1&L=1

- University of Technology of Compiègne http://www.utc.fr/the_university/index.php

- Compiègne is a small town with 70'000 inhabitants, at a 45mn commute of Paris, see http://en.wikipedia.org/wiki/Compi%C3%A8gne