News



In collaboration with researchers from Demokritos, Athens, and partly sponsored by PASCAL2, Network of Excellence, we are organizing a challenge on Large Scale Hierarchical Text Classification:

Presentation of the Lascar project


Several categorization problems involve category systems comprising a very large number of categories. Patent offices, for example, are in charge of assigning, to every patent request, a code from the IPC (International Patent Classification), which contains roughly 70,000 sub-divisions. DMOZ, which is often deemed as the largest Web directory, contains in the range of 600,000 categories, into which new pages are categorized. A similar situation is encountered with the semantic annotation of documents (e.g. journal abstracts) in the medical domain, where concepts of MeSH (which contains more than 150,000 concepts) are used as index terms.

If several works have addressed the problem of scaling-up classifiers, they have mainly dealt with collections where either the number of examples is large, or the dimension of the underlying vector space is high, or both (this is typically the case of the recent PASCAL challenge on large scale learning). Very few works have directly addressed the problem of deploying/developing classifiers on a very large number of categories (a situation which, in general, entails that the number of examples is large, and the dimension of the underlying vector space is high). In particular, no theoretical results have been established regarding the best strategy to follow for deploying a given classifier on large category systems.

The LASCAR project directly aims at developing theoretical results and pratical alorithms for deploying classifiers in large category systems, as well as for developing new classification technologies that can be used on large category systems. To do so, the project brings together researchers from combinatorial optimization, machine learning, information retrieval, statistics and distributed systems.