RANLP 2009

International Conference
RANLP - 2009
/Recent Advances in Natural Language Processing/
September 14-16, 2009, Borovets, Bulgaria

KEYNOTE SPEAKERS:

Ricardo Baeza-Yates (Yahoo! Research, Spain)
Kevin Bretonnel Cohen (University of Colorado School of Medicine),
Walter Daelemans (University of Antwerp),
Mirella Lapata (University of Edinburgh),
Shalom Lappin (King’s College, London),
Massimo Poesio (University of Trento and University of Essex)

ABSTRACTS:

Towards Semantic Search
Ricardo Baeza-Yates, Yahoo! Research, Spain

Semantic search seems to be an elusive and fuzzy target to IR, SW and NLP researchers. One reason is that this challenge lies in between all those fields, which implies a broad scope of issues and technologies that must be mastered. In this talk we survey the work of Yahoo! Research at Barcelona to tackle this problem. Our research is intended to produce a virtuous feedback circuit by using machine learning for capturing semantics, and, ultimately, for better search.

Paradigms for evaluation in natural language processing
Kevin Bretonnel Cohen, University of Colorado School of Medicine

The NLP community has a solid history of doing black-box evaluation of its products. However, white-box evaluation has generally been ignored. This talk will review the history of evaluation in NLP, sketch the consequences of the focus on black-box evaluation, and propose an evaluation paradigm based in software testing and in descriptive linguistics.

Kevin Bretonnel Cohen is the Biomedical Text Mining Group Lead at the Center for Computational Pharmacology in the University of Colorado School of Medicine. He is the founding chairman of the ACL SIGBIOMED special interest group on BioNLP and a co-organizer of the 2007 Medical NLP Challenge on ICD-9-CM classification and the BioCreative III shared task.

Robust features for Computational Stylometry
Walter Daelemans, University of Antwerp

Computational stylometry is the automatic assignment of author properties (e.g., identity, gender, personality, region, age, period, ideology, ...) to a text. Applications range from forensic use to literary scholarship. The methodology currently most successful is based on the well known approach to text categorization using training data in the form of texts with known classes. The approach works by extracting text features, selecting the best ones using statistical methods, representing the text as a vector of these features, and applying machine learning methods to the resulting vectors with associated classes. The main difference with the original text categorization approach is that the extracted text features may be complex and linguistically motivated (e.g. syntactic features).

I will describe some recent applications at the University of Antwerp using this methodology: personality detection, author assignment with many authors and short texts, scribe detection in medieval texts, provenance detection in Kenyan news articles, etc. I will then focus on an empirical comparison of the robustness of different feature types in different situations.

Download slides

Vector-based Models of Semantic Composition
Mirella Lapata, University of Edinburgh

Vector-based models of word meaning have become increasingly popular in natural language processing and cognitive science. The appeal of these models lies in their ability to represent meaning simply by using distributional information under the assumption that words occurring within similar contexts are semantically similar. Despite their widespread use, vector-based models are typically directed at representing words in isolation and methods for constructing representations for phrases or sentences have received little attention in the literature.

In this talk we propose a framework for representing the meaning of word combinations in vector space. Central to our approach is vector composition which we operationalize in terms of additive and multiplicative functions. Under this framework, we introduce a wide range of composition models which we evaluate empirically on a phrase similarity task. We also propose a novel statistical language model that is based on vector composition and can capture long-range semantic dependencies.

Joint work with Jeff Mitchell

Download slides

Restricting Probability Distributions to Expand the Class of Learnable Languages
Shalom Lappin, King’s College, London

Joint work with Alex Clark, Royal Holloway College, London

Classical computational learning models like Gold's (1967) identification in the limit paradigm and PAC learning require learning under all possible probability distributions over data samples. By modifying PAC learning to restrict the set of distributions, it is possible to show that different classes of languages are learnable than in the distribution free framwork. I will explore some recent work on distribution based language learning. Clark and Thollard (2004) prove that when the set of distributions is limited to those produced by Probabilistic Deterministic Finite State Automata, then the class of regular languages is PAC learnable from positive evidence only. Clark (2006) demonstrates a similar result for an interesting subclass of context free languages when the set of distributions corresponds to a specified set of Probabilistic Context Free Grammars. I will then discuss how this approach might be used to increase the range of evidence available for learning. Clark and Lappin (2009) suggest that an appropriate restriction on the set of possible distributions for PAC learning could provide the basis for a stochastic model of indirect negative evidence.

Download slides

Conceptual Knowledge: Evidence from Corpora and the Brain
Massimo Poesio, Universitá di Trento, Center for Mind / Brain Sciences and DISI

Evidence about conceptual knowledge derived by extracting information from corpora could potentially have a big impact on research on the lexicon, ontologies, and commonsense knowledge, e.g., by providing enormous amounts of information about which attributes appear to have the strongest associations with certain concepts, or for which types of concepts our current repertoire of attributes is less appropriate. It is however very difficult to evaluate such research precisely, as the most commonly used gold standards (WordNet, existing ontologies) have been developed on the basis of subjective human judgments as opposed to objective empirical evidence. The most obvious alternative to such resources are the evidence about conceptual attributes contained in feature norms produced by psycholinguists, and evidence about category distinctions extracted through brain imaging data, but the existing databases are serious limitations as well. In this talk I will discuss ongoing work on collecting evidence about conceptual knowledge through these techniques (in particular, brain data through EEG) and using this information to evaluate corpus-extracted models of conceptual knowledge.

Joint work with Marco Baroni, Brian Murphy, Eduard Barbu, and Abdulrahman Almuhareb.

Download slides