Ricardo Baeza-Yates (Yahoo! Research, Spain)
Kevin Bretonnel Cohen (University of Colorado School of Medicine),
Walter Daelemans (University of Antwerp),
Mirella Lapata (University of Edinburgh),
Shalom Lappin (King’s College, London),
Massimo Poesio (University of Trento and University of Essex)
ABSTRACTS:
Towards Semantic Search
Ricardo Baeza-Yates,
Yahoo! Research, Spain
Semantic search seems to be an elusive and fuzzy target to IR, SW and
NLP researchers. One reason is that this challenge lies in between all
those fields, which implies a broad scope of issues and technologies
that must be mastered. In this talk we survey the work of Yahoo!
Research at Barcelona to tackle this problem. Our research is intended
to produce a virtuous feedback circuit by using machine learning for
capturing semantics, and, ultimately, for better search.
Paradigms for evaluation in natural language processing
Kevin Bretonnel Cohen,
University of Colorado School of Medicine
The NLP community has a solid history of doing black-box evaluation of its
products. However, white-box evaluation has generally been ignored. This
talk will review the history of evaluation in NLP, sketch the consequences
of the focus on black-box evaluation, and propose an evaluation paradigm
based in software testing and in descriptive linguistics.
Kevin Bretonnel Cohen is the Biomedical Text Mining Group Lead at the Center
for Computational Pharmacology in the University of Colorado School of
Medicine. He is the founding chairman of the ACL SIGBIOMED special interest
group on BioNLP and a co-organizer of the 2007 Medical NLP Challenge on
ICD-9-CM classification and the BioCreative III shared task.
Robust features for Computational Stylometry
Walter Daelemans,
University of Antwerp
Computational stylometry is the automatic assignment of author properties (e.g., identity, gender, personality, region, age, period, ideology, ...) to a text. Applications range from forensic use to literary scholarship. The methodology currently most successful is based on the well known approach to text categorization using training data in the form of texts with known classes. The approach works by extracting text features, selecting the best ones using statistical methods, representing the text as a vector of these features, and applying machine learning methods to the resulting vectors with associated classes. The main difference with the original text categorization approach is that the extracted text features may be complex and linguistically motivated (e.g. syntactic features).
I will describe some recent applications at the University of Antwerp using this methodology: personality detection, author assignment with many authors and short texts, scribe detection in medieval texts, provenance detection in Kenyan news articles, etc.
I will then focus on an empirical comparison of the robustness of different feature types in different situations.
Download slides
Vector-based Models of Semantic Composition
Mirella Lapata, University of Edinburgh
Vector-based models of word meaning have become increasingly popular
in natural language processing and cognitive science. The appeal of
these models lies in their ability to represent meaning simply by
using distributional information under the assumption that words
occurring within similar contexts are semantically similar. Despite
their widespread use, vector-based models are typically directed at
representing words in isolation and methods for constructing
representations for phrases or sentences have received little
attention in the literature.
In this talk we propose a framework for representing the
meaning of word combinations in vector space. Central to our approach
is vector composition which we operationalize in terms of additive
and multiplicative functions. Under this framework, we
introduce a wide range of composition models which we evaluate empirically
on a phrase similarity task. We also propose a novel statistical
language model that is based on vector composition and can capture
long-range semantic dependencies.
Joint work with Jeff Mitchell
Download slides
Restricting Probability Distributions to Expand the Class of
Learnable Languages
Shalom Lappin,
King’s College, London
Joint work with Alex Clark, Royal Holloway College, London
Classical computational learning models like Gold's (1967)
identification in the limit paradigm and PAC learning require learning
under all possible probability distributions over data samples. By
modifying PAC learning to restrict the set of distributions, it is
possible to show that different classes of languages are learnable than
in the distribution free framwork. I will explore some recent work on
distribution based language learning. Clark and Thollard (2004) prove
that when the set of distributions is limited to those produced by
Probabilistic Deterministic Finite State Automata, then the class of
regular languages is PAC learnable from positive evidence only. Clark
(2006) demonstrates a similar result for an interesting subclass of
context free languages when the set of distributions corresponds to a
specified set of Probabilistic Context Free Grammars. I will then
discuss how this approach might be used to increase the range of
evidence available for learning. Clark and Lappin (2009) suggest that an
appropriate restriction on the set of possible distributions for PAC
learning could provide the basis for a stochastic model of indirect
negative evidence.
Download slides
Conceptual Knowledge: Evidence from Corpora and the Brain
Massimo Poesio,
Universitá di Trento, Center for Mind / Brain Sciences and DISI
Evidence about conceptual knowledge derived by extracting information
from corpora could potentially have a big impact on research on the
lexicon, ontologies, and commonsense knowledge, e.g., by providing
enormous amounts of information about which attributes appear to have
the strongest associations with certain concepts, or for which types
of concepts our current repertoire of attributes is less
appropriate. It is however very difficult to evaluate such research
precisely, as the most commonly used gold standards (WordNet, existing
ontologies) have been developed on the basis of subjective human
judgments as opposed to objective empirical evidence. The most obvious
alternative to such resources are the evidence about conceptual
attributes contained in feature norms produced by psycholinguists, and
evidence about category distinctions extracted through brain imaging
data, but the existing databases are serious limitations as well. In
this talk I will discuss ongoing work on collecting evidence about
conceptual knowledge through these techniques (in particular, brain
data through EEG) and using this information to evaluate
corpus-extracted models of conceptual knowledge.
Joint work with Marco Baroni, Brian Murphy, Eduard Barbu, and
Abdulrahman Almuhareb.
Download slides
|