The main RANLP conference will be preceeded by two days tutorials delivered by distinguished lecturers. We plan 4 half-day tutorials, each with duration of 220 minutes, distributed as follows: 60 min talk + 20 min break + 60 min talk + 20 min break + 60 min talk.
Tutorial speakers:
Kevin Bretonnel Cohen (University of Colorado School of Medicine)
Roberto Navigli (Sapienza University of Rome)
Constantin Orasan (University of Wolverhampton)
Kiril Simov & Petya Osenova (Bulgarian Academy of Sciences)
Abstracts
Biomedical Natural Language Processing: BioNLP
Kevin Bretonnel Cohen - University of Colorado School of Medicine
BioNLP is the application of natural language processing and text mining
techniques to biomedical data. This type of data presents special
challenges on all levels, ranging from the most basic tokenization to
questions of semantic representation. The tutorial will present an overview
of BioNLP problems and solutions, including the following areas:
Named entity recognition: Named entity recognition in the biomedical domain
has focused on gene and protein name recognition, and performance on this
task has lagged behind that in the newswire domain. We will review reasons
for these performance differences and discuss typical solutions.
Named entity normalization: Named entity recognition itself is typically not
useful in the biomedical domain; rather, it is typically necessary to ground
mentions to entities in a database. We will discuss past results and the
latest challenges in this area.
Corpus construction: Corpus construction in this domain presents unique
challenges related to the need for both linguistic and domain expertise. We
will discuss practical approaches to the task.
Crucial resources: We will discuss the ten crucial resources that will
enable anyone to start doing research in BioNLP.
Semantics and discourse: Semantic representations in biomedicine differ in
crucial ways from those for newswire text. We will show the differences and
demonstrate how biomedical semantic resources can be evaluated.
Kevin Bretonnel Cohen is the Biomedical Text Mining Group Lead at the Center
for Computational Pharmacology in the University of Colorado School of
Medicine. He is the founding chairman of the ACL SIGBIOMED special interest
group on BioNLP and a co-organizer of the 2007 Medical NLP Challenge on
ICD-9-CM classification and the BioCreative III shared task.
Graph-Based Word Sense Disambiguation and Discrimination
Roberto Navigli - Sapienza University of Rome
Word Sense Disambiguation (WSD), the ability to identify the
intended meanings of words (senses) in context, is a key
problem in Natural Language Processing (NLP). The most successful
approaches (i.e. supervised methods) typically need large amounts of
manually sense-tagged data. However, this requirement constitutes an
obstacle to the development of wide-coverage systems, due to the high
cost of human annotations. This issue (called the knowledge acquisition
bottleneck) can be overcome by resorting to different kinds of
approaches, which rely on large amounts of unannotated data
(fully unsupervised approaches) or wide-coverage lexical resources
(knowledge-based methods). The former methods do not use a predefined
inventory and aim to induce the inventory of senses for a target word
(Word Sense
Discrimination), which can be used later to sense tag words in context.
In contrast, knowledge-based approaches exploit machine-readable lexical
resources to perform WSD. Both approaches can be implemented with the aid
of graphs:
graph nodes are used to represent words or senses, whereas edges encode
relations between pairs of nodes.
This tutorial covers graph-based approaches to WSD. These methods
need not be trained on manually annotated data, as they can exploit
the graph structure of lexical knowledge resources to perform the
disambiguation/discrimination task. The tutorial is divided into
three parts:
1. Introduction to Word Sense Disambiguation and Discrimination.
Word Sense Disambiguation is introduced, an overview of the current range
of techniques is provided and the main issues of WSD are discussed. This
part
gives the attendees a basic, but clear knowledge of the field of WSD, as
well
as the motivation for using graphs to perform disambiguation. We also
introduce the topic of unsupervised WSD, that is Word Sense
Discrimination.
2. Graph-based methods for Word Sense Disambiguation and Discrimination.
In the second part of the tutorial, we present graphs and illustrate
graph-based methods for Word Sense
Disambiguation and Discrimination. This is the tutorial core, which will
provide
the attendee with a firm knowledge of graph methods for Word Sense
Disambiguation
and Discrimination.
3. Evaluation of graph-based approaches
Finally, we present the main experimental successes and failures of
graph-based
WSD methods in standard international evaluation competitions (Senseval
and Semeval).
We also illustrate evaluation strategies for discrimination methods.
Finally, we discuss future trends and uses of graphs in the field.
Target audience
This tutorial targets both computer scientists and computational
linguists.
The tutorial is self-contained, so no background knowledge is required.
The objective is to give all attendees a clear understanding of how graphs
are used in a wide range of techniques for WSD.
Roberto Navigli is an assistant professor in the Department of Computer
Science
at the University of Roma "La Sapienza". He received a PhD in Computer
Science from
"La Sapienza" in 2007, winner of the Marco Cadoli 2007 Italian national
prize for
the best PhD thesis in Artificial Intelligence.
His research interests include Word Sense Disambiguation, knowledge
acquisition,
ontology learning and population, the Semantic Web and its applications.
He is a board
member of the ACL Special Group on the Lexicon (SIGLEX) and a co-organizer
of two tasks on coarse-grained all-words WSD and lexical substitution at
the Semeval-2007 semantic evaluation campaign.
Automatic summarisation in the Information Age
Constantin Orasan – University of Wolverhampton, UK
Most of the methods proposed in the past in automatic summarisation were
designed to work on newswire and scientific texts. However, the
explosion of information available on the Internet prompted researchers
to turn their attention to online texts as well. The purpose of this
tutorial is to give an overview of past and current developments in the
field of automatic summarisation. The tutorial will be divided into
three parts:
1. Introduction to automatic summarisation
The first hour of the tutorial will introduce the basic concepts in
automatic summarisation and present the field in relation to other
fields such as information retrieval, information extraction, question
answering, etc. This part of the tutorial will also present resources
that can be used in the development and evaluation of automatic
summarisation systems.
2. Important methods in automatic summarisation
The second part of the tutorial will briefly present the most important
methods used in single and multi-document automatic summarisation
including, but not limited to, techniques based on superficial features,
term occurrence, lexical and discourse information, etc. Particular
attention will be paid to methods based on machine learning.
3. Automatic summarisation for online documents
The recent expansion of the Internet brought about the convergence of
automatic summarisation with other fields. The third part of the
tutorial will focus on emerging directions in automatic summarisation,
with special emphasis on summarisation of online documents such as
emails, fora and webpages. Production of summaries as the output of a
question answering system that answers a question and summarisation of
reviews will also be discussed.
Target audience
This tutorial is designed for students and researchers in Computer
Science and Computational Linguistics. No prior knowledge of automatic
summarisation is assumed, but at least basic knowledge of computational
linguistics will be required.
Constantin Orasan is senior lecturer in computational linguistics at the
Research Group in Computational Linguistics, University of
Wolverhampton, UK. His main interests are automatic summarisation,
question answering, anaphora and coreference resolution and corpus
linguistics.
In the NLP world of Knowledge Nets
Kiril Simov & Petya Osenova - Bulgarian Academy of Sciences
Nowadays a number of initiatives have been very active, which structure the linguistic knowledge in hierarchical lexicons, called ‘nets’. Needless to say, the oldest and the most popular ‘net’ is the WordNet, which is still expanding and being developed. Then come in order also PhraseNet, FrameNet, VerbNet, etc. At the same time, our world knowledge is modelled via ontologies – upper (abstract) and domain (specific) ones. Some bridges between these two types of knowledge bases have already been established, such as construction of WordNets aligned to a top ontology in EuroWordNet, mapping of WordNet to SUMO (Suggested Upper Merged Ontology) and to DOLCE - the OntoWordNet.
The purpose of this tutorial is to discuss the properties and utility of both - the rich linguistic knowledge lexicons, on the one hand, and the ontologies, on the other. The relations between the two types of resources will be outlined with respect to various NLP applications (eLearning, Information Retrieval, etc.).
The tutorial will focus on the following topics:
1. Facing Linguistic Knowledge Nets.
A contrastive overview will be presented of the most popular nets and their design properties. These resources will be considered with respect to multilinguality, domains, visualization and potential for various NLP applications.
2. Getting the ontologies into the picture.
Linguistic knowledge is part of our world knowledge. Thus, we will focus first on the concept ‘ontology’. Our interpretation will be more NLP-oriented than philosophically considered, although the philosophical justification will be kept. Then the peculiarities of the upper as well as domain ontologies will be presented. The inheritance and control mechanisms, which go from the upper to the domain one will be featured. Also, the role of the nets from the ontological point of view will be shown.
3. The “war for domination” between the nets and the ontologies.
Very often, in the NLP world the distinction between a linguistic knowledge net and ontology is blurred due to terminological ambiguity and non-clarity. We will discuss the criteria for deciding which resource is a linguistic net and which is an ontology. On the other hand, the nets serve as a ‘middle’ part between the upper and the domain ontologies. This interaction, however, is not trivial. Thus, some mapping procedures in this respect will be outlined as well.
This tutorial aims at a broader audience. It is meant to be suitable for all students and researchers who work or are interested in creating, implementing, and using rich knowledge resources in an NLP environment.
Kiril Simov is a senior researcher at Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences. His main interests are developing and using knowledge-based tools and approaches to linguistic information and tasks, language resources, semantic dictionaries and ontologies.
Petya Osenova is a senior researcher at Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences, and associate professor in Morphology and Syntax at the Sofia University “St. Kl. Ohridski”. Her research interests are in the area of morphology, syntax, lexical knowledge bases, formal grammars, corpus linguistics, question answering, HPSG.
|