Tutoarials RANLP 2009

International Conference
RANLP - 2009
/Recent Advances in Natural Language Processing/
September 14-16, 2009, Borovets, Bulgaria

Tutorials - September 12-13

The main RANLP conference will be preceeded by two days tutorials delivered by distinguished lecturers. We plan 4 half-day tutorials, each with duration of 220 minutes, distributed as follows: 60 min talk + 20 min break + 60 min talk + 20 min break + 60 min talk.

Tutorial speakers:

Kevin Bretonnel Cohen (University of Colorado School of Medicine)
Roberto Navigli (Sapienza University of Rome)
Constantin Orasan (University of Wolverhampton)
Kiril Simov & Petya Osenova (Bulgarian Academy of Sciences)

Abstracts

Biomedical Natural Language Processing: BioNLP
Kevin Bretonnel Cohen - University of Colorado School of Medicine

BioNLP is the application of natural language processing and text mining techniques to biomedical data. This type of data presents special challenges on all levels, ranging from the most basic tokenization to questions of semantic representation. The tutorial will present an overview of BioNLP problems and solutions, including the following areas:

Named entity recognition: Named entity recognition in the biomedical domain has focused on gene and protein name recognition, and performance on this task has lagged behind that in the newswire domain. We will review reasons for these performance differences and discuss typical solutions.

Named entity normalization: Named entity recognition itself is typically not useful in the biomedical domain; rather, it is typically necessary to ground mentions to entities in a database. We will discuss past results and the latest challenges in this area.

Corpus construction: Corpus construction in this domain presents unique challenges related to the need for both linguistic and domain expertise. We will discuss practical approaches to the task.

Crucial resources: We will discuss the ten crucial resources that will enable anyone to start doing research in BioNLP.

Semantics and discourse: Semantic representations in biomedicine differ in crucial ways from those for newswire text. We will show the differences and demonstrate how biomedical semantic resources can be evaluated.

Kevin Bretonnel Cohen is the Biomedical Text Mining Group Lead at the Center for Computational Pharmacology in the University of Colorado School of Medicine. He is the founding chairman of the ACL SIGBIOMED special interest group on BioNLP and a co-organizer of the 2007 Medical NLP Challenge on ICD-9-CM classification and the BioCreative III shared task.

Graph-Based Word Sense Disambiguation and Discrimination
Roberto Navigli - Sapienza University of Rome

Word Sense Disambiguation (WSD), the ability to identify the intended meanings of words (senses) in context, is a key problem in Natural Language Processing (NLP). The most successful approaches (i.e. supervised methods) typically need large amounts of manually sense-tagged data. However, this requirement constitutes an obstacle to the development of wide-coverage systems, due to the high cost of human annotations. This issue (called the knowledge acquisition bottleneck) can be overcome by resorting to different kinds of approaches, which rely on large amounts of unannotated data (fully unsupervised approaches) or wide-coverage lexical resources (knowledge-based methods). The former methods do not use a predefined inventory and aim to induce the inventory of senses for a target word (Word Sense Discrimination), which can be used later to sense tag words in context. In contrast, knowledge-based approaches exploit machine-readable lexical resources to perform WSD. Both approaches can be implemented with the aid of graphs: graph nodes are used to represent words or senses, whereas edges encode relations between pairs of nodes.

This tutorial covers graph-based approaches to WSD. These methods need not be trained on manually annotated data, as they can exploit the graph structure of lexical knowledge resources to perform the disambiguation/discrimination task. The tutorial is divided into three parts:

1. Introduction to Word Sense Disambiguation and Discrimination.
Word Sense Disambiguation is introduced, an overview of the current range of techniques is provided and the main issues of WSD are discussed. This part gives the attendees a basic, but clear knowledge of the field of WSD, as well as the motivation for using graphs to perform disambiguation. We also introduce the topic of unsupervised WSD, that is Word Sense Discrimination.

2. Graph-based methods for Word Sense Disambiguation and Discrimination.
In the second part of the tutorial, we present graphs and illustrate graph-based methods for Word Sense Disambiguation and Discrimination. This is the tutorial core, which will provide the attendee with a firm knowledge of graph methods for Word Sense Disambiguation and Discrimination.

3. Evaluation of graph-based approaches
Finally, we present the main experimental successes and failures of graph-based WSD methods in standard international evaluation competitions (Senseval and Semeval). We also illustrate evaluation strategies for discrimination methods. Finally, we discuss future trends and uses of graphs in the field.

Target audience
This tutorial targets both computer scientists and computational linguists. The tutorial is self-contained, so no background knowledge is required. The objective is to give all attendees a clear understanding of how graphs are used in a wide range of techniques for WSD.

Roberto Navigli is an assistant professor in the Department of Computer Science at the University of Roma "La Sapienza". He received a PhD in Computer Science from "La Sapienza" in 2007, winner of the Marco Cadoli 2007 Italian national prize for the best PhD thesis in Artificial Intelligence. His research interests include Word Sense Disambiguation, knowledge acquisition, ontology learning and population, the Semantic Web and its applications. He is a board member of the ACL Special Group on the Lexicon (SIGLEX) and a co-organizer of two tasks on coarse-grained all-words WSD and lexical substitution at the Semeval-2007 semantic evaluation campaign.

Automatic summarisation in the Information Age
Constantin Orasan – University of Wolverhampton, UK

Most of the methods proposed in the past in automatic summarisation were designed to work on newswire and scientific texts. However, the explosion of information available on the Internet prompted researchers to turn their attention to online texts as well. The purpose of this tutorial is to give an overview of past and current developments in the field of automatic summarisation. The tutorial will be divided into three parts:

1. Introduction to automatic summarisation
The first hour of the tutorial will introduce the basic concepts in automatic summarisation and present the field in relation to other fields such as information retrieval, information extraction, question answering, etc. This part of the tutorial will also present resources that can be used in the development and evaluation of automatic summarisation systems.

2. Important methods in automatic summarisation
The second part of the tutorial will briefly present the most important methods used in single and multi-document automatic summarisation including, but not limited to, techniques based on superficial features, term occurrence, lexical and discourse information, etc. Particular attention will be paid to methods based on machine learning.

3. Automatic summarisation for online documents
The recent expansion of the Internet brought about the convergence of automatic summarisation with other fields. The third part of the tutorial will focus on emerging directions in automatic summarisation, with special emphasis on summarisation of online documents such as emails, fora and webpages. Production of summaries as the output of a question answering system that answers a question and summarisation of reviews will also be discussed.

Target audience
This tutorial is designed for students and researchers in Computer Science and Computational Linguistics. No prior knowledge of automatic summarisation is assumed, but at least basic knowledge of computational linguistics will be required.

Constantin Orasan is senior lecturer in computational linguistics at the Research Group in Computational Linguistics, University of Wolverhampton, UK. His main interests are automatic summarisation, question answering, anaphora and coreference resolution and corpus linguistics.

In the NLP world of Knowledge Nets
Kiril Simov & Petya Osenova - Bulgarian Academy of Sciences

Nowadays a number of initiatives have been very active, which structure the linguistic knowledge in hierarchical lexicons, called ‘nets’. Needless to say, the oldest and the most popular ‘net’ is the WordNet, which is still expanding and being developed. Then come in order also PhraseNet, FrameNet, VerbNet, etc. At the same time, our world knowledge is modelled via ontologies – upper (abstract) and domain (specific) ones. Some bridges between these two types of knowledge bases have already been established, such as construction of WordNets aligned to a top ontology in EuroWordNet, mapping of WordNet to SUMO (Suggested Upper Merged Ontology) and to DOLCE - the OntoWordNet.

The purpose of this tutorial is to discuss the properties and utility of both - the rich linguistic knowledge lexicons, on the one hand, and the ontologies, on the other. The relations between the two types of resources will be outlined with respect to various NLP applications (eLearning, Information Retrieval, etc.).

The tutorial will focus on the following topics:

1. Facing Linguistic Knowledge Nets.
A contrastive overview will be presented of the most popular nets and their design properties. These resources will be considered with respect to multilinguality, domains, visualization and potential for various NLP applications.

2. Getting the ontologies into the picture.
Linguistic knowledge is part of our world knowledge. Thus, we will focus first on the concept ‘ontology’. Our interpretation will be more NLP-oriented than philosophically considered, although the philosophical justification will be kept. Then the peculiarities of the upper as well as domain ontologies will be presented. The inheritance and control mechanisms, which go from the upper to the domain one will be featured. Also, the role of the nets from the ontological point of view will be shown.

3. The “war for domination” between the nets and the ontologies.
Very often, in the NLP world the distinction between a linguistic knowledge net and ontology is blurred due to terminological ambiguity and non-clarity. We will discuss the criteria for deciding which resource is a linguistic net and which is an ontology. On the other hand, the nets serve as a ‘middle’ part between the upper and the domain ontologies. This interaction, however, is not trivial. Thus, some mapping procedures in this respect will be outlined as well.

This tutorial aims at a broader audience. It is meant to be suitable for all students and researchers who work or are interested in creating, implementing, and using rich knowledge resources in an NLP environment.

Kiril Simov is a senior researcher at Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences. His main interests are developing and using knowledge-based tools and approaches to linguistic information and tasks, language resources, semantic dictionaries and ontologies.

Petya Osenova is a senior researcher at Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences, and associate professor in Morphology and Syntax at the Sofia University “St. Kl. Ohridski”. Her research interests are in the area of morphology, syntax, lexical knowledge bases, formal grammars, corpus linguistics, question answering, HPSG.