The main RANLP conference will be preceeded by three days tutorials delivered
by distinguished lecturers.
We plan 6 half day tutorials, each with duration of 180 minutes, distributed
as follows:
60 min talk + 20 min break + 60 min talk + 20 min break + 60 min talk.
Sunday 7/09 |
Monday 8/09 |
Tuesday 9/09
|
|
9:00 - 12:40 |
Irion Technologies BV Wordnet, EuroWordNet and Global Wordnet |
Sheffield University Named Entity Recognition
|
Bar Ilan University Learning in NLP: When can we reduce or avoid annotation cost? |
12:40 - 14:00 |
Lunch break
|
Lunch break
|
Lunch break
|
14:00 - 17:40 |
University of Iasi Discourse theories and technologies |
IBM T.J. Question Answering |
Georgetown University Automatic Summarization |
Abstracts
Dan Cristea, University of Iasi , Discourse theories and technologies
What is the discourse? The difference between text and discourse.
Coherence and cohesion.
Attentional state theory (Grosz and Sidner): the three components of the theory; plusses and minuses of the model.
Rhetorical structure theory (Mann and Thompson): discourse unit, relation scheme, hypo- and hyper-tactic relations, nuclearity, examples of rhetorical relations.
Centering theory (Grosz, Joshi and Weinstein) Locality in centering, backward and forward-looking centers, transitions, rules, what the theory explains, how to apply it.
Veins theory (Cristea, Ide and Romary): intuitions, head and vein expressions, referential accessibility, conjectures, validation.
Segmenting and parsing discourse: Marcu's parser, VT parser.
Related issues on discourse: anaphora, summarisation, information extraction, text understanding.
Download the tutorial from here or from Dan Cristea's home page.
Piek Vossen, Iron Technologies BV, Wordnet, EuroWordNet and Global Wordnet
In this tutorial I will describe the motivation and structure of the EuroWordNet database. In this database, wordnets in 8 European languages are stored as independent conceptual structures but are also connected via a separate Inter-Lingual-Index. Many choices made in EuroWordNet had a practical background. Still, the database also raises some fundamental questions that need to be adressed. How language dependent are conceptual structures? How universal are lexicalizations? Is it possible to achieve a high connectivity between wordnets? How should the Inter-Lingual-Index be structured? The EuroWordNet project ended in 1999. Since then many more wordnets are being developed, and the framework to coordinate and connect these wordnets has been continued by the Global Wordnet Association. Some of these projects will be shortly presented, showiung that further consolidation and standardization of wordnet resources is needed in the future.
Hamish Cunningham, Sheffield University, Named Entity Recognition
This tutorial will introduce research on Named Entity (NE) recognition, which is an important first step in Information Extraction. NE recognition is the process of identifying proper nouns in text and classifying them into a given set of categories, e.g., Person, Organization, Location. NE recognition as a task in its own right began as part of the Message Understanding Conferences (MUC) and the tutorial will start by introducing the task as defined in MUC, what applications it has, and how it is evaluated. Then we will survey different approaches to NE recognition, which broadly fall into two categories: rule-based and machine learning-based. Recent approaches of both types will be surveyed. The tutorial will then discuss the most recent work on multilingual NE recognition, some approaches, and their performance. Finally, we conclude by outlining future directions and challenges in NE.
John Prager, IBM T.J. Watson Research Center, Question Answering
Question Answering is an area that is attracting increasing attention in
both the commercial and scientific worlds. Partly due to its inter-disciplinary
nature, it is giving rise to many interesting and challenging technical problems,
which are very different from those of plain search, or document retrieval.
In this tutorial we will look at both past and present activities in the QA
field, and also project where the field is going. We will spend the bulk of
the time concentrating on features of state-of-the-art systems: we will look
at some individual systems and infer some general principles. We will examine
approaches that are primarily based on information retrieval, on natural language
processing, on pattern-matching (both on and off the Web), on statistical modelling
and on knowledge and inferencing, and also see how these different techniques
can be combined.
Some of the topics covered will include:
If time allows, we will look beyond QA on English text. This tutorial has
no special prerequisites, except for a minimal understanding of NLP and IR concepts.
There will be a little mathematics, but only a little.
Ido Dagan, Bar Illan University, Learning in NLP: When can we reduce or avoid annotation cost?
In the recent decade statistical and machine learning approaches became prominent in NLP research. Often, the Achilles heel of such approaches is that they require substantial manual annotation of training data, through which the system should learn the underlying language concepts and structures that were defined by humans. Indeed, many of the published results in this area rely on standard training materials that were assembled within dedicated projects and evaluation frameworks, such as the Penn Treebank, Brown Corpus, Semcor, TREC, MUC and SenseEval evaluations, and shared tasks in CoNLL conferences.
On the other hand, a variety of methods and approaches were investigated that
aim to avoid manual annotation or to reduce substantially the amount of required
annotation. This tutorial will review a broad range of approaches that tackle
this general goal from quite different perspectives. Various methods will be
presented in the context of the different learning paradigms within which they
were applied, including:
The overall goal of the tutorial will be to provide a "condensed" perspective of attempts to reduce annotation cost, and to draw attention to the growing need in pursuing further such directions in order to increase the applicability of empirical NLP in research and industry.
Comments: The tutorial material has substantial overlap with some of the basic material in statistical NLP, such as covered in the book by Manning and Schutze. We plan to get into the algorithmic details of several methods, as time permits.
Download tutorial and Bibiliography
Inderjeet Mani, Georgetown University , Automatic Summarization
With the explosion in the quantity of on-line text and multimedia information
in recent years, demand for automatic text summarization technology is growing.
The goal of automatic text summarization is to take a source document or documents,
extract information content from it, and present the most important content
in a condensed form in a manner sensitive to the needs of the user and task.
This tutorial provides an overview of the latest developments in automatic summarization,
including methods for producing extracts and abstracts, evaluation strategies,
and new problem and application areas. Human abstracting, as well as automatic
linguistic and statistical methods will be described, with an emphasis on empirical
results. The tutorial concludes with a discussion of existing evaluation methods
and results.
Inderjeet Mani is an Associate Professor of Linguistics at Georgetown University. He has published two books on summarization: Automatic Summarization (Benjamins 2001) and (co-edited) Advances in Automatic Text Summarization (MIT Press 1999), and is also co-editing the forthcoming book The Language of Time: A Reader (Oxford University Press 2004). He has led projects in summarization, temporal information extraction, and information retrieval funded by ARDA, DARPA, MITRE, and NSF.