Tutorials

TUTORIALS

The main RANLP conference will be preceeded by three days tutorials delivered by distinguished lecturers.
We plan 6 half day tutorials, each with duration of 180 minutes, distributed as follows:
60 min talk + 20 min break + 60 min talk + 20 min break + 60 min talk.

	Sunday 7/09	Monday 8/09	Tuesday 9/09
9:00 - 12:40	Piek Vossen Irion Technologies BV Wordnet, EuroWordNet and Global Wordnet	Hamish Cunningham Sheffield University Named Entity Recognition	Ido Dagan Bar Ilan University Learning in NLP: When can we reduce or avoid annotation cost?
12:40 - 14:00	Lunch break	Lunch break	Lunch break
14:00 - 17:40	Dan Cristea University of Iasi Discourse theories and technologies	John Prager IBM T.J. Watson Research Center Question Answering	Inderjeet Mani Georgetown University Automatic Summarization

Abstracts

Dan Cristea, University of Iasi , Discourse theories and technologies

What is the discourse? The difference between text and discourse. Coherence and cohesion.

Attentional state theory (Grosz and Sidner): the three components of the theory; plusses and minuses of the model.

Rhetorical structure theory (Mann and Thompson): discourse unit, relation scheme, hypo- and hyper-tactic relations, nuclearity, examples of rhetorical relations.

Centering theory (Grosz, Joshi and Weinstein) Locality in centering, backward and forward-looking centers, transitions, rules, what the theory explains, how to apply it.

Veins theory (Cristea, Ide and Romary): intuitions, head and vein expressions, referential accessibility, conjectures, validation.

Segmenting and parsing discourse: Marcu's parser, VT parser.

Related issues on discourse: anaphora, summarisation, information extraction, text understanding.

Download the tutorial from here or from Dan Cristea's home page.

Back to top

Piek Vossen, Iron Technologies BV, Wordnet, EuroWordNet and Global Wordnet

In this tutorial I will describe the motivation and structure of the EuroWordNet database. In this database, wordnets in 8 European languages are stored as independent conceptual structures but are also connected via a separate Inter-Lingual-Index. Many choices made in EuroWordNet had a practical background. Still, the database also raises some fundamental questions that need to be adressed. How language dependent are conceptual structures? How universal are lexicalizations? Is it possible to achieve a high connectivity between wordnets? How should the Inter-Lingual-Index be structured? The EuroWordNet project ended in 1999. Since then many more wordnets are being developed, and the framework to coordinate and connect these wordnets has been continued by the Global Wordnet Association. Some of these projects will be shortly presented, showiung that further consolidation and standardization of wordnet resources is needed in the future.

Download tutorial

Back to top

Hamish Cunningham, Sheffield University, Named Entity Recognition

This tutorial will introduce research on Named Entity (NE) recognition, which is an important first step in Information Extraction. NE recognition is the process of identifying proper nouns in text and classifying them into a given set of categories, e.g., Person, Organization, Location. NE recognition as a task in its own right began as part of the Message Understanding Conferences (MUC) and the tutorial will start by introducing the task as defined in MUC, what applications it has, and how it is evaluated. Then we will survey different approaches to NE recognition, which broadly fall into two categories: rule-based and machine learning-based. Recent approaches of both types will be surveyed. The tutorial will then discuss the most recent work on multilingual NE recognition, some approaches, and their performance. Finally, we conclude by outlining future directions and challenges in NE.

Download tutorial

Back to top

John Prager, IBM T.J. Watson Research Center, Question Answering

Question Answering is an area that is attracting increasing attention in both the commercial and scientific worlds. Partly due to its inter-disciplinary nature, it is giving rise to many interesting and challenging technical problems, which are very different from those of plain search, or document retrieval.
In this tutorial we will look at both past and present activities in the QA field, and also project where the field is going. We will spend the bulk of the time concentrating on features of state-of-the-art systems: we will look at some individual systems and infer some general principles. We will examine approaches that are primarily based on information retrieval, on natural language processing, on pattern-matching (both on and off the Web), on statistical modelling and on knowledge and inferencing, and also see how these different techniques can be combined.
Some of the topics covered will include:

Use of NE recognition - how many classes, non-determinism.

Answer type identification.

Dealing with absent (or overly general) answer types.

Special requirements for search.

Answer selection/pinpointing.

Use of redundancy and constraints for increased precision.

The benefits and difficulties of iterative methods.

The differences between open and closed domain QA.

Evaluation of QA systems and components, including the role that context plays.

When questions have no answer.

Recursive QA.

If time allows, we will look beyond QA on English text. This tutorial has no special prerequisites, except for a minimal understanding of NLP and IR concepts.
There will be a little mathematics, but only a little.

Download tutorial

Back to top

Ido Dagan, Bar Illan University, Learning in NLP: When can we reduce or avoid annotation cost?

In the recent decade statistical and machine learning approaches became prominent in NLP research. Often, the Achilles heel of such approaches is that they require substantial manual annotation of training data, through which the system should learn the underlying language concepts and structures that were defined by humans. Indeed, many of the published results in this area rely on standard training materials that were assembled within dedicated projects and evaluation frameworks, such as the Penn Treebank, Brown Corpus, Semcor, TREC, MUC and SenseEval evaluations, and shared tasks in CoNLL conferences.

On the other hand, a variety of methods and approaches were investigated that aim to avoid manual annotation or to reduce substantially the amount of required annotation. This tutorial will review a broad range of approaches that tackle this general goal from quite different perspectives. Various methods will be presented in the context of the different learning paradigms within which they were applied, including:

Expectation-Maximization (EM) estimation and selective sampling within Hidden Markov Models (HMM) for part-of-speech tagging.

Simple learning from noisy statistics, which were collected from an automatically parsed corpus, for prepositional phrase attachment.

Using parallel and monolingual corpora of different languages for word sense disambiguation.

Bootstrapping of decision lists for word sense disambiguation and named entity classification.

EM clustering for Bayesian word sense disambiguation.

Using unsupervised distributional similarity (and clustering) to generalize over data obtained from a given corpus.

The overall goal of the tutorial will be to provide a "condensed" perspective of attempts to reduce annotation cost, and to draw attention to the growing need in pursuing further such directions in order to increase the applicability of empirical NLP in research and industry.

Comments: The tutorial material has substantial overlap with some of the basic material in statistical NLP, such as covered in the book by Manning and Schutze. We plan to get into the algorithmic details of several methods, as time permits.

Download tutorial and Bibiliography

Back to top

Inderjeet Mani, Georgetown University , Automatic Summarization

With the explosion in the quantity of on-line text and multimedia information in recent years, demand for automatic text summarization technology is growing. The goal of automatic text summarization is to take a source document or documents, extract information content from it, and present the most important content in a condensed form in a manner sensitive to the needs of the user and task. This tutorial provides an overview of the latest developments in automatic summarization, including methods for producing extracts and abstracts, evaluation strategies, and new problem and application areas. Human abstracting, as well as automatic linguistic and statistical methods will be described, with an emphasis on empirical results. The tutorial concludes with a discussion of existing evaluation methods and results.

Inderjeet Mani is an Associate Professor of Linguistics at Georgetown University. He has published two books on summarization: Automatic Summarization (Benjamins 2001) and (co-edited) Advances in Automatic Text Summarization (MIT Press 1999), and is also co-editing the forthcoming book The Language of Time: A Reader (Oxford University Press 2004). He has led projects in summarization, temporal information extraction, and information retrieval funded by ARDA, DARPA, MITRE, and NSF.

Download tutorial

Back to top