Tutorials

TUTORIALS
18 - 20 September 2005

The main RANLP conference will be preceeded by three days tutorials delivered by distinguished lecturers. We plan 6 half day tutorials, each with duration of 180 minutes, distributed as follows: 60 min talk + 20 min break + 60 min talk + 20 min break + 60 min talk.

Tutorial lecturers:

Jan Hajic	(Charles University, Prague)
Bernardo Magnini	(IRST, Trento)
Rada Mihalcea	(University of North Texas)
Dragomir Radev	(University of Michigan)
John Tait	(University of Sunderland)
Zdenka Uresova	(Charles University, Prague)
Michael Zock	(CNRS)

Tutorial timetable is as follows (please note that lectures schedule will be announced later):

	September 18, 2005 Sunday	September 19, 2005 Monday	September 20, 2005 Tuesday
9.00 - 12.40	Bernardo Magnini (IRST, Trento)	Dragomir Radev (University of Michigan)	Rada Mihalcea (University of North Texas)
12.40 - 14.00	Lunch break	Lunch break	Lunch break
14.00 - 17.40	Jan Hajic and Zdenka Uresova (Charles University, Prague)	John Tait (University of Sunderland)	Michael Zock (CNRS)

Abstracts

Bernardo Magnini, IRST, Trento

Open Domain Question Answering: Techniques, Systems and Evaluation

Open Domain Question Answering (QA) systems accept natural language questions (as opposed to keywords) and return to the user exact answers (as opposed to a list of scored documents). Under the impulse of the QA track at the TREC evaluation campaign, QA has become both a hot research topic in Computational Linguistics and a challenging perspective for real use applications. This tutorial will report about current techniques and resources for Question Answering, including recent work on Cross-Language Questions Answering, where questions are expressed in a source language and the answer is searched in a document collection of a different language. Finally, special attention will be put on evaluation methodologies, a necessary step in order to measure performance progresses of Question Answering systems.

Presentation

Back to top

Jan Hajic and Zdenka Uresova, Charles University, Prague, Czech Republic

The Prague Dependency Treebank and Valency Annotation

The tutorial will introduce the Prague Dependency Treebank project that aims at complex manual annotation of substantial amount of naturally occurring sentences of Czech. The Prague Dependency Treebank has three levels of annotation: morphological, analytical (describing surface syntax in a dependency fashion) and tectogrammatical, which combines syntax and semantics into a language meaning representation, keeping the dependency structure as the core of the annotation structure but adding coreference, topic/focus annotation, and a detailed semantic labeling of every sentence unit. Special attention and space will be devoted to the notion of valency (both verbal and nominal), which is an important part of the annotation of the Prague Dependency Treebank at the tectogrammatical level. The PDT-Vallex valency dictionary and its relation to the morphological and syntactic annotation of the corpus will be described in detail. Demonstration of the annotation process will be part of the tutorial. Examples of valency dictionary and the annotated structures (taken from Czech and English, at least) will be presented throughout the tutorial. To maximize the benefit of the tutorial, it is recommended the prospective participants get acquainted with the Prague Dependency Treebank or get its version 1.0 before the tutorial (LDC publication Catalog No. LDC2001T10, or directly at http://ufal.mff.cuni.cz, where the preview of the version 2.0 can also be found); however, the tutorial will be self-contained even for those who would not have the chance to do so.

Tentative Syllabus:

Lecture 1 (60 min.) The Prague Dependency Treebank
- The Prague Dependency Treebank introduction
- Morphological tagset and annotation
- Analytical (surface syntax) dependency annotation
- PDT version 1.0 characteristics, tools available
- Morphological and dependency annotation of other languages using the PDT 1.0 scheme

Lecture 2 (60 min.) Tectogrammatical Annotation of the Prague Dependency Treebank
- Tectogrammatical annotation: an overview
- Structure and semantics: annotation units, dependency and functions
- Additional dimensions: coreference and topic/focus annotation
- Valency and the valency dictionary: introduction
- Linking annotated data and the dictionary: adding lexical sense

Lecture 3 (60 min.) Valency: Combining Syntax and Semantics
- Valency dictionary principles in detail
- Valency dictionary entry structure: valency slots, labels and form expressions
- Linking function and form, regular form metamorphosis
- Using the valency dictionary for text generation
- Demonstration of the valency dictionary and the annotation process

Presentation

Back to top

Dragomir Radev, University of Michigan (http://tangra.si.umich.edu/~radev)

Information Retrieval

Information is everywhere. We encounter it in our everyday lives in the form of email, newspapers, television, the Web, and even in conversations with each other. Information is hidden in a variety of media - text, images, sounds, videos. While casual information consumers can simply enjoy its abundance and appreciate the existence of search engines that can help them find what they want, information professionals are responsible for building the underlying technology that search engines use.

Given the time constraints, most of this tutorial will cover classic concepts of (text-based) Information Retrieval as they are related to the creation of a search engine. A portion of time will be be reserved for some of the issues where Natural Language Processing and Information Retrieval intersect.

Syllabus:
   Introduction
   Documents and Queries
   Indexing and Search
   Stemming
   Word distributions
   Retrieval evaluation
   The vector model
   Document similarity
   Text classification and clustering
   Latent semantic indexing
   Natural language IR: question answering, word-sense disambiguation, cross-lingual IR

Presentation

Back to top

John Tait, University of Sunderland, UK

Image Retrieval

The ever increasing volumes of non-textual digital information (like digital photographs) are a huge problem in the modern world. How many of us have digital photographs on hard discs or servers which we know are there but can no longer find?

Traditionally the means of dealing with this problem has been to assign textual descriptors of one sort or another to the images and then to use techniques from the much better developed field of text information retrieval to index and retrieve the image data. However in all but the highest value professional applications manual descriptor assignment is no longer acceptable in terms of the time and effort required.

One possible solution is to use automatic means of assigning key word descriptors to images.

The aims of the tutorial will be to acquaint attendees with the state of the art in assigning key words to images and to outline the current research agenda.

The tutorial will begin with a general review of image retrieval and the problems it poses assuming no prior knowledge. It will relate image retrieval to work in language and especially lexical approaches to language.

It will then move on review the progress which has been made on the task of assigning descriptive terms to still images in the last few years. A particular focus will be on the contrast between generative model originated by Barnard and others working with David Forsyth at the University of California, Berkeley, and the categorical approach pioneered at the University of Sunderland.

The Generative approach has adapted the language modeling techniques which have received wide acceptance in speech understanding, statistical machine translation and information retrieval to the image annotation problem. Essentially it seeks to generate a description of the image (with a degree of likelihood of correctness) based a learned features of previously seen images.

The Categorial approach, on the other hand, seeks to identify that class of images to which a particular descriptive term applies. Again it is based on supervised and semi-supervised machine learning.

Prospects for extending these current approaches to vocabularies of thousands of words will be discussed.

Recent developments have provided some interesting insights into the relationship between (human) language and vision, some of which resonate with recent work in language learning and work from the 1970's in psychology.

Presentation

Back to top

Rada Mihalcea, University of North Texas

Graph-based Algorithms for Information Retrieval and Natural Language Processing

Graph theory is a well studied discipline, and so are the fields of natural language processing and information retrieval. However, most of the times, they are perceived as different disciplines, with different algorithms, different applications, and different potential end-users.

The goal of this tutorial is to provide an overview of methods and applications in natural language processing and information retrieval that rely on graph-based algorithms. This will include techniques for graph traversal, minimum path length, min-cut algorithms, importance of nodes in graphs, etc., and their application to information retrieval and Web search, text understanding (word sense disambiguation), text summarization, keyword extraction, text clustering, and others.

The following topics will be covered (subject to minor changes):

Graph-based Algorithms

graph representations and notations
algorithms for graph traversal
minimum path length
min-cut/max-flow algorithms
node-ranking algorithms

Information Retrieval applications

Web-page ranking
Text classification and clustering

Natural language processing applications

word sense disambiguation
ontology construction (semantic classes)
semantic relations
keyword extraction
text summarization

Presentation

Back to top

Michael Zock, CNRS

Natural Language Generation: snaphot of a fast evolving discipline

BACKGROUND AND OBJECTIVE

Answering a question, giving a talk, translating a document, writing a letter or a summary, are tasks that most of us perform quite regularly. Natural as they may seem, they have taken us many years to learn. A great many researchers have tried over the last three decades to explain the underlying processes and to mimic them, building tools that could simulate or support them, perform them semi-automatically (interactive generation) or entirely on their own (fully automated generation).

The importance of these contributions is directly proportional (a) to the extent to which the amount of information to be processed grows (need for automatic summaries, translations, writing of business letters, etc.), and (b) to the extent to which we believe that we should access in natural language the knowledge put into machines (natural language front-ends).

The goal of this tutorial is twofold: (a) make the non-specialist aware of the problems, potential and achievements of this discipline, (b) help to bridge the gap that still exists between the experts of the different disciplines (e.g., linguists, psychologists, computer scientists). The following issues will be addressed:

motivation of the research: Why is generation an important research area?
identification of the problems:Why is generation a difficult task?
achievements:What has been achieved? What methods have been invented? What are the bottlenecks?
evaluation: Which problems have been neglected?
current issues in NL-generation: What areas are in the current focus of interest?

The organization of the tutorial will parallel the typical order in which people proceed when producing language : starting from a situation that yields one or several discourse goals, we discuss the choices people make at the various levels (conceptual, linguistic) in order to reach their objectives. Thus, given some goal (e.g. explain something to somebody), we select and structure messages (deep generation), for which we try then to find the corresponding linguistic forms, i.e. words and sentence patterns (surface generation).

We will show in this tutorial the evolution of a fascinating discipline, that next to its obvious industrial benefits (selling of NLP software), has the potential to extend our mind, freeing it from routine tasks. Simulating the entire process may also shed some light on the mental processes, that is to say, the way we drive our thoughts or use our mind.

Presentation

Back to top