TUTORIALS
18 - 20 September 2005
The main RANLP conference will be preceeded by three days tutorials delivered by distinguished lecturers. We plan 6 half day tutorials, each with duration of 180 minutes, distributed as follows: 60 min talk + 20 min break + 60 min talk + 20 min break + 60 min talk.
Tutorial lecturers:
Jan Hajic | (Charles University, Prague) |
Bernardo Magnini | (IRST, Trento) |
Rada Mihalcea | (University of North Texas) |
Dragomir Radev | (University of Michigan) |
John Tait | (University of Sunderland) |
Zdenka Uresova | (Charles University, Prague) |
Michael Zock | (CNRS) |
Tutorial timetable is as follows (please note that lectures schedule will be announced later):
September 18, 2005 Sunday |
September 19, 2005 Monday |
September 20, 2005 Tuesday |
|
9.00 - 12.40 | Bernardo
Magnini (IRST, Trento) |
Dragomir
Radev (University of Michigan) |
Rada Mihalcea |
12.40 - 14.00 | Lunch break
|
Lunch break
|
Lunch break
|
14.00 - 17.40 | Jan Hajic and Zdenka
Uresova |
John
Tait (University of Sunderland) |
Michael Zock |
Abstracts
Bernardo Magnini, IRST, Trento
Open Domain Question Answering: Techniques, Systems and Evaluation
Open Domain Question Answering (QA) systems accept natural language questions (as opposed to keywords) and return to the user exact answers (as opposed to a list of scored documents). Under the impulse of the QA track at the TREC evaluation campaign, QA has become both a hot research topic in Computational Linguistics and a challenging perspective for real use applications. This tutorial will report about current techniques and resources for Question Answering, including recent work on Cross-Language Questions Answering, where questions are expressed in a source language and the answer is searched in a document collection of a different language. Finally, special attention will be put on evaluation methodologies, a necessary step in order to measure performance progresses of Question Answering systems.
Jan Hajic and Zdenka Uresova, Charles University, Prague, Czech Republic
The Prague Dependency Treebank and Valency Annotation
The tutorial will introduce the Prague Dependency Treebank project that aims at complex manual annotation of substantial amount of naturally occurring sentences of Czech. The Prague Dependency Treebank has three levels of annotation: morphological, analytical (describing surface syntax in a dependency fashion) and tectogrammatical, which combines syntax and semantics into a language meaning representation, keeping the dependency structure as the core of the annotation structure but adding coreference, topic/focus annotation, and a detailed semantic labeling of every sentence unit. Special attention and space will be devoted to the notion of valency (both verbal and nominal), which is an important part of the annotation of the Prague Dependency Treebank at the tectogrammatical level. The PDT-Vallex valency dictionary and its relation to the morphological and syntactic annotation of the corpus will be described in detail. Demonstration of the annotation process will be part of the tutorial. Examples of valency dictionary and the annotated structures (taken from Czech and English, at least) will be presented throughout the tutorial. To maximize the benefit of the tutorial, it is recommended the prospective participants get acquainted with the Prague Dependency Treebank or get its version 1.0 before the tutorial (LDC publication Catalog No. LDC2001T10, or directly at http://ufal.mff.cuni.cz, where the preview of the version 2.0 can also be found); however, the tutorial will be self-contained even for those who would not have the chance to do so.
Tentative Syllabus:
Lecture 1 (60 min.) The Prague Dependency Treebank
- The Prague Dependency Treebank introduction
- Morphological tagset and annotation
- Analytical (surface syntax) dependency annotation
- PDT version 1.0 characteristics, tools available
- Morphological and dependency annotation of other languages using the PDT 1.0
scheme
Lecture 2 (60 min.) Tectogrammatical Annotation of the Prague Dependency Treebank
- Tectogrammatical annotation: an overview
- Structure and semantics: annotation units, dependency and functions
- Additional dimensions: coreference and topic/focus annotation
- Valency and the valency dictionary: introduction
- Linking annotated data and the dictionary: adding lexical sense
Lecture 3 (60 min.) Valency: Combining Syntax and Semantics
- Valency dictionary principles in detail
- Valency dictionary entry structure: valency slots, labels and form expressions
- Linking function and form, regular form metamorphosis
- Using the valency dictionary for text generation
- Demonstration of the valency dictionary and the annotation process
Dragomir Radev, University of Michigan (http://tangra.si.umich.edu/~radev)
Information Retrieval
Information is everywhere. We encounter it in our everyday
lives in the form of email, newspapers, television, the Web, and even in conversations
with each other. Information is hidden in a variety of media - text, images,
sounds, videos. While casual information consumers can simply enjoy its abundance
and appreciate the existence of search engines that can help them find what
they want, information professionals are responsible for building the underlying
technology that search engines use.
Given the time constraints, most of this tutorial will cover classic concepts
of (text-based) Information Retrieval as they are related to the creation of
a search engine. A portion of time will be be reserved for some of the issues
where Natural Language Processing and Information Retrieval intersect.
Syllabus:
Introduction
Documents and Queries
Indexing and Search
Stemming
Word distributions
Retrieval evaluation
The vector model
Document similarity
Text classification and clustering
Latent semantic indexing
Natural language IR: question answering, word-sense disambiguation, cross-lingual IR
John Tait, University of Sunderland, UK
Image Retrieval
The ever increasing volumes of non-textual digital information (like digital
photographs) are a huge problem in the modern world. How many of us have digital
photographs on hard discs or servers which we know are there but can no longer
find?
Traditionally the means of dealing with this problem has been to assign
textual descriptors of one sort or another to the images and then to use techniques
from the much better developed field of text information retrieval to index
and retrieve the image data. However in all but the highest value professional
applications manual descriptor assignment is no longer acceptable in terms of
the time and effort required.
One possible solution is to use automatic means of assigning key word descriptors
to images.
The aims of the tutorial will be to acquaint attendees with the state of
the art in assigning key words to images and to outline the current research
agenda.
The tutorial will begin with a general review of image retrieval and the
problems it poses assuming no prior knowledge. It will relate image retrieval
to work in language and especially lexical approaches to language.
It will then move on review the progress which has been made on the task
of assigning descriptive terms to still images in the last few years. A particular
focus will be on the contrast between generative model originated by Barnard
and others working with David Forsyth at the University of California, Berkeley,
and the categorical approach pioneered at the University of Sunderland.
The Generative approach has adapted the language modeling techniques which
have received wide acceptance in speech understanding, statistical machine translation
and information retrieval to the image annotation problem. Essentially it seeks
to generate a description of the image (with a degree of likelihood of correctness)
based a learned features of previously seen images.
The Categorial approach, on the other hand, seeks to identify that class
of images to which a particular descriptive term applies. Again it is based
on supervised and semi-supervised machine learning.
Prospects for extending these current approaches to vocabularies of thousands
of words will be discussed.
Recent developments have provided some interesting insights into the relationship
between (human) language and vision, some of which resonate with recent work
in language learning and work from the 1970's in psychology.
Rada Mihalcea, University of North Texas
Graph-based Algorithms for Information Retrieval and Natural Language Processing
Graph theory is a well studied discipline, and so are the fields of natural language processing and information retrieval. However, most of the times, they are perceived as different disciplines, with different algorithms, different applications, and different potential end-users.
The goal of this tutorial is to provide an overview of methods and applications in natural language processing and information retrieval that rely on graph-based algorithms. This will include techniques for graph traversal, minimum path length, min-cut algorithms, importance of nodes in graphs, etc., and their application to information retrieval and Web search, text understanding (word sense disambiguation), text summarization, keyword extraction, text clustering, and others.
The following topics will be covered (subject to minor changes):
Presentation
Natural Language Generation: snaphot of a fast evolving discipline
BACKGROUND AND OBJECTIVE
Answering a question, giving a talk, translating a document, writing a letter or a summary, are tasks
that most of us perform quite regularly. Natural as they may seem, they have taken us many years to learn.
A great many researchers have tried over the last three decades to explain the underlying processes and
to mimic them, building tools that could simulate or support them, perform them semi-automatically
(interactive generation) or entirely on their own (fully automated generation).
The importance of these contributions is directly proportional (a) to the extent to which the amount of
information to be processed grows (need for automatic summaries, translations, writing of business
letters, etc.), and (b) to the extent to which we believe that we should access in natural language
the knowledge put into machines (natural language front-ends).
The goal of this tutorial is twofold: (a) make the non-specialist aware of the problems, potential and
achievements of this discipline, (b) help to bridge the gap that still exists between the experts of the
different disciplines (e.g., linguists, psychologists, computer scientists). The following issues will be
addressed:
The organization of the tutorial will parallel the typical order in which people proceed when producing
language : starting from a situation that yields one or several discourse goals, we discuss the choices
people make at the various levels (conceptual, linguistic) in order to reach their objectives. Thus,
given some goal (e.g. explain something to somebody), we select and structure messages (deep generation),
for which we try then to find the corresponding linguistic forms, i.e. words and sentence patterns
(surface generation).
We will show in this tutorial the evolution of a fascinating discipline, that next to its obvious
industrial benefits (selling of NLP software), has the potential to extend our mind, freeing it from
routine tasks. Simulating the entire process may also shed some light on the mental processes, that is
to say, the way we drive our thoughts or use our mind.