Memory-Based Language Processing

Walter Daelemans, University of Antwerp / Tilburg University

One of the goals of the discipline of `machine learning of natural language' should be to achieve more insight into which properties of language match which properties of learning algorithms. I will introduce memory-based learning and show how its properties (similarity-based, local, classification-based, lazy learning) fit the properties of natural language processing tasks (sparse data, information source fusion, interaction of regularities and exceptions). I will also provide a brief overview of the results of applying this learning method to tasks such as word sense disambiguation, speech synthesis, and shallow parsing.

Measures of Semantic Relatedness and the Detection and Correction of Real-Word Spelling Errors

Graeme Hirst and Alexander Budanitsky, University of Toronto

Lexical semantic relatedness in text reflects cohesion in the text. Lexical semantic relationships include synonymy, hypernymy, and meronymy. But what degree of closeness by such links counts as semantic relatedness and how it can be measured? We experimentally compared five different proposed measures of similarity or semantic relatedness in WordNet by examining their performance in a real-word spelling correction system. We will discuss what these results mean for applications in NLP, and also outline a method of real-word spelling correction that approaches practical usefulness.

Bootstrapping Morphological Analyzers

Kemal Oflazer, Sabanci University.
(Joint work with Sergei Nirenburg and Marjorie McShane)

This talk presents a semi-automatic technique for developing broad-coverage finite-state morphological analyzers for use in natural language processing applications. It consists of three components -- elicitation of linguistic information from humans, a machine learning bootstrapping scheme and a testing environment. The three components are applied iteratively until a threshold of output quality is attained. This elicit-build-test technique compiles lexical and inflectional information elicited from a human into a finite state transducer lexicon and combines this with a sequence of morphographemic rewrite rules that is induced using transformation-based learning from the elicited examples. The resulting morphological analyzer is then tested against a test suite, and any corrections are fed back into the learning procedure that builds an improved analyzer.

Ambiguity and nondeterminism

Martin Kay, Xerox Parc

Linguistic ambiguity and computational nondeterminism are generally thought of as two sides of the same coin. Constraint-based linguistic theories, for example, typically have nothing to say about ambiguity, on the assumption that it will be handled through general mechanisms, like charts. This can result in a greatly increased computational burden, particularly in cases of so-called "early binding". This occurs when a decision is made in the course of a search that has consequences only later, so that the process continues for some time along parallel tracks. Constraint-based linguistic theories lead to early binding in many ways, most egregiously in their reliance on lexical rules to provide for passivization, dative shift, and the like. I will suggest that other mechanisms, involving greater involvement by the grammar writer, might lead to drammatic savings in processing time.

Learning, Collecting, and using ontological knowledge for NLP

Eduard Hovy, University of Southern California

People have long talked about having NLP systems employ semantic knowledge on a large scale. However, no-one has yet built a large ontology that was indeed practically useful for tasks such as question answering, machine translation, and information retrieval. Work on WordNet, the major contender, shows that it requires more content to realize its full potential, while efforts to use CYC show how hard it is to build general-purpose ontologies that can support NLP applications. In this talk I outline some recent efforts to automatically acquire knowledge that may be placed into terminological ontologies and used by NLP systems. Discussing topic signatures and instantial information, I focus on methods of acquisition, purity measures, and uses.

Reconsidering the Frame Problem in Linguistic Reasoning

James Pustejovsky, Brandeis University

In this talk, I discuss some modifications and enhancements to subevental models of event structure, based on data that prove difficult to handle under current event-based theories. These data mostly involve "contradictions of change", which are descriptions that, by virtue of the events they participate in, no longer hold without contradiction. To solve these cases, I will outline an algorithm for computing the maximally coherent event description associated with a sentence. This results in a semantic representation I call the 'event persistence structure', computed as an extension of the event structure. I argue that this is a natural manifestation of the linguistically motivated entailments regarding change and persistence in a sentence, and can be derived compositionally from sentential interpretation. One of the consequences of this analysis is that the chain of states associated with an object in discourse is initially projected from the lexical and compositional semantic properties of expressions in the sentence and represented structurally in the event persistence structure. We will view this level of representation as the starting point from which discourse inference is subsequently computed.