7-13 September
Hissar, Bulgaria
ranlp2013@lml.bas.bg
Nicoletta Calzolari (Institute of Computational Linguistics "Antonio Zampolli", Pisa)
"Policy issues for Language Resources in the Data era – The challenges of openness, interoperability, collaboration"
Summary: Language Technology is a data-intensive field and major breakthroughs have stemmed from a larger use of Language Resources (LR). Technical/scientific advances are important, but infrastructural/policy issues play a major role in the LR field. The challenges ahead depend on a coherent strategy involving all the dimensions relevant for LRs, around which we organised the FLaReNet recommendations: a) Infrastructure, b) Documentation, c) Development, d) Interoperability, e) Coverage, Quality and Adequacy, f) Availability, Sharing and Distribution, g) Sustainability, h) Recognition and i) International cooperation. Taken together, as a coherent system, these directions contribute to a sustainable LR ecosystem.
In the paradigm of open, distributed language infrastructures based on sharing LRs, services and tools, the only way for our field to achieve the status of a mature science lies in initiatives enabling to join forces both in the creation of large LR pools and in big collaborative experiments using these LRs. This will serve better the needs of language applications, enabling building on each other achievements, integrating results (also with Linked Data), and having them accessible to various systems, thus coping with the need of more “knowledge intensive” LRs for effective multilingual content processing. This requires also an effort to push towards a culture of “service to the community” where everyone has to contribute. This “cultural change” is not a minor issue.
Iryna Gurevych (Technical University Darmstadt)
"Chasing the crowd: automatically assessing the quality of content in Wikipedia"
Summary: Automatically assessing the quality of content is a fundamental task to be addressed for using the vast amount of knowledge on the Web. In this talk, I will present experimental work to assess the quality of content in Wikipedia, the largest online encyclopedia with more than 23 million articles in 285 languages. We experiment with modeling the quality in Wikipedia according to several experimental setups based on different types of user-generated quality judgments. These include automatically classifying the articles as being of distinguished quality or suffering from specific quality flaws, and learning the quality model for distinct quality dimensions, such as objectivity, trustworthiness, completeness, and readability. We analyze the usefulness of lexical, network- and structure-based, or readability features as well as features derived from the Wikipedia revisions and discussions. This way, we gain better insight into collaborative writing processes and derive recommendations for the design of future writing assistance systems on the Web.
Horacio Saggion (University Pompeu Fabra, Barcelona)
"Automatic Text Simplification: What for?"
Summary: Automatic text simplification (ATS) is a complex task which encompasses a number of operations applied to a text at different linguistic levels. The aim is to turn “complex” textual input into a simplified variant, taking into consideration the specific needs of a particular target user or task. ATS can serve as preprocessing tool for other NLP applications but most importantly it can have a social function, making content accessible to different types of users. One could argue, naively maybe, that ATS has the potential to make electronic textual content equally accessible to everyone. ATS has been in the NLP research agenda for a number of years and although some progress has been made in different aspects of the text simplification problem, there are still issues to be resolved. In this presentation, I will discuss the problem of text simplification and overview different NLP paradigms applied to solve it. I will take the opportunity to report on a number of interesting developments at our laboratory to make textual content in Spanish more accessible.
Violeta Seretan (University of Geneva)
"Collocation Extraction Based on Syntactic Criteria"
Summary: Collocations – typical combinations of words like “to meet a need”, “to break a record”, “to believe deeply”, or “heavy rain” – are pervasive in language. Due to their encoding idiomaticity, they are of paramount importance for natural language processing tasks dealing with text production, such as machine translation. At the same time, they are important for language analysis tasks, where they can act as lexical and structural disambiguators. Collocations are the most numerous among all types of multi-word expressions. Arguably, they are also the most difficult to process, because they show high morphosyntactic flexibility, which makes them more similar to regular word combinations. The problem of morphosyntactic variation is even more acute in a multilingual setting, when dealing with languages with a richer morphology and a freer word order than English. In this talk, I will present a methodological framework for collocation identification based on syntactic criteria, and will discuss results obtained from text corpora in several languages. Furthermore, I will discuss evaluation experiments comparing these results against those of standard methods based on linear proximity constraints. I will also show that the syntax-based approach provides a better opportunity for scalability, as the strong filter applied on candidate data represents a convenient way to get around the combinatorial explosion problem affecting standard approaches.
Mark Stevenson (University of Sheffield)
"Large Scale Word Sense Disambiguation for the Biomedical Domain"
Summary: The amount of research literature available in the biomedical domain is vast and growing at an exponential rate. However, access to the information in these documents is hampered by the fact that they contain a range of lexical ambiguities. For example, the term “fit” can describe a patient who is well (“fit and well”) or refer to a seizure.This talk will describe work on Word Sense Disambiguation for the biomedical domain that has been carried out at Sheffield University. It will introduce some of the lexical ambiguities that are found in biomedical documents, such as polysemous terms and abbreviations with multiple expansions. Various methods for resolving these ambiguities will be introduced including supervised and unsupervised approaches. The talk will also describe how these approaches are applied to create Word Sense Disambiguation systems that can be applied on the large scale to disambiguate all ambiguous terms in biomedical documents.
Dekai Wu (Hong Kong University of Science & Technology)
"Re-Architecting The Core: What SMT Should Be Teaching Us About Machine Learning "
Summary: Learning to translate between languages is one of the fascinating grand challenges of science precisely because it encompasses so many of the cornerstone capabilities of machine learning and language acquisition - capabilities like grammar induction, unsupervised learning, category formation, chunking, relational abstraction, transduction aquisition, contextual disambiguation, inductive bias, and semantic generalization. Yet 25 years into statistical machine century research, an overwhelming proportion of SMT work still fails to attack these fundamental problems. Instead, we are increasingly in danger of becoming mired in a plateau, trapped by ever more complex `stacks-of-hacks' architectures that typically combine long chains of often-mismatched heuristic modules to memorize ever larger corpora. Such systems require far more computational and data resources than should be necessary, because they fail to learn meaningful cross-lingual generalizations. What will it take to surmount the current obstacles? In this talk we consider what key steps are needed to shift to a semantic SMT paradigm: the machine learning principles we must stop ignoring, the right things to measure empirically, and the types of models and algorithms that would address current SMT deficiencies.