RANLP 2015

Invited Talks

Marcello Federico (Fondazione Bruno Kessler, Trento)

"When machine translation meets human translators"

Summary: My main research interest is the symbiotic integration of human and machine translation (MT). We all know that human translation is slow and expensive, while MT is fast and cheap, but far from publication quality. However, machine and human translation can strongly benefit from each other, in multiple ways. There has been increasing evidence that MT can boost productivity of human translators, by providing them with draft translations to post-edit. On the other hand, we recently proved that human post-editing can be leveraged to dynamically adapt MT. These results paved the way to interesting application scenario and new research challenges for MT, such as learning and adapting from human feedback, optimizing MT performance towards minimum post-editing effort, and reliably measuring the impact of MT on human productivity. In the first part of my talk, I will review the main results and lessons learned from the recently concluded MateCat EU project, in which we addressed the smooth integration of human and machine translation and also developed an innovative translation workbench, also called MateCat, now used by thousands of professional translators. In the second part of my talk, I will outline some of the application barriers that still prevent the wide adoption of MT by the translation industry and, finally, outline the new open source MT software we are currently developing in the MMT EU project, which aims to overcome these barriers and thus become the ideal companion for MateCat.

Khalil Sima'an (University of Amsterdam)

"Reordering Grammar and Hidden Treebanks"

Summary: Word order differences between sentences constitutes a major challenge for machine translation. The formal paradigm of syntax-directed transduction, inspired by compositional semantics, works by modifying a (syntactic) tree of the source sentence only locally, before generating the target translation. However, the majority of sentence pairs found in parallel corpora turn out difficult to explain under this paradigm without very complex and non-local modifications to the source tree. Therefore, it has been rather difficult to bring syntactic knowledge to bear upon statistical machine translation. Yet, the general idea of “syntax-directed transduction” has attractive properties, and in this talk I will dwell on the challenge of finding a statistical Reordering Grammar hidden in translation data. I will present Reordering Grammar (EMNLP 2015), a synchronous grammar obtained by factorizing word alignments down to their prime components. Reordering Grammar (RG) has a range of novel and attractive properties, e.g., (a) it develops a novel view of a word aligned parallel corpus as a “hidden treebank”, (b) it is induced efficiently from a word-aligned parallel corpus in a way similar to inducing monolingual parsers from treebanks, and (c) it allows n-ary reordering beyond the binary Inversion-Transduction Grammar, showing better statistical fit to the training data.

Idan Szpektor (Yahoo Research, Haifa)

"Natural Language Processing for Community Question Answering"

Summary: Community Question Answering (CQA) sites, in which people ask questions and other people answer them, have become quite popular in recent years. These sites contain hundreds of millions of questions and answers in different languages and their content often appear on search engine result pages. The types of questions in CQA sites range from opinion and recommendation seeking to technical and factual questions. Still, most questions expect a “human touch” to the answer. In this talk, I will introduce the CQA echo-system and its relationship with Web search. I will then demonstrate how NLP techniques can help addressing fundamental CQA tasks. These include Latent Dirichlet Allocation for personalized diversification in question recommendation, Explicit Semantic Analysis for handling language differences between questions and answers in automatic question answering and improving Web search over CQA sites using syntactic analysis of the questions. Being a user-centric world, some of the experiments I’ll present were performed on live users, and their results shed light on how users react to the output of recommendation algorithms.

Piek Vossen (VU University Amsterdam)

"From mentions in text to instances in RDF: cross-lingual interpretation of unstructured news in the NewsReader project"

Summary: We monitor the news to learn about the changes in the world. However, every working day millions of news articles are published and any many more news messages are found in social media. How can we handle this massive bombardment of information, while our world is becoming more and more global and connected? How can we avoid being selective and biased in our view of the world? In the NewsReader project, we developed programs that read these massive streams of daily news across 4 languages (English, Dutch, Spanish and Italian) to extract what happened, when and where, and who is involved. By recording the changes day-by-day, we build up a knowledge store that records the history over longer spans of time. Our technology interprets natural language text to build up a formal representation of these changes over which computers can reason. You can ask the computer to provide the history of individual persons, companies, places and regions, find connections, derive social networks, detect trends and long-term developments of all types of events. So far, we could only measure how much news there is on a day. Now, we can start ask ourselves the question how much the world changed yesterday according to the news. We processed millions of articles on various topics related to the financial and economic domain, coming from thousands of different sources. This reveals many stories that took place during the financial-economic crisis in the last ten years but it also shows how these sources differ from each other: who tells what part of the story, where do sources disagree or differ. Likewise, we not only can learn about the changes in the world but also about the media that report on it.

Bonnie Webber (University of Edinburgh, Scotland)

"Towards improving the discourse coherence of SMT output"

Summary: Statistical Machine Translation (SMT) is currently limited by two forms of locality: (1) the single sentence, which limits how much is translated at one time, with SMT systems standardly processing sentences independently of one another and without regard to order; (2) the N-gram locality of the SMT Language Model, which limits how much of an output translation can be simultaneously assessed as a good sub-string in the target language.

Neither of these localities provides sufficient view of an output translation to ensure that it is syntactically correct, semantically adequate for expressing the source message, or discourse coherent with its position in the target text. If an output translation ends up satisfying any of these criteria, it is more a matter of luck and frequency in the training data than of making linguistically-informed choices.

In this talk, I will describe efforts to improve aspects of discourse coherence in SMT.

Michael Zock (LIF-CNRS, Aix-Marseille University)

"Roget, WordNet and beyond"

Summary: Whenever we read a book, write a letter, or launch a query on Google, we always use words, the short-hand labels for more or less well specified thoughts. The problem is that words may refuse to come to our mind when we need them most, at the very moment of speaking or writing. This is when we tend to reach for a dictionary. Yet, even dictionaries may fail to reveal the target, although they contain it. This is not only a problem of input (poor query word), but also a problem of design : the way how words are represented and organized, i.e. the kind of information associated to each one of them.

While words in books and words in the brain are not at all the same (in dictionaries they exist as holistic entities while in our brain they are decomposed), we do have good reasons to believe that functionally speaking we can achieve the same: find the elusive word. While humans activate words in their brain, they search and eventually find them in an external resource (dictionry). What this kind of resource needs to look like and how to perform search in it will be a major part of this talk.

Yet, before doing so I will consider three major resources, the mental lexicon, Roget's thesaurus and WordNet, discussing their relative strengths and weaknesses with respect to word access. I will then present the roadmap of an approach meant to help authors (speakers/writers) to overcome the tip-of-the-tongue-problem even in cases where the above-mentioned resources may fail.