Selected publications

A partial list of publications by Preslav Nakov and co-authors.

Publications at the Linguistic Modelling Department, IPP-BAS

2008

Solving Relational Similarity Problems Using the Web as a Corpus. Preslav Nakov and Marti Hearst. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL'08). Columbus, OH, USA. 2008.

Abstract. We present a simple linguistically-motivated method for characterizing the semantic relations that hold between two nouns. The approach leverages the vast size of the Web in order to build lexically-specific features. The main idea is to look for verbs, prepositions, and coordinating conjunctions that can help make explicit the hidden relations between the target nouns. Using these features in instance-based classifiers, we demonstrate state-of-the-art results on various relational similarity problems, including mapping noun-modifier pairs to abstract relations like TIME, LOCATION and CONTAINER, characterizing noun-noun compounds in terms of abstract linguistic predicates like CAUSE, USE, and FROM, classifying the relations between nominals in context, and solving SAT verbal analogy problems. In essence, the approach puts together some existing ideas, showing that they apply generally to various semantic tasks, finding that verbs are especially useful features.

Improved Statistical Machine Translation Using Monolingual Paraphrases. Preslav Nakov. In Proceedings of the European Conference on Artificial Intelligence (ECAI'08), Patras, Greece, 2008.

Abstract. We propose a novel monolingual sentence paraphrasing method for augmenting the training data for statistical machine translation systems "for free" - by creating it from data that is already available rather than having to create more aligned data. Starting with a syntactic tree, we recursively generate new sentence variants where noun compounds are paraphrased using suitable prepositions, and vice-versa - preposition-containing noun phrases are turned into noun compounds. The evaluation shows an improvement equivalent to 33%-50% of that of doubling the amount of training data.

Noun Compound Interpretation Using Paraphrasing Verbs: Feasibility Study. Preslav Nakov. In Proceedings of the 13th International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA'08), 2008.

Abstract. The paper addresses an important challenge for the automatic processing of English written text: understanding noun compounds' semantics. Following Downing (1977), we define noun compounds as sequences of nouns acting as a single noun, e.g., bee honey, apple cake, stem cell, etc. In our view, they are best characterised by the set of all possible paraphrasing verbs that can connect the target nouns, with associated weights, e.g., malaria mosquito can be represented as follows: carry (23), spread (16), cause (12), transmit (9), etc. These verbs are directly usable as paraphrases, and using multiple of them simultaneously yields an appealing fine-grained semantic representation. In the present paper, we describe the process of constructing such representations for 250 noun-noun compounds previously proposed in the linguistic literature by Levi (1978). In particular, using human subjects recruited through Amazon Mechanical Turk Web Service, we create a valuable manually-annotated resource for noun compound interpretation, which we make publicly available with the hope to inspire further research in paraphrase-based noun compound interpretation. We further perform a number of experiments, including a comparison to automatically generated weight vectors, in order to assess the dataset quality and the feasibility of the idea of using paraphrasing verbs to characterise noun compounds' semantics; the results are quite promising.

Paraphrasing Verbs for Noun Compound Interpretation. Preslav Nakov. In Proceedings of the Workshop on Multiword Expressions (MWE'08), in conjunction with the Language Resources and Evaluation conference, Marrakech, Morocco, 2008.

Abstract. An important challenge for the automatic analysis of English written text is the abundance of noun compounds: sequences of nouns acting as a single noun. In our view, their semantics is best characterized by the set of all possible paraphrasing verbs, with associated weights, e.g., malaria mosquito is carry (23), spread (16), cause (12), transmit (9), etc. Using Amazon's Mechanical Turk, we collect paraphrasing verbs for 250 noun-noun compounds previously proposed in the linguistic literature, thus creating a valuable resource for noun compound interpretation. Using these verbs, we further construct a dataset of pairs of sentences representing a special kind of textual entailment task, where a binary decision is to be made about whether an expression involving a verb and two nouns can be transformed into a noun compound, while preserving the sentence meaning.

Improving English-Spanish Statistical Machine Translation: Experiments with Domain Adaptation, Sentence-Level Paraphrasing, Tokenization, and Recasing. Preslav Nakov. In Proceedings of the Third Workshop on Statistical Machine Translation (WMT'08), in conjunction with ACL'2008.

Abstract. We describe the experiments of the UC Berkeley team on improving English-Spanish machine translation of news text, as part of the WMT'08 Shared Translation Task. We experiment with domain adaptation, combining a small in-domain news bi-text and a large out-of-domain one from the Europarl corpus, building two separate phrase translation models and two separate language models. We further add a third phrase translation model trained on a version of the news bi-text augmented with monolingual sentencelevel syntactic paraphrases on the sourcelanguage side, and we combine all models in a log-linear model using minimum error rate training. Finally, we experiment with different tokenization and recasing rules, achieving 35.09% Bleu score on the WMT'07 news test data when translating from English to Spanish, which is a sizable improvement over the highest Bleu score achieved on that dataset at WMT'07: 33.10% (in fact, by our system). On the WMT'08 English to Spanish news translation, we achieve 21.92%, which makes our team the second best on Bleu score.

Overview of BioCreative II Gene Mention Recognition, Larry Smith, Lorraine K. Tanabe, Rie Johnson nee Ando, Cheng-Ju Juo, I-Fang Chung, Chun-Nan Hsu, Yu-Shi Lin, Roman Klinger, Christoph M. Friedrich, Kuzman Ganchev, Manabu Torii, Hongfang Liu, Barry Haddow, Craig A. Struble, Richard J. Povinelli, Andreas Vlachos, William A. Baumgartner Jr., Lawrence Hunter, Bob Carpenter, Richard Tzong-Han Tsai, Hong jie Dai, Feng Liu, Yifei Chen, Chengjie Sun, Sophia Katrenko, Pieter Adriaans, Christian Blaschke, Rafel Torres Perez, Mariana Neves, Preslav Nakov, Anna Divoli, Manuel Mana, Jacinto Mata-Vazquez, and W. John Wilbur. In Journal of Genome Biology. 2008. accepted
Classification of Semantic Relations between Nominals. Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Szpakowicz, Peter Turney, Deniz Yuret. Journal of Language Resources and Evaluation, Special Issue on "Computational Semantic Analysis of Language: SemEval-2007 and Beyond", 2008. accepted

Publications at UC Berkeley

2007

Ph.D. thesis: Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics., Nakov, Preslav Ivanov. EECS Department, University of California, Berkeley. Technical Report No. UCB/EECS-2007-173. December 20, 2007.

Abstract. An important characteristic of English written text is the abundance of noun compounds - sequences of nouns acting as a single noun, e.g., colon cancer tumor suppressor protein. While eventually mastered by domain experts, their interpretation poses a major challenge for automated analysis. Understanding noun compounds' syntax and semantics is important for many natural language applications, including question answering, machine translation, information retrieval, and information extraction. For example, a question answering system might need to know whether "protein acting as a tumor suppressor" is an acceptable paraphrase of the noun compound tumor suppressor protein, and an information extraction system might need to decide if the terms neck vein thrombosis and neck thrombosis can possibly co-refer when used in the same document. Similarly, a phrase-based machine translation system facing the unknown phrase WTO Geneva headquarters, could benefit from being able to paraphrase it as Geneva headquarters of the WTO or WTO headquarters located in Geneva. Given a query like migraine treatment, an information retrieval system could use paraphrasing verbs like relieve and prevent for page ranking and query refinement. I address the problem of noun compounds syntax by means of novel, highly accurate unsupervised and lightly supervised algorithms using the Web as a corpus and search engines as interfaces to that corpus. Traditionally the Web has been viewed as a source of page hit counts, used as an estimate for n-gram word frequencies. I extend this approach by introducing novel surface features and paraphrases, which yield state-of-the-art results for the task of noun compound bracketing. I also show how these kinds of features can be applied to other structural ambiguity problems, like prepositional phrase attachment and noun phrase coordination. I address noun compound semantics by automatically generating paraphrasing verbs and prepositions that make explicit the hidden semantic relations between the nouns in a noun compound. I also demonstrate how these paraphrasing verbs can be used to solve various relational similarity problems, and how paraphrasing noun compounds can improve machine translation.

BioText Search Engine: beyond abstract search., Marti A. Hearst, Anna Divoli, Harendra Guturu, Alex Ksikes, Preslav Nakov, Michael A. Wooldridge, and Jerry Ye. In Bioinformatics.

Abstract. The BioText Search Engine is a freely available Web-based application that provides biologists with new ways to access the scientific literature. One novel feature is the ability to search and browse article figures and their captions. A grid view juxtaposes many different figures associated with the same keywords, providing new insight into the literature. An abstract/title search and list view shows at a glance many of the figures associated with each article. The interface is carefully designed according to usability principles and techniques. The search engine is a work in progress, and more functionality will be added over time.

SemEval-2007 Task 04: Classification of Semantic Relations between Nominals., Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Szpakowicz, Peter Turney, Deniz Yuret in Proceedings of SemEval-2007 Workshop co-located with ACL-2007, Prague, June 23-24, 2007.

Abstract. The NLP community has shown a renewed interest in deeper semantic analyses, among them automatic recognition of relations between pairs of words in a text. We present an evaluation task designed to provide a framework for comparing different approaches to classifying semantic relations between nominals in a sentence. This is part of SemEval, the 4th edition of the semantic evaluation event previously known as SensEval. We define the task, describe the training/test data and their creation, list the participating systems and discuss their results. There were 14 teams who submitted 15 systems.

UCB: System Description for SemEval Task #4., Preslav Nakov and Marti Hearst. in Proceedings of SemEval-2007 Workshop co-located with ACL-2007, Prague, June 23-24, 2007.

Abstract. The UC Berkeley team participated in the SemEval 2007 Task #4, with an approach that leverages the vast size of the Web in order to build lexically-specific features. The idea is to determine which verbs, prepositions, and conjunctions are used in sentences containing a target word pair, and to compare those to features extracted for other word pairs in order to determine which are most similar. By combining these Web features with words from the sentence context, our team was able to achieve the best results for systems of category C, and close to the best results for systems of category A.

UCB System Description for the WMT 2007 Shared Task., Preslav Nakov and Marti Hearst. in Proceedings of Second Workshop on Statistical Machine Translation co-located with ACL-2007, Prague, June 23, 2007.

Abstract. For the WMT 2007 shared task, the UC Berkeley team employed three techniques of interest. First, we used monolingual syntactic paraphrases to provide syntactic variety to the source training set sentences. Second, we trained two language models: a small in-domain model and a large out-of-domain model. Finally, we made use of results from prior research that shows that cognate pairs can improve word alignments. We contributed runs translating English to Spanish, French, and German using various combinations of these techniques.

BioText Report for the Second BioCreAtIvE Challenge., Nakov, P., and Divoli A. in Proceedings of BioCreAtIvE II Workshop, pp. 297-306, Madrid, Spain, April 23-25, 2007.

Abstract. This report describes the BioText team participation in the Second BioCreAtIvE Challenge. We focused on the Interaction-Article (IAS) and the Interaction-Pair (IPS) Sub-Tasks, which ask for the identification of protein interaction information in abstracts, and the extraction of interacting protein pairs from full text documents, respectively. We identified and normalized protein names and then used an ensemble of Naive Bayes classifiers in order to decide whether protein interaction information is present in a given abstract (for IAS) or a pair of co-occurring genes interact (for IPS). Since the recognition and normalization of genes and proteins were critical components of our approach, we participated in the Gene Mention (GM) and Gene Normalization (GN) tasks as well, in order to evaluate the performance of these components in isolation. For these tasks we used a previously developed in-house tool, based on database-derived gazetteers and approximate string matching, which we augmented with a document-centered ambiguity resolution, but did not train or tune on the training data for GN and GM.

Improved Word Alignments Using the Web as a Corpus, Preslav Nakov, Svetlin Nakov and Elena Paskaleva. In Proceedings of RANLP'2007, pp.400-405, Borovetz, Bulgaria, September 27-29, 2007.

Abstract. We propose a novel method for improving word alignments in a parallel sentence-aligned bilingual corpus based on the idea that if two words are translations of each other then so should be many words in their local contexts. The idea is formalised using the Web as a corpus, a glossary of known word translations (dynamically augmented from the Web using bootstrapping), the vector space model, linguistically motivated weighted minimum edit distance, competitive linking, and the IBM models. Evaluation results on a Bulgarian-Russian corpus show a sizable improvement both in word alignment and in translation quality.

Cognate or False Friend? Ask the Web!, Svetlin Nakov, Preslav Nakov and Elena Paskaleva. In Proceedings of the Workshop on Acquisition and Management of Multilingual Lexicons, held in conjunction with RANLP'2007, pp. 55-62, Borovetz, Bulgaria, September 30, 2007.

Abstract. We propose a novel unsupervised semantic method for distinguishing cognates from false friends. The basic intuition is that if two words are cognates, then most of the words in their respective local contexts should be translations of each other. The idea is formalised using the Web as a corpus, a glossary of known word translations used as cross-linguistic "bridges", and the vector space model. Unlike traditional orthographic similarity measures, our method can easily handle words with identical spelling. The evaluation on 200 Bulgarian-Russian word pairs shows this is a very promising approach.

Extracting Translation Lexicons from Bilingual Corpora: Application to South-Slavonic Languages, Preslav Nakov, Veno Pacovski and Elena Paskaleva. In Proceedings of the Workshop on Common Natural Language Processing Paradigm For Balkan Languages, held in conjunction with RANLP'2007, pp. 23-31, Borovetz, Bulgaria, September 26, 2007.

Abstract. The paper presents a novel approach for automatic translation lexicon extraction from a parallel sentence-aligned corpus. This is a five-step process, which includes cognate extraction, word alignment, phrase extraction, statistical phrase filtering, and linguistic phrase filtering. Unlike other approaches whose objective is to extract word or phrase pairs to be used in machine translation, we try to induce meaningful linguistic units (pairs of words or phrases) that could potentially be included as entries in a bilingual dictionary. Structural and content analysis of the extracted phrases of length up to seven words shows that over 90% of them are correctly translated, which suggests that this is a very promising approach.

2006

BioText Team Report for the TREC 2006 Genomics Track., Anna Divoli, Marti A. Hearst, Preslav I. Nakov, Ariel Schwartz, Alex Ksikes. Proceedings of TREC 2006, Gaithersburg, MD, 2005.

Abstract. The paper reports on the work conducted by the BioText team at UC Berkeley for the TREC 2006 Genomics track. Our approach had three main focal points: First, based on our successful results in the TREC 2003 Genomics track [1], we emphasized gene name recall. Second, given the structured nature of the Generic Topic Types (GTTs), we attempted to design queries that covered every part of the topics, including synonym expansion. Third, inspired by having access to the full text of documents, we experimented with identifying and weighting information depending on which section (Introduction, Results, etc.) it appeared in. Our emphasis on covering the different pieces of the query may have helped with the aspects ranking portion of the task, as we performed best on that evaluation measure. We submitted three runs: Biotext1, BiotextWeb, and Biotext3. All runs were fully automatic. The Biotext1 run performed best, achieving MAP scores of .24 on aspects, .35 on documents, and .035 on passages.

Using Verbs to Characterize Noun-Noun Relations., Nakov, P., and Hearst, M. in Proceedings of AIMSA 2006, Bulgaria, September 2006.

Abstract. We present a novel, simple, unsupervised method for characterizing the semantic relations that hold between nouns in noun-noun compounds. The main idea is to discover predicates that make explicit the hidden relations between the nouns. This is accomplished by writing Web search engine queries that restate the noun compound as a relative clause containing a wildcard character to be filled in with a verb. A comparison to results from the literature suggest this is a promising approach.

2005

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing., Nakov, P., and Hearst, M. in Proceedings of CoNLL-2005, Ninth Conference on Computational Natural Language Learning Ann Arbor, MI, June 2005.

Abstract. In order to achieve the long-range goal of semantic interpretation of noun compounds, it is often necessary to first determine their syntactic structure. This paper describes an unsupervised method for noun compound bracketing which extracts statistics fromWeb search engines using a Chi^s measure, a new set of surface features, and paraphrases. On a gold standard, the system achieves results of 89.34% (baseline 66.80%), which is a sizable improvement over the state of the art (80.70%).

Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution., Nakov, P., and Hearst, M. In Proceedings of the HLT-NAACL'05 Vancouver, 2005.

Abstract. Recent work has shown that very large corpora can act as training data for NLP algorithms even without explicit labels. In this paper we show how the use of surface features and paraphrases in queries against search engines can be used to infer labels for structural ambiguity resolution tasks. Using unsupervised algorithms, we achieve 84% precision on PP-attachment and 80% on noun compound coordination.

A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies., Nakov, P., and Hearst, M. In Proceedings of RANLP'05 Borovets, Bulgaria, 2005.

Abstract. The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts across different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined.

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing., Nakov, P., Schwartz, A., Wolf, B, and Hearst, M. In ACL/ISMB BioLINK SIG: Linking Literature, Information and Knowledge for Biology Detroit, MI, June 2005.

Abstract. We describe the use of the Layered Query Language and architecture to acquire statistics for natural language processing applications. We illustrate system’s use on the problem of noun compound bracketing using MEDLINE.

Supporting Annotation Layers for Natural Language Processing., Nakov, P., Schwartz, A., Wolf, B, and Hearst, M. In the ACL 2005 Poster/Demo Track Ann Arbor, MI, June 2005.

Abstract. We demonstrate a system for flexible querying against text that has been annotated with the results of NLP processing. The system supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, flexibility in the format of returned results, and tight integration with SQL.We present a query language and its use on examples taken from the NLP literature.

2004

BioText Team Experiments for the TREC 2004 Genomics Track., Nakov, P., Schwartz, A., Wolf, B, and Hearst, M. Proceedings of TREC 2004, Gaithersburg, MD, 2005.

Abstract.

Citances: Citation Sentences for Semantic Analysis of Bioscience Text., Nakov P., A. Schwartz, M. Hearst. Workshop on Search and Discovery in Bioinformatics at SIGIR'04, Sheffield, UK, July 2004.

Abstract. We propose the use of the text of the sentences surrounding citations as an important tool for semantic interpretation of bioscience text. We hypothesize several dierent uses of citation sentences (which we call citances), including the creation of training and testing data for semantic analysis (especially for entity and relation recognition), synonym set creation, database curation, document summarization, and information retrieval generally. We illustrate some of these ideas, showing that citations to one document in particular align well with what a hand-built curator extracted. We also show preliminary results on the problem of normalizing the dierent ways that the same concepts are expressed within a set of citances, using and improving on existing techniques in automatic paraphrase generation. becoming more available, providing new opportunities for automatic text processing. One such opportunity lies in the text around citations in full text papers. In this paper we put forward a new vision for the path towards robust and large-coverage algorithms for semantic interpretation of bioscience articles. We suggest using the sentences that surround the citations to related work as the data from which to build semantic interpretation models. We also introduce a neologism, citances, to mean the sentence( s) surrounding the citation within a document. Citations are used in every scientific literature, but they are

BioText Team Report for the TREC 2004 Genomics Track. Preslav Nakov, Ariel Schwartz, Emilia Stoica, Marti Hearst. In Proceedings of the Text REtrieval Conference (TREC'04), Gaithersburg, MD, USA, 2004.

Abstract. The BioText group participated in the two main tasks of the TREC 2004 Genomics track. Our approach to the ad hoc task was similar to the one used in the 2003 Genomics track, but due to the lack of training data, we did not achieve the high scores of the previous year. The most novel aspect of our submission for the categorization task centers around our method for assigning Gene Ontology (GO) codes to articles marked for curation. This approach compares the text surrounding a target gene to text that has been found to be associated with GO codes assigned to homologous genes for organisms with genomes similar to mice (namely, humans and rats). We applied the same method to GO codes that have been assigned to MGI entries in years prior to the test set. In addition, we filtered out proposed GO codes based on their previously observed likelihood to co-occur with one another.

2003

Category-based Pseudowords, Preslav Nakov and Marti Hearst, in the Companion Volume of the Proceedings of HLT-NAACL'03, pp. 67-69, Edmonton, Canada, May 2003.

Abstract. A pseudoword is a composite comprised of two or more words chosen at random; the individual occurrences of the original words within a text are replaced by their conflation. Pseudowords are a useful mechanism for evaluating the impact of word sense ambiguity in many NLP applications. However, the standard method for constructing pseudowords has some drawbacks. Because the constituent words are chosen a t random, the word contexts that surround pseudowords do not necessarily reflect the contexts that real ambiguous words occur in. This in turn leads to an optimistic upper bound on algorithm performance. To address these drawbacks, we propose the use of lexical categories to create more realistic pseudowords, and evaluate the results of different variations of this idea against the standard approach. combine semantically distinct words. Another drawback is that the results produced using pseudowords are dif- ficult to characterize in terms of the types of ambiguity they model.

BioText Team Report for the TREC 2003 Genomics Track, Gaurav Bhalotia, Preslav Nakov, Ariel S. Schwartz, Marti A. Hearst. In Proceedings of TREC'03, pp. 612-621, Gaithersburg, MD, USA, 2003.

Abstract. The BioText project team participated in both tasks of the TREC 2003 genomics track. Key to our approach in the primary task was the use of an organism-name recognition module, a module for recognizing gene name variants, and MeSH descriptors. Text classification improved the results slightly. In the secondary task, the key insight was casting it as a classification problem of choosing between the title and the last sentence of the abstract, although MeSH descriptors helped somewhat in this task as well. These approaches yielded results within the top three groups in both tasks.

Other publications (not at UC Berkeley)

M.Sc.thesis: Nakov P. Recognition and Morphological Classification of Unknown Words for German. Sofia University, Faculty of Mathematics and Informatics, Department of Information Technologies. Sofia. July, 2001.

Abstract. A system for recognition and morphological classification of unknown words for German is described. The System takes raw text as input and outputs list of the unknown nouns together with hypothesis about their possible morphological class and stem. The morphological classes used uniquely identify the word gender and the inflection endings it takes when changes by case and number. The System exploits both global (ending guessing rules, maximum likelihood estimations, word frequency statistics) and local information (surrounding context) as well as morphological properties (compounding, inflection, affixes) and external knowledge (specially designed lexicons, German grammar information etc.). The problem is solved as a sequence of subtasks including: unknown words identification, noun identification, inflected forms of the same word recognition and grouping (they must share the same stem), compounds splitting, morphological stem analysis, stem hypothesis for each group of inflected forms, and finally — production of ranked list of hypotheses about the possible morphological class for each group of words. The System is a kind of tool for lexical acquisition: it identifies, derives some properties and classifies unknown words from a raw text. Only nouns are currently considered but the approach can be successfully applied to other parts-of-speech as well as to other inflexional languages.

Robust Ending Guessing Rules with Application to Slavonic Languages. Preslav Nakov, Elena Paskaleva. In Proceedings of the 3rd workshop on RObust Methods in Analysis of Natural Language Data (ROMAND), an International Workshop in Association with COLING'04, pp. 76-85, Geneva, August 29, 2004.

Abstract. The paper studies the automatic extraction of diagnostic word endings for Slavonic languages aimed to determine some grammatical, morphological and semantic properties of the underlying word. In particular, ending guessing rules are being learned from a large morphological dictionary of Bulgarian in order to predict POS, gender, number, article and semantics. A simple exact high accuracy algorithm is developed and compared to an approximate one, which uses a scoring function previously proposed by Mikheev for POS guessing.

Towards Deeper Understanding and Personalisation in CALL. Galia Angelova, Albena Strupchanska, Ognyan Kalaydijev, Milena Yankova, Svetla Boytcheva, Irena Vitanova, Preslav Nakov. In Proceedings of "eLearning for Computational Linguistics and Computational Linguistics for eLearning", an International Workshop in Association with COLING'04, pp. 45-52, Geneva, August 28, 2004.

Abstract. We consider in depth the semantic analysis in learning systems as well as some information retrieval techniques applied for measuring the document similarity in eLearning. These results are obtained in a CALL project, which ended by extensive user evaluation. After several years spent in the development of CALL modules and prototypes, we think that much closer cooperation with real teaching experts is necessary, to find the proper learning niches and suitable wrappings of the language technologies, which could give birth to useful eLearning solutions.

Non-Parametric SPAM Filtering based on kNN and LSA. Preslav Nakov, Panayot Dobrikov. In Proceedings of the 33th National Spring Conference of the Bulgarian Mathematicians Union, Borovets, Bulgaria, April 1-4, 2004.

Abstract. The paper proposes a non-parametric approach to filtering of unsolicited commercial e-mail messages, also known as spam. The email messages text is represented as an LSA vector, which is then fed into a kNN classifier. The method shows a high accuracy on a collection of recent personal email messages. Tests on the standard LINGSPAM collection achieve an accuracy of over 99.65%, which is an improvement on the best-published results to date.

Towards Deeper Understanding of LSA Performance. Nakov P., E. Valchanova, G. Angelova. In Proceedings of Recent Advances in Natural Language Processing (RANLP’03). pp. 311-318. Borovetz, Bulgaria, September 10-12, 2003.

Abstract. The paper presents an on-going work towards deeper understanding of the factors influencing the performance of the Latent Semantic Analysis (LSA). Unlike previous attempts that concentrate on problems such as matrix elements weighting, space dimensionality selection, similarity measure etc., we primarily study the impact of another, often neglected, but fundamental element of LSA (and of any text processing technique): the definition of "word". For the purpose, a balanced corpus of Bulgarian newspaper texts was carefully created, to allow for in-depth observations of the LSA performance, and series of experiments were performed in order to understand and compare (with respect to the task of text categorisation) six possible inputs with different level of linguistic quality, including: graphemic form as met in the text, stem, lemma, phrase, lemma&phrase and part-of-speech annotation. In addition to LSA, we made comparisons to the standard vector-space model, without any dimensionality reduction. The results show that while the linguistic processing has a substantial influence on the LSA performance, the traditional factors are even more important, and therefore we did not prove that the linguistic pre-processing substantially improves text categorisation.

Guessing Morphological Classes of Unknown German Nouns. Nakov P., Bonev Y., G. Angelova, E. Gius, W. von Hahn. In Proceedings of Recent Advances in Natural Language Processing (RANLP’03). pp. 319-326. Borovetz, Bulgaria, September 10-12, 2003.

Abstract. A system for recognition and morphological classification of unknown German words is described. Given raw texts it outputs a list of the unknown nouns together with hypotheses about their possible stems and morphological class(es). The system exploits both global and local information as well as morphological properties and external linguistic knowledge sources. It learns and applies ending-guessing rules similar to the ones originally proposed for POS guessing. The paper presents the system design and implementation and discusses its performance by extensive evaluation. Similar ideas for ending-guessing rules have been applied to Bulgarian as well but the performance is worse due to the difficulties of noun recognition as well as to the highly inflexional morphology with numerous ambiguous endings.

BulStem: Design and Evaluation of Inflectional Stemmer for Bulgarian. Nakov P. In Proceedings of Workshop on Balkan Language Resources and Tools (1st Balkan Conference in Informatics), Thessaloniki, Greece, November, 2003.

Abstract. The paper starts with an overview of some important approaches to stemming for English and other languages. Then, the design, implementation and evaluation of the BulStem inflectional stemmer for Bulgarian are presented. The problem is addressed from a machinelearning perspective using a large morphological dictionary. A detailed automatic evaluation in terms of understemming, over-stemming and coverage is provided. In addition, the effect of stemming and BulStem parameters setting is demonstrated on a particular task: text categorisation using kNN+LSA.

ArtsSemNet: from Bilingual Dictionary to Bilingual Semantic Network. Atanassova I, Nakov S., Nakov P. In Proceedings of Workshop on Balkan Language Resources and Tools (1st Balkan Conference in Informatics), Thessaloniki, Greece, November, 2003.

Abstract. The paper presents two bilingual lexicographical resources for the terminology of fine arts: the ArtsDict electronic dictionary and the ArtsSemNet semantic network, and describes the process of transformation of the former into the latter. ArtsDict combines a broad range of information sources and is currently the most complete dictionary of fine arts terminology for both Bulgarian and Russian: not only electronic, but also in general. It contains 2,900 Bulgarian and 2,644 Russian terms, each annotated with complete dictionary definitions. These are further augmented with various terminological relations (polysemy, synonymy, homonymy, antonymy and hyponymy) and organised into a bilingual semantic network similar to WordNet. In addition, a specialised hypertext browser is implemented in order to enable intuitive query and navigation through the network.

Adaptivity in Web-Based CALL. Angelova G., S. Boytcheva, O. Kalaydjiev, S. Trausan-Matu, P. Nakov and A. Strupchanska. In Proceedings of the 15th European Conference on Artificial Intelligence (ECAI'02), pp. 445-449. Lyon, France. July 21-26 2002.

Abstract. This paper presents the design, implementation and some original features of a Web-based learning environment - STyLE (Scientific Terminology Learning Environment). STyLE4 supports adaptive learning of English terminology with a target user group of non-native speakers. It attempts to improve Computer- Aided Language Learning (CALL) by intelligent integration of Natural Language Processing (NLP) and personalised Information Retrieval (IR) into a single coherent system.

Automatic Recognition and Morphological Classification of Unknown German Nouns. Nakov, P., G. Angelova, W. von Hahn. FBI-HH-B-243/02, Bericht 243, Fachbereich Informatik, Universitaet Hamburg, September 2002.

Abstract. The work presented here was performed 2001 as a scientific project of the BIS-21 "Center of Excellence" project, ICA1-2000-70016 and was supported by the cooperation between Hamburg University Sofia University "St. Kl. Ohridski" Abstract A system for recognition and morphological classification of unknown words for German is described. The MorphoClass system takes raw text as input and outputs a list of the unknown nouns together with hypotheses about their morphological class and stem. The used morphological classes uniquely identify the word gender and the inflection endings it takes for changes in case and number. MorphoClass exploits both global information (ending guessing rules, maximum likelihood estimations, word frequency statistics), and local information (adjacent context) as well as morphological properties (compounding, inflection, affixes) and external linguistic knowledge (especially designed lexicons, German grammar information etc.). The task is solved by a sequence of subtasks including: unknown word identification, noun identification, recognition and grouping of inflected forms of the same word (they must share the same stem), compound splitting, morphological stem analysis, stem hypotheses for each group of inflected forms, and finally С production of a ranked list of hypotheses about a possible morphological class for each group of words. MorphoClass is a kind of tool for lexical acquisition: it identifies unknown words from a raw text, derives their properties and classifies them. Currently, only nouns are processed but the approach can be successfully applied to other parts of speech (especially when the PoS of the unknown word is already determined) as well as to other inflexional languages.

Latent Semantic Analysis for Notional Structures Investigation. Nakov P., S. Terzieva. In Proceedings of the Annual Congress of the European Society for Philosophy and Psychology (ESPP'02). Lyon, France, July 10-13, 2002.

Abstract. The research on the effects of study is hindered by the limitations of the techniques and methods of registering, measuring and assessing the actually formed knowledge. The problem has been solved using latent semantic analysis for comparison and assessment of scientific texts and knowledge, expressed in the form of free verbal statements. Education at higher schools has the specific objective to develop knowledge and experience both of which have two fundamental dimensions: the first is expertise training in a well-defined occupational or disciplinary domain, and the second — learning strategies and skills to be an effective learner. Various trends for stimulation of deep learning, transferring in practice the achievements of the cognitive psychology, have been developed during the last decade. Here we present a research on the cognitive activity of university students and its results in the dimension of declarative knowledge. In practice a comparative analysis is made between the input system of notions from the learning texts and the formed mental structures of the students. The research includes a sequence of actions and procedures for: facilitation of the formation of stable concepts structures (preparation of learning materials, its content, structure and visual presentation, organisation of learning, etc.); feedback output on the preservation of knowledge of certain number of key notions; and assessment of manifested knowledge. The data used is verbal - learning texts, linguistic descriptions of notions contained in them and all these are rendered in an open format by the people observed while posing indirect questions. The nature of the processed material (input stimuli and preserved knowledge), decided on the application of Latent Semantic Analysis (LSA) as a research method on the information data. This statistical technology permitted the formation of a model of semantic connections between the researched notions in the output and the general representation of the results.

Latent Semantic Analysis for German literature investigation. Nakov P. In Proceedings of the 7th Fuzzy Days'01, International Conference on Computational Intelligence. B. Reusch (Ed.): LNCS 2206. pp. 834-641. Dortmund, Germany. October 1-3, 2001.

Abstract. The paper presents the results of experiments of usage of LSA for analysis of textual data. The method is explained in brief and special attention is pointed on its potential for comparison and investigation of German literature texts. Two hypotheses are tested: 1) the texts by the same author are alike and can be distinguished from the ones by different person; 2) the prose and poetry can be automatically discovered.

Weight functions impact on LSA performance. Nakov P., Popova A., Mateev P. In Proceedings of the EuroConference Recent Advances in Natural Language Processing (RANLP'01). pp. 187-193. Tzigov Chark, Bulgaria, September 5-7, 2001.

Abstract. This paper presents experimental results of usage of LSA for analysis of English literature texts. Several preliminary transformations of the frequency text-document matrix with different weight functions are tested on the basis of control subsets. Additional clustering based on correlation matrix is applied in order to reveal the latent structure. The algorithm creates a shaded form matrix via singular values and vectors. The results are interpreted as a quality of the transformations and compared to the control set tests.

Research in the Notional Structures of the Declarative Memory of Students using Latent Semantic Analysis. Terzieva S., Nakov P. Bulgarian Journal of Psychology. vol. 1-2/2001. Sofia, Bulgaria. 2001. (in Bulgarian).

Abstract. Education at higher schools has the specific objective to develop knowledge and experience both of which have two fundamental dimensions: the first is expertise training in a well defined occupational or disciplinary domain, and the second - learning strategies and skills to be effective learner. Various trends for stimulation of deep learning have been developed for the past decade. In their concepts they transfer in practice the achievements of cognitive psychology. Here we present a research on the cognitive activity of university students and its results in the dimension of declarative knowledge. In practice a comparative analysis is made between the input system of notions from the learning texts and the formed mental structures of the students. The research includes a sequence of actions and procedures for: facilitation of the formation of stable concepts structures (preparation of learning materials, its content, structure and visual presentation, organization of learning, etc.); feedback output on the preservation of knowledge of certain number of key notions; and assessment of manifested knowledge. The data used is verbal – learning texts, linguistic descriptions of notions contained in them and all these are rendered in an open format by the people observed while posing indirect questions. The nature of the processed material (input stimuli and preserved knowledge), decided on the application of Latent Semantic Analysis (LSA) as a research method on the information data. This statistical technology permitted the formation of a model of semantic connections between the researched notions in the output space against whose background is made an assessment of the individual achievements and the general representation of the results.

Investigating the Degree of Adequacy of the Relations in the Concept Structure of Students using the Method of Latent Semantic Analysis. Terzieva S., Nakov P., Handjieva S. In Proceedings of the Bulgarian Computer Science Conference on Computer Systems and Technologies (CompSysTech'01). Sofia, Bulgaria. 2001.

Abstract. The research on the effects of study is hindered by the possibilities of the techniques and methods of registering, measuring and assessing the actually formed knowledge as information represented in the memory with the appropriate correlation among its units. The problem has been solved by the use of the latent semantic analysis for comparison and assessment of scientific texts and knowledge, expressed by the students in the form of free verbal statements.

Getting Better Results with Latent Semantic Indexing. Nakov P. In Proceedings of the Students Presentations at the European Summer School in Logic Language and Information (ESSLLI'00). pp. 156-166. Birmingham, UK. August 2000.

Abstract. The paper presents an overview of some important factors in°uencing the quality of the results obtained when using Latent Semantic Indexing. The factors are separated in 5 major groups and analyzed both separately and as whole. A new class of extended Boolean operations such as OR, AND and NOT (AND-NOT) and their combinations is proposed and evaluated on a corpus of religious and sacred texts.

Web Personalization Using Extended Boolean Operations with Latent Semantic Indexing. Nakov P. In Lecture Notes in Artificial Intelligence - 1904 (Springer). Artificial Intelligence: Methodology, Systems and Applications. 9th International Conference (AIMSA'00), pp. 189-198. Varna, Bulgaria, September 2000.

Abstract. The paper discusses the potential of the usage of Extended Boolean operations for personalized information delivery on the Internet based on semantic vector representation models. The final goal is the design of an e-commerce portal tracking user’s clickstream activity and purchases history in order to offer them personalized information. The emphasis is put on the introduction of dynamic composite user profiles constructed by means of extended Boolean operations. The basic binary Boolean operations such as OR, AND and NOT (AND-NOT) and their combinations have been introduced and implemented in variety of ways. An evaluation is presented based on the classic Latent Semantic Indexing method for information retrieval using a text corpus of religious and sacred texts.

Latent Semantic Analysis of Textual Data. Nakov P. In Proceedings of the International Conference on Computer Systems and Technologies (CompSysTech'00). pp. V.3-1-V.3-5. Sofia, Bulgaria. June 2000.

Abstract. Latent Semantic Analysis of Text Information The paper presents an overview of the usage of LSA for analysis of textual data. The mathematical apparatus is explained in brief and special attention if pointed on the key parameters that influence the quality of the results obtained. The potential of LSA is demonstrated on selected corpus of religious and sacred texts. The results of an experimental application of LSA for educational purposes are also present.