BULSTEM: INFLECTIONAL STEMMER FOR BULGARIAN

Abstract: The paper starts with an overview of some important approaches to stemming for English and other languages. Then, the design, implementation and evaluation of the BulStem inflectional stemmer for Bulgarian are presented. The problem is addressed from a machinelearning perspective using a large morphological dictionary. A detailed automatic evaluation in terms of understemming, over-stemming and coverage is provided. In addition, the effect of stemming and BulStem parameters setting is demonstrated on a particular task: text categorisation using kNN+LSA.

Keywords: Stemming, lemmatisation, text categorisation, k-nearest-neighbour, vector-space model, latent semantic analysis, information retrieval.