This paper presents an accurate and highly efficient rule-based part-of-speech tagger for Bulgarian. All four stages -- tokenization, dictionary application, unknown words guessing and contextual part-of-speech disambiguation -- are implemented as a pipeline of a couple deterministic finite state bimachines and transducers. We present a description of the Bulgarian ambiguity classes and a detailed evaluation and error analysis of our tagger. The overall precision of the tagger is over 98.4\% for full disambiguation and the processing speed is over 34K words/sec on a personal computer. The same methodology has been applied for English as well. The presented realization conforms to the specific demands of the semantic web. This work was funded by a grant from VolkswagenStiftung.
Back to my Home page