About

About the project

In enterprises, organizations and state administration, a large number of documents are only available in paper format. Before modern techniques for document management can be applied, these documents have to be converted to electronic format.

Despite of many commercial systems for Optical Character Recognition (OCR) that are available, the distribution and success of OCR-technology is still rather limited due to the unacceptable error rate that is often observed in the recognition process. The principal concern of this project is the improvement of methods for post-correction of OCR-results. On the application side, special emphasis is given to conversion of documents in Cyrillic or with mixed Cyrillic and Latin languages and alphabets. From a methodological point of view, the use of large-scale mono- and multilingual electronic dictionary systems for OCR correction is a central concern. As a starting point, existing linguistic resources in terms of Bulgarian, Russian, German and English dictionaries will be enriched with statistical information on word frequencies, word collocations and on confusion probabilities for entries. Subsequently, a new method for string correction with large electronic dictionaries developed only recently by the authors will be further extended and optimized. As a synthesis of statistical and algorithmic work it is intended to create a robust and efficient software system for OCR correction in the word context where statistical and linguistic knowledge is used for selecting the most appropriate correction candidates for misspelled words.


	[Home] [About] [Results] [Project Team] [Papers] [Tools] [Contacts]