|
About the project In enterprises, organizations and state administration, a large number of documents are only available in paper format. Before modern techniques for
document management can be applied, these documents have to be converted to electronic format. Despite of many commercial systems for Optical Character Recognition (OCR) that are available, the distribution and
success of OCR-technology is still rather limited due to the unacceptable error rate that is often observed in the recognition process. The principal concern of this project is the improvement of methods for
post-correction of OCR-results. On the application side, special emphasis is given to conversion of documents in Cyrillic or with mixed Cyrillic and Latin languages and alphabets. From a methodological point of view,
the use of large-scale mono- and multilingual electronic dictionary systems for OCR correction is a central concern. As a starting point, existing linguistic resources in terms of Bulgarian, Russian, German and English
dictionaries will be enriched with statistical information on word frequencies, word collocations and on confusion probabilities for entries. Subsequently, a new method for string correction with large electronic
dictionaries developed only recently by the authors will be further extended and optimized. As a synthesis of statistical and algorithmic work it is intended to create a robust and efficient software system for OCR
correction in the word context where statistical and linguistic knowledge is used for selecting the most appropriate correction candidates for misspelled words.
|