|
Project Objectives The main objective of this project is to expand and develop methods, resources and software systems for improving the OCR correction of Bulgarian and
Multilingual (Bulgarian, Russian, English and German) documents. The high level achievements of the project are given bellow: Word context based OCR correction
- Further development of word context based on Levenshtein Automata correction method. This method can be further refined in several directions. First, one can use probabilities for symbol-dependent recognition
errors in order to sort more precisely the correction candidates. For the implementation of this option we can extend the concept of Levenshtein automata by using of weighted automata, which will deliver optimal
efficiency. Second we can order the possible correction candidates in respect of the word frequencies.
- Extension of the Bulgarian, Russian, German and English Electronic Dictionaries with OCR aiding data which makes the further correction methods possible. This includes adding information about the word
frequencies, recognition error risk values. The lexical resources will be formatted in order to provide efficiency. For correction of multilingual documents a very-large size consolidated
Bulgarian-Russian-German-English dictionary will be constructed.
- Test series for the probabilities of symbol-dependent recognition errors for Cyrillic and Latin Fonts. This series will provide the font dependent data for the building of the weighted Levenshtein automata. In
that way the list of the correction candidates can be sorted in respect of the recognition error probability.
Sentence context based OCR correction
- Analysis of large-size Corpora for extracting word collocation table for Bulgarian to be used for OCR correction based on word collocation techniques.
Implementation
- Implementation of a robust and highly efficient correction system based on the Levenshtein automata framework and the sentence context correction. We plan to implement our approach in order to test and compare
it against the traditional methods. This implementation can demonstrate the achievements of the project in order to attract industrial applications.
|