Scientific Project Results
In the project we covered a wide range of theoretical and practical issues of OCR postcorrection, both on the level of general alphabets and
specifically for Cyrillic-Latin alphabets. A survey of the topics investigated in the project that supports orientation is given in the table below.
On the theory side, the most important achievement of the project
are new and extremely fast methods for approximate search in large dictionaries. We developed a new correction methodology based on the concepts of Levenshtein automata and universal Levenshtein automata. Additionally
we proposed the use of a combined -- forwards and backwards -- dictionary traversal, which provided significantly better efficiency compared to other methods. The results have been published in the leading journals of
the field -- [2,5].
Another theoretical achievement is a new method for dictionary rewriting using subsequential transducers. We developed an efficient method for constructing, given a
rewrite dictionary a subsequential transducer that accepts a text as input and outputs the intended rewriting result under the so-called ``leftmost longest match'' replacement with skips. The main application in the
project context are error-dictionaries that may be used to
automatically correct typical errors in a very efficient way. The resulting rewriting mechanism is very efficient since it is linear in time in respect to
the text size. This result is submitted to the JNLE .
From the practical point of view the most important achievements are the preparation and evaluation of a representative OCR
corpus, the large series of experiments with different correction strategies and the realization of a flexible and effective software system for OCR correction.
We created a Bulgarian OCR corpus (2304 documents) and
a German OCR Corpus (349 documents). Both corpora follow a predefined structure and have only real life documents with a wide variety of styles, layouts, formatting, fonts, font sizes and printing quality in order to be
representative. The analysis of the OCR errors in the corpora showed a new error class in mixed alphabet documents -- the wrong replacement of letters in one alphabet with letters similar in shape in the other alphabet.
This error class was not considered before in the literature and revealed to be very frequent in mixed Cyrillic-Latin texts. We reported the problem and a method for its correction in .
A further publication, to be submitted to a major international conference, is in preparation.
We made many large-scale experiments for correction with different correction dictionaries. These experiments showed how
the use of domain specific and crawled ``Web Dictionaries'' significantly improve the correction results [3,6].
Another topic of interest was the combination of different ranking
strategies based on word and collocation frequencies, weighted edit distance etc. for achieving better correction results. We implemented an optimization technique that automatically finds the optimal linear combination
of the rankings and optimizes the resulting correction .
And finally we developed a very flexible architecture for a software OCR postcorrection system. We constructed a pipeline
where the data presented in an uniform XML format is processed by a pipe of specific tools. Initially the XML data is derived from the OCR-ed text and afterward on each step the data is enriched with additional elements
like correction candidates and various rankings by each of the tools. At the end the data is evaluated and the corresponding correction result is given as output.
The Final Project report can be found here.