CHARACTER SHAPE PRECLASSIFICATION IN MIXED SCRIPT OCR FOR MACEDONIAN LANGUAGE
Dejan Gorgevik, Ljupco Josifovski, Dragan Mihajlov
Abstract: Despite of the presence of many commercial OCR programs on the market, the recognition accuracy is still a difficult problem, especially when handling multilingual documents. In absence of an OCR program which can concurrently recognize mixed script documents, we have developed an OCR program for recognizing mainly Macedonian text with words in Latin script. In this paper an adaptation of preclassification based on character shape, for Macedonian characters along with standard Latin characters is presented. Preclassification is performed using the line parameters such as baseline and upper-baseline, as well as the dimensions of the character boxes and their horizontal position within the text line.
Keywords: OCR, character preclassification, mixed script text recognition