Accession Number : ADA325444

Title :   Foreign Language Optical Character Recognition, Phase II: Arabic and Persian Training and Test Data Sets.

Descriptive Note : Final rept.,

Corporate Author : SCIENCE APPLICATIONS INTERNATIONAL CORP MCLEAN VA

Personal Author(s) : Davidson, Robert B. ; Hopely, Richard L.

PDF Url : ADA325444

Report Date : MAY 1997

Pagination or Media Count : 11

Abstract : This report describes the creation of large data sets consisting of bit-mapped images of real-world printed secular Arabic-alphabet text (in Arabic and Persian), accompanied by corresponding high-fidelity coded transcriptions (text ground truth), that have been systematically chosen, prepared, and documented. Each data set is divided into a training set, which is made available to developers, and a carefully matched equal-sized set of closely analogous samples, which is reserved for testing of the developers' products. The samples were systematically chosen to represent current vocabulary, usage, typography, and publication practices in major newspapers and news magazines, and in recent books and journals dealing with politics, economics, and commercial and military matters. Lexicons and character-frequency tables have been compiled for each data set and for the Arabic collection as a whole.

Descriptors :   *DATA BASES, *OPTICAL CHARACTER RECOGNITION, *ARABIC LANGUAGE, IMAGE PROCESSING, VOCABULARY, COMPUTER FILES, COMPUTATIONAL LINGUISTICS, NATURAL LANGUAGE, TEXT PROCESSING, WORD RECOGNITION, LEXICOGRAPHY, TYPOGRAPHY.

Subject Categories : Linguistics
      Cybernetics

Distribution Statement : APPROVED FOR PUBLIC RELEASE