(a) Top left: a book of poetry in the endangered language Ainu of northern Japan
(b) Top right: A children's book in the Yakkha language from Nepal
(c) Bottom: A book of folktales in the Griko language, spoken in southern Italy
Textual data in endangered languages is often found in formats that are not machine-readable, including scanned images of paper books such as those in the image above. Extracting the text is challenging because there is typically no annotated data to train an OCR system in the data-scarce setting of endangered languages.
Instead, we focus on post-correcting the OCR output from a general-purpose OCR system.
In this paper, we present:
- A benchmark dataset for OCR and OCR post-correction
- The dataset contains documents in three critically endangered languages: Ainu, Griko, Yakkha.
- An extensive analysis of existing OCR systems
- We show that these systems are not robust to the data-scarce setting of endangered languages.
- An OCR post-correction method adapted to data-scarce settings
- Our method reduces the word error rate by 34%, on average, over a state-of-the-art OCR system.
For all the details, check out the paper!