OCR Post Correction for Endangered Language Texts
Shruti Rijhwani, Antonios Anastasopoulos, Graham Neubig

[Paper]     [BibTex]     [arXiv]     [Dataset]     [Code]     [Slides]     [Talk]

Do you have documents in an endangered language that require digitization? We can build OCR models for your data! Let us know here.

Come to our live QA session at EMNLP 2020 at Gather Session 4: Machine Translation and Multilinguality! Details here.

(a) Top left: a book of poetry in the endangered language Ainu of northern Japan
(b) Top right: A children's book in the Yakkha language from Nepal
(c) Bottom: A book of folktales in the Griko language, spoken in southern Italy

Textual data in endangered languages is often found in formats that are not machine-readable, including scanned images of paper books such as those in the image above. Extracting the text is challenging because there is typically no annotated data to train an OCR system in the data-scarce setting of endangered languages.

Instead, we focus on post-correcting the OCR output from a general-purpose OCR system.

In this paper, we present:

For all the details, check out the paper!