OCR for Endangered Language Texts
Shruti Rijhwani, Daisy Rosenblum, Antonios Anastasopoulos, Graham Neubig

There are thousands of books and documents containing text in endangered languages created by documentary linguists and language education programs, such as language learning textbooks and cultural texts. However, the vast majority are not widely accessible because they exist only as non-digitized printed books and handwritten notes.

We’re building natural language processing (NLP) models to improve the accuracy of Optical Character Recognition (OCR) systems on low-resourced and endangered languages. This enables the extraction of text from these non-digitized documents, converting them into a machine-readable format and making them accessible and searchable online.

📌 Our software is on GitHub: available here.

📌 We’ve also created a benchmark dataset for OCR on endangered languages: available here.

Do you have documents that require digitization? We can build OCR models for your data! Let us know here.


Figure: OCR post-correction on a scanned document that contains text in the endangered language Kwak’wala. The goal of post-correction is to fix the recognition errors made by the first pass OCR system.

Our research focuses on improving the results of existing OCR systems using OCR post-correction, where we design models to automatically correct recognition errors in OCR transcriptions.

In our EMNLP 2020 paper, we present:

Our recent TACL 2021 paper improves over our previous work with a semi-supervised OCR post-correction method. The semi-supervised method has two key components:

In experiments on four endangered languages, our method improves digitization accuracy over our previous model, with relative error reductions of 15-29%.

More details: