Shruti Rijhwani

I am a Research Scientist at Google. My interests lie in multilingual NLP and building better language technologies for under-represented languages.

Previously, I graduated with a Ph.D. from the Language Technologies Institute at Carnegie Mellon University, where I was advised by Graham Neubig. My research focused on developing models for multilingual and low-resource NLP, and my thesis presented improved optical character recognition (OCR) technologies for endangered languages. I was named to the Forbes 30 Under 30 in Science list for my work on NLP for endangered languages.

In the past, I’ve worked as a research intern at Bloomberg AI and as a research fellow at Microsoft Research.

More information about my work experience, publications, and academic service can be found in my CV.

I am best reached by email at shrutirijhwani@gmail.com. Feel free to reach out about my research or anything else I might be able to help with. I’m always happy to answer questions about getting started with NLP research and applying to Ph.D. programs.

Research Highlights

A full list of my publications can be found here.

OCR-EL: optical character recognition for low-resource and endangered languages. [Webpage] [Software] [Papers: 1, 2]
Temporally-aware NER: measuring the effect of temporal drift on named entity recognition. [Dataset] [Paper]
Low-resource entity extraction and linking: using cross-lingual transfer, multilingual knowledge bases, and phonological representations. [Entity linking software] [NER software] [Papers: 1, 2, 3, 4, 5]
Print and probability: OCR models to discover the printers of a 375-year-old document, John Milton’s Areopagitica — one of the most significant documents in the history of the freedom of the press. [Press coverage] [CMU blog coverage] [SoFCB Essay Prize 2021] [Paper]

Academic Service

Organizer
- Sixth Workshop on Computational Methods for Endangered Languages at ICLDC 2023 [link]
- Fifth Workshop on Computational Methods for Endangered Languages at ACL 2022 [link]
- Student Research Workshop at ACL 2020 [link]
- CMU SCS Graduate Application Support Program, 2020
Volunteer
- CMU LTI Diversity, Equity, and Inclusion Committee [link]
- Diversity and Inclusion Committee at NAACL 2019 [link]
Mentor
- CMU Language Technologies Mentoring Program (for new graduate students; 2021)
- CMU Graduate Application Support Mentor (2020, 2021)
- CMU AI Mentoring Program (for undergraduates; 2019, 2020, 2021)
Reviewer
- ARR 2022, AAAI 2022, AAAI 2021, ARR 2021, EACL 2021, NAACL 2021, ACL 2021, AmericasNLP 2021, AAAI 2020, HAMLETS 2020, LREC 2020, EMNLP 2020, *SEM 2020, AACL SRW 2020, AfricaNLP2020, TALLIP 2019, CALCS 2018
Area Chair
- Multilinguality track at EMNLP 2022

Publications

Lexically-Aware Semi-Supervised Learning for OCR Post-Correction
S. Rijhwani, D. Rosenblum, A. Anastasopoulos, G. Neubig
TACL, 2021
[PDF] [Code+Data]

MasakhaNER: Named Entity Recognition for African Languages
D. I. Adelani et al., including S. Rijhwani
TACL, 2021
[PDF] [Code+Data]

Evaluating the Morphosyntactic Well-formedness of Generated Texts
A. Pratapa, A. Anastasopoulos, S. Rijhwani, A. Chaudhary et al.
EMNLP, 2021
[PDF] [Code+Data]

Dependency Induction Through the Lens of Visual Perception
R. Su, S. Rijhwani, H. Zhu, J. He, X. Wang, Y. Bisk, G. Neubig
CoNLL, 2021
[PDF] [Code+Data]

OCR Post-Correction for Endangered Language Texts
S. Rijhwani, A. Anastasopoulos, G. Neubig
EMNLP, 2020
[PDF] [Code+Data]

Soft Gazetteers for Low-Resource Named Entity Recognition
S. Rijhwani, S. Zhou, G. Neubig, J. Carbonell
ACL, 2020
[PDF] [Code+Data]

Temporally-Informed Analysis of Named Entity Recognition
S. Rijhwani and D. Preotiuc-Pietro
ACL, 2020
[PDF] [Data]

Improving Candidate Generation for Low-resource Cross-lingual Entity Linking
S. Zhou, S. Rijhwani, J. Wieting, J. Carbonell, G. Neubig
TACL, 2020
[PDF] [Code+Data]

AlloVera: A Multilingual Allophone Database
D. R. Mortensen, X. Li, P. Littell, A. Michaud, S. Rijhwani et al.
LREC, 2020
[PDF] [Data]

Damaged Type and Areopagitica’s Clandestine Printers
C. N. Warren, P. Wiliams, S. Rijhwani, M. G’Sell
Milton Studies, 2020
[PDF]

A Summary of the First Workshop on Language Technology for Language Documentation and Revitalization
G. Neubig, S. Rijhwani, A. Palmer, J. MacKenzie, H. Cruz, X. Li, M. Lee et al.
First Joint SLTU and CCURL Workshop, 2020
[PDF]

Practical Comparable Data Collection for Low-Resource Languages via Images
A. Madaan, S. Rijhwani, A. Anastasopoulos, Y. Yang, G. Neubig
Practical Machine Learning for Developing Countries Workshop, 2020
[PDF] [Code+Data]

Zero-shot Neural Transfer for Cross-lingual Entity Linking
S. Rijhwani, J. Xie, G. Neubig, J. Carbonell
AAAI, 2019
[PDF] [Code+Data]

Choosing Transfer Languages for Cross-Lingual Learning
Y. Lin, C. Chen, J. Lee, Z. Li, Y. Zhang, M. Xia, S. Rijhwani, J. He et al.
ACL, 2019
[PDF] [Code+Data]

Towards Zero-resource Cross-lingual Entity Linking
S. Zhou, S. Rijhwani, G. Neubig
Workshop on Deep Learning Approaches for Low-Resource NLP, 2019
[PDF] [Code+Data]

Parser Combinators for Tigrinya and Oromo Morphology
P. Littell, T. McCoy, N. Han, S. Rijhwani, Z. Sheikh, D. Mortensen, T. Mitamura, L. Levin
LREC, 2018
[PDF] [Code+Data]

Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique
S. Rijhwani, R. Sequiera, M. Choudhury, K. Bali, C. S. Maddila
ACL, 2017
[PDF]

Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology-Based Representations
P. Michel*, A. Ravichander*, S. Rijhwani*
Second Workshop on Representation Learning for NLP, 2017
[PDF]

Code-Switching as a Social Act: The Case of Arabic Wikipedia Talk Pages
M. Yoder, S. Rijhwani, C. Rosé, L. Levin
Second Workshop on NLP and Computational Social Science, 2017
[PDF]

Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?
K. Rudra, S. Rijhwani, R. Begum, K. Bali, M. Choudhury, N. Ganguly
EMNLP, 2016
[PDF]

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text
S. Sitaram, S. K. Rallabandi, S. Rijhwani, A. W. Black
Ninth ISCA Speech Synthesis Workshop (SSW), 2016.
[PDF]