About

Reading the First Books: Multilingual, Early-Modern OCR for Primeros Libros was a two-year, multi-university effort to develop tools for the automatic transcription of early modern printed books. The focus of the project was the Primeros Libros de las Américas collection, which seeks to produce digital facsimiles of all books printed before 1601 in the Americas. A collaboration between the LLILAS Benson Latin American Studies and Collections at UT Austin and the Early Modern OCR Project at Texas A&M University, Reading the First Books sought to develop corpora and advanced tools for the transcription of these historical documents, including materials specifically tailored for the transcription of indigenous-language texts. It also set out to produce OCR transcriptions of the books in the Primeros Libros collection. The project was funded by a National Endowment for the Humanities Digital Implementation Grant.

The Reading the First Books project was completed in December 2017. This website has been modified to reflect the completion of the project. The website as it existed upon completion has been preserved through the Internet Archive’s Wayback Machine.

Project Description

Digital facsimile collections of early modern printed books (books printed on hand presses in the 15th-17th century) greatly improve access to these historical documents by scholars, students, and the general public. The utility and accessibility of these digital collections, however, has been limited by the challenges of transcribing early modern printed books: linguistic complexity, unstable orthography (spelling and punctuation), and uneven typesetting and inking make these books difficult to read for humans and machines alike.

Three images: three lines from a facsimile edition of a 16th century book in Spanish and Nahuatl; lines colorcoded to reveal language switching; and Ocular's output with color-coded transcriptions.

Multilingual Ocular output identifies language switches and produces machine-readable transcriptions. Image shows digital facsimile (top) and automatic transcription in Spanish and Nahuatl (bottom).

To address this challenge, this project continues the development of Ocular, an Optical Character Recognition (OCR) tool first developed by Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein in 2013. Ocular transcribes historical documents by modeling the behavior of a hand press, including uneven printing and inking. Our project extends Ocular by focusing on the problems posed by variable orthographies and multilingual documents. In a paper by Dan Garrette, Hannah Alpert-AbramsTaylor Berg-Kirkpatrick, and Dan Klein published by the North American Association for Computational Linguistics in 2015, we proposed a modified system that uses multilingual language models and ran experiments using a trilingual model for Spanish, Latin, and Nahuatl (the dominant indigenous language of central Mexico) to produce multilingual transcriptions. This system can be used on documents in any language, from anywhere in the world. A subsequent paper adds features to automatically learn historical orthography and produce “normalized” transcriptions.

The Reading the First Books project will implement this new, open-access digital tool for use in humanities research. The project is managed by Dr. Sergio Romero (Assistant Professor, LLILAS and Spanish & Portuguese, UT Austin); Albert A. Palacios (Digital Scholarship Coordinator, LLILAS Benson), Hannah Alpert-Abrams (PhD Student, comparative literature, UT Austin), and María Victoria Fernández. In collaboration with the Initiative for Digital Humanities, Media, and Culture at Texas A&M University, we will work to incorporate Ocular into their Early Modern OCR Project (eMOP) workflow. eMOP, which has been used to transcribe 45 million pages, leverages and produces cutting-edge tools for analyzing and transcribing facsimiles of texts printed in the 15th-18th century. Our collaboration will incorporate Ocular into the eMOP workflow, opening the way for the automatic transcription of new collections, particularly those that extend digital scholarship beyond monolingual corpora.

The project will also use the eMOP workflow to automatically transcribe the digital facsimiles in the Primeros Libros collection, producing a new corpus of machine-readable texts in Huastec, Latin, Mixtec, Nahuatl, Otomi, Spanish, Tarascan (Purépecha), and Zapotec.  Primeros Libros is a unique reflection of the range of textual production in early colonial America by both European and indigenous intellectuals. This includes, for example, the only Nahuatl grammar from the period written by a native speaker of the language. Our newly transcribed corpora will create new possibilities for scholarship and increase discoverability for users such as the approximately 1.5 million Nahuatl speakers living in Mexico and the United States today. This corpus will also carry with it new, implicitly learned statistical information about the documents, including patterns in language use as well as patterns in printing processes.

More Information

Learn more about using Ocular: Ocular FAQs.

Follow the Reading the First Books website for updates as we continue to develop this project.

For more information about the project contact Maria Victoria Fernandez (maria.victoria.fernandez@utexas.edu) or Hannah Alpert-Abrams (halperta@gmail.com).
For information about implementing Ocular, contact Taylor Berg-Kirkpatrick (tberg@cs.cmu.edu) or Dan Garrette (dhgarrette@gmail.com).

 

***

Any views, findings, conclusions, or recommendations expressed in this web site do not necessarily represent those of the National Endowment for the Humanities.