The University of Texas at Austin is one of six recipients of a Digital Humanities Implementation Grant award from the National Endowment for the Humanities (NEH). The grant of $215,000 will fund “Reading the First Books: Multilingual, Early-Modern OCR for Primeros Libros,” a project to extend the capabilities of current open-source optical character recognition (OCR) technology for use in the transcription of sixteenth-century texts. LLILAS Benson Latin American Studies and Collections will administer the grant as part of its new Digital Scholarship program.
The tool developed under the project will be used to produce transcriptions of the digitized books in the Primeros Libros de las Américas collection, which currently includes over 330 copies of books printed in the Americas before 1601. Books in the collection include text in Spanish, Latin, and several indigenous Latin American languages, including Nahuatl, once spoken by the Aztecs and still spoken by some 1.5 million people. UT Libraries and the Benson Latin American Collection are founding members of the Primeros Libros consortium, along with Texas A&M University and the Biblioteca José María Lafragua at the Benemérita Universidad Autónoma de Puebla. The consortium currently has over 20 member libraries from throughout the Americas and Europe, including the John Carter Brown Library, Monterrey Institute of Technology and Higher Education (ITESM), and the Universidad Complutense in Madrid.
The ability of scholars and students to work with ancient texts in digital form has been limited by the challenges of transcribing early modern books: printed long ago, they contain variable typefaces, typesetting, spelling, and multilingual text that is not recognized by conventional OCR software. The goal of this project is to develop and implement groundbreaking methods in the automatic transcription of early modern printed books. This will help scholars to shine a light on a period of history that saw a transition away from oral culture, the rise of literacy, and the birth of the scientific method.
The two-year project, which begins Sept. 1, 2015, will be overseen by Sergio Romero, assistant professor at the Teresa Lozano Long Institute of Latin American Studies (LLILAS) and the Department of Spanish and Portuguese, and by Kent Norsworthy, LLILAS Benson digital scholarship coordinator. The project further develops a prototype of Ocular, a new OCR tool developed by Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein at UC Berkeley and adapted for Primeros Libros by comparative literature PhD student Hannah Alpert-Abrams and computer scientist Dan Garrette (U. Washington). The tool will be integrated into the Early Modern OCR Project by a team at Texas A&M University, who are partners in the grant. UT Libraries will incorporate the transcriptions produced under the project into the existing Primeros Libros website.
Alpert-Abrams, who is also a LLILAS Benson Digital Scholarship graduate research assistant, stresses the importance of collaboration across disciplines and across universities, as well as the implications for broader use of the new technology: “The NEH grant is exciting because it gives us an opportunity to conduct research and build tools with scholars from multiple disciplines and universities. The ultimate goal is to produce a tool that will be useful for anyone interested in producing digital collections of historical documents, across regions and languages.”
Nahuatl scholar Kelly McDonough, assistant professor in the UT Department of Spanish and Portuguese, sees great promise in this technology for the classroom and beyond. She says that as a result of the successful extension of OCR technology, “scholars and students will be able to rapidly search multiple corpora of multilingual texts—a task that is extraordinarily, often prohibitively, time-consuming without this technology.” In her own work, which includes the study of female indigenous leaders in colonial Mexico, she will be able to search for rarely used terms and “variants of terminology utilized by indigenous scribes over a long period of time and a large geographic area. In short, we will be able to ask questions of massive amounts of data that we simply couldn’t ask before.”