UT Shield

Reading the First Books

The University of Texas at Austin

Multilingual, Early-Modern OCR for Primeros Libros

  • About
    • Ocular FAQs
    • Project Team
  • Blog
  • Symposium
    • Venue Information
  • Publications
  • Links

August 24, 2016, Filed Under: Español, Research

“Reading the First Books” se une con eMOP

Ocular, el sistema de transcripción utilizado por el proyecto Reading the First Books, se desarolló como un prototipo experimental. Una de nuestras metas fue facilitar el uso de Ocular a través de un proceso de trabajo que mejoraría la interfaz de usuario, la interacción con el sistema y la visualización de los resultados.

Para lograr esta meta, colaboramos con el Early Modern OCR Project (eMOP) [proyecto de ROC para la temprana edad moderna] de Texas A&M University. eMOP fue financiado inicialmente por una beca del Mellon Foundation en 2012 y ha creado un grupo de herramientas digitales para transcribir, evaluar, y mejorar la transcripción automática de los libros de la temprana edad moderna. Este proyecto ha logrado transcribir casi 45 millones de páginas de textos y datos sacados del Early English Books Online, Eighteenth Century Collections Online y otras colecciones.

Screen shot of the eMOP dashboard, showing options for languages and fonts, as well as a list of early Mexican books.
Libros de los Primeros Libros en el panel de control de eMOP.

Un resultado del proyecto de eMOP fue el desarrollo de un proceso de trabajo para la transcripción automática que une los sistemas de ROC (reconocimiento óptico de caracteres) con las herramientas para el preprocesamiento, el postprocesamiento y la evaluación del texto. Muchas de estas herramientas han sido desarrolladas por eMOP y se han unido en un panel de control que es fácil de usar. Integramos el programa de Ocular en el panel de control de eMOP para mejorar la accesibilidad al sistema y tomar ventaja de las herramientas de eMOP dedicadas al postprocesamiento y evaluación de datos.

Durante un período de tres meses, el equipo de eMOP (Matt Christy, Trey Dockendorf y Liz Grumbach) trabajó conmigo y Dan Garrette (un desarrollador de Ocular) para integrar los sistemas. Fue necesario modificar a Ocular y hacerlo más fácil e intuitivo de usar. Por ejemplo, tuvimos que facilitar el acceso y entendimiento de los datos de salida. También fue necesario cambiar la infraestructura de eMOP y agregar aplicaciones como la utilización de fuentes e idiomas variables, y una etapa de entrenamiento antes del proceso de transcripción.

Transcripciones automáticas de un libro antiguo mexicano, hecho por eMOP en XML.
Transcripciones automáticas de un libro antiguo mexicano, producidas en XML usando eMOP.

En mayo, producimos las primeras transcripciones usando Ocular en la interfaz de eMOP. Probamos la herramienta con seis páginas de un libro escrito en Español, Latín y Nahuatl. El systema produce transcripciones en texto sin formato y en XML (usando el esquema de ALTO desarrollado por el Library of Congress) para preservar datos como la posición de caracteres en la página o el idioma de cada palabra. Se produce una versión diplomática y otra normalizada (es decir modernizada) de cada transcripción. [Consulta un artículo sobre las transcripciones modernizadas (in English)].

Aunque estábamos contentos por haber logrado producir transcripciones, todavía tenemos mucho trabajo que hacer antes de empezar el proceso de transcribir libros enteros. En el proceso de probar estas nuevas herramientas, descubrimos incompatibilidades y nuevos errores en eMOP y Ocular. Estamos trabajando con Bryan Tarpley, nuevo miembro del equipo de eMOP, para resolver estas dificultades. También tenemos que afinar los parámetros de los documentos en la colección de los Primeros Libros y seguir modificando los datos lingüísticos y ortográficos para los siete idiomas en el corpus.

Ocular y eMOP estan disponibles en GitHub pero el proceso de instalación requiere un poco de conocimiento técnico. Esperamos que otros proyectos podrán colaborar con eMOP y tomar ventaja de las herramientas que ofrece. En fin, futuras colaboraciones con eMOP podrán mejorar la accesibilidad y precisión de transcripciones automáticas de los libros antiguos de America Colonial.

 

August 24, 2016, Filed Under: Research

Reading the First Books joins the Early Modern OCR Project

Ocular, the automatic transcription tool used by the Reading the First Books project, was designed as a prototype to test experimental models. One of the goals of the project was to make Ocular more user-friendly by integrating it into an OCR workflow that would include a more user-friendly interface, easier interaction with the tool, and clearer visualization of results.

To accomplish this goal, we partnered with the Early Modern OCR Project (eMOP) at Texas A&M University. eMOP, which was originally funded by a Mellon Foundation Grant in 2012, aimed to bring together tools for assessing, transcribing, and evaluating the automatic transcription of early modern books, ultimately transcribing some 45 million pages of data from Early English Books Online, Eighteenth Century Collections Online, and elsewhere.

Screen shot of the eMOP dashboard, showing options for languages and fonts, as well as a list of early Mexican books.
Primeros Libros files uploaded into the eMOP dashboard.

One outcome of the eMOP project was an open source workflow for automatic transcription, which brings Optical Character Recognition (OCR) tools for transcription with tools for pre-processing, post-processing, and evaluating text, many of which were created by eMOP. These tools come together through the user-friendly eMOP dashboard. By integrating Ocular into the eMOP dashboard and workflow, we would make the tool more accessible, while also gaining access to the tools for post-processing and evaluation.

Over a three month period, the eMOP team (Matt Christy, Trey Dockendorf, and Liz Grumbach) worked with Dan Garrette (an Ocular developer) and myself to make this integration possible. This involved restructuring Ocular to make it more user-friendly and intuitive, for example, by changing the output so it was easier to find and interpret. It also involved making significant changes to the eMOP infrastructure, such as adding the ability to work with multiple fonts and languages, or adding a training stage prior to transcription.

Transcripciones automáticas de un libro antiguo mexicano, hecho por eMOP en XML.
Automatic transcriptions of a sixteenth century Mexican book, produced in XML using eMOP.

In May, we were able to produce the first transcriptions using Ocular with the eMOP interface. We tested the tool on six pages from a book written in Spanish, Latin, and Nahuatl. The system displays the transcriptions in plain text and in xml (using the ALTO schema designed by the Library of Congress) to preserve information like the location of each character on the page, and the language of each word. It produces both a “diplomatic” and “normalized” (or modernized) version of the transcription. [See our discussion of modernized transcriptions (en español)].

Though we were thrilled to have produced successful transcriptions, we still had a lot of work to do before we can begin transcribing. Over the course of the testing process, we discovered new incompatibilities between eMOP and Ocular, and new bugs in the code. We have been working with Bryan Tarpley, a new member of the eMOP team, to resolve these challenges. We will also need to fine-tune the parameters for the documents in the Primeros Libros collection, and continue modifying language data and orthographies for the seven languages represented in our corpus.

Both Ocular and the eMOP dashboard are available via GitHub, though installing your own version does require some technical skill. We hope that future projects will also be able to partner with eMOP to take advantage of these tools. Ultimately, this should improve both the availability and the accuracy of automatic transcriptions for early modern books from Colonial America.

March 25, 2016, Filed Under: Research

New Tools for Modernized Transcription

[Español]

The goal of the “Reading the First Books” project is to design and implement tools for the transcription of the Primeros Libros collection of books printed in New Spain in the sixteenth century. One of the first tasks we faced when we began this project was to ask historians: what kind of transcriptions would you like to see?

We offered two choices. The first, a diplomatic transcription, would be one that preserved all the orthographic oddities of the original documents: obsolete characters like the “long s” (ſ), inconsistent spelling, missing accents, historical shorthand, and typographical errors. The second, a normalized transcription, would be one that rewrote the original documents according to modern conventions by expanding shorthand, replacing obsolete characters with their modern equivalents, and standardizing spelling.

About half the scholars we spoke to wanted to see a diplomatic transcription: “because spelling variation can tell us things about the original compositer!”

The other half of our informal pool wanted to see a normalized transcription: “because searching, reading, and processing are easier when text is normalized.”

An informal twitter survey confirmed our results: 64% of respondents wanted both diplomatic and normalized transcriptions. But this was easier said than done. Optical Character Recognition (OCR), which we use to automatically transcribe historical printed documents, can only produce diplomatic transcriptions: it moves sequentially through a string of characters, seeking to match them to images in its database. Tools that have been designed to modernize transcribed text, on the other hand, depend on hand-crafted dictionaries that exist almost exclusively for historical English. No tool existed to easily modernize historical Spanish, never mind Nahuatl.

Three variations of the text: a digital facsimile of a sixteenth century document, a diplomatic transcription that preserves historical variation, and a normalized transcription that follows modern standards.
Sample Diplomatic and Normalized Transcriptions, produced automatically and simultaneously with Ocular.

So we set out to address the problem ourselves by modifying Ocular, our favorite OCR tool, to automatically (and simultaneously) produce both diplomatic and modernized transcriptions. Our first attempt at this challenge, which will appear in the proceedings of NAACL2016, works by automatically discovering patterns of orthographic variation in a given text. First, we provide it with samples of modernized text, like documents from Project Gutenberg. Then it compares the usage of language in those documents with the characters that it sees on the page. When it finds inconsistencies, it recognizes them, spitting out the printed letters and their modern equivalent. [Read a technical description.][Download the software.]

As the image to the left shows, the result is a tool that automatically transcribes historical documents, preserving historical variation while simultaneously producing a modernized text. In the example shown here, the system has learned that a tilde over a vowel signifies an elided m or n; that a q may be used in place of a c or a u in place of a v; and that the words apro and uecha are two parts of a whole (despite the missing hyphen). We can also see where the tool has made mistakes: misreading a u as a p, for example, in the diplomatic version.

Table of common characters and their replacements, along with how often they occur and the probability learned by the model for that substitution.
Orthographic patterns in texts from New Spain, identified by Ocular. The left column shows the modern character, followed by its historical variant, frequency, and likelihood.

Despite these errors, we see a lot of potential for this tool. The simultaneous production of both kinds of transcription actually improves the accuracy of both versions: the diplomatic version benefits from our knowledge of modern language, while the normalized version is no longer tied to the accuracy of a previously produced diplomatic variation. The simultaneous production of these two kinds of transcriptions, furthermore, means that without significantly increasing our use of resources we can better meet the needs of our users: documents can be searched, parsed, tagged, or analyzed for historical orthography.

As it analyzes documents to produce transcriptions, our modified OCR tool learns patterns of orthographic variation. If we preserve this information, we can acquire new knowledge about our corpus. OCR is often thought of as a necessary bottleneck on the way to corpus analytics. But with tools like this, transcription can be simultaneously an act of analysis, and a stage in the production of more accessible, discoverable resources.

  • 1
  • 2
  • Next Page »
National Endowment for the Humanities
LLILAS Benson Latin American Studies and Collections

University of Texas Libraries
Initiative for Digital Humanities, Media, and Culture

UT Home | Emergency Information | Site Policies | Web Accessibility | Web Privacy | Adobe Reader

© The University of Texas at Austin 2021

  • UT Austin
  • UT Blogs
  • Log in
  • About
    • Ocular FAQs
    • Project Team
  • Blog
  • Symposium
    • Venue Information
  • Publications
  • Links