Ocular, the automatic transcription tool used by the Reading the First Books project, was designed as a prototype to test experimental models. One of the goals of the project was to make Ocular more user-friendly by integrating it into an OCR workflow that would include a more user-friendly interface, easier interaction with the tool, and clearer visualization of results.
To accomplish this goal, we partnered with the Early Modern OCR Project (eMOP) at Texas A&M University. eMOP, which was originally funded by a Mellon Foundation Grant in 2012, aimed to bring together tools for assessing, transcribing, and evaluating the automatic transcription of early modern books, ultimately transcribing some 45 million pages of data from Early English Books Online, Eighteenth Century Collections Online, and elsewhere.
One outcome of the eMOP project was an open source workflow for automatic transcription, which brings Optical Character Recognition (OCR) tools for transcription with tools for pre-processing, post-processing, and evaluating text, many of which were created by eMOP. These tools come together through the user-friendly eMOP dashboard. By integrating Ocular into the eMOP dashboard and workflow, we would make the tool more accessible, while also gaining access to the tools for post-processing and evaluation.
Over a three month period, the eMOP team (Matt Christy, Trey Dockendorf, and Liz Grumbach) worked with Dan Garrette (an Ocular developer) and myself to make this integration possible. This involved restructuring Ocular to make it more user-friendly and intuitive, for example, by changing the output so it was easier to find and interpret. It also involved making significant changes to the eMOP infrastructure, such as adding the ability to work with multiple fonts and languages, or adding a training stage prior to transcription.
In May, we were able to produce the first transcriptions using Ocular with the eMOP interface. We tested the tool on six pages from a book written in Spanish, Latin, and Nahuatl. The system displays the transcriptions in plain text and in xml (using the ALTO schema designed by the Library of Congress) to preserve information like the location of each character on the page, and the language of each word. It produces both a “diplomatic” and “normalized” (or modernized) version of the transcription. [See our discussion of modernized transcriptions (en español)].
Though we were thrilled to have produced successful transcriptions, we still had a lot of work to do before we can begin transcribing. Over the course of the testing process, we discovered new incompatibilities between eMOP and Ocular, and new bugs in the code. We have been working with Bryan Tarpley, a new member of the eMOP team, to resolve these challenges. We will also need to fine-tune the parameters for the documents in the Primeros Libros collection, and continue modifying language data and orthographies for the seven languages represented in our corpus.
Both Ocular and the eMOP dashboard are available via GitHub, though installing your own version does require some technical skill. We hope that future projects will also be able to partner with eMOP to take advantage of these tools. Ultimately, this should improve both the availability and the accuracy of automatic transcriptions for early modern books from Colonial America.