The goal of the “Reading the First Books” project is to design and implement tools for the transcription of the Primeros Libros collection of books printed in New Spain in the sixteenth century. One of the first tasks we faced when we began this project was to ask historians: what kind of transcriptions would you like to see?
We offered two choices. The first, a diplomatic transcription, would be one that preserved all the orthographic oddities of the original documents: obsolete characters like the “long s” (ſ), inconsistent spelling, missing accents, historical shorthand, and typographical errors. The second, a normalized transcription, would be one that rewrote the original documents according to modern conventions by expanding shorthand, replacing obsolete characters with their modern equivalents, and standardizing spelling.
About half the scholars we spoke to wanted to see a diplomatic transcription: “because spelling variation can tell us things about the original compositer!”
The other half of our informal pool wanted to see a normalized transcription: “because searching, reading, and processing are easier when text is normalized.”
An informal twitter survey confirmed our results: 64% of respondents wanted both diplomatic and normalized transcriptions. But this was easier said than done. Optical Character Recognition (OCR), which we use to automatically transcribe historical printed documents, can only produce diplomatic transcriptions: it moves sequentially through a string of characters, seeking to match them to images in its database. Tools that have been designed to modernize transcribed text, on the other hand, depend on hand-crafted dictionaries that exist almost exclusively for historical English. No tool existed to easily modernize historical Spanish, never mind Nahuatl.
So we set out to address the problem ourselves by modifying Ocular, our favorite OCR tool, to automatically (and simultaneously) produce both diplomatic and modernized transcriptions. Our first attempt at this challenge, which will appear in the proceedings of NAACL2016, works by automatically discovering patterns of orthographic variation in a given text. First, we provide it with samples of modernized text, like documents from Project Gutenberg. Then it compares the usage of language in those documents with the characters that it sees on the page. When it finds inconsistencies, it recognizes them, spitting out the printed letters and their modern equivalent. [Read a technical description.][Download the software.]
As the image to the left shows, the result is a tool that automatically transcribes historical documents, preserving historical variation while simultaneously producing a modernized text. In the example shown here, the system has learned that a tilde over a vowel signifies an elided m or n; that a q may be used in place of a c or a u in place of a v; and that the words apro and uecha are two parts of a whole (despite the missing hyphen). We can also see where the tool has made mistakes: misreading a u as a p, for example, in the diplomatic version.
Despite these errors, we see a lot of potential for this tool. The simultaneous production of both kinds of transcription actually improves the accuracy of both versions: the diplomatic version benefits from our knowledge of modern language, while the normalized version is no longer tied to the accuracy of a previously produced diplomatic variation. The simultaneous production of these two kinds of transcriptions, furthermore, means that without significantly increasing our use of resources we can better meet the needs of our users: documents can be searched, parsed, tagged, or analyzed for historical orthography.
As it analyzes documents to produce transcriptions, our modified OCR tool learns patterns of orthographic variation. If we preserve this information, we can acquire new knowledge about our corpus. OCR is often thought of as a necessary bottleneck on the way to corpus analytics. But with tools like this, transcription can be simultaneously an act of analysis, and a stage in the production of more accessible, discoverable resources.