Normalized transcriptions, eMOP integration, XML, and indigenous language transcription: here’s a report on where we’re at one year into the Reading the First Books project! [Español]
Tools for automatically transcribing printed books generally produce “diplomatic transcriptions”: that is, transcriptions which preserve the spelling and punctuation of the original text.
But many users of digital texts prefer a normalized transcription, where the spelling is more consistent, and shorthands have been expanded. Normalization helps with searching, and with reading. It also aids more complex forms of digital analysis like lemmatization and parsing, which in turn help with topic modeling and visualization.
So we built an extension of Ocular that jointly models normalized and diplomatic transcriptions. Every time it transcribes a text, it transcribes it both in modern and historical form. And it does this by learning the differences on its own, and then applying them to its transcription.
That means Ocular can “normalize” text in any language for which we have data. And the computational linguists behind Ocular – that’s Taylor Berg-Kirkpatrick and Dan Garrette – are working to make the “normalization” system better, so historians can access better normalized versions of historical texts.
eMOP Integration and Primeros Libros Transcription
The Ocular prototype, the tool we use to transcribe texts, works best on a few books at a time – not so great for the Primeros Libros corpus and other large collections of texts.
So we partnered with Texas A&M University to manage our corpus and integrate it into the Early Modern OCR Project (eMOP).
With the support of Anton DuPlessis, one of the leaders of the Primeros Libros project (and a librarian at TAMU), we were able to process many of the books in the collection and put them into a structured database.
We then worked with our collaborators at the IDHMC (Bryan Tarpley, Matt Christy, Trey Dockendorf, and Liz Grumbach) to integrate Ocular into eMOP.
eMOP was designed to transcribe large collections of texts, like Early English Books Online, but it was initially built to be used with Tesseract, Google’s system for automatic transcription.
Now we can use Ocular through the eMOP interface, meaning that we can more easily manage and transcribe large amounts of data. And you can too! The whole thing (eMOP and Ocular) is available via GitHub, though a better approach is to collaborate with the IDHMC directly.
Ocular was first designed to output plain text transcriptions: just the words, and nothing but the words.
With the help of Melanie Cofield, metadata expert at UT Libraries, we’ve extended Ocular so it outputs XML using the Library of Congress’s ALTO schema. ALTO was designed specifically for automatic transcriptions, and it will make it easier for us to display the transcribed text alongside scanned images on the Primeros Libros website. It also preserves information about the transcription, like system parameters.
But don’t worry! Ocular still outputs plain-text transcriptions, and we’ll be making those available for download. So if you want to do data analysis, you won’t be stripping metadata from our transcribed files.
Indigenous Language Transcription
When we started this project, indigenous language transcription was one of our top priorities. In fact, it was Ocular’s failure to transcribe Nahuatl well that inspired the project.
But we’ve found indigenous language transcription to be difficult, especially for languages that aren’t widely available online. To transcribe texts we need to build language models, and language models require examples of what language is “supposed” to look like.
With the help of Stephanie Wood, we were able to build a collection of Nahuatl transcriptions. Historians sent us archival materials and books that they had painstakingly transcribed into Word documents or text files. We kept the transcriptions private, but the statistical analysis of the transcriptions feeds our system, enabling us to transcribe books in Nahuatl.
But what about the other languages we’re working with? We have struggled to build collections of Zapotec, Mixtec, Purépecha, and Huastec, meaning that our system is still ineffective when it comes to transcribing documents in those languages.
Our new research assistant, María Victoria Fernández, will be leading us through the next stage of the First Books project. By the end of the coming year, we expect to have transcribed the entirety of the Primeros Libros corpus, to integrate those transcriptions into the website, and to make them available for analysis. Stay tuned to the First Books project website to follow our progress.
We’ll also be hosting a workshop to discuss the future of automatic transcription for historical texts, and the significance of our project for colonial Latin American research. Keep an eye out for more information!
Want to get involved in the First Books project? There’s so much you can do:
- Help us build our indigenous language corpora!
Send us anything you’ve ever written in Nahuatl, Zapotec, Mixtec, Huastec, or Purépecha. This could be modern poetry or prose, or transcriptions of archival records. We can’t use PDFs, but Word, RTF, TXT, or XML documents all help! Remember that we will never share these documents with anyone, and we won’t read them ourselves. It’s okay if they’re not perfect. They’re data for our models, and they’ll help us make indigenous language transcription better.
- Help us make Ocular better!
Are you a computer scientist, programmer, digital humanist, or computational linguist? Ocular is freely available on GitHub, and it could use your help! We’d love for Ocular to be able to output TEI-encoded text. We’d love to see if it works on non-Romance languages like Arabic. And we’d love to see someone improve the line-extraction system, a pre-processing program that cuts the page into individual lines.
- Try out Ocular on your own texts!
Are you a historian with scanned copies of historical books? We can work with you to automatically transcribe scanned documents from any era (pre-20th century is best) and in any language that uses Latin script (Ocular should work on all character-based languages, but we haven’t tried it yet). Get started on GitHub or contact us for guidance (firstname.lastname@example.org).