Ocular is an open source software for automatic transcription, available through a GitHub repository. Instructions on how to use Ocular are found on GitHub.
Is Ocular the right OCR software for my project?
Ocular works best on documents printed using a hand press, including those written in multiple languages. It operates using the command line.
If your documents are from the 20th century, we recommend that you use Tesseract (Google’s open source software) or ABBYY FineReader (a proprietary software). If your documents are very old or written by hand, you might try using Transkribus (which can also be used for printed documents) or crowdsourcing with From the Page. (Note that these are not endorsements, just suggestions).
If you have a large corpus of documents to transcribe, you might consider partnering with the Early Modern OCR Project.
How does Ocular work?
Ocular has three stages.
First, you “initialize” the system by familiarizing it with the fonts and languages that your documents are written in. To do this, you feed sample text in those languages into a “language model.” You then use the language model to initialize the “font model,” which uses fonts on your computer to approximate an understanding of language.
Second, you “train” the system on your specific corpus. To do this, you feed a subset of pages from your document into the “training model.” The system revises its very basic understandings of fonts and languages using the information it gathers from your specific document. We have found that 10-20 pages are sufficient for training. While you can use any set of pages, it is best to select pages with minimal images.
Third, you “transcribe” the entire document by feeding the trained model into the transcription system. Ocular will output plain text transcriptions, and can optionally output ALTO xml, comparisons, and evaluations.
How much data do I need?
For the language initialization model, we have found that 750,000 – 1,000,000 characters is a sufficient amount of data. There are parameters that you can manipulate if you don’t have enough data. If you have too much data, the system sometimes breaks down. You can set a maximum threshold using the “-lmCharCount” parameter.
For training, 10-20 sample pages is sufficient. You can make a list of pages that you feed into the system using the “-inputDocListPath” parameter, or copy the training pages into a separate folder.
Will Ocular work on my computer?
Ocular works on Macs, and we think it works on Linux. It doesn’t seem to work on Windows.
Ocular takes a significant amount of computing power, so be prepared to not use your computer for a while. Its speed depends on both the settings and your computer. The “-emissionEngine openCL” parameter speeds up the process.
The best way to use Ocular for larger projects is through a server.
What if my document has pictures or columns?
Right now, Ocular has no way to handle pictures or columns – it merges multiple columns into one, and it reads the pictures as text. You can crop the columns into separate images if you want. We invite you to participate in the effort to add these features to the line extractor!