Transcribing Century-Old Newspapers with NLP
Source A venerable financial newspaper wanted to do something special for their 130th Anniversary: create a special edition, digitized broadsheet of stories from their entire history. While their archives had been well preserved and imaged, extracting the text from old — in some cases century-old — pages proved tricky. The newspaper approached Tolstoy , a NLP startup based in San Francisco, to help them extract the text accurately — and on deadline for their 130th anniversary.
This article covers how we used the latest techniques in image processing and optical character recognition to digitize and extract the text from the newspaper broadsheets.
Trending Bot Articles:
An augmentation based deep neural network approach to learn human driving behavior
Small U-Net for vehicle detection
Chatbot Design Canvas
How to Quickly Improve your Chatbot’s Retention & Engagement
Primer on OCR
Optical character recognition, more commonly known as OCR, is among the foundational computer vision tasks. Basically, it is the technology that converts images of typed or handwritten text into machine-encoded text (i.e. strings).
It was one of the first big machine learning problems to be tackled by researchers — chiefly because it didn’t require deep learning to achieve good performance. While these days OCR is largely seen as a solved problem, especially in “domestic” settings like PDFs and scanned documents, there is a very long tail of cases where even paid/state-of-the-art options, such as Google Cloud Vision and Amazon Textract (not to mention the very good open-source Tesseract), fail.
You see, most OCR, by default, reads text from left-to-right and top-to-bottom — in that order. While this is well suited to most essays, novels, contracts, and official documents (what largely comprised the training sets of early OCR), it isn’t so conducive to a lot of other real world documents (newspapers, journal articles, reports, advertisements, posters, etc.), which may have text all over the place. As we can see in the above example, even columns, which intuitively seem like they are regular enough to be picked up, can confound the best OCR models.
Can we do better — and if so, how?
The issue all these models have is not in the accuracy of the text — rather, it is in determining an accurate reading order. Thus, what we have to do is to define a set of pre-processing operations that create and arrange image blocks in a way that returns the correct order when fed through any OCR model.
Grouping text blocks
The first step is to recognize groups of text. Remember, since we only need “blobs” (encoded as bounding boxes) likely to contain text, we need not worry so much about readability, merely distinguishability. We can always pipe the generated coordinates back over the original image for the engine.
To this end,...