Zir - A GUI software for supervised, structured OCR of printed source material.

Summary: The piece of software introduced in this paper is developed as a part of the project to build NedHisFirm, a comprehensive database of the Dutch stock exchange and corporate data from 1796 through 1980. The paper outlines the functionality, performance, and experiences from using and developing the GUI software prototype, Zir, which is intended for human supervision of layout detection and OCR of the various historical printed source materials relevant to the project.

Challenges in database building from historical sources

The use of data originating from printed material in historical studies is riddled with challenges along the data flow from locating, scanning, digitising and treating the data into workable formats. Previous and ongoing solutions for digitising large amounts of data from printed sources focus on various tweaks of existing OCR technology and optical layout detection to not only correctly read, but also purposefully structure the digital data extracted from the physical source material.

Problems with these methodologies seem to run on a scale where exaggerated use of human steps on one end of the scale increase output quality but decreases speed, while the other side of the scale comprises autonomous machine-solutions that are fast but may suffer from less structured output or less robust layout detection systems across different sources. The latter in turn pushes the workload of structuring the data downstream to the post-processing of the scanned data.

A midway solution using Zir

The software prototype Zir is a component-based GUI-application framework which aims to find a midway between these extremes. It uses source-specific (scriptable) presets to present human observers with tweakable suggestions for recognised document structures prior to OCR scans. While demanding a per-scan preview, it is slower than fully automated OCR systems, but in ensuring correctly defined blocks of content, it also solves important data structuration problems and facilitates supervision and has scriptable source-presets that ensure adequate layout detection.

It is developed to be put in the hands of the project’s student assistants, and currently show strengths in terms of ease-of-adjustments, speed, and structure of the output, while still showing shortcomings in terms of technical integration of previously developed libraries to solve similar problems. The application is written with an MVC-approach in Python with tkinter and runs pytesseract on layout-blocks identified with OpenCV.

The paper outlines the application’s functionality and performance in terms of correctly structured and OCR:ed material. It also accounts for early users’ experience from working with the software. It puts the application of the software prototype into a larger perspective of the entire digitisation process of large-scale database building using data from historical sources while drawing on the particular experiences from the NedHisFirm project.

Josef Lilljegren is a business historian, computer scientist, and postdoctoral researcher at Groningen University (Netherlands). His research interests include the use of computational approaches in the humanities, particularly in finance history where he has previously studied the organisation of firms through intercorporate networks.