The central dogma of handwritten character recognition remains inextricably linked to optical character recognition methods for print media. Alongside their reliance on proprietary data and lack of open-access software, the applicability of these optical character recognition methods to handwritten characters from low-quality documents (e.g., that are damaged) remains unknown. In this paper, we compare and contrast the performance of state-of-the-art optical character recognition tools for print and learning models engineered with state-of-the-art machine learning toolkits trained on handwritten inputs. Using Tesseract OCR as a baseline, we build, optimize, and evaluate three types of convolutional neural networks that are trained on the AL-ALLand AL-PUBdatasets, a collection of images of handwritten ancient Greek characters that were labeled by volunteers through the Ancient Lives online citizen science project. We find our best-performing machine learning model to be 92.57% accurate compared to Tesseract OCR's 11.15%. Following our analysis, we present a brief examination of our models' shortcomings, introduce the publicly-available AL-PUBdataset, and, describe Theia, a web-based tool that democratizes our machine learning models for public use. We conclude by discussing the promise of our findings for advancing research at the intersection of machine learning, manuscript transcription, and the digital humanities.
|Title of host publication
|Proceedings - IEEE 17th International Conference on eScience, eScience 2021
|Institute of Electrical and Electronics Engineers Inc.
|Number of pages
|Published - Sep 2021
|17th IEEE International Conference on eScience, eScience 2021 - Virtual, Online, Austria
Duration: Sep 20 2021 → Sep 23 2021
|2021 IEEE 17th International Conference on eScience (eScience)
|17th IEEE International Conference on eScience, eScience 2021
|9/20/21 → 9/23/21
Bibliographical noteFunding Information:
ACKNOWLEDGMENT This research is made possible by the thousands of Zooni-verse volunteers who participated in the Ancient Lives project over the past decade. We recognize these volunteers and thank them for their efforts in spurring advances not only across the humanities, but also, now, the sciences. We also thank the Imaging Papyri Project at the University of Oxford for providing access to the digitized manuscript images as well as the Egyptian Exploration Society for providing access to the Oxyrhynchus Papyri. This research was partially funded by the Andrew W. Mellon Foundation and The Chellgren Center for Undergraduate Excellence.
© 2021 IEEE.
- Ancient Greek
- Character transcription
- Citizen science
- Machine learning