Abstract
Extracting text from historical maps using Optical Character Recognition (OCR) engines often results in partially or incorrectly recognized words due to complex map content. Previous work utilizes lexical-based approaches with linguistic context or applies language models to correct OCR results for documents. However, these post-OCR methods cannot directly consider spatial relations of map text for correction. For example, "Mississippi"and "River"constitute the place phrase "Mississippi River"(linguistic relation), and near "highway", there are likely to exist intersected "road"to enter the "highway"(spatial relation). This paper presents a novel approach that exploits the spatial arrangement of map text using a contextual language model, BART [6] for post-processing of map text from OCR. The approach first structures word-level map text into sentences based on their spatial arrangement while preserving the spatial location of words constituting a place name and corrects imperfect OCR text using neighboring information. To train BART for capturing spatial relations in map text, we automatically generate large numbers of synthetic maps to fine-tune BART with location names and their spatial context. We conduct experiments on synthetic and real-world historical maps of various map styles and scales and show that the proposed method can achieve significant improvement over the commonly used lexical approach.
Original language | English (US) |
---|---|
Title of host publication | Proceedings of the 5th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, GeoAI 2022 |
Editors | Bruno Martins, Dalton Lunga, Song Gao, Shawn Newsam, Lexie Yang, Xueqing Deng, Gengchen Mai |
Publisher | Association for Computing Machinery, Inc |
Pages | 14-17 |
Number of pages | 4 |
ISBN (Electronic) | 9781450395328 |
DOIs | |
State | Published - Nov 1 2022 |
Event | 5th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, GeoAI 2022 - Seattle, United States Duration: Nov 1 2022 → … |
Publication series
Name | Proceedings of the 5th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, GeoAI 2022 |
---|
Conference
Conference | 5th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, GeoAI 2022 |
---|---|
Country/Territory | United States |
City | Seattle |
Period | 11/1/22 → … |
Bibliographical note
Funding Information:5 CONCLUSION This paper presented a novel approach that exploits the spatial locations of map text with BART for post-OCR on maps. The main contribution is a method for incorporating spatial context with BART by clustering the spatial locations of OCRed words for converting 2D map text into 1D pseudo sentences in both training and inferencing for post-OCR processing. Overall, our presented method improves F1 score significantly on the synthetic maps compared to the lexical approach by correcting and predicting even unidentified metadata. However, due to the various types of short words in the historical map, the proposed method would remove short, unseen words in post-OCR processing. In the future, we will further include the short words (e.g., abbreviations) as well as incorporate geographic word variations in the training data to be able to handle many types of variations of place names in map post-OCR processing. ACKNOWLEDGMENTS This material is based upon work supported in part by NVIDIA Corporation, the National Endowment for the Humanities under Award No. HC-278125-21 and Council Reference AH/V009400/1, and the University of Minnesota, Computer Science & Engineering Faculty startup funds. We thank Jina Kim and Yijun Lin in developing the synthetic maps.
Publisher Copyright:
© 2022 ACM.
Keywords
- BART
- information retrieval
- neural networks
- post-OCR processing