ITALLIC: A tool for identifying and correcting errors in location based plant breeding data

Getiria Onsongo, Samantha Fritsche, Thy Nguyen, Ayoub Belemlih, Jeffery Thompson, Kevin A.T. Silverstein

Research output: Contribution to journalArticlepeer-review

Abstract

Advances in big data technologies are making it possible to analyze large amounts of data in near real time. These technologies offer great promise in the area of data-driven plant breeding. To fully realize this promise, disparate sources of genotype, environment, management, and socioeconomic data need to be integrated. Collectively, this data could be used to inform genetic predictive models for maize, wheat, and other crops. Some of the primary challenges to analyzing these disparate sources collectively are errors in location data, which include flipped latitude and longitude values, missing negative signs, and, in some cases, missing data. To address these challenges, we have developed an Integrated Tool for AgData Lat Long Imputation and Cleaning (ITALLIC), which detects and corrects errors in location data and imputes missing values for location-dependent data, such as region name. Location information is considered valid if a multipolygon bounding its coordinates corresponds to the country label. This validation step easily detects common errors, such as missing negative signs or flipped latitude and longitude data. To suggest corrections, combinations of alternative latitude and longitude values are generated, and a query is used to determine the country for each of these possible coordinate pairs. If one of the coordinate pairs corresponds to a country, those coordinates are suggested as the putatively correct latitude and longitude values for that data entry. If this approach fails to correct the error, an open-source API is used to geocode the location. In addition to identifying and correcting potential errors, ITALLIC includes a visualization tool that makes it easy for users to validate results. Illustratively, when used to analyze data from over 1,400 plant breeding stations around the world, ITALLIC enabled us to validate or correct errors in over 90% of the data. Being able to examine suggested corrections visually made the validation process seamless and convenient. In a few instances, latitude and longitude values were flipped, resulting in a plant breeding station being listed as located in the middle of the ocean. The visualization tool was able to plot both the location in the middle of the ocean and the station's suggested correct location with a line connecting them. Being able to visualize both the erroneous data point and its suggested location helped us quickly identify and correct errors. ITALLIC is freely available for installation via the publicly accessible Anaconda package management system and the source code has been made available on GitHub. ITALLIC is under active development and has been integrated into the GEMS™ Agroinformatics platform.

Original languageEnglish (US)
Article number106947
JournalComputers and Electronics in Agriculture
Volume197
DOIs
StatePublished - Jun 2022

Bibliographical note

Funding Information:
The authors would like to thank Daniel Maseda for proofreading the manuscript.

Publisher Copyright:
© 2022 Elsevier B.V.

Keywords

  • Automated data cleaning
  • Data driven plant breeding
  • Data visualization
  • Location data

Fingerprint

Dive into the research topics of 'ITALLIC: A tool for identifying and correcting errors in location based plant breeding data'. Together they form a unique fingerprint.

Cite this