TY - JOUR
T1 - ITALLIC
T2 - A tool for identifying and correcting errors in location based plant breeding data
AU - Onsongo, Getiria
AU - Fritsche, Samantha
AU - Nguyen, Thy
AU - Belemlih, Ayoub
AU - Thompson, Jeffery
AU - Silverstein, Kevin A.T.
N1 - Funding Information:
The authors would like to thank Daniel Maseda for proofreading the manuscript.
Publisher Copyright:
© 2022 Elsevier B.V.
PY - 2022/6
Y1 - 2022/6
N2 - Advances in big data technologies are making it possible to analyze large amounts of data in near real time. These technologies offer great promise in the area of data-driven plant breeding. To fully realize this promise, disparate sources of genotype, environment, management, and socioeconomic data need to be integrated. Collectively, this data could be used to inform genetic predictive models for maize, wheat, and other crops. Some of the primary challenges to analyzing these disparate sources collectively are errors in location data, which include flipped latitude and longitude values, missing negative signs, and, in some cases, missing data. To address these challenges, we have developed an Integrated Tool for AgData Lat Long Imputation and Cleaning (ITALLIC), which detects and corrects errors in location data and imputes missing values for location-dependent data, such as region name. Location information is considered valid if a multipolygon bounding its coordinates corresponds to the country label. This validation step easily detects common errors, such as missing negative signs or flipped latitude and longitude data. To suggest corrections, combinations of alternative latitude and longitude values are generated, and a query is used to determine the country for each of these possible coordinate pairs. If one of the coordinate pairs corresponds to a country, those coordinates are suggested as the putatively correct latitude and longitude values for that data entry. If this approach fails to correct the error, an open-source API is used to geocode the location. In addition to identifying and correcting potential errors, ITALLIC includes a visualization tool that makes it easy for users to validate results. Illustratively, when used to analyze data from over 1,400 plant breeding stations around the world, ITALLIC enabled us to validate or correct errors in over 90% of the data. Being able to examine suggested corrections visually made the validation process seamless and convenient. In a few instances, latitude and longitude values were flipped, resulting in a plant breeding station being listed as located in the middle of the ocean. The visualization tool was able to plot both the location in the middle of the ocean and the station's suggested correct location with a line connecting them. Being able to visualize both the erroneous data point and its suggested location helped us quickly identify and correct errors. ITALLIC is freely available for installation via the publicly accessible Anaconda package management system and the source code has been made available on GitHub. ITALLIC is under active development and has been integrated into the GEMS™ Agroinformatics platform.
AB - Advances in big data technologies are making it possible to analyze large amounts of data in near real time. These technologies offer great promise in the area of data-driven plant breeding. To fully realize this promise, disparate sources of genotype, environment, management, and socioeconomic data need to be integrated. Collectively, this data could be used to inform genetic predictive models for maize, wheat, and other crops. Some of the primary challenges to analyzing these disparate sources collectively are errors in location data, which include flipped latitude and longitude values, missing negative signs, and, in some cases, missing data. To address these challenges, we have developed an Integrated Tool for AgData Lat Long Imputation and Cleaning (ITALLIC), which detects and corrects errors in location data and imputes missing values for location-dependent data, such as region name. Location information is considered valid if a multipolygon bounding its coordinates corresponds to the country label. This validation step easily detects common errors, such as missing negative signs or flipped latitude and longitude data. To suggest corrections, combinations of alternative latitude and longitude values are generated, and a query is used to determine the country for each of these possible coordinate pairs. If one of the coordinate pairs corresponds to a country, those coordinates are suggested as the putatively correct latitude and longitude values for that data entry. If this approach fails to correct the error, an open-source API is used to geocode the location. In addition to identifying and correcting potential errors, ITALLIC includes a visualization tool that makes it easy for users to validate results. Illustratively, when used to analyze data from over 1,400 plant breeding stations around the world, ITALLIC enabled us to validate or correct errors in over 90% of the data. Being able to examine suggested corrections visually made the validation process seamless and convenient. In a few instances, latitude and longitude values were flipped, resulting in a plant breeding station being listed as located in the middle of the ocean. The visualization tool was able to plot both the location in the middle of the ocean and the station's suggested correct location with a line connecting them. Being able to visualize both the erroneous data point and its suggested location helped us quickly identify and correct errors. ITALLIC is freely available for installation via the publicly accessible Anaconda package management system and the source code has been made available on GitHub. ITALLIC is under active development and has been integrated into the GEMS™ Agroinformatics platform.
KW - Automated data cleaning
KW - Data driven plant breeding
KW - Data visualization
KW - Location data
UR - http://www.scopus.com/inward/record.url?scp=85129558530&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85129558530&partnerID=8YFLogxK
U2 - 10.1016/j.compag.2022.106947
DO - 10.1016/j.compag.2022.106947
M3 - Article
AN - SCOPUS:85129558530
SN - 0168-1699
VL - 197
JO - Computers and Electronics in Agriculture
JF - Computers and Electronics in Agriculture
M1 - 106947
ER -