TY - JOUR
T1 - Advancing plant metabolic research by using large language models to expand databases and extract labeled data
AU - Knapp, Rachel
AU - Johnson, Braidon
AU - Busta, Lucas
N1 - Publisher Copyright:
© 2025 The Author(s). Applications in Plant Sciences published by Wiley Periodicals LLC on behalf of Botanical Society of America.
PY - 2025
Y1 - 2025
N2 - Premise: Recently, plant science has seen transformative advances in scalable data collection for sequence and chemical data. These large datasets, combined with machine learning, have demonstrated that conducting plant metabolic research on large scales yields remarkable insights. A key next step in increasing scale has been revealed with the advent of accessible large language models, which, even in their early stages, can distill structured data from the literature. This brings us closer to creating specialized databases that consolidate virtually all published knowledge on a topic. Methods: Here, we first test different combinations of prompt engineering techniques and language models in the identification of validated enzyme–product pairs. Next, we evaluate the application of automated prompt engineering and retrieval-augmented generation to identify compound–species associations. Finally, we build and determine the accuracy of a multimodal language model–based pipeline that transcribes images of tables into machine-readable formats. Results: When tuned for each specific task, these methods perform with high (80–90%) or modest (50%) accuracies for enzyme–product pair identification and table image transcription, but with lower false-negative rates than previous methods (decreasing from 55% to 40%) for compound–species pair identification. Discussion: We enumerate several suggestions for researchers working with language models, among which is the importance of the user's domain-specific expertise and knowledge.
AB - Premise: Recently, plant science has seen transformative advances in scalable data collection for sequence and chemical data. These large datasets, combined with machine learning, have demonstrated that conducting plant metabolic research on large scales yields remarkable insights. A key next step in increasing scale has been revealed with the advent of accessible large language models, which, even in their early stages, can distill structured data from the literature. This brings us closer to creating specialized databases that consolidate virtually all published knowledge on a topic. Methods: Here, we first test different combinations of prompt engineering techniques and language models in the identification of validated enzyme–product pairs. Next, we evaluate the application of automated prompt engineering and retrieval-augmented generation to identify compound–species associations. Finally, we build and determine the accuracy of a multimodal language model–based pipeline that transcribes images of tables into machine-readable formats. Results: When tuned for each specific task, these methods perform with high (80–90%) or modest (50%) accuracies for enzyme–product pair identification and table image transcription, but with lower false-negative rates than previous methods (decreasing from 55% to 40%) for compound–species pair identification. Discussion: We enumerate several suggestions for researchers working with language models, among which is the importance of the user's domain-specific expertise and knowledge.
KW - artificial intelligence
KW - plant metabolism
KW - structured data extraction
UR - http://www.scopus.com/inward/record.url?scp=105005215905&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105005215905&partnerID=8YFLogxK
U2 - 10.1002/aps3.70007
DO - 10.1002/aps3.70007
M3 - Article
AN - SCOPUS:105005215905
SN - 2168-0450
JO - Applications in Plant Sciences
JF - Applications in Plant Sciences
ER -