Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites

Research output: Contribution to journalArticlepeer-review

Abstract

Transformer-based large language models are receiving considerable attention because of their ability to analyse scientific literature. Small language models (SLMs), however, also have potential in this area as they have smaller compute footprints and allow users to keep data in-house. Here, we quantitatively evaluate the ability of SLMs to: (i) score references according to project-specific relevance and (ii) extract and structuring data from unstructured sources (scientific abstracts). By comparing SLMs’ outputs against those of a human on hundreds of abstracts, we found that (i) SLMs can effectively filter literature and extract structured information relatively accurately (error rates as low as 10%), but not with perfect yield (as low as 50% in some cases), (ii) that there are tradeoffs between accuracy, model size and computing requirements and (iii) that clearly written abstracts are needed to support accurate data extraction. We recommend advanced prompt engineering techniques, full-text resources and model distillation as future directions.

Original languageEnglish (US)
Article numbere26
JournalQuantitative Plant Biology
Volume6
DOIs
StatePublished - Jul 25 2025

Bibliographical note

Publisher Copyright:
© The Author(s), 2025. Published by Cambridge University Press in association with John Innes Centre.

Keywords

  • language models
  • language processing
  • literature mining
  • natural products
  • plant chemistry

PubMed: MeSH publication types

  • Journal Article

Fingerprint

Dive into the research topics of 'Small language models enable rapid and accurate extraction of structured data from unstructured text: An example with plants and their specialized metabolites'. Together they form a unique fingerprint.

Cite this