How to make more from exposure data? An integrated machine learning pipeline to predict pathogen exposure

Nicholas M. Fountain-Jones, Gustavo Machado, Scott Carver, Craig Packer, Mariana Recamonde-Mendoza, Meggan E. Craft

Research output: Contribution to journalArticlepeer-review

27 Scopus citations


Predicting infectious disease dynamics is a central challenge in disease ecology. Models that can assess which individuals are most at risk of being exposed to a pathogen not only provide valuable insights into disease transmission and dynamics but can also guide management interventions. Constructing such models for wild animal populations, however, is particularly challenging; often only serological data are available on a subset of individuals and nonlinear relationships between variables are common. Here we provide a guide to the latest advances in statistical machine learning to construct pathogen-risk models that automatically incorporate complex nonlinear relationships with minimal statistical assumptions from ecological data with missing data. Our approach compares multiple machine learning algorithms in a unified environment to find the model with the best predictive performance and uses game theory to better interpret results. We apply this framework on two major pathogens that infect African lions: canine distemper virus (CDV) and feline parvovirus. Our modelling approach provided enhanced predictive performance compared to more traditional approaches, as well as new insights into disease risks in a wild population. We were able to efficiently capture and visualize strong nonlinear patterns, as well as model complex interactions between variables in shaping exposure risk from CDV and feline parvovirus. For example, we found that lions were more likely to be exposed to CDV at a young age but only in low rainfall years. When combined with our data calibration approach, our framework helped us to answer questions about risk of pathogen exposure that are difficult to address with previous methods. Our framework not only has the potential to aid in predicting disease risk in animal populations, but also can be used to build robust predictive models suitable for other ecological applications such as modelling species distribution or diversity patterns.

Original languageEnglish (US)
Pages (from-to)1447-1461
Number of pages15
JournalJournal of Animal Ecology
Issue number10
StatePublished - Oct 1 2019

Bibliographical note

Funding Information:
N.M.F-J. and M.E.C. were funded by National Science Foundation (DEB-1413925) and M.E.C. was funded by National Science Foundation (DEB-1654609) and CVM Research Office UMN Ag Experiment Station General Ag Research Funds.

Publisher Copyright:
© 2019 The Authors. Journal of Animal Ecology © 2019 British Ecological Society


  • boosted regression trees
  • disease ecology
  • gradient boosting models
  • machine learning
  • model-agnostic methods
  • random forests
  • serology
  • support vector machines


Dive into the research topics of 'How to make more from exposure data? An integrated machine learning pipeline to predict pathogen exposure'. Together they form a unique fingerprint.

Cite this