Determining Risk Factors for Long COVID Using Positive Unlabeled Learning on Electronic Health Records Data from NIH N3C

Saurav Sengupta, Johanna Loomba, Suchetha Sharma, Scott A. Chapman, Donald E. Brown

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Post-acute sequelae of SARS-Co V-2 infection (PASC), also known as Long COVID, is an emerging medical condition in the aftermath of the COVID-19 pandemic. Research on this disease is limited by its newness and the lack of reliable controls, which can hinder model development. The National COVID Cohort Collaborative (N3C)11https://ncats.nih.gov/n3c contains Electronic Health Record (EHR) data for 7 million COVID positive patients from 76 sites across the United States, of which there are fifty thousand Long COVID patients. For this study, we model our risk factor analysis as Positive Unlabeled (PU) problem, where we treat Long COVID patients as the positive sample and rest of the COVID positive patients as unlabeled data. We first curate reliable controls using a PU modeling technique called bagging. We then use this cohort of positive and the curated negative samples to model risk factors for Long COVID. We utilize an attention-based deep learning approach using Long Short Term Memory (LSTM) networks on historical diagnosis data prior to COVID-19 infection, to first predict for Long COVID and then extract the model attention values to score input diagnoses for each patient. Using this process, we achieve an Area Under the Receiver Operating Characteristic (AUROC) of 0.93 (0.88 F1 Score) for the prediction task, significantly outperforming the same model trained on randomly selected controls. We then use a scoring process to rank different input diagnoses for each correctly classified patient with attention values extracted from the trained model and find the temporal distribution of top diagnosis codes which, when represented graphically, becomes a helpful tool to for physicians to investigate diagnosis patterns that effect Long COVID and also evaluate model trustworthiness.

Original languageEnglish (US)
Title of host publicationProceedings - 22nd IEEE International Conference on Machine Learning and Applications, ICMLA 2023
EditorsM. Arif Wani, Mihai Boicu, Moamar Sayed-Mouchaweh, Pedro Henriques Abreu, Joao Gama
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages430-436
Number of pages7
ISBN (Electronic)9798350345346
DOIs
StatePublished - 2023
Event22nd IEEE International Conference on Machine Learning and Applications, ICMLA 2023 - Jacksonville, United States
Duration: Dec 15 2023Dec 17 2023

Publication series

NameProceedings - 22nd IEEE International Conference on Machine Learning and Applications, ICMLA 2023

Conference

Conference22nd IEEE International Conference on Machine Learning and Applications, ICMLA 2023
Country/TerritoryUnited States
CityJacksonville
Period12/15/2312/17/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

Keywords

  • Deep learning
  • positive-unlabeled learning
  • self-supervised learning

Fingerprint

Dive into the research topics of 'Determining Risk Factors for Long COVID Using Positive Unlabeled Learning on Electronic Health Records Data from NIH N3C'. Together they form a unique fingerprint.

Cite this