Classifying the lifestyle status for Alzheimer’s disease from clinical notes using deep learning with weak supervision

Zitao Shen, Dalton J Schutte, Yoonkwon Yi, Anu Bompelli, Fang Yu, Yanshan Wang, Rui Zhang

Research output: Contribution to journalArticlepeer-review

3 Scopus citations


Background: Since no effective therapies exist for Alzheimer’s disease (AD), prevention has become more critical through lifestyle status changes and interventions. Analyzing electronic health records (EHRs) of patients with AD can help us better understand lifestyle’s effect on AD. However, lifestyle information is typically stored in clinical narratives. Thus, the objective of the study was to compare different natural language processing (NLP) models on classifying the lifestyle statuses (e.g., physical activity and excessive diet) from clinical texts in English. Methods: Based on the collected concept unique identifiers (CUIs) associated with the lifestyle status, we extracted all related EHRs for patients with AD from the Clinical Data Repository (CDR) of the University of Minnesota (UMN). We automatically generated labels for the training data by using a rule-based NLP algorithm. We conducted weak supervision for pre-trained Bidirectional Encoder Representations from Transformers (BERT) models and three traditional machine learning models as baseline models on the weakly labeled training corpus. These models include the BERT base model, PubMedBERT (abstracts + full text), PubMedBERT (only abstracts), Unified Medical Language System (UMLS) BERT, Bio BERT, Bio-clinical BERT, logistic regression, support vector machine, and random forest. The rule-based model used for weak supervision was tested on the GSC for comparison. We performed two case studies: physical activity and excessive diet, in order to validate the effectiveness of BERT models in classifying lifestyle status for all models were evaluated and compared on the developed Gold Standard Corpus (GSC) on the two case studies. Results: The UMLS BERT model achieved the best performance for classifying status of physical activity, with its precision, recall, and F-1 scores of 0.93, 0.93, and 0.92, respectively. Regarding classifying excessive diet, the Bio-clinical BERT model showed the best performance with precision, recall, and F-1 scores of 0.93, 0.93, and 0.93, respectively. Conclusion: The proposed approach leveraging weak supervision could significantly increase the sample size, which is required for training the deep learning models. By comparing with the traditional machine learning models, the study also demonstrates the high performance of BERT models for classifying lifestyle status for Alzheimer’s disease in clinical notes.

Original languageEnglish (US)
Article number88
JournalBMC medical informatics and decision making
Issue numberSuppl 1
StatePublished - Jul 2022

Bibliographical note

Funding Information:
YY thanks the University of Minnesota’s Undergraduate Research Opportunities Program (UROP). This article has been published as part of BMC Medical Informatics and Decision Making Volume 22 Supplement 1, 2022: Selected articles from the Third International Workshop on Health Natural Language Processing (HealthNLP 2020). The full contents of the supplement are available online at

Funding Information:
This work was partially supported by the National Institutions of Health’s National Center for Complementary and Integrative Health (NCCIH), the Office of Dietary Supplements (ODS) and National Institute on Aging (NIA) grant number R01AT009457 (PI: Zhang) and Clinical and Translational Science Award (CTSA) program grant number UL1TR002494 (PI: Blazar). The content is solely the responsibility of the authors and does not represent the official views of the NCCIH, ODS or NIA. The funding bodies provide financial support to our study in data analysis, and interpretation of data and in writing the manuscript. Publication costs are funded by the these NIH grants.

Publisher Copyright:
© 2022, The Author(s).


  • Alzheimer’s disease
  • Clinical text classification
  • Deep learning
  • Electronic health records
  • Machine learning
  • Natural language processing
  • Unified Medical Language System
  • Life Style
  • Humans
  • Natural Language Processing
  • Alzheimer Disease
  • Deep Learning

PubMed: MeSH publication types

  • Journal Article
  • Research Support, N.I.H., Extramural


Dive into the research topics of 'Classifying the lifestyle status for Alzheimer’s disease from clinical notes using deep learning with weak supervision'. Together they form a unique fingerprint.

Cite this