TY - JOUR
T1 - Using word embeddings to expand terminology of dietary supplements on clinical notes
AU - Fan, Yadan
AU - Pakhomov, Serguei
AU - McEwan, Reed
AU - Zhao, Wendi
AU - Lindemann, Elizabeth
AU - Zhang, Rui
N1 - Publisher Copyright:
© The Author(s) 2019.
PY - 2019
Y1 - 2019
N2 - Objective: The objective of this study is to demonstrate the feasibility of applying word embeddings to expand the terminology of dietary supplements (DS) using over 26 million clinical notes. Methods: Word embedding models (ie, word2vec and GloVe) trained on clinical notes were used to predefine a list of top 40 semantically related terms for each of 14 commonly used DS. Each list was further evaluated by experts to generate semantically similar terms. We investigated the effect of corpus size and other settings (ie, vector size and window size) as well as the 2 word embedding models on performance for DS term expansion. We compared the number of clinical notes (and patients they represent) that were retrieved using the word embedding expanded terms to both the baseline terms and external DS sources expanded terms. Results: Using the word embedding models trained on clinical notes, we could identify 1-12 semantically similar terms for each DS. Using the word embedding expanded terms, we were able to retrieve averagely 8.39% more clinical notes and 11.68% more patients for each DS compared with 2 sets of terms. The increasing corpus size results in more misspellings, but not more semantic variants and brand names. Word2vec model is also found more capable of detecting semantically similar terms than GloVe. Conclusion: Our study demonstrates the utility of word embeddings on clinical notes for terminology expansion on 14 DS. We propose that this method can be potentially applied to create a DS vocabulary for downstream applications, such as information extraction.
AB - Objective: The objective of this study is to demonstrate the feasibility of applying word embeddings to expand the terminology of dietary supplements (DS) using over 26 million clinical notes. Methods: Word embedding models (ie, word2vec and GloVe) trained on clinical notes were used to predefine a list of top 40 semantically related terms for each of 14 commonly used DS. Each list was further evaluated by experts to generate semantically similar terms. We investigated the effect of corpus size and other settings (ie, vector size and window size) as well as the 2 word embedding models on performance for DS term expansion. We compared the number of clinical notes (and patients they represent) that were retrieved using the word embedding expanded terms to both the baseline terms and external DS sources expanded terms. Results: Using the word embedding models trained on clinical notes, we could identify 1-12 semantically similar terms for each DS. Using the word embedding expanded terms, we were able to retrieve averagely 8.39% more clinical notes and 11.68% more patients for each DS compared with 2 sets of terms. The increasing corpus size results in more misspellings, but not more semantic variants and brand names. Word2vec model is also found more capable of detecting semantically similar terms than GloVe. Conclusion: Our study demonstrates the utility of word embeddings on clinical notes for terminology expansion on 14 DS. We propose that this method can be potentially applied to create a DS vocabulary for downstream applications, such as information extraction.
KW - Clinical notes
KW - Dietary supplements
KW - Natural language processing
KW - Terminology expansion
KW - Word embeddings
UR - http://www.scopus.com/inward/record.url?scp=85069740667&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85069740667&partnerID=8YFLogxK
U2 - 10.1093/jamiaopen/ooz007
DO - 10.1093/jamiaopen/ooz007
M3 - Article
AN - SCOPUS:85069740667
SN - 2574-2531
VL - 2
SP - 246
EP - 253
JO - JAMIA Open
JF - JAMIA Open
IS - 2
ER -