TY - JOUR
T1 - Automated vs. manual coding of neuroimaging reports via natural language processing, using the international classification of diseases, tenth revision
AU - McKinney, Alexander M.
AU - Moore, Jessica A.
AU - Campbell, Kevin
AU - Braga, Thiago A.
AU - Rykken, Jeffrey B.
AU - Jagadeesan, Bharathi D.
AU - McKinney, Zeke J.
N1 - Publisher Copyright:
© 2024 The Authors
PY - 2024/5/30
Y1 - 2024/5/30
N2 - Objective: Natural language processing (NLP) can generate diagnoses codes from imaging reports. Meanwhile, the International Classification of Diseases (ICD-10) codes are the United States' standard for billing/coding, which enable tracking disease burden and outcomes. This cross-sectional study aimed to test feasibility of an NLP algorithm's performance and comparison to radiologists' and physicians' manual coding. Methods: Three neuroradiologists and one non-radiologist physician reviewers manually coded a randomly-selected pool of 200 craniospinal CT and MRI reports from a pool of >10,000. The NLP algorithm (Radnosis, VEEV, Inc., Minneapolis, MN) subdivided each report's Impression into “phrases”, with multiple ICD-10 matches for each phrase. Only viewing the Impression, the physician reviewers selected the single best ICD-10 code for each phrase. Codes selected by the physicians and algorithm were compared for agreement. Results: The algorithm extracted the reports' Impressions into 645 phrases, each having ranked ICD-10 matches. Regarding the reviewers' selected codes, pairwise agreement was unreliable (Krippendorff α = 0.39-0.63). Using unanimous reviewer agreement as “ground truth”, the algorithm's sensitivity/specificity/F2 for top 5 codes was 0.88/0.80/0.83, and for the single best code was 0.67/0.82/0.67. The engine tabulated “pertinent negatives” as negative codes for stated findings (e.g. “no intracranial hemorrhage”). The engine's matching was more specific for shorter than full-length ICD-10 codes (p = 0.00582x10−3). Conclusions: Manual coding by physician reviewers has significant variability and is time-consuming, while the NLP algorithm's top 5 diagnosis codes are relatively accurate. This preliminary work demonstrates the feasibility and potential for generating codes with reliability and consistency. Future works may include correlating diagnosis codes with clinical encounter codes to evaluate imaging's impact on, and relevance to care.
AB - Objective: Natural language processing (NLP) can generate diagnoses codes from imaging reports. Meanwhile, the International Classification of Diseases (ICD-10) codes are the United States' standard for billing/coding, which enable tracking disease burden and outcomes. This cross-sectional study aimed to test feasibility of an NLP algorithm's performance and comparison to radiologists' and physicians' manual coding. Methods: Three neuroradiologists and one non-radiologist physician reviewers manually coded a randomly-selected pool of 200 craniospinal CT and MRI reports from a pool of >10,000. The NLP algorithm (Radnosis, VEEV, Inc., Minneapolis, MN) subdivided each report's Impression into “phrases”, with multiple ICD-10 matches for each phrase. Only viewing the Impression, the physician reviewers selected the single best ICD-10 code for each phrase. Codes selected by the physicians and algorithm were compared for agreement. Results: The algorithm extracted the reports' Impressions into 645 phrases, each having ranked ICD-10 matches. Regarding the reviewers' selected codes, pairwise agreement was unreliable (Krippendorff α = 0.39-0.63). Using unanimous reviewer agreement as “ground truth”, the algorithm's sensitivity/specificity/F2 for top 5 codes was 0.88/0.80/0.83, and for the single best code was 0.67/0.82/0.67. The engine tabulated “pertinent negatives” as negative codes for stated findings (e.g. “no intracranial hemorrhage”). The engine's matching was more specific for shorter than full-length ICD-10 codes (p = 0.00582x10−3). Conclusions: Manual coding by physician reviewers has significant variability and is time-consuming, while the NLP algorithm's top 5 diagnosis codes are relatively accurate. This preliminary work demonstrates the feasibility and potential for generating codes with reliability and consistency. Future works may include correlating diagnosis codes with clinical encounter codes to evaluate imaging's impact on, and relevance to care.
KW - Coding
KW - ICD-10
KW - Impact
KW - Neuroradiology
KW - Relevance
UR - http://www.scopus.com/inward/record.url?scp=85192737325&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85192737325&partnerID=8YFLogxK
U2 - 10.1016/j.heliyon.2024.e30106
DO - 10.1016/j.heliyon.2024.e30106
M3 - Article
C2 - 38799748
AN - SCOPUS:85192737325
SN - 2405-8440
VL - 10
JO - Heliyon
JF - Heliyon
IS - 10
M1 - e30106
ER -