Using compund codes for automatic classification of clinical diagnoses

Serguei V. Pakhomov, James D. Buntrock, Christopher G. Chute

Research output: Contribution to journalArticlepeer-review

6 Scopus citations


Classification of diagnoses (a.k.a. coding) is the central part of current concept based medical IR systems. Some classification systems contain over 30, 000 distinct codes which makes classifying clinical documents a time consuming labor intensive and error prone process. This paper presents a simple methodology for cleaning up and reusing existing manually coded diagnostic statements mainly extracted from clinical notes to build predictive models using a sparse feature implementation of a Naïve Bayes classifier. One of the problems addressed is that diagnostic statements often contain several diagnoses and are assigned several codes resulting in a 'many-to-many' mapping problem. We investigate one possible way of solving this problem by introducing compound (multiple code) categories. We present experimental results of classifying >16,000 randomly selected diagnostic strings into 19 top level categories. A small improvement (3%) with using compound categories over simple categories indicates that using multiple code categories is a promising solution, although clearly in need of further research and refinement.

Original languageEnglish (US)
Pages (from-to)411-415
Number of pages5
JournalStudies in health technology and informatics
StatePublished - 2004

Bibliographical note

Funding Information:
We'd like to thank Barbara Abbot, Deborah Albrecht and Pauline Funk for sharing their HICDA coding expertise as well as Ted Pedersen for his advice on training automatic classifiers.


  • Automatic classification
  • clinical diagnoses
  • concept indexing


Dive into the research topics of 'Using compund codes for automatic classification of clinical diagnoses'. Together they form a unique fingerprint.

Cite this