Logo image
Explainable hierarchical fine-grained ICD code assignment using large language models: a dissertation in Engineering and Applied Science
Dissertation   Open access

Explainable hierarchical fine-grained ICD code assignment using large language models: a dissertation in Engineering and Applied Science

Joshua Carberry
Doctor of Philosophy (PHD), University of Massachusetts Dartmouth
2024
DOI:
https://doi.org/10.62791/1994

Abstract

In healthcare, structured and standardized records are essential to ensure that crucial medical information is well-documented and accessible. Doctor’s notes, an example of unstructured records, encode important information about patient diagnoses and treatments but are presented in unstructured natural language. An important step in the healthcare process is the use of standards such as the International Classification of Diseases (ICD) to annotate doctor’s notes with precise medical codes, providing a common language for accurate communication between healthcare organizations and other sectors such as insurance. Due to the complexity of healthcare, which involves countless unique and often indistinguishable diagnoses, assigning ICD codes is an enormous amount of work that requires skilled experts. In this thesis, we introduce several novel ICD auto-coding techniques that leverage knowledge representations and large language models (LLMs) to achieve high performance and explainable results. The presented auto-coding system is based on a fine-grained approach that reduces the complexity of classification and improves human comprehensibility by locating and labeling one diagnosis at a time rather than processing all notes at once. For each selected diagnosis, ICD code predictions are based not only on the diagnosis itself but also consider semantically related text elsewhere in the notes. This additional evidence simplifies classification and provides a basis for understanding the results of automated coding. We explored related sentence extraction using two approaches: an ontology-based approach (a formal knowledge representation) and an approach using the LLM GPT-4. After extracting semantically related text, LLM-based classifiers are able to predict the correct ICD codes. To improve scalability, we introduce a hierarchical classifier forming a tree of fine-tuned LLMs to handle the large label space and complex classification inherent in the ICD coding task. This hierarchical approach decomposes the ICD coding task into smaller, more manageable subclassification tasks, thereby improving tractability and addressing the challenges posed by the high number of unique labels associated with ICD coding.
pdf
Carberry J. COE PhD Dissertation 20241.65 MBDownloadView
CC BY-NC-ND V4.0 Open Access

Metrics

23 File views/ downloads
42 Record Views

Details

Logo image