Enhancing the Accuracy of Large Language Models in Medical Coding through Retrieval-Based Approaches

Keith KWAN; Hao CHEN; Ho Hung Billy CHEUNG

Authors

Keith KWAN AI Native Health, Unit D, 1/F, Sunshine Plaza 17 Sung on Street, Hung Hom Kowloon, Hong Kong
Hao CHEN Department of Computer Science and Engineering, Faculty of Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
Ho Hung Billy CHEUNG Department of Surgery, School of Clinical Medicine, LKS Faculty of Medicine, The University of Hong Kong https://orcid.org/0000-0002-8843-7893

Keywords:

Natural Language Processing, Retrieve-Rank system, Automated diagnosis coding, Machine learning in healthcare, Medical informatics, ICD-10-CM coding

Abstract

Purpose: Medical coding, essential for healthcare administration and research, requires significant expertise and resources. While Large Language Models (LLMs) showed promise in automating this task, recent studies highlighted their limitations, with even advanced models like GPT-4 achieving only moderate accuracy. This study presents a novel Retrieve-Rank system combining ColBERT-V2 retriever with GPT-3.5-turbo for medical coding automation. Methods: We evaluated the performance of our Retrieve-Rank system against a Vanilla LLM approach using a dataset of 100 single-term medical conditions with corresponding International Classiﬁcation of Diseases, 10th edition, Clinical Modiﬁcation (ICD-10-CM) codes, which is the latest version of the standardized system used to code diseases and medical conditions used in the United States. The system employed a two-step process: first, retrieving the top-15 most relevant codes using ColBERT-V2, then applying GPT-3.5-turbo for reranking to select the most appropriate code. The experiment was conducted on 1^st June 2024. Performance was measured using top-one accuracy with normalized ICD-10-CM codes. Results: Our Retrieve-Rank system achieved 100% accuracy in code identification, significantly outperforming the Vanilla LLM approach's 6% accuracy. This improvement is particularly noteworthy as it was achieved using GPT-3.5, a more accessible model than GPT-4, demonstrating that LLMs, when equipped with appropriate retrieval mechanisms, can effectively overcome their inherent limitations in medical coding tasks. Conclusions: While our study was limited to single-term conditions, the results suggest significant potential for broader applications in healthcare administration. This research contributes to bridging the gap between AI capabilities and clinical implementation, offering a promising approach to automating medical coding while maintaining high accuracy. Future research should focus on validating these findings with more complex, real-world medical cases and unstructured clinical notes.

Enhancing the Accuracy of Large Language Models in Medical Coding through Retrieval-Based Approaches

Authors

Keywords:

Abstract

Downloads

Additional Files

Published

Versions

How to Cite

Issue

Section

License

Make a Submission