ETRO-VUB Department of Electronics and Informatics

About ETRO | News | Events | Vacancies | Contact

Home

Research

Master theses

Current and past ideas and concepts for Master Theses.


	Automated Icd Classification Using Nlp And Deep Learning Techniques: A Comparative Study On Mimic-Iii Dataset

	Subject The International Classification of Diseases (ICD) is a system used to classify diseases and medical procedures. Accurate ICD coding is essential for billing, epidemiological studies, and clinical decision-making. With the advent of electronic health records (EHRs), there is a growing interest in using natural language processing (NLP) and deep learning (DL) techniques to automate the ICD coding process. This study aims to compare the performance of different DL and NLP techniques for ICD classification using the MIMIC-III (Medical Information Mart for Intensive Care III) dataset [1,2,3,4,5,6]. Kind of work 1. Objectives: • To compare the performance of different DL and NLP techniques (e.g., convolutional neural networks, recurrent neural networks, transformers) for ICD classification using MIMIC-III dataset. • To evaluate the impact of different pre-processing and feature engineering techniques on the performance of the DL models. • To investigate the interpretability and explainability of the DL models for ICD classification. Framework of the Thesis 2. Methodology: 1. The study will be conducted in the following steps: 2. Data Preprocessing: The MIMIC-III dataset will be preprocessed to extract relevant text for NLP analysis. 3. Text Preprocessing: Different text pre-processing techniques (e.g., tokenization, stemming, stop word removal) will be applied to prepare the text for NLP analysis. 4. Feature Engineering: Different feature engineering techniques (e.g., word embeddings, contextual embeddings) will be applied to represent the text for DL models. 5. DL Model Selection and Training: Several DL models (e.g., convolutional neural networks, recurrent neural networks, transformers) will be trained using the preprocessed data and selected features. 6. Model Evaluation: The performance of the DL models will be evaluated using standard metrics (e.g., accuracy, F1-score) and compared to identify the best-performing technique. 7. Interpretability and Explainability: The selected DL models will be investigated for interpretability and explainability to understand the decision-making process of the models. 3. Expected Outcomes: The study is expected to produce the following outcomes: • A comparison of different DL for NLP techniques for ICD classification using MIMIC-III dataset4. • An evaluation of the impact of different pre-processing and feature engineering techniques on the performance of the DL models. • A set of recommendations for the selection of the most appropriate DL for NLP technique for ICD classification based on the dataset characteristics. • A deeper understanding of the interpretability and explainability of the DL models for ICD classification. References and further reading 1. C. Yan, X. Fu, X. Liu, Y. Zhang, Y. Gao, J. Wu, and Q. Li, “A survey of automated International Classification of Diseases coding: development, challenges, and applications,” Intelligent Medicine, vol. 2, pp. 161–173, Aug. 2022. 2. F. Teng, Y. Liu, T. Li, Y. Zhang, S. Li, and Y. Zhao, “A review on deep neural networks for ICD coding,” IEEE Transactions on Knowledge and Data Engineering, pp. 1–1, 2022. 3. J. Xu, X. Xi, J. Chen, V. S. Sheng, J. Ma, and Z. Cui, “A Survey of Deep Learning for Electronic Health Records,” Applied Sciences, vol. 12, p. 11709, Nov. 2022. 4. A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark, “MIMIC-III, a freely accessible critical care database,” Scientific Data, vol. 3, p. 160035, May 2016. 5. J.Mullenbach,S.Wiegreffe,J.Duke,J.Sun,andJ.Eisenstein,“Explain- able Prediction of Medical Codes from Clinical Text,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), (New Orleans, Louisiana), pp. 1101–1111, Association for Computational Linguistics, 2018. 6. Z. Zhang, J. Liu, and N. Razavian, “BERT-XML: Large Scale Auto- mated ICD Coding Using BERT Pretraining,” in Proceedings of the 3rd Clinical Natural Language Processing Workshop, (Online), pp. 24–34, Association for Computational Linguistics, 2020. Expected Student Profile Requirements: • Strong programming skills are required for this project, particularly in Python. • Experience with deep learning frameworks, especially PyTorch, is a plus. • Familiarity with natural language processing is also desirable. • Good communication skills and the ability to document code and results are helpful

Promotors

Prof. Dr. Ir. Nikos Deligiannis

+32 (0)2 629 1683

ndeligia@etrovub.be

more info

Prof. Hichem Sahli

+32 (0)2 629 2916

hsahli@etrovub.be

more info

Supervisor

Miss Soha Sadat Mahdi

+32 (0)2 629 2930

smahdi@etrovub.be

more info


Research - Contact person - IRIS - AVSP - LAMI	Education - Contact person - Thesis proposals - ETRO Courses	Industry - Contact person - Spin-offs - Know How	Publications - Journals - Conferences - Books	About ETRO - Vacancies - News - Events - Press	Contact ETRO Department info@etro.vub.ac.be Tel: +32 2 629 29 30


©2025 • Vrije Universiteit Brussel • ETRO Dept. • Pleinlaan 2 • 1050 Brussels • Tel: +32 2 629 2930 (secretariat) • Fax: +32 2 629 2883 • Webmaster • Disclaimer