Clinical coding is a common ground in numerous healthcare settings. Providers are required to index clinical and administrative data from each patient episode using classification schemes such as the International Classification of Diseases (ICD). However, this process is highly resource-intensive, repetitive and error-prone, representing a significant burden to healthcare organizations. Considering the increasing adoption of information technologies by health organizations, with special emphasis on Electronic Health Record (EHR) systems, multiple researchers were motivated to develop systems aiming to (partially) automate the clinical coding process. The baseline methodology consisted in employing natural language processing (NLP) techniques to extract concepts from medical narratives and use these data to build models to predict the set of ICD codes to assign to each episode. However, the major drawback of these approaches lies in the use of NLP techniques, which are highly dependent on the language of medical texts, as well on their structure, ambiguity and availability of NLP resources such as annotated texts.

To surpass the issues associated with the use of NLP techniques, this project is based on recent trends in EHR technology wherein data is increasingly available in structured format, thereby avoiding the need to employ NLP tools to extract concepts from narratives. However, the extraction of structured data from EHR databases and mapping these data to a format upon which to build prediction models is not straightforward, considering that these data is stored in different formats and documented with different frequencies. Therefore, this project encompasses two stages: (1) structuring a data model whereby structured information stored in the EHR database is mapped to a data matrix and (2) the development of prediction models using a supervised learning paradigm, making use of data from past episodes and the corresponding set of ICD codes to model the relationship between EHR and resulting ICD codes. This project is carried out using data produced in real-world settings.



José Ferrão

José Ferrão


Health Big Data and
Decision Support Systems.