Hypotheses engine (HypE): exploring structured biomedical datasets in search for predictive patterns

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Machine Learning, Data Science and Artificial Intelligence
Degree programme
Master’s Programme in Computer, Communication and Information Sciences
Nowadays, healthcare facilities constantly collect an immense amount of data as part of their daily-management systems, which include diverse type of information, such as patient admission details, drugs administered or clinical examinations’ results. Even though medical research has been traditionally condition-oriented, researchers oftentimes use similar analysis methodologies, with very little context customization, making them computationally redundant. This project proposes an analysis pipeline capable of automatically mine big and diverse biomedical datasets, and identify potentially interesting patterns in the data, despite of the medical conditions the data might relate to. Such system is called an hypotheses engine, as its purpose is to output patterns that seem to be medically predictive, which we call hypotheses. HypE’s novelty is two-fold: on one hand, a tailored data processing method was developed for analyzing inconsistent and chaotic temporal data (i.e. a patient has laboratory measurements, that usually are only partially repeated over time); and on the other hand, the hypotheses found are to be outputted in a physician-friendly way, to allow fast understanding of the patterns found, in case medical intervention is recommended. Given HypE’s functionality, results cannot be straightforwardly classified as good or bad, as certain data subsets might actually not contain any patterns, at all. However, methodologically, it is to expect that some hypotheses found will be known medical patterns. Thus, HypE’s outputs are presented and discussed on a high level, considering no manual check for their medical validity was performed by medical experts. The prototype implemented was ran on MIMIC-III data and the results exceeded the initial expectations as they did include common medical scenarios.
Kaski, Samuel
Thesis advisor
Mamitsuka, Hiroshi
Edgren, Henrik
data mining, electronic health records, temporal data, machine learning, data mining, support vector machines
Other note