Significance of Patterns in Data Visualisations
Loading...
URL
Journal Title
Journal ISSN
Volume Title
Sähkötekniikan korkeakoulu |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Savvides, Rafael
Date
2019-08-19
Department
Major/Subject
Machine Learning, Data Science and Artificial Intelligence
Mcode
SCI3044
Degree programme
CCIS - Master’s Programme in Computer, Communication and Information Sciences (TS2013)
Language
en
Pages
62+1
Series
Abstract
When a data analyst explores data visually and observes a pattern, how can he or she determine whether the pattern is real or just a random artefact of the data? This thesis addresses the problem of evaluating visual patterns observed during visual data exploration by developing a statistical significance testing framework for visual patterns. Traditionally, patterns observed during data exploration are not evaluated with statistical testing. The reason is that any hypotheses to be tested about the data must be formulated prior to viewing the data, else there is a risk of false discoveries (Type I errors). A naive solution for combining visual exploration with statistical testing involves pre-specifying all possible hypotheses about observable patterns and then applying a multiple testing correction. However, the sheer number of potential patterns results in an overly strict multiple testing correction, resulting in low statistical power. This means that true patterns in the data may fail to be discovered, i.e., there is a risk of false negatives (Type II errors). The framework proposed in this thesis is a principled statistical significance testing procedure that controls Type I errors and is not overly conservative. The framework is based on improving statistical power by leveraging the data analyst's knowledge and by utilising multiple testing corrections that are suitable for visual exploration. An empirical investigation of the framework is performed on real and synthetic tabular data and time series, using different test statistics and null distributions. The investigation shows that the proposed framework allows the significance of visual patterns to be determined during exploratory analysis.Description
Supervisor
Gionis, AristidesThesis advisor
Puolamäki, KaiKeywords
data, visualisation, patterns, statistical, significance