Quality of analytics management of data pipelines for retail forecasting
Loading...
URL
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
2019-08-19
Department
Major/Subject
Data science
Mcode
SCI3095
Degree programme
Master's Programme in ICT Innovation
Language
en
Pages
54+3
Series
Abstract
This thesis presents a framework for managing quality of analytics in data pipelines. The main research question of this thesis is the trade-off management between cost, time and data quality in retail forcasting. Generally this trade-off in data analytics is defined as quality of analytics. The challenge is addressed by introducing a proof of concept framework that collects real time metrics about the data quality, resource consumption and other relevant metrics from tasks within a data pipeline. The data pipelines within the framework are developed using Apache Airflow that orchestrates Dockerized tasks. Different metrics of each task are monitored and stored to ElasticSearch. Cross-task communication is enabled by using an event driven architecture that utilizes a RabbitMQ as the message queue and custom consumer images written in python. With the help of these consumers the system can control the result with respect to quality of analytics. Empirical testing of the final system with retail datasets showed that this approach can aid data science teams to provide better services on demand with bounded resources especially when dealing with big data.Description
Supervisor
Truong, Hong-LinhThesis advisor
Ervasti, MikkoLuukkonen, Teppo
Keywords
machine learning, offline learning, data pipelines, quality of analytics, apache airflow