Quality of analytics management of data pipelines for retail forecasting

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2019-08-19

Department

Major/Subject

Data science

Mcode

SCI3095

Degree programme

Master's Programme in ICT Innovation

Language

en

Pages

54+3

Series

Abstract

This thesis presents a framework for managing quality of analytics in data pipelines. The main research question of this thesis is the trade-off management between cost, time and data quality in retail forcasting. Generally this trade-off in data analytics is defined as quality of analytics. The challenge is addressed by introducing a proof of concept framework that collects real time metrics about the data quality, resource consumption and other relevant metrics from tasks within a data pipeline. The data pipelines within the framework are developed using Apache Airflow that orchestrates Dockerized tasks. Different metrics of each task are monitored and stored to ElasticSearch. Cross-task communication is enabled by using an event driven architecture that utilizes a RabbitMQ as the message queue and custom consumer images written in python. With the help of these consumers the system can control the result with respect to quality of analytics. Empirical testing of the final system with retail datasets showed that this approach can aid data science teams to provide better services on demand with bounded resources especially when dealing with big data.

Description

Supervisor

Truong, Hong-Linh

Thesis advisor

Ervasti, Mikko
Luukkonen, Teppo

Keywords

machine learning, offline learning, data pipelines, quality of analytics, apache airflow

Other note

Citation