Evaluation of cloud based approaches to data quality management

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2016-02-15

Department

Major/Subject

Service Design and Engineering

Mcode

IL3005

Degree programme

Master's Programme in Service Design and Engineering (SDE)

Language

en

Pages

83

Series

Abstract

Quality of data is critical for making data driven business decisions. Enhancing the quality of data enables companies to make better decisions and prevent business losses. Systems similar to Extract Transform and Load (ETL) are often used to clean and improve the quality of data. Currently, businesses tend to collect a massive amount of customer data, store it in the cloud, and analyze the data to gain statistical inferences about their products, services, and customers. Cheaper storage, constantly improving approaches to data privacy and security provided by cloud vendors, such as Microsoft Azure, Amazon Web Service, seem to be the key driving forces behind this process. This thesis implements Azure Data Factory based ETL system that serves the purpose of data quality management in the Microsoft Azure Cloud platform. In addition to Azure Data Factory, there are four other key components in the system: (1) Azure Storage for storing raw, and semi cleaned data; (2) HDInsight for processing raw and semi cleaned data using Hadoop clusters and Hive queries; (3) Azure ML Studio for processing raw and semi cleaned data using R scripts and other machine learning algorithms; (4) Azure SQL database for storing the cleaned data. This thesis shows that using Azure Data factory as the core component offers many benefits because it helps in scheduling jobs, and monitoring the whole data transformation processes. Thus, it makes data intake process more timely, guarantees data reliability, simplifies data auditing. The developed system was tested and validated using sample raw data.

Description

Supervisor

Nurminen, Jukka

Thesis advisor

Moloney, Seamus

Keywords

data quality management, ETL, data cleaning, hive, hadoop, azure microsoft

Other note

Citation