Evaluation of cloud based approaches to data quality management

Loading...
Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Date
2016-02-15
Department
Major/Subject
Service Design and Engineering
Mcode
IL3005
Degree programme
Master's Programme in Service Design and Engineering (SDE)
Language
en
Pages
83
Series
Abstract
Quality of data is critical for making data driven business decisions. Enhancing the quality of data enables companies to make better decisions and prevent business losses. Systems similar to Extract Transform and Load (ETL) are often used to clean and improve the quality of data. Currently, businesses tend to collect a massive amount of customer data, store it in the cloud, and analyze the data to gain statistical inferences about their products, services, and customers. Cheaper storage, constantly improving approaches to data privacy and security provided by cloud vendors, such as Microsoft Azure, Amazon Web Service, seem to be the key driving forces behind this process. This thesis implements Azure Data Factory based ETL system that serves the purpose of data quality management in the Microsoft Azure Cloud platform. In addition to Azure Data Factory, there are four other key components in the system: (1) Azure Storage for storing raw, and semi cleaned data; (2) HDInsight for processing raw and semi cleaned data using Hadoop clusters and Hive queries; (3) Azure ML Studio for processing raw and semi cleaned data using R scripts and other machine learning algorithms; (4) Azure SQL database for storing the cleaned data. This thesis shows that using Azure Data factory as the core component offers many benefits because it helps in scheduling jobs, and monitoring the whole data transformation processes. Thus, it makes data intake process more timely, guarantees data reliability, simplifies data auditing. The developed system was tested and validated using sample raw data.
Description
Supervisor
Nurminen, Jukka
Thesis advisor
Moloney, Seamus
Keywords
data quality management, ETL, data cleaning, hive, hadoop, azure microsoft
Other note
Citation