SQL Engines for Big Data Analytics

Loading...
Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Date
2015
Department
Major/Subject
Ohjelmistotekniikka
Mcode
T3001
Degree programme
Tietotekniikan koulutusohjelma
Language
en
Pages
56+7
Series
Abstract
The traditional relational database systems can not accommodate the need of analyzing data with large volume and various formats, i.e., Big Data. Apache Hadoop as the first generation of open-source Big Data solution provided a stable distributed data storage and resource management system. However, as a MapReduce framework, the only channel of utilizing the parallel computing power of Hadoop is the API. Given a problem, one has to code a corresponding MapReduce program in Java, which is time consuming. Moreover, Hadoop focuses on high throughput rather than low latency. Therefore, Hadoop can be a poor fit for interactive data processing. For instance, recently more and more DNA genomic sequence data is generated, and processing the genomic sequences in a single standalone system is next to impossible. But the genomic researchers usually major in their own field rather than programming and they definitely do not expect the long wait until they get their interested data. The demand of interactive Big Data processing necessitated decoupling of data storage from analysis. The simple SQL queries of traditional relational database systems is still the most practical analyzing tool that people without programming background can also benefit from. As a result, Big Data SQL engines have been spun off in the Hadoop Ecosystem. This thesis first discusses the variety of Big Data storage formats and introduces Hadoop as the compulsory background knowledge. Then chapter three introduced three Hadoop-based SQL engines, i.e., Hive, Spark, and Impala, and focused on the first two, currently the most popular ones. In order to have deeper understanding of those SQL engines, an SQL benchmark experiment on Hive and Spark was executed with BAM data, which a binary genomic data format, as input and presented in this thesis. Finally, conclusion about Hadoop-based SQL engines is given.
Description
Supervisor
Heljanko, Keijo
Thesis advisor
Heljanko, Keijo
Keywords
hadoop, SQL, interactive analysis, hive, spark, spark SQL
Other note
Citation