SQL Engines for Big Data Analytics

 |  Login

Show simple item record

dc.contributor Aalto-yliopisto fi
dc.contributor Aalto University en
dc.contributor.advisor Heljanko, Keijo
dc.contributor.author Xue, Rui
dc.date.accessioned 2015-12-16T07:57:39Z
dc.date.available 2015-12-16T07:57:39Z
dc.date.issued 2015
dc.identifier.uri https://aaltodoc.aalto.fi/handle/123456789/19201
dc.description.abstract The traditional relational database systems can not accommodate the need of analyzing data with large volume and various formats, i.e., Big Data. Apache Hadoop as the first generation of open-source Big Data solution provided a stable distributed data storage and resource management system. However, as a MapReduce framework, the only channel of utilizing the parallel computing power of Hadoop is the API. Given a problem, one has to code a corresponding MapReduce program in Java, which is time consuming. Moreover, Hadoop focuses on high throughput rather than low latency. Therefore, Hadoop can be a poor fit for interactive data processing. For instance, recently more and more DNA genomic sequence data is generated, and processing the genomic sequences in a single standalone system is next to impossible. But the genomic researchers usually major in their own field rather than programming and they definitely do not expect the long wait until they get their interested data. The demand of interactive Big Data processing necessitated decoupling of data storage from analysis. The simple SQL queries of traditional relational database systems is still the most practical analyzing tool that people without programming background can also benefit from. As a result, Big Data SQL engines have been spun off in the Hadoop Ecosystem. This thesis first discusses the variety of Big Data storage formats and introduces Hadoop as the compulsory background knowledge. Then chapter three introduced three Hadoop-based SQL engines, i.e., Hive, Spark, and Impala, and focused on the first two, currently the most popular ones. In order to have deeper understanding of those SQL engines, an SQL benchmark experiment on Hive and Spark was executed with BAM data, which a binary genomic data format, as input and presented in this thesis. Finally, conclusion about Hadoop-based SQL engines is given. en
dc.format.extent 56+7
dc.format.mimetype application/pdf en
dc.language.iso en en
dc.title SQL Engines for Big Data Analytics en
dc.title SQL hakukone isoa datan analyysia varten fi
dc.type G2 Pro gradu, diplomityö en
dc.contributor.school Perustieteiden korkeakoulu fi
dc.subject.keyword hadoop en
dc.subject.keyword SQL en
dc.subject.keyword interactive analysis en
dc.subject.keyword hive en
dc.subject.keyword spark en
dc.subject.keyword spark SQL en
dc.identifier.urn URN:NBN:fi:aalto-201512165719
dc.programme.major Ohjelmistotekniikka fi
dc.programme.mcode T3001 fi
dc.type.ontasot Master's thesis en
dc.type.ontasot Diplomityö fi
dc.contributor.supervisor Heljanko, Keijo
dc.programme Tietotekniikan koulutusohjelma fi

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search archive

Advanced Search

article-iconSubmit a publication


My Account