Comparative Analysis of Big Data Stream Processing Systems

Loading...
Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Date
2016-07-29
Department
Major/Subject
Mobile Computing, Services and Security
Mcode
SCI3045
Degree programme
Master's Programme in ICT Innovation
Language
en
Pages
12+77
Series
Abstract
In recent years, Big Data has become a prominent paradigm in the field of distributed systems. These systems distribute data storage and processing power across a cluster of computers. Such systems need methodologies to store and process Big Data in a distributed manner. There are two models for Big Data processing: batch processing and stream processing. The batch processing model is able to produce accurate results but with large latency. Many systems, such as billing systems, require Big Data to be processed with low latency because of real-time constraints. Therefore, the batch processing model is unable to fulfill the requirements of real-time systems. The stream processing model tries to address the batch processing limitations by producing results with low latency. Unlike the batch processing model, the stream processing model processes the recent data instead of all the produced data to fulfill the time limitations of real-time systems. The subsequent model divides a stream of records into data windows. Each data window contains a group of records to be processed together. Records can be collected based on the time of arrival, the time of creation, or the user sessions. However, in some systems, processing the recent data depends on the already processed data. There are many frameworks that try to process Big Data in real time such as Apache Spark, Apache Flink, and Apache Beam. The main purpose of this research is to give a clear and fair comparison among the mentioned frameworks from different perspectives such as the latency, processing guarantees, the accuracy of results, fault tolerance, and the available functionalities of each framework.
Description
Supervisor
Heljanko, Keijo
Thesis advisor
Latif, Khalid
Keywords
big data, stream processing frameworks, Apache Spark, apache flink, apache beam, lambda architecture
Other note
Citation