This book aims to describe how data analytics works for big data and how they are used in business. It gives an overview of existing technologies and ap-proaches to building data analytics infrastructures. It also defines points that should be taken into consideration while choosing the most suitable software solution for a particular use case.

The research is done by studying architectural principles of big data sys-tems and investigating the market of data analytics software. The result of this work is a composite report including comparison of several technologies and a list of criteria considered. The final report can be used as a guideline for choosing the most suitable technology for implementing an analytical platform in a broad variety of organizations.

With a growing amount of data generated, their changing and evolving, the concept of big data has become incredibly popular in last years. It provides a set of new approaches and techniques allowing to work e ciently with huge volumes of records.

Nowadays, information is one of the most important resources; it can help with decision making and business processes optimization. However, to get actual insights and unlock a potential of data, it is necessary to process them and discover the information hidden inside it which is a goal of data analytics. Data analytic platforms allow to manipulate with raw data in order to find out what exactly they contain. These systems are complex and includes multiple components therefore their designing requires comprehensive analysis of available options.

Excerpt

Preface

Introduction

List of engines to compare

Conclusion

Objectives and Topics

This work aims to provide a comprehensive overview of big data analytics, specifically focusing on the architectural principles and technologies used for processing large volumes of data. The research explores the market of data analytics software to offer a guideline for selecting the most suitable SQL engine for organizational needs, balancing performance, scalability, and specific use-case requirements.

Architectural principles of big data systems
Comparison of open-source SQL engines (Hive, Impala, Drill, Presto)
Criteria for selecting analytical software solutions
Evaluation of deployment models and support options
Practical use case examples for diverse business scenarios

Excerpt from the Book

1.1.3 Big data architecture and workflow

As big data appears to be different from common one, its processing requires not only new technologies but also a special architectural design tailored to its specifics.

Each of big data 3V is thus reflected in different aspects of its processing. High volume implies a necessity of enabling powerful batch processing mechanisms for dealing with larger amount of data that have previously been saved to some storage and now need to be processed. For handling velocity data, the system should be capable of executing interactive queries to give a result based on all available information including the most recent pieces of it. Wide variety of data sets requires having an extensible storage system along with advanced techniques for accessing and then integrating them to the whole processing pipeline.

Big data architecture is a model of how big data and other information assets will be captured, stored, managed and made accessible to various user groups and applications[8]. In other words, it describes the way big data works as it flows through all components, from the raw data extracted from the data sources to the insights derived by end users using various analytical applications.

Summary of Chapters

Preface: Introduces the concept of big data and the goal of the book, which is to provide a guide for choosing data analytics infrastructure components.

Introduction: Discusses the growing importance of data in business decision-making and the necessity of effective SQL engines within analytical pipelines.

List of engines to compare: Outlines the selection of Apache Hive, Apache Impala, Apache Drill, and Presto as the primary technologies for analysis.

Conclusion: Summarizes the challenges of choosing analytical platform components and emphasizes the importance of aligning infrastructure with specific business requirements and workloads.

Keywords

Big Data, Data Analytics, SQL Engines, Apache Hive, Apache Impala, Apache Drill, Presto, Data Architecture, Business Intelligence, Scalability, ETL, Hadoop, Distributed Systems, Data Storage, Performance

Frequently Asked Questions

What is the primary focus of this book?

The book focuses on how big data analytics works in a business context and provides an overview of existing technologies for building data analytics infrastructures.

What are the central thematic fields covered?

The work covers big data architecture, SQL engine market exploration, technology comparison, and guidelines for infrastructure selection.

What is the research goal of this document?

The goal is to serve as a guideline for choosing the most suitable SQL engine for implementing an analytical platform across various organizations.

Which scientific or research methods were applied?

The research was conducted by studying the architectural principles of big data systems and investigating the current market of data analytics software.

What topics are discussed in the main body?

The main body covers the theoretical background of big data (3Vs), the comparison of specific SQL engines, and their practical application in different business scenarios.

Which terms best characterize this work?

Key terms include Big Data, SQL Engines, Data Analytics, Architecture, and Performance.

How does the author define the "Big Data" concept?

The author defines it through the "3V" attributes—Volume, Velocity, and Variety—which describe the challenges and characteristics of handling modern, massive data sets.

What specific role does the Apache Hive metastore play in these systems?

It acts as a central repository for storing metadata, such as table structures and partition details, which is essential for various SQL engines to process and manipulate data.

Why is "Schema-on-read" considered a significant feature in the context of Apache Drill?

It allows the engine to process data without requiring predefined schemas, making it highly flexible for handling unstructured or evolving data sources without expensive ETL pre-processing.

How does Presto differentiate itself from batch-processing engines like Hive?

Presto is designed for interactive, low-latency ad-hoc queries, allowing users to analyze data across multiple sources in real time rather than waiting for batch-processed results.

Excerpt out of 60 pages - scroll top

Details

Title: SQL Engines for Big Data Analytics
Course: Master of Computer Application
Grade: 8
Author: Ajit Singh (Author)
Publication Year: 2018
Pages: 60
Catalog Number: V489838
ISBN (eBook): 9783346079091
Language: English
Tags: Big data big data analytics SQL engine data analytics plat-form technologies comparison
Product Safety: GRIN Publishing GmbH

Quote paper: Ajit Singh (Author), 2018, SQL Engines for Big Data Analytics, Munich, GRIN Verlag, https://www.grin.com/document/489838

SQL Engines for Big Data Analytics