Hadoop and Map reduce today are facing huge amounts of data and are moving towards ubiquitous for big data storage and processing. This has made it an essential feature to evaluate and characterize the Hadoop file system and its deployment through extensive benchmarking. We have other benchmarking tools widely available with us today that are capable of analyzing the performance of the Hadoop system but they are made to either run in a single node system or are created for assessing the storage device that is attached and its basic characteristics as top speed and other hardware related details or manufacturer’s details. For this, the tool used is HiBench that is an essential part of Hadoop and is comprehensive benchmark suit that consist of a complete deposit of Hadoop applications having micro bench marks & real time applications for the purpose of benchmarking the performance of Hadoop on the available type of storage device (i.e. HDD and SSD) and machine configuration. This is helpful to optimize the performance and improve the support towards the limitations of Hadoop system.
In this research work we will analyze and characterize the performance of external sorting algorithm in Hadoop (MapReduce) with SSD and HDD that are connected with various Interconnect technologies like 10GigE, IPoIB and RDBAIB. In addition, we will also demonstrate that the traditional servers and old Cloud systems can be upgraded by software and hardware up gradations to perform at par with the modern technologies to handle these loads, without spending ruthlessly on up gradations or complete changes in the system with the use of Modern storage devices and interconnect networking systems. This in turn reduces the power consumption drastically and allows smoother running of large scale servers with low latency and high throughput allowing use of the utmost power of the processors for the big data flowing in the network.

Excerpt

Chapter 1 – INTRODUCTION

Chapter 2 - LITRATURE REVIEW

2.1 Advantages of using Hadoop

2.2 Big Data

2.3 Project Architecture

2.4 Goal: HDFS-HMFGR

2.5 INTERCONNECT TECHNOLOGIES (Hardware Solution)

2.5.1 Traditional Interconnect 10GigE Network

2.5.2 Infiniband Technology

2.5.2.1 IPoIB Interconnect Technology

2.5.2.2 RDMA-IB Interconnect Technology

2.6 Memory Allocation with MemCached (Software Solution)

Chapter 3 – Experimental Testbed System

3.1 An Insight into SSD and HDD

Chapter 4 – Installation, Designing and Implementation of System

4.1 Set a LAN and System service of Network

4.2 Hadoop Installation Guide

4.2.1 Steps On each Machine

4.2.1.1 Install prerequisites

4.2.1.2 Adding a dedicated Hadoop system user

4.2.1.3 Setup hostname

4.2.2 Steps On Master

4.2.2.1 Install Hadoop

4.2.2.2 Configuration ssh

4.2.2.3 Install NFS

4.2.2.4 Configuration

4.2.3. Run Hadoop

4.2.3.1 Run HDFS

4.2.3.2 Rum Map Reduce Job

4.2.3.3 Run All Daemon

4.2.3.4 Hadoop Web Interfaces

4.3 Enable Ethernet and InfiniBand (SR-IOV)

4.3.1 Enable Ethernet SR-IOV

4.3.2 Enable InfiniBand SR-IOV

4.4 Steps to setup IPoIB on Quanta machines

4.5 Install MemCached on CentOS

Chapter 5 – Benchmarking

5.1 ATTO Disk Benchmarking

5.2 HD Tune Pro:

5.3 Linux Disk Utilities:

5.4 HiBench (Hadoop Benchmarking Suit)

5.4.1 Micro-Benchmarks

5.4.1.1 Sort

5.4.1.2 Word Count

5.4.1.3 TeraSort

5.4.1.4 Enhanced DFSIO

5.4.2 Web Search:

5.4.2.1 Nutch Indexing

5.4.2.2 Page Ranking

5.4.3 Machine Learning:

5.4.3.1 Bayesian Classification

5.4.3.2 K-means Clustering

5.4.4 Analytical Query

5.4.4.1 Hive Join

5.4.4.2 Hive Aggregation

Chapter 6 – Performance Evaluation of SSD and HDD on Hadoop using 10GigE

6.1 Sort Work Load:

6.2 Word Count Work Load

6.3 Tera Sort Work Load:

Chapter 7 – Performance Evaluation of SSD and HDD on Hadoop using IPoIB

7.1 Sort Work Load:

7.2 Word Count Work Load

7.3 Tera Sort Work Load

Chapter 8 – Performance Evaluation of SSD and HDD on Hadoop by RDMA-IB

8.1 Sort Work Load

8.2 Word Count Work Load

8.3 Tera Sort Work Load

Chapter 9 – Performance Comparison between 10GigE and IPoIB

9.1 Performance Comparison of Sort Workload

9.1.1 Performance Comparison of SSD

9.1.2 Performance Comparison of HDD

9.2 Performance Comparison of WordCount Workload

9.2.1Performance Comparison of SSD

9.2.2 Performance Comparison of HDD

9.3 Performance Comparison of TeraSort Workload

9.3.1 Performance Comparison of SSD

9.3.2 Performance Comparison of HDD

Chapter 10 – Performance Comparison between IPoIB and RDMA-IB

10.1 Performance Comparison of Sort Workload

10.1.1 Performance Comparison of SSD

10.1.2 Performance Comparison of HDD

10.2 Performance Comparison of WordCount Workload

10.2.1Performance Comparison of SSD

10.2.2 Performance Comparison of HDD

10.3 Performance Comparison of TeraSort Workload

10.3.1 Performance Comparison of SSD

10.3.2 Performance Comparison of HDD

Chapter 11 – Performance Comparison between 10GigE and RDMA-IB

11.1 Performance Comparison of Sort Workload

11.1.1 Performance Comparison of SSD

11.1.2 Performance Comparison of HDD

11.2 Performance Comparison of WordCount Workload

11.2.1Performance Comparison of SSD

11.2.2 Performance Comparison of HDD

11.3 Performance Comparison of TeraSort Workload

11.3.1Performance Comparison of SSD

11.3.2 Performance Comparison of HDD

Chapter 12 – Overall comparison of 10GigE, IPoIB and RDMA-IB.

Chapter 13 – Conclusion

Chapter 14 – Future Scope

Research Objectives and Topics

The primary research objective is to analyze and characterize the performance of external sorting algorithms within the Hadoop MapReduce framework, specifically evaluating the impact of storage devices (SSD versus HDD) when connected via different interconnect technologies like 10GigE, IPoIB, and RDMA-IB to optimize big data processing efficiency.

Benchmarking Hadoop performance using HiBench with diverse storage media.
Evaluation of network interconnect impact on Hadoop I/O throughput and latency.
Comparative analysis of traditional 10GigE versus modern InfiniBand-based interconnects.
Optimization of server performance for large-scale data processing workloads.
Assessment of performance bottlenecks in heterogeneous storage environments.

Excerpt from the book

3.1 An Insight into SSD and HDD

A solid-state drive (SSD) (also identified as solid state disk or electronic disk) (Figure 2) is a data storage drive with integrated memory storage circuit assemblies as memory to keep information indefatigably. SSD uses electronic components that are attuned with conventional block input/output (I/O) HDDs, thus allowing easier substitute in ordinary applications. SSDs use NAND-based flash storage memory, which has the capacity to retain data without power [12][39].

A hard disk drive (HDD) (Figure 3) is a storage drive used for accumulating and retrieves digital data by means of quickly revolving disk covered with magnetic substance. HDD is non-volatile i.e. it keeps hold of its records even after power is switched off. Information stored is readable as unsystematic admission method, which means a single block of info can be stock up or recovered in any arrangement. An HDD contains single or numerous, rigidly fixed, revolving disks with magnetic tops prearranged on a stirring actuator limb to retrieve as well as store info to the surfaces [12].

Solid state drives give large no. of benefits over conventional hard drives like:

1.) SSDs are More Durable: SSD endures a non-mechanical arrangement of NAND flash elevated on circuit assemblies, along with are jolted repulsive. Hard disk complying adjunct an aberration of driving components driving them susceptible to jolt along with wreck.

2.) SSDs are Quicker: SSDs to acquire greater elaborated throughput, contemporary data entrances, faster start ups, quicker file exchanges, along with in average superfast calculating speed than hard disk. HDDs can lone enter the information preceding the nearer it exists from the R/W heads, whereas collective areas adjunct the SSD are exposed at the equivalent speeds.

3.) SSDs Consume less Power: SSDs use considerably a smaller amount of power at the highest point of load than hard drives. Their energy efficiency can make the systems cost effective and deliver long battery life, low power tension on system, and a cooler work out atmosphere. (Figure 3.4)

Summary of Chapters

Chapter 1 – INTRODUCTION: Outlines the challenges of big data processing in the digital age and introduces Hadoop and MapReduce as essential architectural solutions.

Chapter 2 - LITRATURE REVIEW: Reviews the advantages of Hadoop, characteristics of big data, project architecture, and various interconnect technologies including Ethernet and InfiniBand.

Chapter 3 – Experimental Testbed System: Details the hardware and software specifications of the 4-node Quanta server stack used for experimental evaluation.

Chapter 4 – Installation, Designing and Implementation of System: Provides a comprehensive guide for setting up LAN, NFS, NIS, and Hadoop on the experimental cluster.

Chapter 5 – Benchmarking: Describes the methodology for disk benchmarking using tools like ATTO, HD Tune, and HiBench to evaluate storage device performance.

Chapter 6 – Performance Evaluation of SSD and HDD on Hadoop using 10GigE: Analyzes the execution performance of Sort, Word Count, and TeraSort workloads on SSD and HDD using 10GigE.

Chapter 7 – Performance Evaluation of SSD and HDD on Hadoop using IPoIB: Presents the performance results of various Hadoop workloads using the IPoIB interconnect technology.

Chapter 8 – Performance Evaluation of SSD and HDD on Hadoop by RDMA-IB: Investigates the performance improvements of Hadoop workloads when utilizing the RDMA-IB interconnect.

Chapter 9 – Performance Comparison between 10GigE and IPoIB: Compares the results obtained from 10GigE and IPoIB to demonstrate performance gains in map and reduce phases.

Chapter 10 – Performance Comparison between IPoIB and RDMA-IB: Evaluates the performance differences between IPoIB and RDMA-IB across different workloads and storage types.

Chapter 11 – Performance Comparison between 10GigE and RDMA-IB: Conducts a comparative performance analysis of the traditional 10GigE against the high-performance RDMA-IB interconnect.

Chapter 12 – Overall comparison of 10GigE, IPoIB and RDMA-IB.: Synthesizes all performance data into a comprehensive comparison to identify the most effective storage and interconnect combinations.

Chapter 13 – Conclusion: Summarizes research findings, stating that modern interconnects significantly outperform traditional ones for big data tasks.

Chapter 14 – Future Scope: Suggests future research directions, including the implementation of dynamic shared memory models using InfiniBand.

Keywords

Big Data, Hadoop, MapReduce, SSD, HDD, HiBench, 10GigE, InfiniBand, IPoIB, RDMA-IB, Performance Evaluation, Benchmarking, Storage Systems, Cluster Computing, Latency

Frequently Asked Questions

What is the core focus of this research paper?

The paper focuses on evaluating and characterizing the performance of Hadoop storage systems by testing Solid State Drives (SSD) and Hard Disk Drives (HDD) across various network interconnect technologies such as 10GigE, IPoIB, and RDMA-IB.

Which central topics are discussed in the work?

Key topics include big data storage, Hadoop performance optimization, benchmark suites like HiBench, hardware vs. software interconnect solutions, and cluster administration for research environments.

What is the primary goal of the dissertation?

The primary goal is to study how modern high-performance storage and network interconnects can overcome I/O bottlenecks in Hadoop clusters to improve throughput and reduce latency for data-intensive applications.

What scientific methods are utilized for this evaluation?

The author utilizes a practical experimental approach, constructing a 4-node cluster using Quanta servers and applying standardized workload benchmarks (Sort, Word Count, TeraSort) provided by the HiBench suite to collect empirical performance data.

What is the thematic structure of the main section?

The main sections cover the technical design of the Hadoop cluster, the setup of network configurations, a detailed benchmarking phase, and a systematic comparative evaluation of different interconnect technologies on workload execution times.

Which keywords best characterize this research?

Keywords include Big Data, Hadoop, MapReduce, SSD, HDD, HiBench, 10GigE, InfiniBand, IPoIB, RDMA-IB, Performance Evaluation, and Cluster Computing.

Why are SSDs considered superior to HDDs in this cluster setup?

The experimental results demonstrate that SSDs consistently provide higher throughput, lower latency, and faster completion times for MapReduce workloads compared to HDDs due to their non-mechanical nature.

How does RDMA-IB compare to 10GigE?

The research concludes that RDMA-IB significantly outperforms 10GigE, showing drastic improvements in both map and reduce phase times, making it a highly effective solution for real-time big data processing.

What recommendation does the author give for cloud environments?

The author recommends that cloud systems with high-importance, real-time requirements should upgrade to RDMA-IB, while those without strict real-time requirements can achieve significant performance boosts by upgrading to IPoIB.

Excerpt out of 104 pages - scroll top

Details

Title: High-Performance Persistent Storage System for BigData Analysis
Course: M.Tech CS&E
Grade: 82.00
Author: Piyush Saxena (Author)
Publication Year: 2014
Pages: 104
Catalog Number: V278725
ISBN (eBook): 9783656721611
ISBN (Book): 9783656722847
Language: English
Tags: High Speed Cloud Computing
Product Safety: GRIN Publishing GmbH

Quote paper: Piyush Saxena (Author), 2014, High-Performance Persistent Storage System for BigData Analysis, Munich, GRIN Verlag, https://www.grin.com/document/278725

High-Performance Persistent Storage System for BigData Analysis