Hadoop and Map reduce today are facing huge amounts of data and are moving towards ubiquitous for big data storage and processing. This has made it an essential feature to evaluate and characterize the Hadoop file system and its deployment through extensive benchmarking. We have other benchmarking tools widely available with us today that are capable of analyzing the performance of the Hadoop system but they are made to either run in a single node system or are created for assessing the storage device that is attached and its basic characteristics as top speed and other hardware related details or manufacturer’s details. For this, the tool used is HiBench that is an essential part of Hadoop and is comprehensive benchmark suit that consist of a complete deposit of Hadoop applications having micro bench marks & real time applications for the purpose of benchmarking the performance of Hadoop on the available type of storage device (i.e. HDD and SSD) and machine configuration. This is helpful to optimize the performance and improve the support towards the limitations of Hadoop system.
In this research work we will analyze and characterize the performance of external sorting algorithm in Hadoop (MapReduce) with SSD and HDD that are connected with various Interconnect technologies like 10GigE, IPoIB and RDBAIB. In addition, we will also demonstrate that the traditional servers and old Cloud systems can be upgraded by software and hardware up gradations to perform at par with the modern technologies to handle these loads, without spending ruthlessly on up gradations or complete changes in the system with the use of Modern storage devices and interconnect networking systems. This in turn reduces the power consumption drastically and allows smoother running of large scale servers with low latency and high throughput allowing use of the utmost power of the processors for the big data flowing in the network.
Inhaltsverzeichnis (Table of Contents)
- Chapter 1
INTRODUCTION - Chapter 2 - LITRATURE REVIEW
- 2.1 Advantages of using Hadoop
- 2.2 Big Data
- 2.3 Project Architecture
- 2.4 Goal: HDFS-HMFGR
- 2.5 INTERCONNECT TECHNOLOGIES (Hardware Solution)
- 2.5.1 Traditional Interconnect 10GigE Network
- 2.5.2 Infiniband Technology
- 2.5.2.1 IPOIB Interconnect Technology
- 2.5.2.2 RDMA-IB Interconnect Technology
- 2.6 Memory Allocation with MemCached (Software Solution)
- Chapter 3 - Experimental Testbed System
- 3.1 An Insight into SSD and HDD
- Chapter 4 - Installation, Designing and Implementation of System
- 4.1 Set a LAN and System service of Network
- 4.2 Hadoop Installation Guide
- 4.2.1 Steps On each Machine
- 4.2.1.1 Install prerequisites
- 4.2.1.2 Adding a dedicated Hadoop system user
- 4.2.1.3 Setup hostname
- 4.2.2 Steps On Master
- 4.2.2.1 Install Hadoop
- 4.2.2.2 Configuration ssh
- 4.2.2.3 Install NFS
- 4.2.2.4 Configuration
- 4.2.3. Run Hadoop
- 4.2.3.1 Run HDFS
- 4.2.3.2 Rum Map Reduce Job
- 4.2.3.3 Run All Daemon
- 4.2.3.4 Hadoop Web Interfaces
- 4.3 Enable Ethernet and InfiniBand (SR-IOV)
- 4.3.1 Enable Ethernet SR-IOV
- 4.3.2 Enable InfiniBand SR-IOV
- 4.4 Steps to setup IPOIB on Quanta machines
- 4.5 Install MemCached on CentOS
- 4.2.1 Steps On each Machine
- Chapter 5 - Benchmarking
- 5.1 ATTO Disk Benchmarking
- 5.2 HD Tune Pro:
- 5.3 Linux Disk Utilities:
- 5.4 HiBench (Hadoop Benchmarking Suit)
- 5.4.1 Micro-Benchmarks
- 5.4.1.1 Sort
- 5.4.1.2 Word Count
- 5.4.1.3 TeraSort
- 5.4.1.4 Enhanced DFSIO
- 5.4.2 Web Search:
- 5.4.2.1 Nutch Indexing
- 5.4.2.2 Page Ranking
- 5.4.3 Machine Learning:
- 5.4.3.1 Bayesian Classification
- 5.4.3.2 K-means clustering
- 5.4.4 Analytical Query
- 5.4.4.1 Hive Join
- 5.4.4.2 Hive Aggregation
- 5.4.1 Micro-Benchmarks
- Chapter 6 - Performance Evaluation of SSD and HDD on Hadoop using 10 GigE
- 6.1 Sort Work Load:
- 6.2 Word Count Work Load
- 6.3 Tera Sort Work Load:
- Chapter 7 – Performance Evaluation of SSD and HDD on Hadoop using IPOIB
Zielsetzung und Themenschwerpunkte (Objectives and Key Themes)
The dissertation aims to analyze and characterize the performance of external sorting algorithms within the Hadoop (MapReduce) framework using Solid State Drives (SSDs) and Hard Disk Drives (HDDs) connected through various interconnect technologies. The research explores the potential for upgrading traditional servers and cloud systems to handle big data workloads effectively by leveraging modern storage devices and networking systems. The dissertation seeks to demonstrate that such upgrades can enhance performance without significant costs, reducing power consumption and latency while improving throughput. Key themes include:- Performance evaluation of Hadoop with SSD and HDD
- Impact of interconnect technologies on Hadoop performance
- Optimization of Hadoop system using modern storage devices and networking systems
- Cost-effective performance enhancement for traditional servers and cloud systems
- Reducing power consumption and latency while increasing throughput
Zusammenfassung der Kapitel (Chapter Summaries)
- Chapter 1: INTRODUCTION
- Chapter 2: LITRATURE REVIEW
This chapter presents a comprehensive overview of Hadoop's advantages, the concept of big data, and the project's architecture. It also delves into various interconnect technologies, including 10GigE, IPOIB, and RDMA-IB, as well as the role of MemCached for memory allocation. - Chapter 3: Experimental Testbed System
Chapter 3 provides an in-depth analysis of Solid State Drives (SSDs) and Hard Disk Drives (HDDs) as storage devices. - Chapter 4: Installation, Designing and Implementation of System
This chapter describes the detailed process of installing and configuring the Hadoop system, including network setup, Hadoop installation steps on individual machines and the master node, and enabling Ethernet and InfiniBand connectivity. It also covers the setup of IPOIB on Quanta machines and MemCached installation on CentOS. - Chapter 5: Benchmarking
This chapter outlines the methodology employed for benchmarking the system, utilizing tools like ATTO Disk Benchmarking, HD Tune Pro, Linux Disk Utilities, and HiBench (Hadoop Benchmarking Suit). The chapter delves into the various micro-benchmarks, web search applications, machine learning algorithms, and analytical queries used for evaluating the system's performance. - Chapter 6: Performance Evaluation of SSD and HDD on Hadoop using 10 GigE
This chapter presents the performance evaluation of SSD and HDD in the Hadoop environment using 10GigE connectivity, analyzing the results of Sort, Word Count, and Tera Sort workloads. - Chapter 7: Performance Evaluation of SSD and HDD on Hadoop using IPOIB
This chapter investigates the performance of SSD and HDD in the Hadoop environment using IPOIB technology, providing insights into the performance gains achieved through this interconnect method.
Schlüsselwörter (Keywords)
The dissertation primarily focuses on the performance of Hadoop with SSD and HDD utilizing various interconnect technologies, including 10GigE, IPOIB, and RDMA-IB. The key concepts revolve around optimizing Hadoop systems through modern storage devices and networking solutions, aiming to enhance performance, reduce power consumption, and improve throughput. The research explores the possibility of upgrading traditional servers and cloud systems to handle big data workloads efficiently without significant cost implications.
- Quote paper
- Piyush Saxena (Author), 2014, High-Performance Persistent Storage System for BigData Analysis, Munich, GRIN Verlag, https://www.grin.com/document/278725