LightBlog
Contact at mumbai.academics@gmail.com or 8097636691/9323040215
Responsive Ads Here

Thursday, 14 June 2018

IMPROVING PERFORMANCE OF DATA IN HADOOP CLUSTERS USING DYNAMIC DATA REPLICA PLACEMENT: A SURVEY

ABSTRACT:-

Big data refers to various forms of large information sets that require special computational platforms in order to be analyzed. Research on big data emerged in the 1970s but has seen an explosion of publications since 2008. The Apache Hadoop software library based framework gives permissions to distribute huge amount of datasets processing across clusters of computers using easy programmer models. In this paper, we discuss the architecture of Hadoop, survey paper of various data replication placement strategies and propose an approach for the improvement of data replica placement and suggest an implementation of proposed algorithm with various MapReduce applications for improving performance of data in Hadoop clusters with respect to execution time and number of nodes in Hadoop platform

KEYWORDS: Apache Hadoop, HDFS, MapReduce, Data Replication Placement, MapReduce applications

INTRODUCTION Hadoop, well known as Apache Hadoop, is an open-source software platform for process large amount of data. It is scalable and distributed computing of large volumes of data. It provides rapid, high performance and cost-effective analysis of structured and unstructured data generated on digital platforms and within the enterprise. It is used in almost all departments and sectors today. Hadoop is a distributed file system, which lets to store and handle the massive amount of data on a cloud machine, handling data redundancy. The primary benefit is that since data is stored in several nodes, it is better to process it in a distributed manner. Each node can process the data stored on it instead of spending time in moving it over the network.

The performance of Hadoop depends on various factors, such as amount and frequency of CPU cores, RAM capacity, the throughput of storages, data flows intensity, Network bandwidth etc [1]. Hadoop is a popular cloud computing platform based on HDFS and MapReduce. Hadoop Architecture:

 Hadoop Common – This includes Java libraries and utilities which provide those Java files which are essential to start Hadoop.
 Task Tracker – It is a node which is used to accept the tasks such as shuffle and MapReduce form job tracker.
 Job Tracker – It is a service provider which runs MapReduce jobs on the cluster.
 NameNode – It is a node where Hadoop stores all file location information (data stored location) in Hadoop distributed file system. Files and directories are represented on the NameNode by inodes.
 DataNode – The data is stored in the Hadoop distributed file system. Each block replica on a DataNode is represented by two files in the local system. The NameNode does not directly call DataNodes. It uses replies to heartbeats to send instructions to data nodes

No comments:

Post a Comment