Free

Hadoop is One Top Level Apache Project - Report Example

Add to wishlist

Summary

This paper 'Hadoop is One Top Level Apache Project' tells that Hadoop is one top-level Apache Project, which began in the year 2006 and is written in Java. It does emphasize on discovery from a scalability perspective as well as Analysis to comprehend near impossible features.Presently, Hadoop is used on massive data amounts…

Download full paper File format: .doc, available for editing

GRAB THE BEST PAPER95.4% of users find it useful

Read Text Preview

Subject: Information Technology
Type: Report
Level: College
Pages: 5 (1250 words)
Downloads: 1
Author: shawnmueller

Extract of sample "Hadoop is One Top Level Apache Project"

HADOOP MAPREDUCER ON TRANSPORTATION NETWORK Hadoop is one top level Apache Project, which began in the year 2006 and is written in Java. It does emphasize on discovery from a scalability perspective as well as analysis to comprehend near impossible features. Hadoop was developed by Doug Cutting as open source projects collection on which the programming environment for google map can be applied within a system of distribution. Presently, Hadoop is used on huge data amounts. The Hadoop Map Reduce happens to be a software framework for writing applications easily and is able to process data volumes that are extremely large(multi-terabyte sets of data), in parallel and on huge clusters or thousands of nodes for commodity hardware in a manner that is reliable as well as fault tolerant. There lacks a formal definition for the map reduce model (Ricky). Basically, on the implementation of Hadoop, it can be thought of as a “distribution sort engine.” In general, the processing flow is as shown below: The input data gets to be split into some multiple mapper process that does execute in parallel. The mapper result is partitioned by key as well as sorted locally The mapper results for the same key lands on the same reducer as well as consolidated there At the reducer is where merge sorting happens so every key that arrives at the similar reducer is sorted Characteristically, the storage and compute nodes are the same meaning that the MapReduce framework as well as the Hadoop system of distributed files runs on similar set of nodes. Such a configuration makes it possible for the framework to schedule tasks effectively on the nodes, where the data is present already, leading quite a high bandwidth aggregate across the cluster. The framework of the MapReduce does consist of one master resource manager, a single slave node manager per cluster node as well as MRAppMaster per application (Ricky). Slightly, applications do specify input or output locations as well as supply map and reduce functions through abstract classes and/or appropriate interfaces implementation. These as well as other job parameters encompass the job configuration. The job client of Hadoop then submits the job as well as configuration to the resource manager, that assumes the software/configuration distribution responsibility to the slaves, preparing tasks as well as monitoring them, providing status as well as diagnostic information to the client of the job (Ricky). How Algorithm Works In most applications of the real world, data which is generated becomes a stakeholder’s great concern since it provides meaningful information or knowledge, which assists in predicting analysis. This knowledge assists in the modification of certain decision parameters for the application, which changes business process overall outcome. The data volume generated by the process, also known as data sets collectively happens to be very large. The collected data sets could be from sources that are heterogeneous and the data may be structured or unstructured. The processing of such data could generate patterns that are useful that knowledge can be extracted from them (Anjan, Sharma and Kiran). Data mining is a process that finds patterns or correlations in among the fields within huge data sets. They also build up knowledge base that is based on some given constraints. Its overall goal is the extraction of knowledge from datasets that exist then convert it into a structure that humans can understand. The process can be called Knowledge discover in Data Sets or KDD. KDD has revolutionized the complex real world problems solving approach. Within a distributed computing environment, there is a bunch of some loosely coupled processing nodes that are connected by network. Every one of them does contribute to the data execution or distribution/ replication (Anjan, Sharma and Kiran). It may be described as cluster nodes. Cluster frameworks set up a cluster and a good example is the Hadoop MapReduce. Other approaches involve the setting up of the cluster nodes on the basis of ad-hoc and the lack of being bound by the rigid framework. These methods just include an API set up calls for remote method invocation (RMI) basically, as part of inter process communication. Good examples include Message Passing Interface (MPI) as well as the MPI variant known as MPJExpress (Anjan, Sharma and Kiran). The cluster set up method is dependent on data densities as well as up on the below listed scenarios: Generation of the data is at various locations therefore requires to be local accessed during most times for processing The data as well as the processing is spread to the machines within the cluster with an aim of reducing the impact for any specific machine from becoming overloaded something that lead to the damaging of the machine’s processing (Anjan, Sharma and Kiran). Algorithm design and Analysis The discussed algorithm produces all subsets, which could be created from a given item set. Additionally, the subsets are searched against the sets of data while the frequency is renowned. There exist a lot of items of data which requires to be simultaneously searched in order to reduce the search time. Here is where Hadoop architecture comes in. a map function is divided for each items subset. The maps could run on any node within the distributed environment that is configured under the Hadoop Configuration. The distribution of the job is cared for by the Hadoop system as well as the files, required data sets put into HDFS. Within every map function, the item set is the value. The entire data set undergoes scanning in effort to identify the value item set entry with the frequency noted. It is given by way of an output towards the reduce function within the reduce class that is defined within the Hadoop core package (Anjan, Sharma and Kiran). Within the reducer function, every output of the every map is collected as well as put in the required file together with its frequency. Below is algorithm mentioned in natural language: Read the file of the subsets For every subset within the file Reset count to zero then invoke map function Read the database For every entry within the data base Run comparison with subset If equals the increment then count by one Into the middle file, write the output accounts Call the reduce function so as to count the subsets then report output to the file (Anjan, Sharma and Kiran). In an experimental set up that has been tested rigorously against Hadoop Pseudo distributed configuration as well as with a standalone PC for the varying of data intensity and transaction, the fully configured multi node Hadoop by differential system configuration could take long time comparatively for data processing unlike the fully configured multi nodes. The similarity happens to be in terms of the configuration of the system beginning with the architecture of computer to the OS that runs it. The performance can be expressed within the lateral comparison between FHSSC and FHDSC as follows: ῃ= FHDSC/FHSSC FHDSC= FHSSC= loge N Where N represents number of nodes that are installed in the cluster (Anjan, Sharma and Kiran) Conclusion This paper does represent a novel approach for cluster environment. This could be applicable within scenarios where there is requirement for data intensive computation. This kind of a set up provides a broad avenue for research as well as investigation in data mining. When one looks at the demand for this kind of algorithm, it is clear that here is an urgent requirement for focusing as well as exploring more about environment that are clustered especially for such domain. References Anjan, et al. "MapReduce Design and Implimentation of Apriorialgorithm for Handling Volume Data Sets." Advanced Computing (2012): 11. 3. Ricky, Ho. "Designing algorithms for Map Reduce." Pragmatic Programming Techniques (2010): 6. Read More

Cite this document

APA
MLA
CHICAGO

(Hadoop is One Top Level Apache Project Report Example | Topics and Well Written Essays - 1250 words, n.d.)
Hadoop is One Top Level Apache Project Report Example | Topics and Well Written Essays - 1250 words. https://studentshare.org/information-technology/1853290-hadoop-mapreduce-on-transportation-network

(Hadoop Is One Top Level Apache Project Report Example | Topics and Well Written Essays - 1250 Words)
Hadoop Is One Top Level Apache Project Report Example | Topics and Well Written Essays - 1250 Words. https://studentshare.org/information-technology/1853290-hadoop-mapreduce-on-transportation-network.

“Hadoop Is One Top Level Apache Project Report Example | Topics and Well Written Essays - 1250 Words”. https://studentshare.org/information-technology/1853290-hadoop-mapreduce-on-transportation-network.

Cited: 0 times

CHECK THESE SAMPLES OF Hadoop is One Top Level Apache Project

Influence of Social Media on Customer Relationship Management

The researcher of this dissertation will attempt to examine existing literature and identifiable potential applications / benefits for social media association with CRM to decide about how social media will influence CRM in the future.... hellip; The paper operates mainly based on research questions which can be stated as follows: What are the important reasons for social media integration with CRM?...

41 Pages (10250 words) Dissertation

Buffalo Soldiers and the Apache

[Enter your name here] [Your instructor] [Course code number] 9 March 2013 Role of the Buffalo Soldiers in the apache Campaign The Buffalo Soldiers played a very import role in American History and the campaign against the apache Indians.... The apache wars which when first started lasted over a decade between 1876 and 1889 (Buffalosoldiers.... hellip; This apache War was lead by Geronimo and a band of 17 warriors.... However, it was the 10th regiment's success at detailed mapping that later lead to the demise of the apache Indians between 1879 and 1880....

3 Pages (750 words) Essay

Production Management

The growth of the specialist sub-contract sector in the construction industry has ensured the importance of specialist trade sub contractors to the overall project development process.... The Bureau of Building, professional team, constructor team, and a building commissioning agent, when applicable, enter a contractual relationship for a project.... Because the bureau does not have the in-house capability to perform some of the design and construction tasks that some private construction program managers can provide, it may employ engineers, contract analysts, architects, or construction project administrators to manage and monitor projects....

21 Pages (5250 words) Essay

The Large Hadron Collider: The Challenge of Human Brain

The Large Hadron Collider, usually abbreviated as (LHC), is a very large scientific instrument found near the city of Geneva, whereby is extents the border between France and Switzerland about one hundred meters underground.... ? Rumors from an internal memo has it that one of the CERNs Large Hadron Collider detectors has picked up some signals that could actually be the particle that has been sought after for a long period of time, which is known as the Higgs boson....

7 Pages (1750 words) Essay

Apache Tribe

As a collective term for various ethnically related Native American tribes, the apache people represent the indigenous tribes that inhabit North America....

7 Pages (1750 words) Research Paper

Apache Web Server vs IIS 6

This coursework "apache Web Server vs IIS 6" has presented a detailed analysis of two well-known web server technologies apache and IIS.... nbsp;… However, at the present, Microsoft IIS-6 and apache have acquired a huge share of the web-based business market.... But, apache is the definite champion in the Security Space and Netcraft monthly surveys.... Additionally, according to the results of these surveys, apache Internet Information Server has a better market share than 1000 enterprises (Netcraft)....

5 Pages (1250 words) Coursework

Cloud Computing and Software Evolution at Business

The author of this paper presents a detailed analysis of the new emerging technology paradigm.... The author of this paper also assesses and analyzes the main areas and factors, new developments, and implementations of cloud computing technology at businesses… Public clouds are executed using 3rd parties, and the implementations from diverse clients are probable to be mixed on the cloud's storage systems, servers, and networks....

24 Pages (6000 words) Research Paper

Apache and Hadoop as a New Type of Database

the essay "apache and Hadoop as a New Type of Database" outlines that apache and Hadoop are the open-source servers that have allowed organizations to seamlessly integrate their big data infrastructure with their business intelligence both coming as a package and utilized for a parallel file system.... nbsp;… However, a new type of database known as apache and Hadoop have been developed to challenge these databases (“Business Intelligence-BI”)....

5 Pages (1250 words) Essay