Hadoop’s Programming Model

Hadoop is a Java implementation of Map-Reduce. Map-Reduce is a software architecture used to process large amounts of data, also know as “big data”, in a distributed fashion. It is based upon the idea of mapping data items into key and value pairs, which are grouped by the key and reduced into a single value. From a service perspective, Hadoop allows an application to map, group, and reduce data across a distributed cloud of machines, thus permitting the applications to process an enormous amount of data.

A common Hadoop application is the processing of data located in web sites. For example, let’s consider an application that counts the number of occurrences of a word in web pages. In other words, if we had 10 web pages, where each uses the word “hello” twice, then we expect the result to include the key and value pair: {“hello” -> 20}. This word counting application can be easily hosted in Hadoop by using Hadoop’s map and reduce services in the following form:

Generate a map of word token to word occurrence for each word present in the input web pages. For example, if a page includes the word “hello” only once, we should generate the map entry {“hello” -> 1}.
Reduce all maps that have been grouped together by Hadoop with the same key into a single map entry where the key is the word token and the value is the sum of all values in that group. In other words, Hadoop collects all maps that have the same key, that is, the same word token, and then groups them together providing the application with their values. The application is then responsible in reducing all values into a single item. For example, if step one generated the entries {“hello” -> 1} and {“hello” -> 2}, then we reduce these to a single entry {“hello” -> 3}.

Following, we have a walk-through of a simple scenario:

This is done by:

Load each web-page as input.

web-page-1: “first web-page”

web-page-2: “and the second and final web-page”

Map each input (i.e, page) into a collection of sequences (word, occurrences).

{(first, 1), (web-page, 1)},

{(and, 2), (the, 1), (second, 1), (final, 1), (web-page, 1)}

Group all sequences by ‘word’. Thus, the output will be collections in which all member sequences have the same ‘word’.

{(first, 1)},

{(web-page, 1), (web-page, 1)},

{(and, 2)},

{(the, 1)},

{(second, 1)},

{(final, 1)}

For each group, reduce to a single sequence by summing to together all word occurrences.

{(first, 1)},

{(web-page, 2)},

{(and, 2)},

{(the, 1)},

{(second, 1)},

{(final, 1)}

Store each sequence.

We have described a word counting Hadoop application, the next task is to implement it using Hadoop’s programming model. Hadoop provides two basic programming models. The first one is a collection of Java classes, centered on a Mapper and Reducer interfaces. The application needs to extend a base class called MapReduceBase, and implement the Mapper and Reducer interfaces, specifying the data types of the input and output data. The application then registers its Mapper and Reducer classes into a Hadoop job, together with the distributed location of the input and output, and fires it away into the framework. The framework takes care of reading the data from the input location, calls back the Mapper and Reducer application classes when needed in a concurrent and distributed fashion, and writes the result to the output location.

The second option is to use a domain language called Pig. Pig defines keywords such as FOREACH, GROUP, and GENERATE, which fit naturally into the map, group and reduce actions. Using Pig, a developer can write a Hadoop application in a matter of a few lines of code, almost as if writing a SQL query, although Pig is rather more imperative than declarative as SQL.

map_result = FOREACH webpage GENERATE FLATTEN(count_word_occurrences(*)); 
key_groups = GROUP map_result BY $0; 
output = FOREACH key_groups GENERATE sum_word_occurrences(*);

Hadoop is configured through XML configuration files. A good part of Hadoop is to deal with the distribution of jobs across a distributed file system; hence a large aspect of Hadoop’s configuration is related to configuring servers in the processing cloud.

Hadoop is an excellent example of a newly created application framework targeted for the emergent problem presented by the web where we need to deal with mammoth amounts of data in a soft real-time fashion. As new problems arise, they will be accompanied by new solutions, some of which will certainly take the form of new development platforms, as Hadoop does.

This entry was posted on Sunday, December 12th, 2010 at 6:48 pm and is filed under Hadoop, Java. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

2 Responses to Hadoop’s Programming Model

Christoph says:

December 13, 2010 at 12:42 pm

Hi Alex,

thanks for this very nice article that actually describes very well the behavior of Hadoop. In my daily working life, I encountered that the problem that is addressed via Hadoop is a issue that comes up quite often when people start talking about event processing and creating “continuous” event streams. Unfortunately people do not tend to be very accurate in their behavior description and it happens quite a few times that they select for exactly the use case where Hadoop fits 100% an event processing platform like Esper instead.
So we might want to force people to be as accurate in the requirements specification as you’ve been in describing the actual solution.
Many thanks and kind regards,
Christoph

Reply
kiran says:

July 5, 2011 at 1:48 am

hi

I have gone through the post, and it was really helped me to correct my mistakes in Hadoop programming.Thanks for sharing valuable information, this would really fetch other people to learn Hadoop programming. Like wise when I was crawling search engines I found an interesting information regarding the same .

Oracle Fusion Apps

Reply

A World of Events