Pinela's Garden: [EN] - Big Data Helper - Part 5

sexta-feira, 18 de maio de 2012

[EN] - Big Data Helper - Part 5 - MapReduce

And this post will (I promise) show you how to run a simple mapreduce job.
I've mentioned mapreduce a couple of time.
But what is a mapreduce job? Well, a mapreduce job (for this I won't resort to the documentation, hope it is clear and correct) is a process that consists of two parts: the Map, and the Reduce.

The Map part, takes the data (in a way that you specify) and breaks it into individual pieces (more on this later).
The Reduce part, takes all the individual pieces and groups them into meaningful information / results.
Unclear? Well, to simplify (for those familiar with SQL), the reduce basically works like a Group By clause, and the Reducer code can do SUM, AVG, COUNT and so forth, as well as some greater complexity operations.

Specific example (Wordcount) (with another file):

Well, on the upper picture, you first have a file with some words. Then, the Mapper (Map program) breaks the file into words, and associates a 1 with the word and creates several touples.
After that, the Reducer (Reduce program) checks all the individual touples, and groups them by the word, summing all the 1s.
This for a simple WordCount MapReduce task.

To run the WordCount example (that is shipped with hadoop as a sample) just run (we will use the file discussed on http://pinelasgarden.blogspot.pt/2012/05/en-big-data-helper-part-3-loading-data.html):

step 1 and 2 (click to enlarge)

cd /usr/lib/hadoop-0.20 (to find the examples.jar)
hadoop jar hadoop-0.20.2-cdh3u0-examples.jar wordcount sapwc sap.wc.out (to run the mapreduce job)
hadoop dfs -getmerge sap.wc.out sapOUT (to get the file out)

step 3 (click to enlarge)

When checking the output file you'll find the expected output.

Remember, that the description of the work of the Mapper and Reducer are specific for the Wordcount application. With other goals/purposes, the mapper/reducer would be responsible for different outputs.

Thank you.

p.s. - to sort this output use
$ sort -rnk 2 sapOUT

All done! congrats! your first map reduce job.
Lesson 6 (I think it's 6) will show you how to write your own mapreduce functions (a basic example).

Thank you.

-- ====================

Other Tutorial Links (to remove any doubts):

http://pinelasgarden.blogspot.pt/2012/04/en-big-data-helper-part-1-concepts.html
http://pinelasgarden.blogspot.pt/2012/04/en-big-data-helper-part-2-getting.html
http://pinelasgarden.blogspot.pt/2012/05/en-big-data-helper-part-3-loading-data.html
http://pinelasgarden.blogspot.pt/2012/05/en-big-data-helper-part-4-pig.html

Pinela's Garden

sexta-feira, 18 de maio de 2012

[EN] - Big Data Helper - Part 5 - MapReduce

Sem comentários:

Enviar um comentário