terça-feira, 15 de maio de 2012

[EN] - Big Data Helper - Part 4 - Pig

Before running some mapreduce jobs, let me talk a little about Apache Pig.
Apache Pig (http://pig.apache.org/) is a:
"platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs"


With pig you use a CLI and write Pig Latin (specific) commands to analyse the data.
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html

Specific Example:
we are going to use a csv file with some random data about car makers.


Now that we have the file, we are going to load the file into HDFS (like we did on lesson 3 http://pinelasgarden.blogspot.pt/2012/05/en-big-data-helper-part-3-loading-data.html) and then, import the file with pig. The next image represents just that.

After loading the file into HDFS, we load the content to a PIG variable, filter out the heading, and create a "table".
You can see by the next screen, that the "table" is indeed in the HDFS, along with the base file (auto.csv).



Hope that it was clear enough.

Thanks.




-- ====================

Other Tutorial Links



http://pinelasgarden.blogspot.pt/2012/04/en-big-data-helper-part-1-concepts.html
http://pinelasgarden.blogspot.pt/2012/04/en-big-data-helper-part-2-getting.html
http://pinelasgarden.blogspot.pt/2012/05/en-big-data-helper-part-3-loading-data.html
http://pinelasgarden.blogspot.pt/2012/05/en-big-data-helper-part-5-mapreduce.html

Sem comentários:

Enviar um comentário