Well, after something like a year, hunting around for information about big data, I thought it was the time to share some useful information that might help those curious about this "technology" (this is not a correct term, but bear with me).
The image to the left should be extremely familiar to you. If not, I'll shed some light on it like pretty soon.
First of all let's define Big Data.
Big Data is (commonly) defined by three dimensions:
- Velocity
- Volume
- Variety
Velocity is due to the speed at which data is generated, Volume as to the amount of data generated per small time unit, and final variety due to the different types of source data (e.t. files, audio, video, geospatial, structured, unstructured).
Hadoop (http://hadoop.apache.org/)
:
"The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. "
So, basically Hadoop gives you a tool to analyze your Big Data.
Now the Hadoop framework is composed of several components.
The basic ones:
- Hadoop Common (standard libraries);
- Hadoop Distributed File System (HDFS) - file system for high throughput;
- Hadoop MapReduce - A software that allows you to distribute data processing load across several nodes of a cluster.
Aside from that there are some possible and useful aditions:
- Avro
- Cassandra
- Chukwa
- HBase
- Hive
- Mahout
- Pig
- Zookeeper
- Lucene
Some of these concepts will be described in more detail across the next posts.
This was a simple intro.
Hope it boosted your curiosity, or helped you learn something.
Stay tuned.
Thanks.
-- ====================
Other Tutorial Links
http://pinelasgarden.blogspot.pt/2012/04/en-big-data-helper-part-2-getting.html
http://pinelasgarden.blogspot.pt/2012/05/en-big-data-helper-part-3-loading-data.html
http://pinelasgarden.blogspot.pt/2012/05/en-big-data-helper-part-4-pig.html
http://pinelasgarden.blogspot.pt/2012/05/en-big-data-helper-part-5-mapreduce.html