My journey into space

Hadoop Tutorial – Part 4

January 9, 2010

In the previous part we reviewed some basic concepts related to Hadoop’s Map/Reduce. We are now ready to run a small example called Grep. Grep is one of many examples that comes bundled with the distribution. It is a very simple program that receives as an input a regular expression, scans a bunch of input files, find any matching strings while counting the occurrences of each match and finally outputs the result into an output file. The Map/Reduce operation is done in two steps: Step 1: First a mapper reads input files and scan for any matches against the given regular expression. For each match a key (matched word) and a value (count of matches = 1) pair is outputted to an intermediate file. Please note that multiple Mappers can run in parallel in order to scan the files more efficiently. Once the scanning is completed the Reducers combine the results. The combining operation is actually an aggregation of the matched strings. So for instance if the the key is ‘dog’ and there are 5 occurrences then the Reducer will combine all 5 instances into a single key (‘dog’) and a single value (5) pair written to another intermediate file. Please note that multiple reducers can work in parallel in order to perform the aggregation more efficiently. The Map/Reduce framework will always make sure that each reducer gets partitioned data (potentially fetched from many distributed nodes) meaning that all values belonging to a single key will be processed in the same place. Step 2: Now that we have all the data aggregated there are two more tasks in order to complete the processing : first we need to sort the results so that the most popular match is at the top of the list and the rest of the results are ordered decreasingly. Second – we need to combine all the results into a single file. In order to sort the list we will use a common technique in Map/Reduce – we will inverse the keys and the values (done by the Mappers) and finally merge all the pairs into a single file by instructing the framework to use a single reducer task (which sort the pairs decreasingly into a single output file). That’s it! – you can review the above description in the following diagram: In order to run the example you should follow the instructions as provided in the quick start guide: You should first create the input directory (don’t worry about the output directory – Grep will create it) # mkdir input Now we will copy a bunch of xml file as the input source for the Grep example: copy conf/*.xml input We are now ready to run the example: bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' Some remarks about the above command line: the first parameter tells hadoop to run the main method out of a JAR. the second argument is the actual jar to be used. The hadoop-*-examples.jar expression results into a single jar file. This is easier than writing the explicit name of the JAR file because it contains the actual hadoop version in it. If you review the JAR file’s manifest file you will find out that the main class is: Main-Class: org/apache/hadoop/examples/ExampleDriver The ExampleDriver class is just a simple dispatcher of all the sample programs that came with the Hadoop distribution. It parses the command line arguments – dispatch the relevant example according to the given keyword (in this case ‘grep’) and pass onto the example all other command line options. The ‘grep’ keyword will launch the org.apache.hadoop.examples.Grep class. We will dive into that Class’s code in the next part. The remaining parameters are: the input folder (containing the XML files to be scanned), the output folder for the resulting list and finally the regular expression. Tip: The regular expression as shown above will result in a single occurrence…you may want to change the regular expression into something more interesting so that you get more results. Don’t forget to delete the output directory before re-testing. Up Next: Reviewing the Grep Source Code. Stay Tuned!

Posted in Computing, Distributed, Hadoop | 2 Comments »

Hadoop Tutorial – part 3

January 8, 2010

In the previous part we reviewed a basic installation of Hadoop. We will now review some basic concepts related to Hadoop’s architecture before we can dive into code.

In any Map Reduce framework there are two key elements –

A Mapper which applies some processing function on a given key/value and produces a list of key/value pairs. The Map operation’s output is partitioned.

The second element is a Reducer which:

1. Fetch a key and a list of all the values that where created for that key from all the Mappers

2. Sort the fetched key/values into buckets

3. Reduce the list by processing key/values and output new key/value pairs.

In Hadoop you can chain multiple Map-Reduce pairs to create a processing pipeline that performs data manipulation in incremental steps. Please note that the reducer is optional.

There is also a special intermediate reducer (called Combiner) which combines duplicated keys produced from an individual Map task. The Combiner runs on the same node as its corresponding Map task. It is mainly used in order to avoid sending large lists to the Reducers by performing intermediate (local) reduce operations. The Combiner is also optional.

Additionally Hadoop provides a distributed block-level file system called HDFS. The main purpose of HDFS is abstracting the underlying distributed storage (based on commodity hardware clusters) from the application level. HDFS provides data durability by replicating blocks on multiple nodes as depicted in this diagram.

In the following diagram there is a very high level description of how data flows in a default Map/Reduce application. You can notice that the data is pulled from the file system (could be directly from the the OS file system or from a HDFS file system). The Mappers performs the initial processing: mapping values to keys and storing results on intermediate files (usually in a serialized binary format). Reducers fetch the key/value pairs (with the help of partitioners so that a single reducer will process value sets belonging to the same key). The reducers will output to the file system the end result after doing some aggregation/sorting or any other logical operation on the data in order to converge to the final output.

Up Next: Reviewing a Grep Example.

Posted in Computing, Distributed, Hadoop | Leave a Comment »

Hadoop Tutorial – Part 2

January 4, 2010

In Part 1 I reviewed the pre-requisites in order to run Hadoop on a linux machine.

In this part we will install hadoop and get ready to run it for the first time!

First you will need to download a stable release of Hadoop – I selected the 0.20.1 version.

After downloading you will need to unpack it.

We now move on to some configuration tasks before we can run the basic sample that comes bundled with the distribution.

We need to make sure that Hadoop is using the preferred JVM (multiple versions may be installed on your machine). In order to do that we are required to point Hadoop’s JAVA_HOME parameter to the root location of the JVM.

If you are not sure what is the location of your JVM you can run :

# locate /rt.jar

Please note that the SUN’s JVM (1.6.x) is the recommended one according to Hadoop’s documentation. If it is not installed on your machine – please do so.

The Hadoop’s JAVA_HOME variable is located in the conf/hadoop-env.sh file. After the modification on my machine it looks like this:

export JAVA_HOME=/usr/lib/jvm/java-6-sun

Now – we are ready to test the installation by running

# bin/hadoop

You should see a prompt of the command line options for Hadoop.

That’s it – Hadoop is installed and ready to run!

Up Next: Basic Hadoop Map/Reduce concepts.

Posted in Computing, Distributed, Hadoop | Leave a Comment »

Hadoop Tutorial – Part 1

January 4, 2010

In the prologue I have discussed the motivation behind the Hadoop Tutorial and about selecting the core environment.

In this part we will go over the pre-requisites required in order to run Hadoop. according to the Hadoop Common Quick Start documentation the pre-requisites are a 1.6.x jvm, ssh and sshd.

After setting the Linux environment the next step is setting the JVM. The Hadoop documentation recommends the SUN JVM 1.6.x.

On Ubuntu you might need to change the default JVM (Open JDK) to the SUN jdk. You can find more details on how to do that over here.

You can check that the default version is setup correctly by running from the console: java -version

ssh: In Ubuntu the OpenSSH client (command line) is installed by default. You can also install PuTTY if you want to use a more friendly GUI.

finally sshd is also required: On my Ubuntu machine the OpenSSH Server is not installed by default.

From command line:

# sudo apt-get install openssh-server

after the installation completes you can check the installation by starting an ssh session:

# ssh <user>@<machine name or ip>

Please note that you will need to repeat the above instructions for each of your Hadoop nodes.

Finally – in the Quick Start guide they show how to install rsync. I’m not sure if its part of the prerequisites – just in case it is – I made sure it is installed on my Ubuntu.

Up next is installing Hadoop on my machine.

Posted in Computing, Distributed, Hadoop | 1 Comment »

Hadoop Tutorial – Prologue

December 31, 2009

In the following posts I will try to summarize as much as possible my experience with Hadoop (mainly its Map/Reduce part but not only).

I am not a Hadoop Expert (although I strive to gain experience along the way) and although I consider this as a tutorial, this is far from being a first source or unique in any way. You can (and should) consider reading a more robust tutorial like the one provided in Hadoop’s project site.

Having said that – I think that this tutorial provides a better ground for the community to share exchange and provide insight during the process by using the comments at the end of each part. I will try to maintain/update the tutorial as much as possible with the community feedback I get.

So what’s the motivation behind this initiative?

I am working in a research group and one of the tasks on my plate involves analyzing a bunch of http server log files (more on that later on). On another end of my desk there is another research activity related to Hadoop (which always gets push back by more urgent issues) So I figured why not combine this two tasks?

Don’t get me wrong – there are plenty of other useful ways for analyzing log files – but I figured that this is a good way to kill two birds in one shot – getting my job done and experimenting with distributed computing at the same time…

Apparently it worked for me, and now my boss approved this approach so here we go!

In very high level – I plan to do some Map/Reduce processing on the logs and maybe do some dynamic query over HBase but I’m not sure about that yet.

So first thing first – I need an environment…

Hadoop is divided into various sub projects – for now I’m focusing on the Hadoop Common sub-project.

From the Hadoop common quick start guide GNU/Linux is the preferred environment for production so I figured I should use it for development as well.

I guess that many flavors of Linux can play nicely with Hadoop. My plan is to run one instance on my laptop (dual boot) and more instances as virtual nodes (VMWare) if I get lucky with my IT guys 🙂 .

For my laptop I will use Ubuntu 8.04 (Hardy Heron ).

For the VMs I picked Fedora mainly because it supports RPMs (which may become handy in the future in case I want to use the Cloudera distribution of Hadoop). I will use two virtual machines (VMware) for now.

Up next: Installing Hadoop’s pre-requisites

Stay tuned!

Posted in Computing, Distributed, Hadoop | Leave a Comment »

Modified Cairngorm Store Uploaded

December 2, 2006

After downloading the Cairngorm Store I realized that I need the Flex Data Services and a J2EE (Web) Container. Reading some of the feedbacks from the community convinced me to modify the original example and support Business Delegate Mockups which means you now do not need any backend!

The install size is less than 1MB and you can extract import and run without any additional downloads or setups. Due to security issues the file type is doc – please rename to zip extract and read the readme.txt file.

Enjoy!

update (22, feb. 2008) : Please check out Douglas McCarrol’s update to the Modified Cairngorm which includes support for Flex 3!

Happy coding,

Chen Bekor.

Posted in Cairngorm, Flex | 16 Comments »

My journey into space

Hadoop Tutorial – Part 4

Hadoop Tutorial – part 3

Hadoop Tutorial – Part 2

Hadoop Tutorial – Part 1

Hadoop Tutorial – Prologue

Modified Cairngorm Store Uploaded

Pages

Archives

Categories

Blogroll

Meta