Getting Started with Hadoop

发布于 2015-09-14 15:14:20 | 163 次阅读 | 评论: 0 | 来源: 网络整理

MongoDB and Hadoop are a powerful combination and can be used together to deliver complex analytics and data processing for data stored in MongoDB. The following guide shows how you can start working with the MongoDB-Hadoop adapter. Once you become familiar with the adapter, you can use it to pull your MongoDB data into Hadoop Map-Reduce jobs, process the data and return results back to a MongoDB collection.

Prerequisites¶

Hadoop¶

In order to use the following guide, you should already have Hadoop up and running. This can range from a deployed cluster containing multiple nodes or a single node pseudo-distributed Hadoop installation running locally. As long as you are able to run any of the examples on your Hadoop installation, you should be all set. The following versions of Hadoop are currently supported:

0.20/0.20.x
1.0/1.0.x
0.21/0.21.x
CDH3
CDH4

MongoDB¶

The latest version of MongoDB should be installed and running. In addition, the MongoDB commands should be in your $PATH.

Miscellaneous¶

In addition to Hadoop, you should also have git and JDK 1.6 installed.

Building MongoDB Adapter¶

The MongoDB-Hadoop adapter source is available on github. First, clone the repository and get the release-1.0 branch:

git clone https://github.com/mongodb/mongo-hadoop.git
git checkout release-1.0

Now, edit build.sbt and update the build target in hadoopRelease in ThisBuild. In this example, we’re using the CDH3 Hadoop distribution from Cloudera so I’ll set it as follows:

hadoopRelease in ThisBuild := "cdh3"

To build the adapter, use the self-bootstrapping version of sbt that ships with the MongoDB-Hadoop adapter:

./sbt package

Once the adapter is built, you will need to copy it and the latest stable version of the MongoDB Java driver to your $HADOOP_HOME/lib directory. For example, if you have Hadoop installed in /usr/lib/hadoop:

wget --no-check-certificate https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.7.3.jar
cp mongo-2.7.3.jar /usr/lib/hadoop/lib/
cp core/target/mongo-hadoop-core_cdh3u3-1.0.0.jar /usr/lib/hadoop/lib/

Examples¶

Load Sample Data¶

The MongoDB-Hadoop adapter ships with a few examples of how to use the adapter in your own setup. In this guide, we’ll focus on the UFO Sightings and Treasury Yield examples. To get started, first load the sample data for these examples:

./sbt load-sample-data

To confirm that the sample data was loaded, start the mongo client and look for the mongo_hadoop database and be sure that it contains the ufo_sightings.in and yield_historical.in collections:

$ mongo
MongoDB shell version: 2.0.5
connecting to: test
> show dbs
mongo_hadoop    0.453125GB
> use mongo_hadoop
switched to db mongo_hadoop
> show collections
system.indexes
ufo_sightings.in
yield_historical.in

Treasury Yield¶

To build the Treasury Yield example, we’ll need to first edit one of the configuration files uses by the example code :

emacs examples/treasury_yield/src/main/resources/mongo-treasury_yield.xml

and set the MongoDB location for the input (mongo.input.uri) and output (mongo.output.uri ) collections (in this example, Hadoop is running on a single node alongside MongoDB):

...
  <property>
    <!-- If you are reading from mongo, the URI -->
    <name>mongo.input.uri</name>
    <value>mongodb://127.0.0.1/mongo_hadoop.yield_historical.in</value>
  </property>
  <property>
    <!-- If you are writing to mongo, the URI -->
    <name>mongo.output.uri</name>
    <value>mongodb://127.0.0.1/mongo_hadoop.yield_historical.out</value>
  </property>
...

Next, edit the main class that we’ll use for our MapReduce job (TreasuryYieldXMLConfig.java):

emacs examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfig.java

and update the class definition as follows:

...
public class TreasuryYieldXMLConfig extends MongoTool {

    static{
        // Load the XML config defined in hadoop-local.xml
        // Configuration.addDefaultResource( "hadoop-local.xml" );
        Configuration.addDefaultResource( "mongo-defaults.xml" );
        Configuration.addDefaultResource( "mongo-treasury_yield.xml" );
    }

    public static void main( final String[] pArgs ) throws Exception{
        System.exit( ToolRunner.run( new TreasuryYieldXMLConfig(), pArgs ) );
    }
}
...

Now let’s build the Treasury Yield example:

./sbt treasury-example/package

Once the example is done building we can submit our MapReduce job:

hadoop jar examples/treasury_yield/target/treasury-example_cdh3u3-1.0.0.jar com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig

This job should only take a few moments as it’s a relatively small amount of data. Now check the output collection data in MongoDB to confirm that the MapReduce job was successful:

$ mongo
MongoDB shell version: 2.0.5
connecting to: test
> use mongo_hadoop
switched to db mongo_hadoop
> db.yield_historical.out.find()
{ "_id" : 1990, "value" : 8.552400000000002 }
{ "_id" : 1991, "value" : 7.8623600000000025 }
{ "_id" : 1992, "value" : 7.008844621513946 }
{ "_id" : 1993, "value" : 5.866279999999999 }
{ "_id" : 1994, "value" : 7.085180722891565 }
{ "_id" : 1995, "value" : 6.573920000000002 }
{ "_id" : 1996, "value" : 6.443531746031742 }
{ "_id" : 1997, "value" : 6.353959999999992 }
{ "_id" : 1998, "value" : 5.262879999999994 }
{ "_id" : 1999, "value" : 5.646135458167332 }
{ "_id" : 2000, "value" : 6.030278884462145 }
{ "_id" : 2001, "value" : 5.02068548387097 }
{ "_id" : 2002, "value" : 4.61308 }
{ "_id" : 2003, "value" : 4.013879999999999 }
{ "_id" : 2004, "value" : 4.271320000000004 }
{ "_id" : 2005, "value" : 4.288880000000001 }
{ "_id" : 2006, "value" : 4.7949999999999955 }
{ "_id" : 2007, "value" : 4.634661354581674 }
{ "_id" : 2008, "value" : 3.6642629482071714 }
{ "_id" : 2009, "value" : 3.2641200000000037 }
has more
>

UFO Sightings¶

This will follow much of the same process as with the Treasury Yield example with one extra step; we’ll need to add an entry into the build file to compile this example. First, open the file for editing:

emacs project/MongoHadoopBuild.scala

Next, add the following lines starting at line 72 in the build file:

...
  lazy val ufoExample = Project( id = "ufo-sightings",
                                base = file("examples/ufo_sightings"),
                                settings = exampleSettings ) dependsOn ( core )
...

Now edit the UFO Sightings config file:

emacs examples/ufo_sightings/src/main/resources/mongo-ufo_sightings.xml

and update the mongo.input.uri and mongo.output.uri properties:

...
  <property>
    <!-- If you are reading from mongo, the URI -->
    <name>mongo.input.uri</name>
    <value>mongodb://127.0.0.1/mongo_hadoop.ufo_sightings.in</value>
  </property>
  <property>
    <!-- If you are writing to mongo, the URI -->
    <name>mongo.output.uri</name>
    <value>mongodb://127.0.0.1/mongo_hadoop.ufo_sightings.out</value>
  </property>
...

Next edit the main class for the MapReduce job in UfoSightingsXMLConfig.java to use the configuration file:

emacs examples/ufo_sightings/src/main/java/com/mongodb/hadoop/examples/ufos/UfoSightingsXMLConfig.java

...
public class UfoSightingsXMLConfig extends MongoTool {

    static{
        // Load the XML config defined in hadoop-local.xml
        // Configuration.addDefaultResource( "hadoop-local.xml" );
        Configuration.addDefaultResource( "mongo-defaults.xml" );
        Configuration.addDefaultResource( "mongo-ufo_sightings.xml" );
    }

    public static void main( final String[] pArgs ) throws Exception{
        System.exit( ToolRunner.run( new UfoSightingsXMLConfig(), pArgs ) );
    }
}
...

Now build the UFO Sightings example:

./sbt ufo-sightings/package

Once the example is built, execute the MapReduce job:

hadoop jar examples/ufo_sightings/target/ufo-sightings_cdh3u3-1.0.0.jar com.mongodb.hadoop.examples.UfoSightingsXMLConfig

This MapReduce job will take just a bit longer than the Treasury Yield example. Once it’s complete, check the output collection in MongoDB to see that the job was successful:

$ mongo
MongoDB shell version: 2.0.5
connecting to: test
> use mongo_hadoop
switched to db mongo_hadoop
> db.ufo_sightings.out.find().count()
21850

Prerequisites¶

Hadoop¶

MongoDB¶

Miscellaneous¶

Building MongoDB Adapter¶

Examples¶

Load Sample Data¶

Treasury Yield¶

UFO Sightings¶

后端技术

前端技术

数据库

热门框架

常用IDE

其他