发布于 2015-09-14 15:14:20 | 152 次阅读 | 评论: 0 | 来源: 网络整理
MongoDB and Hadoop are a powerful combination and can be used together to deliver complex analytics and data processing for data stored in MongoDB. The following guide shows how you can start working with the MongoDB-Hadoop adapter. Once you become familiar with the adapter, you can use it to pull your MongoDB data into Hadoop Map-Reduce jobs, process the data and return results back to a MongoDB collection.
In order to use the following guide, you should already have Hadoop up and running. This can range from a deployed cluster containing multiple nodes or a single node pseudo-distributed Hadoop installation running locally. As long as you are able to run any of the examples on your Hadoop installation, you should be all set. The following versions of Hadoop are currently supported:
The latest version of MongoDB should be installed and running. In addition, the MongoDB commands should be in your $PATH.
In addition to Hadoop, you should also have git and JDK 1.6 installed.
The MongoDB-Hadoop adapter source is available on github. First, clone the repository and get the release-1.0 branch:
git clone https://github.com/mongodb/mongo-hadoop.git
git checkout release-1.0
Now, edit build.sbt and update the build target in hadoopRelease in ThisBuild. In this example, we’re using the CDH3 Hadoop distribution from Cloudera so I’ll set it as follows:
hadoopRelease in ThisBuild := "cdh3"
To build the adapter, use the self-bootstrapping version of sbt that ships with the MongoDB-Hadoop adapter:
./sbt package
Once the adapter is built, you will need to copy it and the latest stable version of the MongoDB Java driver to your $HADOOP_HOME/lib directory. For example, if you have Hadoop installed in /usr/lib/hadoop:
wget --no-check-certificate https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.7.3.jar
cp mongo-2.7.3.jar /usr/lib/hadoop/lib/
cp core/target/mongo-hadoop-core_cdh3u3-1.0.0.jar /usr/lib/hadoop/lib/
The MongoDB-Hadoop adapter ships with a few examples of how to use the adapter in your own setup. In this guide, we’ll focus on the UFO Sightings and Treasury Yield examples. To get started, first load the sample data for these examples:
./sbt load-sample-data
To confirm that the sample data was loaded, start the mongo client and look for the mongo_hadoop database and be sure that it contains the ufo_sightings.in and yield_historical.in collections:
$ mongo
MongoDB shell version: 2.0.5
connecting to: test
> show dbs
mongo_hadoop 0.453125GB
> use mongo_hadoop
switched to db mongo_hadoop
> show collections
system.indexes
ufo_sightings.in
yield_historical.in
To build the Treasury Yield example, we’ll need to first edit one of the configuration files uses by the example code :
emacs examples/treasury_yield/src/main/resources/mongo-treasury_yield.xml
and set the MongoDB location for the input (mongo.input.uri) and output (mongo.output.uri ) collections (in this example, Hadoop is running on a single node alongside MongoDB):
...
<property>
<!-- If you are reading from mongo, the URI -->
<name>mongo.input.uri</name>
<value>mongodb://127.0.0.1/mongo_hadoop.yield_historical.in</value>
</property>
<property>
<!-- If you are writing to mongo, the URI -->
<name>mongo.output.uri</name>
<value>mongodb://127.0.0.1/mongo_hadoop.yield_historical.out</value>
</property>
...
Next, edit the main class that we’ll use for our MapReduce job (TreasuryYieldXMLConfig.java):
emacs examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/TreasuryYieldXMLConfig.java
and update the class definition as follows:
...
public class TreasuryYieldXMLConfig extends MongoTool {
static{
// Load the XML config defined in hadoop-local.xml
// Configuration.addDefaultResource( "hadoop-local.xml" );
Configuration.addDefaultResource( "mongo-defaults.xml" );
Configuration.addDefaultResource( "mongo-treasury_yield.xml" );
}
public static void main( final String[] pArgs ) throws Exception{
System.exit( ToolRunner.run( new TreasuryYieldXMLConfig(), pArgs ) );
}
}
...
Now let’s build the Treasury Yield example:
./sbt treasury-example/package
Once the example is done building we can submit our MapReduce job:
hadoop jar examples/treasury_yield/target/treasury-example_cdh3u3-1.0.0.jar com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig
This job should only take a few moments as it’s a relatively small amount of data. Now check the output collection data in MongoDB to confirm that the MapReduce job was successful:
$ mongo
MongoDB shell version: 2.0.5
connecting to: test
> use mongo_hadoop
switched to db mongo_hadoop
> db.yield_historical.out.find()
{ "_id" : 1990, "value" : 8.552400000000002 }
{ "_id" : 1991, "value" : 7.8623600000000025 }
{ "_id" : 1992, "value" : 7.008844621513946 }
{ "_id" : 1993, "value" : 5.866279999999999 }
{ "_id" : 1994, "value" : 7.085180722891565 }
{ "_id" : 1995, "value" : 6.573920000000002 }
{ "_id" : 1996, "value" : 6.443531746031742 }
{ "_id" : 1997, "value" : 6.353959999999992 }
{ "_id" : 1998, "value" : 5.262879999999994 }
{ "_id" : 1999, "value" : 5.646135458167332 }
{ "_id" : 2000, "value" : 6.030278884462145 }
{ "_id" : 2001, "value" : 5.02068548387097 }
{ "_id" : 2002, "value" : 4.61308 }
{ "_id" : 2003, "value" : 4.013879999999999 }
{ "_id" : 2004, "value" : 4.271320000000004 }
{ "_id" : 2005, "value" : 4.288880000000001 }
{ "_id" : 2006, "value" : 4.7949999999999955 }
{ "_id" : 2007, "value" : 4.634661354581674 }
{ "_id" : 2008, "value" : 3.6642629482071714 }
{ "_id" : 2009, "value" : 3.2641200000000037 }
has more
>
This will follow much of the same process as with the Treasury Yield example with one extra step; we’ll need to add an entry into the build file to compile this example. First, open the file for editing:
emacs project/MongoHadoopBuild.scala
Next, add the following lines starting at line 72 in the build file:
...
lazy val ufoExample = Project( id = "ufo-sightings",
base = file("examples/ufo_sightings"),
settings = exampleSettings ) dependsOn ( core )
...
Now edit the UFO Sightings config file:
emacs examples/ufo_sightings/src/main/resources/mongo-ufo_sightings.xml
and update the mongo.input.uri and mongo.output.uri properties:
...
<property>
<!-- If you are reading from mongo, the URI -->
<name>mongo.input.uri</name>
<value>mongodb://127.0.0.1/mongo_hadoop.ufo_sightings.in</value>
</property>
<property>
<!-- If you are writing to mongo, the URI -->
<name>mongo.output.uri</name>
<value>mongodb://127.0.0.1/mongo_hadoop.ufo_sightings.out</value>
</property>
...
Next edit the main class for the MapReduce job in UfoSightingsXMLConfig.java to use the configuration file:
emacs examples/ufo_sightings/src/main/java/com/mongodb/hadoop/examples/ufos/UfoSightingsXMLConfig.java
...
public class UfoSightingsXMLConfig extends MongoTool {
static{
// Load the XML config defined in hadoop-local.xml
// Configuration.addDefaultResource( "hadoop-local.xml" );
Configuration.addDefaultResource( "mongo-defaults.xml" );
Configuration.addDefaultResource( "mongo-ufo_sightings.xml" );
}
public static void main( final String[] pArgs ) throws Exception{
System.exit( ToolRunner.run( new UfoSightingsXMLConfig(), pArgs ) );
}
}
...
Now build the UFO Sightings example:
./sbt ufo-sightings/package
Once the example is built, execute the MapReduce job:
hadoop jar examples/ufo_sightings/target/ufo-sightings_cdh3u3-1.0.0.jar com.mongodb.hadoop.examples.UfoSightingsXMLConfig
This MapReduce job will take just a bit longer than the Treasury Yield example. Once it’s complete, check the output collection in MongoDB to see that the job was successful:
$ mongo
MongoDB shell version: 2.0.5
connecting to: test
> use mongo_hadoop
switched to db mongo_hadoop
> db.ufo_sightings.out.find().count()
21850