Mongo Hadoop Connector Issue

Mongo Hadoop Connector Issue - java

I am trying to run a MapReduce job: I pull from Mongo and then write to HDFS, but I cannot seem to get the job to run. I could not find an example, but the issues I am having that if I set an input path of Mongo it loos for the output path of Mongo. And now I am getting an authentication error when my MongoDB instance does not have authentication.
final Configuration conf = getConf();
final Job job = new Job(conf, "sort");
MongoConfig config = new MongoConfig(conf);
MongoConfigUtil.setInputFormat(getConf(), MongoInputFormat.class);
FileOutputFormat.setOutputPath(job, new Path("/trythisdir"));
MongoConfigUtil.setInputURI(conf,"mongodb://localhost:27017/fake_data.file");
//conf.set("mongo.output.uri", "mongodb://localhost:27017/fake_data.file");
job.setJarByClass(imageExtractor.class);
job.setMapperClass(imageExtractorMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass( MongoInputFormat.class );
// Execute job and return status
return job.waitForCompletion(true) ? 0 : 1;
Edit: This is the current error I am having:
Exception in thread "main" java.lang.IllegalArgumentException: Couldn't connect and authenticate to get collection
at com.mongodb.hadoop.util.MongoConfigUtil.getCollection(MongoConfigUtil.java:353)
at com.mongodb.hadoop.splitter.MongoSplitterFactory.getSplitterByStats(MongoSplitterFactory.java:71)
at com.mongodb.hadoop.splitter.MongoSplitterFactory.getSplitter(MongoSplitterFactory.java:107)
at com.mongodb.hadoop.MongoInputFormat.getSplits(MongoInputFormat.java:56)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1079)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1096)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:177)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:995)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:948)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:948)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:566)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596)
at com.orbis.image.extractor.mongo.imageExtractor.run(imageExtractor.java:103)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at com.orbis.image.extractor.mongo.imageExtractor.main(imageExtractor.java:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.lang.NullPointerException
at com.mongodb.MongoURI.<init>(MongoURI.java:148)
at com.mongodb.MongoClient.<init>(MongoClient.java:268)
at com.mongodb.hadoop.util.MongoConfigUtil.getCollection(MongoConfigUtil.java:351)
... 22 more

Late answer.. It may be helpul for people. I encountered with same problem while playing with Apache Spark.
I think you should set correctly mongo.input.uri and mongo.output.uri which will be used by hadoop and also input and output formats.
/*Correct input and output uri setting on spark(hadoop)*/
conf.set("mongo.input.uri", "mongodb://localhost:27017/dbName.inputColName");
conf.set("mongo.output.uri", "mongodb://localhost:27017/dbName.outputColName");
/*Set input and output formats*/
job.setInputFormatClass( MongoInputFormat.class );
job.setOutputFormatClass( MongoOutputFormat.class )
Btw, if "mongo.input.uri" or "mongo.output.uri" strings have typos it causes same error.

Replace:
MongoConfigUtil.setInputURI(conf, "mongodb://localhost:27017/fake_data.file");
by:
MongoConfigUtil.setInputURI(job.getConfiguration(), "mongodb://localhost:27017/fake_data.file");
The conf object is already 'consumed' by your job, so you need to set it directly on the configuration of the job.

You haven't shared the complete code so it's hard to tell, but what you've got there does not look consistent with typical usage of the MongoDB Connector for Hadoop.
I would suggest that you start with the examples in github.

Related

Where does SparkSession get AWS credentials from? SparkSession or HadoopConfiguration? [duplicate]

I'm trying to make my Spark Streaming application reading his input from a S3 directory but I keep getting this exception after launching it with spark-submit script:
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.fs.s3native.$Proxy6.initialize(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.spark.streaming.StreamingContext.checkpoint(StreamingContext.scala:195)
at MainClass$.main(MainClass.scala:1190)
at MainClass.main(MainClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I'm setting those variables through this block of code as suggested here http://spark.apache.org/docs/latest/ec2-scripts.html (bottom of the page):
val ssc = new org.apache.spark.streaming.StreamingContext(
conf,
Seconds(60))
ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId",args(2))
ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey",args(3))
args(2) and args(3) are my AWS Access Key ID and Secrete Access Key of course.
Why it keeps saying they are not set?
EDIT: I tried also this way but I get the same exception:
val lines = ssc.textFileStream("s3n://"+ args(2) +":"+ args(3) + "#<mybucket>/path/")

Odd. Try also doing a .set on the sparkContext. Try also exporting env variables before you start the application:
export AWS_ACCESS_KEY_ID=<your access>
export AWS_SECRET_ACCESS_KEY=<your secret>
^^this is how we do it.
UPDATE: According to #tribbloid the above broke in 1.3.0, now you have to faff around for ages and ages with hdfs-site.xml, or your can do (and this works in a spark-shell):
val hadoopConf = sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)

The following configuration works for me, make sure you also set "fs.s3.impl":
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val hadoopConf=sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId",myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey",mySecretKey)

On AWS EMR the above suggestions did not work. Instead I updated the following properties in the conf/core-site.xml:
fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey with your S3 credentials.

For those using EMR, use the Spark build as described at https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark and just reference S3 with the s3:// URI. No need to set S3 implementation or additional configuration as credentials are set by IAM or role.

I wanted to put the credentials more securely in a config file on one of my encrypted partitions. So I did export HADOOP_CONF_DIR=~/Private/.aws/hadoop_conf before running my spark application, and put a file in that directory (encrypted via ecryptfs) called core-site.xml containing the credentials like this:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>my_aws_access_key_id_here</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>my_aws_secret_access_key_here</value>
</property>
</configuration>
HADOOP_CONF_DIR can also be set in conf/spark-env.sh.

Latest EMR releases (tested on 4.6.0) require the following configuration:
val sc = new SparkContext(conf)
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "com.amazon.ws.emr.hadoop.fs.EmrFileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)
Although in most cases out of the box config should work - this is if you have different S3 credentials from the ones you launched the cluster with.

In java, the following are the lines of code. You have to add AWS creds in SparkContext only and not SparkSession.
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
sc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY);
sc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_KEY);

this works for me in 1.4.1 shell:
val conf = sc.getConf
conf.set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
conf.set("spark.hadoop.fs.s3.awsAccessKeyId", <your access key>)
conf.set("spark.hadoop.fs.s3.awsSecretAccessKey", <your secret key>)
SparkHadoopUtil.get.conf.addResource(SparkHadoopUtil.get.newConfiguration(conf))
...
sqlContext.read.parquet("s3://...")

Augmenting #nealmcb's answer, the most straightforward way to do this is to define
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
in conf/spark-env.sh or export that env variable in ~/.bashrc or ~/.bash_profile.
That will work as long as you can access s3 through hadoop. For instance, if you can run
hadoop fs -ls s3n://path/
then hadoop can see the s3 path.
If hadoop can't see the path, follow the advice of contained in How can I access S3/S3n from a local Hadoop 2.6 installation?

specified destination directory does not exist

I am trying to create a Workflow using the Oozie dashboard provided by the Hue interface. Trying to do it step by step, my workflow only has one java step. The relevant piece of code of this java step is as follows:
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class InputPathsCalculator {
private static final Logger LOGGER = LoggerFactory.getLogger(InputPathsCalculator.class);
public static void main(String[] args) throws IOException {
System.out.println("sout-ing");
LOGGER.info("putting something in the log");
JobConf jobConf = new JobConf();
jobConf.addResource(new Path("/etc/hadoop/conf/hdfs-site.xml"));
Path outputPath = new Path(args[1]);
List<Path> inputPaths = calculateInputPaths(args[0], jobConf);
FileUtil.copy(fileSystem,
inputPaths.toArray(new Path[0]),
fileSystem,
outputPath,
false,
true,
jobConf);
}
}
calculateInputPaths(...) is a method that has been tested in isolation and works just fine. The arguments that I pass to the method are a config file, and a String with the value /usr/myUser/outputs/.
I have two problems here:
1. I can't see anything in any logs. Not what I put into the console, not what I put into the logs
2. The outputs directory exists, but I get the following stack trace:
org.apache.oozie.action.hadoop.JavaMainException: java.io.IOException: `/user/eliasg/outputs/output': specified destination directory does not exist
at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:58)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:39)
at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:226)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: `/user/eliasg/outputs/output': specified destination directory does not exist
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:306)
at com.ig.hadoop.jsonextractor.InputPathsCalculator.main(InputPathsCalculator.java:37)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:55)
... 15 more
I have the feeling that for point 2, my jobConfig is missing something that will let it work with hdfs, but I don't know what. About point 1, I am completely lost.

I was looking at the wrong place. The logs were there, but apparently I needed to look at the map task container logs instead of the list of apps log.
I have discovered that adding hdfs-site.xml to the jobConf is not enough. There is a property that is not enabled by default and needs to be set. That property is:
jobConf.set("fs.default.name", String.format("hdfs://%1$s", jobConf.get("dfs.nameservices")));

How can I submit a Cascading job to a remote YARN cluster from Java?

I know that I can submit a Cascading job by packaging it into a JAR, as detailed in the Cascading user guide. That job will then run on my cluster if I manually submit it using hadoop jar CLI command.
However, in the original Hadoop 1 Cascading version, it was possible to submit a job to the cluster by setting certain properties on the Hadoop JobConf. Setting fs.defaultFS and mapred.job.tracker caused the local Hadoop library to automatically attempt to submit the job to the Hadoop1 JobTracker. However, setting these properties does not seem to work in the newer version. Submitting to a CDH5 5.2.1 Hadoop cluster using Cascading version 2.5.3 (which lists CDH5 as a supported platform) leads to an IPC exception when negotiating with the server, as detailed below.
I believe that this platform combination -- Cascading 2.5.6, Hadoop 2, CDH 5, YARN, and the MR1 API for submission -- is a supported combination based on the compatibility table (see under "Prior Releases" heading). And submitting the job using hadoop jar works fine on this same cluster. Port 8031 is open between the submitting host and the ResourceManager. An error with the same message is found in the ResourceManager logs on the server side.
I am using the cascading-hadoop2-mr1 library.
Exception in thread "main" cascading.flow.FlowException: unhandled exception
at cascading.flow.BaseFlow.complete(BaseFlow.java:894)
at WordCount.main(WordCount.java:91)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): Unknown rpc kind in rpc headerRPC_WRITABLE
at org.apache.hadoop.ipc.Client.call(Client.java:1411)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
at org.apache.hadoop.mapred.$Proxy11.getStagingAreaDir(Unknown Source)
at org.apache.hadoop.mapred.JobClient.getStagingAreaDir(JobClient.java:1368)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:102)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:982)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:976)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:976)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:950)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:105)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Demo code is below, which is basically identical to the WordCount sample from the Cascading user guide.
public class WordCount {
public static void main(String[] args) {
String inputPath = "/user/vagrant/wordcount/input";
String outputPath = "/user/vagrant/wordcount/output";
Scheme sourceScheme = new TextLine( new Fields( "line" ) );
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextDelimited( new Fields( "word", "count" ) );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
Pipe assembly = new Pipe( "wordcount" );
String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );
assembly = new GroupBy( assembly, new Fields( "word" ) );
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );
Properties properties = AppProps.appProps()
.setName( "word-count-application" )
.setJarClass( WordCount.class )
.buildProperties();
properties.put("fs.defaultFS", "hdfs://192.168.30.101");
properties.put("mapred.job.tracker", "192.168.30.101:8032");
FlowConnector flowConnector = new HadoopFlowConnector( properties );
Flow flow = flowConnector.connect( "word-count", source, sink, assembly );
flow.complete();
}
}
I've also tried setting a bunch of other properties to try to get it working:
mapreduce.jobtracker.address
mapreduce.framework.name
yarn.resourcemanager.address
yarn.resourcemanager.host
yarn.resourcemanager.hostname
yarn.resourcemanager.resourcetracker.address
None of these worked, they just cause the job to run in local mode (unless mapred.job.tracker is also set).

I've now resolved this problem. It comes from trying to use the older Hadoop classes that Cloudera distributes, particularly JobClient. This will happen if you use hadoop-core with the provided 2.5.0-mr1-cdh5.2.1 version, or the hadoop-client dependency with this same version number. Although this claims to be the MR1 version, and we are using the MR1 API to submit, this version actually ONLY supports submission to the Hadoop1 JobTracker, and it does not support YARN.
In order to allow submitting to YARN, you must use the hadoop-client dependency with the non-MR1 2.5.0-cdh5.2.1 version, which still supports submission of MR1 jobs to YARN.

Attempting an hbase bulkloading job, the reducer uses a bloom filter that complains about unordered input

I'm doing a large scale hbase import using a map-reduce job I set up like so.
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
job.setMapperClass(BulkMapper.class);
job.setOutputFormatClass(HFileOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
HFileOutputFormat.configureIncrementalLoad(job, hTable); //This creates a text file that will be full of put statements, should take 10 minutes or so
boolean suc = job.waitForCompletion(true);
It uses a mapper that I make myself and HFileOutputFormat.configureIncrementalLoad sets up a reducer. I've done proofs of concepts with this setup before, however when I ran it on a large dataset it died in the reducer with this error:
Error: java.io.IOException: Non-increasing Bloom keys: BLMX2014-02-03nullAdded after BLMX2014-02-03nullRemoved at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.appendGeneralBloomfilter(StoreFile.java:934) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:970) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:168) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:576) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105) at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:78) at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:43) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:645) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:405) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157) Container killed by the ApplicationMaster. Container killed on request. Exit code is 143
I thought hadoop was supposed to guarantee sorted input into the reducer, if so why am I having this issue and is there anything I can do to avoid it?

I'm deeply annoyed that this worked, the problem was in the way I was keying my map output. I replaced what I used to have for output with this:
ImmutableBytesWritable HKey = new ImmutableBytesWritable(put.getRow());
context.write(HKey, put);
Basically the key I was using and the key to the put statement were slightly different which cause the reducer to receive put statements out of order.

How to use MultipleInput class in mapreduce?

I have a one question.
I need two files as an Input to mapreduce program.
#Override
public int run(String[] args) throws Exception {
(argument skip)
Job job1 = new Job();
job1.setJarByClass(CFRecommenderDriver.class);
job1.setMapperClass(CFRecommenderMapper.class);
//job1.setReducerClass(CFRecommenderReducer.class);
job1.setMapOutputKeyClass(Text.class);
job1.setMapOutputValueClass(TextDoublePairWritableComparable.class);
//job1.setOutputKeyClass(TextTwoWritableComparable.class);
//job1.setOutputValueClass(TextDoubleTwoPairsWritableComparable.class);
MultipleInputs.addInputPath(job1, new Path(args[0]), FileInputFormat.class);
MultipleInputs.addInputPath(job1, new Path(args[1]), FileInputFormat.class);
job1.setNumReduceTasks(0);
boolean step1 = job1.waitForCompletion(true);
if(!(step1)) return -1;
If I'm running program with the following command:
hadoop jar mapreduce-0.1.jar cf /input/cf-re/data1 /input/cf-re/data2 /output/cf-r/data1
I get the following error:
2013-07-01 13:13:44.822 java[45783:1603] Unable to load realm info from SCDynamicStore
13/07/01 13:13:45 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/01 13:13:45 INFO mapred.JobClient: Cleaning up the staging area hdfs://127.0.0.1:9000/tmp/hadoop- suhyunjeon/mapred/staging/suhyunjeon/.staging/job_201306191432_0218
java.lang.RuntimeException: java.lang.InstantiationException
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
at org.apache.hadoop.mapreduce.lib.input.MultipleInputs.getInputFormatMap(MultipleInputs.java:109)
at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:58)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1024)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1041)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:959)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:912)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:912)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
at org.ankus.hadoop.mapreduce.algorithm.cf.recommender.CFRecommenderDriver.run(CFRecommenderDriver.java:86)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.ankus.hadoop.mapreduce.algorithm.cf.recommender.CFRecommenderDriver.main(CFRecommenderDriver.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.ankus.hadoop.mapreduce.MapReduceDriver.main(MapReduceDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.InstantiationException
at sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:30)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113)
... 29 more
I don't know what exactly the problem. Please help me.

You can not use the abstract class FileInputFormat directly. If your inputs are text, you can use org.apache.hadoop.mapreduce.lib.input.TextInputFormat. For example,
MultipleInputs.addInputPath(job1, new Path(args[0]), TextInputFormat.class);
MultipleInputs.addInputPath(job1, new Path(args[1]), TextInputFormat.class);

This is how you can use multiple input files in your job:
job1.setInputFormatClass(TextInputFormat.class);
FileInputFormat.setInputPaths(job1, input_1 + ","
+ input_2);

I think it's worth mentioning another method that can be used for adding multiple input paths, and in my mind the prettiest and simplest: FileInputFormat.setInputPaths(Job job, Path... inputPaths)
The Path... signature tells you that you can give any number of Path objects to this call. Example:
FileInputFormat.setInputPaths(job, new Path(args[0]), new Path(args[1]), new Path(args[2]));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Mongo Hadoop Connector Issue - java

You haven't shared the complete code so it's hard to tell, but what you've got there does not look consistent with typical usage of the MongoDB Connector for Hadoop. I would suggest that you start with the examples in github.

Related

Where does SparkSession get AWS credentials from? SparkSession or HadoopConfiguration? [duplicate]

specified destination directory does not exist

How can I submit a Cascading job to a remote YARN cluster from Java?

Attempting an hbase bulkloading job, the reducer uses a bloom filter that complains about unordered input

How to use MultipleInput class in mapreduce?

Categories

Resources