Combiner function in Apache Hadoop with Gora - java

I have a simple Hadoop, Nutch 2.x, Hbase cluster. I have to write a MR job that will find some statistics. It is two step job i.e., I think I need combiner function also. In simple Hadoop jobs, its not a big problem as a lot of guide is given e.g., this one. But I could not found any option to use combiner with Gora. My statistics will be added to pages in Hbase that's why I could not about Gora (I think ). Following is the code snippet where I expect to add com
GoraMapper.initMapperJob(job, query, pageStore, Text.class, WebPage.class,
My_Mapper.class, null, true);
job.setNumReduceTasks(1);
// === Reduce ===
DataStore<String, WebPage> hostStore = StorageUtils.createWebStore(
job.getConfiguration(), String.class, WebPage.class);
GoraReducer.initReducerJob(job, hostStore, My_Reducer.class);

I have never used the combiner with Gora, but does this work (or what error does it show)?:
GoraReducer.initReducerJob(job, hostStore, My_Reducer.class);
job.setCombinerClass(My_Reducer.class);
Edit: Created an issue at Apache's Jira about the Combiner.

Related

Partition not working in mongodb spark read in java connector

I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at the doc of mongoDb spark, and it mentions of various partitioner classes. I was trying to use MongoSamplePartitioner class but it only reads on just 1 partition. MongoPaginateByCountPartitioner class as well partitions to a fixed 66 partitions. This is even when I am configuring "samplesPerPartition" and "numberOfPartitions" in both of these cases respectively. I need to use readConfig created via a map. My code:
SparkSession sparkSession = SparkSession.builder().appName("sampleRecords")
.config("spark.driver.host", "2g")
.config("spark.driver.host", "127.0.0.1")
.master("local[4]").getOrCreate();
Map<String, String> readOverrides = new HashMap<>();
readOverrides.put("uri", "mongodb://mongo-root:password#127.0.0.1:27017/importedDb.myNewCollection?authSource=admin");
readOverrides.put("numberOfPartitions", "16");
readOverrides.put("partitioner", "MongoPaginateByCountPartitioner");
ReadConfig readConfig = ReadConfig.create(readOverrides);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Dataset<Row> dataset = MongoSpark.load(jsc, readConfig).toDF();
System.out.println(dataset.count()); //24576
System.out.println(dataset.rdd().getNumPartitions()); //66
Using sample partitioner returns 1 partition everytime
Am I missing something here? Please help.
PS - I am reading 24576 records, mongoD version v4.0.10, mongo spark connector 2.3.1, java 8
Edit:
I got it to work, needed to give properties like so partitionerOptions.samplesPerPartition in the map. But I am still facing issue, partitionerOptions.samplesPerPartition : "1000", in MongoSamplePartitioner only returns 1 partition. Any suggestions?
Number of Partitions can be configured for MongoPaginateByCountPartitioner.
Supposing that we need to configure the target number of partitions to 16...
Please add partitionerOptions.numberOfPartitions -> 16 in the properties rather than only numberOfPartitions -> 16.

command in hadoop source code for map time and reduce time

we know map has two part 'chunk and combine ',and reduce has 3 part 'shuffle ,sort and reduce '.
In hadoop source code,what is the command for having each parts time,
There is an API for the JobTracker for submitting and tracking MR jobs in a network environment.
Check this for more details.
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/JobTracker.html
TaskReport[] maps = jobtracker.getMapTaskReports("job_id");
for (TaskReport rpt : maps) {
System.out.println(rpt.getStartTime());
System.out.println(rpt.getFinishTime());
}
TaskReport[] reduces = jobtracker.getReduceTaskReports("job_id");
for (TaskReport rpt : reduces) {
System.out.println(rpt.getStartTime());
System.out.println(rpt.getFinishTime());
}
or If you are using Hadoop 2.x, ResourceManager REST API's have been provided.
https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html
https://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html

Compress map output result exception in hadoop program

In Hadoop program, I tried to compress the map result, I wrote the following code:
conf.setBoolean("mapred.compress.map.output",true);
conf.setClass("mapred.map.output.compression.codec",GzipCodec.class,CompressionCodec.class);
and run it, I got the below exception, anybody know the reason?
WARN mapred.LocalJobRunner: job_local1149103367_0001
java.io.IOException: not a gzip file
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.processBasicHeader(BuiltInGzipDecompressor.java:495)
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.executeHeaderState(BuiltInGzipDecompressor.java:256)
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.decompress(BuiltInGzipDecompressor.java:185)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:91)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:72)
at java.io.DataInputStream.readByte(DataInputStream.java:265)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
at org.apache.hadoop.mapred.IFile$Reader.positionToNextRecord(IFile.java:400)
at org.apache.hadoop.mapred.IFile$Reader.nextRawKey(IFile.java:425)
at org.apache.hadoop.mapred.Merger$Segment.nextRawKey(Merger.java:323)
at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:613)
at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:558)
at org.apache.hadoop.mapred.Merger.merge(Merger.java:70)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:385)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:445)
today, I tested it again, I found that if the put the 2 lines before the job object was created,
Job job = new Job(conf, "MyCounter");
the error will happen, if after that, no error will occur, why this happen?
are you using MRv1 or MRv2. If you are using MRv2 then use the following job config.
config.setBoolean("mapreduce.output.fileoutputformat.compress", true);
config.setClass("mapreduce.output.fileoutputformat.compress.codec",GzipCodec.class,CompressionCodec.class);
additionally you can set
config.set("mapreduce.output.fileoutputformat.compress.type",CompressionType.NONE.toString());
BLOCK|NONE|RECORD are three types of compression.

Attempting an hbase bulkloading job, the reducer uses a bloom filter that complains about unordered input

I'm doing a large scale hbase import using a map-reduce job I set up like so.
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
job.setMapperClass(BulkMapper.class);
job.setOutputFormatClass(HFileOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
HFileOutputFormat.configureIncrementalLoad(job, hTable); //This creates a text file that will be full of put statements, should take 10 minutes or so
boolean suc = job.waitForCompletion(true);
It uses a mapper that I make myself and HFileOutputFormat.configureIncrementalLoad sets up a reducer. I've done proofs of concepts with this setup before, however when I ran it on a large dataset it died in the reducer with this error:
Error: java.io.IOException: Non-increasing Bloom keys: BLMX2014-02-03nullAdded after BLMX2014-02-03nullRemoved at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.appendGeneralBloomfilter(StoreFile.java:934) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:970) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:168) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:576) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105) at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:78) at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:43) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:645) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:405) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157) Container killed by the ApplicationMaster. Container killed on request. Exit code is 143
I thought hadoop was supposed to guarantee sorted input into the reducer, if so why am I having this issue and is there anything I can do to avoid it?
I'm deeply annoyed that this worked, the problem was in the way I was keying my map output. I replaced what I used to have for output with this:
ImmutableBytesWritable HKey = new ImmutableBytesWritable(put.getRow());
context.write(HKey, put);
Basically the key I was using and the key to the put statement were slightly different which cause the reducer to receive put statements out of order.

With Hadoop, how to change the number of mappers for a given job?

So, I have two jobs, Job A and Job B. For Job A, I would like to have a maximum of 6 mappers per node. However, Job B is a little different. For Job B, I can only run one mapper per node. The reason for this isn't important -- let's just say this requirement is non-negotiable. I would like to tell Hadoop, "For Job A, schedule a maximum of 6 mappers per node. But for Job B, schedule a maximum of 1 mapper per node." Is this possible at all?
The only solution I can think of is :
1) Have two folders off the main hadoop folder, conf.JobA and conf.JobB. Each folder has its own copy of mapred-site.xml. conf.JobA/mapred-site.xml has a value of 6 for mapred.tasktracker.map.tasks.maximum. conf.JobB/mapred-site.xml has a value of 1 for mapred.tasktracker.map.tasks.maximum.
2) Before I run Job A :
2a) Shut down my tasktrackers
2b) Copy conf.JobA/mapred-site.xml into Hadoop's conf folder, replacing the mapred-site.xml that was already in there
2c) Restart my tasktrackers
2d) Wait for the tasktrackers to finish starting
3) Run Job A
and then do a similar thing when I need to run Job B.
I really don't like this solution; it seems kludgey and failure-prone. Is there a better way to do what I need to do?
In your Java code for the custom jar itself you could set this configuration mapred.tasktracker.map.tasks.maximum for both of your jobs.
Do something like this:
Configuration conf = getConf();
// set number of mappers
conf.setInt("mapred.tasktracker.map.tasks.maximum", 4);
Job job = new Job(conf);
job.setJarByClass(MyMapRed.class);
job.setJobName(JOB_NAME);
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(MapJob.class);
job.setMapOutputKeyClass(Text.class);
job.setReducerClass(ReduceJob.class);
job.setMapOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, args[0]);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
EDIT:
You also need to set the property mapred.map.tasks to the value derived from
the following formula ( mapred.tasktracker.map.tasks.maximum * Number of tasktracker Nodes
in your cluster) .

Categories

Resources