In Hadoop program, I tried to compress the map result, I wrote the following code:
conf.setBoolean("mapred.compress.map.output",true);
conf.setClass("mapred.map.output.compression.codec",GzipCodec.class,CompressionCodec.class);
and run it, I got the below exception, anybody know the reason?
WARN mapred.LocalJobRunner: job_local1149103367_0001
java.io.IOException: not a gzip file
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.processBasicHeader(BuiltInGzipDecompressor.java:495)
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.executeHeaderState(BuiltInGzipDecompressor.java:256)
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.decompress(BuiltInGzipDecompressor.java:185)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:91)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:72)
at java.io.DataInputStream.readByte(DataInputStream.java:265)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
at org.apache.hadoop.mapred.IFile$Reader.positionToNextRecord(IFile.java:400)
at org.apache.hadoop.mapred.IFile$Reader.nextRawKey(IFile.java:425)
at org.apache.hadoop.mapred.Merger$Segment.nextRawKey(Merger.java:323)
at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:613)
at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:558)
at org.apache.hadoop.mapred.Merger.merge(Merger.java:70)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:385)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:445)
today, I tested it again, I found that if the put the 2 lines before the job object was created,
Job job = new Job(conf, "MyCounter");
the error will happen, if after that, no error will occur, why this happen?
are you using MRv1 or MRv2. If you are using MRv2 then use the following job config.
config.setBoolean("mapreduce.output.fileoutputformat.compress", true);
config.setClass("mapreduce.output.fileoutputformat.compress.codec",GzipCodec.class,CompressionCodec.class);
additionally you can set
config.set("mapreduce.output.fileoutputformat.compress.type",CompressionType.NONE.toString());
BLOCK|NONE|RECORD are three types of compression.
Related
I'm quite new at Java and log4j, and i have been asked to create a new log on top of those existing.
The situation is : batchLog takes care of BATCH, wesLog takes cares of WES. Both INFO level. That's easy. My task is to create an errorLog that will gather ERROR from BOTH Logs. And it is getting weird now. Here are the main lines of my code :
log4j.rootLogger=INFO, stdout, errorLog
log4j.logger.batchLog=INFO, batchLog
log4j.logger.wesLog=INFO, wesLog
log4j.appender.stdout.Threshold=INFO
log4j.appender.wesLog.File=/opt/apache-tomcat-8.0.18/logs/ECL_WES.log
log4j.additivity.wesLog=false
log4j.appender.errorLog.File=/opt/apache-tomcat-8.0.18/logs/ECL_ERROR.log
log4j.additivity.errorLog=false
log4j.appender.batchLog.File=/opt/apache-tomcat-8.0.18/logs/ECL_BATCH.log
log4j.additivity.batchLog=false
And I got some issues, with WES writing in wesLog (as intended) but BATCH was writing in batchLog AND wesLog (errorLog working fine).
I have been tempted by creating each first Logs its own ERROR appender and both writing in the same file, but i heard it is working badly.
Help will greatly appreciated !
Alex
PS : in the main program, they keep referring to the batchLog and wesLog as batchLogger and wesLogger (see the +ger) and it seems to be working fine, and i don't get why, as for me it refers to another, non existing, non described, object. Any idea ?
I am using com.cloudera.crunch version: '0.3.0-3-cdh-5.2.1'.
I have a small program that reads some AVROs and filters out invalid data based on some criteria. I am using pipeline.write(PCollection, AvroFileTarget) to write the invalid data output. It works fine in production run.
For unit testing this piece of code, I use MemPipeline instance.
But, it fails while writing the output in that case.
I get error:
java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(II[BI[BIILjava/lang/String;JZ)V
at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native Method)
at org.apache.hadoop.util.NativeCrc32.calculateChunkedSumsByteArray(NativeCrc32.java:86)
at org.apache.hadoop.util.DataChecksum.calculateChunkedSums(DataChecksum.java:428)
at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:197)
at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:163)
at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:144)
at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:78)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:50)
at java.io.DataOutputStream.writeBytes(DataOutputStream.java:276)
at com.cloudera.crunch.impl.mem.MemPipeline.write(MemPipeline.java:159)
Any idea what's wrong?
Hadoop environment variable should be configured properly along with hadoop.dll and winutils.exe.
Also pass the JVM argument while executing MR job/application
-Djava.library.path=HADOOP_HOME/lib/native
I'm trying the quickstart from here: http://datafu.incubator.apache.org/docs/datafu/getting-started.html
I tried nearly everything, but I'm sure it must be my fault somewhere. I tried already:
exporting PIG_HOME, CLASSPATH, PIG_CLASSPATH
starting pig with -cpdatafu-pig-incubating-1.3.0.jar
registering datafu-pig-incubating-1.3.0.jar locally and in hdfs => both succesful (at least no error shown)
nothing helped
Trying this on pig:
register datafu-pig-incubating-1.3.0.jar
DEFINE Median datafu.pig.stats.StreamingMedian();
data = load '/user/hduser/numbers.txt' using PigStorage() as (val:int);
data2 = FOREACH (GROUP data ALL) GENERATE Median(data);
or directly
data2 = FOREACH (GROUP data ALL) GENERATE datafu.pig.stats.StreamingMedian(data);
I get this name-resolve error:
2016-06-04 17:22:22,734 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1070: Could not resolve datafu.pig.stats.StreamingMedian using imports: [, java.lang., org.apache.pig.builtin.,
org.apache.pig.impl.builtin.] Details at logfile:
/home/hadoop/pig_1465053680252.log
When I look into the datafu-pig-incubating-1.3.0.jar it looks OK, everything in place. I also tried some Bag functions, same error then.
I think it's kind of a noob-error which I just don't see (as I did not find particular answers for datafu in SO or google), so thanks in advance for shedding some light on this.
Pig script is proper, the only thing that could break is that while registering datafu there were some class dependencies that coudn't been met.
Try to run locally (pig -x local) and see a detailed log.
Check also the version of pig - it should be newer than 0.14.0.
I'm doing a large scale hbase import using a map-reduce job I set up like so.
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
job.setMapperClass(BulkMapper.class);
job.setOutputFormatClass(HFileOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
HFileOutputFormat.configureIncrementalLoad(job, hTable); //This creates a text file that will be full of put statements, should take 10 minutes or so
boolean suc = job.waitForCompletion(true);
It uses a mapper that I make myself and HFileOutputFormat.configureIncrementalLoad sets up a reducer. I've done proofs of concepts with this setup before, however when I ran it on a large dataset it died in the reducer with this error:
Error: java.io.IOException: Non-increasing Bloom keys: BLMX2014-02-03nullAdded after BLMX2014-02-03nullRemoved at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.appendGeneralBloomfilter(StoreFile.java:934) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:970) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:168) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:576) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105) at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:78) at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:43) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:645) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:405) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157) Container killed by the ApplicationMaster. Container killed on request. Exit code is 143
I thought hadoop was supposed to guarantee sorted input into the reducer, if so why am I having this issue and is there anything I can do to avoid it?
I'm deeply annoyed that this worked, the problem was in the way I was keying my map output. I replaced what I used to have for output with this:
ImmutableBytesWritable HKey = new ImmutableBytesWritable(put.getRow());
context.write(HKey, put);
Basically the key I was using and the key to the put statement were slightly different which cause the reducer to receive put statements out of order.
I am struck on a part
I was trying to execute an example code https://github.com/stormprocessor/storm-state/blob/master/src/jvm/storm/state/example/MapExample.java code of github.com/stormprocessor/storm-state. It uses HDFS.
but it is giving an NullPointerException as
java.lang.NullPointerException
at storm.state.hdfs.HDFSStore.getMeta(HDFSStore.java:37)
at storm.state.PartitionedState.getState(PartitionedState.java:11)
at storm.state.bolt.StatefulBoltExecutor.prepare(StatefulBoltExecutor.java:36)
at backtype.storm.daemon.executor$fn__4052$fn__4061.invoke(executor.clj:610)
at backtype.storm.util$async_loop$fn__465.invoke(util.clj:375)
at clojure.lang.AFn.run(AFn.java:24)
in code of above link I have changed
builder.setBolt("counter", new StatefulBoltExecutor(new WordCount(), new HDFSStore("hdfs://ip-10-202-7-99.ec2.internal:8020/tmp/data")), 8)
.fieldsGrouping("spout", new Fields("word"));
to
builder.setBolt("counter", new StatefulBoltExecutor(new WordCount(), new HDFSStore("hdfs://localhost:9000/home/mohit/hadoop/tmp/dfs/data")), 8)
which is my HDFS path.
Codes in error log are present at
https://github.com/stormprocessor/storm-state/blob/master/src/jvm/storm/state
Sorry for very less links, as I am a learning student with very less reputation,
Please Help, Thanks in advance!!
I think the error you are getting is because your namenode is not listening to port 9000 which you have configured in your given code. Try to verify the value of this property fs.default.name in your core-site.xml file and check which port your namenode is using. I think it could be 8020. If that is say 8020 then you code will be something as below.
builder.setBolt("counter", new StatefulBoltExecutor(new WordCount(), new HDFSStore("hdfs://localhost:8020/home/mohit/hadoop/tmp/dfs/data")), 8)
I hope this may solve your problem