I have the following job, and many others like it, launched like this
String jobName = MR.class
.getSimpleName();
Configuration config = HBaseConfiguration.create();
config.set(
"hbase.master", HBASE_MASTER);
config.set(
"hbase.zookeeper.quorum", HBASE_ZOOKEEPERS);
config.set(
"hbase.zookeeper.property.clientPort", ZK_CLIENT_PORT);
Job job = Job.getInstance(config, jobName);
job.setJarByClass(MR.class);
Scan scan = new Scan();
//gets the raw band data cells
scan.addColumn("gc".getBytes(), "s".getBytes());
System.out.println("Job=" + jobName);
System.out.println("\tHBASE_MASTER=" + HBASE_MASTER);
System.out.println(
"\tHBASE_ZOOKEEPERS=" + HBASE_ZOOKEEPERS);
System.out.println(
"\tINPUT_TABLE=" + INPUT_TABLE);
System.out.println(
"\tOUTPUT_TABLE= " + OUTPUT_TABLE);
System.out.println(
"\tCACHING_LEVEL= " + CACHING_LEVEL);
// TODO: make caching size configurable
scan.setCaching(CACHING_LEVEL);
scan.setCacheBlocks(
false);
// null for output key/value since we're not sending anything to reduce
TableMapReduceUtil.initTableMapperJob(INPUT_TABLE, scan,
MR.MapClass.class,
null,
null,
job);
TableMapReduceUtil.initTableReducerJob(
"PIXELS", // output table
null, // reducer class
job);
job.setNumReduceTasks(0);
// at least one, adjust as required
boolean b = job.waitForCompletion(true);
but I keep getting this error and can't even read the data into the mapper
Error: java.io.BufferedReader.lines()Ljava/util/stream/Stream;
14/09/22 22:11:13 INFO mapreduce.Job: Task Id : attempt_1410880772411_0045_m_000009_2, Status : FAILED
Error: java.io.BufferedReader.lines()Ljava/util/stream/Stream;
14/09/22 22:11:13 INFO mapreduce.Job: Task Id : attempt_1410880772411_0045_m_000018_2, Status : FAILED
Error: java.io.BufferedReader.lines()Ljava/util/stream/Stream;
14/09/22 22:11:13 INFO mapreduce.Job: Task Id : attempt_1410880772411_0045_m_000002_2, Status : FAILED
never seen this before, been using HBase a while....
here is what i've tried:
I can scan the table via the hbase shell no problem in the same way this job does
I can use the java api to scan in the same way no problem
I built and rebuilt my jar thinking some files were corrupt... not the problem
I have at least 5 other jobs setup the same way
I disabled, enabled, compacted, rebuilt everything you can imagine with the table
I am using maven shade to uber jar
I am running HBase 0.98.1-cdh5.1.0
any help or ideas greatly appreciated.
Looks like I should have waited a bit to ask this question. The problem was that I actually was getting to the data, but one of my external jars had been built with Java 8, and .lines() does not exists in a BufferedReader in java 7, i'm running java 7, hence the error. Nothing to do with HBase or Map Reduce.
Related
we know map has two part 'chunk and combine ',and reduce has 3 part 'shuffle ,sort and reduce '.
In hadoop source code,what is the command for having each parts time,
There is an API for the JobTracker for submitting and tracking MR jobs in a network environment.
Check this for more details.
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapred/JobTracker.html
TaskReport[] maps = jobtracker.getMapTaskReports("job_id");
for (TaskReport rpt : maps) {
System.out.println(rpt.getStartTime());
System.out.println(rpt.getFinishTime());
}
TaskReport[] reduces = jobtracker.getReduceTaskReports("job_id");
for (TaskReport rpt : reduces) {
System.out.println(rpt.getStartTime());
System.out.println(rpt.getFinishTime());
}
or If you are using Hadoop 2.x, ResourceManager REST API's have been provided.
https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html
https://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html
In Hadoop program, I tried to compress the map result, I wrote the following code:
conf.setBoolean("mapred.compress.map.output",true);
conf.setClass("mapred.map.output.compression.codec",GzipCodec.class,CompressionCodec.class);
and run it, I got the below exception, anybody know the reason?
WARN mapred.LocalJobRunner: job_local1149103367_0001
java.io.IOException: not a gzip file
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.processBasicHeader(BuiltInGzipDecompressor.java:495)
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.executeHeaderState(BuiltInGzipDecompressor.java:256)
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.decompress(BuiltInGzipDecompressor.java:185)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:91)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:72)
at java.io.DataInputStream.readByte(DataInputStream.java:265)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
at org.apache.hadoop.mapred.IFile$Reader.positionToNextRecord(IFile.java:400)
at org.apache.hadoop.mapred.IFile$Reader.nextRawKey(IFile.java:425)
at org.apache.hadoop.mapred.Merger$Segment.nextRawKey(Merger.java:323)
at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:613)
at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:558)
at org.apache.hadoop.mapred.Merger.merge(Merger.java:70)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:385)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:445)
today, I tested it again, I found that if the put the 2 lines before the job object was created,
Job job = new Job(conf, "MyCounter");
the error will happen, if after that, no error will occur, why this happen?
are you using MRv1 or MRv2. If you are using MRv2 then use the following job config.
config.setBoolean("mapreduce.output.fileoutputformat.compress", true);
config.setClass("mapreduce.output.fileoutputformat.compress.codec",GzipCodec.class,CompressionCodec.class);
additionally you can set
config.set("mapreduce.output.fileoutputformat.compress.type",CompressionType.NONE.toString());
BLOCK|NONE|RECORD are three types of compression.
I am doing my first own steps with Spark and Zeppelin and don't understand why this code sample isn't working.
First Block:
%dep
z.reset() // clean up
z.load("/data/extraJarFiles/postgresql-9.4.1208.jar") // load a jdbc driver for postgresql
Second Block
%spark
// This code loads some data from a PostGreSql DB with the help of a JDBC driver.
// The JDBC driver is stored on the Zeppelin server, the necessary Code is transfered to the Spark Workers and the workers build the connection with the DB.
//
// The connection between table and data source is "lazy". So the data will only be loaded in the case that an action need them.
// With the current script means this the DB is queried twice. ==> Q: How can I keep a RDD in Mem or on disk?
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.JdbcRDD
import java.sql.Connection
import java.sql.DriverManager
import java.sql.ResultSet
import org.apache.spark.sql.hive._
import org.apache.spark.sql._
val url = "jdbc:postgresql://10.222.22.222:5432/myDatabase"
val username = "postgres"
val pw = "geheim"
Class.forName("org.postgresql.Driver").newInstance // activating the jdbc driver. The jar file was loaded inside of the %dep block
case class RowClass(Id:Integer, Col1:String , Col2:String) // create a class with possible values
val myRDD = new JdbcRDD(sc, // SparkContext sc
() => DriverManager.getConnection(url,username,pw), // scala.Function0<java.sql.Connection> getConnection
"select * from tab1 where \"Id\">=? and \"Id\" <=? ", // String sql Important: we need here two '?' for the lower/upper Bounds vlaues
0, // long lowerBound = start value
10000, // long upperBound, = end value that is still included
1, // int numPartitions = the area is spitted into x sub commands.
// e.g. 0,1000,2 => first cmd from 0 ... 499, second cmd from 500..1000
row => RowClass(row.getInt("Id"),
row.getString("Col1"),
row.getString("Col2"))
)
myRDD.toDF().registerTempTable("Tab1")
// --- improved methode (not working at the moment)----
val prop = new java.util.Properties
prop.setProperty("user",username)
prop.setProperty("password",pw)
val tab1b = sqlContext.read.jdbc(url,"tab1",prop) // <-- not working
tab1b.show
So what is the problem.
I want to connect to an external PostgreSql DB.
Block I is adding the necessary JAR file for the DB and first lines of the second block is already using the JAR and it is able get some data out of the DB.
But the first way is ugly, because you have to convert the data by your own into a table, so I want to use the easier method at the end of the script.
But I am getting the error message
java.sql.SQLException: No suitable driver found for
jdbc:postgresql://10.222.22.222:5432/myDatabase
But it is the same URL / same login / same PW from the above code.
Why is this not working?
Maybe somebody has a helpful hint for me.
---- Update: 24.3. 12:15 ---
I don't think the loading of the JAR is not working. I added an extra val db = DriverManager.getConnection(url, username, pw); for testing. (The function that fails inside of the Exception) And this works well.
Another interesting detail. If I remove the %dep block and class line, produces the first block a very similar error. Same Error Message; same function + line number that is failing, but the stack of functions is a bit different.
I have found the source code here: http://code.metager.de/source/xref/openjdk/jdk8/jdk/src/share/classes/java/sql/DriverManager.java
My problem is in line 689. So if all parameters are OK , maybe it comes from the isDriverAllowed() check ?
I ve had the same problem with dependencies in Zeppelin, and I had to add my jars to the SPARK_SUBMIT_OPTIONS in zeepelin-env.sh to have them included in all notebooks and paragraphs
SO in zeppelin-env.sh you modify SPARK_SUBMIT_OPTIONS to be:
export SPARK_SUBMIT_OPTIONS="--jars /data/extraJarFiles/postgresql-9.4.1208.jar
Then you have to restart your zeppelin instance.
In my case while executing a spark/scala code, I received the same error. I had previously set SPARK_CLASSPATH in my spark-env.sh conf file - it was pointing to a jar file. I removed/commented out the line in spark-env.sh and restarted zepplin. This got rid of the error.
I'm doing a large scale hbase import using a map-reduce job I set up like so.
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
job.setMapperClass(BulkMapper.class);
job.setOutputFormatClass(HFileOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));
HFileOutputFormat.configureIncrementalLoad(job, hTable); //This creates a text file that will be full of put statements, should take 10 minutes or so
boolean suc = job.waitForCompletion(true);
It uses a mapper that I make myself and HFileOutputFormat.configureIncrementalLoad sets up a reducer. I've done proofs of concepts with this setup before, however when I ran it on a large dataset it died in the reducer with this error:
Error: java.io.IOException: Non-increasing Bloom keys: BLMX2014-02-03nullAdded after BLMX2014-02-03nullRemoved at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.appendGeneralBloomfilter(StoreFile.java:934) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:970) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:168) at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:576) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105) at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:78) at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:43) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:645) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:405) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157) Container killed by the ApplicationMaster. Container killed on request. Exit code is 143
I thought hadoop was supposed to guarantee sorted input into the reducer, if so why am I having this issue and is there anything I can do to avoid it?
I'm deeply annoyed that this worked, the problem was in the way I was keying my map output. I replaced what I used to have for output with this:
ImmutableBytesWritable HKey = new ImmutableBytesWritable(put.getRow());
context.write(HKey, put);
Basically the key I was using and the key to the put statement were slightly different which cause the reducer to receive put statements out of order.
So, I have two jobs, Job A and Job B. For Job A, I would like to have a maximum of 6 mappers per node. However, Job B is a little different. For Job B, I can only run one mapper per node. The reason for this isn't important -- let's just say this requirement is non-negotiable. I would like to tell Hadoop, "For Job A, schedule a maximum of 6 mappers per node. But for Job B, schedule a maximum of 1 mapper per node." Is this possible at all?
The only solution I can think of is :
1) Have two folders off the main hadoop folder, conf.JobA and conf.JobB. Each folder has its own copy of mapred-site.xml. conf.JobA/mapred-site.xml has a value of 6 for mapred.tasktracker.map.tasks.maximum. conf.JobB/mapred-site.xml has a value of 1 for mapred.tasktracker.map.tasks.maximum.
2) Before I run Job A :
2a) Shut down my tasktrackers
2b) Copy conf.JobA/mapred-site.xml into Hadoop's conf folder, replacing the mapred-site.xml that was already in there
2c) Restart my tasktrackers
2d) Wait for the tasktrackers to finish starting
3) Run Job A
and then do a similar thing when I need to run Job B.
I really don't like this solution; it seems kludgey and failure-prone. Is there a better way to do what I need to do?
In your Java code for the custom jar itself you could set this configuration mapred.tasktracker.map.tasks.maximum for both of your jobs.
Do something like this:
Configuration conf = getConf();
// set number of mappers
conf.setInt("mapred.tasktracker.map.tasks.maximum", 4);
Job job = new Job(conf);
job.setJarByClass(MyMapRed.class);
job.setJobName(JOB_NAME);
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(MapJob.class);
job.setMapOutputKeyClass(Text.class);
job.setReducerClass(ReduceJob.class);
job.setMapOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, args[0]);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
EDIT:
You also need to set the property mapred.map.tasks to the value derived from
the following formula ( mapred.tasktracker.map.tasks.maximum * Number of tasktracker Nodes
in your cluster) .