I'm running a spark job to generate HFiles for my HBase data store.
It used to be working fine with my Cloudera cluster, but when we switched to EMR cluster, it fails with following stacktrace:
Serialization stack:
- object not serializable (class: org.apache.hadoop.hbase.io.ImmutableBytesWritable, value: 50 31 36 31 32 37 30 33 34 5f 49 36 35 38 34 31 35 38 35); not retrying
Serialization stack:
- object not serializable (class: org.apache.hadoop.hbase.io.ImmutableBytesWritable, value: 50 31 36 31 32 37 30 33 34 5f 49 36 35 38 34 31 35 38 35)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1493)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1492)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1492)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1720)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1675)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1664)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:629)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1158)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:1005)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:996)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:996)
at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopFile(JavaPairRDD.scala:823)
My questions:
What could cause the difference between the two runs? Version difference between the two clusters?
I did research and found this post: then I added the Kyro parameters into my spark-submit command, now my command looks like below:
spark-submit --conf spark.kryo.classesToRegister=org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.hbase.KeyValue --master yarn --deploy-mode client --driver-memory 16G --executor-memory 18G ... but still, I got the same error.
Here's my Java code:
protected void generateHFilesUsingSpark(JavaRDD<Row> rdd) throws Exception {
JavaPairRDD<ImmutableBytesWritable, KeyValue> javaPairRdd = rdd.mapToPair(
new PairFunction<Row, ImmutableBytesWritable, KeyValue>() {
public Tuple2<ImmutableBytesWritable, KeyValue> call(Row row) throws Exception {
String key = (String) row.get(0);
String value = (String) row.get(1);
ImmutableBytesWritable rowKey = new ImmutableBytesWritable();
byte[] rowKeyBytes = Bytes.toBytes(key);
rowKey.set(rowKeyBytes);
KeyValue keyValue = new KeyValue(rowKeyBytes,
Bytes.toBytes("COL"),
Bytes.toBytes("FM"),
ProductJoin.newBuilder()
.setId(key)
.setSolrJson(value)
.build().toByteArray());
return new Tuple2<ImmutableBytesWritable, KeyValue>(rowKey, keyValue);
}
});
Configuration baseConf = HBaseConfiguration.create();
Configuration conf = new Configuration();
conf.set(HBASE_ZOOKEEPER_QUORUM, "xxx.xxx.xx.xx");
Job job = new Job(baseConf, "APP-NAME");
HTable table = new HTable(conf, "hbaseTargetTable");
Partitioner partitioner = new IntPartitioner(importerParams.shards);
JavaPairRDD<ImmutableBytesWritable, KeyValue> repartitionedRdd =
javaPairRdd.repartitionAndSortWithinPartitions(partitioner);
HFileOutputFormat2.configureIncrementalLoad(job, table);
System.out.println("Done configuring incremental load....");
Configuration config = job.getConfiguration();
repartitionedRdd.saveAsNewAPIHadoopFile(
"hfilesOutputPath",
ImmutableBytesWritable.class,
KeyValue.class,
HFileOutputFormat2.class,
config
);
System.out.println("Saved to HFiles to: " + importerParams.hfilesOutputPath);
}
All right, problem solved, the trick is to use KyroSerializer, I added this in my Java code to register ImmutableBytesWritable.
SparkSession.Builder builder = SparkSession.builder().appName("AWESOME");
builder.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
SparkConf conf = new SparkConf().setAppName("AWESOME");
Class<?>[] classes = new Class[]{org.apache.hadoop.hbase.io.ImmutableBytesWritable.class};
conf.registerKryoClasses(classes);
builder.config(conf);
SparkSession spark = builder.getOrCreate();
Related
I would like to not use null value for field of a class used in dataset. I try to use scala Option and java Optional but it failed:
#AllArgsConstructor // lombok
#NoArgsConstructor // mutable type is required in java :(
#Data // see https://stackoverflow.com/q/59609933/1206998
public static class TestClass {
String id;
Option<Integer> optionalInt;
}
#Test
public void testDatasetWithOptionField(){
Dataset<TestClass> ds = spark.createDataset(Arrays.asList(
new TestClass("item 1", Option.apply(1)),
new TestClass("item .", Option.empty())
), Encoders.bean(TestClass.class));
ds.collectAsList().forEach(x -> System.out.println("Found " + x));
}
Fails, at runtime, with message File 'generated.java', Line 77, Column 47: Cannot instantiate abstract "scala.Option"
Question: Is there a way to encode optional fields without null in a dataset, using java?
Subsidiary question: btw, I didn't use much dataset in scala either, can you validate that it is actually possible in scala to encode a case class containing Option fields?
Note: This is used in an intermediate dataset, i.e something that isn't read nor write (but for spark internal serialization)
This is fairly simple to do in Scala.
Scala Implementation
import org.apache.spark.sql.{Encoders, SparkSession}
object Test {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.appName("Stack-scala")
.master("local[2]")
.getOrCreate()
val ds = spark.createDataset(Seq(
TestClass("Item 1", Some(1)),
TestClass("Item 2", None)
))( Encoders.product[TestClass])
ds.collectAsList().forEach(println)
spark.stop()
}
case class TestClass(
id: String,
optionalInt: Option[Int] )
}
Java
There are various Option classes available in Java. However, none of them work out-of-the-box.
java.util.Optional : Not serializable
scala.Option -> Serializable but abstract, so when CodeGenerator generates the following code, it fails!
/* 081 */ // initializejavabean(newInstance(class scala.Option))
/* 082 */ final scala.Option value_9 = false ?
/* 083 */ null : new scala.Option(); // ---> Such initialization is not possible for abstract classes
/* 084 */ scala.Option javaBean_1 = value_9;
org.apache.spark.api.java.Optional -> Spark's implementation of Optional which is serializable but has private constructors. So, it fails with error : No applicable constructor/method found for zero actual parameters. Since this is a final class, it's not possible to extend this.
/* 081 */ // initializejavabean(newInstance(class org.apache.spark.api.java.Optional))
/* 082 */ final org.apache.spark.api.java.Optional value_9 = false ?
/* 083 */ null : new org.apache.spark.api.java.Optional();
/* 084 */ org.apache.spark.api.java.Optional javaBean_1 = value_9;
/* 085 */ if (!false) {
One option is to use normal Java Optionals in the data class and then use Kryo as serializer.
Encoder en = Encoders.kryo(TestClass.class);
Dataset<TestClass> ds = spark.createDataset(Arrays.asList(
new TestClass("item 1", Optional.of(1)),
new TestClass("item .", Optional.empty())
), en);
ds.collectAsList().forEach(x -> System.out.println("Found " + x));
Output:
Found TestClass(id=item 1, optionalInt=Optional[1])
Found TestClass(id=item ., optionalInt=Optional.empty)
There is a downside when using Kryo: this encoder encodes in a binary format:
ds.printSchema();
ds.show(false);
prints
root
|-- value: binary (nullable = true)
+-------------------------------------------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------------------------------------------+
|[01 00 4A 61 76 61 53 74 61 72 74 65 72 24 54 65 73 74 43 6C 61 73 F3 01 01 69 74 65 6D 20 B1 01 02 02]|
|[01 00 4A 61 76 61 53 74 61 72 74 65 72 24 54 65 73 74 43 6C 61 73 F3 01 01 69 74 65 6D 20 AE 01 00] |
+-------------------------------------------------------------------------------------------------------+
An udf-based solution to get the normal output columns of a dataset encoded with Kryo describes this answer.
Maybe a bit off-topic but probably a start to find a long-term solution is to look at the code of JavaTypeInference. The methods serializerFor and deserializerFor are used by ExpressionEncoder.javaBean to create the serializer and deserializer part of the encoder for Java beans.
In this pattern matching block
typeToken.getRawType match {
case c if c == classOf[String] => createSerializerForString(inputObject)
case c if c == classOf[java.time.Instant] => createSerializerForJavaInstant(inputObject)
case c if c == classOf[java.sql.Timestamp] => createSerializerForSqlTimestamp(inputObject)
case c if c == classOf[java.time.LocalDate] => createSerializerForJavaLocalDate(inputObject)
case c if c == classOf[java.sql.Date] => createSerializerForSqlDate(inputObject)
[...]
there is the handling for java.util.Optional missing. It could probably be added here as well as in the corresponding deserialize method. This would allow Java beans to have properties of type Optional.
I would like to do a simple Spark SQL code that reads a file called u.data, that contains the movie ratings, creates a Dataset of Rows, and then print the first rows of the Dataset.
I've had as premise read the file to a JavaRDD, and map the RDD according to a ratingsObject(the object has two parameters, movieID and rating). So I just want to print the first Rows in this Dataset.
I'm using Java language and Spark SQL.
public static void main(String[] args){
App obj = new App();
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example").getOrCreate();
Map<Integer,String> movieNames = obj.loadMovieNames();
JavaRDD<String> lines = spark.read().textFile("hdfs:///ml-100k/u.data").javaRDD();
JavaRDD<MovieRatings> movies = lines.map(line -> {
String[] parts = line.split(" ");
MovieRatings ratingsObject = new MovieRatings();
ratingsObject.setMovieID(Integer.parseInt(parts[1].trim()));
ratingsObject.setRating(Integer.parseInt(parts[2].trim()));
return ratingsObject;
});
Dataset<Row> movieDataset = spark.createDataFrame(movies, MovieRatings.class);
Encoder<Integer> intEncoder = Encoders.INT();
Dataset<Integer> HUE = movieDataset.map(
new MapFunction<Row, Integer>(){
private static final long serialVersionUID = -5982149277350252630L;
#Override
public Integer call(Row row) throws Exception{
return row.getInt(0);
}
}, intEncoder
);
HUE.show();
//stop the session
spark.stop();
}
I've tried a lot of possible solutions that I found, but all of them got the same error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, localhost, executor 1): java.lang.ArrayIndexOutOfBoundsException: 1
at com.ericsson.SparkMovieRatings.App.lambda$main$1e634467$1(App.java:63)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
And here is the sample of the u.data file:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
62 257 2 879372434
286 1014 5 879781125
200 222 5 876042340
210 40 3 891035994
224 29 3 888104457
303 785 3 879485318
122 387 5 879270459
194 274 2 879539794
Where the first column represents de UserID, the second MovieID, the third the rating,and the last one is the timestamp.
As mentioned before your data are not space separated.
I'll show you two possible solutions, the first one based on RDD and the second one based on spark sql which is, in general, the best solution in term of performance.
RDD (you should use built in types to reduce the overhead):
public class SparkDriver {
public static void main (String args[]) {
// Create a configuration object and set the name of
// the application
SparkConf conf = new SparkConf().setAppName("application_name");
// Create a spark Context object
JavaSparkContext context = new JavaSparkContext(conf);
// Create final rdd (suppose you have a text file)
JavaPairRDD<Integer,Integer> movieRatingRDD =
contextFile("u.data.txt")
.mapToPair(line -> {(
String[] tokens = line.split("\\s+");
int movieID = Integer.parseInt(tokens[0]);
int rating = Integer.parseInt(tokens[1]);
return new Tuple2<Integer, Integer>(movieID, rating);});
// Keep in mind that take operation takes the first n elements
// and the order is the order of the file.
ArrayList<Tuple2<Integer, Integer> list = new ArrayList<>(movieRatingRDD.take(10));
System.out.println("MovieID\tRating");
for(tuple : list) {
System.out.println(tuple._1 + "\t" + tuple._2);
}
context.close();
}}
SQL
public class SparkDriver {
public static void main(String[] args) {
// Create spark session
SparkSession session = SparkSession.builder().appName("[Spark app sql version]").getOrCreate();
Dataset<MovieRatings> personsDataframe = session.read()
.format("tct")
.option("header", false)
.option("inferSchema", true)
.option("delimiter", "\\s+")
.load("u.data.txt")
.map(row -> {
int movieID = row.getInteger(0);
int rating = row.getInteger(1);
return new MovieRatings(movieID, rating);
}).as(Encoders.bean(MovieRatings.class);
// Stop session
session.stop();
}
}
my widgets.proto:
option java_package = "example";
option java_outer_classname = "WidgetsProtoc";
message Widget {
required string name = 1;
required int32 id = 2;
}
message WidgetList {
repeated Widget widget = 1;
}
my rest: (Path: /widgets)
#GET
#Produces("application/x-protobuf")
public WidgetsProtoc.WidgetList getAllWidgets() {
Widget widget1 =
Widget.newBuilder().setId(1).setName("testing").build();
Widget widget2 =
Widget.newBuilder().setId(100).setName("widget 2").build();
System.err.println("widgets1: " + widget1);
System.err.println("widgets2: " + widget2);
WidgetsProtoc.WidgetList list = WidgetsProtoc.WidgetList.newBuilder().addWidget(widget1).addWidget(widget2).build();
System.err.println("list: " + list.toByteArray());
return list;
}
And when i use postman i get this response:
(emptyline)
(emptyline)
testing
(emptyline)
widget 2d
This is normal? I think not really...in my messagebodywriter class i override writeto like this:
#Override
public void writeTo(WidgetsProtoc.WidgetList widgetList, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException, WebApplicationException {
entityStream.write(widgetList.toByteArray());
}
I thought its good to send a byte array...but it is a little bit weird that not serialize nothing....or just my be the id...Thanks for the help :)
It already is in binary.
Have you noticed that you message consists of name and id but you get only name displayed?
That's what I get displayed when I just save the response as a "test" file from the browser's GET call:
$ cat test
testing
widget 2d%
You can see cat being barely able to display it, but there is more content into it.
Now if I open it in bless (hex editor in ubuntu) I can get more out of it
0A 0B 0A 07 74 65 73 74 69 6E 67 10 01 0A 0C 0A 08 77 69 64 67 65 74 20 32 10 64
In general I wouldn't trust Postman to display binary data. You need to know how to decode it, and apparently Postman doesn't.
And here's the final proof:
$ cat test | protoc --decode_raw
1 {
1: "testing"
2: 1
}
1 {
1: "widget 2"
2: 100
}
A am trying to save data from MapReduce job into HBase. We made script which work great on older versions of Hadoop (CDH3u4). Now we upgraded to the newest version (CDH 5.0.2) and script is not working.
When I run the program on newest version, I get following error:
Exception in thread "main" java.lang.RuntimeException: java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat.setConf(TableOutputFormat.java:211)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:455)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:343)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
at com.nrholding.t0_mr.main.ELogHBaseImport.main(ELogHBaseImport.java:89)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:389)
at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:366)
at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:247)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:188)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:150)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat.setConf(TableOutputFormat.java:206)
... 17 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:387)
... 22 more
Caused by: java.lang.NoClassDefFoundError: org/cloudera/htrace/Trace
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:195)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:801)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:633)
... 27 more
Caused by: java.lang.ClassNotFoundException: org.cloudera.htrace.Trace
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 33 more
It seams that problem is in TableOutputFormat. So I checked that:
Proper library is in libpath.
Zookeeper quorum and zookeeper port is set in hbase-site.xml
Table wp_json exists in hbase
Here is the code which is makes problems:
public static void main(String args[]) throws Exception {
Configuration conf = new Configuration();
conf.set("hbase.zookeeper.quorum", "zookeeper_server1,zookeeper_server2,zookeeper_server3");
conf.set(TableOutputFormat.OUTPUT_TABLE, "wp_json");
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
String input = otherArgs[0];
Job job = Job.getInstance(conf, "ELogHBaseImport");
// Input is just text files in HDFS
FileInputFormat.addInputPath(job, new Path(input));
job.setJarByClass(ELogHBaseImport.class);
job.setMapperClass(Map.class);
job.setNumReduceTasks(0);
job.setOutputFormatClass(TableOutputFormat.class);
job.waitForCompletion(true);
}
When I use NullOutputFormat, everything works great but nothing is written to hbase.
The part of TableOutputFormat responsible for error is here:
163 /**
164 * Returns the output committer.
165 *
166 * #param context The current context.
167 * #return The committer.
168 * #throws IOException When creating the committer fails.
169 * #throws InterruptedException When the job is aborted.
170 * #see org.apache.hadoop.mapreduce.OutputFormat#getOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext)
171 */
172 #Override
173 public OutputCommitter getOutputCommitter(TaskAttemptContext context)
174 throws IOException, InterruptedException {
175 return new TableOutputCommitter();
176 }
177
178 public Configuration getConf() {
179 return conf;
180 }
181
182 #Override
183 public void setConf(Configuration otherConf) {
184 this.conf = HBaseConfiguration.create(otherConf);
185
186 String tableName = this.conf.get(OUTPUT_TABLE);
187 if(tableName == null || tableName.length() <= 0) {
188 throw new IllegalArgumentException("Must specify table name");
189 }
190
191 String address = this.conf.get(QUORUM_ADDRESS);
192 int zkClientPort = this.conf.getInt(QUORUM_PORT, 0);
193 String serverClass = this.conf.get(REGION_SERVER_CLASS);
194 String serverImpl = this.conf.get(REGION_SERVER_IMPL);
195
196 try {
197 if (address != null) {
198 ZKUtil.applyClusterKeyToConf(this.conf, address);
199 }
200 if (serverClass != null) {
201 this.conf.set(HConstants.REGION_SERVER_IMPL, serverImpl);
202 }
203 if (zkClientPort != 0) {
204 this.conf.setInt(HConstants.ZOOKEEPER_CLIENT_PORT, zkClientPort);
205 }
206 this.table = new HTable(this.conf, tableName);
207 this.table.setAutoFlush(false, true);
208 LOG.info("Created table instance for " + tableName);
209 } catch(IOException e) {
210 LOG.error(e);
211 throw new RuntimeException(e);
212 }
213 }
Error is actually caused by this message:
Caused by: java.lang.ClassNotFoundException: org.cloudera.htrace.Trace*
Probably, you are missing a jar in the classpath. The class mentioned above may be indirectly referred from your code. Try to put the jar containing this class in classpath.
Hope this helps!!!
Why doesn't this work? I'm trying to test that an empty database is the same before doing nothing as after doing nothing. In other words, this is the simplest dbunit test with a database that I can think of. And it doesn't work. The test methods are practically lifted from http://www.dbunit.org/howto.html
The error message I'm getting for comparing empty database is:
java.lang.AssertionError: expected:
org.dbunit.dataset.xml.FlatXmlDataSet<AbstractDataSet[_orderedTableNameMap=null]>
but was:
org.dbunit.database.DatabaseDataSet<AbstractDataSet[_orderedTableNameMap=null]>
The error message I'm getting for comparing empty table is:
java.lang.AssertionError: expected:
<org.dbunit.dataset.DefaultTable[_metaData=tableName=test, columns=[], keys=[], _rowList.size()=0]>
but was:
<org.dbunit.database.CachedResultSetTable[_metaData=table=test, cols=[(id, DOUBLE, noNulls), (txt, VARCHAR, nullable)], pk=[(id, DOUBLE, noNulls)], _rowList.size()=0]>
(I've added newlines for readability)
I can edit in the full stack trace (or anything else) if it'll be useful. Or you can browse through the public git repo: https://bitbucket.org/djeikyb/simple_dbunit
Do I need to somehow convert my actual IDataSet to xml then back to IDataSet to properly compare? What am I doing/expecting wrong?
34 public class TestCase
35 {
36
37 private IDatabaseTester database_tester;
38
39 #Before
40 public void setUp() throws Exception
41 {
42 database_tester = new JdbcDatabaseTester("com.mysql.jdbc.Driver",
43 "jdbc:mysql://localhost/cal",
44 "cal",
45 "cal");
46
47 IDataSet data_set = new FlatXmlDataSetBuilder().build(
48 new FileInputStream("src/simple_dbunit/dataset.xml"));
49 database_tester.setDataSet(data_set);
50
51 database_tester.onSetup();
52 }
53
54 #Test
55 public void testDbNoChanges() throws Exception
56 {
57 // expected
58 IDataSet expected_data_set = new FlatXmlDataSetBuilder().build(
59 new FileInputStream("src/simple_dbunit/dataset.xml"));
60
61 // actual
62 IDatabaseConnection connection = database_tester.getConnection();
63 IDataSet actual_data_set = connection.createDataSet();
64
65 // test
66 assertEquals(expected_data_set, actual_data_set);
67 }
68
69 #Test
70 public void testTableNoChanges() throws Exception
71 {
72 // expected
73 IDataSet expected_data_set = new FlatXmlDataSetBuilder().build(
74 new FileInputStream("src/simple_dbunit/dataset.xml"));
75 ITable expected_table = expected_data_set.getTable("test");
76
77 // actual
78 IDatabaseConnection connection = database_tester.getConnection();
79 IDataSet actual_data_set = connection.createDataSet();
80 ITable actual_table = actual_data_set.getTable("test");
81
82 // test
83 assertEquals(expected_table, actual_table);
84 }
85
86 }
When you compare the IDataSet and other DBUnit components, you have to use the assert method that provided by DBUnit.
If you use assert methods provided by JUnit, it will only be compared via equals method in Object
That's why you get the error complaining about different object type.