I am trying to read JSON records, produced by Kafka, in Spark using SQLContext.read(). Every time NullPointerException appears.
SparkConf conf = new SparkConf()
.setAppName("kafka-sandbox")
.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(2000));
Set<String> topics = Collections.singleton(topicString);
Map<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", servers);
JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(
ssc, String.class, String.class, StringDecoder.class, StringDecoder.class,
kafkaParams, topics);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
directKafkaStream
.map(message -> message._2)
.foreachRDD(rdd -> {
rdd.foreach(record -> {
Dataset<Row> ds = sqlContext.read().json(rdd);
});
});
ssc.start();
ssc.awaitTermination();
Here is a log:
java.lang.NullPointerException
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:112)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:110)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:535)
at org.apache.spark.sql.SparkSession.read(SparkSession.scala:595)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:504)
at SparkJSONConsumer$1.lambda$2(SparkJSONConsumer.java:73)
at SparkJSONConsumer$1$$Lambda$8/1821075039.call(Unknown Source)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.apply(JavaRDDLike.scala:350)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.apply(JavaRDDLike.scala:350)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:875)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:875)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I assume the problem is due to foreachRDD clause, but can't figure it out. So any suggestions would be great.
Also, I am using sqlContext, because after I plan to serialize records in avro format ("com.databricks.spark.avro"). If there is a way to serialize a string, containing JSON structure, to avro format without defining the schema, you are very welcome to share it!
Thanks in advance.
As mentioned in the Spark documentation -
You have to create a SparkSession using the SparkContext that the StreamingContext is using. Furthermore this has to done such that it can be restarted on driver failures. This is done by creating a lazily instantiated singleton instance of SparkSession.
Refer:
http://spark.apache.org/docs/2.1.0/streaming-programming-guide.html#dataframe-and-sql-operations
Solution:
Create SQLContext like below, just before read json.
SQLContext sqlContext = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate().sqlContext
Dataset<Row> ds = sqlContext.read().json(rdd);
Related
Hi Guys i am looking for help to delete index in es7 using spark there is no such example or anything on google as i search. if you find please help me
SparkConf conf = new SparkConf().setAppName("es7").setMaster("local[*]");
conf.set("es.index.auto.create", "true");
conf.set("es.index.read.missing.as.empty", "false");
conf.set("es.resource", "employeeindex/_doc");
conf.set("es.query", "?q=me*");
JavaSparkContext jsc = new JavaSparkContext(conf);
Map<String, ?> numbers = ImmutableMap.of("one", 1, "two", 2);
Map<String, ?> airports = ImmutableMap.of("OTP", "Otopeni", "SFO", "San Fran");
JavaRDD<Map<String, ?>> javaRDD = jsc.parallelize(ImmutableList.of(numbers, airports));
JavaEsSpark.saveToEs(javaRDD, "employeeindex/_doc");
JavaEsSpark.saveToEs(javaRDD, "employeeindex/_doc", ImmutableMap.of("es.mapping.id", "id"));
JavaPairRDD<String, Map<String, Object>> esRDD =
JavaEsSpark.esRDD(jsc, "employeeindex/_doc");
esRDD.collect();
I'm wondering if we can perform Batch Write/Update with low-level API for DynamoDB for java.
Thanks in advance!
Yes. Something like this:
Map<String, List<WriteRequest>> writeRequestItems = new HashMap<String, List<WriteRequest>>();
Map<String, AttributeValue> userItem1 = new HashMap<String, AttributeValue>();
userItem1.put("userId", new AttributeValue().withS("1"));
userItem1.put("name", new AttributeValue().withS("Alex"));
Map<String, AttributeValue> userItem2 = new HashMap<String,AttributeValue>();
userItem2.put("userId", new AttributeValue().withS("2"));
userItem2.put("name", new AttributeValue().withS("Jonh"));
List<WriteRequest> userList = new ArrayList<WriteRequest>();
userList.add(new WriteRequest().withPutRequest(new PutRequest().withItem(userItem1)));
userList.add(new WriteRequest().withPutRequest(new PutRequest().withItem(userItem2)));
writeRequestItems.put("User", userList);
BatchWriteItemRequest batchWriteItemRequest = new BatchWriteItemRequest(writeRequestItems);
BatchWriteItemResult batchWriteItemResult = dynamoDBClient.batchWriteItem(batchWriteItemRequest);
Yes. You can use AmazonDynamoDB class to perform these operations.
Check http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/dynamodbv2/AmazonDynamoDB.html#batchWriteItem-com.amazonaws.services.dynamodbv2.model.BatchWriteItemRequest-
I'm trying to use Spark (Java API) to take an in-memory Map (that potentially contains other nested Maps as its values) and convert it into a dataframe. I think I need something along these lines:
Map myMap = getSomehow();
RDD myRDD = sparkContext.makeRDD(myMap); // ???
DataFrame df = sparkContext.read(myRDD); // ???
But I'm having a tough time seeing the forest through the trees here...any ideas? Again this might be a Map<String,String> or a Map<String,Map>, where there could be several nested layers of maps-inside-of-maps-inside-of-maps, etc.
So I tried something, not sure if this is the most efficient option to do it, but I do not see any other right now.
SparkConf sf = new SparkConf().setAppName("name").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(sf);
SQLContext sqlCon = new SQLContext(sc);
Map map = new HashMap<String, Map<String, String>>();
map.put("test1", putMap);
HashMap putMap = new HashMap<String, String>();
putMap.put("1", "test");
List<Tuple2<String, HashMap>> list = new ArrayList<Tuple2<String, HashMap>>();
Set<String> allKeys = map.keySet();
for (String key : allKeys) {
list.add(new Tuple2<String, HashMap>(key, (HashMap) map.get(key)));
};
JavaRDD<Tuple2<String, HashMap>> rdd = sc.parallelize(list);
System.out.println(rdd.first());
List<StructField> fields = new ArrayList<>();
StructField field1 = DataTypes.createStructField("String", DataTypes.StringType, true);
StructField field2 = DataTypes.createStructField("Map",
DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType), true);
fields.add(field1);
fields.add(field2);
StructType struct = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = rdd.map(new Function<Tuple2<String, HashMap>, Row>() {
#Override
public Row call(Tuple2<String, HashMap> arg0) throws Exception {
return RowFactory.create(arg0._1, arg0._2);
}
});
DataFrame df = sqlCon.createDataFrame(rowRDD, struct);
df.show();
In this scenario I assumed that the Map in the Dataframe is of Type (String, String). Hope this helps!
Edit: Obviously you can delete all the prints. I did this for visualization purposes!
I am trying a simple example but I am unable to use the Graph API to generate a graph using the following code:
SparkConf conf = new SparkConf().setAppName("GGenerate").setMaster("local").set("spark.cores.max", "10");
JavaSparkContext context = new JavaSparkContext(conf);
List<Tuple2<Long,String>> l1=new ArrayList<Tuple2<Long, String>>();
l1.add(new Tuple2<Long, String>(1L,"Alice"));
l1.add(new Tuple2<Long, String>(2L, "Bob"));
l1.add(new Tuple2<Long, String>(3L, "Charlie"));
JavaRDD<Tuple2<Long,String>> vert=context.parallelize(l1);
List<Tuple3<Long,Long,String>> rd=new ArrayList<Tuple3<Long,Long,String>>();
rd.add(new Tuple3(1L,2L,"worker"));
rd.add(new Tuple3(2L, 3L, "friend"));
JavaRDD<Tuple3<Long, Long, String>> edge=context.parallelize(rd);
As part of my project, I have to create a SQL query interface for a very large Cassandra Dataset, hence I have been looking at different methods for executing SQL queries on cassandra column families using Spark and I have come up with 3 different methods
using Spark SQLContext with a statically defined schema
// statically defined in the application
public static class TableTuple implements Serializable {
private int id;
private String line;
TableTuple (int i, String l) {
id = i;
line = l;
}
// getters and setters
...
}
and I consume the definition as:
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CASSANDRA_HOST)
.setJars(jars);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<CassandraRow> rowrdd = javaFunctions(sc).cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);
JavaRDD<TableTuple> rdd = rowrdd.map(row -> new TableTuple(row.getInt(0), row.getString(1)));
DataFrame dataFrame = sqlContext.createDataFrame(rdd, TableTuple.class);
dataFrame.registerTempTable("lines");
DataFrame resultsFrame = sqlContext.sql("Select line from lines where id=1");
System.out.println(Arrays.asList(resultsFrame.collect()));
using Spark SQLContext with a dynamically defined schema
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CASSANDRA_HOST)
.setJars(jars);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<CassandraRow> cassandraRdd = javaFunctions(sc).cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);
JavaRDD<Row> rdd = cassandraRdd.map(row -> RowFactory.create(row.getInt(0), row.getString(1)));
List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("id", DataTypes.IntegerType, true));
fields.add(DataTypes.createStructField("line", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType(fields);
DataFrame dataFrame = sqlContext.createDataFrame(rdd, schema);
dataFrame.registerTempTable("lines");
DataFrame resultDataFrame = sqlContext.sql("select line from lines where id = 1");
System.out.println(Arrays.asList(resultDataFrame.collect()));
using CassandraSQLContext from the spark-cassandra-connector
SparkConf conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CASSANDRA_HOST)
.setJars(jars);
SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
CassandraSQLContext sqlContext = new CassandraSQLContext(sc);
DataFrame resultsFrame = sqlContext.sql("Select line from " + CASSANDRA_KEYSPACE + "." + CASSANDRA_COLUMN_FAMILY + " where id = 1");
System.out.println(Arrays.asList(resultsFrame.collect()));
I would like to know the advantages/disadvantages of one method over another. Also, for the CassandraSQLContext method, are queries limited to CQL, or is it fully compatible with Spark SQL. I would also like an analysis pertaining to my specific use case, I have a cassandra column family with ~17.6 million tuples having 62 columns. For querying such a large database, which method is most adequate ?