I am trying a simple example but I am unable to use the Graph API to generate a graph using the following code:
SparkConf conf = new SparkConf().setAppName("GGenerate").setMaster("local").set("spark.cores.max", "10");
JavaSparkContext context = new JavaSparkContext(conf);
List<Tuple2<Long,String>> l1=new ArrayList<Tuple2<Long, String>>();
l1.add(new Tuple2<Long, String>(1L,"Alice"));
l1.add(new Tuple2<Long, String>(2L, "Bob"));
l1.add(new Tuple2<Long, String>(3L, "Charlie"));
JavaRDD<Tuple2<Long,String>> vert=context.parallelize(l1);
List<Tuple3<Long,Long,String>> rd=new ArrayList<Tuple3<Long,Long,String>>();
rd.add(new Tuple3(1L,2L,"worker"));
rd.add(new Tuple3(2L, 3L, "friend"));
JavaRDD<Tuple3<Long, Long, String>> edge=context.parallelize(rd);
Hi Guys i am looking for help to delete index in es7 using spark there is no such example or anything on google as i search. if you find please help me
SparkConf conf = new SparkConf().setAppName("es7").setMaster("local[*]");
conf.set("es.index.auto.create", "true");
conf.set("es.index.read.missing.as.empty", "false");
conf.set("es.resource", "employeeindex/_doc");
conf.set("es.query", "?q=me*");
JavaSparkContext jsc = new JavaSparkContext(conf);
Map<String, ?> numbers = ImmutableMap.of("one", 1, "two", 2);
Map<String, ?> airports = ImmutableMap.of("OTP", "Otopeni", "SFO", "San Fran");
JavaRDD<Map<String, ?>> javaRDD = jsc.parallelize(ImmutableList.of(numbers, airports));
JavaEsSpark.saveToEs(javaRDD, "employeeindex/_doc");
JavaEsSpark.saveToEs(javaRDD, "employeeindex/_doc", ImmutableMap.of("es.mapping.id", "id"));
JavaPairRDD<String, Map<String, Object>> esRDD =
JavaEsSpark.esRDD(jsc, "employeeindex/_doc");
I'm wondering if we can perform Batch Write/Update with low-level API for DynamoDB for java.
Thanks in advance!
Yes. Something like this:
Map<String, List<WriteRequest>> writeRequestItems = new HashMap<String, List<WriteRequest>>();
Map<String, AttributeValue> userItem1 = new HashMap<String, AttributeValue>();
userItem1.put("userId", new AttributeValue().withS("1"));
userItem1.put("name", new AttributeValue().withS("Alex"));
Map<String, AttributeValue> userItem2 = new HashMap<String,AttributeValue>();
userItem2.put("userId", new AttributeValue().withS("2"));
userItem2.put("name", new AttributeValue().withS("Jonh"));
List<WriteRequest> userList = new ArrayList<WriteRequest>();
userList.add(new WriteRequest().withPutRequest(new PutRequest().withItem(userItem1)));
userList.add(new WriteRequest().withPutRequest(new PutRequest().withItem(userItem2)));
writeRequestItems.put("User", userList);
BatchWriteItemRequest batchWriteItemRequest = new BatchWriteItemRequest(writeRequestItems);
BatchWriteItemResult batchWriteItemResult = dynamoDBClient.batchWriteItem(batchWriteItemRequest);
Yes. You can use AmazonDynamoDB class to perform these operations.
Check http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/dynamodbv2/AmazonDynamoDB.html#batchWriteItem-com.amazonaws.services.dynamodbv2.model.BatchWriteItemRequest-
I'm trying to use Spark (Java API) to take an in-memory Map (that potentially contains other nested Maps as its values) and convert it into a dataframe. I think I need something along these lines:
Map myMap = getSomehow();
RDD myRDD = sparkContext.makeRDD(myMap); // ???
DataFrame df = sparkContext.read(myRDD); // ???
But I'm having a tough time seeing the forest through the trees here...any ideas? Again this might be a Map<String,String> or a Map<String,Map>, where there could be several nested layers of maps-inside-of-maps-inside-of-maps, etc.
So I tried something, not sure if this is the most efficient option to do it, but I do not see any other right now.
SparkConf sf = new SparkConf().setAppName("name").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(sf);
SQLContext sqlCon = new SQLContext(sc);
Map map = new HashMap<String, Map<String, String>>();
map.put("test1", putMap);
HashMap putMap = new HashMap<String, String>();
putMap.put("1", "test");
List<Tuple2<String, HashMap>> list = new ArrayList<Tuple2<String, HashMap>>();
Set<String> allKeys = map.keySet();
for (String key : allKeys) {
list.add(new Tuple2<String, HashMap>(key, (HashMap) map.get(key)));
JavaRDD<Tuple2<String, HashMap>> rdd = sc.parallelize(list);
List<StructField> fields = new ArrayList<>();
StructField field1 = DataTypes.createStructField("String", DataTypes.StringType, true);
StructField field2 = DataTypes.createStructField("Map",
DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType), true);
StructType struct = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = rdd.map(new Function<Tuple2<String, HashMap>, Row>() {
public Row call(Tuple2<String, HashMap> arg0) throws Exception {
return RowFactory.create(arg0._1, arg0._2);
DataFrame df = sqlCon.createDataFrame(rowRDD, struct);
In this scenario I assumed that the Map in the Dataframe is of Type (String, String). Hope this helps!
Edit: Obviously you can delete all the prints. I did this for visualization purposes!
I am trying to read JSON records, produced by Kafka, in Spark using SQLContext.read(). Every time NullPointerException appears.
SparkConf conf = new SparkConf()
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(2000));
Set<String> topics = Collections.singleton(topicString);
Map<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", servers);
JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(
ssc, String.class, String.class, StringDecoder.class, StringDecoder.class,
kafkaParams, topics);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
.map(message -> message._2)
.foreachRDD(rdd -> {
rdd.foreach(record -> {
Dataset<Row> ds = sqlContext.read().json(rdd);
Here is a log:
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:112)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:110)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:535)
at org.apache.spark.sql.SparkSession.read(SparkSession.scala:595)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:504)
at SparkJSONConsumer$1.lambda$2(SparkJSONConsumer.java:73)
at SparkJSONConsumer$1$$Lambda$8/1821075039.call(Unknown Source)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.apply(JavaRDDLike.scala:350)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.apply(JavaRDDLike.scala:350)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:875)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:875)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I assume the problem is due to foreachRDD clause, but can't figure it out. So any suggestions would be great.
Also, I am using sqlContext, because after I plan to serialize records in avro format ("com.databricks.spark.avro"). If there is a way to serialize a string, containing JSON structure, to avro format without defining the schema, you are very welcome to share it!
Thanks in advance.
As mentioned in the Spark documentation -
You have to create a SparkSession using the SparkContext that the StreamingContext is using. Furthermore this has to done such that it can be restarted on driver failures. This is done by creating a lazily instantiated singleton instance of SparkSession.
Create SQLContext like below, just before read json.
SQLContext sqlContext = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate().sqlContext
Dataset<Row> ds = sqlContext.read().json(rdd);
I need your help for using MongoDB aggregation framework with java driver.
I don't understand how to write my request, even with this documentation.
I want to get the 200 oldest views from all items in my collection. Here is my mongo query (which works like I want in console mode):
{$unwind : "$views"},
{$match : {"views.isActive" : true}},
{$sort : {"views.date" : 1}},
{$limit : 200},
{$project : {"_id" : 0, "url" : "$views.url", "date" : "$views.date"}}
Items in this collection have one or many views.
My question is not about the request result, I want to know the java syntaxe.
Finally found the solution, I get the same result than with the original request.
Mongo Driver 3 :
Aggregate doc
MongoCollection<Document> collection = database.getCollection("myCollection");
AggregateIterable<Document> output = collection.aggregate(Arrays.asList(
new Document("$unwind", "$views"),
new Document("$match", new Document("views.isActive", true)),
new Document("$sort", new Document("views.date", 1)),
new Document("$limit", 200),
new Document("$project", new Document("_id", 0)
.append("url", "$views.url")
.append("date", "$views.date"))
// Print for demo
for (Document dbObject : output)
You can make it more readable with static import :
import static com.mongodb.client.model.Aggregates.*;.
See koulini answer for complet example.
Mongo Driver 2 :
Aggregate doc
Iterable<DBObject> output = collection.aggregate(Arrays.asList(
(DBObject) new BasicDBObject("$unwind", "$views"),
(DBObject) new BasicDBObject("$match", new BasicDBObject("views.isActive", true)),
(DBObject) new BasicDBObject("$sort", new BasicDBObject("views.date", 1)),
(DBObject) new BasicDBObject("$limit", 200),
(DBObject) new BasicDBObject("$project", new BasicDBObject("_id", 0)
.append("url", "$views.url")
.append("date", "$views.date"))
// Print for demo
for (DBObject dbObject : output)
Query conversion logic :
Thank to this link
It is worth pointing out, that you can greatly improve the code shown by the answers here, by using the Java Aggregation methods for MongoDB.
Let's take as a code example, the OP's answer to his own question.
AggregateIterable<Document> output = collection.aggregate(Arrays.asList(
new Document("$unwind", "$views"),
new Document("$match", new Document("views.isActive", true)),
new Document("$sort", new Document("views.date", 1)),
new Document("$limit", 200),
new Document("$project", new Document("_id", 0)
.append("url", "$views.url")
.append("date", "$views.date"))
We can rewrite the above code as follows;
import static com.mongodb.client.model.Aggregates.*;
AggregateIterable output = collection.aggregate(Arrays.asList(
match(new Document("views.isActive",true)),
sort(new Document("views.date",1)),
project(new Document("_id",0)
Obviously, you will need the corresponding static import but beyond that, the code in the second example is cleaner, safer (as you don't have to type the operators yourself every time), more readable and more beautiful IMO.
Using previous example as a guide, here's how to do it using mongo driver 3 and up:
MongoCollection<Document> collection = database.getCollection("myCollection");
AggregateIterable<Document> output = collection.aggregate(Arrays.asList(
new Document("$unwind", "$views"),
new Document("$match", new Document("views.isActive", true))
for (Document doc : output) {
Here is a simple way to count employee by departmentId..
Details at: Aggregation using Java API
Map<Long, Integer> empCountMap = new HashMap<>();
AggregateIterable<Document> iterable = getMongoCollection().aggregate(Arrays.asList(
new Document("$match",
new Document("active", Boolean.TRUE)
.append("region", "India")),
new Document("$group",
new Document("_id", "$" + "deptId").append("count", new Document("$sum", 1)))));
iterable.forEach(new Block<Document>() {
public void apply(final Document document) {
empCountMap.put((Long) document.get("_id"), (Integer) document.get("count"));