Spark : dataframe - how to properly parallelize the execution

Spark : dataframe - how to properly parallelize the execution - java

I have a Spark Streaming application that reads from Kafka with a batch interval of 5 minutes
My application stores the input in a DataFrame, and is configured to execute aggregation queries on the dataframe (~ 1000 queries)
Current solution: from the driver I execute a "for loop" on the list of 1000 queries and execute them.
Problem: I have a hundred queries to execute and it takes a huge amount of time for my application.
Is there a way to make this process more faster ?
SparkConf sparkConf = new SparkConf();
messagesFromKafka.foreachRDD((VoidFunction2<JavaRDD<String>, Time>) (rdd, time) -> {
//...
SparkSession spark = JavaSparkSessionSingleton.getInstance(sparkConf);
// Convert RDD[String]
JavaRDD<Bean> rddRow = rdd.map((Function<String, Bean>) line -> {
Bean row = new Bean();
row.setFiled1(line.split(";")[0]);
row.setFiled2(line.split(";")[1]);
//..
return row;
});
// ...
Dataset<Bean> ds = spark.createDataFrame(rddRow, Bean.class);
// Prepare List of query ...
List<String> listQuery = new ArrayList<>();
listQuery.add("select sum(..) group by filed1...");
listQuery.add("select avg (..) group by field2...");
// perform aggregation with key query
for (String query : listQuery) {
Dataset<Bean> dsResult = spark.sql(query);
}
}

Related

Multiple select queries in single job on flink table API

If I want to run two different select queries on a flink table created from the dataStream, the blink-planner runs them as two different jobs. Is there a way to combine them and run as a single job ?
Example code :
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(4);
System.out.println("Running credit scores : ");
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
DataStream<String> recordsStream =
env.readTextFile("src/main/resources/credit_trial.csv");
DataStream<CreditRecord> creditStream = recordsStream
.filter((FilterFunction<String>) line -> !line.contains(
"Loan ID,Customer ID,Loan Status,Current Loan Amount,Term,Credit Score,Annual Income,Years in current job" +
",Home Ownership,Purpose,Monthly Debt,Years of Credit History,Months since last delinquent,Number of Open Accounts," +
"Number of Credit Problems,Current Credit Balance,Maximum Open Credit,Bankruptcies,Tax Liens"))
.map(new MapFunction<String, CreditRecord>() {
#Override
public CreditRecord map(String s) throws Exception {
String[] fields = s.split(",");
return new CreditRecord(fields[0], fields[2], Double.parseDouble(fields[3]),
fields[4], fields[5].trim().equals("")?0.0: Double.parseDouble(fields[5]),
fields[6].trim().equals("")?0.0:Double.parseDouble(fields[6]),
fields[8], Double.parseDouble(fields[15]));
}
});
tableEnv.createTemporaryView("CreditDetails", creditStream);
Table creditDetailsTable = tableEnv.from("CreditDetails");
Table resultsTable = creditDetailsTable.select($("*"))
.filter($("loanStatus").isEqual("Charged Off"));
TableResult result = resultsTable.execute();
result.print();
Table resultsTable2 = creditDetailsTable.select($("*"))
.filter($("loanStatus").isEqual("Fully Paid"));
TableResult result2 = resultsTable2.execute();
result2.print();
The above code creates 2 different jobs, but I don't want that. Is there any way out ?

Apache Spark -- Java , Group Live Stream data

I am trying to get live JSON data from RabbitMQ to Apache Spark using Java and do some realtime analytics out of it.
I am able to get the data and also do some basic SQL queries on it, but I am not able to figure out the grouping part.
Below is the JSON I have
{"DeviceId":"MAC-101","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-101","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-102","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-102","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
I would like to group them by device id. The idea is that way I can run and gather analytics against individual devices. Below is the sample code snippet that I am trying
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("RabbitMqReceiver");
mconf.setMaster("local[*]");
jssc = new JavaStreamingContext(mconf,Durations.seconds(10));
SparkSession spksess = SparkSession
.builder()
.master("local[*]")
.appName("RabbitMqReceiver2")
.getOrCreate();
SQLContext sqlctxt = new SQLContext(spksess);
JavaReceiverInputDStream<String> jsonData = jssc.receiverStream(
new mqreceiver(StorageLevel.MEMORY_AND_DISK_2()));
//jsonData.print();
JavaDStream<String> machineData = jsonData.window(Durations.minutes(1), Durations.seconds(20));
machineData.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
if(!rdd.isEmpty()){
Dataset<Row> data = sqlctxt.read().json(rdd);
//Dataset<Row> data = spksess.read().json(rdd).select("*");
data.createOrReplaceTempView("DeviceData");
data.printSchema();
//data.show(false);
// The below select query works
//Dataset<Row> groupedData = sqlctxt.sql("select * from DeviceData where DeviceId='MAC-101'");
// The below sql fails...
Dataset<Row> groupedData = sqlctxt.sql("select * from DeviceData GROUP BY DeviceId");
groupedData.show();
}
}
});
jssc.start();
jssc.awaitTermination();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
What i am looking to do with the streamed data is to see if i can push the incoming data into individual buckets...
Lets say we have the below incoming data from rabbitmq,
What i want to do is to have either a single key/value based collection which will have the device id as key and List as value
Or it could be somekind of individual dynamic collection for each device id.
Can we do something like the below code (from url -- http://backtobazics.com/big-data/spark/apache-spark-groupby-example/)
public class GroupByExample {
public static void main(String[] args) throws Exception {
JavaSparkContext sc = new JavaSparkContext();
// Parallelized with 2 partitions
JavaRDD<String> rddX = sc.parallelize(
Arrays.asList("Joseph", "Jimmy", "Tina",
"Thomas", "James", "Cory",
"Christine", "Jackeline", "Juan"), 3);
JavaPairRDD<Character, Iterable<String>> rddY = rddX.groupBy(word -> word.charAt(0));
System.out.println(rddY.collect());
}
}
So in our case we need to pass a filter for the group by w.r.t DeviceId
Working Code....
JavaDStream<String> strmData = jssc.receiverStream(
new mqreceiver(StorageLevel.MEMORY_AND_DISK_2()));
//This is just a sliding window i have kept
JavaDStream<String> machineData = strmData.window(Durations.minutes(1), Durations.seconds(10));
machineData.print();
JavaPairDStream<String, String> pairedData = machineData.mapToPair(s -> new Tuple2<String, String>(s.substring(5, 10) , new String(s)));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.print();

It's because in queries with group by, only following columns can be used in select:
columns listed in group by
aggregation of any of column
If you use "*", then all columns are used in select - and that's why the query fails. Change the query to for example:
select DeviceId, count(distinct DeviceType) as deviceTypeCount from DeviceData group by DeviceId
and it will work, because it uses only column in group by and columns in aggregation functions

The GROUP BY statement is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns.

Spark - Java UDF returning multiple columns

I'm using sparkSql 1.6.2 (Java API) and I have to process the following DataFrame that has a list of value in 2 columns:
ID AttributeName AttributeValue
0 [an1,an2,an3] [av1,av2,av3]
1 [bn1,bn2] [bv1,bv2]
The desired table is:
ID AttributeName AttributeValue
0 an1 av1
0 an2 av2
0 an3 av3
1 bn1 bv1
1 bn2 bv2
I think I have to use a combination of the explode function and a custom UDF function.
I found the following resources:
Explode (transpose?) multiple columns in Spark SQL table
How do I call a UDF on a Spark DataFrame using JAVA?
and I can successfully run an example that read the two columns and return the concatenation of the first two strings in a column
UDF2 combineUDF = new UDF2<Seq<String>, Seq<String>, String>() {
public String call(final Seq<String> col1, final Seq<String> col2) throws Exception {
return col1.apply(0) + col2.apply(0);
}
};
context.udf().register("combineUDF", combineUDF, DataTypes.StringType);
the problem is to write the signature of a UDF returning two columns (in Java).
As far as I understand I must define a new StructType as the one shown below and set that as return type, but so far I didn't manage to have the final code working
StructType retSchema = new StructType(new StructField[]{
new StructField("#AttName", DataTypes.StringType, true, Metadata.empty()),
new StructField("#AttValue", DataTypes.StringType, true, Metadata.empty()),
}
);
context.udf().register("combineUDF", combineUDF, retSchema);
Any help will be really appreciated.
UPDATE: I'm trying to implement first the zip(AttributeName,AttributeValue) so then I will need just to apply the standard explode function in sparkSql:
ID AttName_AttValue
0 [[an1,av1],[an1,av2],[an3,av3]]
1 [[bn1,bv1],[bn2,bv2]]
I built the following UDF:
UDF2 combineColumns = new UDF2<Seq<String>, Seq<String>, List<List<String>>>() {
public List<List<String>> call(final Seq<String> col1, final Seq<String> col2) throws Exception {
List<List<String>> zipped = new LinkedList<>();
for (int i = 0, listSize = col1.size(); i < listSize; i++) {
List<String> subRow = Arrays.asList(col1.apply(i), col2.apply(i));
zipped.add(subRow);
}
return zipped;
}
};
But when I run the code
myDF.select(callUDF("combineColumns", col("AttributeName"), col("AttributeValue"))).show(10);
I got the following error message:
scala.MatchError: [[an1,av1],[an1,av2],[an3,av3]] (of class java.util.LinkedList)
and it looks like the combining has been performed correctly but then the return type is not the expected one in Scala.
Any Help?

Finally I managed to get the result I was looking for but probably not in the most efficient way.
Basically the are 2 step:
Zip of the two list
Explode of the list in rows
For the first step I defined the following UDF Function
UDF2 concatItems = new UDF2<Seq<String>, Seq<String>, Seq<String>>() {
public Seq<String> call(final Seq<String> col1, final Seq<String> col2) throws Exception {
ArrayList zipped = new ArrayList();
for (int i = 0, listSize = col1.size(); i < listSize; i++) {
String subRow = col1.apply(i) + ";" + col2.apply(i);
zipped.add(subRow);
}
return scala.collection.JavaConversions.asScalaBuffer(zipped);
}
};
Missing the function registration to SparkSession:
sparkSession.udf().register("concatItems",concatItems,DataTypes.StringType);
and then I called it with the following code:
DataFrame df2 = df.select(col("ID"), callUDF("concatItems", col("AttributeName"), col("AttributeValue")).alias("AttName_AttValue"));
At this stage the df2 looks like that:
ID AttName_AttValue
0 [[an1,av1],[an1,av2],[an3,av3]]
1 [[bn1,bv1],[bn2,bv2]]
Then I called the following lambda function for exploding the list into rows:
DataFrame df3 = df2.select(col("ID"),explode(col("AttName_AttValue")).alias("AttName_AttValue_row"));
At this stage the df3 looks like that:
ID AttName_AttValue
0 [an1,av1]
0 [an1,av2]
0 [an3,av3]
1 [bn1,bv1]
1 [bn2,bv2]
Finally to split the attribute name and value into two different columns, I converted the DataFrame into a JavaRDD in order to use the map function:
JavaRDD df3RDD = df3.toJavaRDD().map(
(Function<Row, Row>) myRow -> {
String[] info = String.valueOf(myRow.get(1)).split(",");
return RowFactory.create(myRow.get(0), info[0], info[1]);
}).cache();
If anybody has a better solution feel free to comment.
I hope it helps.

MongoDB - merge collection and map - can performance be improved

The function below merge word MongoDB collection and map content like this:
Collection:
cat 3,
dog 5
Map:
dog 2,
zebra 1
Collection after merge:
cat 3,
dog 7,
zebra 1
We have empty collection and map with about 14000 elements.
Oracle PL/SQL procedure using one merge SQL running on 15k RPM HD do it in less then a second.
MongoBD on SSD disk needs about 53 seconds.
It looks like Oracle prepares in memory image of file operation
and saves result in one i/o operation.
MongoDB probably does 14000 i/o - it is about 4 ms for each insert. It is corresponds with performance of SSD.
If I do just 14000 inserts without search for documents existence as in case of merge everything works also fast - less then a second.
My questions:
Can the code be improved?
Maybe it necessary to do something with MongoDB configuration?
Function code:
public void addBookInfo(String bookTitle, HashMap<String, Integer> bookInfo)
{
// insert information to the book collection
Document d = new Document();
d.append("book_title", bookTitle);
book.insertOne(d);
// insert information to the word collection
// prepare collection of word info and book_word info documents
List<Document> wordInfoToInsert = new ArrayList<Document>();
List<Document> book_wordInfoToInsert = new ArrayList<Document>();
for (String key : bookInfo.keySet())
{
Document d1 = new Document();
Document d2 = new Document();
d1.append("word", key);
d1.append("count", bookInfo.get(key));
wordInfoToInsert.add(d1);
d2.append("book_title", bookTitle);
d2.append("word", key);
d2.append("count", bookInfo.get(key));
book_wordInfoToInsert.add(d2);
}
// this is collection of insert/update DB operations
List<WriteModel<Document>> updates = new ArrayList<WriteModel<Document>>();
// iterator for collection of words
ListIterator<Document> listIterator = wordInfoToInsert.listIterator();
// generate list of insert/update operations
while (listIterator.hasNext())
{
d = listIterator.next();
String wordToUpdate = d.getString("word");
int countToAdd = d.getInteger("count").intValue();
updates.add(
new UpdateOneModel<Document>(
new Document("word", wordToUpdate),
new Document("$inc",new Document("count", countToAdd)),
new UpdateOptions().upsert(true)
)
);
}
// perform bulk operation
// this is slowly
BulkWriteResult bulkWriteResult = word.bulkWrite(updates);
boolean acknowledge = bulkWriteResult.wasAcknowledged();
if (acknowledge)
System.out.println("Write acknowledged.");
else
System.out.println("Write was not acknowledged.");
boolean countInfo = bulkWriteResult.isModifiedCountAvailable();
if (countInfo)
System.out.println("Change counters avaiable.");
else
System.out.println("Change counters not avaiable.");
int inserted = bulkWriteResult.getInsertedCount();
int modified = bulkWriteResult.getModifiedCount();
System.out.println("inserted: " + inserted);
System.out.println("modified: " + modified);
// insert information to the book_word collection
// this is very fast
book_word.insertMany(book_wordInfoToInsert);
}

neo4j - batch insertion using neo4j rest graph db

I'm using version 2.0.1 .
I have like hundred of thousands of nodes that needs to be inserted. My neo4j graph db is on a stand alone server, and I'm using RestApi through the neo4j rest graph db library to achieved this.
However, I'm facing a slow performance result. I've chopped my queries into batches, sending 500 cypher statements in a single http call. The result that I'm getting is like:
10:38:10.984 INFO commit
10:38:13.161 INFO commit
10:38:13.277 INFO commit
10:38:15.132 INFO commit
10:38:15.218 INFO commit
10:38:17.288 INFO commit
10:38:19.488 INFO commit
10:38:22.020 INFO commit
10:38:24.806 INFO commit
10:38:27.848 INFO commit
10:38:31.172 INFO commit
10:38:34.767 INFO commit
10:38:38.661 INFO commit
And so on.
The query that I'm using is as follows:
MERGE (a{main:{val1},prop2:{val2}}) MERGE (b{main:{val3}}) CREATE UNIQUE (a)-[r:relationshipname]-(b);
My code is this:
private RestAPI restAPI;
private RestCypherQueryEngine engine;
private GraphDatabaseService graphDB = new RestGraphDatabase("http://localdomain.com:7474/db/data/");
...
restAPI = ((RestGraphDatabase) graphDB).getRestAPI();
engine = new RestCypherQueryEngine(restAPI);
...
Transaction tx = graphDB.getRestAPI().beginTx();
try {
int ctr = 0;
while (isExists) {
ctr++;
//excute query here through engine.query()
if (ctr % 500 == 0) {
tx.success();
tx.close();
tx = graphDB.getRestAPI().beginTx();
LOGGER.info("commit");
}
}
tx.success();
} catch (FileNotFoundException | NumberFormatException | ArrayIndexOutOfBoundsException e) {
tx.failure();
} finally {
tx.close();
}
Thanks!
UPDATED BENCHMARK.
Sorry for the confusion, the benchmark that I've posted isn't accurate, and is not for 500 queries. My ctr variable isn't actually referring to the number of cypher queries.
So now, I'm having like 500 queries per 3 seconds and that 3 seconds keeps on increasing as well. It's still way slow compared to the embedded neo4j.

If you have to ability to use Neo4j 2.1.0-M01 (don't use it in prod yet!!), you could benefit from new features. If you'd create/generate a CSV file like this:
val1,val2,val3
a_value,another_value,yet_another_value
a,b,c
....
you'd only need to launch the following code:
final GraphDatabaseService graphDB = new RestGraphDatabase("http://server:7474/db/data/");
final RestAPI restAPI = ((RestGraphDatabase) graphDB).getRestAPI();
final RestCypherQueryEngine engine = new RestCypherQueryEngine(restAPI);
final String filePath = "file://C:/your_file_path.csv";
engine.query("USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM '" + filePath
+ "' AS csv MERGE (a{main:csv.val1,prop2:csv.val2}) MERGE (b{main:csv.val3})"
+ " CREATE UNIQUE (a)-[r:relationshipname]->(b);", null);
You'd have to make sure that the file can be accessed from the machine where your server is installed on.
Take a look at my server plugin that does this for you on the server. If you build this and put in the plugins folder, you could use the plugin in java as follows:
final RestAPI restAPI = new RestAPIFacade("http://server:7474/db/data");
final RequestResult result = restAPI.execute(RequestType.POST, "ext/CSVBatchImport/graphdb/csv_batch_import",
new HashMap<String, Object>() {
{
put("path", "file://C:/.../neo4j.csv");
}
});
EDIT:
You can also use a BatchCallback in the java REST wrapper to boost the performance and it removes the transactional boilerplate code as well. You could write your script similar to:
final RestAPI restAPI = new RestAPIFacade("http://server:7474/db/data");
int counter = 0;
List<Map<String, Object>> statements = new ArrayList<>();
while (isExists) {
statements.add(new HashMap<String, Object>() {
{
put("val1", "abc");
put("val2", "abc");
put("val3", "abc");
}
});
if (++counter % 500 == 0) {
restAPI.executeBatch(new Process(statements));
statements = new ArrayList<>();
}
}
static class Process implements BatchCallback<Object> {
private static final String QUERY = "MERGE (a{main:{val1},prop2:{val2}}) MERGE (b{main:{val3}}) CREATE UNIQUE (a)-[r:relationshipname]-(b);";
private List<Map<String, Object>> params;
Process(final List<Map<String, Object>> params) {
this.params = params;
}
#Override
public Object recordBatch(final RestAPI restApi) {
for (final Map<String, Object> param : params) {
restApi.query(QUERY, param);
}
return null;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark : dataframe - how to properly parallelize the execution - java

Related

Multiple select queries in single job on flink table API

Apache Spark -- Java , Group Live Stream data

Spark - Java UDF returning multiple columns

MongoDB - merge collection and map - can performance be improved

neo4j - batch insertion using neo4j rest graph db

Categories

Resources