Apache Spark -- Java , Group Live Stream data - java

I am trying to get live JSON data from RabbitMQ to Apache Spark using Java and do some realtime analytics out of it.
I am able to get the data and also do some basic SQL queries on it, but I am not able to figure out the grouping part.
Below is the JSON I have
{"DeviceId":"MAC-101","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-101","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-102","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-102","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
I would like to group them by device id. The idea is that way I can run and gather analytics against individual devices. Below is the sample code snippet that I am trying
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("RabbitMqReceiver");
mconf.setMaster("local[*]");
jssc = new JavaStreamingContext(mconf,Durations.seconds(10));
SparkSession spksess = SparkSession
.builder()
.master("local[*]")
.appName("RabbitMqReceiver2")
.getOrCreate();
SQLContext sqlctxt = new SQLContext(spksess);
JavaReceiverInputDStream<String> jsonData = jssc.receiverStream(
new mqreceiver(StorageLevel.MEMORY_AND_DISK_2()));
//jsonData.print();
JavaDStream<String> machineData = jsonData.window(Durations.minutes(1), Durations.seconds(20));
machineData.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
if(!rdd.isEmpty()){
Dataset<Row> data = sqlctxt.read().json(rdd);
//Dataset<Row> data = spksess.read().json(rdd).select("*");
data.createOrReplaceTempView("DeviceData");
data.printSchema();
//data.show(false);
// The below select query works
//Dataset<Row> groupedData = sqlctxt.sql("select * from DeviceData where DeviceId='MAC-101'");
// The below sql fails...
Dataset<Row> groupedData = sqlctxt.sql("select * from DeviceData GROUP BY DeviceId");
groupedData.show();
}
}
});
jssc.start();
jssc.awaitTermination();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
What i am looking to do with the streamed data is to see if i can push the incoming data into individual buckets...
Lets say we have the below incoming data from rabbitmq,
What i want to do is to have either a single key/value based collection which will have the device id as key and List as value
Or it could be somekind of individual dynamic collection for each device id.
Can we do something like the below code (from url -- http://backtobazics.com/big-data/spark/apache-spark-groupby-example/)
public class GroupByExample {
public static void main(String[] args) throws Exception {
JavaSparkContext sc = new JavaSparkContext();
// Parallelized with 2 partitions
JavaRDD<String> rddX = sc.parallelize(
Arrays.asList("Joseph", "Jimmy", "Tina",
"Thomas", "James", "Cory",
"Christine", "Jackeline", "Juan"), 3);
JavaPairRDD<Character, Iterable<String>> rddY = rddX.groupBy(word -> word.charAt(0));
System.out.println(rddY.collect());
}
}
So in our case we need to pass a filter for the group by w.r.t DeviceId
Working Code....
JavaDStream<String> strmData = jssc.receiverStream(
new mqreceiver(StorageLevel.MEMORY_AND_DISK_2()));
//This is just a sliding window i have kept
JavaDStream<String> machineData = strmData.window(Durations.minutes(1), Durations.seconds(10));
machineData.print();
JavaPairDStream<String, String> pairedData = machineData.mapToPair(s -> new Tuple2<String, String>(s.substring(5, 10) , new String(s)));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.print();

It's because in queries with group by, only following columns can be used in select:
columns listed in group by
aggregation of any of column
If you use "*", then all columns are used in select - and that's why the query fails. Change the query to for example:
select DeviceId, count(distinct DeviceType) as deviceTypeCount from DeviceData group by DeviceId
and it will work, because it uses only column in group by and columns in aggregation functions

The GROUP BY statement is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns.

Related

Spark : dataframe - how to properly parallelize the execution

I have a Spark Streaming application that reads from Kafka with a batch interval of 5 minutes
My application stores the input in a DataFrame, and is configured to execute aggregation queries on the dataframe (~ 1000 queries)
Current solution: from the driver I execute a "for loop" on the list of 1000 queries and execute them.
Problem: I have a hundred queries to execute and it takes a huge amount of time for my application.
Is there a way to make this process more faster ?
SparkConf sparkConf = new SparkConf();
messagesFromKafka.foreachRDD((VoidFunction2<JavaRDD<String>, Time>) (rdd, time) -> {
//...
SparkSession spark = JavaSparkSessionSingleton.getInstance(sparkConf);
// Convert RDD[String]
JavaRDD<Bean> rddRow = rdd.map((Function<String, Bean>) line -> {
Bean row = new Bean();
row.setFiled1(line.split(";")[0]);
row.setFiled2(line.split(";")[1]);
//..
return row;
});
// ...
Dataset<Bean> ds = spark.createDataFrame(rddRow, Bean.class);
// Prepare List of query ...
List<String> listQuery = new ArrayList<>();
listQuery.add("select sum(..) group by filed1...");
listQuery.add("select avg (..) group by field2...");
// perform aggregation with key query
for (String query : listQuery) {
Dataset<Bean> dsResult = spark.sql(query);
}
}

How to Iterate through a GlobalKTable or KTable in Kafka Streaming App

I have a Kafka Streaming App that with 2 data sources: Events and Users.
I have a 4 topics: Events, Users, Users2, User-Events
Users2 is the same as Users and is used to demonstrate the GlobalKTable.
The Events topic uses timestamp mode, so when the timestamp-field date is reached, the KStream will receive the Event record.
At this point, I want to create a User-Event record for every User-ID in the Users KTable along with the new Event-ID; but I don't know how to iterate through the GlobalKTable nor the KTable to achieve this.
public Topology createTopology() {
final Serde<String> serde = Serdes.String();
final StreamsBuilder builder = new StreamsBuilder();
final GlobalKTable<String, String> gktUsers =
builder.globalTable("Users",
Materialized.<String, String, KeyValueStore<Bytes, byte[]>>as("user-store")
.withKeySerde(Serdes.String()).withValueSerde(serde));
final KTable<String, String> ktUsers = builder.table("Users2");
builder.stream("Events", Consumed.with(Serdes.String(), serde))
.peek((k, v) -> {
// This is called when a new Event record becomes current.
// How do I iterate through gktUsers at this point
// and output a User-ID and an Event-ID to the User-Events topic?
// This type of iteration doesn't work either.
ktUsers.toStream().foreach(new ForeachAction<String, String>() {
#Override
public void apply(String s, String s2) {
log.info("{} {}", s, s2);
}
});
})
builder.build();
}
You need to get that statestore to iterate it
ReadOnlyKeyValueStore<String, String> keyValueStore = streams.store("user-store", QueryableStoreTypes.keyValueStore()));
String value = store.get("key");
KeyValueIterator<String, String> range = keyValueStore.all();
https://docs.confluent.io/platform/current/streams/developer-guide/interactive-queries.html
Or, maybe you should rather be doing a stream-table join between users and events

Multiple select queries in single job on flink table API

If I want to run two different select queries on a flink table created from the dataStream, the blink-planner runs them as two different jobs. Is there a way to combine them and run as a single job ?
Example code :
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(4);
System.out.println("Running credit scores : ");
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
DataStream<String> recordsStream =
env.readTextFile("src/main/resources/credit_trial.csv");
DataStream<CreditRecord> creditStream = recordsStream
.filter((FilterFunction<String>) line -> !line.contains(
"Loan ID,Customer ID,Loan Status,Current Loan Amount,Term,Credit Score,Annual Income,Years in current job" +
",Home Ownership,Purpose,Monthly Debt,Years of Credit History,Months since last delinquent,Number of Open Accounts," +
"Number of Credit Problems,Current Credit Balance,Maximum Open Credit,Bankruptcies,Tax Liens"))
.map(new MapFunction<String, CreditRecord>() {
#Override
public CreditRecord map(String s) throws Exception {
String[] fields = s.split(",");
return new CreditRecord(fields[0], fields[2], Double.parseDouble(fields[3]),
fields[4], fields[5].trim().equals("")?0.0: Double.parseDouble(fields[5]),
fields[6].trim().equals("")?0.0:Double.parseDouble(fields[6]),
fields[8], Double.parseDouble(fields[15]));
}
});
tableEnv.createTemporaryView("CreditDetails", creditStream);
Table creditDetailsTable = tableEnv.from("CreditDetails");
Table resultsTable = creditDetailsTable.select($("*"))
.filter($("loanStatus").isEqual("Charged Off"));
TableResult result = resultsTable.execute();
result.print();
Table resultsTable2 = creditDetailsTable.select($("*"))
.filter($("loanStatus").isEqual("Fully Paid"));
TableResult result2 = resultsTable2.execute();
result2.print();
The above code creates 2 different jobs, but I don't want that. Is there any way out ?

Java BigQuery API to list table data

I am trying to list the table data from BigQuery using JAVA. However I am not able to find how to configure API to get maximum rows per call?
public class QuickstartSample {
public static void main(String... args) throws Exception {
GoogleCredentials credentials;
File credentialsPath = new File("/Users/gaurang.shah/Downloads/fb3735b731b9.json"); // TODO: update to your key path.
FileInputStream serviceAccountStream = new FileInputStream(credentialsPath);
credentials = ServiceAccountCredentials.fromStream(serviceAccountStream);
BigQuery bigquery = BigQueryOptions.newBuilder().
setCredentials(credentials).
setProjectId("bigquery-public-data").
build().
getService();
Dataset hacker_news = bigquery.getDataset("hacker_news");
Table comments = hacker_news.get("comments");
TableResult result = comments.list().;
for (FieldValueList row : result.iterateAll()) {
// do something with the row
System.out.println(row);
}
}
}
To limit the number of rows you can use listTableData method with TableDataListOption.pageSize(n) parameter.
Following example returns 100 rows as the result:
String datasetName = "my_dataset_name";
String tableName = "my_table_name";
TableId tableIdObject = TableId.of(datasetName, tableName);
TableResult tableData =
bigquery.listTableData(tableIdObject, TableDataListOption.pageSize(100));
for (FieldValueList row : tableData.iterateAll()) {
// do something with the row
}

Merge Multiple JavaRDD

I've attempted to merge multiple JavaRDD but i get only 2 merged can someone kindly help. I've been struggling with this for a while but overall i would like to be able to obtain multiple collections and use sqlContext create a group and print out all results.
here my code
JavaRDD<AppLog> logs = mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.ppa_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.fav_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.pps_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.dd_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.ppt_logs")
)
)
)
);
public JavaRDD<AppLog> mapCollection(JavaSparkContext sc ,String uri){
Configuration mongodbConfig = new Configuration();
mongodbConfig.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
mongodbConfig.set("mongo.input.uri", uri);
JavaPairRDD<Object, BSONObject> documents = sc.newAPIHadoopRDD(
mongodbConfig, // Configuration
MongoInputFormat.class, // InputFormat: read from a live cluster.
Object.class, // Key class
BSONObject.class // Value class
);
return documents.map(
new Function<Tuple2<Object, BSONObject>, AppLog>() {
public AppLog call(final Tuple2<Object, BSONObject> tuple) {
AppLog log = new AppLog();
BSONObject header =
(BSONObject) tuple._2();
log.setTarget((String) header.get("target"));
log.setAction((String) header.get("action"));
return log;
}
}
);
}
// printing the collections
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
DataFrame logsSchema = sqlContext.createDataFrame(logs, AppLog.class);
logsSchema.registerTempTable("logs");
DataFrame groupedMessages = sqlContext.sql(
"select * from logs");
// "select target, action, Count(*) from logs group by target, action");
// "SELECT to, body FROM messages WHERE to = \"eric.bass#enron.com\"");
groupedMessages.show();
logsSchema.printSchema();
If you wanted to merge multiple JavaRDDs , simply use sc.union(rdd1,rdd2,..) instead rdd1.union(rdd2).union(rdd3).
Also check this RDD.union vs SparkContex.union

Categories

Resources