Aggregate data based on timestamp in JavaDStream of spark streaming

Aggregate data based on timestamp in JavaDStream of spark streaming - java

I am writing a spark streaming job in java which takes input record from kafka.
Now the record is available in JavaDstream as a custom java object.
Sample record is :
TimeSeriesData: {tenant_id='581dd636b5e2ca009328b42b', asset_id='5820870be4b082f136653884', bucket='2016', parameter_id='58218d81e4b082f13665388b', timestamp=Mon Aug 22 14:50:01 IST 2016, window=null, value='11.30168'}
Now I want to aggregate this data based on min, hour, day and week of the field "timestamp".
My question is, how to aggregate JavaDstream records based on a window. A sample code will be helpful.

Related

Read Anylogic schedule exceptions from database table

How can I read a list of exception dates for a schedule from an Excel file without having to adopt each date from the file separately? I am trying to set up a shift-plan which takes into account holidays, etc. over the next 5 years. For this I have created an Excel table containing a list of holiday dates, which I would now like to use in my AnyLogic simulation. I tried the exceptions section of my schedule object, but didn't find a way to connect this section to my excel file. The only option I am getting here is to manually enter each date... Since this would be extremely tedious, I am looking for a workaround (Java Code?). Can someone help?

Suppose you have a scheduled object of type integer and a database table full of exception dates that you want the integer value of the schedule to be 0 for the entire day.
You can then programmatically add exceptions to your scheduling with the following code
List<Tuple> rows = selectFrom(db_table).list();
for (Tuple row : rows) {
Date exceptionDate = row.get( db_table.db_column );
schedule.addException(exceptionDate.getYear(), exceptionDate.getMonth(), exceptionDate.getDay(), exceptionDate.getHours(), exceptionDate.getMinutes(), exceptionDate.getSeconds(),
exceptionDate.getYear(), exceptionDate.getMonth(), exceptionDate.getDay()+1, exceptionDate.getHours(), exceptionDate.getMinutes(), exceptionDate.getSeconds(),
0, false);
}
The format for adding exceptions is a bit cumbersome but it is:
schedule.addException(startYear, startMonth, startDay, startHour, startMinute, startSecond, endYear, endMonth, endDay, endHour, endMinute, endSecond, value, annually);
The value object is the value of your schedule type

True, there is no build-in way for this. You need to code your schedule. This allows adding exceptions programmatically, see https://anylogic.help/anylogic/data/schedule.html#creating-and-initializing-schedule-from-code-on-model-startup
Specifically:

Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark

I have a kafka stream that I am loading to Spark. Messages from Kafka topic has following attributes: bl_iban, blacklisted,timestamp. So there are IBANS, flag about whether or not is that IBAN blacklisted (Y/N) and also there is timestamp of that record.
The thing is that there can be multiple records for one IBAN, because overtime IBAN can get blacklisted or "removed". And the thing that I am trying to achieve is that I want to know the current status for each of IBANS. However I have started with even simpler goal and that is to list for each IBAN latest timestamp (and after that I would like to add blacklisted status as well) So I have produced the following code (where blacklist represents Dataset that I have loaded from Kafka):
blackList = blackList.groupBy("bl_iban")
.agg(col("bl_iban"), max("timestamp"));
And after that I have tried to print that to console using following code:
StreamingQuery query = blackList.writeStream()
.format("console")
.outputMode(OutputMode.Append())
.start();
I have run my code and I get following error:
Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark
So I put watermark to my Dataset like so:
blackList = blackList.withWatermark("timestamp", "2 seconds")
.groupBy("bl_iban")
.agg(col("bl_iban"), max("timestamp"));
And got same error after that.
Any ideas how can I approach this problem?
Update:
With help of mike I have managed to get rid of that error. But the problem is that I still cannot get my blacklist working. I can see how data is loaded from Kafka but after that from my group operation I get two empty batches and that is it.
Printed data from Kafka:
+-----------------------+-----------+-----------------------+
|bl_iban |blacklisted|timestamp |
+-----------------------+-----------+-----------------------+
|SK047047595122709025789|N |2020-04-10 17:26:58.208|
|SK341492788657560898224|N |2020-04-10 17:26:58.214|
|SK118866580129485701645|N |2020-04-10 17:26:58.215|
+-----------------------+-----------+-----------------------+
This is how I got that blacklist that is outputted:
blackList = blackList.selectExpr("split(cast(value as string),',') as value", "cast(timestamp as timestamp) timestamp")
.selectExpr("value[0] as bl_iban", "value[1] as blacklisted", "timestamp");
And this is my group operation:
Dataset<Row> blackListCurrent = blackList.withWatermark("timestamp", "20 minutes")
.groupBy(window(col("timestamp"), "10 minutes", "5 minutes"), col("bl_iban"))
.agg(col("bl_iban"), max("timestamp"));
Link to source file: Spark Blacklist

When you use watermarking in Spark you need to ensure that your aggregation knows about the window. The Spark documentation provides some more background.
In your case the code should look something like this
blackList = blackList.withWatermark("timestamp", "2 seconds")
.groupBy(window(col("timestamp"), "10 minutes", "5 minutes"), col("bl_iban"))
.agg(col("bl_iban"), max("timestamp"));
It is important, that the attribute timestamp has the data type timestamp!

Spark writing to Cassandra with varying TTL

In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to.
I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp), where CONST_TTL is a constant TTL that I configured.
Currently I am writing to Cassandra with spark using a constant TTL, with the following code:
df.write().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "key_space_name");
put("table, "table_name");
put("spark.cassandra.output.ttl, Long.toString(CONST_TTL)); // Should be depended on bucket_timestamp column
}
}).mode(SaveMode.Overwrite).save();
One possible way I thought about is - for each possible bucket_timestamp - filter the data according to timestamp, calculate the TTL and write filtered data to Cassandra. but this seems very non-efficient and not the spark way. Is there a way in Java Spark to provide a spark column as the TTL option, so that the TTL will differ for each row?
Solution should be working with Java and dataset< Row>: I encountered some solutions for performing this with RDD in scala, but didn't find a solution for using Java and dataframe.
Thanks!

From Spark-Cassandra connector options (https://github.com/datastax/spark-cassandra-connector/blob/v2.3.0/spark-cassandra-connector/src/main/java/com/datastax/spark/connector/japi/RDDAndDStreamCommonJavaFunctions.java) you can set the TTL as:
constant value (withConstantTTL)
automatically resolved value (withAutoTTL)
column-based value (withPerRowTTL)
In your case you could try the last option and compute the TTL as a new column of the starting Dataset with the rule you provided in the question.
For use case you can see the test here: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/writer/TableWriterSpec.scala#L612

For DataFrame API there is no support for such functionality, yet... There is JIRA for it - https://datastax-oss.atlassian.net/browse/SPARKC-416, you can watch it to get notified when it's implemented...
So only choice that you have is to use RDD API as described in the #bartosz25's answer...

How to change timestamp of records?

I'm using FluentD (v.12 last stable version) to send messages to Kafka. But FluentD is using an old KafkaProducer, so that the records timestamp is always set to -1.
Thus i have to use the WallclockTimestampExtractor to set the timestamp of the record to the point in time, when the message arrives in kafka.
Is there a Kafka Streams-specific solution?
The timestamp i'm realy interested in, is send by fluentd within the message:
"timestamp":"1507885936","host":"V.X.Y.Z."
record representation in kafka:
offset = 0, timestamp= - 1, key = null, value = {"timestamp":"1507885936","host":"V.X.Y.Z."}
i would like to have a record like this in kafka:
offset = 0, timestamp= 1507885936, key = null, value = {"timestamp":"1507885936","host":"V.X.Y.Z."}
my workaround would look like:
write a consumer to extract the timestamp (https://kafka.apache.org/0110/javadoc/org/apache/kafka/streams/processor/TimestampExtractor.html)
write a producer to produce a new record with the timestamp set (ProducerRecord(String topic, Integer partition, Long timestamp, K key, V value)
I would prefer a KafkaStreams solution, if there is one.

You can write a very simple Kafka Streams Application like:
KStreamBuilder builder = new KStreamBuilder();
builder.stream("input-topic").to("output-topic");
and configure the application with a custom TimestampExtractor that extract the timestamp from the record and returns it.
Kafka Streams will use the returned timestamps when writing the records back to Kafka.
Note: if you have out of order data -- ie, timestamps are not strictly ordered -- the result will contain out of order timestamps, too. Kafka Streams uses the returned timestamps to writing back to Kafka (ie, whatever the extractor returns, is used as record metadata timestamp). Note, that on write, the timestamp from the currently processed input record is used for all generated output records -- this hold for version 1.0 but might change in future releases.).
Update:
In general, you can modify timestamps via the Processor API. Calling context.forward() you can set the output record timestamp via To.all().withTimestamp(...) as a parameter for forward().

Grouping of data based on week using apache spark

I am new bee to spark, I have around 15 TB data in mongo
ApplicationName Name IPCategory Success Fail CreatedDate
abc a.com cd 3 1 25-12-2015 00:00:00
def d.com ty 2 2 25-12-2015 01:20:00
abc b.com cd 5 0 01-01-2015 06:40:40
I am looking for based on ApplicationName, groupby (Name,IpCategory) for one week data.I am able to fetch data from mongo and save output to mongo. I am working on it using java.
NOTE:- From one month data I need only last week. It should be groupby(Name,IPCategory).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Aggregate data based on timestamp in JavaDStream of spark streaming - java

Related

Read Anylogic schedule exceptions from database table

Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark

Spark writing to Cassandra with varying TTL

How to change timestamp of records?

Grouping of data based on week using apache spark

Categories

Resources