Kafka - problems with TimestampExtractor

Kafka - problems with TimestampExtractor - java

I use org.apache.kafka:kafka-streams:0.10.0.1
I'm attempting to work with a time series based stream that doesn't seem to be triggering a KStream.Process() to trigger ("punctuate"). (see here for reference)
In a KafkaStreams config I'm passing in this param (among others):
config.put(
StreamsConfig.TIMESTAMP_EXTRACTOR_CLASS_CONFIG,
EventTimeExtractor.class.getName());
Here, EventTimeExtractor is a custom timestamp extractor (that implements org.apache.kafka.streams.processor.TimestampExtractor) to extract the timestamp information from JSON data.
I would expect this to call my object (derived from TimestampExtractor) when each new record is pulled in. The stream in question is 2 * 10^6 records / minute. I have punctuate() set to 60 seconds and it never fires. I know the data passes this span very frequently since its pulling old values to catch up.
In fact it never gets called at all.
Is this the wrong approach to setting timestamps on KStream records?
Is this the wrong way to declare this configuration?

Update Nov 2017: Kafka Streams in Kafka 1.0 now supports punctuate() with both stream-time and with processing-time (wall clock time) behavior. So you can pick whichever behavior you prefer.
Your setup seems correct to me.
What you need to be aware of: As of Kafka 0.10.0, the punctuate() method operates on stream-time (by default, i.e. based on the default timestamp extractor, stream-time will mean event-time). And the stream-time is only advanced when new data records are coming in, and how much the stream-time is advanced is determined by the associated timestamps of these new records.
For example:
Let's assume you have set punctuate() to be called every 1 minute = 60 * 1000 (note: 1 minute of stream-time). Now, if it happens that no data is being received for the next 5 minutes, punctuate() will not be called at all -- even though you might expect it to be called 5 times. Why? Again, because punctuate() depends on stream-time, and the stream-time is only advanced based on newly received data records.
Might this be causing the behavior you are seeing?
Looking ahead: There's already a ongoing discussion in the Kafka project on how to make punctuate() more flexible, e.g. to have trigger it not only based on stream-time (which defaults to event-time) but also based on processing-time.

Your approach seems to be correct. Compare pargraph "Timestamp Extractor (timestamp.extractor):" in http://docs.confluent.io/3.0.1/streams/developer-guide.html#optional-configuration-parameters
Not sure, why your custom timestamp extractor is not used. Have a look into org.apache.kafka.streams.processor.internals.StreamTask. In the constructor there should be something like
TimestampExtractor timestampExtractor1 = (TimestampExtractor)config.getConfiguredInstance("timestamp.extractor", TimestampExtractor.class);
Check if your custom extractor is picked up there or not...

I think this is another case of issues at the broker level. I went and rebuilt the cluster using instances with more CPU and RAM. Now I'm getting the results I expected.
Note to distant observer(s): if your KStream app is behaving strangely take a look at your brokers and make sure they aren't stuck in GC and have plenty of 'headroom' for file handles, RAM, etc.
See also

Related

Google Fit data pattern change since Google fit App update, implementation apparently broken

We have identified in our user base that since the last google fit app update there's been a dramatic drop in data, and since it began we have tried to identify the issue in our code. Giving the timing, we thought the version we were using ( 18.0 at the time ) was the problem.
Upgrading to SDK 20.0 did not improve the results, but stopped the data from stalling. currently we can assume 50-60% of the users connected to google fit trough the SDK are no longer corretcly retrieving data according to the (previously working) implementation. They are not lost, and they still send some bits here and there, but it's no longer what it used to be.
This graph showcases the timeline of events that lead us the conclusion that one of the sides must be doing something wrong.
The code examples below have been stripped of most data processing code for readability, but it is there.
Our Fitness client requests FitnessOptions.ACCESS_READ for all the types mentioned below, plus others depending on the App, every time it's initialised, either in foreground or background, making sure we only request those accepted by the user.
We can confirm the next data types no longer return any value when requesting daily total or local device daily total, but do return data chunks of the same period when requested in a non-aggregated read:
DataType.TYPE_STEP_COUNT_DELTA
DataType.TYPE_CALORIES_EXPENDED
DataType.TYPE_HEART_RATE_BPM
we also tried changing those possible to their aggregate counterparts, with no avail:
DataType.AGGREGATE_CALORIES_EXPENDED
DataType.AGGREGATE_STEP_COUNT_DELTA
This is our current getDailyTotal implementation, working before the update, and is written straight out as the examples on the developer site show:
Fitness.getHistoryClient(context, account)
.readDailyTotal(type)
.addOnSuccessListener {
Logger.i("${type.name}::DailyTotal::Success")
onResponse(it)
}
This currently returns 0 no matter the time of the day it's asked.
Then we have our complementary code, which emulates what getDailyTotal does in the insides, also as per developer site examples:
from: day start at 00:00:00, UTC+1
to: day end at 23:59:59, UTC+1
type: any DataType.
val readRequest = DataReadRequest.Builder()
.enableServerQueries()
.aggregate(type)
.bucketByTime(1, TimeUnit.DAYS)
.setTimeRange(from.time, to.time, TimeUnit.MILLISECONDS)
.build()
val account = GoogleSignIn
.getAccountForExtension(context, fitnessOptions!!)
GFitClient.request(context, account, readRequest) {
if (it == null) {
aggregatedRequestError(type)
} else {
Logger.i(TAG, "Aggregated ${type.name} received.")
}
}
The common result here is either 1) a null or empty result, 2) actually getting the result ( in the case of DataType.TYPE_STEP_COUNT_DELTA sometimes it happens ) or 3) a APIException code 5012, this datatype can't be aggregated.
We are using the single aggregate since the double, that could be called by (type, type.aggregate) has been deprecated since a couple versions already, although some developer site examples still use it.
The use ( or not ) of .enableServerQueries() does not modify the final result.
Finally we assume the worst and we request anything for that day no matter what and then we aggregate manually. This usually reports results, wether others did not. sadly those results are never conclusive enough to feel comfortable.
val readRequest = DataReadRequest.Builder()
.enableServerQueries()
.read(type)
.bucketByTime(1, TimeUnit.DAYS)
.setTimeRange(from.time, to.time, TimeUnit.MILLISECONDS)
.build()
val account = GoogleSignIn
.getAccountForExtension(context, fitnessOptions!!)
This tends to work but the manual processing of the data is complex given the intricate nested nature of datasets, buckets and the overall dataset structure.
We have also noticed issues when retrieving data that is clearly seen on the fit app, but doesn't appear on the SDK, for example, Huawei Health activities appearing on the App while the SDK returns only a subset of them, and the other way around, the SDK returning us data ( for example, a whole night worth of sleep sessions ( light, rem, deep... ), while the fit app shows that same sleep as a single Sleep block without any sessions.
Sleep session as shown in a third party app, with the same data the SDK returns us:
The same sleep session shown in the Google fit app:
As far as the documentation says:
For the Android APIs, read by data type and the Fit platform will
return the merged stream by default. This automatically includes all
data available to your app, including data written by other apps. You
won't be able to see a list of which apps or devices the data came
from with the Android APIs.
We believe that the merged stream is not behaving properly, not in real time ( which could be explained by a delay between the App showing the data directly from the backend and the SDK not having the data yet written ), but also not in a matter of minutes or hours of difference, sometimes never showing up.
To understand how we retrieve this data, we have a background WorkerManager CouroutineJob that every once in a while ( when the system lets so, given doze mode permissions, but what we would prefer (and ask so via WorkerManager configuration ) is once every hour or couple of hours, to keep the data up to date with the one displayed in the fitness app ), we request data from last update to last day's end day or/and we request today's daily total ( or up to the current time, depends on how far the "doesn't work" funnel we go, and also on the last update's date).
Is there anything wrong in our implementation?
has google fit changed the way it reports its data to connected apps?
can we somehow get more truthful data?
is there any way to request the same data differently, more efficiently? we are deeply interested mostly in getting daily summaries, totals and averages, rather than time buckets / sessions. We request both but they go to different data funnels covering different use cases.

There is no answer yet.
Our solution has ended up having a rowdy succession of checks for data and on every failure we try a different way.

Kafka Streams KeyValueStore retention.bytes

I have a funny behaviour with KeyValueStore and I have some assumption to explain it, may be you can tell I am right or wrong...
I configured a state store like the following
Map<String, String> storeConfig = new HashMap<>();
storeConfig.put(TopicConfig.RETENTION_MS_CONFIG, TimeUnit.DAYS.toMillis(30));
storeConfig.put(TopicConfig.CLEANUP_POLICY_CONFIG, "compact,delete");
StoreBuilder store1 = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("STORE1"),
Serdes.String(),
Serdes.String()
);
streamsBuilder.addStateStore(store1.withLoggingEnabled(storeConfig));
with this configuration, I am expecting a dataset older then 30 Days will disappear but I am observing something completely different.
When I look to the rockdb directory of the store, every 14451 bytes it rolls the file and I have a such structure in the directory
14451 1. Oct 19:00 LOG
14181 30. Sep 15:59 LOG.old.1569854012833395
14451 30. Sep 17:40 LOG.old.1569918431235734
14451 1. Oct 11:05 LOG.old.1569949239434224
It seems instead of realising retention over 30 days that is configured it also realises over the file size.
I found on the internet that there is also the parameter Topic.RETENTION_BYTES_CONFIG 'retention.bytes', do I also have to configure this parameter, so my Data is visible during the retention and not deleted because of the file size (I know I have value for my key but I can't access it after this phenomena occurs)...
Thx for answers..

Internally, KeyValueStores use RocksDB, and RocksDB uses a so-called LSM-Tree (Log-Structured-Merged-Tree) internally, and it creates many smaller segments that are later combined into larger segments. After this "compaction" step, the smaller segment files can be deleted, because the data is copied into larger segment files. Hence, there is nothing to worry about.
Furthermore, Topic.RETENTION_MS_CONFIG is a topic configuration and is not related to the local store of a Kafka Streams application. Furthermore, a KeyValueStore will retain data forever, until explicitly deleted via a "tombstone" message. Hence, if you set a retention time for the underlying changelog topic, the data might be deleted in the topic, but not in the local store.
If you want to apply a retention time to a local store, you cannot use a KevValueStore, but you could use WindowedStore that supports a retention time.

Is it possible to control time in Siddhi Cep

In below example, time is cpu time. What I am struggling is when I run a time series for back test purpose, data would arrive in order but much faster and the subsequent logic basing on the timed window would not be correct
My question:
- Ideal solution for me is to change the Siddi time using timestamp of arriving time series event. Is that possible to do so?
- If not, what's suggestion to fix this issue.
from fooStream#window.timeBatch(10 sec)
select count() as count
insert into barStream;

You can use the externalTimeWindow[1] as previously mentioned. However, what you are looking for is playback [2].
In Siddhi, internally there are two TimestampGenerators. Namely EventTimeBasedMillisTimestampGenerator and SystemCurrentTimeMillisTimestampGenerator. By default SystemCurrentTimeMillisTimestampGenerator will be used with the Siddhi CEP engine. But, if you use playback annotation, it'll change to EventTimeBasedMillisTimestampGenerator. If you use this, Siddhi will use the timestamp of arriving time series event as the CEP engines time.
[1] https://wso2.github.io/siddhi/api/latest/#externaltime-window
[2] https://wso2.github.io/siddhi/documentation/siddhi-4.0/#appplayback
[3] https://github.com/wso2/siddhi/blob/master/modules/siddhi-core/src/test/java/org/wso2/siddhi/core/managment/PlaybackTestCase.java

perhaps you can use the externalTime windows functionality of Siddhi for this.
see documenation
https://wso2.github.io/siddhi/api/latest/#externaltime-window

You can use siddhi externalTimeWindow[1] for your requirement.
For extenrnalTimewindow, you can provide your own timestamp and the window time will be calculated from the timestamp you have provided.
[1] https://wso2.github.io/siddhi/api/latest/#externaltime-window

Spark streaming maintain state over window

For spark streaming, are there ways that we can maintain state only for the current window? I understand updateStateByKey works but that maintains the state forever unless we purge it. Is it possible to store and reset the state per window?
To give more context. I'm trying to convert one type of object into another within a windowed stream. However, the conversion is the following:
Object 1 is either an invocation or a response.
Object 2 is not considered complete until we see both a invocation and a response.
However, since the response for the an object could be in a separate batch I need to maintain states across batches.
But I only wish to maintain the state for the current window. Are there any ways that I could achieve this through spark.
thank you!

You can use the mapWithState transformation instead of updateStateByKey and you can set time out to the State spec with duration of your batch interval.by this you can have the state for only last batch every time.but it will work if you invocation and response depends only on the last batch.other wise when you try to update key which got removed it will throw exception.
MapwithState is fast in performance compared to updateStateByKey.
you can find the sample code snippet below.
import org.apache.spark.streaming._
val stateSpec =
StateSpec
.function(updateUserEvents _)
.timeout(Minutes(5))

Sync AWS glacier storage changes with RDS

I am using S3 Lifecycle Rule to move objects to Glacier. Since objects will be moved to glacier storage I need to make sure my application RDS is also
updated with similar details.
As per my discussion over this thread AWS Lambda for objects moved to glacier, there is no way currently to generate SQS notification to get notified about object being moved to glacier.
Also, as per my understanding currently Lifecycle rule will be evaluated once in a day, but there is not specific time when this will happen in a day. If there was i was planning to have a scheduler which will run after that and update status of archived objects in RDS.
Is there a way that you can suggest which will be close enough to sync this status changes between AWS & RDS?
Let me know your feedback or if you need more information on this to understand use case.
=== My Current approach is as per below.
Below is exact flow that I have implemented, please review and let me know if there is anything that could have been done in better way.
When object is uploaded to system I am marking it with status Tagged and also capturing creation date. My Lifecycle rule is configured with 30 days from creation. So, I have a scheduler which calculates difference between today's date and object creation date for all objects with status Tagged, and check if diff is greater than equal to 30. If so, it updates status to Archived.
If user performs any operation on object with status Archived, we explicitly check in s3 whether object is actually moved to glacier or not. If not we perform operation requested. If moved to glacier we initiate restore process and wait for restore to finish to initiate operation requred.
I appreciate your thoughts and would like to hear your inputs on above approach that i have taken.
Regards.

If I wanted to implement this, I would set the storage class of the object inside my database as "Glacier/Archived" at the beginning of the day it is supposed to transition.
You already know your lifecycle policies, and, as part of object metadata, you also know the creation time of each object. Then it becomes a simple query, which can be scheduled to run every night at 12:00 AM.
You could further enhance your application by defining an algorithm that checks if an object has transitioned to Glacier today, at the moment when object access is requested, it would go and explicitly check if it is actually transitioned or not. If it is marked as Glacier/Archive for more than a day, then checking is no longer required.
Of course, if for any reason, the above solution doesn't work for you, it is possible to write a scanner application to continuously check the status of those objects that are supposed to transition at "DateTime.Today" and are not marked as Glacier/Archive yet.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.