KStream batch process windows

KStream batch process windows - java

I want to batch messages with KStream interface.
I have a Stream with Keys/values
I tried to collect them in a tumbling window and then I wanted to process the complete window at once.
builder.stream(longSerde, updateEventSerde, CONSUME_TOPIC)
.aggregateByKey(
HashMap::new,
(aggKey, value, aggregate) -> {
aggregate.put(value.getUuid, value);
return aggregate;
},
TimeWindows.of("intentWindow", 100),
longSerde, mapSerde)
.foreach((wk, values) -> {
The thing is foreach gets called on each update to the KTable.
I would like to process the whole window once it is complete. As in collect Data from 100 ms and then process at once. In for each.
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 294
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 295
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 296
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 297
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 298
16:** - windows from 2016-08-23T10:56:26 to 2016-08-23T10:56:27, key 2016-07-21T14:38:16.288, value count: 299
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 1
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 2
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 3
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 4
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 5
16:** - windows from 2016-08-23T10:56:27 to 2016-08-23T10:56:28, key 2016-07-21T14:38:16.288, value count: 6
at some point the new window starts with 1 entry in the map.
So I don't even know when the window is full.
any hints to to batch process in kafka streams

My actual tasks is to push updates from the stream to redis but I don't want to read / update / write individiually even though redis is fast.
My solution for now is to use KStream.process() supply a processor which adds to a queue on process and actually process the queue in punctuate.
public class BatchedProcessor extends AbstractProcessor{
...
BatchedProcessor(Writer writer, long schedulePeriodic)
#Override
public void init(ProcessorContext context) {
super.init(context);
context.schedule(schedulePeriodic);
}
#Override
public void punctuate(long timestamp) {
super.punctuate(timestamp);
writer.processQueue();
context().commit();
}
#Override
public void process(Long aLong, IntentUpdateEvent intentUpdateEvent) {
writer.addToQueue(intentUpdateEvent);
}
I still have to test but it solves the problem I had. One could easily write such a processor in a very generic way. The API is very neat and clean but a processBatched((List batchedMessaages)-> ..., timeInterval OR countInterval) that just uses punctuate to process the batch and commits at that point and collects KeyValues in a Store might be a useful addition.
But maybe it was intended to solve this with a Processor and keep the API purely in the one message at a time low latency focus.

Right now (as of Kafka 0.10.0.0 / 0.10.0.1): The windowing behavior you are describing is "working as expected". That is, if you are getting 1,000 incoming messages, you will (currently) always see 1,000 updates going downstream with the latest versions of Kafka / Kafka Streams.
Looking ahead: The Kafka community is working on new features to make this update-rate behavior more flexible (e.g. to allow for what you described above as your desired behavior). See KIP-63: Unify store and downstream caching in streams for more details.

====== Update ======
On further testing, this does not work.
The correct approach is to use a processor as outlined by #friedrich-nietzsche. I am down-voting my own answer.... grrrr.
===================
I am still wrestling with this API (but I love it, so it's time well spent :)), and I am not sure what you're trying to accomplish downstream from where your code sample ended, but it looks similar to what I got working. High level is:
Object read from source. It represents a key and 1:∞ number of events, and I want to publish the total number of events per key every 5 seconds ( or TP5s, transactions per 5 seconds ). The beginning of the code looks the same, but I use:
KStreamBuilder.stream
reduceByKey
to a window(5000)
to a new stream which gets the accumulated value for each key every 5 secs.
map that stream to a new KeyValue per key
to the sink topic.
In my case, each window period, I can reduce all events to one event per key, so this works. If you want to retain all the individual events per window, I assume that could use reduce to map each instance to a collection of instances (possibly with the same key, or you might need a new key) and at the end of each window period, the downstream stream will get a bunch of collections of your events (or maybe just one collection of all the events), all in one go. It looks like this, sanitized and Java 7-ish:
builder.stream(STRING_SERDE, EVENT_SERDE, SOURCE_TOPICS)
.reduceByKey(eventReducer, TimeWindows.of("EventMeterAccumulator", 5000), STRING_SERDE, EVENT_SERDE)
.toStream()
.map(new KeyValueMapper<Windowed<String>, Event, KeyValue<String,Event>>() {
public KeyValue<String, Event> apply(final Windowed<String> key, final Event finalEvent) {
return new KeyValue<String, Event>(key.key(), new Event(key.window().end(), finalEvent.getCount());
}
}).to(STRING_SERDE, EVENT_SERDE, SINK_TOPIC);

Related

How to ensure the expiry of a stream data structure in redis is set once only?

I have a function that use lettuce to talk to a redis cluster.
In this function, I insert data into a stream data structure.
import io.lettuce.core.cluster.SlotHash;
...
public void addData(Map<String, String> dataMap) {
var sync = SlotHash.getSlot(key).sync()
sync.xadd(key, dataMap);
}
I also want to set the ttl when I insert a record for the first time. It is because part of the user requirement is expire the structure after a fix length of time. In this case it is 10 hour.
Unfortunately the XADD function does not accept an extra parameter to set the TTL like the SET function.
So now I am setting the ttl this way:
public void addData(Map<String, String> dataMap) {
var sync = SlotHash.getSlot(key).sync()
sync.xadd(key, dataMap);
sync.expire(key, 60000 /* 10 hours */);
}
What is the best way to ensure the I will set the expiry time only once (i.e. when the stream structure is first created)? I should not set TTL multiple times within the function because every call to xadd will also follow by a call of expire which effectively postpone the expiry time indefinitely.
I think I can always check the number of items in the stream data structure but it is an overhead. I don't want to keep flags in the java application side because the app could be restarted and this information will be removed from the memory.

You may want to try lua script, sample script below which sets the expiry only if it's not set for key, works with any type of key in redis.
eval "local ttl = redis.call('ttl', KEYS[1]); if ttl == -1 then redis.call('expire', KEYS[1], ARGV[1]); end; return ttl;" 1 mykey 12
script also returns the actual expiry time left in seconds.

Can I set a TTL for #Cacheable by key?

I use #Cacheable(name = "rates", key = "#car.name")
Can I set up a TTL for this cache? and the TTL is by the car.name?
for example
I want to set name = "rates" TTL 60 secs
running the java:
time: 0 car.name = 1, return "11"
time: 30 car.name = 2, return "22"
time: 60 car.name = 1 key should be gone.
time: 90 car.name = 2 key should be gone.
and I want to set multiple TTL for multiple names.
name = "rates2" TTL 90 secs.

You can't #Cacheable is static configuration and what you want is more on the dynamic side. Keep in mind that Spring just provides abstraction that is supposed to fit all providers. You should either specify different regions for the different entries, or do a background process invalidating the keys that need invalidation.
Time to live setting is on per region basis when statically configured.
If you walk away from the static configuration you can set expiration while inserting an entry , but then you are getting away from spring(one size to fit them all remember) and entering the territory of the caching provider which can be anything Redis, Hazelcast,Ehcache,infinispan and each will have different contract
Here is example contract of IMap interface from Hazelcast:
IMap::put(Key, Value, TTL, TimeUnit)
But this has nothing to do with spring.
With Spring means you can do the following:
#Cacheable(name="floatingRates")
List<Rate> floatingRates;
#Cacheable(name="fixedRates")
List<Rate> fixedRates;
and then define TTL for each.

How to get a key without changing the existing expiry time in couchbase using java client?

I have a couchbase key K which stores a JsonLongDocument V.
Whenever I see an event E at time T, I increment the V by 1 with an updated expiry of T+n(sec) using the following java client function :
bucket.counter(K, 1, 1, n)
I also occasionally have to get the value V using key K by calling the following java client function :
bucket.get(K, classOf[JsonLongDocument])
But whenever I'm calling simple 'get', the couchbase is changing the expiry of the document and setting it to 0 which means persist forever.
How can I still do the 'get' on my value without changing its expiry?

Getting a document by its key does not change the document's expiry.
You must use bucket.touch(K, n) to update the expiry after incrementing the counter. The expiry passed to bucket.counter(K, 1, 1, n) is only used if the counter document does not exist.
The expiry returned in the JsonLongDocument is either the expiry value passed to the method or zero. It does not reflect the actual expiry timestamp stored in Couchbase. I'm not sure why the SDKs behave this way, but it is expected behavior.
To see the real expiry timestamp you can use N1QL as described here or you can inspect the counter's data file.
Steps to get a document's expiry from a data file:
Determine in which vbucket and server the document resides, vbucket 8 and localhost in the example below (requires libcouchbase, the Couchbase C SDK)
> cbc-hash test-counter-1
test-counter-1: [vBucket=8, Index=0]
Server: localhost:11210, CouchAPI: http://localhost:8092/default
Replica #0: Index=-1, Host=N/A
Extract the counter document information from the vbucket data file, 8.couch.1 in this example
> couch_dbdump 8.couch.1 | grep -B 2 -A 6 test-counter-1
Dumping "8.couch.1":
Doc seq: 63
id: test-counter-1
rev: 63
content_meta: 128
size (on disk): 11
cas: 1501205424364060672, expiry: 1501205459, flags: 0, datatype: 1
size: 1
data: (snappy) 2
See File Locations or the Disk Storage section of your server node information to locate the 'data' directory where data files are stored. The vbucket files will be in a subdirectory of 'data' named after your bucket.

Cassandra Query on secondary index :ReadTimeout: code=1200

I am using [cqlsh 5.0.1 | Cassandra 2.2.1 | CQL spec 3.3.0 | Native protocol v4] version. I have 2 node cassandra cluster with replication factor as 2.
$ nodetool status test_keyspace
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.xxx.4.xxx 85.32 GB 256 100.0% xxxx-xx-xx-xx-xx rack1
UN 10.xxx.4.xxx 80.99 GB 256 100.0% x-xx-xx-xx-xx rack1
[I have replaced numbers with x]
This is keyspace defination.
cqlsh> describe test_keyspace;
CREATE KEYSPACE test_keyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2'} AND durable_writes = true;
CREATE TABLE test_keyspace.test_table (
id text PRIMARY KEY,
listids map<int, timestamp>
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX list_index ON test_keyspace.test_table (keys(listids));
id are unique and listids's key has cardinality close to 1000. I have millions records in this keyspace.
I want to get count of records with specific key and also list of those records. I tried this query from cqlsh:
select count(1) from test_table where listids contains key 12;
Got this error after few seconds:
ReadTimeout: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
I have already modified timeout parameters in cqlshrc and cassandra.yaml.
cat /etc/cassandra/conf/cassandra.yaml | grep read_request_timeout_in_ms
#read_request_timeout_in_ms: 5000
read_request_timeout_in_ms: 300000
cat ~/.cassandra/cqlshrc
[connection]
timeout = 36000
request_timeout = 36000
client_timeout = 36000
When i checked /var/log/cassandra/system.log I got only this-
WARN [SharedPool-Worker-157] 2016-07-25 11:56:22,010 SelectStatement.java:253 - Aggregation query used without partition key
I am using Java client form my code. The java client is also getting lot of read timeouts. One solution might be remodeling my data but that will take more time (Although i am not sure about it). Can someone suggest quick solution of this problem?
Adding stats :
$ nodetool cfstats test_keyspace
Keyspace: test_keyspace
Read Count: 5928987886
Read Latency: 3.468279416568199 ms.
Write Count: 1590771056
Write Latency: 0.02020026287239664 ms.
Pending Flushes: 0
Table (index): test_table.list_index
SSTable count: 9
Space used (live): 9664953448
Space used (total): 9664953448
Space used by snapshots (total): 4749
Off heap memory used (total): 1417400
SSTable Compression Ratio: 0.822577888909709
Number of keys (estimate): 108
Memtable cell count: 672265
Memtable data size: 30854168
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 1718274
Local read latency: 63.356 ms
Local write count: 1031719451
Local write latency: 0.015 ms
Pending flushes: 0
Bloom filter false positives: 369
Bloom filter false ratio: 0.00060
Bloom filter space used: 592
Bloom filter off heap memory used: 520
Index summary off heap memory used: 144
Compression metadata off heap memory used: 1416736
Compacted partition minimum bytes: 73
Compacted partition maximum bytes: 2874382626
Compacted partition mean bytes: 36905317
Average live cells per slice (last five minutes): 5389.0
Maximum live cells per slice (last five minutes): 51012
Average tombstones per slice (last five minutes): 2.0
Maximum tombstones per slice (last five minutes): 2759
Table: test_table
SSTable count: 559
Space used (live): 62368820540
Space used (total): 62368820540
Space used by snapshots (total): 4794
Off heap memory used (total): 817427277
SSTable Compression Ratio: 0.4856571513639344
Number of keys (estimate): 96692796
Memtable cell count: 2587248
Memtable data size: 27398085
Memtable off heap memory used: 0
Memtable switch count: 558
Local read count: 5927272991
Local read latency: 3.788 ms
Local write count: 559051606
Local write latency: 0.037 ms
Pending flushes: 0
Bloom filter false positives: 4905594
Bloom filter false ratio: 0.00023
Bloom filter space used: 612245816
Bloom filter off heap memory used: 612241344
Index summary off heap memory used: 196239565
Compression metadata off heap memory used: 8946368
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 1916
Compacted partition mean bytes: 173
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1

You can either redesign your tables, or split your query in multiple smaller queries.
You are selecting using a secondary index without using the partition key (thats what the warning tells you). Doing that, you essentially perform a full table scan. Your nodes have to look into every partition in order to fulfill your request.
A solution without changing the datamodel would be, to iterate over all partitions and run your query once per partition.
select count(*) from test_table where id = 'somePartitionId' and listids contains key 12;
This way, your nodes know on which partition you are looking for these information. You would then have to aggregate the results of these query on clientside.

I faced the same issue. tried
1) # Can also be set to None to disable:client_timeout = None in cqlshrc in home .cassandra. Did not helped.
2) Increased the timeout *timeout_in_ms in ym.cassandra.yaml
Did not helped too. Finally I settled in running loop on select clause in my java code and received count. 12 million rows gave me count in 7 seconds. It is fast.
Cluster cluster = Cluster.builder()
.addContactPoints(serverIp)
.build();
session = cluster.connect(keyspace);
String cqlStatement = "SELECT count(*) FROM imadmin.device_appclass_attributes";
//String cqlStatement = "SELECT * FROM system_schema.keyspaces";
for (Row row : session.execute(cqlStatement)) {
System.out.println(row.toString());
}

Hive: apply lowercase to an array

In Hive, how do I apply the lower() UDF to an Array of string?
Or any UDF in general. I don't know how to apply a "map" in a select query

If your use case is that you are transforming an array in isolation (not as part of a table), then the combination of explode, lower, and collect_list should do the trick. For example (please pardon the horrible execution times, I'm running on an underpowered VM):
hive> SELECT collect_list(lower(val))
> FROM (SELECT explode(array('AN', 'EXAMPLE', 'ARRAY')) AS val) t;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 4 seconds 10 msec
Ended Job = job_1422453239049_0017
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 4.01 sec HDFS Read: 283 HDFS Write: 17 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 10 msec
OK
["an","example","array"]
Time taken: 33.05 seconds, Fetched: 1 row(s)
(Note: Replace array('AN', 'EXAMPLE', 'ARRAY') in the above query with whichever expression you are using to generate the array.
If instead your use case is such that your arrays stored in a column of a Hive table and you need to apply the lowercase transformation to them, to my knowledge you have two principle options:
Approach #1: Use the combination of explode and LATERAL VIEW to separate the array. Use lower to transform the individual elements, and then collect_list to glue them back together. A simple example with silly made-up data:
hive> DESCRIBE foo;
OK
id int
data array<string>
Time taken: 0.774 seconds, Fetched: 2 row(s)
hive> SELECT * FROM foo;
OK
1001 ["ONE","TWO","THREE"]
1002 ["FOUR","FIVE","SIX","SEVEN"]
Time taken: 0.434 seconds, Fetched: 2 row(s)
hive> SELECT
> id, collect_list(lower(exploded))
> FROM
> foo LATERAL VIEW explode(data) exploded_table AS exploded
> GROUP BY id;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 3 seconds 310 msec
Ended Job = job_1422453239049_0014
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 3.31 sec HDFS Read: 358 HDFS Write: 44 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 310 msec
OK
1001 ["one","two","three"]
1002 ["four","five","six","seven"]
Time taken: 34.268 seconds, Fetched: 2 row(s)
Approach #2: Write a simple UDF to apply the transformation. Something like:
package my.package_name;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class LowerArray extends UDF {
public List<Text> evaluate(List<Text> input) {
List<Text> output = new ArrayList<Text>();
for (Text element : input) {
output.add(new Text(element.toString().toLowerCase()));
}
return output;
}
}
And then invoke the UDF directly on the data:
hive> ADD JAR my_jar.jar;
Added my_jar.jar to class path
Added resource: my_jar.jar
hive> CREATE TEMPORARY FUNCTION lower_array AS 'my.package_name.LowerArray';
OK
Time taken: 2.803 seconds
hive> SELECT id, lower_array(data) FROM foo;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 2 seconds 760 msec
Ended Job = job_1422453239049_0015
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.76 sec HDFS Read: 358 HDFS Write: 44 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 760 msec
OK
1001 ["one","two","three"]
1002 ["four","five","six","seven"]
Time taken: 27.243 seconds, Fetched: 2 row(s)
There are some trade-offs between the two approaches. #2 will probably be more efficient at runtime in general than #1, since the GROUP BY clause in #1 forces a reduction stage while the UDF approach does not. However, #1 does everything in HiveQL and is a bit more easily generalized (you can replace lower with some other kind of string transformation in the query if you needed to). With the UDF approach of #2, you potentially have to write a new UDF for each different kind of transformation you want to apply.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

KStream batch process windows - java

Related

How to ensure the expiry of a stream data structure in redis is set once only?

Can I set a TTL for #Cacheable by key?

How to get a key without changing the existing expiry time in couchbase using java client?

Cassandra Query on secondary index :ReadTimeout: code=1200

Hive: apply lowercase to an array

Categories

Resources