I am trying to wrap my head around Kafka Streams and having some fundamental questions that I can't seem to figure out on my own. I understand the concept of a KTable and Kafka State Stores but am having trouble deciding how to approach this. I am also using Spring Cloud Streams, which adds another level of complexity on top of this.
My use case:
I have a rule engine that reads in a Kafka event, processes the event, returns a list of rules that matched and writes it into another topic. This is what I have so far:
#Bean
public Function<KStream<String, ProcessNode>, KStream<String, List<IndicatorEvaluation>>> process() {
return input -> input.mapValues(this::analyze).filter((host, evaluation) -> evaluation != null);
}
public List<IndicatorEvaluation> analyze(final String host, final ProcessNode process) {
// Does stuff
}
Some of the stateful rules look like:
[some condition] REPEATS 5 TIMES WITHIN 1 MINUTE
[some condition] FOLLOWEDBY [some condition] WITHIN 1 MINUTE
[rule A exists and rule B exists]
My current implementation is storing all this information in memory to be able to perform the analysis. For obvious reasons, it is not easily scalable. So I figured I would persist this into a Kafka State Store.
I am unsure of the best way to go about it. I know there is a way to create custom state stores that allow for a higher level of flexibility. I'm not sure if the Kafka DSL will support this.
Still new to Kafka Streams and wouldn't mind hearing a variety of suggestions.
From the description you have given, I believe this use case can still be implemented using the DSL in Kafka Streams. The code you have shown above does not track any state. In your topology, you need to add state by tracking the counts of the rules and store them in a state store. Then you only need to send the output rules when that count hits a threshold. Here is the general idea behind this as a pseudo-code. Obviously, you have to tweak this to satisfy the particular specifications of your use case.
#Bean
public Function<KStream<String, ProcessNode>, KStream<String, List<IndicatorEvaluation>>> process() {
return input -> input
.mapValues(this::analyze)
.filter((host, evaluation) -> evaluation != null)
...
.groupByKey(...)
.windowedBy(TimeWindows.of(Duration.ofHours(1)))
.count(Materialized.as("rules"))
.filter((key, value) -> value > 4)
.toStream()
....
}
Related
I'm new to Stackoverflow, so forgive me if the question is badly asked. Any help/inspiration is much appreciated!
I'm using Kafka streams to filter incoming data to my database. The incoming messages looks like {"ID":"X","time":"HH:MM"} and a few other parameters, irrelevant in this case. I managed to get a java application running that reads from a topic and prints out the incoming messages. Now what I want to do is to use KTables(?) to group incoming messages with the same ID and then use a session window to group the table in time-slots. I want a time window of X minutes continuously running on the time axis.
The first thing is of course to get a KTable running to count incoming messages with the same ID. What I would like to do should result in something like this:
ID Count
X 1
Y 3
Z 1
that keeps getting updated continuously, so messages with an outdated timestamp is removed from the table.
I'm not a hundred percent sure, but I think what I want is KTables and not KStreams, am I right? And how do I achieve the Sliding Window if this is the proper way of achieving my desired results?
This is the code I use right now. It only reads from a topic and prints the incoming messages.
private static List<String> printEvent(String o) {
System.out.println(o);
return Arrays.asList(o);
}
final StreamsBuilder builder = new StreamsBuilder();
builder.<String, String>stream(srcTopic)
.flatMapValues(value -> printEvent(value));
I would like to know what I have to add to achieve my desired output stated above, and where I put it in my code.
Thanks in advance for the help!
Yes you need Ktable and sliding window, i also recommend you look on watermark feature, to handle late delivery message.
Example
KTable<Windowed<Key>, Value> oneMinuteWindowed = yourKStream
.groupByKey()
.reduce(/*your adder*/, TimeWindows.of(60*1000, 60*1000), "store1m");
//where your adder can be as simple as (val, agg) -> agg + val
//for primitive types or as complex as you need
(Working in RxKotlin and RxJava, but using metacode for simplicity)
Many Reactive Extensions guides begin by creating an Observable from already available data. From The introduction to Reactive Programming you've been missing, it's created from a single string
var soureStream= Rx.Observable.just('https://api.github.com/users');
Similarly, from the frontpage of RxKotlin, from a populated list
val list = listOf(1,2,3,4,5)
list.toObservable()
Now consider a simple filter that yields an outStream,
var outStream = sourceStream.filter({x > 3})
In both guides the source events are declared apriori. Which means the timeline of events has some form
source: ----1,2,3,4,5-------
out: --------------4,5---
How can I modify sourceStream to become more of a pipeline? In other words, no input data is available during sourceStream creation? When a source event becomes available, it is immediately processed by out:
source: ---1--2--3-4---5-------
out: ------------4---5-------
I expected to find an Observable.add() for dynamic updates
var sourceStream = Observable.empty()
var outStream = sourceStream.filter({x>3})
//print each element as its added
sourceStream .subscribe({println(it)})
outStream.subscribe({println(it)})
for i in range(5):
sourceStream.add(i)
Is this possible?
I'm new, but how could I solve my problem without a subject? If I'm
testing an application, and I want it to "pop" an update every 5
seconds, how else can I do it other than this Publish subscribe
business? Can someone post an answer to this question that doesn't
involve a Subscriber?
If you want to pop an update every five seconds, then create an Observable with the interval operator, don't use a Subject. There are some dozen different operators for constructing Observables so you rarely need a subject.
That said, sometimes you do need one, and they come in very handy when testing code. I use them extensively in unit tests.
To Use Subject Or Not To Use Subject? is and excellent article on the subject of Subjects.
I have written a Spring Boot micro service using RxJava (aggregated service) to implement the following simplified usecase. The big picture is when an instructor uploads a course content document, set of questions should be generated and saved.
User uploads a document to the system.
The system calls a Document Service to convert the document into a text.
Then it calls another question generating service to generate set of questions given the above text content.
Finally these questions are posted into a basic CRUD micro service to save.
When a user uploads a document, lots of questions are created from it (may be hundreds or so). The problem here is I am posting questions one at a time sequentially for the CRUD service to save them. This slows down the operation drastically due to IO intensive network calls hence it takes around 20 seconds to complete the entire process. Here is the current code assuming all the questions are formulated.
questions.flatMapIterable(list -> list).flatMap(q -> createQuestion(q)).toList();
private Observable<QuestionDTO> createQuestion(QuestionDTO question) {
return Observable.<QuestionDTO> create(sub -> {
QuestionDTO questionCreated = restTemplate.postForEntity(QUESTIONSERVICE_API,
new org.springframework.http.HttpEntity<QuestionDTO>(question), QuestionDTO.class).getBody();
sub.onNext(questionCreated);
sub.onCompleted();
}).doOnNext(s -> log.debug("Question was created successfully."))
.doOnError(e -> log.error("An ERROR occurred while creating a question: " + e.getMessage()));
}
Now my requirement is to post all the questions in parallel to the CRUD service and merge the results on completion. Also note that the CRUD service will accept only one question object at a time and that can not be changed. I know that I can use Observable.zip operator for this purpose, but I have no idea on how to apply it in this context since the actual number of questions is not predetermined. How can I change the code in line 1 so that I can improve the performance of the application. Any help is appreciated.
By default the observalbes in flatMap operate on the same scheduler as you subscribed it on. In order to run your createQuestion observables in parallel, you have to subscribe them on a computation scheduler.
questions.flatMapIterable(list -> list)
.flatMap(q -> createQuestion(q).subscribeOn(Schedulers.computation()))
.toList();
Check this article for a full explanation.
I'm new to Hazelcast, and I'm trying to use it to store data in a map that is too large than possible to fit on a single machine.
One of the processes that I need to implement is to go over each of the values in the map and do something with them - not accumulating or aggregation and I don't need to see all the data at once, so there is no memory concern with that.
My trivial implementation would be to use IMap.keySet() and then to iterate over all the keys to get each stored value in turn (and allow the value to be GCed after processing), but my concern is that there is going to be so much data in the system that even just getting the list of keys will be large enough to put undue stress on the system.
I was hoping that there was a streaming API that I can stream keys (or even full entries) in such a way that the local node will not have to cache the entire set locally - but failed to find anything that seemed relevant to me in the documentation.
I would appreciate any suggestions that you may come up with. Thanks.
Hazelcast Jet provides distributed version of j.u.s and adds «streaming» capabilities to IMap.
It allows execution of Java Streams API on the Hazelcast cluster.
import com.hazelcast.jet.JetInstance;
import com.hazelcast.jet.stream.DistributedCollectors;
import com.hazelcast.jet.stream.IStreamMap;
import com.hazelcast.jet.stream.IStreamList;
import static com.hazelcast.jet.stream.DistributedCollectors.toIList;
final IStreamMap<String, Integer> streamMap = instance1.getMap("source");
// stream of entries, you can grab keys from it
IStreamList<String> counts = streamMap.stream()
.map(entry -> entry.getKey().toLowerCase())
.filter(key -> key.length() >= 5)
.sorted()
// this will store the result on cluster as well
// so there is no data movement between client and cluster
.collect(toIList());
Please, find more info about jet here and more examples here.
Cheers,
Vik
While the Hazelcast Jet stream implementation looks impressive, I didn't have a lot of time to invest in looking at upgrading to Hazelcast Jet (in our pretty much bog-standard vert.x setup). Instead I used IMap.executeOnEntries which seems to be doing about the same thing as detailed for Hazelcast Jet by #Vik Gamov, except with a more annoying syntax.
My example:
myMap.executeOnEntries(new EntryProcessor<String,MyEntity>(){
private static final long serialVersionUID = 1L;
#Override
public Object process(Entry<String, MyEntity> entry) {
entry.getValue().fondle();
return null;
}
#Override
public EntryBackupProcessor<String, MyEntity> getBackupProcessor() {
return null;
}});
As you can see, the syntax is quite annoying:
We need to create an actual object, that can be serialized to the cluster - no fancy lambdas here (don't use my serial ID, if you copy&paste this - its broken by design).
One reason it cannot be lambda is that the interface is not functional - you need another method to handle backup copies (or at least to declare that you don't want to handle them, as I do), which while I acknolwedge its importance, its not important all of the time and I would guess that its only important in rare cases.
Obviously you can't (or at least its not trivial) to return data from the process - which is not important in my case, but still.
Is there functionality built into Kafka Streams that allows for dynamically connecting a single input stream into multiple output streams? KStream.branch allows branching based on true/false predicates, but this isn't quite what I want. I'd like each incoming log to determine the topic it will be streamed to at runtime, e.g., a log {"date": "2017-01-01"} will be streamed to the topic topic-2017-01-01 and a log {"date": "2017-01-02"} will be streamed to the topic topic-2017-01-02.
I could call forEach on the stream, then write to a Kafka producer, but that doesn't seem very elegant. Is there a better way to do this within the Streams framework?
If you want to create topics dynamically based on your data, you do not get any support within Kafka's Streaming API at the moment (v0.10.2 and earlier). You will need to create a KafkaProducer and implement your dynamic "routing" by yourself (for example using KStream#foreach() or KStream#process()). Note, that you need to do synchronous writes to avoid data loss (which are not very performant unfortunately). There are plans to extend Streaming API with dynamic topic routing, but there is no concrete timeline for this feature right now.
There is one more consideration you should take into account. If you do not know your destination topic(s) ahead of time and just rely on the so-called "topic auto creation" feature, you should make sure that those topics are being created with the desired configuration settings (e.g., number of partitions or replication factor).
As an alternative to "topic auto creation" you can also use Admin Client (available since v0.10.1) to create topics with correct configuration. See https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations