Flink DataStream - how to start a source from an input element?

Flink DataStream - how to start a source from an input element? - java

Say I have a Flink SourceFunction<String> called RequestsSource.
On each request coming in from that source, I would like to subscribe to an external data source (for the purposes of an example, it could start a separate thread and start producing data on that thread).
The output data could be joined on a single DataStream. For example
Input Requests: A, B
Data produced:
A1
B1
A2
A3
B2
...
... and so on, with new elements being added to the DataStream forever.
How do I write a Flink Operator that can do this? Can I use e.g. FlatMapFunction?

you'd typically want to use an AsyncFunction, which (asynchronously) can take one input element, call some external service, and emit a collection of results.
See also Apache Flink Training - Async IO.
-- Ken

It sounds you are asking about an operator that can emit one or more boundless streams of data based on a connection to an external service, after receiving subscription events. The only clean way I can see to do this is to do all the work in the SourceFunction, or in a custom Operator.
I don't believe async i/o can emit an unbounded stream of results from a single input event. A ProcessFunction can do that, but only via its onTimer method.

Related

Flink - Consume from two streams and switch after N seconds

In Flink (1.14.0) application I have two streams to consume (all sources are kafka topics) , one with raw events and second with configurations. Configurations are stored in ListState<>. When under specific key there is no configuration, raw data events are ignored. There is a small time window when configurations are still processed in second stream but first stream is consuming events and I'm losing potential valid data to process. Configurations are stored in compressed kafka topic and are loaded at the beginning of job running.
I tried with custom operator implementation using InputSelectable. After some time N, timer switch from second stream to all streams but no data is pushed in first one, I used approach from this example
but its look like this 'switch' need to be performed when there is still data in second stream.
Is there a way to handle this scenario without implementing custom window aggregators to play role of time buffer for first stream?

Can we have exactly-once-processing with Spring kafka?

I am trying to have exactly-once consuming of Kafka Consumer.
My requirement is of:
Read data from Topic
Process the data [which involves calling another API]
Writing the response back to Kafka
I wanted to know if exactly once is possible in this scenario?
I know that use case satisfies Kafka streams API, but I wanted to know from the Producer/Consumer API? Also, if lets say that after processing of the data, the consumer fails for some reason, (the processing should be done only once), what would be best way to handle such cases? Can there be any continuation/checkpoint for such cases?
I understand that Kafka Streams API is produce-consumer-produce transactional. Here also, if after calling the API consumer crashes, the flow would start from the very start, right?

Yes; Spring for Apache Kafka supports exactly once semantics in the same way as Kafka Streams.
See
https://docs.spring.io/spring-kafka/docs/current/reference/html/#exactly-once
and
https://docs.spring.io/spring-kafka/docs/current/reference/html/#transactions
Bear in mind that "exactly once" means that the entire successful
consume -> process -> produce
is performed once. But, if the produce step fails (rolling back the transaction), then the consume -> process part is "at least once".
Therefore, you need to make the process part idempotent.

Ask vs Tell or forward for Actors using Akka Streams

Hi I am working with akka streams along with akka-stream-kafka. I am setting up a Stream with the below setup:
Source (Kafka) --> | Akka Actor Flow | --> Sink (MongoDB)
Actor Flow basically by Actors that will process data, below is the hierarchy:
System
|
Master Actor
/ \
URLTypeHandler SerializedTypeHandler
/ \ |
Type1Handler Type2Handler SomeOtherHandler
So Kafka has the message, I write up the consumer and run it in atMostOnceSource configuration and use
Consumer.Control control =
Consumer.atMostOnceSource(consumerSettings, Subscriptions.topics(TOPIC))
.mapAsyncUnordered(10, record -> processAccessLog(rootHandler, record.value()))
.to(Sink.foreach(it -> System.out.println("FinalReturnedString--> " + it)))
.run(materializer);
I've used a print as a sink initially, just to get the flow running.
and the processAccessLog is defined as:
private static CompletionStage<String> processAccessLog(ActorRef handler, byte[] value) {
handler.tell(value, ActorRef.noSender());
return CompletableFuture.completedFuture("");
}
Now, from the definition ask must be used when an actor is expecting a response, makes sense in this case since I want to return values to be written in the sink.
But everyone (including docs), mention to avoid ask and rather use tell and forward, an amazing blog is written on it Don't Ask, Tell.
In the blog he mentions, in case of nested actors, use tell for the first message and then use forward for the message to reach the destination and then after processing directly send the message back to the root actor.
Now here is the problem,
How do I send the message from D back to A, such that I can still use the sink.
Is it good practice to have open ended streams? e.g. Streams where Sink doesn't matter because the actors have already done the job. (I don't think it is recommend to do so, seems flawed).

ask is Still the Right Pattern
From the linked blog article, one "drawback" of ask is:
blocking an actor itself, which cannot pick any new messages until the
response arrives and processing finishes.
However, in akka-stream this is the exact feature we are looking for, a.k.a. "back-pressure". If the Flow or Sink are taking a long time to process data then we want the Source to slow down.
As a side note, I think the claim in the blog post that the additional listener Actor results in an implementation that is "dozens times heavier" is an exaggeration. Obviously an intermediate Actor adds some latency overhead but not 12x more.
Elimination of Back-Pressure
Any implementation of what you are looking for would effectively eliminate back-pressure. An intermediate Flow that only used tell would continuously propagate demand back to the Source regardless of whether or not your processing logic, within the handler Actors, was completing its calculations at the same speed that the Source is generating data.
Consider an extreme example: what if your Source could produce 1 million messages per second but the Actor receiving those messages via tell could only process 1 message per second. What would happen to that Actor's mailbox?
By using the ask pattern in an intermediate Flow you are purposefully linking the speed of the handlers and the speed with which your Source produces data.
If you are willing to remove back-pressure signaling, from the Sink to the Source, then you might as well not use akka-stream in the first place. You can have either back-pressure or non-blocking messaging, but not both.

Ramon J Romero y Vigil is right but I will try to extend the response.
1) I think that the "Don't ask, tell" dogma is mostly for Actor systems architecture. Here you need to return a Future so the stream can resolve the processed result, you have two options:
Use ask
Create an actor per event and pass them Promise so a Future will be complete when this actor receives the data (you can use the getSender method so D can send the response to A). There is no way to send a Promise or Future in a message (The are not Serialisable) so the creation of this short living actors can not be avoided.
At the end you are doing mostly the same...
2) It's perfectly fine to use an empty Sink to finalise the stream (indeed akka provides the Sink.ignore() method to do so).
Seems like you are missing the reason why you are using streams, they are cool abstraction to provide composability, concurrency and back pressure. In the other hand, actors can not be compose and is hard to handle back pressure. If you don't need this features and your actors can have the work done easily you shouldn't use akka-streams in first place.

How to correctly use BlockingQueue in java when I want to drop messages from the its head

I'm writing app for Android that process real-time data.
My app reads binary data from data bus (CAN), parse and display it on the screen.
App reads data in background thread. A need rapidly transfer data from one thread to another. Displaying data should be most actual.
I've found the nice java queue that almost implements required behavior: LinkedBlockingQueue. I plan to set the strong limit for this queue (about 100 messages).
Consumer thread should read data from queue with the take() method. But producer thread can't wait for consumer. By this reason it can't use standard method put() (because it's blocking).
So, I plan to put messages to my queue using the following construction:
while (!messageQueue.offer(message)) {
messageQueue.poll();
}
That is, the oldest message should be removed from queue to provide a place for the new actual data.
Is this a good practice? Or I've lost some important details?

Can't see anything wrong with it. You know what you are doing (loosing the head record). This can't relate to any practice; it's your call to use the api like you want. I personally prefer ArrayBlockingQueue though (less temp objects).

This should be what you're looking for: Size-limited queue that holds last N elements in Java
Top answer refers to an apache lib queue which will drop elements.

Apache flink multi-threading/parallel execution

The input stream consists of data in JSON array of objects format.
Each object has one field/key named state by which we need to separate the input stream, see below example
Object1 -> "State":"Active"
Object2 -> "State":"Idle"
Object3 -> "State":"Blocked"
Object4 -> "State":"Active"
We have to start processing/thread as soon as we receive a particular state, keep on getting the data if a new state is similar to the previous state let the previous thread handle it else start a new thread for a new state. Also, it is required to run each thread for finite time and all the threads should run in parallel.
Please suggest how can I do it in Apache Flink. Pseudo codes and links would be helpful.

This can be done with Flink's Datastream API. Each JSON object can be treated as a tuple, which can be processed with any of the Flink Operators.
/----- * * | Active
------ (KeyBy) ------ * | Idle
\----- * | Blocked
Now, you can split the single data stream into multiple streams using the KeyBy operator. This operator splits and clubs together, all the tuples with a particular key (State in your case) into a keyedstream which is processed in parallel. Internally, this is implemented with hash partitioning.
Any new keys(States) are dynamically handled as new keyedstreams are created for them.
Explore the documentation for implementation purpose.

From your description, I believe you'd need to first have an operator with a parallelism of 1, that "chunks" events by the state, and adds a "chunk id" to the output record. Whenever you get an event with a new state, you'd increment the chunk id.
Then key by the chunk id, which will parallelize downstream processing. Add a custom function which is keyed by the chunk id, and has a window duration of 10 minutes. This is where the bulk of your data processing will occur.
And as #narush noted above, you should read through the documentation that he linked to, so you understand how windows work in Flink.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.