I am investigating using Flink with a Kinesis stream as a source. I would like to use Event Time watermarking.
Planning on running this on AWS managed Flink (Kinesis Analytics) platform.
Looking at the AWS documentation and indeed Flink documentation it is recommended to use the FlinkKinesisConsumer.
To enable EventTime on this consumer I see that the recommendation is to use a custom AssignerWithPeriodicWatermarks() and set it on the KinesisConsumer with setPeriodicWatermarkAssigner.
However, I also read on the Flink documentation that this API is deprecated and it advised to use WatermarkStrategies.
My questions:
is it possible to use the WatermarkStrategy on the kinesis consumer or must it be applied after a non-source operation on the DataStream itself (discouraged in flink docs)?
if not possible and must be used after a non-source operation what does this mean? Why is it discouraged? how does it will performance of the workload
Or is it recommended to continue to use a deprecated API?
or is there another kinesis flink consumer than can be recommended
Thanks in advance for any suggestions
Alexis
Related
When introducing parallel processing to an application where multiple save entity calls are being made, I see prior dev has chosen to do it via Spring Integration using split().channel(MessageChannels.executor(Executors.newfixedThreadPool(10))).handle("saveClass","saveMethod").aggregate().get() - where this method is mapped to a requestChannel using #Gateway annotation. My question is this task seems to be simpler to do using the parallelStream() and forEach() method. Does IntergrationFlow provide any benefit in this case?
If you really do a plain in-memory data processing where Java's Stream API is enough, then indeed you don't need the whole messaging solution like Spring Integration. But if you deal with distributed requirements to process data from different systems, like from HTTP to Apache Kafka or DB, then it is better to use some tool which allows you smoothly to connect everything together. Also: no one stops you to use Stream API in the Spring Integration application. In the end all your code is Java anyway. Please, learn more what is EIP and why we would need a special framework to implement this messaging-based solutions: https://www.enterpriseintegrationpatterns.com/
I'm new to the streaming community
I'm trying to create a continuous query using kafka topics and flink but I haven't found any examples so I can get an idea of how to get started
can you help me with some examples?
thank you.
For your use case, I'm guessing you want to use kafka as source for continuous data. In this case you can use kafka-source-connector(linked below) and if you want to slice it with time you can use flink's Window Processing Function. This will group your kafka messages streamed in a particular timeframe like a list/map.
Flink Kafka source connector
Flink Window Processing Function
While creating Kafka Streams using Kafka Streams DSL
https://kafka.apache.org/0110/documentation/streams/developer-guide
we have encountered a scenario where we need to update the Kafka Streams with new topology definition.
For example:
When we started, we have a topology defined to read from one topic (Source) and a destination topic (Sink).
However, on a configuration change we now need to read from 2 different topics (2 sources if you will) and write to a single destination topic.
From what we have built right now, the topology definition is hard coded, something like defined in processor topology.
Questions:
Is it possible to define topology in a declarative way (say in a Json or something else), which doesn't require a codification of the topology?
Is it possible to reload an existing Kafka Stream to use a new definition of the Kafka Streams Topology?
For #2 mentioned above, does Kafka Streams DSL provide a way to "reload" new topology definitions by way of an external trigger or system call?
We are using JDK 1.8 and Kafka DSL 2.2.0
Thanks,
Ayusman
Is it possible to define topology in a declarative way (say in a Json or something else), which doesn't require a codification of the topology?
The KStreams DSL is declarative, but I assume you mean something other than the DSL?
If so, the answer is No. You may want to look at KSQL, however.
Is it possible to reload an existing Kafka Stream to use a new definition of the Kafka Streams Topology?
You mean if an existing Kafka Streams application can reload a new definition of a processing topology? If so, the answer is No. In such cases, you'd deploy a new version of your application.
Depending on how the old/new topologies are defined, a simple rolling upgrade of your application may suffice (roughly: if the topology change was minimal), but probably you will need to deploy the new application separately and then, once the new one is vetted, decommission your old application.
Note: KStreams is a Java library and, by design, does not include functionality to operate/manage the Java applications that use the KStreams library.
For #2 mentioned above, does Kafka Streams DSL provide a way to "reload" new topology definitions by way of an external trigger or system call?
No.
I'm using the mvn dependency google-cloud-dataflow-java-sdk-all version 2.1.0 and I'm trying to add a custom Sink for my pipeline.
In the pipeline, I'm retrieving Pubsub messages and am eventually transforming these to a PCollection of Strings.
This is a simplified version of the pipeline I've set up:
Pipeline pipeline = Pipeline.create(options);
pipeline.apply(PubsubIO.readMessages())
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
//transformations
.apply(//Write to custom sink)
The reason I need a custom Sink is because someone else on the team has already written the code to write out this data to BigQuery and provided a REST API for this. So, my Sink would be calling this REST API with the relevant data. I'm not keen on using BigQueryIO since that would involve duplicating parts of the code that was already written.
The problem is that I can not find any documentation on the Apache Beam website about writing custom Sinks using the Java SDK, so if someone could give me a nod in the right direction, it'd be much appreciated.
I've also considered just using a ParDo to send the data to the REST API, but then I technically would not have a Sink anymore and I wouldn't be doing it the "Dataflow way".
For unbounded sinks, there is no sink specific API in Beam. All the IO transforms essentially implement a ParDo. There are a few techniques to provide specific guarantees (e.g. using a GroupByKey to provide a checkpoint barrier in Dataflow) it depends on your interaction with external system (REST API in this case). Looks like writing a ParDo is the way to go in your case.
Can MongoDB be used as a datasource to Apache Flink for processing the Streaming Data?
What is the native implementation of Apache Flink to use No-SQL Database as data source?
Currently, Flink does not have a dedicated connector to read from MongoDB. What you can do is the following:
Use StreamExecutionEnvironment.createInput and provide a Hadoop input format for MongoDB using Flink's wrapper input format
Implement your own MongoDB source via implementing SourceFunction/ParallelSourceFunction
The former should give you at-least-once processing guarantees since the MongoDB collection is completely re-read in case of a recovery. Depending on the functionality of the MongoDB client, you might be able to implement exactly-once processing guarantees with the latter approach.