I'm new to the streaming community
I'm trying to create a continuous query using kafka topics and flink but I haven't found any examples so I can get an idea of how to get started
can you help me with some examples?
thank you.
For your use case, I'm guessing you want to use kafka as source for continuous data. In this case you can use kafka-source-connector(linked below) and if you want to slice it with time you can use flink's Window Processing Function. This will group your kafka messages streamed in a particular timeframe like a list/map.
Flink Kafka source connector
Flink Window Processing Function
Related
I am investigating using Flink with a Kinesis stream as a source. I would like to use Event Time watermarking.
Planning on running this on AWS managed Flink (Kinesis Analytics) platform.
Looking at the AWS documentation and indeed Flink documentation it is recommended to use the FlinkKinesisConsumer.
To enable EventTime on this consumer I see that the recommendation is to use a custom AssignerWithPeriodicWatermarks() and set it on the KinesisConsumer with setPeriodicWatermarkAssigner.
However, I also read on the Flink documentation that this API is deprecated and it advised to use WatermarkStrategies.
My questions:
is it possible to use the WatermarkStrategy on the kinesis consumer or must it be applied after a non-source operation on the DataStream itself (discouraged in flink docs)?
if not possible and must be used after a non-source operation what does this mean? Why is it discouraged? how does it will performance of the workload
Or is it recommended to continue to use a deprecated API?
or is there another kinesis flink consumer than can be recommended
Thanks in advance for any suggestions
Alexis
While creating Kafka Streams using Kafka Streams DSL
https://kafka.apache.org/0110/documentation/streams/developer-guide
we have encountered a scenario where we need to update the Kafka Streams with new topology definition.
For example:
When we started, we have a topology defined to read from one topic (Source) and a destination topic (Sink).
However, on a configuration change we now need to read from 2 different topics (2 sources if you will) and write to a single destination topic.
From what we have built right now, the topology definition is hard coded, something like defined in processor topology.
Questions:
Is it possible to define topology in a declarative way (say in a Json or something else), which doesn't require a codification of the topology?
Is it possible to reload an existing Kafka Stream to use a new definition of the Kafka Streams Topology?
For #2 mentioned above, does Kafka Streams DSL provide a way to "reload" new topology definitions by way of an external trigger or system call?
We are using JDK 1.8 and Kafka DSL 2.2.0
Thanks,
Ayusman
Is it possible to define topology in a declarative way (say in a Json or something else), which doesn't require a codification of the topology?
The KStreams DSL is declarative, but I assume you mean something other than the DSL?
If so, the answer is No. You may want to look at KSQL, however.
Is it possible to reload an existing Kafka Stream to use a new definition of the Kafka Streams Topology?
You mean if an existing Kafka Streams application can reload a new definition of a processing topology? If so, the answer is No. In such cases, you'd deploy a new version of your application.
Depending on how the old/new topologies are defined, a simple rolling upgrade of your application may suffice (roughly: if the topology change was minimal), but probably you will need to deploy the new application separately and then, once the new one is vetted, decommission your old application.
Note: KStreams is a Java library and, by design, does not include functionality to operate/manage the Java applications that use the KStreams library.
For #2 mentioned above, does Kafka Streams DSL provide a way to "reload" new topology definitions by way of an external trigger or system call?
No.
I'm using the mvn dependency google-cloud-dataflow-java-sdk-all version 2.1.0 and I'm trying to add a custom Sink for my pipeline.
In the pipeline, I'm retrieving Pubsub messages and am eventually transforming these to a PCollection of Strings.
This is a simplified version of the pipeline I've set up:
Pipeline pipeline = Pipeline.create(options);
pipeline.apply(PubsubIO.readMessages())
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
//transformations
.apply(//Write to custom sink)
The reason I need a custom Sink is because someone else on the team has already written the code to write out this data to BigQuery and provided a REST API for this. So, my Sink would be calling this REST API with the relevant data. I'm not keen on using BigQueryIO since that would involve duplicating parts of the code that was already written.
The problem is that I can not find any documentation on the Apache Beam website about writing custom Sinks using the Java SDK, so if someone could give me a nod in the right direction, it'd be much appreciated.
I've also considered just using a ParDo to send the data to the REST API, but then I technically would not have a Sink anymore and I wouldn't be doing it the "Dataflow way".
For unbounded sinks, there is no sink specific API in Beam. All the IO transforms essentially implement a ParDo. There are a few techniques to provide specific guarantees (e.g. using a GroupByKey to provide a checkpoint barrier in Dataflow) it depends on your interaction with external system (REST API in this case). Looks like writing a ParDo is the way to go in your case.
I am creating a storm based project where messages will be filtered by storm. My aim is to allow a user to adapt the filtering performed at runtime by sending configuration information to a zookeeper Znode.
I believe this is possible by setting a zookeeper watcher up within storm but I am struggling to achieve this. I would be gratefull for some guidance or a simple example on how to perfrom this.
I have looked at the Java docs and afraid the way to perfrom this does not seem obvious
I have an application where multiple users are able to specify Spark workflows which are then send to the driver and executed on the cluster.
The workflows should now be extended to also support streamed data-sources. A possible workflow could involve:
Stream tweets with a specific hashtag
Transform each tweet
Do analysis on a windowed frame and visualization
This is working if only one single stream is started at once but gives the "Only one StreamingContext may be started in this JVM." error.
I tried different known approaches but none of them was working for me ("spark.driver.allowMultipleContexts = true", increasing "spark.streaming.concurrentJobs", trying to run each streaming context in a different pool, etc.)
Can anybody tell me what the current best practice regarding parallel streams with Spark streaming is?
Thx in advance!
I assume you're starting your spark streaming jobs programatically within an existing application - hence the error with the JVM. Spark is specifically not designed to run in the scope of an different application, even this is feasible in stand alone mode. If you want to start spark streaming jobs programatically on a cluster, you will want to use the Launcher, which looks like this:
import org.apache.spark.launcher.SparkLauncher
object Launcher extends App {
val spark = new SparkLauncher()
.setSparkHome("...")
.setAppResource("..path to your jar...")
.setMainClass("..your app...")
.setMaster("yarn")
.launch();
spark.waitFor();
}
there's a blog post with some examples:
https://blog.knoldus.com/2015/06/26/startdeploy-apache-spark-application-programmatically-using-spark-launcher/
the API docs are here:
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/launcher/SparkLauncher.html