While creating Kafka Streams using Kafka Streams DSL
https://kafka.apache.org/0110/documentation/streams/developer-guide
we have encountered a scenario where we need to update the Kafka Streams with new topology definition.
For example:
When we started, we have a topology defined to read from one topic (Source) and a destination topic (Sink).
However, on a configuration change we now need to read from 2 different topics (2 sources if you will) and write to a single destination topic.
From what we have built right now, the topology definition is hard coded, something like defined in processor topology.
Questions:
Is it possible to define topology in a declarative way (say in a Json or something else), which doesn't require a codification of the topology?
Is it possible to reload an existing Kafka Stream to use a new definition of the Kafka Streams Topology?
For #2 mentioned above, does Kafka Streams DSL provide a way to "reload" new topology definitions by way of an external trigger or system call?
We are using JDK 1.8 and Kafka DSL 2.2.0
Thanks,
Ayusman
Is it possible to define topology in a declarative way (say in a Json or something else), which doesn't require a codification of the topology?
The KStreams DSL is declarative, but I assume you mean something other than the DSL?
If so, the answer is No. You may want to look at KSQL, however.
Is it possible to reload an existing Kafka Stream to use a new definition of the Kafka Streams Topology?
You mean if an existing Kafka Streams application can reload a new definition of a processing topology? If so, the answer is No. In such cases, you'd deploy a new version of your application.
Depending on how the old/new topologies are defined, a simple rolling upgrade of your application may suffice (roughly: if the topology change was minimal), but probably you will need to deploy the new application separately and then, once the new one is vetted, decommission your old application.
Note: KStreams is a Java library and, by design, does not include functionality to operate/manage the Java applications that use the KStreams library.
For #2 mentioned above, does Kafka Streams DSL provide a way to "reload" new topology definitions by way of an external trigger or system call?
No.
Related
I'm new to the streaming community
I'm trying to create a continuous query using kafka topics and flink but I haven't found any examples so I can get an idea of how to get started
can you help me with some examples?
thank you.
For your use case, I'm guessing you want to use kafka as source for continuous data. In this case you can use kafka-source-connector(linked below) and if you want to slice it with time you can use flink's Window Processing Function. This will group your kafka messages streamed in a particular timeframe like a list/map.
Flink Kafka source connector
Flink Window Processing Function
I'm using the mvn dependency google-cloud-dataflow-java-sdk-all version 2.1.0 and I'm trying to add a custom Sink for my pipeline.
In the pipeline, I'm retrieving Pubsub messages and am eventually transforming these to a PCollection of Strings.
This is a simplified version of the pipeline I've set up:
Pipeline pipeline = Pipeline.create(options);
pipeline.apply(PubsubIO.readMessages())
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
//transformations
.apply(//Write to custom sink)
The reason I need a custom Sink is because someone else on the team has already written the code to write out this data to BigQuery and provided a REST API for this. So, my Sink would be calling this REST API with the relevant data. I'm not keen on using BigQueryIO since that would involve duplicating parts of the code that was already written.
The problem is that I can not find any documentation on the Apache Beam website about writing custom Sinks using the Java SDK, so if someone could give me a nod in the right direction, it'd be much appreciated.
I've also considered just using a ParDo to send the data to the REST API, but then I technically would not have a Sink anymore and I wouldn't be doing it the "Dataflow way".
For unbounded sinks, there is no sink specific API in Beam. All the IO transforms essentially implement a ParDo. There are a few techniques to provide specific guarantees (e.g. using a GroupByKey to provide a checkpoint barrier in Dataflow) it depends on your interaction with external system (REST API in this case). Looks like writing a ParDo is the way to go in your case.
We use Tibco EMS as our messaging system and have used apache camel to write our application. In our application, messages are written to a queue. A component, with concurrentConsumers set to 8, reads from the queue, processes the message, then writes to another queue. Another component, again with concurrentConsumers set to 8, then reads from this new queue and so on. Up until now, maintaining message order has not been important, but a new requirement means that it now is. Looking at the camel documentation, it is suggested that jmsxgroupid is used to maintain ordering. Unfortunately, this functionality is not available with Tibco EMS. Are there any other ways of maintaining ordering in camel in a multithreaded application? I have looked at sticky load balancing, but this seems to be applicable to end point load balancing only.
Thanks
Bruce
In the enterprise integration world, we generally use the Resequencer design pattern to solve such kind of problems that you need to ensure ordering in the messages.
Apache Camel covers a broad extent of the Enterprise Integration Patterns including Resequencer at its core and it has out-of-the-box imprementations for those patterns. So what you are looking for should be this:
http://camel.apache.org/resequencer.html
In your specific case, all you need to do would be add a custom message header like myMessageNo, which has a sequential number that specifies the ordering, to outgoing messages to TIBCO EMS. Then, at consumer side, use the resequencer EIP to restore the ordering of incoming messages from TIBCO EMS.
As you can see, however, it's not as easy as just putting the resequencer EIP to your Camel routes. (Any asynchronous solutions are always hard to build correctly.) For the resequencer, you need to consider when sad paths happen, e.g. when some messages get lost and never reach. To make sure your routes work fine even with those exceptional cases, you need to choose from two options: maximum batch size or timeout. Depending on the condition chosen, the resequenser will flush messages when the batch reaches the maximum size or it timeouts waiting for a missing message.
I am creating a storm based project where messages will be filtered by storm. My aim is to allow a user to adapt the filtering performed at runtime by sending configuration information to a zookeeper Znode.
I believe this is possible by setting a zookeeper watcher up within storm but I am struggling to achieve this. I would be gratefull for some guidance or a simple example on how to perfrom this.
I have looked at the Java docs and afraid the way to perfrom this does not seem obvious
We have existing Spring Batch Application, that we want to make scalable to run on multiple nodes.
The scalabilty docs for Spring Batch involves code changes and configuration changes.
I am just wondering if this can be achieved by just configuration changes ( adding new classes and wiring it in configuration is fine but just want to avoid code changes to existing classes).
Thanks a lot for the help in advance.
It really depends on your situation. Specifically, why do you do you want to run on multiple nodes? What is the bottle neck you're attempting to overcome? The typical two scenarios that Spring Batch handles out of the box for scaling across multiple nodes are remote chunking and remote partitioning. Both are master/slave configurations, but each have a different use case.
Remote chunking is used when the processor in a step is the bottle neck. In this case, the master node reads the input and sends it via a Spring Integration channel to remote nodes for processing. Once the item has been processed, the result is returned to the master for writing. In this case, reading and writing are done locally to the master. While this helps parallelize processing, it takes an I/O hit because every item is being sent over the wire (and requires guaranteed delivery, ala JMS for example).
Remote partitioning is the other scenario. In this case, the master generates a description of the input to be processed for each slave and only that description is sent over the wire. For example, if you're processing records in a database, the master may send a range of row ids to each slave (1-100, 101-200, etc). Reading and writing occur local to the slaves and guaranteed delivery is not required (although useful in certain situations).
Both of these options can be done with minimal (or no) new classes depending on your use case. There are a couple different places to look for information on these capabilities:
Spring Batch Integration Github repository - Spring Batch Integration is the project that supports the above use cases. You can read more about it here: https://github.com/spring-projects/spring-batch-admin/tree/master/spring-batch-integration
My remote partitioning example - This talk walks though remote partitioning and provides a working example to run on CloudFoundry (currently only works on CF v1 but updates for CF2 are coming in a couple days). The configuration is almost the same, only the connection pool for Rabbit is different: https://github.com/mminella/Spring-Batch-Talk-2.0 The video for this presentation can be found on YouTube here: http://www.youtube.com/watch?v=CYTj5YT7CZU
Gunnar Hillert's presentation on Spring Batch and Spring Integration: This was presented at SpringOne2GX 2013 and contains a number of examples: https://github.com/ghillert/spring-batch-integration-sample
In any of these cases, remote chunking should be accomplishable with zero new classes. Remote partitioning typically requires you to implement one new class (the Partitioner).