How to create a Kafka stream which runs at a specific time everyday, reads messages from a topic, do some transformations and write messages back to a different topic.
For instance a stream that runs at 9pm everyday, fetches all the messages pushed to a topic and write them to another topic.
I tried windowing but all the examples were pertaining to aggregation only. I don't have to do aggregation.
I am using java DSL
Write your java code to do what you want and configure crontab to running your java when you want.
Related
I'm trying to black-box test a Spring Boot application which is using Spring Cloud Stream Kafka. The expected results (in the DB) may differ based on the message processing order. How could I reliably tell if one message was processed and I can send in the next? One important factor is that one message from the test can generate multiple events (messages) within the application.
I did the following methods:
wait fixed amount of time: usually works, but if someone's PC is hot and throttling, it can become flaky, and to be honest this is just ugly
create an aspect to count the method invocations, serve it through a controller, query it multiple times, send the next message when we're "settled": timing of querying matters, unreliable
periodically check Kafka consumer lag, either from code or by querying actuator, with multiple samples: this is mixture of the above two, sometimes slower than the first but more reliable
Is there any official way of doing this?
Configure the container to emit ListenerContainerIdleEvents.
See https://docs.spring.io/spring-kafka/docs/current/reference/html/#idle-containers
Source is kafka for our beam pipeline. Apache beam's kafka IO connector supports moving of watermark(in case of flink runner) even if any partition is idle. The applications who would want to process packets based on the timestamp of the packet which is included in the payload would want to use "CustomTimestampPolicyWithLimitedDelay". We use FIXED WINDOWS for a minute for aggregation which is dependent on notion of time. So if time does not advance properly aggregation function is not called and data is missed.
This API has functionality issues. So when the application is initialized , let us just for example Topic a is used as a source with three partitions. These steps were taken to reproduce the issue:
Pump data to only one partition with a frequency with of any x seconds and observation is aggregation function is not called even after several minutes.
Now pump data to all partitions and observation is aggregation function is called at end of minute as expected.
Now pump data to only one partition and that too not till end of minute just before that so that we can generate a idle partition scenario and observation is it works as expected NOW.
So the sort of summary is there is a initialization issue with this api where it does not advance time but after step 2 it stabilizes and works as expected.
This is easily reproducible and would request apache beam to fix this.
As of now the temp fix we have gone is with LogAppendTime which works flawlessly but we do not want to process packets on broker time due to various application needs.
I'm creating a kafka stream to replicate information from one application to another, the destination api has some maintenance windows when i don't have to send data or i can cause issues on it.
I have an api that gives me when there is a maintenance period this is not an issue, what i would like to know is how to disable the stream for a given period of time and start it again once the maintenance window is over.
I'm writing my code in Java
you could manage kafka streams state like starting/stopping in a way that required to your use case. for that you need to have collection with your kafka streams in memory, and during maintenance stop them (either all or part of them) using
kafkaStreams.close() and kafkaStreams.cleanUp() on each required stream. when maintenance completed, start them by using kafkaStreams.start().
Listening of mainetance could be done in a multiple ways, for example:
by scheduling (e.g. using quartz library). if you have multiple instances of the app, scheduler should be triggered on each node.
by some kafka topic like maintenance_operations (e.g. message with status MAINTENANCE_STARTED or MAINTENANCE_COMPLETED). your app always will listen this topic and start / stop required streams based on event. if you have multiple instances of the app, each node should have a unique consumer group for maintenance_operations topic.
I am new to Hadoop and Kafka. I inherited code for a Kafka consumer that runs on a desktop Windows machine, receives the HDFS location of new XML data available on a remote cluster, downloads the data for processing, and writes the result back out to the HDFS cluster.
It seems to me that the consumer should run on the cluster because that's where the data is but all the sample Kafka consumer code I see suggests that producer/consumers run on regular desktop machines. What is the typical target platform for Kafka consumer?
Producers and consumers can run anywhere. The examples you see imply a desktop execution because that code is much simpler than, say, code running within a Storm topology and examples tend to be overly simple. The only reason for a desktop environment would be the presence of a UI for the application.
If the application is headless, then it does make a lot of sense to move the execution as close to the data (both Kafka and HDFS) as possible.
I need some design and developments inputs on reading messages from queue. i have following requirements and constraints
i need read message from queue and inert to db.
messages can come at any interval (100's at same time or 1 by one with few mins gap)
don't have any MDB container to host (just plain tomcat server)
Need to write java application to perform the above.
so not very sure how to put this simple application.
if is use quartz scheduler to trigger job to read all messages in the queue then not sure before even that complete next instance of scheduler might start and create problem.
please suggest me any inputs.
this is basically some utility so i don't want to spend too long time nor too much resources on this.
thanks & regards
LR
The usage of an ESB like Mule or Camel would simplify a lot your development. You'd find already developed components (called endpoints) for reading from a queue, and writing into a db. Also for scheduling jobs with quartz.