How to safely unsubscribe to a topic in Kafka - java

I have a simple java program (dockerized) and deployed in kubernetes (pod)
This java program is just a normal java project that listens and consumes to a specific topic. eg. SAMPLE-SAFE-TOPIC
I have to unsubscribe to this topic safely, meaning no data will be lost even I deleted this pod (java consumer).
This is the code that I saw from searching:
public static void unsubscribeSafelyFromKafka() {
logger.debug("Safely unsubscribe to topic..");
if (myKakfaConsumer != null) {
myKafkaConsumer.unsubscribe();
myKafkaConsumer.close();
}
}
I need to run this via command line wherein the Java program has already an existing static main method.
My questions are:
Is the code above guarantees no records will be lost?
How can I trigger the code above via command line when there is already an existing static main()
Note: I am running the java project via command line. E.g. java -jar MyKafkaConsumer.jar as this is the requirement.
Please help

If I understand question 1 right you are concerned that after unsubscribing via one thread triggered by a console command there is a risk that the polling consumer is processing a batch of records that might be lost if the pod is killed?
If you have other pods that are consuming as part of the same consumer group, or if this or any pod subscribes again with the same group ID then the last committed offset will ensure that no records are lost (though some could be processed more than once) as that is where the consumer that takes over will start from.
If you use auto-commit that is safest as each commit happens in a subsequent poll so you cannot possibly commit records that haven't been processed (as long as you don't spawn additional threads to do the processing). Manual commit leaves it to you to decide when records have been dealt with and hence when it is safe to commit.
However, calling close after unsubscribe is a good idea and should ensure a clean completion of the current polled batch and commit of the final offsets as long as that all happens within a timeout period.
Re question 2, if you need to manually unsubscribe then I think you'd need JMX or expose an API or similar to call a method on the running JVM. However if you are just trying to ensure safe shutdown when the pod terminates, you could unsubscribe in a shutdown hook, or just not worry, given the safety provided by offset commits.

Related

How to remove an execution data from camunda using workflows

I have a bpmn process that once starts and continues its execution forever based on the Timer cycle event. There is no end event for it.
I had recently done few changes with the workflow and made a redeployment to camunda. Since the existing processes are already running I need an option to stop it which I am finding difficult to do through workflow.
How can I stop existing execution if a new workflow started its execution? Can we achieve that using workflow itself? REST / Java coding cannot be done to achieve this.
I have another question regarding an order by query in camunda.
From the above scenario, i ended up seeing quite a few similar variables in variable table. How can i get the latest variable out of it? orderByActivityInstanceId is the only option i saw, which i feel is not reliable.
You can use other events (conditional, message or signal) to react to the situation in which you want to stop the looping process. You can for instance add an event-based sub process with a interrupting message start event to your process model.
To your second point: https://docs.camunda.org/manual/7.15/reference/rest/history/activity-instance/get-activity-instance-query/
sortBy Sort the results by a given criterion. Valid values are
activityInstanceId, instanceId, executionId, activityId, activityName,
activityType, startTime, endTime, duration, definitionId, occurrence
and tenantId. Must be used in conjunction with the sortOrder
parameter.
https://docs.camunda.org/manual/7.15/reference/rest/variable-instance/get/
is another option
To stop all the active process instances in Camunda, you can do this by calling a Camunda REST API or by Java Coding.
Using REST API
Activate/Suspend Process Instance By Id
Using Java
Suspend Process Instances
If you would like to suspend all process instances of a given process definition, you can use the method suspendProcessDefinitionById(...) of theRepositoryService and specify the suspendProcessInstances option.
Thanks a lot, i appreciate your response #amine & #rob.
I got it resolved using a signal event. Every time when a new process is deployed it triggers a signal event that will stop the recursion.
To sort the data there are options within camunda. But I had done it differently.
If there is more than one variable, I fetch them using versionTag from the process definition table.

Can I have local state in a Kafka Processor?

I've been reading a bit about the Kafka concurrency model, but I still struggle to understand whether I can have local state in a Kafka Processor, or whether that will fail in bad ways?
My use case is: I have a topic of updates, I want to insert these updates into a database, but I want to batch them up first. I batch them inside a Java ArrayList inside the Processor, and send them and commit them in the punctuate call.
Will this fail in bad ways? Am I guaranteed that the ArrayList will not be accessed concurrently?
I realize that there will be multiple Processors and multiple ArrayLists, depending on the number of threads and partitions, but I don't really care about that.
I also realize I will loose the ArrayList if the application crashes, but I don't care if some events are inserted twice into the database.
This works fine in my simple tests, but is it correct? If not, why?
Whatever you use for local state in your Kafka consumer application is up to you. So, you can guarantee only the current thread/consumer will be able to access the local state data in your array list. If you have multiple threads, one per Kafka consumer, each thread can have their own private ArrayList or hashmap to store state into. You could also have something like a local RocksDB database for persistent local state.
A few things to look out for:
If you're batching updates together to send to the DB, are those updates in any way related, say, because they're part of a transaction? If not, you might run into problems. An easy way to ensure this is the case is to set a key for your messages with a transaction ID, or some other unique identifier for the transaction, and that way all the updates with that transaction ID will end up in one specific partition, so whoever consumes them is sure to always have the
How are you validating that you got ALL the transactions before your batch update? Again, this is important if you're dealing with database updates inside transactions. You could simply wait for a pre-determined amount of time to ensure you have all the updates (say, maybe 30 seconds is enough in your case). Or maybe you send an "EndOfTransaction" message that details how many messages you should have gotten, as well as maybe a CRC or hash of the messages themselves. That way, when you get it, you can either use it to validate you have all the messages already, or you can keep waiting for the ones that you haven't gotten yet.
Make sure you're not committing to Kafka the messages you're keeping in memory until after you've batched and sent them to the database, and you have confirmed that the updates went through successfully. This way, if your application dies, the next time it comes back up, it will get again the messages you haven't committed in Kafka yet.

Event processed confirmation in Kafka

I'm trying to achieve some kind of event processing in Kafka. I've got some producers which post events to Kafka queue. I've also consumers which get the event, process it, and save processed data in DB. However, I need to be sure that EVERY event had been processed and finished. What if something crash unexpectedly during processing of event after taking it from a queue? How can I inform Kafka that this particular event is still not processed? Are there any known patterns?
Kafka streams Version 0.10.* by design has "At least once" semantics. Once you are using DB if every event has its own key you will also get "Exactly once semantic " since there is no duplications if you write to the same key.
If you want to make sure that this is correct.
Start kafka,
Generate Data,
Start DB,
Start your stream,
Make sure data is getting there,
Now stop your DB,
Kill stream while it gets some errors,
Start DB again,
And you will see that Kafka reproduces the data into your DB again.
For further reading you can go here

How to implement a light weight database based FIFO queue in java

We have an Servlet based application which can uploads files.
Uploaded files are saved in the server , and after that it is delegated for File Processing and later insert in our Mongo DB.
Files are large greater than 50 mbs and it will take 30m -1 hour depending upon server load.
Problem happens when multiple files gets processed at a time in separate threads off course, but it will eventually slow up the system and finally one of the thread gets aborted , which we can never trace it.
So we are now planning for a Multiple Producer - single consumer based approach , where file jobs are queued one by one , and the Consumer will consume it from queue one by one , but sequentially .
Provided we need the clustering capability to implemented in the application later on ,
For this approach we are planning to implement the below process.
when a file job comes, we will put it in a mongo collection with status New.
Next it will call the consumer thread immediately .
Consumer will check if there is already a running tasks with status "Running"
if there is no running status , it will start the task .
Upon completion ,before ending, consumer will again check the table, if there are any tasks with status NEW , if it is there ,it will take the task IN FIFO manner by checking the time stamp , and the process continuous.
If there is current Running tasks, it will simply insert the new task to db. since there is a already running consumer ,that thread will take care of the new job inserted into db ,while the current process ends.
By this way , we can also ensure that it will run smoothly on the clustered environment also without any additional configuration .
There are message queue based solution with RabbitMQ or ActiveMQ , but wee neeeds to minimize the additional component configuration
Let me know if our approach is correct or ,do we have a better solution out there .
Thanks,

How does the Storm handle nextTuple in the Bolt

I am newbie to Storm and have created a program to read the incremented numbers for certain time. I have used a counter in Spout and in the "nextTuple()" method the counter is being emitted and incremented
_collector.emit(new Values(new Integer(currentNumber++)));
/* how this method is being continuously called...*/
and in the execute() method of the Tuple class has
public void execute(Tuple input) {
int number = input.getInteger(0);
logger.info("This number is (" + number + ")");
_outputCollector.ack(input);
}
/*this part I am clear as Bolt would receive the input from Spout*/
In my Main class execution I have the following code
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("NumberSpout", new NumberSpout());
builder.setBolt("NumberBolt", new PrimeNumberBolt())
.shuffleGrouping("NumberSpout");
Config config = new Config();
LocalCluster localCluster = new LocalCluster();
localCluster.submitTopology("NumberTest", config, builder.createTopology());
Utils.sleep(10000);
localCluster.killTopology("NumberTest");
localCluster.shutdown();
The programs Perfectly works fine. What currently I am looking here is how does the Storm framework internally calls the nextTuple() method continuously. I am sure that my understanding is missing something here and due to this gap I am unable to connect to the internal logic of this framework.
Can anyone of you guys help me in understanding this portion clearly then it would be a great help for me as I will have to implement this concept in my project. If I am conceptually clear here then I can make a significant progress. Appreciate if anyone can quickly assist me over here. Awaiting responses...
how does the Storm framework internally calls the nextTuple() method continuously.
I believe this actually involves a very detail discussion about the entire life cycle of a storm topology as well as a clear concepts of different entities like workers, executors, tasks etc. The actual processing of a topology is carried out by the StormSubmitter class with its submitTopology method.
The very first thing it does is start uploading the jar using Nimbus's Thrift interface and then calls the submitTopology which eventually submit the topology to Nimbus. The Nimbus then start by normalizing the topology (from doc: The main purpose of normalization is to ensure that every single task will have the same serialization registrations, which is critical for getting serialization working correctly) followed by serialization, zookeeper hand shaking , supervisor and worker process startup and so on. Its too broad to discuss but If you really want to dig more you can go through the life cycle of storm topology where it explain nicely the step by step actions performs during the entire time. ( quick note from the documentation)
First a couple of important notes about topologies:
The actual topology that runs is different than the topology the user
specifies. The actual topology has implicit streams and an implicit
"acker" bolt added to manage the acking framework (used to guarantee
data processing).
The implicit topology is created via the
system-topology! function. system-topology! is used in two places:
- - when Nimbus is creating tasks for the topology code - - in the worker so
it knows where it needs to route messages to code
Now here's few clue I could try to share ...
Spouts or Bolts are actually the components which does the real processing (the logic). In storm terminology they executes as many tasks across the structure.
From the doc page : Each task corresponds to one thread of execution
Now, among many others, one typical responsibility of a worker process (read here) in storm is to monitor weather a topology is active or not and stored that particular state in a variable named storm-active-atom. This variable is used by the tasks to determine whether or not to call the nextTuple method.. So as long as your topology is live (you haven't put your spout code but assuming) till the time your timer is active (as you said for certain time) it will keep calling the nextTuple method. You can dig even further to understand the storm's Acking framework implementation to understand how it understand and acknowledge once a tuple is successfully processed and Guarantee-message-processing
I am sure that my understanding is missing something here and due to this gap I am unable to connect to the internal logic of this framework
Having said this I think its more important to get a clear understanding of how to work with storm rather than how to understand storm in the early stage. e.g instead of learning the internal mechanism of storm its important to realize that if we set a spout to read a file line by line then it keep on emitting each lines using the _collector.emit method till it reaches EOF. And the bolt connected to it receive the same in its execute(tuple input) method
Hope this help you share more with us in future
Ordinary Spouts
There is a loop in the storm's executor daemon that repeatedly calls nextTuple (as well as ack and fail when appropriate) on the corresponding spout instance.
There is no waiting for tuples being processed. Spout simply receives fail for tuples that did not manage to be processed in given timeout.
This can be easily simulated with a topology of a fast spout and a slow processing bolt: the spout will receive a lot of fail calls.
See also the ISpout javadoc:
nextTuple, ack, and fail are all called in a tight loop in a single thread in the spout task. When there are no tuples to emit, it is courteous to have nextTuple sleep for a short amount of time (like a single millisecond) so as not to waste too much CPU.
Trident Spouts
The situation is completely different for Trident-spouts:
By default, Trident processes a single batch at a time, waiting for
the batch to succeed or fail before trying another batch. You can get
significantly higher throughput – and lower latency of processing of
each batch – by pipelining the batches. You configure the maximum
amount of batches to be processed simultaneously with the
topology.max.spout.pending property.
Even while processing multiple batches simultaneously, Trident will order any state updates taking place in the topology among batches.

Categories

Resources