Can I have local state in a Kafka Processor?

Can I have local state in a Kafka Processor? - java

I've been reading a bit about the Kafka concurrency model, but I still struggle to understand whether I can have local state in a Kafka Processor, or whether that will fail in bad ways?
My use case is: I have a topic of updates, I want to insert these updates into a database, but I want to batch them up first. I batch them inside a Java ArrayList inside the Processor, and send them and commit them in the punctuate call.
Will this fail in bad ways? Am I guaranteed that the ArrayList will not be accessed concurrently?
I realize that there will be multiple Processors and multiple ArrayLists, depending on the number of threads and partitions, but I don't really care about that.
I also realize I will loose the ArrayList if the application crashes, but I don't care if some events are inserted twice into the database.
This works fine in my simple tests, but is it correct? If not, why?

Whatever you use for local state in your Kafka consumer application is up to you. So, you can guarantee only the current thread/consumer will be able to access the local state data in your array list. If you have multiple threads, one per Kafka consumer, each thread can have their own private ArrayList or hashmap to store state into. You could also have something like a local RocksDB database for persistent local state.
A few things to look out for:
If you're batching updates together to send to the DB, are those updates in any way related, say, because they're part of a transaction? If not, you might run into problems. An easy way to ensure this is the case is to set a key for your messages with a transaction ID, or some other unique identifier for the transaction, and that way all the updates with that transaction ID will end up in one specific partition, so whoever consumes them is sure to always have the
How are you validating that you got ALL the transactions before your batch update? Again, this is important if you're dealing with database updates inside transactions. You could simply wait for a pre-determined amount of time to ensure you have all the updates (say, maybe 30 seconds is enough in your case). Or maybe you send an "EndOfTransaction" message that details how many messages you should have gotten, as well as maybe a CRC or hash of the messages themselves. That way, when you get it, you can either use it to validate you have all the messages already, or you can keep waiting for the ones that you haven't gotten yet.
Make sure you're not committing to Kafka the messages you're keeping in memory until after you've batched and sent them to the database, and you have confirmed that the updates went through successfully. This way, if your application dies, the next time it comes back up, it will get again the messages you haven't committed in Kafka yet.

Related

Getting usernames from database that are not being used by a thread

I have a multi threaded Java program where each thread gets one username for some processing which takes about 10 minutes or so.
Right now it's getting the usernames by a sql query that returns one username randomly and the problem is that the same username can be given to more than one thread at a time.
I don't want a username that is being processed by a thread, to be fetched again by another thread. What is a simple and easy way to achieve this goal?

Step-by-step solution:
Create a threads table where you store the threads' state. Among other columns, you need to store the owner user's id there as well.
When a thread is associated to a user, create a record, storing the owner, along with all other juicy stuff.
When a thread is no longer associated to a user, set its owner to null.
When a thread finishes its job, remove its record.
When you randomize your user for threads, filter out all the users who are already associated to at least a thread. This way you know any users at the end of randomization are threadless.
Make sure everything is in place. If, while working on the feature some thread records were created and should be removed or disposed from its owner, then do so.

There is a lot of ways to do this... I can think of three solution to this problem:
1) A singleton class with a array that contains all the user already in use. Be sure that the acces to the array is synchronized and you remove the unused users from it.
2) A flag in the user table that contains a unique Id referencing the thread that is using it. After you have to manage when you remove the flag from the table.
-> As an alternative, why do you check if a pool of connections shared by all the thread could be the solution to your problem ?

You could do one batch query that returns all of the usernames you want from the database and store them in a List (or some type of collection).
Then ensure synchronised access to this list to prevent two threads taking the same username at the same time. Use a synchronised list or a synchronised method to access the list and remove the username from the list.

One way to do it is to add another column to your users table.this column is a simple flag that shows if a user has an assigned thread or not.
but when you query the db you have to wrap it in a transaction.
you begin the transaction and then first you select a user that doesn't have a thread after that you update the flag column and then you commit or roll back.
since the queries are wrapped in a transaction the db handles all the issues that happen in scenarios like this.
with this solution there is no need to implement synchronization mechanisms in your code since the database will do it for you.
if you still have problems after doing this i think you have to configure isolation levels of your db server.

You appear to want a work queue system. Don't reinvent the wheel - use a well established existing work queue.
Robust, reliable concurrent work queuing is unfortunately tricky with relational databases. Most "solutions" land up:
Failing to cope with work items not being completed due to a worker restart or crash;
Actually land up serializing all work on a lock, so all but one worker are just waiting; and/or
Allowing a work item to be processed more than once
PostgreSQL 9.5's new FOR UPDATE SKIP LOCKED feature will make it easier to do what you want in the database. For now, use a canned reliable task/work/message queue engine.
If you must do this yourself, you'll want to have a table of active work items where you record the active process ID / thread ID of the worker that's processing a row. You will need a cleanup process that runs periodically, on thread crash, and on program startup that removes entries for failed jobs (where the worker process no longer exists) so they can be re-tried.
Note that unless the work the workers do is committed to the database in the same transaction that marks the work queue item as done, you will have timing issues where the work can be completed then the DB entry for it isn't marked as done, leading to work being repeated. To absolutely prevent that requires that you commit the work to the DB in the same transaction as the change that marks the work as done, or that you use two-phase commit and an external transaction manager.

Updating integer atomically over multiple JVMs for every key

We have a requirement, where the problem can be narrowed down as.
There are multiple keys and each key maps to a integer.
When a key is received on a JVM, you need to retrieve the int value from the shared memory, increment it and then put the incremented value back on the shared memory.
So when two JVMs or two threads read the same value, then the update of one of them should fail consistently, so that you do not lose any increment done by any of the thread on any of the JVM.
Once an update fails, you read again from the shared memory, increment it and then update again till the update is successful or you have exhausted some 'N' number of retries.
Right now we are using infinispan with optimistic locking, but the behavior is not consistent. Please find the link to that thread.
https://developer.jboss.org/message/914490
Is there any other technology which will fit in well for this requirement.

Synchronizing between threads is easy, but between JVMs is extremely hard, especially if you need to support multiple platforms. I would suggest centralising the update code using one of the following methods, both of which "contract out" the data update task:
Publish a trivial REST API from a single process that knows how to do the update task, and serialize the requests.
Use a relational database to hold the counts, and make sure the client code correctly rolls back transactions when they don't succeed.
Probably not what you wanted to hear, but either method will work well.

How to iterate over db records correctly with hibernate

I want to iterate over records in the database and update them. However since that updating is both taking some time and prone to errors, I need to a) don't keep the db waiting (as e.g. with a ScrollableResults) and b) commit after each update.
Second thing is that this is done in multiple threads, so I need to ensure that if thread A is taking care of a record, thread B is getting another one.
How can I implement this sensibly with hibernate?
To give a better idea, the following code would be executed by several threads, where all threads share a single instance of the RecordIterator:
Iterator<Record> iter = db.getRecordIterator();
while(iter.hasNext()){
Record rec = iter.next();
// do something lengthy here
db.save(rec);
}
So my question is how to implement the RecordIterator. If on every next() I perform a query, how to ensure that I don't return the same record twice? If I don't, which query to use to return detached objects? Is there a flaw in the general approach (e.g. use one RecordIterator per thread and let the db somehow handle synchronization)? Additional info: there are way to many records to locally keep them (e.g. in a set of treated records).
Update: Because the overall process takes some time, it can happen that the status of Records changes. Due to that the ordering of the result of a query can change. I guess to solve this problem I have to mark records in the database once I return them for processing...

Hmmm, what about pushing your objects from a reader thread in some bounded blocking queue, and let your updater threads read from that queue.
In your reader, do some paging with setFirstResult/setMaxResults. E.g. if you have 1000 elements maximum in your queue, fill them up 500 at a time. When the queue is full, the next push will automatically wait until the updaters take the next elements.

My suggestion would be, since you're sharing an instance of the master iterator, is to run all of your threads using a shared Hibernate transaction, with one load at the beginning and a big save at the end. You load all of your data into a single 'Set' which you can iterate over using your threads (be careful of locking, so you might want to split off a section for each thread, or somehow manage the shared resource so that you don't overlap).
The beauty of the Hibernate solution is that the records aren't immediately saved to the database, since you're using a transaction, and are stored in hibernate's cache. Then at the end they'd all be written back to the database at once. This would save on those expensive database writes you're worried about, plus it gives you an actual object to work with on each iteration, instead of just a database row.
I see in your update that the status of the records may change during processing, and this could always cause a problem. If this is a constantly running process or long running, then my advice using a hibernate solution would be to work in smaller sets, and yes, add a flag to mark records that have been updated, so that when you move to the next set you can pick up ones that haven't been touched.

How to remove messages from a topic

I am trying to write an Application that uses the JMS publish subscribe model. However I have run into a setback, I want to be able to have the publisher delete messages from the topic. The usecase is that I have durable subscribers, the active ones will get the messages (since it's more or less instantly) , but if there are inactive ones and the publisher decides the message is wrong, I want to have him able to delete the message so that the subscribers won't receive it anymore once they become active.
Problem is, I don't know how/if this can be done.
For a provider I settled on glassfish's implementation, but if other alternatives offer this functionality, I can switch.
Thank you.

JMS is a form of asynchronous messaging and as such the publishers and subscribers are decoupled by design. This means that there is no mechanism to do what you are asking. For subscribers who are active at time of publication, they will consume the message with no chance of receiving the delete message in time to act on it. If a subscriber is offline then they will but async messages are supposed to be atomic. If you proceed with design of other respondent's answer (create a delete message and require reconnecting consumers to read the entire queue looking for delete messages), then you will create a situation in which the behavior of the system differs based on whether or not a subscriber was online or not at the time a specific message/delete combination was was published. There is also a race condition in which the subscriber completes reading of the retained messages just before the publisher sends out the delete message. This means you must put significant logic into subscribers to reconcile these conditions and even more to reconcile the race condition.
The accepted method of doing this is what are called "compensating transactions." In any system where the producer and consumer do not share a single unit of work or share common state (such as using the same DB to store state) then backing out or correcting a previous transaction requires a second transaction that reverses the first. The consumer must of course be able to apply the compensating transaction correctly. When this pattern is used the result is that all subscribers exhibit the same behavior regardless of whether the messages are consumed in real time or in a batch after the consumer has restarted.
Note that a compensating transaction differs from a "delete message." The delete message as proposed in the other respondent's answer is a form of command and control that affects the message stream itself. On the other hand, compensating transactions affect the state of the system through transactional updates of the system state.
As a general rule, you never want to manage state of the system by manipulating the message stream with command and control functions. This is fragile, susceptible to attack and very hard to audit or debug. Instead, design the system to deliver every message subject to its quality of service constraints and to process all messages. Handle state changes (including reversing a prior action) entirely in the application.
As an example, in banking where transactions trigger secondary effects such as overdraft fees, a common procedure is to "memo post" the transactions during the day, then sort and apply them in a batch after the bank has closed. This allows a mistake to be reconciled before it causes overdraft fees. More recently, the transactions are applied in real time but the triggers are withheld until the day's books close and this achieves the same result.

JMS API does not allow removing messages from any destination (either queue or topic). Although I believe that specific JMX providers provide their own proprietary tools to manage their state for example using JMX. Try to check it out for your JMS provider but be careful: even if you find solution it will not be portable between different JMS providers.
One legal way to "remove" message is using its time-to-live:
publish(Topic topic, Message message, int deliveryMode, int priority, long timeToLive). Probably it is good enough for you.
If it is not applicable for your application, solve the problem on application level. For example attach unique ID to each message and publish special "delete" message with higher priority that will be a kind of command to delete "real" message with the same ID.

You have have the producer send a delete message and the consumer needs to read all messages before starting to process them.

Java: Do something on event in SQL Database?

I'm building an application with distributed parts.
Meaning, while one part (writer) maybe inserting, updating information to a database, the other part (reader) is reading off and acting on that information.
Now, i wish to trigger an action event in the reader and reload information from the DB whenever i insert something from the writer.
Is there a simple way about this?
Would this be a good idea? :
// READER
while(true) {
connect();
// reload info from DB
executeQuery("select * from foo");
disconnect();
}
EDIT : More info
This is a restaurant Point-of-sale system.
Whenever the cashier punches an order into the db - the application in the kitchen get's the notification. This is the kind of system you would see at McDonald's.
The applications needn't be connected in any way. But each part will connect to a single mySQL server.
And i believe i'd expect immediate notifications.

You might consider setting up an embedded JMS server in your application, I would recommend ActiveMQ as it is super easy to embed.
For what you want to do a JMS Topic is a perfect fit. When the cashier punches in an order the order is not written to the database but in a message on the Topic, let's name it newOrders.
On the topic there are 2 subscribers : NewOrderPersister and KitchenNotifier. These will each have an onMessage(Message msg) method which contains the details of the order. One saves it to the database, the other adds it to a screen or yells it through te kitchen with text-to-speech, whatever.
The nice part of this is that the poster does not need to know which and how many subscribers are there waiting for the messages. So if you want a NewOrderCOunter in the backoffice to keep an online count of how much money the owner has made today, or add a "FreanchFiresOrderListener" to have a special display near the deep frying pan, nothing has to change in the rest of the application. They just subscribe to the topic.

The idea you are talking about is called "polling". As Graphain pointed out you must add a delay in the loop. The amount of delay should be decided based on factors like how quickly you want your reader to detect any changes in database and how fast the writer is expected to insert/update data.
Next improvement to your solution could be to have an change-indicator within the database. Your algo will look something like:
// READER
while(true) {
connect();
// reload info from DB
change_count=executeQuery("select change_count from change_counters where counter=foo");
if(change_count> last_change_count){
last_change_count=change_count;
reload();
}
disconnect();
}
The above change will ensure that you do not reload data unnecessarily.
You can further tune the solution to keep a row level change count so that you can reload only the updated rows.

I don't think it's a good idea to use a database to synchronize processes. The parties using the database should synchronize directly, i.e., the writer should write its orders and then notify the kitchen that there is a new order. Then again the notification could be the order itself (or some ID for the database). The notifications can be sent via a message broker.
It's more or less like in a real restaurant. The kitchen rings a bell when meals are finished and the waiters fetch them. They don't poll unnecessarily.
If you really want to use the database for synchronization, you should look into triggers and stored procedures. I'm fairly sure that most RDBMS allow the creation of stored procedures in Java or C that can do arbitrary things like opening a Socket and communicating with another Computer. While this is possible and not as bad as polling I still don't think of it as a very good idea.

Well to start with you'd want some kind of wait timer in there or it is literally going to poll every instance of time it can which would be a pretty bad idea unless you want to simulate what it would be like if Google was hosted on one database.
What kind of environment do the apps run in? Are we talking same machine notification, cross-network, over the net?
How frequently do updates occur and how soon does the reader need to know about them?

I have done something similar before using jGroups I don't remember the exact details as it was quite a few years ago but I had a listener on the "writer" end which would then use JGroups to send out notification of change which would cause the receivers to respond accordingly.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.