I am having web service that is receiving multiple XML file at a time which contains student's data
i need to process that file and store values to database.
for that i have used JMS queue. i am creating object message and pushing to queue.
but when queue is processing message another messages are available for process and due to that my database table gets locked.
consider that i am having one list that contains 5000 values and in for loop i am iterating list and processing JMS messages.
this is exactly my scenario is . The problem is while processing one message my table gets locked and rest of file remains as it is in queue.
suggest some solution
Make sure you use the right lock strategy (see Table level locking and Row level locking)
See if you can treat your messages one at a time, (JMS consumer conf.) this way, the first message will release the lock for the second one and so on
EDIT: Typo and links
If I understand you correctly, the database handling is in the listener that's taking messages off the queue.
You have to worry about database isolation and table/row locking, because each listener runs in its own thread.
You'll either have to lock rows or set the ISOLATION level on your database to SERIALIZABLE to guarantee that only one thread at a time will INSERT or UPDATE the table.
Related
I've been reading a bit about the Kafka concurrency model, but I still struggle to understand whether I can have local state in a Kafka Processor, or whether that will fail in bad ways?
My use case is: I have a topic of updates, I want to insert these updates into a database, but I want to batch them up first. I batch them inside a Java ArrayList inside the Processor, and send them and commit them in the punctuate call.
Will this fail in bad ways? Am I guaranteed that the ArrayList will not be accessed concurrently?
I realize that there will be multiple Processors and multiple ArrayLists, depending on the number of threads and partitions, but I don't really care about that.
I also realize I will loose the ArrayList if the application crashes, but I don't care if some events are inserted twice into the database.
This works fine in my simple tests, but is it correct? If not, why?
Whatever you use for local state in your Kafka consumer application is up to you. So, you can guarantee only the current thread/consumer will be able to access the local state data in your array list. If you have multiple threads, one per Kafka consumer, each thread can have their own private ArrayList or hashmap to store state into. You could also have something like a local RocksDB database for persistent local state.
A few things to look out for:
If you're batching updates together to send to the DB, are those updates in any way related, say, because they're part of a transaction? If not, you might run into problems. An easy way to ensure this is the case is to set a key for your messages with a transaction ID, or some other unique identifier for the transaction, and that way all the updates with that transaction ID will end up in one specific partition, so whoever consumes them is sure to always have the
How are you validating that you got ALL the transactions before your batch update? Again, this is important if you're dealing with database updates inside transactions. You could simply wait for a pre-determined amount of time to ensure you have all the updates (say, maybe 30 seconds is enough in your case). Or maybe you send an "EndOfTransaction" message that details how many messages you should have gotten, as well as maybe a CRC or hash of the messages themselves. That way, when you get it, you can either use it to validate you have all the messages already, or you can keep waiting for the ones that you haven't gotten yet.
Make sure you're not committing to Kafka the messages you're keeping in memory until after you've batched and sent them to the database, and you have confirmed that the updates went through successfully. This way, if your application dies, the next time it comes back up, it will get again the messages you haven't committed in Kafka yet.
I have a multi threaded Java program where each thread gets one username for some processing which takes about 10 minutes or so.
Right now it's getting the usernames by a sql query that returns one username randomly and the problem is that the same username can be given to more than one thread at a time.
I don't want a username that is being processed by a thread, to be fetched again by another thread. What is a simple and easy way to achieve this goal?
Step-by-step solution:
Create a threads table where you store the threads' state. Among other columns, you need to store the owner user's id there as well.
When a thread is associated to a user, create a record, storing the owner, along with all other juicy stuff.
When a thread is no longer associated to a user, set its owner to null.
When a thread finishes its job, remove its record.
When you randomize your user for threads, filter out all the users who are already associated to at least a thread. This way you know any users at the end of randomization are threadless.
Make sure everything is in place. If, while working on the feature some thread records were created and should be removed or disposed from its owner, then do so.
There is a lot of ways to do this... I can think of three solution to this problem:
1) A singleton class with a array that contains all the user already in use. Be sure that the acces to the array is synchronized and you remove the unused users from it.
2) A flag in the user table that contains a unique Id referencing the thread that is using it. After you have to manage when you remove the flag from the table.
-> As an alternative, why do you check if a pool of connections shared by all the thread could be the solution to your problem ?
You could do one batch query that returns all of the usernames you want from the database and store them in a List (or some type of collection).
Then ensure synchronised access to this list to prevent two threads taking the same username at the same time. Use a synchronised list or a synchronised method to access the list and remove the username from the list.
One way to do it is to add another column to your users table.this column is a simple flag that shows if a user has an assigned thread or not.
but when you query the db you have to wrap it in a transaction.
you begin the transaction and then first you select a user that doesn't have a thread after that you update the flag column and then you commit or roll back.
since the queries are wrapped in a transaction the db handles all the issues that happen in scenarios like this.
with this solution there is no need to implement synchronization mechanisms in your code since the database will do it for you.
if you still have problems after doing this i think you have to configure isolation levels of your db server.
You appear to want a work queue system. Don't reinvent the wheel - use a well established existing work queue.
Robust, reliable concurrent work queuing is unfortunately tricky with relational databases. Most "solutions" land up:
Failing to cope with work items not being completed due to a worker restart or crash;
Actually land up serializing all work on a lock, so all but one worker are just waiting; and/or
Allowing a work item to be processed more than once
PostgreSQL 9.5's new FOR UPDATE SKIP LOCKED feature will make it easier to do what you want in the database. For now, use a canned reliable task/work/message queue engine.
If you must do this yourself, you'll want to have a table of active work items where you record the active process ID / thread ID of the worker that's processing a row. You will need a cleanup process that runs periodically, on thread crash, and on program startup that removes entries for failed jobs (where the worker process no longer exists) so they can be re-tried.
Note that unless the work the workers do is committed to the database in the same transaction that marks the work queue item as done, you will have timing issues where the work can be completed then the DB entry for it isn't marked as done, leading to work being repeated. To absolutely prevent that requires that you commit the work to the DB in the same transaction as the change that marks the work as done, or that you use two-phase commit and an external transaction manager.
My requirement is as follows
Maintain a pool of records in a table (MySQL DB).
A job acts as a producer and fills up this pool if the number of entries goes below a certain threshold. The job runs every 15 mins.
There can be multiple consumers with each consumer picking up just one record each. Two consumers coming in at the same time should get two different records.
The producer should not block the consumer. So while the producer job is running consumers should be able to pick up any available rows.
The producer / consumer is a part of the application code which is turn is a JBoss application.
In order to ensure that each consumer picks a distinct record (in case of concurrency) we do the following
We use an integer column as an index.
A consumer will first update the record with the lowest index value with its own name.
It will then select and pick up that record and proceed with that.
This approach ensures that two consumers do not end up with the same record.
One problem we are seeing is that when the producer is filling up the pool, consumers get blocked. Since the producer can take some time to complete, all consumers in that period are blocked as the update by the consumer waits for the insert by the producer to complete.
Is there any way to resolve this scenario? Any other approach to design this is also welcome.
Is it a hard requirement that you use a relational database as a queue? This seems like a bad approach to the problem, especially since the problems been addressed by message queues. You could use MySQL to persist the state of your queue, but it won't make a good queue itself.
Take a look at ActiveMQ or JBoss Messaging (given that you are using JBoss)
I am using JBOSS 5.1.2 MDB to consume entity messages placed on a queue. The message producer can produce multiple messages for the same entity. These message entities are defined by their entity number. (msg_entity_no) I must insert a record into an existing entity table only if the the entity does not exist else I must update the existing record.
The existing entity table is not unique on this entity number as it has another internal key. ie it contains duplicate msg_entity_no
The problem I am experiencing is that when multiple messages are produced , multiple instances of the MDB query's for existence on the entity table at the same time.
At that time it does not exist for either instance and the process then inserts for both messages. As opposed to one insert for the non-existent entity
and then updating the record for subsequent messages.
I want to get away from using the annotation #ActivationConfigProperty(propertyName = "maxSession", propertyValue = "1") and deploying to the deploy-hasingleton folder which only allows one instance of the MDB as this is not scalable.
The condition you are receiving is due to either DUPLICATE or SAME DATA contained with in messages which are placed in the queue within quick succession. There are a couple of solutions for this.
1) DEQUEUE On one JBOSS instance with only one mdb. This means you will have ONE MDB running on one JBOSS SERVER in the cluster, messages will be essentially processed in sequence.
2) Create a locking mechanism whereby you create a table write the message contents to that table with a PRIMARY KEY and the message contents. You then filter out or create an ordering of data to be processed based on the contents. This will slowdown execution time but you will have better auditing for your data. You will in essence have two ASYNC process jobs. One to populate you data from the QUEUE and another to PROCESS the data. You could do this one minute later.
3) Some QUEUE Implementations such as ORACLE AQ have a dequeue condition which can be set Messages in the queue are evaluated against the condition, and messages that satisfy the given condition are returned. TIBCO have strategies for Locking which protect the thread of execution when multiple threads in an agent.
References
http://tech.chococloud.com/archives/2152
Not understanding the business process, I would suggest you try prevent the "DUPLICATE" messages from the source.
I have an application that listens on a port for UDP datagrams. I use a UDP inbound channel adapter to listen on this port. My UDP channel adapter is configured to use a ThreadPoolTaskExecutor to dispatch incoming UDP datagrams. After the UDP channel adapter I use a direct channel. My channel has only one subscriber i.e. a service activator.
The service adds the incoming messages to a synchronized list stored in memory. Then, I have a single thread that retrieves the content of the list every 5 seconds and does a batch update to a MySQL database.
My problem:
A first bulk of message arrives. The threads of my ThreadPoolExecutor get the incoming message from the UDP channel adapter and add them to the synchronized list. Let's say 10000 messages have been received and inserted.
The background thread retrieves the 10000 messages and does a batch update (JdbcTemplate.update(String[]).
At this point, the background thread waits the response from the database. But, now, because it takes time to the database to execute the 10000 INSERT, 20000 messages have been received and are present in the list.
The background thread receives a response from the database. Then, it retrieves the 20000 messages and does a batch update (JdbcTemplate.update(String[]).
It takes more time to the database to execute the INSERT and during this time, 35000 messages have been received and stored in the list.
The heap size grows constantly and causes after a certain time a memory execption.
I'm trying to find solution to improve the performance of my application.
Thanks
Storing 10,000 records every 5 seconds is quite alot for any database to sustain.
You need to consider other options
use a different data store e.g a NoSQL data store, or a flat file.
ensure you have good write performance on your disks e.g using a write cache.
use a disk sub-system with mutliple disks or an SSD drive.
Suggestions
a. Do you really need a single synchronized list? Can't you have a group of lists, and let's say divide the work between these lists , let's say by running hashCode on a key of the data?
b. Can you use a thread pool of threads that read information from the list (I would use a queue here, by the way) , this way, when one thread is "stuck" due to heavy batch insertion, other threads can still read "jobs" from the queue and perform them?
c. Is your database co-hosted on the same machine as the application? This can improve performance
d. Can you post your insert query? maybe someone can offer you a way to optimize it?
Use a Database Connection pool so that you don't have to wait on the commit on any one thread. Just grab the next available connection and do parallel inserts.
I get 5.000 inserts per second sustained on a SQLServer table, but that required quite a few optimizations. Did not use all of the tips below, some might be of use to you.
Check the MySQL Insert Speed documentation tips in
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
Parallelize the insert process
Aggregate messages if possible. Instead of storing all messages insert a row with information about received messages in a timeframe, of a certain type etc.
Change the table to have no Indexes or Foreign keys except for the primary key
Switch to writing to a textfile (and import that during the night in a loaddata bulk file insert if you really want it in the database)
Use a seperate database instance to serve only your table
...