Is multi threaded delete from mongodb blocking? - java

I am working on a task where I would need to delete some very large records from mongodb. sometimes records are between 2M and 3M. I am trying to make that as fast as it could be.
My idea was to use some kind of thread pool and divide this number into some like 20 threads that each delete a part of the collection. Before I go further in this approach I would like to know if that is a good(promising) approach or not. My main concern is that if maybe this is not possible in mongo and I will have a blocking behaviour in the db and basically the threads will wait for each other to finish deleting.
also I would be happy if any other approaches/solutions are suggested.
the project language is Java/Spring.

Before making anything "as fast as it could be" you need to understand where the bottleneck is (typically CPU, memory or disk) so that your changes actually make a difference.
When it comes to deletes, there is some overhead in the delete operation (client has to send the command to the server, server has to parse it, etc.).
Assuming you have a large number of deletes, using 2 application threads for deleting may be a good idea to reduce this overhead when measuring wallclock time.
The size of documents being deleted doesn't matter.
If you are assuming that the server will be I/O bound due to document size, then sending more requests to it concurrently wouldn't help at all (in fact that would be counterproductive).

Related

Reading parallel record in java from database

I have one database and in this we have millions of records. We are reading the record one by one using java and inserting those record to another system on daily basis after end of day. We have been told to make it faster.
I told them we will create multiple thread using thread pool and these thread will read data parallelly and inject into another system but I dont know how can we stop our thread to read same data again. how can make it faster and achieve data consistency as well. I mean how can we make this process faster using multithreading in java or is there any other way ,other than multithreading to achieve it?
One possible solution for your task would be taking the ids of records in your database, splitting them into chunks (e.g. with size 1000 each) and calling JpaRepository.findAllById(Iterable<ID>) within Runnables passed to ExecutorService.submit().
If you don't want to do it manually then you could have a look into Spring Batch. It is designed particularly for bulk transformation of large amounts of data.
I think you should identify the slowest part in this flow and try to optimize it step by step.
In the described flow you could:
Try to reduce the number of "roundtrips" between the java application (in coming from the driver driver) and the database: Stop reading records one by one and move to bulk reading. Namely, read, say, 2000 records at once from the db into memory and process the whole bulk. Consider even larger numbers (like 5000) but you should measure this really, it depends on the memory of the java application and other factors. Anyway, if there is an issue - discard the bulk.
The data itself might not be organized correctly: when you read the bulk of data you might need to order it by some criteria, so make sure it doesnt make a full table scan, define indices properly etc
If applicable, talk to your DBA, he/she might provide additional insights about data management itself: partitioning, storage related optimizations, etc.
If all this fails and reading from the db is still a bottleneck, consider the flow redesign (for instance - throw messages to kafka if you have one), these might be naturally partitioned so you could scale out the whole process, but this might be beyond the scope of this question.

Distribute database records evenly across multiple processes

I have a database table with 3 million records. A java thread reads 10,000 records from table and processes it. After processing it jumps to next 10,000 and so on. In order to speed up, i have 25 threads doing the same task (reading + processing), and then I have 4 physical servers running the same java program. So effectively i have 100 thread doing the same work (reading + processing).
I strategy i have used is to have a sql procedure which does the work of grabbing next 10,000 records and marking them as being processed by a particular thread. However, i have noticed that the threads seems to be waiting for a some time trying to invoke the procedure and getting a response back. What other strategy i can use to speed up this process of data selection.
My database server is mysql and programming language is java
The idiomatic way of handling such scenario is producer-consumer design pattern. And in idiomatic way of implementing it in Java land is by using jms.
Essentially you need one master server reading records and pushing them to JMS queue. Then you'll have arbitrary number of consumers reading from that queue and competing with each other. It is up to you how you want to implement this in detail: do you want to send a message with whole record or only ID? All 10000 records in one message or record per message?
Another approach is map-reduce, check out hadoop. But the learning curve is a bit steeper.
Sounds like a job for Hadoop to me.
I would suspect that you are majorly database IO bound with this scheme. If you are trying to increase performance of your system, I would suggest partitioning your data across multiple database servers if you can do so. MySQL has some partitioning modes that I have no experience with. If you do partition yourself, it can add a lot of complexity to a database schema and you'd have to add some sort of routing layer using a hash mechanism to divide up your records across the multiple partitions somehow. But I suspect you'd get a significant speed increase and your threads would not be waiting nearly as much.
If you cannot partition your data, then moving your database to a SSD memory drive would be a huge win I suspect -- anything to increase the IO rates on those partitions. Stay away from RAID5 because of the inherent performance issues. If you need a reliable file system then mirroring or RAID10 would have much better performance with RAID50 also being an option for a large partition.
Lastly, you might find that your application performs better with less threads if you are thrashing your database IO bus. This depends on a number of factors including concurrent queries, database layout, etc.. You might try dialing down the per-client thread count to see if that makes a different. The effect may be minimal however.

HBase Multithreaded Scan is really slow

I'm using HBase to store some time series data. Using the suggestion in the O'Reilly HBase book I am using a row key that is the timestamp of the data with a salted prefix. To query this data I am spawning multiple threads which implement a scan over a range of timestamps with each thread handling a particular prefix. The results are then placed into a concurrent hashmap.
Trouble occurs when the threads attmept to perform their scan. A query that normally takes approximately 5600 ms when done serially takes between 40000 and 80000 ms when 6 threads are spawned (corresponding to 6 salts/region servers).
I've tried to use HTablePools to get around what I thought was an issue with HTable being not thread-safe, but this did not result in any better performance.
in particular I am noticing a significant slow down when I hit this portion of my code:
for(Result res : rowScanner){
//add Result To HashMap
Through logging I noticed that everytime through the conditional of the loop I experienced delays of many seconds. These delays do not occur if I force the threads to execute serially.
I assume that there is some kind of issue with resource locking but I just can't see it.
Make sure that you are setting the BatchSize and Caching on your Scan objects (the object that you use to create the Scanner). These control how many rows are transferred over the network at once, and how many are kept in memory for fast retrieval on the RegionServer itself. By default they are both way too low to be efficient. BatchSize in particular will dramatically increase your performance.
EDIT: Based on the comments, it sounds like you might be swapping either on the server or on the client, or that the RegionServer may not have enough space in the BlockCache to satisfy your scanners. How much heap have you given to the RegionServer? Have you checked to see whether it is swapping? See How to find out which processes are swapping in linux?.
Also, you may want to reduce the number of parallel scans, and make each scanner read more rows. I have found that on my cluster, parallel scanning gives me almost no improvement over serial scanning, because I am network-bound. If you are maxing out your network, parallel scanning will actually make things worse.
Have you considered using MapReduce, with perhaps just a mapper to easily split your scan across the region servers? It's easier than worrying about threading and synchronization in the HBase client libs. The Result class is not threadsafe. TableMapReduceUtil makes it easy to set up jobs.

One reader thread, one writer thread, n worker threads

I am trying to develop a piece of code in Java, that will be able to process large amounts of data fetched by JDBC driver from SQL database and then persisted back to DB.
I thought of creating a manager containing one reader thread, one writer thread and customizable number of worker threads processing data. The reader thread would read data to DTOs and pass them to a Queue labled 'ready for processing'. Worker threads would process DTOs and put processed objects to another queue labeld 'ready for persistence'. The writer thread would persist data back to DB. Is such an approach optimal? Or perhaps I should allow more readers for fetching data? Are there any ready libraries in Java for doing this sort of thing I am not aware of?
Whether or not your proposed approach is optimal depends crucially on how expensive it is to process the data in relation to how expensive it is to get it from the DB and to write the results back into the DB. If the processing is relatively expensive, this may work well; if it isn't, you may be introducing a fair amount of complexity for little benefit (you still get pipeline parallelism which may or may not be significant to the overall throughput.)
The only way to be sure is to benchmark the three stages separately, and then deside on the optimal design.
Provided the multithreaded approach is the way to go, your design with two queues sounds reasonable. One additional thing you may want to consider is having a limit on the size of each queue.
I hear echoes from my past and I'd like to offer a different approach just in case you are about to repeat my mistake. It may or may not be applicable to your situation.
You wrote that you need to fetch a large amount of data out of the database, and then persist back to the database.
Would it be possible to temporarily insert any external data you need to work with into the database, and perform all the processing inside the database? This would offer the following advantages:
It eliminates the need to extract large amounts of data
It eliminates the need to persist large amounts of data
It enables set-based processing (which outperforms procedural)
If your database supports it, you can make use of parallel execution
It gives you a framework (Tables and SQL) to make reports on any errors you encounter during the process.
To give an example. A long time ago I implemented a (java) program whose purpose was to load purchases, payments and related customer data from files into a central database. At that time (and I regret it deeply), I designed the load to process the transactions one-by-one , and for each piece of data, perform several database lookups (sql) and finally a number of inserts into appropriate tables. Naturally this did not scale once the volume increased.
Then I made another misstake. I deemed that it was the database which was the problem (because I had heard that the SELECT is slow), so I decided to pull out all data from the database and do ALL processing in Java. And then finally persist back all data to the database. I implemented all kinds of layers with callback mechanisms to easily extend the load process, but I just couldn't get it to perform well.
Looking in the rear mirror, what I should have done was to insert the (laughably small amount of) 100,000 rows temporarily in a table, and process them from there. What took nearly half a day to process would have taken a few minutes at most if I played to the strength of all technologies I had at my disposal.
An alternative to using an explicit queue is to have an ExecutorService and add tasks to it. This way you let Java manager the pool of threads.
You're describing writing something similar to the functionality that Spring Batch provides. I'd check that out if I were you. I've had great luck doing operations similar to what you're describing using it. Parallel and multithreaded processing, and several different database readers/writers and whole bunch of other stuff are provided.
Use Spring Batch! That is exactly what you need

Is a single Java thread better than multiple threading in my scenario?

Our company is running a Java application (on a single CPU Windows server) to read data from a TCP/IP socket and check for specific criteria (using regular expressions) and if a match is found, then store the data in a MySQL database. The data is huge and is read at a rate of 800 records/second and about 70% of the records will be matching records, so there is a lot of database writes involved. The program is using a LinkedBlockingQueue to handle the data. The producer class just reads the record and puts it into the queue, and a consumer class removes from the queue and does the processing.
So the question is: will it help if I use multiple consumer threads instead of a single thread? Is threading really helpful in the above scenario (since I am using single CPU)? I am looking for suggestions on how to speed up (without changing hardware).
Any suggestions would be really appreciated. Thanks
Simple: Try it and see.
This is one of those questions where you argue several points on either side of the argument. But it sounds like you already have most of the infastructure set up. Just create another consumer thread and see if the helps.
But the first question you need to ask yourself:
What is better?
How do you measure better?
Answer those two questions then try it.
Can the single thread keep up with the incoming data? Can the database keep up with the outgoing data?
In other words, where is the bottleneck? If you need to go multithreaded then look into the Executor concept in the concurrent utilities (There are plenty to choose from in the Executors helper class), as this will handle all the tedious details with threading that you are not particularly interested in doing yourself.
My personal gut feeling is that the bottleneck is the database. Here indexing, and RAM helps a lot, but that is a different question.
It is very likely multi-threading will help, but it is easy to test. Make it a configurable parameter. Find out how many you can do per second with 1 thread, 2 threads, 4 threads, 8 threads, etc.
First of all:
It is wise to create your application using the java 5 concurrent api
If your application is created around the ExecutorService it is fairly easy to change the number of threads used. For example: you could create a threadpool where the number of threads is specified by configuration. So if ever you want to change the number of threads, you only have to change some properties.
About your question:
- About the reading of your socket: as far as i know, it is not usefull (if possible at all) to have two threads read data from one socket. Just use one thread that reads the socket, but make the actions in that thread as few as possible (for example read socket - put data in queue -read socket - etc).
- About the consuming of the queue: It is wise to construct this part as pointed out above, that way it is easy to change number of consuming threads.
- Note: you cannot really predict what is better, there might be another part that is the bottleneck, etcetera. Only monitor / profiling gives you a real view of your situation. But if your application is constructed as above, it is really easy to test with different number of threads.
So in short:
- Producer part: one thread that only reads from socket and puts in queue
- Consumer part: created around the ExecutorService so it is easy to adapt the number of consuming threads
Then use profiling do define the bottlenecks, and use A-B testing to define the optimal numbers of consuming threads for your system
As an update on my earlier question:
We did run some comparison tests between single consumer thread and multiple threads (adding 5, 10, 15 and so on) and monitoring the que size of yet-to-be processed records. The difference was minimal and what more.. the que size was getting slightly bigger after the number of threads was crossing 25 (as compared to running 5 threads). Leads me to the conclusion that the overhead of maintaining the threads was more than the processing benefits got. Maybe this could be particular to our scenario but just mentioning my observations.
And of course (as pointed out by others) the bottleneck is the database. That was handled by using the multiple-insert statement in mySQL instead of single inserts. If we did not have that to start with, we could not have handled this load.
End result: I am still not convinced on how multi-threading will give benefit on processing time. Maybe it has other benefits... but I am looking only from a processing-time factor. If any of you have experience to the contrary, do let us hear about it.
And again thanks for all your input.
In your scenario where a) the processing is minimal b) there is only one CPU c) data goes straight into the database, it is not very likely that adding more threads will help. In other words, the front and the backend threads are I/O bound, with minimal processing int the middle. That's why you don't see much improvement.
What you can do is to try to have three stages: 1st is a single thread pulling data from the socket. 2nd is the thread pool that does processing. 3rd is a single threads that serves the DB output. This may produce better CPU utilization if the input rate varies, at the expense of temporarily growth of the output queue. If not, the throughput will be limited by how fast you can write to the database, no matter how many threads you have, and then you can get away with just a single read-process-write thread.

Categories

Resources