Perfomance in Spring MVC app - java

I've got an app that adds file's content row by row into data base. If file is not so big (smaller, than 100 kB) it will work well, but I can not say the same about big files. I found out that INSERT query takes about 1 msc, so 50k INSERT takes 50 sec. I find it very slow. This is my plan:
if file is big enough, do INSERT in another thread
if not, do it synchronously
So, every user will run new thread, if file is big. I mean that I can not use one instance of this thread, every user will run new. Is it a good idea or not? How would you do?

Two points:
Why don't you use batch updates? I mean doing several inserts to database at one time. Network round trip costs a lot of time, you can increase the performance significantly.
Performing update asynchronously is a good idea. But actually it doesn't mean that you need to create new thread per user. It can be a fixed pool of threads (let's say 5) to do the job for all the users.

Related

Reading parallel record in java from database

I have one database and in this we have millions of records. We are reading the record one by one using java and inserting those record to another system on daily basis after end of day. We have been told to make it faster.
I told them we will create multiple thread using thread pool and these thread will read data parallelly and inject into another system but I dont know how can we stop our thread to read same data again. how can make it faster and achieve data consistency as well. I mean how can we make this process faster using multithreading in java or is there any other way ,other than multithreading to achieve it?
One possible solution for your task would be taking the ids of records in your database, splitting them into chunks (e.g. with size 1000 each) and calling JpaRepository.findAllById(Iterable<ID>) within Runnables passed to ExecutorService.submit().
If you don't want to do it manually then you could have a look into Spring Batch. It is designed particularly for bulk transformation of large amounts of data.
I think you should identify the slowest part in this flow and try to optimize it step by step.
In the described flow you could:
Try to reduce the number of "roundtrips" between the java application (in coming from the driver driver) and the database: Stop reading records one by one and move to bulk reading. Namely, read, say, 2000 records at once from the db into memory and process the whole bulk. Consider even larger numbers (like 5000) but you should measure this really, it depends on the memory of the java application and other factors. Anyway, if there is an issue - discard the bulk.
The data itself might not be organized correctly: when you read the bulk of data you might need to order it by some criteria, so make sure it doesnt make a full table scan, define indices properly etc
If applicable, talk to your DBA, he/she might provide additional insights about data management itself: partitioning, storage related optimizations, etc.
If all this fails and reading from the db is still a bottleneck, consider the flow redesign (for instance - throw messages to kafka if you have one), these might be naturally partitioned so you could scale out the whole process, but this might be beyond the scope of this question.

Is multi threaded delete from mongodb blocking?

I am working on a task where I would need to delete some very large records from mongodb. sometimes records are between 2M and 3M. I am trying to make that as fast as it could be.
My idea was to use some kind of thread pool and divide this number into some like 20 threads that each delete a part of the collection. Before I go further in this approach I would like to know if that is a good(promising) approach or not. My main concern is that if maybe this is not possible in mongo and I will have a blocking behaviour in the db and basically the threads will wait for each other to finish deleting.
also I would be happy if any other approaches/solutions are suggested.
the project language is Java/Spring.
Before making anything "as fast as it could be" you need to understand where the bottleneck is (typically CPU, memory or disk) so that your changes actually make a difference.
When it comes to deletes, there is some overhead in the delete operation (client has to send the command to the server, server has to parse it, etc.).
Assuming you have a large number of deletes, using 2 application threads for deleting may be a good idea to reduce this overhead when measuring wallclock time.
The size of documents being deleted doesn't matter.
If you are assuming that the server will be I/O bound due to document size, then sending more requests to it concurrently wouldn't help at all (in fact that would be counterproductive).

How to process large log file in java

I have 4 files and each one is 200 MB. I have created 4 threads and parallelly running 4 thread and each thread processing and adding in to Array blocking queue.
Some other thread is taking Array Blocking Queue and process and adding in to batch. The batch size is 5000 and executing batch and inserting records into database.But still its taking complete 4 files is around 6 mins to complete.
How increase performance in this case?
1) Make sure you have enough memory for queue+processor buffers+db buffers.
2) Batch size of 5k is a bit more than needed, in general you get up to speed in 100, not that iе makes much difference here though.
3) You can push data into oracle in multiple threads. Fetching sequences for ID fields population ahead, you'll be able to insert into 1 table in parallel, if you have not many indexes. Otherwise consider disabling/recalculating indexes, or insert into temporary table and then move everything into main one.
4) Take a look at oracle settings with fellow DB admin. Things like extend size/increase can change performance.

REST API Design - Fetching multiple(1000) records

I have a web application and have to fetch 1000 records using a REST API. Each record is around 500 bytes.
What is the best way to do it from the following and why? Is there another better way to do it?
1>Fetch one record at a time. Trigger 1000 calls in parallel.
2>Fetch in groups of 20. Trigger 50 calls in parallel.
3>Fetch in groups of 100. Trigger 10 calls in parallel.
4>Fetch all 1000 records together.
As #Dima said in the comments, it really depends on what you are trying to do.
How are the records being consumed?
Is it a back end process to process or program to program communication? If so, then it depends on the difficulty of processing once the client receives it. Is it going to take them a long time to process each record? 1 ms per record, or 100ms per record? This option depends entirely on possible processing time per record.
Is there a front end consuming this for human users? If so, batch requesting would be good for reasons like paginating results. In such cases, I would go with option 2 or 3 personally.
In general though, depending upon the sheer volume of records, I would recommend considering batching requests (by triggering fewer calls). Heuristically speaking, you are likely to get better overall network throughput that way.
If you add more specifics, I'll happily update my answer, but until then, general will have to do!
Best for what case? What are you trying to optimize?
I did some tests a while back on a similar situation, with slightly larger payloads (images), where my goal was to utilize network efficiently on a high-latency setup (across continents).
My results were that after a minimal amount of parallelism (like 3-4 threads), the network was almost perfectly saturated. We compared it to specific (proprietary) UDP-based transfer protocols, and there was no measurable difference.
Anyway, it may be not what you are looking for, but sometimes having a "dumb" http endpoint is good enough.

One reader thread, one writer thread, n worker threads

I am trying to develop a piece of code in Java, that will be able to process large amounts of data fetched by JDBC driver from SQL database and then persisted back to DB.
I thought of creating a manager containing one reader thread, one writer thread and customizable number of worker threads processing data. The reader thread would read data to DTOs and pass them to a Queue labled 'ready for processing'. Worker threads would process DTOs and put processed objects to another queue labeld 'ready for persistence'. The writer thread would persist data back to DB. Is such an approach optimal? Or perhaps I should allow more readers for fetching data? Are there any ready libraries in Java for doing this sort of thing I am not aware of?
Whether or not your proposed approach is optimal depends crucially on how expensive it is to process the data in relation to how expensive it is to get it from the DB and to write the results back into the DB. If the processing is relatively expensive, this may work well; if it isn't, you may be introducing a fair amount of complexity for little benefit (you still get pipeline parallelism which may or may not be significant to the overall throughput.)
The only way to be sure is to benchmark the three stages separately, and then deside on the optimal design.
Provided the multithreaded approach is the way to go, your design with two queues sounds reasonable. One additional thing you may want to consider is having a limit on the size of each queue.
I hear echoes from my past and I'd like to offer a different approach just in case you are about to repeat my mistake. It may or may not be applicable to your situation.
You wrote that you need to fetch a large amount of data out of the database, and then persist back to the database.
Would it be possible to temporarily insert any external data you need to work with into the database, and perform all the processing inside the database? This would offer the following advantages:
It eliminates the need to extract large amounts of data
It eliminates the need to persist large amounts of data
It enables set-based processing (which outperforms procedural)
If your database supports it, you can make use of parallel execution
It gives you a framework (Tables and SQL) to make reports on any errors you encounter during the process.
To give an example. A long time ago I implemented a (java) program whose purpose was to load purchases, payments and related customer data from files into a central database. At that time (and I regret it deeply), I designed the load to process the transactions one-by-one , and for each piece of data, perform several database lookups (sql) and finally a number of inserts into appropriate tables. Naturally this did not scale once the volume increased.
Then I made another misstake. I deemed that it was the database which was the problem (because I had heard that the SELECT is slow), so I decided to pull out all data from the database and do ALL processing in Java. And then finally persist back all data to the database. I implemented all kinds of layers with callback mechanisms to easily extend the load process, but I just couldn't get it to perform well.
Looking in the rear mirror, what I should have done was to insert the (laughably small amount of) 100,000 rows temporarily in a table, and process them from there. What took nearly half a day to process would have taken a few minutes at most if I played to the strength of all technologies I had at my disposal.
An alternative to using an explicit queue is to have an ExecutorService and add tasks to it. This way you let Java manager the pool of threads.
You're describing writing something similar to the functionality that Spring Batch provides. I'd check that out if I were you. I've had great luck doing operations similar to what you're describing using it. Parallel and multithreaded processing, and several different database readers/writers and whole bunch of other stuff are provided.
Use Spring Batch! That is exactly what you need

Categories

Resources