How to make a Java Application faster?

How to make a Java Application faster? - java

I have a billing daemon that must process hundred thousands of data in a very fast manner. I implemented ExecutorSerivce for parallel processing. It did increase the speed but not very much. It takes approx 2.5-3 hours to process 1,00,000 records. How can I make it more faster like processing those data within half an hour?
I have written the following for execution setting:
-Xms2048M -Xmx2048M -XX:MaxPermSize=256m
I tried to implement a Producer Consumer model with 1 producer and 4 consumers. Each list can contain 10,000 records.
ArrayBlockingQueue<BillableList> list =new ArrayBlockingQueue<BillableList>(10);
ExecutorService threadPool = Executors.newFixedThreadPool(5);
threadPool.execute(new Consumer("pool1", list));
threadPool.execute(new Consumer("pool2", list));
threadPool.execute(new Consumer("pool3", list));
threadPool.execute(new Consumer("pool4", list));
Future producerStatus = threadPool.submit(new Producer("Producer", list));
producerStatus.get();
threadPool.shutdown();
I also get a lot of "database lock wait timeout exceeded" exceptions while updating records to the database. Is it due to different consumers trying for the same user at the same time? How can I make different consumers take different data from ArrayBlockingQueue's list?

The only possible answer to this is "Use a profiler and find out why it's slow". You can't do anything about a problem when you don't know where the problem is. What are you going to, pick a random function and micro-optimize it? Profiler data or nothing will ever, ever happen.

How can I make it more faster like processing those data within half an hour?
If adding threads did not help then chances are you are being limited not my CPU but by some other factor. Most likely disk or network IO. As mentioned, profiling your code should show you the culprit.
I also get a lot of "database lock wait timeout exceeded" exceptions while updating records to the database.
And there's your big clue. Regardless of how many threads are working on a the job, if they are all waiting for the database then adding threads is not making it faster.
Here are some ideas:
Increase the physical speed of your database box. SSDs can provide wondrous improvements for IO intensive operations. Increasing the memory can also give a big win because of disk cache.
Consider sharding your data and writing to multiple database instances. This may not be possible given your schema.
Consider turning off auto-commit and manually committing after every ~100 or so operations.
Watch out for indexes. If you are doing some sort of bulk load, often if you turn off indexes your inserts will run faster. Adding the indexes at the end takes a while but still is a win.
Also, if you are doing queries, make sure you have good indexes where needed. Check your database logs to see which queries are taking too long to see if you are missing some indexes in key places.

Related

Is multi threaded delete from mongodb blocking?

I am working on a task where I would need to delete some very large records from mongodb. sometimes records are between 2M and 3M. I am trying to make that as fast as it could be.
My idea was to use some kind of thread pool and divide this number into some like 20 threads that each delete a part of the collection. Before I go further in this approach I would like to know if that is a good(promising) approach or not. My main concern is that if maybe this is not possible in mongo and I will have a blocking behaviour in the db and basically the threads will wait for each other to finish deleting.
also I would be happy if any other approaches/solutions are suggested.
the project language is Java/Spring.

Before making anything "as fast as it could be" you need to understand where the bottleneck is (typically CPU, memory or disk) so that your changes actually make a difference.
When it comes to deletes, there is some overhead in the delete operation (client has to send the command to the server, server has to parse it, etc.).
Assuming you have a large number of deletes, using 2 application threads for deleting may be a good idea to reduce this overhead when measuring wallclock time.
The size of documents being deleted doesn't matter.
If you are assuming that the server will be I/O bound due to document size, then sending more requests to it concurrently wouldn't help at all (in fact that would be counterproductive).

Executor Service and Huge IO

I have a service which calls a database and performs a callback on each result.
ExecutorService service = Executors.newFixedThreadPool(10);
service.exectute(runnable(segmentID, callback)); // database is segmented
Runnable is:
call database - collect all the rows for the segment keep in memory
perform callback(segment);
Now the issue is I get a huge rows returned by database and my understanding is executor service will schedule threads whenever they are idle in I/O. So I go into Out of Memory.
Is there any way to restrict only 10 threads are running at a time and no executor service scheduling happens?
For some reason I have to keep all the rows of a segment in memory.
How can I prevent going OOM by doing this. Is Executor service newFixedThreadPool solution for this?
Please let me know if I missed anything.
Thanks

You must use a fixed thread pool. There's a rule that you should only spawn N threads where N should be in the same order of magnitude than the number of cores in the CPU. There's a debate on the size of N, and you can read more about it here. For a normal CPU we could be talking 4,8, 16 threads.
But even if you were running your program in a cluster, which I think you are not, you can't just fetch 20k rows from a DB and pretend to spawn 20k threads. If you do, the performance of your app is going to degrade big time, because most of the CPU cycles would be consumed in context switching.
Now even with fixed thread pool, you might run into OOM exceptions anyway if the data fetched is stored in memory at the same time. I think the only solution to this is to fetch smaller chunks of data, or write the data to a file as it gets downloaded.

Define Parallel Processing Thread Pool Count and Sleep time

I need to update 550 000 records in a table with the JBOSS Server is Starting up. I need to make this update as a backgroundt process with multiple threads and parallel processing. Application is Spring, so I can use initializing bean for this.
To perform the parallal processing I am planning to use Java executor framework.
ThreadPoolExecutor executor=(ThreadPoolExecutor)Executors.newFixedThreadPool(50); G
How to decide the thread pool count?
I think this is depends on hardware My hardware. it is 16 GB Ram and Co-i 3 processor.
Is it a good practice to Thread.sleep(20);while processing this big update as background.

I don't know much about Spring processing specifically, but your questions seem general enough that I can still provide a possibly inadequate answer.
Generally there's a lot of factors that go into how many threads you want. You definitely don't want multiple threads on a core, as that'll slow things down as threads start contending for CPU time instead of working, so probably your core count would be your ceiling, or maybe core count - 1 to allow one core for all other tasks to run on (so in your case maybe 3 or 4 cores, tops, if I remember core counts for i3 processors right). However, in this case I'd guess you're more likely to run into I/O and/or memory/cache bottlenecks, since when those are involved, those are more likely to slow down your program than insufficient parallelization. In addition, the tasks that your threads are doing would affect the number of threads you can use; if you have one thread to pull data in and one thread to dump data back out after processing, it might be possible for those threads to share a core.
I'm not sure why this would be a good idea... What use do you see for Thread.sleep() while processing? I'd guess it'd actually slow down your processing, because all you're doing is putting threads to sleep when they could be working.
In any case, I'd be wary of parallelizing what is likely to be an I/O bound task. You'll definitely need to profile to see where your bottlenecks are, even before you start parallelizing, to make sure that multiple cores will actually help you.
If it is the CPU that is adding extra time to complete your task, then you can start parallelizing. Even then, be careful about cache issues; try to make sure each thread works on a totally separate chunk of data (e.g. through ThreadLocal) so cache/memory issues don't limit any performance increases. One way this could work is by having a reader thread dump data into a Queue which the worker threads can then read into a ThreadLocal structure, process, etc.
I hope this helped. I'll keep updating as the mistakes I certainly made are pointed out.

How to determine optimal number of threads for high latency network requests?

I am writing a utility that must make thousands of network requests. Each request receives only a single, small packet in response (similar to ping), but may take upwards of several seconds to complete. Processing each response completes in one (simple) line of code.
The net effect of this is that the computer is not IO-bound, file-system-bound, or CPU-bound, it is only bound by the latency of the responses.
This is similar to, but not the same as There is a way to determine the ideal number of threads? and Java best way to determine the optimal number of threads [duplicate]... the primary difference is that I am only bound by latency.
I am using an ExecutorService object to run the threads and a Queue<Future<Integer>> to track threads that need to have results retrieved:
ExecutorService executorService = Executors.newFixedThreadPool(threadPoolSize);
Queue<Future<Integer>> futures = new LinkedList<Future<Integer>>();
for (int quad3 = 0 ; quad3 < 256 ; ++quad3) {
for (int quad4 = 0 ; quad4 < 256 ; ++quad4) {
byte[] quads = { quad1, quad2, (byte)quad3, (byte)quad4 };
futures.add(executorService.submit(new RetrieverCallable(quads)));
}
}
... I then dequeue all the elements in the queue and put the results in the required data structure:
int[] result = int[65536]
while(!futures.isEmpty()) {
try {
results[i] = futures.remove().get();
} catch (Exception e) {
addresses[i] = -1;
}
}
My first question is: Is this a reasonable way to track all the threads? If thread X takes a while to complete, many other threads might finish before X does. Will the thread pool exhaust itself waiting for open slots, or will the ExecutorService object manage the pool in such a way that threads that have completed but not yet been processed be moved out of available slots so that other threads my begin?
My second question is what guidelines can I use for finding the optimal number of threads to make these calls? I don't even know order-of-magnitude guidance here. I know it works pretty well with 256 threads, but seems to take roughly the same overall time with 1024 threads. CPU utilization is hovering around 5%, so that doesn't appear to be an issue. With that large a number of threads, what are all the metrics I should be looking at to compare different numbers? Obviously overall time to process the batch, average time per thread... what else? Is memory an issue here?

It will shock you, but you do not need any threads for I/O (quantitatively, this means 0 threads). It is good that you have studied that multithreading does not multiply your network bandwidth. Now, it is time to know that threads do computation. They are not doing the (high-latency) communication. The communication is performed by a network adapter, which is another process, running really in parallel with with CPU. It is stupid to allocate a thread (see which resources allocated are listed by this gentlemen who claims that you need 1 thread) just to sleep until network adapter finishes its job. You need no threads for I/O = you need 0 threads.
It makes sense to allocate the threads for computation to make in parallel with I/O request(s). The amount of threads will depend on the computation-to-communication ratio and limited by the number of cores in your CPU.
Sorry, I had to say that despite you have certainly implied the commitment to blocking I/O, so many people do not understand this basic thing. Take the advise, use asynchronous I/O and you'll see that the issue does not exist.

As mentioned in one of the linked answers you refer to, Brian Goetz has covered this well in his article.
He seems to imply that in your situation you would be advised to gather metrics before committing to a thread count.
Tuning the pool size
Tuning the size of a thread pool is largely a matter of avoiding two mistakes: having too few threads or too many threads. ...
The optimum size of a thread pool depends on the number of processors available and the nature of the tasks on the work queue. ...
For tasks that may wait for I/O to complete -- for example, a task that reads an HTTP request from a socket -- you will want to increase the pool size beyond the number of available processors, because not all threads will be working at all times. Using profiling, you can estimate the ratio of waiting time (WT) to service time (ST) for a typical request. If we call this ratio WT/ST, for an N-processor system, you'll want to have approximately N*(1+WT/ST) threads to keep the processors fully utilized.
My emphasis.

Have you considered using Actors?
Best practises.
Actors should be like nice co-workers: do their job efficiently
without bothering everyone else needlessly and avoid hogging
resources. Translated to programming this means to process events and
generate responses (or more requests) in an event-driven manner.
Actors should not block (i.e. passively wait while occupying a Thread)
on some external entity—which might be a lock, a network socket,
etc.—unless it is unavoidable; in the latter case see below.
Sorry, I can't elaborate, because haven't much used this.
UPDATE
Answer in Good use case for Akka might be helpful.
Scala: Why are Actors lightweight?

Pretty sure in the described circumstances, the optimal number of threads is 1. In fact, that is surprisingly often the answer to any quesion of the form 'how many threads should I use'?
Each additonal thread adds extra overhead in terms of stack (and associated GC roots), context switching and locking. This may or not be measurable: the effor to meaningfully measure it in all target envoronments is non-trivial. In return, there is little scope to provide any benifit, as processing is neither cpu nor io-bound.
So less is always better, if only for reasons of risk reduction. And you cant have less than 1.

I assume the desired optimization is the time to process all requests. You said the number of requests is "thousands". Evidently, the fastest way is to issue all requests at once, but this may overflow the network layer. You should determine how many simultaneous connections can network layer bear, and make this number a parameter for your program.
Then, spending a thread for each request require a lot of memory. You can avoid this using non-blocking sockets. In Java, there are 2 options: NIO1 with selectors, and NIO2 with asynchronous channels. NIO1 is complex, so better find a ready-made library and reuse it. NIO2 is simple but available only since JDK1.7.
Processing the responses should be done on a thread pool. I don't think the number of threads in the thread pool greatly affects the overall performance in your case. Just make tuning for thread pool size from 1 to the number of available processors.

In our high-performance systems, we use the actor model as described by #Andrey Chaschev.
The no. of optimal threads in your actor model differ with your CPU structure and how many processes (JVMs) do you run per box. Our finding is
If you have 1 process only, use total CPU cores - 2.
If you have multiple process, check your CPU structure. We found its good to have no. of threads = no. of cores in a single CPU - e.g. if you have a 4 CPU server each server having 4 cores, then using 4 threads per JVM gives you best performance. After that, always leave at least 1 core to your OS.

An partial answer, but I hope it helps. Yes, memory can be an issue: Java reserves 1 MB of thread stack by default (at least on Linux amd64). So with a few GB of RAM in your box, that limits your thread count to a few thousand.
You can tune this with a flag like -XX:ThreadStackSize=64. That would give you 64 kB, which is plenty in most situations.
You could also move away from threading entirely and use epoll to respond to incoming responses. This is far more scalable but I have no practical experience with doing this in Java.

HBase Multithreaded Scan is really slow

I'm using HBase to store some time series data. Using the suggestion in the O'Reilly HBase book I am using a row key that is the timestamp of the data with a salted prefix. To query this data I am spawning multiple threads which implement a scan over a range of timestamps with each thread handling a particular prefix. The results are then placed into a concurrent hashmap.
Trouble occurs when the threads attmept to perform their scan. A query that normally takes approximately 5600 ms when done serially takes between 40000 and 80000 ms when 6 threads are spawned (corresponding to 6 salts/region servers).
I've tried to use HTablePools to get around what I thought was an issue with HTable being not thread-safe, but this did not result in any better performance.
in particular I am noticing a significant slow down when I hit this portion of my code:
for(Result res : rowScanner){
//add Result To HashMap
Through logging I noticed that everytime through the conditional of the loop I experienced delays of many seconds. These delays do not occur if I force the threads to execute serially.
I assume that there is some kind of issue with resource locking but I just can't see it.

Make sure that you are setting the BatchSize and Caching on your Scan objects (the object that you use to create the Scanner). These control how many rows are transferred over the network at once, and how many are kept in memory for fast retrieval on the RegionServer itself. By default they are both way too low to be efficient. BatchSize in particular will dramatically increase your performance.
EDIT: Based on the comments, it sounds like you might be swapping either on the server or on the client, or that the RegionServer may not have enough space in the BlockCache to satisfy your scanners. How much heap have you given to the RegionServer? Have you checked to see whether it is swapping? See How to find out which processes are swapping in linux?.
Also, you may want to reduce the number of parallel scans, and make each scanner read more rows. I have found that on my cluster, parallel scanning gives me almost no improvement over serial scanning, because I am network-bound. If you are maxing out your network, parallel scanning will actually make things worse.

Have you considered using MapReduce, with perhaps just a mapper to easily split your scan across the region servers? It's easier than worrying about threading and synchronization in the HBase client libs. The Result class is not threadsafe. TableMapReduceUtil makes it easy to set up jobs.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.