Optimizing number of threads dynamically - java

I have two IO intensive processes that don't do much computing: one is getting and parsing a webpage and the other is storing some data obtained with the parsing in a database. This is going to repeat while the crawling of the web continues.
Is there a method for adding and subtracting the number of threads that are working on each task dynamically so the performance is optimal for the machine where the whole system is running? The method should not involve benchmarking because it's going to be distributed to a number of machines I cannot access beforehand.
Please guide me to some sources or information.

Instead of using threads directly you should just create a ThreadPool to which you add a number of Runnables which do the actual work. From your description a CachedThreadPool might be suitable. Check out http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html for some guidelines how to implement.

Well dynamically adjusting thread count should be no problem (using ThreadPoolExecutor for example).
But it looks to me that the optimal number of threads is limited by two factors:
The network bandwidth for your "downloading threads"
The maximum number of allowed database connections for your "database threads"
I'm not sure if the downloading part should be multithreaded at all, because each thread will just steal bandwidth from the others unless the pages are really small.

Related

File synchronizer architecture

I have to make a file synchronizer: an application that essentially synchronizes H24 a large amount of data files from many systems outside to my local system using essentially FTP, SFTP and NFS.
The streams are more than twenty, for each of them the logic is slightly different and it must be configurable.
One of the requirements is that if one of the streams for some reason falls down it must be possible to retrieve it on without restarting the entire system.
Another requirement is that the transfer rate is balanced. In other words, there must not be a stream or a part of them synchronized and another stream 10 hours late
I have some perplexity about architecture to be realized: if I realize a single multithread system I would have a very high thread count (more than 100 I would say) and make it complicated by fulfilling the two requirements outlined above.
I was thinking of realizing several processes or different instances of the same process even if It seems a little "ugly" .. so in this way some load balancing would be done by the operating system and it would be simpler to kill or to start a flow ..Perhaps even performance might be better as several processes could use much more ram Someone has any tips/advice? Thanks a lot and sorry for my poor english. Gian
As #kayaman said, 100 threads is not a lot. If that means 100 threads per unit of work and you will have many units of work which would imply many magnitudes increase in threads, I would suggest having a look at Fibers
As long as you don't block the fibers, you can have 100000+ fibers running over a couple (typically number of CPU cores) of threads. Each fiber would then just wait for a callback from the process before continuing.
To access your endpoints and handle them in similar ways, have a look at Apache Camel - it will allow you to stream the FTP, SFTP, etc and handle each as just another endpoint (in theory you should be able to plug email in as well and stream packets that are emailed to the endpoint)
Regarding balancing the streams, this is business logic you need to implement. If one stream is receiving packets faster than another stream, you should be able to limit the rate by not requesting more packets under certain conditions. Need some more information on how you retrieve the packages and which libraries you are using in order to be of better assistance here.

Does downloading with multiple threads actually speed things up?

So, I was starting up minecraft a few days ago and opened up it's developer console to see what it was doing while it was updating itself. I noticed one of the lines said the following:
Downloading 32 files. (16 threads)
Now, the first thing that came to mind was: the processor can still only do one thing at a time, all threads do is split each of their tasks up and distribute the CPU power between them, so what would the purpose be of downloading multiple files on multiple threads if each thread is still only being run on a single processor?
Then, in the process of deciding whether or not I should ask this question on SO, I remembered that multiple cores can reside on one processor. For example, my processor is quad-core. So, you can actually accomplish 4 downloads truly simultaneously. Now that sounds like it makes sense. Except for the fact that there are 16 threads being use for minecraft's download. So, basically my question is:
Does increasing the number of threads during a download help the speed at all? (Assuming a multi-core processor, and the thread count is less than the core count.)
And
If you increase the number of threads to past the number of cores, does speed still increase? (It sounds to me like the downloads would be max-speed after 4 threads, on a quad-core processor.)
Downloads are network-bound, not CPU-bound. So theoretically, using multiple threads will not make it faster.
On the one hand, if your program downloads using synchronous (blocking) I/O, then multiple threads simply enables less blocking to occur. In general, on the other hand, it is more sensible to just use a single thread with asynchronous I/O.
On the gripping hand, asynchronous I/O is trickier to code correctly than synchronous I/O (which is straightforward). So the developers may have just decided to favour ease of programming over pure performance. (Or they may favour compatibility with older Java platforms: real async I/O is only available with NIO2 (which came with Java 7).)
When one thread downloads one file, it will spend some time waiting. When one thread downloads N files, one after another, it will spend, on average, N times as much total wait time.
When N threads each download one file, each of those threads will spend some time waiting, but some of those waits will be overlapped (e.g., thread A and thread B are both waiting at the same time.) The end result is that it may take less wall-clock time to get all N of the files.
On the other hand, if the threads are waiting for files from the same server, each thread's individual wait time may be longer.
The question of whether or not there is an over-all performance benefit depends on the client, on the server, and on the available network bandwidth. If the network can't carry bytes as fast as the server can pump them out, then multi-threading the client probably won't save any time, if the server is single-threaded, then multi-threading the client definitely won't help, but if the conditions are right (e.g., if you have a fast internet connection and especially if the files are coming from a server farm instead of a single machine), then multi-threading potentially can speed things up.
Normally it will not be faster, but there are always exceptions.
Assuming for each download thread, you are opening a new connection, then if
The network (either your own network, or target system) is limiting the download speed for each connection, or
You are downloading from multiple servers, and etc
Or, if the "download" is not a plain download, but downloading something and do some CPU intensive processing on that.
In such cases you may see download speed become faster when having multiple thread.

How many threads is it advisable to have running at the same time in Java?

I am new to multithreading in Java, after looking at Java virtual machine - maximum number of threads it would appear there isn't a limit to how many threads a Java/Android app can run. However, is there an advisable limit? What I mean by this is, is there a number of threads where if you run past this number then it is unwise because you are unable to determine what thread does what at what time? I hope my question makes sense.
There are some advisable limits, however they don't really have anything to do with keeping track of them.
Most multithreading comes with locking. If you are using central data storage or global mutable state then the more threads you have, the more lock contention you will get. This is app-specific and depends on how much of said state you have and how often threads read and write it.
There are no limits in desktop JVMs by default, but there are OS limits.It should be in the tens of thousands for modern Windows machines, but don't rely on the ability to create much more than that.
Running multiple tasks in parallel is great, but the hardware can only cope with so much. If you are using small threads that get fired up sometimes, and spend most their time idle, that's no biggie (Java servers were written like this for years). However if your threads are very intensive, making more of them than the number of cores you have is not likely to give you any benefit. (I believe the standard practice is twice the number of cores if you anticipate threads going idle sometimes).
Threads have a cost to them. Whenever you switch Threads you switch context, and while it isn't that expensive, doing it constantly will hurt performance. It's not a good idea to create a Thread to sum up two integers and write back a result.
If Threads need visibility of each others state, then they are greatly slowed down, since a lot of their writes have to be written back to main memory. Threads are best used for standalone tasks that require little interaction with each other.
TL;DR
Depends on OS and Hardware: on servers creating thousands of threads is fine, on desktop machines you should limit yourself to 50-200 and choose carefully what you do with them.
Note: Androids default and suggested "UI multithread helper" - the AsyncTask is not actually a thread. It's a task invoked from a ThreadPool, and as such there is no limit or penalty to using it. It has an upper limit on the number of threads it spawns and reuses them rather than creating new ones. Most Android apps should use it instead of spawning their own threads. In general, Thread Pools are fairly widespread and are a great choice unless you are forced into blocking operations.

JVM running out of connections resulting into high CPU utilization and OutOfMemoryException

We have a 64 bit linux machine and we make multiple HTTP connections to other services and Drools Guvnor website(Rule engine if you don't know) is one of them. In drools, we create knowledge base per rule being fired and creation of knowledge base makes a HTTP connection to Guvnor website.
All other threads are blocked and CPU utilization goes up to ~100% resulting into OOM. We can make changes to compile the rules after 15-20 mins. but I want to be sure of the problem if someone has already faced it.
I checked for "cat /proc/sys/kernel/threads-max" and it shows 27000 threads, Can it be a reason?
I have a couple of question:
When do we know that we are running over capacity?
How many threads can be spawned internally (any rough estimate or formula relating diff parameters will work)?
Has anyone else seen similar issues with Drools? Concurrent access to Guvnor website is basically causing the issue.
Thanks,
I am basing my answer on the assumption that you are creating a knowledge base for each request, and this knowledge base creation incudes the download of latest rule sources from Guvnor please correct if I am mistaken.
I suspect that the build /compilation of packages is taking time and hog your system.
Instead of compiling packages on each and every request, you can download pre build packages from guvnor, and also you can cache this packages locally if your rules does not change much. Only restriction is that you need to use the same version of drools both on guvnor and in your application.
I checked for "cat /proc/sys/kernel/threads-max" and it shows 27000
threads, Can it be a reason?
That number does look large but we dont know if a majority of those threads belong to you java app. Create a java thread dump to confirm this. Your thread dump will also show the CPU time taken by each thread.
When do we know that we are running over capacity?
You have 100% CPU and an OOM error. You are over capacity :) Jokes aside, you should monitor your HTTP connection queue to determine what you are doing wrong. Your post says nothing about how you are handling the HTTP connections (presumably through some sort of pooling mechanism backed by a queue ?). I've seen containers and programs queue requests infinitely causing them to crash with a big bang. Plot the following graphs to isolate your problem
The number of blocking threads over time
Time taken for each thread
Number of threads per thread pool and how they increase / decrease with time (pool size)
How many threads can be spawned internally (any rough estimate or
formula relating diff parameters will work)?
Only a load test can answer this question. Load your server and determine the number of concurrent users it can support at 60-70% capacity. Note the number of threads spawned internally at this point. That is your peak capacity (allowing room for unexpected traffic)
Has anyone else seen similar issues with Drools? Concurrent access to
Guvnor website is basically causing the issue
I cant help there since I've not accessed drools this way. Sorry.

Is there a way to determine the ideal number of threads? [duplicate]

This question already has answers here:
How to find out the optimal amount of threads?
(5 answers)
Closed 6 years ago.
I am doing a webcrawler and using threads to download pages.
The first limiting factor to the performance of my program is the bandwidth, I can never download more pages that it can get.
The second thing is what I interested. I am using threads to download many pages at same time, but as I create more threads, more sharing of processor occurs. Is there some metric/way/class of tests to determine what is the ideal number of threads or if after certain number, the performance doesn't change or decrease?
we've developped a multithreaded parrallel web crawler. Benchmarking troughput is the best way to get ideas on how the beast will handle his job. For a dedicated java server, one thread per core is a base to start, then the I/O comes into play and change.
Performances do decrease after certain number of threads. But it depends on the site you crawl too, on the OS you use, etc. Try to find a site with a merely constant response time to do your first benchmarks (like Google, but take differents services)
With slow websites, higher number of threads tends to compensate i/o blocking
Have a look at my answer in this thread
How to find out the optimal amount of threads?
Your example will likely be CPU bound, so you need a way to work out the contention to be able to work out the right number of threads on your box to use and be able to keep them all busy. Profiling will help there but remember it'll depend on the number of cores (as well as the network latency already mentioned etc) so use the runtime to get the number of cores when wiring up your thread pool size.
No quick answer I'm afraid, there will be an element of test, measure, adjust, repeat I'm afraid!
The ideal number of thread should be close to the number of cores (virtual cores) your hardware provides. This is to avoid thread context switching and thread scheduling. If you're doing heavy IO operations with many blocking reads (your thread blocks on a socket read) I suggest you redesign your code to use non-blocking IO APIs. Typically this will involve one "selector" thread that will monitor the activity of thousands of sockets and a small number of worker threads that will do the processing. If you code is in Java, the APIs are NIO. The only blocking call will be when you call selector.select() and it will only block if there is nothing to be processed on any of the thousands of sockets. Event-driven frameworks such as netty.io use this model and have proven to be very scalable and to best use the hardware resources of the system.
I say use something like Akka manage the threads for u. Use Jersey http client lib with non blocking IO which works with callback if i remember correctly. It's possibly the ideal setting for that type of tasks.

Categories

Resources