I am using Java's concurrency library ExecutorService to run my tasks. The threshold for writing to the database is 200 QPS, however, this program can only reach 20 QPS with 15 threads. I tried 5, 10, 20, 30 threads, and they were even slower than 15 threads. Here is the code:
ExecutorService executor = Executors.newFixedThreadPool(15);
List<Callable<Object>> todos = new ArrayList<>();
for (final int id : ids) {
todos.add(Executors.callable(() -> {
try {
TestObject test = testServiceClient.callRemoteService();
SaveToDatabase();
} catch (Exception ex) {}
}));
}
try {
executor.invokeAll(todos);
} catch (InterruptedException ex) {}
executor.shutdown();
1) I checked the CPU usage of the linux server on which this program is running, and the usage was 90% and 60% (it has 4 CPUs). The memory usage was only 20%. So the CPU & memory were still fine. The database server's CPU usage was low (around 20%). What could prevent the speed from reaching 200 QPS? Maybe this service call: testServiceClient.callRemoteService()? I checked the server configuration for that call and it allows high number of calls per seconds.
2) If the count of id in ids is more than 50000, is it a good idea to use invokeAll? Should we split it to smaller batches, such as 5000 each batch?
There is nothing in this code which prevents this query rate, except creating and destroying a thread pool repeately is very expensive. I suggest using the Streams API which is not only simpler but reuses a built in thread pool
int[] ids = ....
IntStream.of(ids).parallel()
.forEach(id -> testServiceClient.callRemoteService(id));
Here is a benchmark using a trivial service. The main overhead is the latency in creating the connection.
public static void main(String[] args) throws IOException {
ServerSocket ss = new ServerSocket(0);
Thread service = new Thread(() -> {
try {
for (; ; ) {
try (Socket s = ss.accept()) {
s.getOutputStream().write(s.getInputStream().read());
}
}
} catch (Throwable t) {
t.printStackTrace();
}
});
service.setDaemon(true);
service.start();
for (int t = 0; t < 5; t++) {
long start = System.nanoTime();
int[] ids = new int[5000];
IntStream.of(ids).parallel().forEach(id -> {
try {
Socket s = new Socket("localhost", ss.getLocalPort());
s.getOutputStream().write(id);
s.getInputStream().read();
} catch (IOException e) {
e.printStackTrace();
}
});
long time = System.nanoTime() - start;
System.out.println("Throughput " + (int) (ids.length * 1e9 / time) + " connects/sec");
}
}
prints
Throughput 12491 connects/sec
Throughput 13138 connects/sec
Throughput 15148 connects/sec
Throughput 14602 connects/sec
Throughput 15807 connects/sec
Using an ExecutorService would be better as #grzegorz-piwowarek mentions.
ExecutorService es = Executors.newFixedThreadPool(8);
for (int t = 0; t < 5; t++) {
long start = System.nanoTime();
int[] ids = new int[5000];
List<Future> futures = new ArrayList<>(ids.length);
for (int id : ids) {
futures.add(es.submit(() -> {
try {
Socket s = new Socket("localhost", ss.getLocalPort());
s.getOutputStream().write(id);
s.getInputStream().read();
} catch (IOException e) {
e.printStackTrace();
}
}));
}
for (Future future : futures) {
future.get();
}
long time = System.nanoTime() - start;
System.out.println("Throughput " + (int) (ids.length * 1e9 / time) + " connects/sec");
}
es.shutdown();
In this case produces much the same results.
Why do you restrict yourself to such a low number of threads?
You're missing performance opportunities this way. It seems that your tasks are really not CPU-bound. The network operations (remote service + database query) may take up the majority of time for each task to finish. During these times, where a single task/thread needs to wait for some event (network,...), another thread can use the CPU. The more threads you make available to the system, the more threads may be waiting for their network I/O to complete while still having some threads use the CPU at the same time.
I suggest you drastically ramp up the number of threads for the executor. As you say that both remote servers are rather under-utilized, I assume the host your program runs at is the bottleneck at the moment. Try to increase (double?) the number of threads until either your CPU utilization approaches 100% or memory or the remote side become the bottleneck.
By the way, you shutdown the executor, but do you actually wait for the tasks to terminate? How do you measure the "QPS"?
One more thing comes to my mind: How are DB connections handled? I.e. how are SaveToDatabase()s synchronized? Do all threads share (and compete for) a single connection? Or, worse, will each thread create a new connection to the DB, do its thing, and then close the connection again? This may be a serious bottleneck because establishing a TCP connection and doing the authentication handshake may take up as much time as running a simple SQL statement.
If the count of id in ids is more than 50000, is it a good idea to use
invokeAll? Should we split it to smaller batches, such as 5000 each
batch?
As #Vaclav Stengl already wrote, the Executors have internal queues in which they enqueue and from which they process the tasks. So no need to worry about that one. You can also just call submit for each single task as soon as you have created it. This allows the first tasks to already start executing while you're still creating/preparing later tasks, which makes sense especially when each task creation takes comparatively long, but won't hurt in all other cases. Think about invokeAll as a convenience method for cases where you already have a collection of tasks. If you create the tasks successively yourself and you already have access to the ExecutorService to run them on, just submit() them a.s.a.p.
About batch spliting:
ExecutorService has inner queue for storing tasks. In your case ExecutorService executor = Executors.newFixedThreadPool(15); has 15 thread so max 15 tasks will run concurrently and others will be stored in queue. Size of queue can be parametrized. By default size will scale up to max int. InvokeAll call inside of method execute and this method will place tasks in to queue when all threads are working.
Imho there are 2 possible scenarios why CPU is not at 100%:
try to enlarge thread pool
thread is waiting for testServiceClient.callRemoteService() to
complete and meanwhile CPU is starwing
The problem of QPS maybe is the bandwidth limit or transaction execution(it will lock the table or row). So you just increase pool size is not worked. Additional, You can try to use the producer-consumer pattern.
Related
I have some a java code that reads an email inbox.
Once this service is turned on, it spawns a parent thread that keeps checking if a new mail has arrived every 5 seconds. It checks and if no mail has arrived, it sleeps for 5 seconds. If a new mail has arrived, depending on mail burst, it spawns up to max 10 worker threads to parse those emails. and once all the mails are parsed, and no new mail has arrived, the worker threads are killed after 5 seconds of inactivity. The parent thread keeps on pinging ALWAYS.
There are 7-8 such services, reading different inboxes, that keep on running on my aws machine, which has a 4 core CPU. These services are eating upto 350% of my cpu usage which I can see via "top" command.
I want to know if there is a way I can limit these threads from eating CPU resource all the time. This is slowing down all other processes, because they do not get CPU usage because of contention.
This is the code in parent thread
#Override
public void run() {
try {
while (!this.isThreadKillRequested()) {
if (this.getMessageCount() > 0) {
WorkerThread worker = getWorkerThread();
if (worker != null && !worker.isAlive()) {
worker.start();
}
} else {
if(this.isAllThreadIdle()){
//ModelUtil.printOutput("all idle. nothing to do");
}
}
Thread.sleep(MESSAGE_PROCESSOR_SLEEP_TIME);
}
} catch (InterruptedException e) {
ErrorLogger.logError("InterruptedException exception in monitoring thread",
GlobalConfig.msgException + e.getMessage());
e.printStackTrace();
}
}
private WorkerThread getWorkerThread() {
WorkerThread worker = null;
for (Map.Entry<String, ThreadPerformance> entry : this.threadPool.entrySet()) {
ThreadPerformance p = entry.getValue();
if (p.isThreadIdle()) {
worker = p.getThisThread();
break;
}
}
if (worker == null && this.threadPool.size() < MAX_POOL_SIZE) {
double overallThroughput = 0.00;
//some logic to calculate throughput
if (overallThroughput < MIN_THROUGHPUT) {
worker = new WorkerThread(this, this.getUniqueThreadId());
//add in pool
}
System.out.println("Overall Throughput - " + overallThroughput);
System.out.println("Pool Size - " + this.threadPool.size());
}
return worker;
}
Limiting the threads CPU consumption is easiest by slowing down the refresh cycle and condensing threads. Try every 20 seconds and limiting the number of worker threads to 4. Unless there's a need for all of the threads and a fast refresh rate, it's unnecessary pressure on the CPU. Especially when 8 mailboxes are being checked.
The probability of all mailboxes receiving a large amount of new mail, at the same time, is low. Thus, a total thread count can be kept and the number of threads per mailbox can be distributed based on the percentage of new items per mailbox. This will increase the throughput for the mailbox that needs it the most while limiting the number of threads to a manageable amount.
Lets say I am using ExecutorService to spawn threads that perform millions of actions, like updating an individual thread specific counter (no race conditions). I want to print to the console the current action rate from all the combined threads, once per second. How can I do this? My main problem is that I dont know gather each threads statistics once per second in a reliable way.
If it helps, the actual application is blockchain hashing, and I want to print the combined hashrate, between all the threads.
So for example (psuedo-code):
Runnable hash = () -> {
try {
for (int i = 0; i < 1000000; i++) {
hashStuff();
reportHashRateInfo(i, otherStuff); // How do I do this?
}
} catch (InterruptedException e) {
e.printStackTrace();
}
};
ExecutorService executor = Executors.newFixedThreadPool(4);
executor.submit(hash);
printCombinedHashRateInfo(); // and this
I am currently developing a java Benchmark to evaluate some usecases (inserts, updates, deletes, etc.) with an Apache Derby database.
My implementation is the following :
After having warmed up the JVM, I execute a serie (for loop : (100k to 1M iterations)) of, let's say, ÌNSERT in a (single table at the moment) of a database. As it is an Apache Derby, for those who knows, I test every mode (In memory/Embedded, In memory/Network, Persistent/Embedded, Persistent/Network)
The execution of the process may be singleThreaded, or multiThreaded (using Executors.newFixedThreadPool(poolSize)
Well, here goes my problem :
When I execute the benchmark with only 1 thread, I have pretty realistics results
In memory/embedded[Simple Integer Insert] : 35K inserts/second (1 thread)
Then, I decide to execute with 1 and then 2 (concurrent) threads sequentially.
Now, I have the following results :
In memory/embedded[Simple Integer Insert] : 21K inserts/second (1 thread)
In memory/embedded[Simple Integer Insert] : 20K inserts/second (2 thread)
Why do the results for 1 thread change so much ?
Basically, I start and end the timer before and after the loop :
// Processing
long start = System.nanoTime();
for (int i = 0; i < loopSize; i++) {
process();
}
// end timer
long absTime = System.nanoTime() - start;
double absTimeMilli = absTime * 1e-6;
and the process() method :
private void process() throws SQLException {
PreparedStatement ps = clientConn.prepareStatement(query);
ps.setObject(1, val);
ps.execute();
clientConn.commit();
ps.close();
}
As the executions are processed sequantially, the reste of my code (data handling) should not alter the benchmark ?
The results go worse as the number of sequential threads grows (1, 2, 4, 8 for example).
I am sorry in advance if this is confusing. If needed, I'll provide more information or re-explain it!
Thank you for you help :)
EDIT :
Here is the method (from the Usecase class) calling the aforementionned execution :
#Override
public ArrayList<ContextBean> bench(int loopSize, int poolSize) throws InterruptedException, ExecutionException {
Future<ContextBean> t = null;
ArrayList<ContextBean> cbl = new ArrayList<ContextBean>();
try {
ExecutorService es = Executors.newFixedThreadPool(poolSize);
for (int i = 0; i < poolSize; i++) {
BenchExecutor be = new BenchExecutor(eds, insertStatement, loopSize, poolSize, "test-varchar");
t = es.submit(be);
cbl.add(t.get());
}
es.shutdown();
es.awaitTermination(Long.MAX_VALUE,TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
e.printStackTrace();
} catch (SQLException e) {
e.printStackTrace();
}
return cbl;
}
On simple operations, every database behaves as you described.
The reason is that the all threads you are spawning try to operate on the same table (or set of tables), so the database must serialize the access.
In this situation every thread works a little slower, but the overall result is a (small) gain. (21K+20K=41K against a 35K of the single threaded version).
The gain decreases (usually exponentially) with the number of threads, and eventually you may experience a loss, due to lock escalation (see https://dba.stackexchange.com/questions/12864/what-is-lock-escalation).
Generally, the multithread solution gains most when its performance is not bound by a single resource, but by multiple factors (i.e calculations, selects on multiple tables, inserts on different tables).
I am trying out the executor service in Java, and wrote the following code to run Fibonacci (yes, the massively recursive version, just to stress out the executor service).
Surprisingly, it will run faster if I set the nThreads to 1. It might be related to the fact that the size of each "task" submitted to the executor service is really small. But still it must be the same number also if I set nThreads to 1.
To see if the access to the shared Atomic variables can cause this issue, I commented out the three lines with the comment "see text", and looked at the system monitor to see how long the execution takes. But the results are the same.
Any idea why this is happening?
BTW, I wanted to compare it with the similar implementation with Fork/Join. It turns out to be way slower than the F/J implementation.
public class MainSimpler {
static int N=35;
static AtomicInteger result = new AtomicInteger(0), pendingTasks = new AtomicInteger(1);
static ExecutorService executor;
public static void main(String[] args) {
int nThreads=2;
System.out.println("Number of threads = "+nThreads);
executor = Executors.newFixedThreadPool(nThreads);
Executable.inQueue = new AtomicInteger(nThreads);
long before = System.currentTimeMillis();
System.out.println("Fibonacci "+N+" is ... ");
executor.submit(new FibSimpler(N));
waitToFinish();
System.out.println(result.get());
long after = System.currentTimeMillis();
System.out.println("Duration: " + (after - before) + " milliseconds\n");
}
private static void waitToFinish() {
while (0 < pendingTasks.get()){
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
executor.shutdown();
}
}
class FibSimpler implements Runnable {
int N;
FibSimpler (int n) { N=n; }
#Override
public void run() {
compute();
MainSimpler.pendingTasks.decrementAndGet(); // see text
}
void compute() {
int n = N;
if (n <= 1) {
MainSimpler.result.addAndGet(n); // see text
return;
}
MainSimpler.executor.submit(new FibSimpler(n-1));
MainSimpler.pendingTasks.incrementAndGet(); // see text
N = n-2;
compute(); // similar to the F/J counterpart
}
}
Runtime (approximately):
1 thread : 11 seconds
2 threads: 19 seconds
4 threads: 19 seconds
Update:
I notice that even if I use one thread inside the executor service, the whole program will use all four cores of my machine (each core around 80% usage on average). This could explain why using more threads inside the executor service slows down the whole process, but now, why does this program use 4 cores if only one thread is active inside the executor service??
It might be related to the fact that the size of each "task" submitted
to the executor service is really small.
This is certainly the case and as a result you are mainly measuring the overhead of context switching. When n == 1, there is no context switching and thus the performance is better.
But still it must be the same number also if I set nThreads to 1.
I'm guessing you meant 'to higher than 1' here.
You are running into the problem of heavy lock contention. When you have multiple threads, the lock on the result is contended all the time. Threads have to wait for each other before they can update the result and that slows them down. When there is only a single thread, the JVM probably detects that and performs lock elision, meaning it doesn't actually perform any locking at all.
You may get better performance if you don't divide the problem into N tasks, but rather divide it into N/nThreads tasks, which can be handled simultaneously by the threads (assuming you choose nThreads to be at most the number of physical cores/threads available). Each thread then does its own work, calculating its own total and only adding that to a grand total when the thread is done. Even then, for fib(35) I expect the costs of thread management to outweigh the benefits. Perhaps try fib(1000).
I am writing a code for my homework, I am not so familiar with writing multi-threaded applications. I learned how to open a thread and start it. I better show the code.
for (int i = 0; i < a.length; i++) {
download(host, port, a[i]);
scan.next();
}
My code above connects to a server opens a.length multiple parallel requests. In other words, download opens a[i] connections to get the same content on each iteration. However, I want my server to complete the download method when i = 0 and start the next iteration i = 1, when the the threads that download has opened completes. I did it with scan.next() to stop it by hand but obviously it is not a nice solution. How can I do that?
Edit:
public static long download(String host, int port) {
new java.io.File("Folder_" + N).mkdir();
N--;
int totalLength = length(host, port);
long result = 0;
ArrayList<HTTPThread> list = new ArrayList<HTTPThread>();
for (int i = 0; i < totalLength; i = i + N + 1) {
HTTPThread t;
if (i + N > totalLength) {
t = (new HTTPThread(host, port, i, totalLength - 1));
} else {
t = new HTTPThread(host, port, i, i + N);
}
list.add(t);
}
for (HTTPThread t : list) {
t.start();
}
return result;
}
And In my HTTPThread;
public void run() {
init(host, port);
downloadData(low, high);
close();
}
Note: Our test web server is a modified web server, it gets Range: i-j and in the response, there is contents of the i-j files.
You will need to call the join() method of the thread that is doing the downloading. This will cause the current thread to wait until the download thread is finished. This is a good post on how to use join.
If you'd like to post your download method you will probably get a more complete solution
EDIT:
Ok, so after you start your threads you will need to join them like so:
for (HTTPThread t : list) {
t.start();
}
for (HTTPThread t : list) {
t.join();
}
This will stop the method returning until all HTTPThreads have completed
It's probably not a great idea to create an unbounded number of threads to do an unbounded number of parallel http requests. (Both network sockets and threads are operating system resources, and require some bookkeeping overhead, and are therefore subject to quotas in many operating systems. In addition, the webserver you are reading from might not like 1000s of concurrent connections, because his network sockets are finite, too!).
You can easily control the number of concurrent connections using an ExecutorService:
List<DownloadTask> tasks = new ArrayList<DownloadTask>();
for (int i = 0; i < length; i++) {
tasks.add(new DownloadTask(i));
}
ExecutorService executor = Executors.newFixedThreadPool(N);
executor.invokeAll(tasks);
executor.shutdown();
This is both shorter and better than your homegrown concurrency limit, because your limit will delay starting with the next batch until all threads from the current batch have completed. With an ExceutorService, a new task is begun whenever an old task has completed (and there are still tasks left). That is, your solution will have 1 to N concurrent requests until all tasks have been started, whereas the ExecutorService will always have N concurrent requests.