Database Benchmark : Weird results when testing concurrency (ExecutorService)

Database Benchmark : Weird results when testing concurrency (ExecutorService) - java

I am currently developing a java Benchmark to evaluate some usecases (inserts, updates, deletes, etc.) with an Apache Derby database.
My implementation is the following :
After having warmed up the JVM, I execute a serie (for loop : (100k to 1M iterations)) of, let's say, ÌNSERT in a (single table at the moment) of a database. As it is an Apache Derby, for those who knows, I test every mode (In memory/Embedded, In memory/Network, Persistent/Embedded, Persistent/Network)
The execution of the process may be singleThreaded, or multiThreaded (using Executors.newFixedThreadPool(poolSize)
Well, here goes my problem :
When I execute the benchmark with only 1 thread, I have pretty realistics results
In memory/embedded[Simple Integer Insert] : 35K inserts/second (1 thread)
Then, I decide to execute with 1 and then 2 (concurrent) threads sequentially.
Now, I have the following results :
In memory/embedded[Simple Integer Insert] : 21K inserts/second (1 thread)
In memory/embedded[Simple Integer Insert] : 20K inserts/second (2 thread)
Why do the results for 1 thread change so much ?
Basically, I start and end the timer before and after the loop :
// Processing
long start = System.nanoTime();
for (int i = 0; i < loopSize; i++) {
process();
}
// end timer
long absTime = System.nanoTime() - start;
double absTimeMilli = absTime * 1e-6;
and the process() method :
private void process() throws SQLException {
PreparedStatement ps = clientConn.prepareStatement(query);
ps.setObject(1, val);
ps.execute();
clientConn.commit();
ps.close();
}
As the executions are processed sequantially, the reste of my code (data handling) should not alter the benchmark ?
The results go worse as the number of sequential threads grows (1, 2, 4, 8 for example).
I am sorry in advance if this is confusing. If needed, I'll provide more information or re-explain it!
Thank you for you help :)
EDIT :
Here is the method (from the Usecase class) calling the aforementionned execution :
#Override
public ArrayList<ContextBean> bench(int loopSize, int poolSize) throws InterruptedException, ExecutionException {
Future<ContextBean> t = null;
ArrayList<ContextBean> cbl = new ArrayList<ContextBean>();
try {
ExecutorService es = Executors.newFixedThreadPool(poolSize);
for (int i = 0; i < poolSize; i++) {
BenchExecutor be = new BenchExecutor(eds, insertStatement, loopSize, poolSize, "test-varchar");
t = es.submit(be);
cbl.add(t.get());
}
es.shutdown();
es.awaitTermination(Long.MAX_VALUE,TimeUnit.MILLISECONDS);
} catch (InterruptedException e) {
e.printStackTrace();
} catch (SQLException e) {
e.printStackTrace();
}
return cbl;
}

On simple operations, every database behaves as you described.
The reason is that the all threads you are spawning try to operate on the same table (or set of tables), so the database must serialize the access.
In this situation every thread works a little slower, but the overall result is a (small) gain. (21K+20K=41K against a 35K of the single threaded version).
The gain decreases (usually exponentially) with the number of threads, and eventually you may experience a loss, due to lock escalation (see https://dba.stackexchange.com/questions/12864/what-is-lock-escalation).
Generally, the multithread solution gains most when its performance is not bound by a single resource, but by multiple factors (i.e calculations, selects on multiple tables, inserts on different tables).

Related

Performance of executorService multithreading pool

I am using Java's concurrency library ExecutorService to run my tasks. The threshold for writing to the database is 200 QPS, however, this program can only reach 20 QPS with 15 threads. I tried 5, 10, 20, 30 threads, and they were even slower than 15 threads. Here is the code:
ExecutorService executor = Executors.newFixedThreadPool(15);
List<Callable<Object>> todos = new ArrayList<>();
for (final int id : ids) {
todos.add(Executors.callable(() -> {
try {
TestObject test = testServiceClient.callRemoteService();
SaveToDatabase();
} catch (Exception ex) {}
}));
}
try {
executor.invokeAll(todos);
} catch (InterruptedException ex) {}
executor.shutdown();
1) I checked the CPU usage of the linux server on which this program is running, and the usage was 90% and 60% (it has 4 CPUs). The memory usage was only 20%. So the CPU & memory were still fine. The database server's CPU usage was low (around 20%). What could prevent the speed from reaching 200 QPS? Maybe this service call: testServiceClient.callRemoteService()? I checked the server configuration for that call and it allows high number of calls per seconds.
2) If the count of id in ids is more than 50000, is it a good idea to use invokeAll? Should we split it to smaller batches, such as 5000 each batch?

There is nothing in this code which prevents this query rate, except creating and destroying a thread pool repeately is very expensive. I suggest using the Streams API which is not only simpler but reuses a built in thread pool
int[] ids = ....
IntStream.of(ids).parallel()
.forEach(id -> testServiceClient.callRemoteService(id));
Here is a benchmark using a trivial service. The main overhead is the latency in creating the connection.
public static void main(String[] args) throws IOException {
ServerSocket ss = new ServerSocket(0);
Thread service = new Thread(() -> {
try {
for (; ; ) {
try (Socket s = ss.accept()) {
s.getOutputStream().write(s.getInputStream().read());
}
}
} catch (Throwable t) {
t.printStackTrace();
}
});
service.setDaemon(true);
service.start();
for (int t = 0; t < 5; t++) {
long start = System.nanoTime();
int[] ids = new int[5000];
IntStream.of(ids).parallel().forEach(id -> {
try {
Socket s = new Socket("localhost", ss.getLocalPort());
s.getOutputStream().write(id);
s.getInputStream().read();
} catch (IOException e) {
e.printStackTrace();
}
});
long time = System.nanoTime() - start;
System.out.println("Throughput " + (int) (ids.length * 1e9 / time) + " connects/sec");
}
}
prints
Throughput 12491 connects/sec
Throughput 13138 connects/sec
Throughput 15148 connects/sec
Throughput 14602 connects/sec
Throughput 15807 connects/sec
Using an ExecutorService would be better as #grzegorz-piwowarek mentions.
ExecutorService es = Executors.newFixedThreadPool(8);
for (int t = 0; t < 5; t++) {
long start = System.nanoTime();
int[] ids = new int[5000];
List<Future> futures = new ArrayList<>(ids.length);
for (int id : ids) {
futures.add(es.submit(() -> {
try {
Socket s = new Socket("localhost", ss.getLocalPort());
s.getOutputStream().write(id);
s.getInputStream().read();
} catch (IOException e) {
e.printStackTrace();
}
}));
}
for (Future future : futures) {
future.get();
}
long time = System.nanoTime() - start;
System.out.println("Throughput " + (int) (ids.length * 1e9 / time) + " connects/sec");
}
es.shutdown();
In this case produces much the same results.

Why do you restrict yourself to such a low number of threads?
You're missing performance opportunities this way. It seems that your tasks are really not CPU-bound. The network operations (remote service + database query) may take up the majority of time for each task to finish. During these times, where a single task/thread needs to wait for some event (network,...), another thread can use the CPU. The more threads you make available to the system, the more threads may be waiting for their network I/O to complete while still having some threads use the CPU at the same time.
I suggest you drastically ramp up the number of threads for the executor. As you say that both remote servers are rather under-utilized, I assume the host your program runs at is the bottleneck at the moment. Try to increase (double?) the number of threads until either your CPU utilization approaches 100% or memory or the remote side become the bottleneck.
By the way, you shutdown the executor, but do you actually wait for the tasks to terminate? How do you measure the "QPS"?
One more thing comes to my mind: How are DB connections handled? I.e. how are SaveToDatabase()s synchronized? Do all threads share (and compete for) a single connection? Or, worse, will each thread create a new connection to the DB, do its thing, and then close the connection again? This may be a serious bottleneck because establishing a TCP connection and doing the authentication handshake may take up as much time as running a simple SQL statement.
If the count of id in ids is more than 50000, is it a good idea to use
invokeAll? Should we split it to smaller batches, such as 5000 each
batch?
As #Vaclav Stengl already wrote, the Executors have internal queues in which they enqueue and from which they process the tasks. So no need to worry about that one. You can also just call submit for each single task as soon as you have created it. This allows the first tasks to already start executing while you're still creating/preparing later tasks, which makes sense especially when each task creation takes comparatively long, but won't hurt in all other cases. Think about invokeAll as a convenience method for cases where you already have a collection of tasks. If you create the tasks successively yourself and you already have access to the ExecutorService to run them on, just submit() them a.s.a.p.

About batch spliting:
ExecutorService has inner queue for storing tasks. In your case ExecutorService executor = Executors.newFixedThreadPool(15); has 15 thread so max 15 tasks will run concurrently and others will be stored in queue. Size of queue can be parametrized. By default size will scale up to max int. InvokeAll call inside of method execute and this method will place tasks in to queue when all threads are working.
Imho there are 2 possible scenarios why CPU is not at 100%:
try to enlarge thread pool
thread is waiting for testServiceClient.callRemoteService() to
complete and meanwhile CPU is starwing

The problem of QPS maybe is the bandwidth limit or transaction execution(it will lock the table or row). So you just increase pool size is not worked. Additional, You can try to use the producer-consumer pattern.

Java: How to handle multiple threads reporting on the same search space?

Lets say I am using ExecutorService to spawn threads that perform millions of actions, like updating an individual thread specific counter (no race conditions). I want to print to the console the current action rate from all the combined threads, once per second. How can I do this? My main problem is that I dont know gather each threads statistics once per second in a reliable way.
If it helps, the actual application is blockchain hashing, and I want to print the combined hashrate, between all the threads.
So for example (psuedo-code):
Runnable hash = () -> {
try {
for (int i = 0; i < 1000000; i++) {
hashStuff();
reportHashRateInfo(i, otherStuff); // How do I do this?
}
} catch (InterruptedException e) {
e.printStackTrace();
}
};
ExecutorService executor = Executors.newFixedThreadPool(4);
executor.submit(hash);
printCombinedHashRateInfo(); // and this

How can I properly block a thread until timeout starts?

I would like to run several tasks in parallel until a certain amount of time has passed. Let us suppose those threads are CPU-heavy and/or may block indefinitely. After the timeout, the threads should be interrupted immediately, and the main thread should continue execution regardless of unfinished or still running tasks.
I've seen a lot of questions asking this, and the answers were always similar, often along the lines of "create thread pool for tasks, start it, join it on timeout"
The problem is between the "start" and "join" parts. As soon as the pool is allowed to run, it may grab CPU and the timeout will not even start until I get it back.
I have tried Executor.invokeAll, and found that it did not fully meet the requirements. Example:
long dt = System.nanoTime ();
ExecutorService pool = Executors.newFixedThreadPool (4);
List <Callable <String>> list = new ArrayList <> ();
for (int i = 0; i < 10; i++) {
list.add (new Callable <String> () {
#Override
public String call () throws Exception {
while (true) {
}
}
});
}
System.out.println ("Start at " + (System.nanoTime () - dt) / 1000000 + "ms");
try {
pool.invokeAll (list, 3000, TimeUnit.MILLISECONDS);
}
catch (InterruptedException e) {
}
System.out.println ("End at " + (System.nanoTime () - dt) / 1000000 + "ms");
Start at 1ms
End at 3028ms
This (27 ms delay) may not seem too bad, but an infinite loop is rather easy to break out of - the actual program experiences ten times more easily. My expectation is that a timeout request is met with very high accuracy even under heavy load (I'm thinking along the lines of a hardware interrupt, which should always work).
This is a major pain in my particular program, as it needs to heed certain timeouts rather accurately (for instance, around 100 ms, if possible better). However, starting the pool often takes as long as 400 ms until I get control back, pushing past the deadline.
I'm a bit confused why this problem is almost never mentioned. Most of the answers I have seen definitely suffer from this. I suppose it may be acceptable usually, but in my case it's not.
Is there a clean and tested way to go ahead with this issue?
Edited to add:
My program is involved with GC, even though not on a large scale. For testing purposes, I rewrote the above example and found that the results given are very inconsistent, but on average noticeably worse than before.
long dt = System.nanoTime ();
ExecutorService pool = Executors.newFixedThreadPool (40);
List <Callable <String>> list = new ArrayList <> ();
for (int i = 0; i < 10; i++) {
list.add (new Callable <String> () {
#Override
public String call () throws Exception {
String s = "";
while (true) {
s += Long.toString (System.nanoTime ());
if (s.length () > 1000000) {
s = "";
}
}
}
});
}
System.out.println ("Start at " + (System.nanoTime () - dt) / 1000000 + "ms");
try {
pool.invokeAll (list, 1000, TimeUnit.MILLISECONDS);
}
catch (InterruptedException e) {
}
System.out.println ("End at " + (System.nanoTime () - dt) / 1000000 + "ms");
Start at 1ms
End at 1189ms

invokeAll should work just fine. However, it is vital that you write your tasks to properly respond to interrupts. When catching InterruptedException, they should exit immediately. If your code is catching IOException, each such catch-block should be preceded with something like:
} catch (InterruptedIOException e) {
logger.log(Level.FINE, "Interrupted; exiting", e);
return;
}
If you are using Channels, you will want to handle ClosedByInterruptException the same way.
If you perform time-consuming operations that don't catch the above exceptions, you need to check Thread.interrupted() periodically. Obviously, checking more often is better, though there will be a point of diminishing returns. (Meaning, checking it after every single statement in your task probably isn't useful.)
if (Thread.interrupted()) {
logger.fine("Interrupted; exiting");
return;
}
In your example code, your Callable is not checking the interrupt status at all, so my guess is that it never exits. An interrupt does not actually stop a thread; it just signals the thread that it should terminate itself on its own terms.

Using the VM option -XX:+PrintGCDetails, I found that the GC runs more rarely, but with a far larger time delay than expected. That delay just so happens to coincide with the spikes I experienced.
A mundane and sad explanation for the observed behavior.

How to test task performance, using multitheading?

I have some exercises, and one of them refers to concurrency. This theme is new for me, however I spent 6 hours and finally solve my problem. But my knowledge of corresponding API is poor, so I need advice: is my solution correct or may be there is more appropriate way.
So, I have to implement next interface:
public interface PerformanceTester {
/**
* Runs a performance test of the given task.
* #param task which task to do performance tests on
* #param executionCount how many times the task should be executed in total
* #param threadPoolSize how many threads to use
*/
public PerformanceTestResult runPerformanceTest(
Runnable task,
int executionCount,
int threadPoolSize) throws InterruptedException;
}
where PerformanceTestResult contains total time (how long the whole performance test took in total), minimum time (how long the shortest single execution took) and maximum time (how long the longest single execution took).
So, I learned many new things today - about thread pools, types Executors, ExecutorService, Future, CompletionService etc.
If I had Callable task, I could make next:
Return current time in the end of call() procedure.
Create some data structure (some Map may be) to store start time and Future object, that retuned by fixedThreadPool.submit(task) (do this executionCount times, in loop);
After execution I could just subtract start time from end time for every Future.
(Is this right way in case of Callable task?)
But! I have only Runnable task, so I continued looking. I even create FutureListener implements Callable<Long>, that have to return time, when Future.isDone(), but is seams little crazy for my (I have to double threads count).
So, eventually I noticed CompletionService type with interesting method take(), that Retrieves and removes the Future representing the next completed task, waiting if none are yet present., and very nice example of using ExecutorCompletionService. And there is my solution.
public class PerformanceTesterImpl implements PerformanceTester {
#Override
public PerformanceTestResult runPerformanceTest(Runnable task,
int executionCount, int threadPoolSize) throws InterruptedException {
long totalTime = 0;
long[] times = new long[executionCount];
ExecutorService pool = Executors.newFixedThreadPool(threadPoolSize);
//create list of executionCount tasks
ArrayList<Runnable> solvers = new ArrayList<Runnable>();
for (int i = 0; i < executionCount; i++) {
solvers.add(task);
}
CompletionService<Long> ecs = new ExecutorCompletionService<Long>(pool);
//submit tasks and save time of execution start
for (Runnable s : solvers)
ecs.submit(s, System.currentTimeMillis());
//take Futures one by one in order of completing
for (int i = 0; i < executionCount; ++i) {
long r = 0;
try {
//this is saved time of execution start
r = ecs.take().get();
} catch (ExecutionException e) {
e.printStackTrace();
return null;
}
//put into array difference between current time and start time
times[i] = System.currentTimeMillis() - r;
//calculate sum in array
totalTime += times[i];
}
pool.shutdown();
//sort array to define min and max
Arrays.sort(times);
PerformanceTestResult performanceTestResult = new PerformanceTestResult(
totalTime, times[0], times[executionCount - 1]);
return performanceTestResult;
}
}
So, what can you say? Thanks for replies.

I would use System.nanoTime() for higher resolution timings. You might want to ignroe the first 10,000 tests to ensure the JVM has warmed up.
I wouldn't bother creating a List of Runnable and add this to the Executor. I would instead just add them to the executor.
Using Runnable is not a problem as you get a Future<?> back.
Note: Timing how long the task spends in the queue can make a big difference to the timing. Instead of taking the time from when the task was created you can have the task time itself and return a Long for the time in nano-seconds. How the timing is done should reflect the use case you have in mind.
A simple way to convert a Runnable task into one which times itself.
finla Runnable run = ...
ecs.submit(new Callable<Long>() {
public Long call() {
long start = System.nanoTime();
run.run();
return System.nanoTime() - start;
}
});

There are many intricacies when writing performance tests in the JVM. You probably aren't worried about them as this is an exercise, but if you are this question might have more information:
How do I write a correct micro-benchmark in Java?
That said, there don't seem to be any glaring bugs in your code. You might want to ask this on the lower traffic code-review site if you want a full review of your code:
http://codereview.stackexchange.com

Fibonacci on Java ExecutorService runs faster sequentially than in parallel

I am trying out the executor service in Java, and wrote the following code to run Fibonacci (yes, the massively recursive version, just to stress out the executor service).
Surprisingly, it will run faster if I set the nThreads to 1. It might be related to the fact that the size of each "task" submitted to the executor service is really small. But still it must be the same number also if I set nThreads to 1.
To see if the access to the shared Atomic variables can cause this issue, I commented out the three lines with the comment "see text", and looked at the system monitor to see how long the execution takes. But the results are the same.
Any idea why this is happening?
BTW, I wanted to compare it with the similar implementation with Fork/Join. It turns out to be way slower than the F/J implementation.
public class MainSimpler {
static int N=35;
static AtomicInteger result = new AtomicInteger(0), pendingTasks = new AtomicInteger(1);
static ExecutorService executor;
public static void main(String[] args) {
int nThreads=2;
System.out.println("Number of threads = "+nThreads);
executor = Executors.newFixedThreadPool(nThreads);
Executable.inQueue = new AtomicInteger(nThreads);
long before = System.currentTimeMillis();
System.out.println("Fibonacci "+N+" is ... ");
executor.submit(new FibSimpler(N));
waitToFinish();
System.out.println(result.get());
long after = System.currentTimeMillis();
System.out.println("Duration: " + (after - before) + " milliseconds\n");
}
private static void waitToFinish() {
while (0 < pendingTasks.get()){
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
executor.shutdown();
}
}
class FibSimpler implements Runnable {
int N;
FibSimpler (int n) { N=n; }
#Override
public void run() {
compute();
MainSimpler.pendingTasks.decrementAndGet(); // see text
}
void compute() {
int n = N;
if (n <= 1) {
MainSimpler.result.addAndGet(n); // see text
return;
}
MainSimpler.executor.submit(new FibSimpler(n-1));
MainSimpler.pendingTasks.incrementAndGet(); // see text
N = n-2;
compute(); // similar to the F/J counterpart
}
}
Runtime (approximately):
1 thread : 11 seconds
2 threads: 19 seconds
4 threads: 19 seconds
Update:
I notice that even if I use one thread inside the executor service, the whole program will use all four cores of my machine (each core around 80% usage on average). This could explain why using more threads inside the executor service slows down the whole process, but now, why does this program use 4 cores if only one thread is active inside the executor service??

It might be related to the fact that the size of each "task" submitted
to the executor service is really small.
This is certainly the case and as a result you are mainly measuring the overhead of context switching. When n == 1, there is no context switching and thus the performance is better.
But still it must be the same number also if I set nThreads to 1.
I'm guessing you meant 'to higher than 1' here.
You are running into the problem of heavy lock contention. When you have multiple threads, the lock on the result is contended all the time. Threads have to wait for each other before they can update the result and that slows them down. When there is only a single thread, the JVM probably detects that and performs lock elision, meaning it doesn't actually perform any locking at all.
You may get better performance if you don't divide the problem into N tasks, but rather divide it into N/nThreads tasks, which can be handled simultaneously by the threads (assuming you choose nThreads to be at most the number of physical cores/threads available). Each thread then does its own work, calculating its own total and only adding that to a grand total when the thread is done. Even then, for fib(35) I expect the costs of thread management to outweigh the benefits. Perhaps try fib(1000).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.