Java Async Profiler Flame Graph - java

In the scenario below, is Java async-profiler the right tool to see where's time spent when comparing performance of ArrayBlockingQueue and LinkedBlockingQueue?
On my machine, total execution time of ABQ is always 25% faster than LBQ when sharing 50M entries between a consumer and a producer. Flame graphs of both are "pretty much" same except LBQ one shows only a handful of samples from JVM object allocation code but this wouldn't jusify 25% increase. As expected, TLAB allocation in LBQ is much higher.
I was wondering, how can I see which activity (be it code or hardware) is taking the time?
Runner:
import java.util.*;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
public class Runner {
public static void main(String[] args) throws InterruptedException {
int size = 50_000_000;
BlockingQueue<Long> queue = new LinkedBlockingQueue<>(size);
Producer producer = new Producer(queue, size);
Thread t = new Thread(producer);
t.setName("ProducerItIs");
Consumer consumer = new Consumer(queue, size);
Thread t2 = new Thread(consumer);
t2.setName("ConsumerItIs");
t.start();
t2.start();
Thread.sleep(8000);
System.out.println("done");
queue.forEach(System.out::println);
System.out.println(queue.size());
}
}
Producer:
import java.util.Queue;
import java.util.Random;
import java.util.concurrent.BlockingQueue;
public class Producer implements Runnable {
public Producer(BlockingQueue<Long> blockingQueue, int size) {
this.queue = blockingQueue;
this.size = size;
}
private final BlockingQueue<Long> queue;
private final int size;
public void run() {
System.out.println("Started to produce...");
long nanos = System.nanoTime();
Long ii = (long) new Random().nextInt();
for (int j = 0; j < size; j++) {
queue.add(ii);
}
System.out.println("producer Time taken :" + ((System.nanoTime() - nanos) / 1e6));
}
}
Consumer:
import java.util.concurrent.BlockingQueue;
public class Consumer implements Runnable {
private final BlockingQueue<Long> blockingQueue;
private final int size;
private Long value;
public Consumer(BlockingQueue<Long> blockingQueue, int size) {
this.blockingQueue = blockingQueue;
this.size = size;
}
public void run() {
long nanos = System.nanoTime();
System.out.println("Starting to consume...");
int i = 1;
try {
while (true) {
value = blockingQueue.take();
i++;
if (i >= size) {
break;
}
}
System.out.println("Consumer Time taken :" + ((System.nanoTime() - nanos)/1e6));
} catch (Exception exp) {
System.out.println(exp);
}
}
public long getValue() {
return value;
}
}
With ArrayBlockingQueue:
With LinkedListBlockedQueue: Black arrow showing samples captured for allocations

Related

Blocking queue - Is client side locking needed?

As mentioned by Java_author:
5.1.1. Problems with Synchronized Collections
The synchronized collections are thread-safe, but you may sometimes need to use additional client-side locking to guard compound actions.
Example - Multiple producer/consumer problem:
Algorithm using busy wait approach for multiple producers consumers working on thread-unsafe buffer, requires,
global RingBuffer queue; // A thread-unsafe ring-buffer of tasks.
global Lock queueLock; // A mutex for the ring-buffer of tasks.
But below code runs busy wait(while(true){..}) algorithm using thread safe buffer(queue), without a lock,
/* NumbersProducer.java */
package responsive.blocking.prodcons;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.ThreadLocalRandom;
public class NumbersProducer implements Runnable{
private BlockingQueue<Integer> numbersQueue;
private final int poisonPill;
private final int poisonPillPerProducer;
public NumbersProducer(BlockingQueue<Integer> numbersQueue, int poisonPill, int poisonPillPerProducer) {
this.numbersQueue = numbersQueue;
this.poisonPill = poisonPill;
this.poisonPillPerProducer = poisonPillPerProducer;
}
#Override
public void run() {
try {
generateNumbers();
}catch(InterruptedException e) {
Thread.currentThread().interrupt();
}
}
private void generateNumbers() throws InterruptedException{
for(int i=0; i < 100; i++) {
numbersQueue.put(ThreadLocalRandom.current().nextInt(100));
}
for(int j=0; j < poisonPillPerProducer; j++) {
numbersQueue.put(poisonPill);
}
}
}
/* NumbersConsumer.java */
package responsive.blocking.prodcons;
import java.util.concurrent.BlockingQueue;
public class NumbersConsumer implements Runnable{
private BlockingQueue<Integer> queue;
private final int poisonPill;
public NumbersConsumer(BlockingQueue<Integer> queue, int poisonPill) {
this.queue = queue;
this.poisonPill = poisonPill;
}
public void run() {
try {
while(true) {
Integer number = queue.take();
if(number.equals(poisonPill)) {
return;
}
System.out.println(Thread.currentThread().getName() + " result: " + number);
}
}catch(InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
/* Driver.java */
package responsive.blocking.prodcons;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
public class Driver {
public static void main(String[] args) {
int BOUND = 10;
int nProducers = 4;
int nConsumers = Runtime.getRuntime().availableProcessors();
int poisonPill = Integer.MAX_VALUE;
int value = 1;
int poisonPillPerProducer = ((value = nConsumers / nProducers) < 1)?1:value;
BlockingQueue<Integer> queue = new LinkedBlockingQueue<>(BOUND);
for(int i =0; i< nProducers; i++) {
new Thread(new NumbersProducer(queue, poisonPill, poisonPillPerProducer)).start();
}
for(int j=0;j < nConsumers; j++ ) {
new Thread(new NumbersConsumer(queue, poisonPill)).start();
}
}
}
Question:
In the above code,
How do I assess the need of additional client-side locking? Key is compound actions...

Printing numbers in sequence using 3 threads

I have a program where 3 Threads are trying to print numbers in sequence from 1 to 10. I am using a CountDownLatch to keep keep a count.
But the program stops just after printing 1.
Note: I am aware that using AtomicInteger instead of Integer can work. But I am looking to find out the issue in the current code.
public class Worker implements Runnable {
private int id;
private volatile Integer count;
private CountDownLatch latch;
public Worker(int id, Integer count, CountDownLatch latch) {
this.id = id;
this.count = count;
this.latch = latch;
}
#Override
public void run() {
while (count <= 10) {
synchronized (latch) {
if (count % 3 == id) {
System.out.println("Thread: " + id + ":" + count);
count++;
latch.countDown();
}
}
}
}
}
Main program:
public class ThreadSequence {
private static CountDownLatch latch = new CountDownLatch(10);
private volatile static Integer count = 0;
public static void main(String[] args) {
Thread t1 = new Thread(new Worker(0, count, latch));
Thread t2 = new Thread(new Worker(1, count, latch));
Thread t3 = new Thread(new Worker(2, count, latch));
t1.start();
t2.start();
t3.start();
try {
latch.await();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
Edited program with AtomicInteger:
public class ThreadSequence {
private static AtomicInteger atomicInteger = new AtomicInteger(1);
public static void main(String[] args) throws InterruptedException {
Thread t1 = new Thread(new WorkerThread(0, atomicInteger));
Thread t2 = new Thread(new WorkerThread(1, atomicInteger));
Thread t3 = new Thread(new WorkerThread(2, atomicInteger));
t1.start();
t2.start();
t3.start();
t1.join();
t2.join();
t3.join();
System.out.println("Done with main");
}
}
public class WorkerThread implements Runnable {
private int id;
private AtomicInteger atomicInteger;
public WorkerThread(int id, AtomicInteger atomicInteger) {
this.id = id;
this.atomicInteger = atomicInteger;
}
#Override
public void run() {
while (atomicInteger.get() < 10) {
synchronized (atomicInteger) {
if (atomicInteger.get() % 3 == id) {
System.out.println("Thread:" + id + " = " + atomicInteger);
atomicInteger.incrementAndGet();
}
}
}
}
}
But the program stops just after printing 1.
No this is not what happens. None of the threads terminate.
You have a own count field in every worker. Other threads do not write to this field.
Therefore there is only one thread, where if (count % 3 == id) { yields true, which is the one with id = 0. Also this is the only thread that ever modifies the count field and modifying it causes (count % 3 == id) to yield false in subsequent loop iterations, causing an infinite loop in all 3 threads.
Change count to static to fix this.
Edit
In contrast to Integer AtomicInteger is mutable. It is a class that holds a int value that can be modified. Using Integer every modification of the field replaces it's value, but using AtomicInteger you only modify the value inside the AtomicInteger object, but all 3 threads continue using the same AtomicInteger instance.
Your "count" is a different variable for each thread, so changing it in one thread doesn't affect the rest, and so they are all waiting for it to change, without any one that can do it.
Keep the count as static member in Worker class - common for all object in the class.
You can use below code to print sequential numbers using multiple threads -
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.BlockingQueue;
public class ThreadCall extends Thread {
private BlockingQueue<Integer> bq = new ArrayBlockingQueue<Integer>(10);
private ThreadCall next;
public void setNext(ThreadCall t) {
this.next = t;
}
public void addElBQ(int a) {
this.bq.add(a);
}
public ThreadCall(String name) {
this.setName(name);
}
#Override
public void run() {
int x = 0;
while(true) {
try {
x = 0;
x = bq.take();
if (x!=0) {
System.out.println(Thread.currentThread().getName() + " =>" + x);
if (x >= 100) System.exit(0); // Need to stop all running threads
next.addElBQ(x+1);
}
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
public static void main(String[] args) {
int THREAD_COUNT = 10;
List<ThreadCall> listThread = new ArrayList<>();
for (int i=1; i<=THREAD_COUNT; i++) {
listThread.add(new ThreadCall("Thread " + i));
}
for (int i = 0; i < listThread.size(); i++) {
if (i == listThread.size()-1) {
listThread.get(i).setNext(listThread.get(0));
}
else listThread.get(i).setNext(listThread.get(i+1));
}
listThread.get(0).addElBQ(1);
for (int i = 0; i < listThread.size(); i++) {
listThread.get(i).start();
}
}
}
I hope this will resolve your problem

Why java thread behaves differently when counter is larger in my class?

Here is my code:
public class MyRunnableClass implements Runnable {
static int x = 25;
int y = 0;
private static final Object sharedLock = new Object();
#Override
public void run() {
while(x>0){
someMethod();
}
}
public synchronized void someMethod(){
synchronized (sharedLock){
x--;
y++;
}
}
}
and the test class:
public class MyRunnableClassTest {
public static void main(String[] args) throws InterruptedException {
MyRunnableClass aa = new MyRunnableClass();
MyRunnableClass bb = new MyRunnableClass();
Thread a = new Thread(aa);
Thread b = new Thread(bb);
a.start();
b.start();
a.join();
b.join();
System.out.println(aa.y + bb.y);
}
}
When I run this code as it is I see output 25 which is fine, but when x is 250, I see 251.. Why? Why not 250?
You have to extend the synchronized scope, so that is also covers the read operation on x:
#Override
public void run() {
for (;;) {
synchronized (sharedObject) {
if (x <= 0) break;
someMethod();
}
}
}
Coincidence. The same thing could happen with 25, just like any other number.
For example, during execution of
while(x>0){
someMethod();
}
which is not synchronized over, after a bunch of looping, let's take x to be 1. The first thread starts iterating (enters the body), then threads switch, the second thread sees x is 1, so enters the loop body as well. Both will increment their count and their sum will be equal to one more than the original x value.
This is a race condition and you just happen to see the consequences more easily with larger numbers.
When you are doing:
while(x>0){
someMethod();
}
Let's say x = 1 and:
Thread A evaluates x > 0 to true, and enters the loop.
Let's say Thread A gets interrupted before the next line executes.
Thread B will also evaluates x > 0 to true and enter the loop.
Both will decrement x one after the other and increment their y.
To solve this, the check for x > 0 must be in the lock as well.
Ex:
public class MyRunnableClass implements Runnable {
static int x = 25;
int y = 0;
private static final Object sharedLock = new Object();
#Override
public void run() {
while(x>0){
someMethod();
}
}
public synchronized void someMethod(){
synchronized (sharedLock){
if(x > 0){
x--;
y++;
}
}
}
}
Sometimes, both Thread a and Thread b can call someMethod() because x was 1. One Thread locks the sharedLock, makes x equal to 0, y equal to 250 and then release the sharedLock, at which point the other thread calls someMethod() and makes y equal to 251 and x equal to -1.
You could also solve that problem with AtomicInteger:
import java.util.concurrent.atomic.AtomicInteger;
public class MyRunnableClass implements Runnable {
private static final AtomicInteger xHolder = new AtomicInteger(25);
int y = 0;
#Override
public void run() {
while (xHolder.decrementAndGet() >= 0) {
y++;
}
}
public static void main(String[] args) throws InterruptedException {
MyRunnableClass aa = new MyRunnableClass();
MyRunnableClass bb = new MyRunnableClass();
Thread a = new Thread(aa);
Thread b = new Thread(bb);
a.start();
b.start();
a.join();
b.join();
System.out.println(aa.y + bb.y);
}
}
Or with some extended parallelism to test:
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorCompletionService;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.atomic.AtomicInteger;
public class MyRunnableClass implements Callable<Integer> {
private static final AtomicInteger xHolder = new AtomicInteger(250000);
int y = 0;
#Override
public Integer call() throws Exception {
while (xHolder.decrementAndGet() >= 0) {
y++;
}
System.out.println(Thread.currentThread().getName() + " returns " + y);
return y;
}
public static void main(String[] args) throws InterruptedException, ExecutionException {
ExecutorService executorService = Executors.newCachedThreadPool();
ExecutorCompletionService<Integer> completionService = new ExecutorCompletionService<Integer>(
executorService);
int parallelism = 5;
for (int i = 0; i < parallelism; ++i) {
completionService.submit(new MyRunnableClass());
} // for
int ySum = 0;
for (int j = 0; j < parallelism; ++j) {
Future<Integer> future = completionService.take();
ySum += future.get();
} // for
System.out.println(ySum);
executorService.shutdown();
}
}
Output:
pool-1-thread-3 returns 26619
pool-1-thread-5 returns 0
pool-1-thread-1 returns 104302
pool-1-thread-2 returns 95981
pool-1-thread-4 returns 23098
250000

Understanding the main loop in Streams API's ForEachTask

It seems that the centerpiece of Java Streams' parallelization is the ForEachTask. Understanding its logic appears to be essential to acquiring the mental model necessary to anticipate the concurrent behavior of client code written against the Streams API. Yet I find my anticipations contradicted by the actual behavior.
For reference, here is the key compute() method (java/util/streams/ForEachOps.java:253):
public void compute() {
Spliterator<S> rightSplit = spliterator, leftSplit;
long sizeEstimate = rightSplit.estimateSize(), sizeThreshold;
if ((sizeThreshold = targetSize) == 0L)
targetSize = sizeThreshold = AbstractTask.suggestTargetSize(sizeEstimate);
boolean isShortCircuit = StreamOpFlag.SHORT_CIRCUIT.isKnown(helper.getStreamAndOpFlags());
boolean forkRight = false;
Sink<S> taskSink = sink;
ForEachTask<S, T> task = this;
while (!isShortCircuit || !taskSink.cancellationRequested()) {
if (sizeEstimate <= sizeThreshold ||
(leftSplit = rightSplit.trySplit()) == null) {
task.helper.copyInto(taskSink, rightSplit);
break;
}
ForEachTask<S, T> leftTask = new ForEachTask<>(task, leftSplit);
task.addToPendingCount(1);
ForEachTask<S, T> taskToFork;
if (forkRight) {
forkRight = false;
rightSplit = leftSplit;
taskToFork = task;
task = leftTask;
}
else {
forkRight = true;
taskToFork = leftTask;
}
taskToFork.fork();
sizeEstimate = rightSplit.estimateSize();
}
task.spliterator = null;
task.propagateCompletion();
}
On a high level of description, the main loop keeps breaking down the spliterator, alternately forking off the processing of the chunk and processing it inline, until the spliterator refuses to split further or the remaining size is below the computed threshold.
Now consider the above algorithm in the case of unsized streams, where the whole is not being split into roughly equal halves; instead chunks of predetermined size are being repeatedly taken from the head of the stream. In this case the "suggested target size" of the chunk is abnormally large, which basically means that the chunks are never re-split into smaller ones.
The algorithm would therefore appear to alternately fork off one chunk, then process one inline. If each chunk takes the same time to process, this should result in no more than two cores being used. However, the actual behavior is that all four cores on my machine are occupied. Obviously, I am missing an important piece of the puzzle with that algorithm.
What is it that I'm missing?
Appendix: test code
Here is a piece of self-contained code which may be used to test the behavior which is the subject of this question:
package test;
import static java.util.concurrent.TimeUnit.NANOSECONDS;
import static java.util.concurrent.TimeUnit.SECONDS;
import static test.FixedBatchSpliteratorWrapper.withFixedSplits;
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicLong;
public class Parallelization {
static final AtomicLong totalTime = new AtomicLong();
static final ExecutorService pool = Executors.newFixedThreadPool(4);
public static void main(String[] args) throws IOException {
final long start = System.nanoTime();
final Path inputPath = createInput();
System.out.println("Start processing");
try (PrintWriter w = new PrintWriter(Files.newBufferedWriter(Paths.get("output.txt")))) {
withFixedSplits(Files.newBufferedReader(inputPath).lines(), 200).map(Parallelization::processLine)
.forEach(w::println);
}
final double cpuTime = totalTime.get(), realTime = System.nanoTime() - start;
final int cores = Runtime.getRuntime().availableProcessors();
System.out.println(" Cores: " + cores);
System.out.format(" CPU time: %.2f s\n", cpuTime / SECONDS.toNanos(1));
System.out.format(" Real time: %.2f s\n", realTime / SECONDS.toNanos(1));
System.out.format("CPU utilization: %.2f%%", 100.0 * cpuTime / realTime / cores);
}
private static String processLine(String line) {
final long localStart = System.nanoTime();
double ret = 0;
for (int i = 0; i < line.length(); i++)
for (int j = 0; j < line.length(); j++)
ret += Math.pow(line.charAt(i), line.charAt(j) / 32.0);
final long took = System.nanoTime() - localStart;
totalTime.getAndAdd(took);
return NANOSECONDS.toMillis(took) + " " + ret;
}
private static Path createInput() throws IOException {
final Path inputPath = Paths.get("input.txt");
try (PrintWriter w = new PrintWriter(Files.newBufferedWriter(inputPath))) {
for (int i = 0; i < 6_000; i++) {
final String text = String.valueOf(System.nanoTime());
for (int j = 0; j < 20; j++)
w.print(text);
w.println();
}
}
return inputPath;
}
}
package test;
import static java.util.Spliterators.spliterator;
import static java.util.stream.StreamSupport.stream;
import java.util.Comparator;
import java.util.Spliterator;
import java.util.function.Consumer;
import java.util.stream.Stream;
public class FixedBatchSpliteratorWrapper<T> implements Spliterator<T> {
private final Spliterator<T> spliterator;
private final int batchSize;
private final int characteristics;
private long est;
public FixedBatchSpliteratorWrapper(Spliterator<T> toWrap, long est, int batchSize) {
final int c = toWrap.characteristics();
this.characteristics = (c & SIZED) != 0 ? c | SUBSIZED : c;
this.spliterator = toWrap;
this.batchSize = batchSize;
this.est = est;
}
public FixedBatchSpliteratorWrapper(Spliterator<T> toWrap, int batchSize) {
this(toWrap, toWrap.estimateSize(), batchSize);
}
public static <T> Stream<T> withFixedSplits(Stream<T> in, int batchSize) {
return stream(new FixedBatchSpliteratorWrapper<>(in.spliterator(), batchSize), true);
}
#Override public Spliterator<T> trySplit() {
final HoldingConsumer<T> holder = new HoldingConsumer<>();
if (!spliterator.tryAdvance(holder)) return null;
final Object[] a = new Object[batchSize];
int j = 0;
do a[j] = holder.value; while (++j < batchSize && tryAdvance(holder));
if (est != Long.MAX_VALUE) est -= j;
return spliterator(a, 0, j, characteristics());
}
#Override public boolean tryAdvance(Consumer<? super T> action) {
return spliterator.tryAdvance(action);
}
#Override public void forEachRemaining(Consumer<? super T> action) {
spliterator.forEachRemaining(action);
}
#Override public Comparator<? super T> getComparator() {
if (hasCharacteristics(SORTED)) return null;
throw new IllegalStateException();
}
#Override public long estimateSize() { return est; }
#Override public int characteristics() { return characteristics; }
static final class HoldingConsumer<T> implements Consumer<T> {
Object value;
#Override public void accept(T value) { this.value = value; }
}
}
Ironically, the answer is almost stated in the question: as the "left" and "right" task take turns at being forked vs. processed inline, half of the time the right task, represented by this, e.g. the complete rest of the stream, is being forked off. That means that the forking off of chunks is just slowed down a bit (happening every other time), but clearly it happens.

Concurrentlinkedqueue misses to add data in multithreading environment

In the below code, in extremely rare case (3 in 1 billion executions of QueueThread object) it reaches the below mentioned if block and queue.size turned out be 7999. What could be the possible reason for the same.
if(q.size()<batchsize){
System.out.println("queue size" +q.size());
}
Basically it fails to execute queue.add statement but executes all other statements in the thread.
The code snippet is as below.
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicInteger;
public class CLinkQueueTest {
public static final int itersize=100000;
public static final int batchsize=8000;
public static final int poolsize=100;
public static void main (String args[]) throws Exception{
int j= 0;
ExecutorService service = Executors.newFixedThreadPool(poolsize);
AtomicInteger counter = new AtomicInteger(poolsize);
ConcurrentLinkedQueue<String> q = new ConcurrentLinkedQueue<String>();
String s ="abc";
while(j<itersize){
int k=0;
while(k<batchsize){
counter.decrementAndGet();
service.submit(new QueueThread(counter, q, s));
if(counter.get()<=0){
Thread.sleep(5);
}
k++;
}
if(j%20 ==0){
System.out.println("Iteration no " + j);
}
while(counter.get() < poolsize){
//wait infinitely
}
if(q.size()<batchsize){
System.out.println("queue size" +q.size());
}
q.clear();
j++;
}
System.out.println("process complete");
}
import java.util.Queue;
import java.util.concurrent.Callable;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicInteger;
public class QueueThread implements Callable<Boolean> {
private AtomicInteger ai;
private Queue<String> qu;
private String st;
public QueueThread(AtomicInteger i, Queue<String> q, String s){
ai = i;
qu = q;
st = s;
}
#Override
public Boolean call() {
try{
qu.add(st);
} catch(Throwable e){
e.printStackTrace();
}finally{
ai.incrementAndGet();
}
return true;
}
}
Could it be that the one time that it registers one too few entries in the queue it is because the Executor has not finished its processing?
It is clear that every time QueueThread.call() is called the queue is added to and the AtomicInteger is incremented. All I can think is that one call has not been performed.
Perhaps you could be a little kinder to the system by using something like:
while(counter.get() < poolsize){
//wait infinitely
Thread.currentThread().sleep(5);
}
but that's just my opinion.
Documentation for ConcurrentLinkedQueue.size method says:
Beware that, unlike in most collections, this method is NOT a constant-time operation. Because of the asynchronous nature of these queues, determining the current number of elements requires an O(n) traversal. Additionally, if elements are added or removed during execution of this method, the returned result may be inaccurate. Thus, this method is typically not very useful in concurrent applications.

Categories

Resources