Managing Java threads in a CPU instruction pipeline simulator

Managing Java threads in a CPU instruction pipeline simulator - java

I have implemented a 5-Stage CPU instruction pipeline simulator in Java using multi-threading.
Each Stage is a thread that performs mainly below 3 functions, also there is a queue (of capacity 1) in-between every two stages.
Receive from the previous stage.
Process i.e. perform its main responsibility.
Forward to the next stage.
#Override
public void run() {
while (!(latchQueue.isEmpty())) {
fetch();
process();
forward();
}
}
Simulation works fine. This is where I’m stuck, I want to be able to simulate only a specified number of clock cycles. so, the simulator should stop/pause once it has reached the specified number of cycles.
As of now, I have started all the 5 threads and let it simulate the processing of all the instructions rather than limiting it by clock cycles.
How can I accomplish this? do I need to pause thread when specified clock cycles have reached? If so how can I gracefully handle suspending/stopping the threads? Please help me in choosing the best possible approach.
Thanks in advance :)

You are already using some concurrent queue to communicate between the threads (exactly how it works isn't clear because your code example is quite incomplete).
So you can count cycles at the first stage, and use that same mechanism to communicate: shove a sentinel object, which represents "time to stop/pause this thread", onto the queue for the first stage, and when processed it pauses the processor (and still forwards it to the next stage, so all stages will progressively shut down). For example, you could extend the type of objects passed in your queue so that the hierarchy contains both real payload objects (e.g., decoded instructions, etc) or "command objects" like this stop/pause sentinel.
Another asynchronous solution would be to Thread.interrupt each thread and add an interrupt check in your processing loop - that's mostly to gracefully shut down, and not so much to support a "pause" functionality.

Will following work?
Share following class CyclesCounter between all your threads representing stages. It has tryReserve method, getting true from it means thread has got enough "clock cycles" for its' next run. Getting false means there's not enough cycles left. Class is thread-safe.
After getting false, perhaps, your thread should just stop then (i.e., by returning from run()) -- no way it can get enough nr of cycles (due to your requirements, as I understood them), until whole session is run again.
class CyclesManager {
private final AtomicInteger cycles;
CyclesManager(int initialTotalCycles) {
if (initialTotalCycles < 0)
throw new IllegalArgumentException("Negative initial cycles: " + initialTotalCycles);
cycles = new AtomicInteger(initialTotalCycles);
}
/**
* Tries to reserve given nr of cycles from available total nr of cycles. Total nr is decreased accordingly.
* Method is thread-safe: total nr of is consistent if called from several threads concurrently.
*
* #param cyclesToReserve how many cycles we want
* #return {#code true} if cycles are ours, {#code false} if not -- there's not enough left
*/
boolean tryReserve(int cyclesToReserve) {
int currentCycles = cycles.get();
if (currentCycles < cyclesToReserve)
return false;
return cycles.compareAndSet(currentCycles, currentCycles - cyclesToReserve);
}
}

Related

Does the code with CompletableFutures and no custom Executors use only the number of threads equal to the number of cores?

I am reading java 8 in action, chapter 11 (about CompletableFutures), and it got me thinking about my company's code base.
The java 8 in action book says that if you have code like I write down below, you will only use 4 CompletableFutures at a time(if you have a 4 core computer). That means that if you want to perform for example 10 operations asynchronously, you will first run the first 4 CompletableFutures, then the second 4, and then the 2 remaining ones, because the default ForkJoinPool.commonPool() only provides the number of threads equal to Runtime.getRuntime().availableProcessors().
In my company's code base, there are #Service classes called AsyncHelpers, that contain a method load(), that uses CompletableFutures to load information about a product asynchronously in separate chunks. I was wondering if they only use 4 threads at a time.
There are several such async helpers in my company's code base, for example there's one for product list page (PLP) and one for product details page(PDP). A product details page is a page dedicated to a specific product showing it's detailed characteristics, cross-sell products, similar products and many more things.
There was an architectural decision to load the details of the pdp page in chunks. The loading is supposed to happen asynchronously, and the current code uses CompletableFutures. Let's look at pseudocode:
static PdpDto load(String productId) {
CompletableFuture<Details> photoFuture =
CompletableFuture.supplyAsync(() -> loadPhotoDetails(productId));
CompletableFuture<Details> characteristicsFuture =
CompletableFuture.supplyAsync(() -> loadCharacteristics(productId));
CompletableFuture<Details> variations =
CompletableFuture.supplyAsync(() -> loadVariations(productId));
// ... many more futures
try {
return new PdpDto( // construct Dto that will combine all Details objects into one
photoFuture.get(),
characteristicsFuture.get(),
variations.get(),
// .. many more future.get()s
);
} catch (ExecutionException|InterruptedException e) {
return new PdpDto(); // something went wrong, return an empty DTO
}
}
As you can see, the code above uses no custom executors.
Does this mean that if that load method has 10 CompletableFutures and there are currently 2 people loading the PDP page, and we have 20 CompletableFutures to load in total, then all those 20 CompletableFutures won't be executed all at once, but only 4 at a time?
My colleague told me that each user will get 4 threads, but I think the JavaDoc quite clearly states this:
public static ForkJoinPool commonPool()
Returns the common pool instance. This pool is statically constructed; its run state is unaffected by attempts to shutdown() or shutdownNow(). However this pool and any ongoing processing are automatically terminated upon program System.exit(int). Any program that relies on asynchronous task processing to complete before program termination should invoke commonPool().awaitQuiescence, before exit.
Which means that there's only 1 pool with 4 threads for all users of our website.

Yes, but it’s worse than that...
The default size of the common pool is 1 less than the number of processors/cores (or 1 if there’s only 1 processor), so you’re actually processing 3 at a time, not 4.
But your biggest performance hit is with parallel streams (if you use them), because they use the common pool too. Streams are meant to be used for super fast processing, so you don’t want them to share their resources with heavy tasks.
If you have task that is designed to be asynch (ie take more than a few milliseconds) then you should create a pool to run them in. Such a pool can be statically created and reused by all calling threads, which avoids overhead of pool creation per use. You should also tune the pool size by stress testing your code to find the optimum size to maximise throughput and minimise response time.

In my company's code base, there are [...] classes [...] that contain a method load(), that uses CompletableFutures to load information [...]
So, are you saying that the load() method waits for I/O to complete?
If so, and if what #Bohemian says is true, then you should not be using the default thread pool.
#Bohemian says that the default pool has approximately the same number of threads as your host has CPUs. That's great if your application has a lot of compute bound tasks to perform in the background. But it's not so great if your application has a lot of threads that are waiting for replies from different network services. That's a whole different story.
I am not an expert in the subject, and I don't know how (apart from doing experiments) to find out what the best number of threads is, but whatever that number is, it's going to have little to do with how many CPUs your system has, and therefore, you should not be using the default pool for that purpose.

Writing with a single thread LMAX

I've got introduced to LMAX and this wonderful concept called RingBuffer.
So guys tell that when writing to the ringbuffer with only one thread performance is way better than with multiple producers...
However i dont really find it possible for tipical application to use only one thread for writes on ringbuffer... i dont really understand how lmax is doing that (if they do). For example N number of different traders put orders on exchange, those are all asynchronious requests that are getting transformed to orders and put into ringbuffer, how can they possibly write those using one thread?
Question 1. I might missing something or misunderstanding some aspect, but if you have N concurrent producers how is it possible to merge them into 1 and not lock each other?
Question 2. I recall rxJava observables, where you could take N observables and merge them into 1 using Observable.merge i wonder if it is blocking or maintaining any lock in any way?

The impact on a RingBuffer of multi-treaded writing is slight but under very heavy loads can be significant.
A RingBuffer implementation holds a next node where the next addition will be made. If only one thread is writing to the ring the process will always complete in the minimum time, i.e. buffer[head++] = newData.
To handle multi-threading while avoiding locks you would generally do something like while ( !buffer[head++].compareAndSet(null,newValue)){}. This tight loop would continue to execute while other threads were interfering with the storing of the data, thus slowing town the throughput.
Note that I have used pseudo-code above, have a look at getFree in my implementation here for a real example.
// Find the next free element and mark it not free.
private Node<T> getFree() {
Node<T> freeNode = head.get();
int skipped = 0;
// Stop when we hit the end of the list
// ... or we successfully transit a node from free to not-free.
// This is the loop that could cause delays under hight thread activity.
while (skipped < capacity && !freeNode.free.compareAndSet(true, false)) {
skipped += 1;
freeNode = freeNode.next;
}
// ...
}

Internally, RxJava's merge uses a serialization construct I call emitter-loop which uses synchronized and is blocking.
Our 'clients' use merge mostly in throughput- and latency insensitive cases or completely single-threaded and blocking isn't really an issue there.
It is possible to write a non-blocking serializer I call queue-drain but merge can't be configured to use that instead.
You can also take a look at JCTools' MpscArrayQueue directly if you are willing to handle the producer and consumer threads manually.

Does it make sense to reuse Runnables in a thread pool?

I'm implementing a thread pool for processing a high volume market data feed and have a question about the strategy of reusing my worker instances that implement runnable which are submitted to the thread pool for execution. In my case I only have one type of worker that takes a String and parses it to create a Quote object which is then set on the correct Security. Given the amount of data coming off the feed it is possible to have upwards of 1,000 quotes to process per second and I see two ways to create the workers that get submitted to the thread pool.
First option is simply creating a new instance of a Worker every time a line is retrieved from the underlying socket and then adding it to the thread pool which will eventually be garbage collected after its run method executed. But then this got me thinking about performance, does it really make sense to instantiate 1,0000 new instances of the Worker class every second. In the same spirit as a thread pool do people know if it is a common pattern to have a runnable pool or queue as well so I can recycle my workers to avoid object creation and garbage collection. The way I see this being implemented is before returning in the run() method the Worker adds itself back to a queue of available workers which is then drawn from when processing new feed lines instead of creating new instances of Worker.
From a performance perspective, do I gain anything by going with the second approach or does the first make more sense? Has anyone implemented this type of pattern before?
Thanks - Duncan

I use a library I wrote called Java Chronicle for this. It is designed to persist and queue one million quotes per second without producing any significant garbage.
I have a demo here where it sends quote like objects with nano second timing information at a rate of one million messages per second and it can send tens of millions in a JVM with a 32 MB heap without triggering even a minor collection. The round trip latency is less than 0.6 micro-seconds 90% of the time on my ultra book. ;)
from a performance perspective, do I gain anything by going with the second approach or does the first make more sense?
I strongly recommend not filling your CPU caches with garbage. In fact I avoid any constructs which create any significant garbage. You can build a system which creates less than one object per event end to end. I have a Eden size which is larger than the amount of garbage I produce in a day so no GCs minor or full to worry about.
Has anyone implemented this type of pattern before?
I wrote a profitable low latency trading system in Java five years ago. At the time it was fast enough at 60 micro-seconds tick to trade in Java, but you can do better than that these days.
If you want low latency market data processing system, this is the way I do it. You might find this presentation I gave at JavaOne interesting as well.
http://www.slideshare.net/PeterLawrey/writing-and-testing-high-frequency-trading-engines-in-java
EDIT I have added this parsing example
ByteBuffer wrap = ByteBuffer.allocate(1024);
ByteBufferBytes bufferBytes = new ByteBufferBytes(wrap);
byte[] bytes = "BAC,12.32,12.54,12.56,232443".getBytes();
int runs = 10000000;
long start = System.nanoTime();
for (int i = 0; i < runs; i++) {
bufferBytes.reset();
// read the next message.
bufferBytes.write(bytes);
bufferBytes.position(0);
// decode message
String word = bufferBytes.parseUTF(StopCharTesters.COMMA_STOP);
double low = bufferBytes.parseDouble();
double curr = bufferBytes.parseDouble();
double high = bufferBytes.parseDouble();
long sequence = bufferBytes.parseLong();
if (i == 0) {
assertEquals("BAC", word);
assertEquals(12.32, low, 0.0);
assertEquals(12.54, curr, 0.0);
assertEquals(12.56, high, 0.0);
assertEquals(232443, sequence);
}
}
long time = System.nanoTime() - start;
System.out.println("Average time was " + time / runs + " nano-seconds");
when set with -verbose:gc -Xmx32m it prints
Average time was 226 nano-seconds
Note: there are no GCes triggered.

I'd use the Executor from the concurrency package. I believe it handles all this for you.

does it really make sense to instantiate 1,0000 new instances of the Worker class every second.
Not necessarily however you are going to have to be putting the Runnables into some sort of BlockingQueue to be able to be reused and the cost of the queue concurrency may outweigh the GC overhead. Using a profiler or watching the GC numbers via Jconsole will tell you if it is spending a lot of time in GC and this needs to be addressed.
If this does turn out to be a problem, a different approach would be to just put your String into your own BlockingQueue and submit the Worker objects to the thread-pool only once. Each of the Worker instances would dequeue from the queue of Strings and would never quit. Something like:
public void run() {
while (!shutdown) {
String value = myQueue.take();
...
}
}
So you would not need to create your 1000s of Workers per second.

Yes of course, something like this, because OS and JVM don't care about what is going on a thread, so generally this is a good practice to reuse a recyclable object.

I see two questions in your problem. One is about thread pooling, and another is about object pooling. For your thread pooling issue, Java has provided an ExecutorService . Below is an example of using an ExecutorService.
Runnable r = new Runnable() {
public void run() {
//Do some work
}
};
// Thread pool of size 2
ExecutorService executor = Executors.newFixedThreadPool(2);
// Add the runnables to the executor service
executor.execute(r);
The ExecutorService provides many different types of thread pools with different behaviors.
As far as object pooling is concerned, (Does it make sense to create 1000 of your objects per second, then leave them for garbage collection, this all is dependent on the statefulness and expense of your object. If your worried about the state of your worker threads being compromised, you can look at using the flyweight pattern to encapsulate your state outside of the worker. Additionally, if you were to follow the flyweight pattern, you can also look at how useful Future and Callable objects would be in your application architecture.

DelayQueue with higher speed remove()?

I have a project that keeps track of state information in over 500k objects, the program receives 10k updates/second about these objects, the updates consist of new, update or delete operations.
As part of the program house keeping must be performed on these objects roughly every five minutes, for this purpose I've placed them in a DelayQueue implementing the Delayed interface, allowing the blocking functionality of the DelayQueue to control house keeping of these objects.
Upon new, an object is placed on the DelayQueue.
Upon update, the object is remove()'d from the DelayQueue, updated and then reinserted at it's new position dictated by the updated information.
Upon delete, the object is remove()'d from the DelayQueue.
The problem I'm facing is that the remove() method becomes a prohibitively long operation once the queue passes around 450k objects.
The program is multithreaded, one thread handles updates and another the house keeping. Due to the remove() delay, we get nasty locking performance issues, and eventually the update thread buffer's consumes all of the heap space.
I've managed to work around this by creating a DelayedWeakReference (extends WeakReference implements Delayed), which allows me to leave "shadow" objects in the queue until they would expire normally.
This takes the performance issue away, but causes an significant increase in memory requirements. Doing this results in around 5 DelayedWeakReference's for every object that actually needs to be in the queue.
Is anyone aware of a DelayQueue with additional tracking that permits fast remove() operations? Or has any suggestions of better ways to handle this without consuming significantly more memory?

took me some time to think about this,
but after reading your interesting question for some minutes, here are my ideas:
A. if you objects have some sort of ID, use it to hash, and actually don't have one delay queue, but have N delay queues.
This will reduce the locking factor by N.
There will be a central data structure,
holding these N queues. Since N is preconfigured,
you can create all N queues when the system starts.

If you only need to perform a housekeeping "roughly every five minutes" this is allot of work to maintain that.
What I would do is have a task which runs every minute (or less as required) to see if it has been five minutes since the last update. If you use this approach, there is no additional collection to maintain and no data structure is altered on an update. The overhead of scanning the components is increased, but is constant. The overhead of performing updates becomes trivial (setting a field with the last time updated)

If I understand your problem correctly, you want to do something to an object, if it hasn't been touched for 5 minutes.
You can have a custom linked list; the tail is the most recently touched. Removing a node is fast.
The book keeping thread can simply wake up every 1 second, and remove heads that are 5 minutes old. However, if the 1 second delay is unacceptable, calculate the exact pause time
// book keeping thread
void run()
synchronized(list)
while(true)
if(head==null)
wait();
else if( head.time + 5_min > now )
wait( head.time + 5_min - now );
else
remove head
process it
// update thread
void add(node)
synchronized(list)
append node
if size==1
notify()
void remove(node)
synchronized(list)
remove node

Multithreaded file processing and reporting

I have an application that processes data stored in a number of files from an input directory and then produces some output depending on that data.
So far, the application works in a sequential basis, i.e. it launches a "manager" thread that
Reads the contents of the input directory into a File[] array
Processes each file in sequence and stores results
Terminates when all files are processed
I would like to convert this into a multithreaded application, in which the "manager" thread
Reads the contents of the input directory into a File[] array
Launches a number of "processor" threads, each of which processes a single file, stores results and returns a summary report for that file to the "manager" thread
Terminates when all files have been processed
The number of "processor" threads would be at most equal to the number of files, since they would be recycled via a ThreadPoolExecutor.
Any solution avoiding the use of join() or wait()/notify() would be preferrable.
Based on the above scenario:
What would be the best way of having those "processor" threads reporting back to the "manager" thread? Would an implementation based on Callable and Future make sense here?
How can the "manager" thread know when all "processor" threads are done, i.e. when all files have been processed?
Is there a way of "timing" a processor thread and terminating it if it takes "too long" (i.e., it hasn't returned a result despite the lapse of a pre-configured amount of time)?
Any pointers to, or examples of, (pseudo-)source code would be greatly appreciated.

You can definitely do this without using join() or wait()/notify() yourself.
You should take a look at java.util.concurrent.ExecutorCompletionService to start with.
The way I see it you should write the following classes:
FileSummary - Simple value object that holds the result of processing a single file
FileProcessor implements Callable<FileSummary> - The strategy for converting a file into a FileSummary result
File Manager - The high level manager that creates FileProcessor instances, submits them to a work queue and then aggregates the results.
The FileManager would then look something like this:
class FileManager {
private CompletionService<FileSummary> cs; // Initialize this in constructor
public FinalResult processDir(File dir) {
int fileCount = 0;
for(File f : dir.listFiles()) {
cs.submit(new FileProcessor(f));
fileCount++;
}
for(int i = 0; i < fileCount; i++) {
FileSummary summary = cs.take().get();
// aggregate summary into final result;
}
}
If you want to implement a timeout you can use the poll() method on CompletionService instead of take().

wait()/notify() are very low level primitives and you are right in wanting to avoid them.
The simplest solution would be to use a thread-safe queues (or stacks, etc. -- it doesn't really matter in this case). Before starting the worker threads, your main thread can add all the Files to the thread-safe queue/stack. Then start the worker threads, and let them all pull Files and process them until there are none left.
The worker threads can add results to another thread-safe queue/stack, where the main thread can get them from. The main thread knows how many Files there were, so when it has retrieved the same number of results, it will know that the job is finished.
Something like a java.util.concurrent.BlockingQueue would work, and there are other thread-safe collections in java.util.concurrent which would also be fine.
You also asked about terminating worker threads which are taking too long. I will tell right up front: if you can make the code which runs on the worker threads robust enough that you can safely leave this feature out, you will make things a lot simpler.
If you do need this feature, the simplest and most reliable solution is to have a per-thread "terminate" flag, and make the worker task code check that flag frequently and exit if it is set. Make a custom class for workers, and include a volatile boolean field for this purpose. Also include a setter method (because of volatile, it doesn't need to be synchronized).
If a worker discovers that its "terminate" flag is set, it could push its File object back on the work queue/stack so another thread can process it. Of course, if there is some problem which means the File cannot be successfully processed, this will lead to an infinite cycle.
The best is to make the worker code very simple and robust, so you don't need to worry about it "not terminating".

No need for them to report back. Just have a count of the number of jobs remaining to be done and have the thread decrement that count when it's done.
When the count reaches zero of jobs remaining to be done, all the "processor" threads are done.
Sure, just add that code to the thread. When it starts working, check the time and compute the stop time. Periodically (say when you go to read more from the file), check to see if it's past the stop time and, if so, stop.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.