Reading from disk and processing in parallel

Reading from disk and processing in parallel - java

This is going to be the most basic and even may be stupid question here. When we talk about using multi threading for better resource utilization. For example, an application reads and processes files from the local file system. Lets say that reading of file from disk takes 5 seconds and processing it takes 2 seconds.
In above scenario, we say that using two threads one to read and other to process will save time. Because even when one thread is processing first file, other thread in parallel can start reading second file.
Question: Is this because of the way CPUs are designed. As in there is a different processing unit and different read/write unit so these two threads can work in parallel on even a single core machine as they are actually handled by different modules? Or this needs multiple core.
Sorry for being stupid. :)

On a single processor, multithreading is achieved through time slicing. One thread will do some work then it will switch to the other thread.
When a thread is waiting on some I/O, such as a file read, it will give up it's CPU time-slice prematurely allowing another thread to make use of the CPU.
The result is overall improved throughput compared to a single thread even on a single core.
Key for below:
= Doing work on CPU
- I/O
_ Idle
Single thread:
====--====--====--====--
Two threads:
====--__====--__====--__
____====--__====--__====
So you can see how more can get done in the same time as the CPU is kept busy where it would have been kept waiting before. The storage device is also being used more.

In theory yes. Single core has same parallelism. One thread waiting for read from file (I/O Wait), another thread is process file that already read before. First thread actually can not running state until I/O operations is completed. Rougly not use cpu resource at this state. Second thread consume CPU resource and complete task. Indeed, multi core CPU has better performance.

To start with, there is a difference between concurrency and parallelism. Theoretically, a single core machine does not support parallelism.
About the question on performance improvement as a result of concurrency (using threads), it is very implementation dependent. Take for instance, Android or Swing. Both of them have a main thread (or the UI thread). Doing large calculation on the main thread will block the UI and make in unresponsive. So from a layman perspective that would be a bad performance.
In your case(I am assuming there is no UI Thread) where you will benefit from delegating your processing to another thread depends on a lot of factors, specially the implementation of your threads. e.g. Synchronized threads would not be as good as the unsynchronized ones. Your problem statement reminds me of classic consumer producer problem. So use of threads should not really be the best thing for your work as you need synchronized threads. IMO It's better to do all the reading and processing in a single thread.
Multithreading will also have a context switching cost. It is not as big as Process's context switching, but it's still there. See this link.
[EDIT] You should preferably be using BlockingQueue for such producer consumer scenario.

Related

Can a blocked thread be rescheduled to do other work?

If I have a thread that is blocked waiting for a lock, can the OS reschedule that thread to do other work until the lock becomes available?
From my understanding, it cannot be rescheduled, it just sits idle until it can acquire the lock. But it just seems inefficient. If we have 100 tasks submitted to an ExecutorService, and 10 threads in the pool: if one of the threads holds a lock and the other 9 threads are waiting for that lock, then only the thread holding the lock can make progress. I would have imagined that the blocked threads could be temporarily rescheduled to run some of the other submitted tasks.

You said:
I would have imagined that the blocked threads could be temporarily rescheduled to run some of the other submitted tasks.
Project Loom
You happen to be describing the virtual threads (fibers) being developed as part of Project Loom for future versions of Java.
Currently the OpenJDK implementation of Java uses threads from the host OS as Java threads. So the scheduling of those threads is actually controlled by the OS rather than the JVM. And yes, as you describe, on all common OSes when Java code blocks, the code’s thread sits idle.
Project Loom layers virtual threads on top of the “real” platform/kernel threads. Many virtual threads may be mapped to each real thread. Running millions of threads is possible, on common hardware.
With Loom technology, the JVM detects blocking code. That blocked code’s virtual thread is “parked”, set aside, with another virtual thread assigned to that real thread to accomplish some execution time while the parked thread awaits a response. This parking-and-switching is quite rapid with little overhead. Blocking becomes extremely “cheap” under Loom technology.
Blocking is quite common in most run-of-the-mill business-oriented apps. Blocking occurs with file I/O, network I/O, database access, logging, console interactions, GUIs, and so on. Such apps using virtual threads are seeing huge performance gains with the experimental builds of Project Loom. These builds are available now, based on early-access Java 17. The Project Loom team seeks feedback.
Using virtual threads is utterly simple: Switch your choice of executor service.
ExecutorService executorService = Executors.newVirtualThreadExecutor() ;
Caveat: As commented by Michael, the virtual threads managed by the JVM depend on the platform/kernel threads managed by the host OS. Ultimately, execution is scheduled by the OS even under Loom. Virtual threads are useful for those times when a blocked Java thread would otherwise be sitting idle on a CPU core. If the host computer were overburdened, the Java threads might see little execution time, with or without virtual threads.
Virtual threads are not appropriate for tasks that rarely block, that are truly CPU-bound. For example, encoding video. Such tasks should continue using conventional threads.
For more info, see the enlightening presentations and interviews with Ron Pressler of Oracle, or with other members of the Loom team. Look for the most recent, as Loom has evolved.

I would have imagined that the blocked threads could be temporarily rescheduled to run some of the other submitted tasks.
That's what the other threads are for. If you create X threads and Y are blocked, you have the remaining X-Y threads to do other submitted tasks. Presumably, the number X was chosen specifically to get the number of concurrent tasks that the implementation and/or programmer thought was best.
You are asking why the implementation doesn't ignore this decision. The answer is because it makes more sense to choose the number of threads reasonably than have the implementation ignore that choice.

You are partially right.
In the executor service scenario that you described, all the 9 threads will be blocked and only one thread will make progress. true.
The part where you are not quite right is when you attempt to expect the behaviour of OS and Java combined. See, the concept of threads exists at both OS and at Java level. But they are two different things. So there are Java-Threads and there are OS-Threads. Java-Threads are implemented through OS-Threads.
Imagine it this way, JVM has (say) 10 Java-Threads in it, some running, some not. Java borrows some OS-Thread to implement the running Java-Threads. Now when a Java-Thread gets blocked (for whatever reason) then what we know for sure is that the Java-Thread has been blocked. We cannot easily observe what happened to the underlying OS-Thread.
The OS could reclaim the OS-Thread and use it for something else, or it can stay blocked - it depends. But even if the OS-Thread is reused, still Java-Thread will remain blocked. In your thread pool scenario, still the nine Java-Threads will be blocked, and only one Java-Thread will be working.

If I have a thread that is blocked waiting for a lock, can the OS reschedule that thread to do other work until the lock becomes available? From my understanding, it cannot be rescheduled, it just sits idle until it can acquire the lock. But it just seems inefficient.
I think you are thinking about this entirely wrong. Just because 10 of your 20 threads are "idle" doesn't mean that the operating system (or the JVM) is somehow consuming resources trying to manage these idle threads. Although in general we work on our applications to make sure that our threads are as unblocked as possible so we can achieve the highest throughput, there are tons of times that we write threads where we expect them to be idle most of the time.
If we have 100 tasks submitted to an ExecutorService, and 10 threads in the pool: if one of the threads holds a lock and the other 9 threads are waiting for that lock, then only the thread holding the lock can make progress. I would have imagined that the blocked threads could be temporarily rescheduled to run some of the other submitted tasks.
It is not the threads which are rescheduled, it is the CPU resources of the system. If 9 of your 10 threads are blocked in your thread-pool then other threads in your application (garbage collector) or other processes can be given more CPU resources on your server. This switching between jobs is what the modern operating systems are really good at and it happens many, many times a second. This is all quite normal.
Now, if your question is really "how do I improve the throughput of my application" then you are asking the right question. First off, you should make sure that your locks are as fine grained as possible to make sure that the thread holding the lock does so for a minimal amount of time. If the blocking happens too often then you should consider increasing the number of threads in the pool so that there is a higher likelihood that some of the jobs will run concurrently. This optimizing of the number of threads in a thread-pool is very application specific. See my post here for some more details: Concept behind putting wait(),notify() methods in Object class
Another thing you might consider is breaking your jobs up into pieces to separate the pieces that can run concurrently from the ones that need to be synchronized. You could have a pool of 10 threads doing the concurrent work and then a single thread doing the operations that require the locks. This is why the ExecutorCompletionService was written so that something downstream can take the results of a thread-pool and act on them as they complete. This will make your program more complicated and you'll need to worry about your queues if you are talking about some large number of jobs or large number of results but you can dramatically improve throughput if you do this right.
A good example of such refactoring is a situation where you have a processing job that has to write the results to a database. If at the end of each job, each thread in the pool needs to get a lock on the database connection then there will be a lot of contention for the lock and less concurrency. If, instead, the processing was done in a thread-pool and there was a single database update thread, it could turn off auto-commit and make updates from multiple jobs in a row between commits which could dramatically increase throughput. Then again, using multiple database connections managed by a connection pool might be a fine solution.

Thread execution on single and multi core

This is what I see in Oracle documentation and would like to confirm my understanding (source):
A computer system normally has many active processes and threads. This
is true even in systems that only have a single execution core, and
thus only have one thread actually executing at any given moment.
Processing time for a single core is shared among processes and
threads through an OS feature called time slicing.
Does it mean that in a single core machine only one thread can be executed at given moment?
And, does it mean that on multi core machine multiple threads can be executed at given moment?

one thread actually executing at any given moment
Imagine that this is game where 10 people try to sit on 9 chairs in a circle (I think you might know the game) - there isn't enough chairs for every one, but the entire group of people is moving, always. It's just that everyone sits on the chair for some amount of time (very simplified version of time slicing).
Thus multiple processes can run on the same core.
But even if you have multiple processors, it does not mean that a certain thread will run only on that processor during it's entire lifetime. There are tools to achieve that (even in java) and it's called thread affinity, where you would pin a thread only to some processor (this is quite handy in some situations). That thread can be moved (scheduled by the OS) to run on a different core, while running, this is called context switching and for some applications this switching to a different CPU is sometimes un-wanted.
At the same time, of course, multiple threads can run in parallel on different cores.

Does it mean that in a single core machine only one thread can be executed at given moment?
Nope, you can easily have more threads than processors assuming they're not doing CPU-bound work. For example, if you have two threads mostly waiting on IO (either from network or local storage) and another thread consuming the data fetched by the first two threads, you could certainly run that on a machine with a single core and obtain better performance than with a single thread.
And, does it mean that on multi core machine multiple threads can be executed at given moment?
Well yeah you can execute any number of threads on any number of cores, provided that you have enough memory to allocate a stack for each of them. Obviously if each thread makes intensive use of the CPU it will stop being efficient when the number of threads exceeds the number of cores.

Why is Vert.x called responsive even though it is single threaded

What I understood from Vert.x documentation (and a little bit of coding in it) is that Vert.x is single threaded and executes events in the event pool. It doesn't wait for I/O or any network operation(s) rather than giving time to another event (which was not before in any Java multi-threaded framework).
But I couldn't understand following:
How single thread is better than multi-threaded? What if there are millions of incoming HTTP requests? Won't it be slower than other multi-threaded frameworks?
Verticles depend on CPU cores. As many CPU cores you have, you can have that many verticles running in parallel. How come a language that works on a virtual machine can make use of CPU as needed? As far as I know, the Java VM (JVM) is an application that uses just another OS process for (here my understanding is less about OS and JVM hence my question might be naive).
If a single threaded, non-blocking concept is so effective then why can't we have the same non-blocking concept in a multi-threaded environemnt? Won't it be faster? Or again, is it because CPU can execute one thread at a time?

What I understood from Vert.x documentation (and a little bit of coding in it) is that Vert.x is single threaded and executes events in the event pool.
It is event-driven, callback-based. It isn't single-threaded:
Instead of a single event loop, each Vertx instance maintains several event loops. By default we choose the number based on the number of available cores on the machine, but this can be overridden.
It doesn't wait for I/O or any network operation(s)
It uses non-blocking or asynchronous I/O, it isn't clear which. Use of the Reactor pattern suggests non-blocking, but it may not be.
rather than giving time to another event (which was not before in any Java multi-threaded framework).
This is meaningless.
How single thread is better than multi-threaded?
It isn't.
What if there are millions of incoming HTTP requests? Won't it be slower than other multi-threaded frameworks?
Yes.
Verticles depend on CPU cores. As many CPU cores you have, you can have that many verticles running in parallel. How come a language that works on a virtual machine can make use of CPU as needed? As far as I know, the Java VM (JVM) is an application that uses just another OS process for (here my understanding is less about OS and JVM hence my question might be naive).
It uses a thread per core, as per the quotation above, or whatever you choose by overriding that.
If a single threaded, non-blocking concept is so effective then why can't we have the same non-blocking concept in a multi-threaded environemnt?
You can.
Won't it be faster?
Yes.
Or again, is it because CPU can execute one thread at a time?
A multi-core CPU can execute more than one thread at a time. I don't know what 'it' in 'is it because' refers to.

First of all, Vertx isn't single threaded by any means. It just doesn't spawn more threads that it needs.
Second, and this is not related to Vertx at all, JVM maps threads to native OS threads.
Third, we can have non-blocking behavior in multithreaded environment. It's not one thread per CPU, but one thread per core.
But then the question is: "what are those threads doing?". Because usually, to be useful, they need other resources. Network, DB, filesystem, memory. And here it becomes tricky. When you're single threaded, you don't have race conditions. The only one accessing the memory at any point of time is you. But if you're multi threaded, you need to concern yourself with mutexes, or any other way to keep you data consistent.

Q:
How single thread is better than multi-threaded? What if there are millions of incoming HTTP requests? Won't it be slower than other multi-threaded frameworks?
A:
Vert.x isn't a single threaded framework, it does make sure that a "verticle" which is something you deploy within you application and register with vert.x is mostly single threaded.
The reason for this is that concurrency with multiple threads over complicates concurrency with locks synchronisation and other concept that need to be taken care of with multi threaded communication.
While verticles are single threaded the do use something called an event loop which is the true power behind this paradigm called the reactor pattern or multi reactor pattern in Vert.x's case. Multiple verticles can be registered within one application, communication between these verticles run through an eventbus which empowers verticles to use an event based transfer protocol internally but this can also be distributed using some other technology to manage the clustering.
event loops handle events coming in on one thread but everything is async so computation gets handled by the loop and when it's done a signal notifies that a result can be used.
So all computation is either callback based or uses something like Reactive.X / fibers / coroutines / channels and the lot.
Due to the simpler communication model for concurrency and other nice features of Vert.x it can actually be faster than a lot of the Blocking and pure multi threaded models out there.
the numbers
Q:
If a single threaded, non-blocking concept is so effective then why can't we have the same non-blocking concept in a multi-threaded environemnt? Won't it be faster? Or again, is it because CPU can execute one thread at a time?
A:
Like a said with the first question it's not really single threaded. Actually when you know something is blocking you'll have to register computation with a method called executeBlocking which wil make it run multithreaded on an ExecutorService managed by Vert.x
The reason why Vert.x's model is mostly faster is also here because event loops make better use of cpu computation features and constraints. This is mostly powered by the Netty project.
the overhead of multi threading with it's locks and syncs imposes to much strain to outdo Vert.x with it's multi reactor pattern.

Multithreading - multiple users

When a single user is accessing an application, multiple threads can be used, and they can run parallel if multiple cores are present. If only one processor exists, then threads will run one after another.
When multiple users are accessing an application, how are the threads handled?

I can talk from Java perspective, so your question is "when multiple users are accessing an application, how are the threads handled?".
The answer is it all depends on how you programmed it, if you are using some web/app container they provide thread pool mechanism where you can have more than one threads to server user reuqests, Per user there is one request initiated and which in turn is handled by one thread, so if there are 10 simultaneous users there will be 10 threads to handle the 10 requests simultaneously, now we do have Non-blocking IO now a days where the request processing can be off loaded to other threads so allowing less than 10 threads to handle 10 users.
Now if you want to know how exactly thread scheduling done around CPU core, it again depends on the OS. One thing common though 'thread is the basic unit of allocation to a CPU'. Start with green threads here, and you will understand it better.

The incorrect assuption is
If only one processor exists, then threads will run one after another.
How threads are being executed is up to the runtime environment.
With java there are some definitions that certain parts of your code will not be causing synchronisation with other threads and thus will not cause (potential) rescheduling of threads.
In general, the OS will be in charge of scheduling units-of-execution. In former days mostly such entities have been processes. Now there may by processes and threads (some do scheduling only at thread level). For simplicity let ssume OS is dealing with threads only.
The OS then may allow a thread to run until it reaches a point where it can't continue, e.g. waiting for an I/O operation to cpmplete. This is good for the thread as it can use CPU for max. This is bad for all the other threads that want to get some CPU cycles on their own. (In general there always will be more threads than available CPUs.So, the problem is independent of number of CPUs.) To improve interactive behaviour an OS might use time slices that allow a thread to run for a certain time. After the time slice is expired the thread is forcible removed from the CPU and the OS selects a new thread for being run (could even be the one just interrupted).
This will allow each thread to make some progress (adding some overhead for scheduling). This way, even on a single processor system, threads my (seem) to run in parallel.
So for the OS it is not at all important whether a set of thread is resulting from a single user (or even from a single call to a web application) or has been created by a number of users and web calls.

You need understand about thread scheduler.
In fact, in a single core, CPU divides its time among multiple threads (the process is not exactly sequential). In a multiple core, two (or more) threads can run simultaneously.
Read thread article in wikipedia.
I recommend Tanenbaum's OS book.

Tomcat uses Java multi-threading support to serve http requests.
To serve an http request tomcat starts a thread from the thread pool. Pool is maintained for efficiency as creation of thread is expensive.
Refer to java documentation about concurrency to read more https://docs.oracle.com/javase/tutorial/essential/concurrency/
Please see tomcat thread pool configuration for more information https://tomcat.apache.org/tomcat-8.0-doc/config/executor.html

There are two points to answer to your question : Thread Scheduling & Thread Communication
Thread Scheduling implementation is specific to Operating System. Programmer does not have any control in this regard except setting priority for a Thread.
Thread Communication is driven by program/programmer.
Assume that you have multiple processors and multiple threads. Multiple threads can run in parallel with multiple processors. But how the data is shared and accessed is specific to program.
You can run your threads in parallel Or you can wait for threads to complete the execution before proceeding further (join, invokeAll, CountDownLatch etc.). Programmer has full control over Thread life cycle management.

There is no difference if you have one user or several. Threads work depending the logic of your program. The processor runs every thread for a certain ammount of time and then follows to the next one. The time is very short, so if there are not too much threads (or different processes) working, the user won't notice it. If the processor uses a 20 ms unit, and there are 1000 threads, then every thread will have to wait for two seconds for its next turn. Fortunately, current processors, even with just one core, have two process units which can be used for parallel threads.

In "classic" implementations, all web requests arriving to the same port are first serviced by the same single thread. However as soon as request is received (Socket.accept returns), almost all servers would immediately fork or reuse another thread to complete the request. Some specialized single user servers and also some advanced next generation servers like Netty may not.
The simple (and common) approach would be to pick or reuse a new thread for the whole duration of the single web request (GET, POST, etc). After the request has been served, the thread likely will be reused for another request that may belong to the same or different user.
However it is fully possible to write the custom code for the server that binds and then reuses particular thread to the web request of the logged in user, or IP address. This may be difficult to scale. I think standard simple servers like Tomcat typically do not do this.

Is it useful to use a Thread for prefetching from a file?

Using multiple threads for speeding IO may work, but I need to process a huge file (or directory tree) sequentially by a single thread. However I could imagine two possible ways how to speed up reading from a file:
Feeder
The main thread gets all it's data from a PipedInputStream (or alike) fed by the auxiliary thread, which is the only one accessing the file. The synchronization overhead is higher, but there's less communication to (the underlying library communicating with) the OS. This is straightforward for a single file, but very complicated for a directory tree.
Prefetcher
The main thread opens new FileInputStream(file) and reads it as if it was alone. The auxiliary thread opens it's own stream over the same file and reads ahead. The main Thread doesn't need to wait for the disk since it gets all it's data from the OS cache. There should be some trivial synchronization assuring that the auxiliary thread doesn't run too far ahead. This could work for directory trees without much additional effort.
The questions
Which idea (if any) would you recommend to try out?
Have you used something like this?
Any other idea?

I had an app that read multiple files, created xml out of it and sent it to a server.
In this situation having a dedicated "feeder" (reads file and put them in a queue) and a few "sender" (creates xml and send it to the server) helped.
If you are doing moderate to intensive CPU consuming work (like XML parsing), then having 2 threads (1 reads and 1 processes) is likely to help even on a single core machine. I won't be too concerned about synchronization overhead. When there is little contention, the gain by doing work while waiting for IO would be much bigger. If your thread wait for IO time to time, then there will be even more benefits.
I'd recommend to read this chapter from JCiP. It addresses this topic.

It depends! ... on your access patterns, on your hardware...
"Using multiple threads for speeding IO may work" - IF your IO subsystem (such as a large disk array) is capable of handling multiple IO requests at once.
On a single desktop drive, your gains will be limited; if you have several threads performing largely independent work (i.e. there are few synchronisation points) you can benefit from one thread reading data, while others are processing data previously read.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.