When to choose several processes over threads in Java?

When to choose several processes over threads in Java? - java

For what reasons would one choose several processes over several threads to implement an application in Java?
I'm refactoring an older java application which is currently divided into several smaller applications (processes) running on the same multi-core machine, communicating which each other via sockets.
I personally think this should be done using threads rather than processes, but what arguments would defend the original design?

I (and others, see attributions below) can think of a couple of reasons:
Historical Reasons
The design is from the days when only green threads were available and the original author/designer figured they wouldn't work for him.
Robustness and Fault Tolerance
You use components which are not thread safe, so you cannot parallelize withough resorting to multiple processes.
Some components are buggy and you don't want them to be able to affect more than one process. Say, if a component has a memory or resource leak which eventually could force a process restart, then only the process using the component is affected.
Correct multithreading is still hard to do. Depending on your design harder than multiprocessing. The later, however, is arguably also not too easy.
You can have a model where you have a watchdog process that can actively monitor (and eventually restart) crashed worker processes. This may also include suspend/resume of processes, which is not safe with threads (thanks to #Jayan for pointing out).
OS Resource Limits & Governance
If the process, using a single thread, is already using all of the available address space (e.g. for 32bit apps on Windows 2GB), you might need to distribute work amongst processes.
Limiting the use of resources (CPU, memory, etc.) is typically only possible on a per process basis (for example on Windows you could create "job" objects, which require a separate process).
Security Considerations
You can run different processes using different accounts (i.e. "users"), thus providing better isolation between them.
Compatibility Issues
Support multiple/different Java versions: Using differnt processes you can use different Java versions for your application parts (if required by 3rd party libraries).
Location Transparency
You could (potentially) distribute your application over multiple physical machines, thus further increasing scalability and/or robustness of the application (see #Qwe's answer for more Details / the original idea).

If you decide to go with threads you will restrict your app to be run on a single machine. This solution doesn't scale (or scales to some extent) - there are always hardware limits.
And different processes communicating via sockets can be distributed between machines, so that you could add virtually unlimited number or them. This scales better at the cost of slow communication between processes.
Deciding which approach is more suitable is itself a very interesting task. And once you make the decision there's no guarantee that it will look stupid to your successors in a couple of years when requirements change or new hardware becomes available.

Related

Multithread applications on MPP architecture

In short:
Does it worth the effort to add multithreading scalability (Vertical scalability) on an application that will run always in a MPP infrastructure such Tandem HPNS (Horizontal scalable)?
Now, let me go deeper:
I’ve seen on many places the development under MPP (Massively Parallel Processing) using Java tend to think, if it’s Java you can use all what Java provides (You know, Write once run anywhere!) in which multithreading libraries(such threads, AKKA, Thread Pools, etc.) can help a lot by speeding up the performance using parallelism.
Forgetting the fact, if it’s MPP, it is horizontal scalable, meaning if you need a faster app, you have to design it to run multiples copies of the application, each on a different processor.
On the other side we have SMP (Symmetric Multi-processing) infrastructures (here we have any windows, Linux, UNIX like environment), in these you don’t have to worry about that, since the scalability is vertical, you can have more threads in which their execution will be distributed on the different cores the OS have available (Here I do agree on using Multithread libraries).
So, having this in mind, my question is, if there is a need of creating an application that will perform a heavy load of data with a lot of validations and other requirements in which the use of parallelism will help a lot to improve the load time, but, it has to run under a MPP environment (such Tandem HPNS).
Should the developer invest time on adding Multithread libraries to add parallelism and concurrency?
Just a couple of side notes:
1) I’m not saying SMP is better or MPP is better, they are just different infrastructures; my point goes just to the use of multithread libraries on MPP environments giving the fact an application using multithread on MPP will use just one CPU of the N Cpus the Server may has.
2) I’m not saying the MPP server does not support multithread libraries, you can have multithreads running on HPNS, but even you have 20 threads, there is no real parallelism since one thread is blocking the others; unless you have the application distributed (several copies running) on different CPUs.

No I don't think it makes sense to add multithreaded scalability on application that will always run on tandem, because tandem does not provide kernel level thread so even though you write multithreaded application it will not give any benefit.
Even tandem HPNS Java provides multithreading as per Java Spec but its performance is not comparable with linux or any other OS which support kernel level threading.
Actual purpose of tandem is HA availability because of its hardware redundancy.

Why have one JVM per application?

I read that each application runs in its own JVM. Why is it so ? Why don't they make one JVM run 2 or more apps ?
I read a SO post, but could not get the answers there.
Is there one JVM per Java application?
I am talking about applications launched via a public static void main(String[]) method ...)

(I assume you are talking about applications launched via a public static void main(String[]) method ...)
In theory you can run multiple applications in a JVM. In practice, they can interfere with each other in various ways. For example:
The JVM has one set of System.in/out/err, one default encoding, one default locale, one set of system properties, and so on. If one application changes these, it affects all applications.
Any application that calls System.exit() will effectively kill all applications.
If one application goes wild, and consumes too much CPU or memory it will affect the other applications too.
In short, there are lots of problems. People have tried hard to make this work, but they have never really succeeded. One example is the Echidna library, though that project has been quiet for ~10 years. JNode is another example, though they (actually we) "cheated" by hacking core Java classes (like java.lang.System) so that each application got what appeared to be independent versions of System.in/out/err, the System properties and so on1.
1 - This ("proclets") was supposed to be an interim hack, pending a proper solution using true "isolates". But isolates support stalled, primarily because the JNode architecture used a single address space with no obvious way to separate "system" and "user" stuff. So while we could create APIs that matched the isolate APIs, key isolate functionality (like cleanly killing an isolate) was virtually impossible to implement. Or at least, that was/is my view.

Reason to have one JVM pre application, basically same having OS process per application.
Here are few reasons why to have a process per application.
Application bug will not bring down / corrupt data in other applications sharing same process.
System resources are accounted per process hence per application.
Terminating process will automatically release all associated resources (application may not clean up for itself, so sharing processes may produce resource leaks).
Well some applications such a Chrome go even further creating multiple processes to isolate different tabs and plugins.
Speaking of Java there are few more reasons not to share JVM.
Heap space maintenance penalty is higher with large heap size. Multiple smaller independent heaps easier to manage.
It is fairly hard to unload "application" in JVM (there to many subtle reasons for it to stay in memory even if it is not running).
JVM have a lot of tuning option which you may want to tailor for an application.
Though there are several cases there JVM is actually shared between application:
Application servers and servlet containers (e.g. Tomcat). Server side Java specs are designed with shared server JVM and dynamic loading/unloading applications in mind.
There few attempts to create shared JVM utility for CLI applications (e.g. nailgun)
But in practice, even in server side java, it usually better to use JVM (or several) per applications, for reasons mentioned above.

For isolating execution contexts.
If one of the processes hangs, or fails, or it's security is compromised, the others don't get affected.
I think having separate runtimes also helps GC, because it has less references to handle than if it was altogether.
Besides, why would you run them all in one JVM?

Java Application Servers, like JBoss, are design to run many applications in one JVM

Why do we use multi application server instances on the same server

I guess there is a good reason, but I don't understand why sometimes we put for example 5 instances having the same webapplications on the same physical server.
Has it something to do with an optimisation for a multi processor architecture?
The max allowed ram limit for JVM or something else?

Hmmm... After a long time I am seeing this question again :)
Well a multiple JVM instances on a single machine solves a lot of issues. First of let us face this: Although JDK 1.7 is coming into picture, a lot of legacy application were developed using JDK 1.3 or 1.4 or 1.5. And still a major chunk of JDK is divided among them.
Now to your question:
Historically, there are three primary issues that system architects have addressed by deploying multiple JVMs on a single box:
Garbage collection inefficiencies: As heap sizes grow, garbage collection cycles--especially for major collections--tended to introduce significant delays into processing, thanks to the single-threaded GC. Multiple JVMs combat this by allowing smaller heap sizes in general and enabling some measure of concurrency during GC cycles (e.g., with four nodes, when one goes into GC, you still have three others actively processing).
Resource utilization: Older JVMs were unable to scale efficiently past four CPUs or so. The answer? Run a separate JVM for every 2 CPUs in the box (mileage may vary depending on the application, of course).
64-bit issues: Older JVMs were unable to allocate heap sizes beyond the 32-bit maximum. Again, multiple JVMs allow you to maximize your resource utilization.
Availability: One final reason that people sometimes run multiple JVMs on a single box is for availability. While it's true that this practice doesn't address hardware failures, it does address a failure in a single instance of an application server.
Taken from ( http://www.theserverside.com/discussions/thread.tss?thread_id=20044 )
I have mostly seen weblogic. Here is a link for further reading:
http://download.oracle.com/docs/cd/E13222_01/wls/docs92/perform/WLSTuning.html#wp1104298
Hope this will help you.

I guess you are referring to application clustering.
AFAIK, JVM's spawned with really large heap size have issues when it comes to garbage collection though I'm sure by playing around with the GC algorithm and parameters you can bring down the damage to a minimum. Plus, clustered applications don't have a single point of failure. If one node goes down, the remaining nodes can keep servicing the clients. This is one of the reasons why "message based architectures" are a good fit for scalability. Each request is mapped to a message which can then be picked up by any node in a cluster.
Another point would be to service multiple requests simultaneously in case your application unfortunately uses synchronized keyword judiciously. We currently have a legacy application which has a lot of shared state (unfortunately) and hence concurrent request handling is done by spawning around 20 JVM processes with a central dispatching unit which does all the dispatching work. ;-)

I would suggest you use around least JVM per NUMA region. If a single JVM uses more than one NUMA region (often a single CPU) the performance can degrade significantly, due to a significant increase in the cost of accessing main memory of another CPU.
Additionally using multiple servers can allow you to
use different versions of java or your your applications server.
isolate different applications which could interfere (they shouldn't but they might)
limit GC pause times between services.
EDIT: It could be historical. There may have been any number of reasons to have separate JVMs in the past but since you don't know what they were, you don't know if they still apply and it may be simpler to leave things as they are.

An additional reason to use mutliple instance is serviceability.
For example if you multiple different applications for multiple customers then having seperate instances of the appserver for each application can make life a little easier when you have to do an appserver restart during a release.

Suppose you have a average configuration host and installed single instance of the web/app server. Now your application becomes more popular and number of hits increases 2 fold. What you do now ?
Add one more physical server of same configuration and instal the application and load balance these two hosts.
This is not end of life for your application. Your application will keep on becoming more popular and hence the need to scale it up. What's going to be your strategy ?
keep adding more hosts of same configuration
buy a more powerful machine where you can create more logical application servers
Which option will you go far ?
You will do cost analysis, which will involve factors like- actual hardware cost, Cost of managing these servers (power cost, space occupied in data center) etc.
Apparently, it comes that the decision is not very easy. And in most cases it's more cost effective to have a more powerful machine.

one high-end server with one Application Server or multiple Application Servers?

If I have a high-end server, for example with 1T memory and 8x4core CPU...
will it bring more performance if I run multiple App Server (on different JVM) rather than just one App Server?
On App Server I will run some services (EAR whith message driven beans) which exchange message with each other.
btw, has java 64bit now no memory limitation any more?
http://java.sun.com/products/hotspot/whitepaper.html#64

will it bring more performance if I run multiple App Server (on different JVM) rather than just one App Server?
There are several things to take into account:
A single app server means a single point of failure. For many applications, this is not an option and using horizontal and vertical scaling is a common configuration (i.e. multiple VMs per machine and multiple machines). And adding more machines is obviously easier/cheaper if they are small.
A large heap takes longer to fill so the application runs longer before a garbage collection occurs. However, a larger heap also takes longer to compact and causes garbage collection to take longer. Sizing the VM usually means finding a good compromise between frequency and duration (in other words, you don't always want to give as much RAM as possible to one VM)
So, to my experience, running multiple machines hosting multiple JVM is the usual choice (and is usually cheaper than a huge beast and gives you more flexibility).

There is automatically a performance hit when you need to do out-of-process communications, so the question is if the application server does not scale well enough so this can pay off.
As a basic rule of thumb the JVM design allows the usage of any number of CPU's and any amount of RAM the operating system provides. The actual limits are JVM implementation specific, and you need to read the specifications very carefully before choosing to see if there is any limits relevant to you.
Given you have a JVM which can utilize the hardware, you then need an app server which can scale appropriately. A common bottleneck these days is the amount of web requests that can be processed per second - a modern server should be able to process 10000 requests per second (see http://www.kegel.com/c10k.html) but not all do.
So, first of all identify your most pressing needs (connections per second? memory usage? network bandwidth?) and use that to identify the best platform + jvm + app server combination. If you have concrete needs, vendors will usually be happy to assist you to make a sale.

Most likely you will gain by running multiple JVMs with smaller heaps instead of a single large JVM. There is a couple of reasons for this:
Smaller heaps mean shorter garbage collections
More JVMs means lesser competition for internal resources inside JVM such as thread pools and other synchronized access.
How many JVMs you should fit into that box depends on what the application does. The best way to determine this is to set up a load test that simulates production load and observe how the number of requests the system can handle grows with the number of added JVMs. At some point you will see that adding more JVMs does not improve throughput. That's where you should stop.
Yet, there is another consideration. It is better to have multiple physical machines rather than a single big fat box. This is reliability. Should this box go offline for some reason, it will take with it all the app servers that are running inside it. The infrastructure running many separate smaller physical machines is going to be less affected by the failure of a single machine as compared to a single box.

How to force two Java threads to run on same processor/core?

I would like a solution that doesn't include critical sections or similar synchronization alternatives. I'm looking for something similar the equivalent of Fiber (user level threads) from Windows.

The OS manages what threads are processed on what core. You will need to assign the threads to a single core in the OS.
For instance. On windows, open task manager, go to the processes tab and right click on the java processes... then assign them to a specific core.
That is the best you are going to get.

To my knowledge there is no way you can achieve that.
Simply because the OS manages running threads and distributes resources according to it's scheduler.
Edit:
Since your goal is to have a "spare" core to run other processes on I'd suggest you use a thread manager and get the number of cores on the system (x) and then spawn at most x-1 threads on the specific system. That way you'll have your spare core.
The former statements still apply, you cannot specify which cores to run threads on unless you in the OS specify it. But from java, no.

Short of assigning the entire JVM to a single core, I'm not sure how you'd be able to do this. In Linux, you can use taskset:
http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html
I suppose you could run your JVM within a virtualized environment (e.g., VirtualBox/VMWare instance) with one processor allocated, but I'm not sure that that gets you what you want.

I read this as asking if a Java application can control the thread affinity itself. Java does not provide any way to control this. It is treated as the business of the host operating system.
If anything can do it, the OS can, and they typically can, though the tools you use for thread pinning will be OS specific. (But if the OS is itself is virtualized, there are two levels of pinning. I don't know if that is going to work / be practical.)
There don't appear to be any relevant Hotspot JVM thread tuning options in modern JVMs.
If you were using a Rockit JVM you could choose between "native threads" (where there is a 1-1 mapping between Java and OS threads) and "thin threads" where multiple Java threads are multiplexed onto a small number of OS threads. But AFAIK, JRocket "thin threads" are only supported in 32bit mode, and they don't allow you to tune the number of OS threads used.
This is really the kind of question that you should be asking under a Sun support contract. They have people who have spent years figuring out how to get the best performance out of big Java apps.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.