How to define program's requirements

How to define program's requirements - java

Is there any easy, cheap (which don't require to test program on many hardware configuration) and painless method to define hardware requirements (like CPU, RAM memory etc), that are require to run my own program? How it's should be done?
I have quite resource-hungry program written in Java and i don't know how to define hardware specification that will be enough to run this aplication smoothly.

No, I don't think there is any generally applicable way to determine the minimum requirements that does not involve testing on some specified reference hardware.
You may be able to find some of the limitations by using Virtual Machines of some kind - it is easier to modify the parameters of some VM than modifying hardware. But there are artifacts generated by the interaction between host and VM that may influence your results.
It is also difficult to define the criteria for "acceptable performance" in general without knowing a lot about use cases.
Many programs will use more resources if they are available, but can also get along with less.
For example, consider a program using a thread pool with a size a based on the number of CPU cores. When running on a CPU with more cores, more work can be done in parallel, but at the same time overhead due to thread creation, synchronisation and aggregation of results increases. The effects are non-linear in the number of CPUs and depend a lot on the actual program and data. Similarly, the effects of decreasing available memory range from potentially throwing OutOfMemory-Errors for some inputs (but possibly not for others) to just running GC a bit more frequently (and the effects of that depend on the GC strategy, ranging from noticeable freezes to just a bit more CPU load).
All that is without even considering that programs don't usually live in isolation - they run on an operating system in parallel with other tasks that also consume resources.

Related

JVM performance based on CPU type

I don't know if there is a straightforward answer to this. I have the specifications of different CPU types. For example, two instances,A and B.
I want to run some simple java console application in A and B. Based on their specifications, can I assume the runtime of B after knowing the runtime on A?
Second questions is about the core numbers. So, can I assume the runtime of machine with i cores after knowing the result on the same machine with j cores?
Is there some approximations on this? The instances I am talking are Amazon EMR instances.
Thank you.

Based on their specifications, can I [estimate] the runtime of B after knowing the runtime on A?
The answer is No. Or at least, not with any accuracy or confidence. The performance of an application often depends on complex interactions between the algorithm's memory access patterns, memory caches and virtual memory hardware. These are impossible to predict if you treat the application as a black box, and they can be difficult to model even if you understand what it is doing. GC can also have the same kind of unpredictable behavior.
Can I [estimate] the runtime of machine with i cores after knowing the result on the same machine with j cores?
The answer is No. Application performance as you increase the number of cores is highly dependent on the way that you have designed and implemented your application. In the best case you could linear speedup ... up to the limit of the platform's memory system. In the worst case, you could get no speedup at all.
The only practical solution is to make a guesstimate ... then try out the various alternative platforms and see how your application performs on them.

I don't know if there is a straightforward answer to this. I have the specifications of different CPU types. For example, two instances,A and B. I want to run some simple java console application in A and B. Based on their specifications, can I assume the runtime of B after knowing the runtime on A?
You can guesitmate it by looking at the relative performance using a benchmark like PassMark. However, you can see occasions where a system which is supposed to be faster is in fact slower. This is only a very rough estimation.
Second questions is about the core numbers. So, can I assume the runtime of machine with i cores after knowing the result on the same machine with j cores?
It is hihgly unlikely that say you have double the number of cores, you will get double the amount of performance.
Your application may not scale by number of cores as your bottleneck might be another resources e.g. the network, or disk, or your program has too much sequential coding.
When you have more cores, the speed of those cores tends to be slower on a sustained basis because the sockets can only produce so much heat.
Is there some approximations on this? The instances I am talking are Amazon EMR instances.
If you are talking about virtual machines instead of real machines, it is even harder to estimate. You might find that two virtual machines with the same spec don't perform the same. e.g. since your application might be running on different machines at different times, you may find it's performance varies even though the machine should have much the same spec.

How long does java thread switch take?

I'm learning reactive programming techniques, with async I/O etc, and I just can't find decent authoritative comparative data about the benefits of not switching threads.
Apparently switching threads is "expensive" compared to computations. But what scale are we talking on?
The essential question is "How many processor cycles/instructions does it take to switch a java thread?" (I'm expecting a range)
Is it affected by OS?
I presume it's affected by number of threads, which is why async IO is so much better than blocking - the more threads, the further away the context has to be stored (presumably even out of the cache into main memory).
I've seen Approximate timings for various operations which although it's (way) out of date, is probably still useful for relating processor cycles (network would likely take more "instructions", SSD disk probably less).
I understand that reactive applications enable web apps to go from 1000's to 10,000's requests per second (per server), but that's hard to tell too - comments welcome
NOTE - I know this is a bit of a vague, useless, fluffy question at the moment because I have little idea on the inputs that would affect the speed of a context switch. Perhaps statistical answers would help - as an example I'd guess >=60% of threads would take between 100-10000 processor cycles to switch.

Thread switching is done by the OS, so Java has little to do with it. Also, on linux at least, but I presume also many other operating systems, the scheduling cost does not depend on the number of threads. Linux has been using an O(1) scheduler since version 2.6.
The thread switch overhead on Linux is some 1.2 µs (article from 2018). Unfortunately the article doesn't list the clock speed at which that was measured, but the overhead should be some 1000-2000 clock cycles or thereabout. On a given machine and OS the thread switching overhead should be more or less constant, not a wide range.
Apart from this direct switching cost there's also the cost of changing workload: the new thread is most likely using a different set of instructions and data, which need to be loaded into the cache, but this cost doesn't differ between a thread switch or an asynchronous programming 'context switch'. And for completeness, switching to an entirely different process has the additional overhead of changing the memory address space, which is also significant.
By comparison, the switching overhead between goroutines in the Go programming language (which uses userspace threads which are very similar to asynchronous programming techniques) was around 170 ns, so one seventh of a linux thread switch.
Whether that is significant for you depends on your use case of course. But for most tasks, the time you spend doing computation will be far more than the context switching overhead. Unless you have many threads that do an absolutely tiny amount of work before switching.
Threading overhead has improved a lot since the early 2000s, and according to the linked article running 10,000 threads in production shouldn't be a problem on a recent server with a lot of memory. General claims of thread switching being slow are often based on yesteryears computers, so take those with a grain of salt.
One remaining fundamental advantage of asynchronous programming is that the userspace scheduler has more knowledge about the tasks, and so can in principle make smarter scheduling decisions. It also doesn't have to deal with processes from different users doing wildly different things that still need to be scheduled fairly. But even that can be worked around, and with the right kernel extensions these Google engineers were able to reduce the thread switching overhead to the same range as goroutine switches (200 ns).

Rugal has a point. In modern architectures theoretical turn-around times are usually far off from actual measurements because the hardware, as well as the software have become so much more complex. It also inherently depends on your application. Many web-applications for example are I/O-bound where the context switch time matters a lot less.
Also note that context switching (what you refer to as thread switching) is an OS thing and not a Java thing. There is no guarantee as to how "heavy" a context switch in your OS is. It used to take tens if not hundreds of thousands of CPU cycles to do a kernel-level switch, but there are also user-level switches, as well as experimental systems, where even kernel-level switches can take only a few hundred cycles.

Why do we use multi application server instances on the same server

I guess there is a good reason, but I don't understand why sometimes we put for example 5 instances having the same webapplications on the same physical server.
Has it something to do with an optimisation for a multi processor architecture?
The max allowed ram limit for JVM or something else?

Hmmm... After a long time I am seeing this question again :)
Well a multiple JVM instances on a single machine solves a lot of issues. First of let us face this: Although JDK 1.7 is coming into picture, a lot of legacy application were developed using JDK 1.3 or 1.4 or 1.5. And still a major chunk of JDK is divided among them.
Now to your question:
Historically, there are three primary issues that system architects have addressed by deploying multiple JVMs on a single box:
Garbage collection inefficiencies: As heap sizes grow, garbage collection cycles--especially for major collections--tended to introduce significant delays into processing, thanks to the single-threaded GC. Multiple JVMs combat this by allowing smaller heap sizes in general and enabling some measure of concurrency during GC cycles (e.g., with four nodes, when one goes into GC, you still have three others actively processing).
Resource utilization: Older JVMs were unable to scale efficiently past four CPUs or so. The answer? Run a separate JVM for every 2 CPUs in the box (mileage may vary depending on the application, of course).
64-bit issues: Older JVMs were unable to allocate heap sizes beyond the 32-bit maximum. Again, multiple JVMs allow you to maximize your resource utilization.
Availability: One final reason that people sometimes run multiple JVMs on a single box is for availability. While it's true that this practice doesn't address hardware failures, it does address a failure in a single instance of an application server.
Taken from ( http://www.theserverside.com/discussions/thread.tss?thread_id=20044 )
I have mostly seen weblogic. Here is a link for further reading:
http://download.oracle.com/docs/cd/E13222_01/wls/docs92/perform/WLSTuning.html#wp1104298
Hope this will help you.

I guess you are referring to application clustering.
AFAIK, JVM's spawned with really large heap size have issues when it comes to garbage collection though I'm sure by playing around with the GC algorithm and parameters you can bring down the damage to a minimum. Plus, clustered applications don't have a single point of failure. If one node goes down, the remaining nodes can keep servicing the clients. This is one of the reasons why "message based architectures" are a good fit for scalability. Each request is mapped to a message which can then be picked up by any node in a cluster.
Another point would be to service multiple requests simultaneously in case your application unfortunately uses synchronized keyword judiciously. We currently have a legacy application which has a lot of shared state (unfortunately) and hence concurrent request handling is done by spawning around 20 JVM processes with a central dispatching unit which does all the dispatching work. ;-)

I would suggest you use around least JVM per NUMA region. If a single JVM uses more than one NUMA region (often a single CPU) the performance can degrade significantly, due to a significant increase in the cost of accessing main memory of another CPU.
Additionally using multiple servers can allow you to
use different versions of java or your your applications server.
isolate different applications which could interfere (they shouldn't but they might)
limit GC pause times between services.
EDIT: It could be historical. There may have been any number of reasons to have separate JVMs in the past but since you don't know what they were, you don't know if they still apply and it may be simpler to leave things as they are.

An additional reason to use mutliple instance is serviceability.
For example if you multiple different applications for multiple customers then having seperate instances of the appserver for each application can make life a little easier when you have to do an appserver restart during a release.

Suppose you have a average configuration host and installed single instance of the web/app server. Now your application becomes more popular and number of hits increases 2 fold. What you do now ?
Add one more physical server of same configuration and instal the application and load balance these two hosts.
This is not end of life for your application. Your application will keep on becoming more popular and hence the need to scale it up. What's going to be your strategy ?
keep adding more hosts of same configuration
buy a more powerful machine where you can create more logical application servers
Which option will you go far ?
You will do cost analysis, which will involve factors like- actual hardware cost, Cost of managing these servers (power cost, space occupied in data center) etc.
Apparently, it comes that the decision is not very easy. And in most cases it's more cost effective to have a more powerful machine.

one high-end server with one Application Server or multiple Application Servers?

If I have a high-end server, for example with 1T memory and 8x4core CPU...
will it bring more performance if I run multiple App Server (on different JVM) rather than just one App Server?
On App Server I will run some services (EAR whith message driven beans) which exchange message with each other.
btw, has java 64bit now no memory limitation any more?
http://java.sun.com/products/hotspot/whitepaper.html#64

will it bring more performance if I run multiple App Server (on different JVM) rather than just one App Server?
There are several things to take into account:
A single app server means a single point of failure. For many applications, this is not an option and using horizontal and vertical scaling is a common configuration (i.e. multiple VMs per machine and multiple machines). And adding more machines is obviously easier/cheaper if they are small.
A large heap takes longer to fill so the application runs longer before a garbage collection occurs. However, a larger heap also takes longer to compact and causes garbage collection to take longer. Sizing the VM usually means finding a good compromise between frequency and duration (in other words, you don't always want to give as much RAM as possible to one VM)
So, to my experience, running multiple machines hosting multiple JVM is the usual choice (and is usually cheaper than a huge beast and gives you more flexibility).

There is automatically a performance hit when you need to do out-of-process communications, so the question is if the application server does not scale well enough so this can pay off.
As a basic rule of thumb the JVM design allows the usage of any number of CPU's and any amount of RAM the operating system provides. The actual limits are JVM implementation specific, and you need to read the specifications very carefully before choosing to see if there is any limits relevant to you.
Given you have a JVM which can utilize the hardware, you then need an app server which can scale appropriately. A common bottleneck these days is the amount of web requests that can be processed per second - a modern server should be able to process 10000 requests per second (see http://www.kegel.com/c10k.html) but not all do.
So, first of all identify your most pressing needs (connections per second? memory usage? network bandwidth?) and use that to identify the best platform + jvm + app server combination. If you have concrete needs, vendors will usually be happy to assist you to make a sale.

Most likely you will gain by running multiple JVMs with smaller heaps instead of a single large JVM. There is a couple of reasons for this:
Smaller heaps mean shorter garbage collections
More JVMs means lesser competition for internal resources inside JVM such as thread pools and other synchronized access.
How many JVMs you should fit into that box depends on what the application does. The best way to determine this is to set up a load test that simulates production load and observe how the number of requests the system can handle grows with the number of added JVMs. At some point you will see that adding more JVMs does not improve throughput. That's where you should stop.
Yet, there is another consideration. It is better to have multiple physical machines rather than a single big fat box. This is reliability. Should this box go offline for some reason, it will take with it all the app servers that are running inside it. The infrastructure running many separate smaller physical machines is going to be less affected by the failure of a single machine as compared to a single box.

Java performance Inconsistent

I have an interpreter written in Java. I am trying to test the performance results of various optimisations in the interpreter. To do this I parse the code and then repeatedly run the interpreter over the code, this continues until I get 5 runs which differ by a very small margin (0.1s in the times below), the mean is taken and printed. No I/O or randomness happens in the interpreter. If I run the interpreter again I am getting different run times:
91.8s
95.7s
93.8s
97.6s
94.6s
94.6s
107.4s
I have tried to no avail the server and client VM, the serial and parallel gc, large tables and windows and linux. These are on 1.6.0_14 JVM. The computer has no processes running in the background. So I asking what may be causing these large variations or how can I find out what is?
The actualy issue was caused because the program had to iterate to a fixed point solution and the values were stored in a hashset. The hashed values differed between runs, resulting in a different ordering which in turn led to a change in the amount of iterations needed to reach the solution.

"Wall clock time" is rarely a good measurement for benchmarking. A modern OS is extremely unlikely to "[have] no processes running in the background" -- for all you know, it could be writing dirty block buffers to disk, because it's decided that there's no other contention.
Instead, I recommend using ThreadMXBean to track actual CPU consumption.

Your variations don't look that large. It's simply the nature of the beast that there are other things running outside of your direct control, both in the OS and the JVM, and you're not likely to get exact results.
Things that could affect runtime:
if your test runs are creating objects (may be invisible to you, within library calls, etc) then your repeats may trigger a GC
Different GC algorithms, specifications will react differently, different thresholds for incremental gc. You could try to run a System.gc() before every run, although the JVM is not guaranteed to GC when you call that (although it always has when I've played with it).T Depending on the size of your test, and how many iterations you're running, this may be an unpleasantly (and nearly uselessly) slow thing to wait for.
Are you doing any sort of randomization within your tests? e.g. if you're testing integers, values < |128| may be handled slightly differently in memory.
Ultimately I don't think it's possible to get an exact figure, probably the best you can do is an average figure around the cluster of results.

The garbage collection may be responsible. Even though your logic is the same, it may be that the GC logic is being scheduled on external clock/events.
But I don't know that much about JVMs GC implementation.

This seems like a significant variation to me, I would try running with -verbosegc.
You should be able to get the variation to much less than a second if your process has no IO, output or network of any significance.
I suggest profiling your application, there is highly likely to be significant saving if you haven't done this already.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.