Share core library between Java processes

Share core library between Java processes - java

Is there a way to share core library between Java processes (or other way to minimize JVM initial memory impact)
So here's my case. I'm playing with microservices. I'm runing quite a lot of them. I'm setting their heap for 128M as it's enough for them. But I've noticed that the Linux process is consuming much more.
If I understand correctly from here
Max memory = [-Xmx] + [-XX:MaxPermSize] + number_of_threads * [-Xss]
although I am using Java 8 so probably perm size is no longer the issue? or is it.
There is initial "core" JVM memory footprint... and I was wondering if you heard a way to somehow share that "core" memory between processes (as it's really the same). Or any way to deal with that extra cost when running many processes of java.

Conceptually you're asking if you can fork a JVM - since forking (generally) uses copy-on-write memory semantics this can be an effective space-saving measure. Unfortunately as discussed in this answer forking the JVM is not supported, and generally not practical. Non-Unix systems cannot fork efficiently, and there are numerous other side-effects that a forked JVM would have to resolve in messy ways. Theoretically you could probably fork a JVM process, but you'd be walking squarely into "undefined behavior" territory.
The "right" way to avoid JVM startup costs is to reduce the number of JVMs you need to start up in the first place. Java is a highly-concurrent language that supports shared access to common memory out of the box via its threading model. If you can refactor your code to run concurrently in the same JVM you'll see much better performance.

Related

What are the useful JVM options for a multithreaded application?

I am working on an application that creates a lot of threads and relies heavily on String manipulation.
The application works for a good 24 hrs at a time and needs to be always very responsive.
I am trying to keep the creation of objects to a minimum. The application is doing well without any configuration at the moment.
But I was wondering for my own knowledge if there were any advantages (or disavantages) in using a specific JVM configuration?
Please bear with me, I am pretty new on on the subject of the JVM/GC configuration:
I was wondering if there were any JVM options I should absolutely use while working with multithreads?
Should I configure the heap?
Should I also configure the GC?
Should I keep the Garbage Collection to a minimum?
I started reading: http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
Any tips on the subject would greatly be appreciated.
Thanks in advance,

Generally, the best intial advice concerning tweaking your JVM is don't. Unless you are experiencing specific JVM-related problems with the default settings, leave them alone.
If you do need to fiddle around with the settings, I would recommend you set up a representative testcase and use an advanced profiler such as JProfiler.
Furthermore, you should really read the technical documentation regarding the HotSpot VM, specifically the Memory Management Whitepaper, all of which you may find here.

If it is working fine then you should not do anything.
If your application is CPU bound you should not create Lot of threads.
Reason is lot of time is wasted in context switching.
String manipulation if it in memory then there should be only those threads which are required
NCPU = UCPU* (1+W/C)
Where NCPU--> Number of CPU
UCPU--> Target CPU Utilization
W-->Wait time
C--> Compute time
So for CPU bound operations it should be max (Number of CPU +1) threads.
Also there are lot of test cases defined for concurrency applications in Java Concurrency in Practice. You may want to check those.

I was wondering if there were any JVM options I should absolutely use while working with multithreads?
All the best options will be on by default. If you look at HotSpot VM Options you can see quite a few are -XX:+ which means they are on by default.
Should I configure the heap?
Possibly. But I would leave the default setting if you can.
Should I also configure the GC?
Possibly. But I would leave the default setting if you can.
Should I keep the Garbage Collection to a minimum?
Reducing the amount of garbage created takes effort. It provides some benefit up to a point. You have to decide what is the best use of your time and how much time to spend reducing the amount of garbage created.
I would always start with a memory profiler and find where you are creating the most garbage. Start from the top of the list rather than trying to tune everything as this ensures you will get the most benefit for the least amount of effort.
BTW: I am an advocate of low garbage and off heap programs where it makes sense to do so. I have written trading systems which can run for a day without even a minor GC and programs which can load/use 500+ GB of data in off heap memory. However, you have to be able to demonstrate or quantify how much difference it will make to the end users or your business to determine whether it is really worth it.

I was wondering if there were any JVM options I should absolutely use while working with multithreads?
No.
Should I configure the heap?
No, apart from setting the heap size to something reasonable (with -Xmx and -Xms)
Should I also configure the GC?
No, unless you have a particular need for "low-pause". The default throughput compiler is the best option if you are currently meeting your "responsiveness" goals. If you are not meeting those goals then you should consider CMS or G1 ... but beware that they reduce pauses but they also reduce throughput.
Should I keep the Garbage Collection to a minimum?
No. That is not a sensible goal. Your aim is to maximize throughput, and minimizing GC won't necessarily achieve that. In a lot of case, it is more efficient to generate garbage than to to have the application do extra work to avoid generating garbage. (And as Peter Lawrey pointed out, you've also got the extra developer effort in writing and maintaining mode complex code.)
I would advise you to use a profiler to see if your application is spending a lot of time (CPU time or elapsed time) relative to doing other productive work. If not, or if the application is already running fast enough then don't fiddle with the JVM options.
If you are worried that your application won't cope with increased load in the future, then tweaking the GC doesn't scale. A better option is to investigate scaling up your hardware and/or figuring out how to do the work on multiple machines. In addition, tuning the GC to improve performance with current load may actually result in worse performance when the load increases. (Consider the problem that arises with CMS when it can't keep up and is forced to do a full stop-the-world collection to recover.)
Finally, it is generally speak a bad idea to have lots of threads. It is better to use a small number of worker threads (roughly equal to the number of processors / cores) and feed them work via concurrent queues, etcetera.

In the past, I have faced the similar server application: lots of String manipulation, String creation, and needs to be always very responsive. The app worked fine with default configuration, until run into high-stress situation. You need to enable -XX:+UseConcMarkSweepGC for low pause, and fine tune other parameters to ensure the app behavior the way that you want. Here is the short list:
-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=nn
-XX:CMSWaitDuration=300000
-XX:GCTimeRatio=nn
-XX:+DisableExplicitGC

Why does System. gc () seem to have no effect on some JVMs

I have been developing a small Java utility that uses two frameworks: Encog and Jetty to provide neural network functionality for a website.
The code is 'finished' in that it does everything it needs to do, but I have some problems with memory usage. When running on my development machine the memory usage seems to fluctuate between about 4MB and 13MB when the application is doing things (training neural networks) and at most it uses about 18MB. This is very good usage and I think it is due to the fact that I call System.GC() fairly regularly. I do this because the processing time doesn't matter for me, but the memory usage does.
So it all works fine on my machine, but as soon as I put it online on our server (shared unix hosting with memory limits) it uses about 19MB to start with and rises to hundreds of MB of memory usage when doing things. These are the same things that I have been doing in testing. The only way, I believe, to reduce the memory usage, is to quit the application and restart it.
The only difference that I can tell is the Java Virtual Machine that it is being run on. I do not know about this and I have tried to find the reason why it is acting this way, but a lot of the documentation assumes a great knowledge of Java and Virtual Machines. Could someone please help m with some reasons why this may be happening and perhaps some things to try to stop it.
I have looked at using GCJ to compile the application, but I don't know if this is something I should be putting a lot of time in to and whether it will actually help.
Thanks for the help!
UPDATE: Developing on Mac OS 10.6.3 and server is on a unix OS but I don't know what. (Server is from WebFaction)

I think it is due to the fact that I
call System.GC() fairly regularly
You should not do that, it's almost never useful.
A garbage collector works most efficiently when it has lots of memory to play with, so it will tend to use a large part of what it can get. I think all you need to do is to set the max heap size to something like 32MB with an -Xmx32m command line parameter - the default depends on whether the JVM believes it's running on a "server class" system, in which case it assumes that you want the application to use as much memory as it can in order to give better throughput.
BTW, if you're running on a 64 bit JVM on the server, it will legitimately need more memory (usually about 30%) than on a 32bit JVM due to larger references.

Two points you might consider:
Calls of System.gc can be disabled by a commandline parameter (-XX:-DisableExplicitGC), I think the behaviour also depends on the gc algorithm the vm uses. Normally invoking the gc should be left to the jvm
As long as there is enough memory available for the jvm I don't see anything wrong in using this memory to increase application and gc performance. As Michael Borgwardt said you can restrict the amount of memory the vm uses at the command line.

Also you may want to look at what mode the JVM has been started when you deploy it online. My guess its a server VM.
Take a look at the differences between the two right here on stackoverflow. Also, see what garbage collector is actually running on the actual deployment. See if you can tweek the GC behaviour, or change the GC algorithm.See the -X options if its a Sun JVM.

Basically the JVM takes the amount of memory it is allowed to as needed, in order to make the "new" operation as fast as possible (this is a science in itself).
So if you have a lot of objects being used, and then discarded, you will slowly and surely fill up the available memory. Then you can ask for garbage collection, but it is just a hint, and the JVM may choose not to listen.
So, you need another mechanism to keep memory usage down. The typical approach is to limit the amount of memory with -Xoptions, but be careful since the JVM you use on your pc may be very different from the one you deploy on, and the memory need may therefore be different.
Is there a deliberate requirement for low memory usage? If not, then just let it run and see how the JVM behaves. Use jvisualvm to attach and monitor.

Perhaps the server uses more memory because there is a higher load on your app and so more threads are in use? Jetty will use a number of threads to spread out the load if there are a lot of requests. Its worth a look at the thread count on the server versus on your test machine.

RAM memory reallocation - Windows and Linux

I am working on a project involving optimizing energy consumption within a system. Part of that project consists in allocating RAM memory based on locality, that is allocating memory segments for a program as close as possible to each other. Is there a way I can know where exactly is the position of the memory I allocate (the memory chips) and I was also wondering if it is possible to force allocation in a deterministic manner. I am interested in both Windows and Linux. Also, the project will be implemented in Java and .NET so I am interested in managed APIs to achieve this.
[I am aware that this might not translate into direct energy consumption reduction but the project is supposed to be a proof of concept.]

You're working at the wrong level of abstraction.
Java (and presumably .NET) refers to objects using handles, rather than raw pointers. The underlying Java VM can move objects around in virtual memory at any time; the Java application doesn't see any difference.
Win32 and Linux applications (such as the Java VM) refer to memory using virtual addresses. There is a mapping from virtual address to a physical address on a RAM chip. The kernel can change this mapping at any time (e.g. if the data gets paged to disk then read back into a different memory location) and applications don't see any difference.
So if you're using Java and .NET, I wouldn't change your Java/.NET application to achieve this. Instead, I would change the underlying Linux kernel, or possibly the Java VM.
For a prototype, one approach might be to boot Linux with the mem= parameter to restrict the kernel's memory usage to less than the amount of memory you have, then look at whether you can mmap the spare memory (maybe by mapping /dev/mem as root?). You could then change all calls to malloc() in the Java VM to use your own special memory allocator, which allocates from that free space.
For a real implementation of this, you should do it by changing the kernel and keeping userspace compatibility. Look at the work that's been done on memory hotplug in Linux, e.g. http://lhms.sourceforge.net/

If you want to try this in a language with a big runtime you'd have to tweak the implementation of that runtime or write a DLL/shared object to do all the memory management for your sample application. At which point the overall system behaviour is unlikely to be much like the usual operation of those runtimes.
The simplest, cleanest test environment to detect the (probably small) advantages of locality of reference would be in C++ using custom allocators. This environment will remove several potential causes of noise in the runtime data (mainly the garbage collection). You will also lose any power overhead associated with starting the CLR/JVM or maintaining its operating state - which would presumably also be welcome in a project to minimise power consumption. You will naturally want to give the test app a processor core to itself to eliminate thread switching noise.
Writing a custom allocator to give you one of the preallocated chunks on your current page shouldn't be too tough, but given that to accomplish locality of reference in C/C++ you would ordinarily just use the stack it seems unlikely there will be one you can just find, download and use.

In C/C++, if you coerce a pointer to an int, this tells you the address. However, under Windows and Linux, this is a virtual address -- the operating system determines the mapping to physical memory, and the memory management unit in the processor carries it out.
So, if you care where your data is in physical memory, you'll have to ask the OS. If you just care if your data is in the same MMU block, then check the OS documentation to see what size blocks it's using (4KB is usual for x86, but I hear kids these days are playing around with 16M giant blocks?).
Java and .NET add a third layer to the mix, though I'm afraid I can't help you there.

Is pre-allocating in bigger chunks (than needed) an option at all? Will it defeat the original purpose?

I think that if you want such a tide control over memory allocation you are better of using a compiled language such as C, the JVM, isolated the actual implementation of the language from the hardware, chip selection for data storage included.

The approach requires specialized hardware. In ordinary memory sticks and slots arrangements are designed to dissipate heat as even per chip as possible. For example 1 bit in every bus word per physical chip.

This is an interesting topic although I think it is waaaaaaay beyond the capabilities of managed languages such as Java or .NET. One of the major principals of those languages is that you don't have to manage the memory and consequently they abstract that away for you. C/C++ gives you better control in terms of actually allocating that memory, but even in that case, as referenced previously, the operating system can do some hand waving and indirection with memory allocation making it difficult to determine how things are allocated together. Even then, you make reference to the actual chips, that's even harder and I would imagine would be hardware-dependent. I seriously would consider utilizing a prototyping board where you can code at the assembly level and actually control every memory unit allocation explicitly without any interference from compiler optimizations or operating system security practices. That would give you the most meaningful results as it would give you the ability to control every aspect of the program and determine, definitively that any power consumption improvements are due to your algorithm rather than some invisible optimization performed by the compiler or operating system. I imagine this is some sort of research project (very intriguing) so spending ~$100 on a prototyping board would definitely be worth it in my opinion.

In .NET there is a COM interface exposed for profiling .NET applications that can give you detailed address information. I think you will need to combine this with some calls to the OS to translate virtual addresses though.
As zztop eluded to, the .NET CLR compacts memory everytime a garbage collection is done. Although for large objects, they are not compacted. These are objects on the large object heap. The large object heap can consist of many segments scattered around from OS calls to VirtualAlloc.
Here are a couple links on the profiling APIs:
http://msdn.microsoft.com/en-us/magazine/cc300553.aspx
David Broman's CLR Profiling API Blog

jvm design decision

Why does the jvm require around 10 MB of memory for a simple hello world but the clr doesn't. What is the trade-off here, i.e. what does the jvm gain by doing this?
Let me clarify a bit because I'm not conveying the question that is in my head. There is clearly an architectural difference between the jvm and clr runtimes. The jvm has a significantly higher memory footprint than the clr. I'm assuming there is some benefit to this overhead otherwise why would it exist. I'm asking what the trade-offs are in these two designs. What benefit does the jvm gain from it's memory overhead?

I guess one reason is that Java has to do everything itself (another aspect of platform independence). For instance, Swing draws it's own components from scratch, it doesn't rely on the OS to draw them. That's all got to take place in memory. Lots of stuff that windows may do, but linux does not (or does differently) has to be fully contained in Java so that it works the same on both.
Java also always insists that it's entire library is "Linked" and available. Since it doesn't use DLLs (they wouldn't be available on every platform), everything has to be loaded and tracked by java.
Java even does a lot of it's own floating point since the FPUs often give different results which has been deemed unacceptable.
So if you think about all the stuff C# can delegate to the OS it's tied to vs all the stuff Java has to do for the OS to compensate for others, the difference should be expected.
I've run java apps on 2 embedded platforms now. One was a spectrum analyzer where it actually drew the traces, the other is set-top cable boxes.
In both cases, this minimum memory footprint hasn't been an issue--there HAVE been Java specific issues, that just hasn't been one. The number of objects instantiated and Swing painting speed were bigger issues in these cases.

I don't know if initial memory footprint or a footprint of a Hello World application is important. A difference might be due to the number and sizes of the libraries that are loaded by the JVM / CLR. There can also be an amount of memory that is preallocated for garbage collection pools.
Every application that I know off, uses a lot more then Hello World functionality. That will load and free memory thousands of times throughout the execution of the application. If you are interested in Memory Utilization differences of JVM vs CLR, here are a couple of links with good information
http://benpryor.com/blog/2006/05/04/jvm-vs-clr-memory-allocation/
Memory Management Case study (JVM & CLR)
Memory Management Case study is in Power Point. A very interesting presentation.

Seems like java is just using more virtual memory.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
amwise 20598 0.0 0.5 22052 5624 pts/3 Sl+ 14:59 0:00 mono Test.exe
amwise 20601 0.0 0.7 214312 7284 pts/2 Sl+ 15:00 0:00 java Program
I made a test program in C# and in Java that print the string "test" and waits for input. I believe that the resident set size (RSS) value more accurately shows the memory usage. The virtual memory useage (VSZ) is less meaningful.
As I understand it applications can reserve a ton of virtual memory without actually using any real memory. For example you can ask the VirtualAlloc function on Windows to either reserve or commit virtual memory.
EDIT:
Here is a pretty picture from my windows box:
alt text http://awise.us/images/mem.png
Each app was a simple printf followed by a getchar.
Lots of virtual memory usage by Java and CLR. The C version depends on just about nothing, so it's memory usage is tiny relatively.
I doubt it really matters either way. Just pick whichever platform you are more familiar with and then don't write terrible, memory-wasting code. I'm sure it will work out.
EDIT:
This VMMap tool from Microsoft might be useful in figureing out where memory is going.

The JVM counts all its shared libraries whether they use memory or not.
Task manager is rather unreliable when it comes to reporting the memory consumption of programs. You should take it as a guide.

JVM loads lots of unnecessary core classes on each run from rt.jar. Unfortunately, the inner-cross dependencies (java.lang <-> java.io) of java packages make it hard to do a partial runtime init. Not to mention the rt.jar itself is over 40MB, needs lots of time for lookup and decompress.
Post Java 6u10 seems to load things a bit smarter (it has a jqs.exe = java quick starter service to keep necessary data in memory and do a faster startup), still Java 7 is told to be better.
The Process Explorer in Windows reports the Private Bytes correctly (Private bytes are those memory regions, which are not shared by any dll).
A slightly bigger annoyance is that after 10 years, JVM still defaults to 64MB memory usage. It is really annoying to use -Xmx almost every time and cannot run demanding programs in jars with a simple double click (unless I alter the file extension assignment's command).

CLR is counted as part of the OS so the task manager doesn't report it's memory consumption under the application process.

Java performance with very large amounts of RAM

I'm exploring the possibility of running a Java app on a machine with very large amounts of RAM (anywhere from 300GB to 15TB, probably on an SGI Altix 4700 machine), and I'm curious as to how Java's GC is likely to perform in this scenario.
I've heard that IBM's or JRockit's JVMs may be better suited to this than Sun's. Does anyone know of any research or data on JVM performance in this situation?

On the Sun JVM, you can use the option -XX:UseConcMarkSweepGC to turn on the Concurrent mark and sweep Collector, which will avoid the "stop the world" phases of the default GC algorithm almost completely, at the cost of a little bit more overhead.
The advise to use more than on VM on such a machine is IMHO outdated.
In real world applications you often have enough shared data so that the performance with the CMS and one JVM is better.

The question is: do you want to run within a single process (JVM) or not? If you do, then you're going to have a problem. Refer to Tuning Java Virtual Machines, Oracle Coherence User Guide and similar documentation. The rule of thumb I've operated by is try and avoid heaps larger than 1GB. Whereas a 512MB-1GB full GC might take less than a second. A 2-4GB full GC could potentially take 5 seconds or longer. Obvioiusly this depends on many factors but the moral of the story is that GC overhead does not scale linearly and once you get into the one second range performance then degrades rapidly.

Sun's JVM allows you to configure and optimize the heck out of garbage collection, but it's a science unto itself:
http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html
You might have to do some reading and research, but for that kind of machine, GC settings optimized for the machine and application probably make a big difference.

Since 5.0 the Hotspot JVM uses a concept know as Ergonomics to try to optimise the memory usage. This is based on more than just the sheer amount of memory available and effects heap sizes, generation sizes and garbage collection algorithms.
Start by having a read of this, which explains Ergonomics and more:
https://www.oracle.com/technetwork/java/javase/memorymanagement-whitepaper-150215.pdf
There's also a guy called Brian Goetz that's written numerous articles about how Java allocates and uses memory, all of which and more can be found here:
http://www.briangoetz.com/pubs.html

This is not at all answering your question, but if you plan do deploy a huge Java app you might be interested in looking into Azul Systems appliances. They say to be able to garbage-collect without creating a pause in the application up to a single 670 GB heap.

You might want to consider running a virtual Terracotta cluster on this machine.

The only people who can really tell you are SGI. Super computers don't behave like regular servers only bigger.
However, I have found that Java performs best when memory is local to the processors accessing it. Note: the GC needs to be able to walk the whole memory end to end. This means it doesn't scale well if you have a design which is like lots of computers stuck together which may be the case here. The memory module size is 32 GB, so you may get better performance if you limit your JVM to comfortably fit into this size.

The accepted answer for this post is rather old and is now outdated. As of September 2014, if you are using Java 7, you should probably switch to the GC1 collector. From the Java 7 update 4 release notes:
http://www.oracle.com/technetwork/java/javase/7u4-relnotes-1575007.html
"The G1 collector is targeted for applications that fully utilize the large amount of memory available in today's multiprocessor servers, while still keeping garbage collection latencies under control. Applications that require a large heap, have a big active data set, have bursty or non-uniform workloads or suffer from long Garbage Collection induced latencies should benefit from switching to G1."

Surely the answer as to how the GC's going to perform is "who cares?" ;-)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.