How does a large -Xss setting affect server performance? [closed]

How does a large -Xss setting affect server performance? [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have a Java server that reads a large serialised file at startup. This requires a large -Xss setting solely for the main thread at startup. All threads that handle server requests require much less stack space. (Xss is 20M).
How will this (continually increasing) Xss value affect the server's memory usage?

The answer is complicated. You're also asking the wrong question - make sure you read the entire answer all the way to the end.
Answering your question: How bad is large -Xss?
The amount of RAM a JVM needs is, basically, heap+metaspace+MAX_THREADS*STACK_SIZE.
Heap is simple: That's what the -Xmx parameter is for. metaspace is a more or less constant (I'm oversimplifying) and not particularly large amount.
Furthermore, assuming it's the usual server setup where you've set things up such that the JVM gets a static amount of memory (it's a server - it has a set amount of RAM and the best option, usually, is to spend it all. Give every major process running on the system a locked in configured amount of RAM. If the JVM is the only major software running on there (e.g. there is a database involved but it runs on another machine), and you have 8GB in the box, then give the JVM ~7GB. Why wouldn't you? Use -Xmx and -Xms`, set to the same value, and make it large. If postgres is also running on it, give the JVM perhaps 3GB and postgres 4GB (depends on how db-heavy your app is, of course). etcetera.
The point is, if you have both a large stacksize and a decently large max threads, let's say an -Xss of 20MB and max-threads of 100, then you lose 2GB of your allocated 7: On a box with 8GB installed and only the JVM as major consumer of resources, this setting:
java -Xmx7g -Xms7g -Xss20m
would be completely wrong and cause all sorts of trouble - that adds up to 9 GB, and I haven't even started accounting for metaspace yet, or the needs of the kernel. The box doesn't have that much! Instead you should be doing perhaps -Xmx5g -Xms5g -Xss20m.
And now you know what the performance cost is of this: The cost is having to reduce your -Xmx -Xms value from 7 to 5. It gets disastrously worse if you had to knock it down from 3 to 1 because it's a 4GB box - at that point what you're doing is basically impossible unless you first launch a new server with more ram in it.
Actually helping you solve your problem
Forget about all of the above, that's the wrong way to solve this problem. Keep your -Xss nice and low, or don't set it.
Instead, take your init code and isolate it, then run it in a separately set up thread (and then just .join() on that thread to wait for it to complete and flush all the fields your init code modified back; yield() sets up HB/HA as needed). Use this thread constructor:
Runnable initCode = () -> {
// your init stuff goes here
};
ThreadGroup tg = Thread.currentThread().getThreadGroup();
Thread initThread = new Thread(tg, runnable, "init", 20L * 1024L * 1024L);
initThread.start();
initThread.join();
But, do some research first. The API of Thread is horribly designed and makes all sorts of grave errors. In particular, the stack size number (20MB here) is just a hint and the javadoc says any VM is free to just completely ignore it. Good API design would have of course specced that an exception is thrown instead if your requested stacksize is not doable by the VM.
I've done a quick test; adoptopenjdk 11 on a mac seems to have no problem with it.
Here's my test setup:
> cat Test.java
public class Test {
public static void main(String[] args) throws Exception {
Runnable r = () -> {
System.out.println("Allowed stack depth: " + measure());
};
r.run();
r.run();
Thread t = new Thread(Thread.currentThread().getThreadGroup(), r, "init", 1024L * 1024L);
t.start();
t.join();
r.run();
}
public static int measure() {
int min = 1;
int max = 50000;
while (min < max) {
int mid = (max + min) / 2;
try {
attempt(mid);
if (min == mid) return min;
min = mid;
} catch (StackOverflowError e) {
max = mid;
}
}
return min;
}
public static void attempt(int depth) {
if (depth == 0) return;
attempt(depth - 1);
}
}
> javac Test.java; java -Xss200k Test
Allowed stack depth: 2733
Allowed stack depth: 6549
Allowed stack depth: 49999
Allowed stack depth: 6549
You can't check the size of the stack trace, as the JVM has a hard limit and won't store more than 1024 stack trace elements, thus the binary search for the answer.
I can't quite explain why the value isn't constant (it hops from 2733 to 6549), or even why an -Xss of 150k produces higher numbers for a real What The Heck???? - I'll ask a question about that right after posting this answer, but it does show that the thread that was made with a larger stack does indeed let you have a far deeper method callstack.
Run this test code on the target environment with the target JDK just to be sure it'll work, and then you have your actual solution :)

Related

Benchmarking JVM memory consumption similarly to how it is done by the Android OS

When trying to benchmark a specific method, in regards to how many objects are created and how many bytes they occupy while that method is running, in Android it is possible to do this:
Debug.resetThreadAllocCount()
Debug.resetThreadAllocSize()
Debug.startAllocCounting()
benchmarkMethod()
Debug.stopAllocCounting()
var memoryAllocCount = Debug.getThreadAllocCount()
var memoryAllocSize = Debug.getThreadAllocSize()
I would now like to benchmark the same method, but on a normal desktop application, where these methods are not available. I have not found anything similar, and any other memory benchmarking code I have tried does not provide consistent results, like the above code does, which gives the exact same result every time when the same benchmark runs.
Any suggestion, preferably just code would be appreciated, however I would be open to try some software as well if it is able to perform the task I am trying to do.

ThreadMXBean.getThreadAllocatedBytes can help:
com.sun.management.ThreadMXBean bean =
(com.sun.management.ThreadMXBean) ManagementFactory.getThreadMXBean();
long currentThreadId = Thread.currentThread().getId();
long before = bean.getThreadAllocatedBytes(currentThreadId);
allocatingMethod();
long after = bean.getThreadAllocatedBytes(currentThreadId);
System.out.println("Allocated " + (after - before) + " bytes");
The method returns an approximation of the total allocated memory, but this approximation is usually quite precise.
Also, async-profiler has Java API for profiling allocations. It does not only count how many objects are allocated, but also shows the exact allocated objects with the stack traces of the allocation sites.
public static void main(String[] args) throws Exception {
AsyncProfiler profiler = AsyncProfiler.getInstance();
// Dry run to skip allocations caused by AsyncProfiler initialization
profiler.start("alloc", 0);
profiler.stop();
// Real profiling session
profiler.start("alloc", 0);
allocatingMethod();
profiler.stop();
profiler.execute("file=alloc.svg"); // save the output to alloc.svg
}
How to run:
java -Djava.library.path=/path/to/async-profiler -XX:+UseG1GC -XX:-UseTLAB Main
-XX:+UseG1GC -XX:-UseTLAB options are needed to record all allocations. Otherwise, async-profiler will work in sampling mode, recording only a small portion of allocations.
Here is how the output will look like:

Why is the max recursion depth I can reach non-deterministic?

I decided to try a few experiments to see what I could discover about the size of stack frames, and how far through the stack the currently executing code was. There are two interesting questions we might investigate here:
How many levels deep into the stack is the current code?
How many levels of recursion can the current method reach before it hits a StackOverflowError?
Stack depth of currently executing code
Here's the best I could come up with for this:
public static int levelsDeep() {
try {
throw new SomeKindOfException();
} catch (SomeKindOfException e) {
return e.getStackTrace().length;
}
}
This seems a bit hacky. It generates and catches an exception, and then looks to see what the length of the stack trace is.
Unfortunately it also seems to have a fatal limitation, which is that the maximum length of the stack trace returned is 1024. Anything beyond that gets axed, so the maximum that this method can return is 1024.
Question: Is there a better way of doing this that isn't so hacky and doesn't have this limitation?
For what it's worth, my guess is that there isn't: Throwable.getStackTraceDepth() is a native call, which suggests (but doesn't prove) that it can't be done in pure Java.
Determining how much more recursion depth we have left
The number of levels we can reach will be determined by (a) size of stack frame, and (b) amount of stack left. Let's not worry about size of stack frame, and just see how many levels we can reach before we hit a StackOverflowError.
Here's my code for doing this:
public static int stackLeft() {
try {
return 1+stackLeft();
} catch (StackOverflowError e) {
return 0;
}
}
It does its job admirably, even if it's linear in the amount of stack remaining. But here is the very, very weird part. On 64-bit Java 7 (OpenJDK 1.7.0_65), the results are perfectly consistent: 9,923, on my machine (Ubuntu 14.04 64-bit). But Oracle's Java 8 (1.8.0_25) gives me non-deterministic results: I get a recorded depth of anywhere between about 18,500 and 20,700.
Now why on earth would it be non-deterministic? There's supposed to be a fixed stack size, isn't there? And all of the code looks deterministic to me.
I wondered whether it was something weird with the error trapping, so I tried this instead:
public static long badSum(int n) {
if (n==0)
return 0;
else
return 1+badSum(n-1);
}
Clearly this will either return the input it was given, or overflow.
Again, the results I get are non-deterministic on Java 8. If I call badSum(14500), it will give me a StackOverflowError about half the time, and return 14500 the other half. but on Java 7 OpenJDK, it's consistent: badSum(9160) completes fine, and badSum(9161) overflows.
Question: Why is the maximum recursion depth non-deterministic on Oracle's Java 8? And why is it deterministic on OpenJDK 7?

The observed behavior is affected by the HotSpot optimizer, however it is not the only cause. When I run the following code
public static void main(String[] argv) {
System.out.println(System.getProperty("java.version"));
System.out.println(countDepth());
System.out.println(countDepth());
System.out.println(countDepth());
System.out.println(countDepth());
System.out.println(countDepth());
System.out.println(countDepth());
System.out.println(countDepth());
}
static int countDepth() {
try { return 1+countDepth(); }
catch(StackOverflowError err) { return 0; }
}
with JIT enabled, I get results like:
> f:\Software\jdk1.8.0_40beta02\bin\java -Xss68k -server -cp build\classes X
1.8.0_40-ea
2097
4195
4195
4195
12587
12587
12587
> f:\Software\jdk1.8.0_40beta02\bin\java -Xss68k -server -cp build\classes X
1.8.0_40-ea
2095
4193
4193
4193
12579
12579
12579
> f:\Software\jdk1.8.0_40beta02\bin\java -Xss68k -server -cp build\classes X
1.8.0_40-ea
2087
4177
4177
12529
12529
12529
12529
Here, the effect of the JIT is clearly visible, obviously the optimized code needs less stack space, and it’s shown that tiered compilation is enabled (indeed, using -XX:-TieredCompilation shows a single jump if the program runs long enough).
In contrast, with disabled JIT I get the following results:
> f:\Software\jdk1.8.0_40beta02\bin\java -Xss68k -server -Xint -cp build\classes X
1.8.0_40-ea
2104
2104
2104
2104
2104
2104
2104
> f:\Software\jdk1.8.0_40beta02\bin\java -Xss68k -server -Xint -cp build\classes X
1.8.0_40-ea
2076
2076
2076
2076
2076
2076
2076
> f:\Software\jdk1.8.0_40beta02\bin\java -Xss68k -server -Xint -cp build\classes X
1.8.0_40-ea
2105
2105
2105
2105
2105
2105
2105
The values still vary, but not within the single runtime thread and with a lesser magnitude.
So, there is a (rather small) difference that becomes much larger if the optimizer can reduce the stack space required per method invocation, e.g. due to inlining.
What can cause such a difference? I don’t know how this JVM does it but one scenario could be that the way a stack limit is enforced requires a certain alignment of the stack end address (e.g. matching memory page sizes) while the memory allocation returns memory with a start address that has a weaker alignment guaranty. Combine such a scenario with ASLR and there might be always a difference, within the size of the alignment requirement.

Why is the maximum recursion depth non-deterministic on Oracle's Java 8? And why is it deterministic on OpenJDK 7?
About that, possibly relates to changes in garbage collection. Java can choose a different mode for gc each time. http://vaskoz.wordpress.com/2013/08/23/java-8-garbage-collectors/

It's deprecated, but you could try Thread.countStackFrames() like
public static int levelsDeep() {
return Thread.currentThread().countStackFrames();
}
Per the Javadoc,
Deprecated. The definition of this call depends on suspend(), which is deprecated. Further, the results of this call were never well-defined.
Counts the number of stack frames in this thread. The thread must be suspended.
As for why you observe non-deterministic behaviour, I can only assume it is some combination of the JIT and garbage collector.

You don't need to catch the exception, just create one like this:
new Throwable().getStackTrace()
Or:
Thread.currentThread().getStackTrace()
It's still hacky as the result is JVM implementation specific. And JVM may decide to trim the result for better performance.

Declaring multiple arrays with 64 elements 1000 times faster than declaring array of 65 elements

Recently I noticed declaring an array containing 64 elements is a lot faster (>1000 fold) than declaring the same type of array with 65 elements.
Here is the code I used to test this:
public class Tests{
public static void main(String args[]){
double start = System.nanoTime();
int job = 100000000;//100 million
for(int i = 0; i < job; i++){
double[] test = new double[64];
}
double end = System.nanoTime();
System.out.println("Total runtime = " + (end-start)/1000000 + " ms");
}
}
This runs in approximately 6 ms, if I replace new double[64] with new double[65] it takes approximately 7 seconds. This problem becomes exponentially more severe if the job is spread across more and more threads, which is where my problem originates from.
This problem also occurs with different types of arrays such as int[65] or String[65].
This problem does not occur with large strings: String test = "many characters";, but does start occurring when this is changed into String test = i + "";
I was wondering why this is the case and if it is possible to circumvent this problem.

You are observing a behavior that is caused by the optimizations done by the JIT compiler of your Java VM. This behavior is reproducible triggered with scalar arrays up to 64 elements, and is not triggered with arrays larger than 64.
Before going into details, let's take a closer look at the body of the loop:
double[] test = new double[64];
The body has no effect (observable behavior). That means it makes no difference outside of the program execution whether this statement is executed or not. The same is true for the whole loop. So it might happen, that the code optimizer translates the loop to something (or nothing) with the same functional and different timing behavior.
For benchmarks you should at least adhere to the following two guidelines. If you had done so, the difference would have been significantly smaller.
Warm-up the JIT compiler (and optimizer) by executing the benchmark several times.
Use the result of every expression and print it at the end of the benchmark.
Now let's go into details. Not surprisingly there is an optimization that is triggered for scalar arrays not larger than 64 elements. The optimization is part of the Escape analysis. It puts small objects and small arrays onto the stack instead of allocating them on the heap - or even better optimize them away entirely. You can find some information about it in the following article by Brian Goetz written in 2005:
Urban performance legends, revisited: Allocation is faster than you think, and getting faster
The optimization can be disabled with the command line option -XX:-DoEscapeAnalysis. The magic value 64 for scalar arrays can also be changed on the command line. If you execute your program as follows, there will be no difference between arrays with 64 and 65 elements:
java -XX:EliminateAllocationArraySizeLimit=65 Tests
Having said that, I strongly discourage using such command line options. I doubt that it makes a huge difference in a realistic application. I would only use it, if I would be absolutely convinced of the necessity - and not based on the results of some pseudo benchmarks.

There are any number of ways that there can be a difference, based on the size of an object.
As nosid stated, the JITC may be (most likely is) allocating small "local" objects on the stack, and the size cutoff for "small" arrays may be at 64 elements.
Allocating on the stack is significantly faster than allocating in heap, and, more to the point, stack does not need to be garbage collected, so GC overhead is greatly reduced. (And for this test case GC overhead is likely 80-90% of the total execution time.)
Further, once the value is stack-allocated the JITC can perform "dead code elimination", determine that the result of the new is never used anywhere, and, after assuring there are no side-effects that would be lost, eliminate the entire new operation, and then the (now empty) loop itself.
Even if the JITC does not do stack allocation, it's entirely possible for objects smaller than a certain size to be allocated in a heap differently (eg, from a different "space") than larger objects. (Normally this would not produce quite so dramatic timing differences, though.)

StackOverflowError in Math.Random in a randomly recursive method

This is the context of my program.
A function has 50% chance to do nothing, 50% to call itself twice.
What is the probability that the program will finish?
I wrote this piece of code, and it works great apparently. The answer which may not be obvious to everyone is that this program has 100% chance to finish. But there is a StackOverflowError (how convenient ;) ) when I run this program, occuring in Math.Random(). Could someone point to me where does it come from, and tell me if maybe my code is wrong?
static int bestDepth =0;
static int numberOfPrograms =0;
#Test
public void testProba(){
for(int i = 0; i <1000; i++){
long time = System.currentTimeMillis();
bestDepth = 0;
numberOfPrograms = 0;
loop(0);
LOGGER.info("Best depth:"+ bestDepth +" in "+(System.currentTimeMillis()-time)+"ms");
}
}
public boolean loop(int depth){
numberOfPrograms++;
if(depth> bestDepth){
bestDepth = depth;
}
if(proba()){
return true;
}
else{
return loop(depth + 1) && loop(depth + 1);
}
}
public boolean proba(){
return Math.random()>0.5;
}
.
java.lang.StackOverflowError
at java.util.Random.nextDouble(Random.java:394)
at java.lang.Math.random(Math.java:695)
.
I suspect the stack and the amount of function in it is limited, but I don't really see the problem here.
Any advice or clue are obviously welcome.
Fabien
EDIT: Thanks for your answers, I ran it with java -Xss4m and it worked great.

Whenever a function is called or a non-static variable is created, the stack is used to place and reserve space for it.
Now, it seems that you are recursively calling the loop function. This places the arguments in the stack, along with the code segment and the return address. This means that a lot of information is being placed on the stack.
However, the stack is limited. The CPU has built-in mechanics that protect against issues where data is pushed into the stack, and eventually override the code itself (as the stack grows down). This is called a General Protection Fault. When that general protection fault happens, the OS notifies the currently running task. Thus, originating the Stackoverflow.
This seems to be happening in Math.random().
In order to handle your problem, I suggest you to increase the stack size using the -Xss option of Java.

As you said, the loop function recursively calls itself. Now, tail recursive calls can be rewritten to loops by the compiler, and not occupy any stack space (this is called the tail call optimization, TCO). Unfortunately, java compiler does not do that. And also your loop is not tail-recursive. Your options here are:
Increase the stack size, as suggested by the other answers. Note that this will just defer the problem further in time: no matter how large your stack is, its size is still finite. You just need a longer chain of recursive calls to break out of the space limit.
Rewrite the function in terms of loops
Use a language, which has a compiler that performs TCO
You will still need to rewrite the function to be tail-recursive
Or rewrite it with trampolines (only minor changes are needed). A good paper, explaining trampolines and generalizing them further is called "Stackless Scala with Free Monads".
To illustrate the point in 3.2, here's how the rewritten function would look like:
def loop(depth: Int): Trampoline[Boolean] = {
numberOfPrograms = numberOfPrograms + 1
if(depth > bestDepth) {
bestDepth = depth
}
if(proba()) done(true)
else for {
r1 <- loop(depth + 1)
r2 <- loop(depth + 1)
} yield r1 && r2
}
And initial call would be loop(0).run.

Increasing the stack-size is a nice temporary fix. However, as proved by this post, though the loop() function is guaranteed to return eventually, the average stack-depth required by loop() is infinite. Thus, no matter how much you increase the stack by, your program will eventually run out of memory and crash.
There is nothing we can do to prevent this for certain; we always need to encode the stack in memory somehow, and we'll never have infinite memory. However, there is a way to reduce the amount of memory you're using by about 2 orders of magnitude. This should give your program a significantly higher chance of returning, rather than crashing.
We can do this by noticing that, at each layer in the stack, there's really only one piece of information we need to run your program: the piece that tells us if we need to call loop() again or not after returning. Thus, we can emulate the recursion using a stack of bits. Each emulated stack-frame will require only one bit of memory (right now it requires 64-96 times that, depending on whether you're running in 32- or 64-bit).
The code would look something like this (though I don't have a Java compiler right now so I can't test it):
static int bestDepth = 0;
static int numLoopCalls = 0;
public void emulateLoop() {
//Our fake stack. We'll push a 1 when this point on the stack needs a second call to loop() made yet, a 0 if it doesn't
BitSet fakeStack = new BitSet();
long currentDepth = 0;
numLoopCalls = 0;
while(currentDepth >= 0)
{
numLoopCalls++;
if(proba()) {
//"return" from the current function, going up the callstack until we hit a point that we need to "call loop()"" a second time
fakeStack.clear(currentDepth);
while(!fakeStack.get(currentDepth))
{
currentDepth--;
if(currentDepth < 0)
{
return;
}
}
//At this point, we've hit a point where loop() needs to be called a second time.
//Mark it as called, and call it
fakeStack.clear(currentDepth);
currentDepth++;
}
else {
//Need to call loop() twice, so we push a 1 and continue the while-loop
fakeStack.set(currentDepth);
currentDepth++;
if(currentDepth > bestDepth)
{
bestDepth = currentDepth;
}
}
}
}
This will probably be slightly slower, but it will use about 1/100th the memory. Note that the BitSet is stored on the heap, so there is no longer any need to increase the stack-size to run this. If anything, you'll want to increase the heap-size.

The downside of recursion is that it starts filling up your stack which will eventually cause a stack overflow if your recursion is too deep. If you want to ensure that the test ends you can increase your stack size using the answers given in the follow Stackoverflow thread:
How to increase to Java stack size?

Does OutputStream.write(buf, offset, size) have memory leak on Linux?

I write a piece of java code to create 500K small files (average 40K each) on CentOS. The original code is like this:
package MyTest;
import java.io.*;
public class SimpleWriter {
public static void main(String[] args) {
String dir = args[0];
int fileCount = Integer.parseInt(args[1]);
String content="##$% SDBSDGSDF ASGSDFFSAGDHFSDSAWE^#$^HNFSGQW%##&$%^J#%##^$#UHRGSDSDNDFE$T##$UERDFASGWQR!#%!#^$##YEGEQW%!#%!!GSDHWET!^";
StringBuilder sb = new StringBuilder();
int count = 40 * 1024 / content.length();
int remainder = (40 * 1024) % content.length();
for (int i=0; i < count; i++)
{
sb.append(content);
}
if (remainder > 0)
{
sb.append(content.substring(0, remainder));
}
byte[] buf = sb.toString().getBytes();
for (int j=0; j < fileCount; j++)
{
String path = String.format("%s%sTestFile_%d.txt", dir, File.separator, j);
try{
BufferedOutputStream fs = new BufferedOutputStream(new FileOutputStream(path));
fs.write(buf);
fs.close();
}
catch(FileNotFoundException fe)
{
System.out.printf("Hit filenot found exception %s", fe.getMessage());
}
catch(IOException ie)
{
System.out.printf("Hit IO exception %s", ie.getMessage());
}
}
}
}
You can run this by issue following command:
java -jar SimpleWriter.jar my_test_dir 500000
I thought this is a simple code, but then I realize that this code is using up to 14G of memory. I know that because when I use free -m to check the memory, the free memory kept dropping, until my 15G memory VM only had 70 MB free memory left. I compiled this using Eclipse, and I compile this against JDK 1.6 and then JDK1.7. The result is the same. The funny thing is that, if I comment out fs.write(), just open and close the stream, the memory stabilized at certain point. Once I put fs.write() back, the memory allocation just go wild. 500K 40KB files is about 20G. It seems Java's stream writer never deallocate its buffer during the operation.
I once thought java GC does not have time to clean. But this make no sense since I closed the file stream for every file. I even transfer my code into C#, and running under windows, the same code producing 500K 40KB files with memory stable at certain point, not taking 14G as under CentOS. At least C#'s behavior is what I expected, but I could not believe Java perform this way. I asked my colleague who were experienced in java. They could not see anything wrong in code, but could not explain why this happened. And they admit nobody had tried to create 500K file in a loop without stop.
I also searched online and everybody says that the only thing need to pay attention to, is close the stream, which I did.
Can anyone help me to figure out what's wrong?
Can anybody also try this and tell me what you see?
BTW, some people in this community tried the code on Windows and it seemed to worked fine. I didn't tried it on windows. I only tried in Linux as I thought that where people use Java for. So, it seems this issue happened on Linux).
I also did the following to limit the JVM heap, but it take no effects
java -Xmx2048m -jar SimpleWriter.jar my_test_dir 500000

I tried to test your prog on Win XP, JDK 1.7.25. Immediately got OutOfMemoryExceptions.
While debugging, with only 3000 count (args[1]), the count variable from this code:
int count = 40 * 1024 * 1024 / content.length();
int remainder = (40 * 1024 * 1024) % content.length();
for (int i = 0; i < count; i++) {
sb.append(content);
}
count is 355449. So the String you are trying to create will be 355449 * contents long, or as you calculated, 40Mb long. I was out of memory when i was 266587, and sb was 31457266 chars long. At which point each file I get is 30Mb.
The problem does not seem with memory or GC, but with the way you crate the string.
Did you see files created or was memory eating up before any file was created?
I think your main problem is the line:
int count = 40 * 1024 * 1024 / content.length();
should be:
int count = 40 * 1024 / content.length();
to create 40K, not 40Mb files.

[Edit2: The original answer is left in italics at the end of this post]
After your clarifications in the comments, I have run your code on a windows machine (Java 1.6) and here is my findings (numbers are from VisualVM, OS memory as seen from task manager):
Example with 40K size, writing to 500K files (no parameters to JVM):
Used Heap: ~4M, Total Heap: 16M, OS memory: ~16M
Example with 40M size, writing to 500 files (parameters to JVM -Xms128m -Xmx512m. Without parameters I get an OutOfMemory error when creating StringBuilder):
Used Heap: ~265M, Heap size: ~365M, OS memory: ~365M
Especially from the second example you can see that my original explanation still stands. Yes someone would expect that most of the memory would be freed since the byte[] of the BufferedOutputStream reside in the first generation space (short lived objects) but this a) does not happen immediately and b) when GC decides to kicks in (it actually does in my case), yes it will try to clear memory but it can clear as much memory as it sees fit, not necessarily all of it. GC does not provide any guarentees that you can count upon.
So generally speaking you should give to JVM as much memory you feel comfortable with. If you need to keep the memory low for special functionalities you should try a strategy as the code example I gave down below in my original answer i.e. just don't create all those byte[] objects.
Now in your case with CentOS, it does seem that JVM's behaves strangely. Perhaps we could talk about a buggy or bad implementation. To classify it as a leak/bug though you should try to use -Xmx to restrict the heap. Also please try what Peter Lawrey suggested to not create the BufferedOutputStream at all (in the small file case) since you just write all the bytes at once.
If it still exceeds the memory limit then you have encountered a leak and should probably file a bug. (You could still complain though and they may optimize it in the future).
[Edit1: The answer below assumed that the OP's code performed as many reading operations as the write operations, so the memory usage was justifiable. The OP clarified this is not the case, so his question is not answered
"...my 15G memory VM..."
If you give the JVM as much memory why it should try to run GC? As far as the JVM is concerned it is allowed to get as much memory from the system and run GC only when it thinks that is appropriate to do so.
Each execution of BufferedOutputStream will allocate a buffer of 8K size by default. JVM will try to reclaim that memory only when it needs to. This is the expected behaviour.
Do not confuse the memory that you see as free from the system's point of view and from the JVM's point of view. As far the system is concerned the memory is allocated and will be released when the JVM shuts down. As far the JVM's is concerned all the byte[] arrays allocated from BufferedOutputStream are not in use any more, it is "free" memory and will be reclaimed if it needs to.
If for some reason you don't desire this behaviour you could try the following: Extend the BufferedOutputStream class (e.g. create a ReusableBufferedOutputStream class) and add a new method e.g. reUseWithStream(OutputStream os). This method would then clear the internal byte[], flush and close the previous stream, reset any variables used and set the new stream. Your code then would become as below:
// intialize once
ReusableBufferedOutputStream fs = new ReusableBufferedOutputStream();
for (int i=0; i < fileCount; i ++)
{
String path = String.format("%s%sTestFile_%d.txt", dir, File.separator, i);
//set the new stream to be buffered and read
fs.reUseWithStream(new FileOutputStream(path));
fs.write(this._buf, 0, this._buf.length); // this._buf was allocated once, 40K long contain text
}
fs.close(); // Close the stream after we are done
Using the above approach you will avoid creating many byte[]. However I don't see any problem with the expected behaviour neither you mention any problem other than "I see it takes too much memory". You have congifured it to use it after all.]

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.