Performance and memory usage in Java arrays vs C++ arrays

Performance and memory usage in Java arrays vs C++ arrays - java

I work on a small company where I work to build some banking software. Now, I have to build some data structure like:
Array [Int-Max] [2] // Large 2D array
Save that to disk and load it next day for future work.
Now, as I only know Java (and little bit C), they always insist me to use C++ or C. As per their suggestion:
They have seen Array [Int-Max] [2] in Java will take nearly 1.5 times more memory than C and C++ takes some what reasonable memory footprint than Java.
C and C++ can handle arbitrarily large files where as Java can't.
As per their suggestion, as database/data-structure become large Java just becomes infeasible. As we have to work on such large database/data-structure, C/C++ is always preferable.
Now my question is,
Why is C or C++ always preferable on large database/data-structure over Java ? Because, C may be, but C++ is also an OOP. So, how it get advantage over Java ?
Should I stay on Java or their suggestion (switch to C++) will be helpful in future on large database/data-structure environment ? Any suggestion ?
Sorry, I have very few knowledge of all those and just started to work on a project, so really confused. Because until now I have just build some school project, have no idea about relatively large project.

why C/C++ is always preferable on large database/data-structure over
Java ? Because, C may be, but C++ is also an OOP. So, how it get
advantage over Java ?
Remember that a java array (of objects)1 is actually an array of references. For simplicity let's look at a 1D array:
java:
[ref1,ref2,ref3,...,refN]
ref1 -> object1
ref2 -> object2
...
refN -> objectN
c++:
[object1,object2,...,objectN]
The overhead of references is not needed in the array when using the C++ version, the array holds the objects themselves - and not only their references. If the objects are small - this overhead might indeed be significant.
Also, as I already stated in comments - there is another issue when allocating small objects in C++ in arrays vs java. In C++, you allocate an array of objects - and they are contiguous in the memory, while in java - the objects themselves aren't. In some cases, it might cause the C++ to have much better performance, because it is much more cache efficient then the java program. I once addressed this issue in this thread
2) Should I stay on Java or their suggestion (switch to C++) will be
helpful in future on large database/data-structure environment ? Any
suggestion ?
I don't believe we can answer it for you. You should be aware of all pros and cons (memory efficiency, libraries you can use, development time, ...) of each for your purpose and make a decision. Don't be afraid to get advises from seniors developers in your company who have more information about the system then we are.
If there was a simple easy and generic answer to this questions - we engineers were not needed, wouldn't we?
You can also profile your code with the expected array size and a stub algorithm before implementing the core and profile it to see what the real difference is expected to be. (Assuming the array is indeed the expected main space consumer)
1: The overhead I am describing next is not relevant for arrays of primitives. In these cases (primitives) the arrays are arrays of values, and not of references, same as C++, with minor overhead for the array itself (length field, for example).

It sounds like you are in inexperienced programmer in a new job. The chances are that "they" have been in the business a long time, and know (or at least think they know) more about the domain and its programming requirements than you do.
My advice is to just do what they insist that you do. If they want the code in C or C++, just write it in C or C++. If you think you are going to have difficulties because you don't know much C / C++ ... warn them up front. If they still insist, they can wear the responsibility for any problems and delays their insistence causes. Just make sure that you do your best ... and try not to be a "squeaky wheel".
1) They have seen Array [Int-Max] [Int-Max] in Java will take nearly 1.5 times more memory than C and C++ takes some what reasonable memory footprint than Java.
That is feasible, though it depends on what is in the arrays.
Java can represent large arrays of most primitive types using close to optimal amounts of memory.
On the other hand, arrays of objects in Java can take considerably more space than in C / C++. In C++ for example, you would typically allocate a large array using new Foo[largeNumber] so that all of the Foo instances are part of the array instance. In Java, new Foo[largeNumber] is actually equivalent to new Foo*[largeNumber]; i.e. an array of pointers, where each pointer typically refers to a different object / heap node. It is easy to see how this can take a lot more space.
2) C/C++ can handle arbitrarily large file where as Java can't.
There is a hard limit to the number of elements in a single 1-D Java array ... 2^31. (You can work around this limit, but it will make your code more complicated.)
On the other hand if you are talking about simply reading and writing files, Java can handle individual files up to 2^63 bytes ... which is more than you could possibly ever want.
1) why C/C++ is always preferable on large database/data-structure over Java ? Because, C may be, but C++ is also an OOP. So, how it get advantage over Java ?
Because of the hard limit. The limit is part of the JLS and the JVM specification. It is nothing to do with OOP per se.
2) Should I stay on Java or their suggestion (switch to C++) will be helpful in future on large database/data-structure environment ? Any suggestion ?
Go with their suggestion. If you are dealing with in-memory datasets that are that large, then their concerns are valid. And even if their concerns are (hypothetically) a bit overblown it is not a good thing to be battling your superiors / seniors ...

1) They have seen Array [Int-Max] [Int-Max] in Java will take nearly 1.5 times more memory than C and C++ takes some what reasonable memory footprint than Java.
That depends on the situation. If you create an new int[1] or new int[1000] there is almost no difference in Java or C++. If you allocate data on the stack, it has a high relative difference as Java doesn't use the stack for such data.
I would first ensure this is not micro-tuning the application. Its worth remembering that one day of your time is worth (assuming you get minimum wage) is about 2.5 GB. So unless you are saving 2.5 GB per day by doing this, suspect its not worth chasing.
2) C/C++ can handle arbitrarily large file where as Java can't.
I have memory mapped a 8 TB file in a pure Java program, so I have no idea what this is about.
There is a limit where you cannot map more than 2 GB or have more than 2 billion elements in an array. You can work around this by having more than one (e.g. up to 2 billion of those)
As we have to work on such large database/data-structure, C/C++ is always preferable.
I regularly load 200 - 800 GB of data with over 5 billion entries into a single Java process (sometime more than one at a time on the same machine)
1) why C/C++ is always preferable on large database/data-structure over Java ?
There is more experience on how to do this in C/C++ than there is in Java, and their experience of how to do this is only in C/C++.
Because, C may be, but C++ is also an OOP. So, how it get advantage over Java ?
When using large datasets, its more common to use a separate database in the Java world (embedded databases are relatively rare)
Java just calls the same system calls you can in C, so there is no real difference in terms of what you can do.
2) Should I stay on Java or their suggestion (switch to C++) will be helpful in future on large database/data-structure environment ? Any suggestion ?
At the end of the day, they pay you and sometimes technical arguments are not really what matters. ;)

Related

Which is faster: Array list or looping through all data combinations?

I'm programming something in Java, for context see this question: Markov Model descision process in Java
I have two options:
byte[MAX][4] mypatterns;
or
ArrayList mypatterns
I can use a Java ArrayList and append a new arrays whenever I create them, or use a static array by calculating all possible data combinations, then looping through to see which indexes are 'on or off'.
Essentially, I'm wondering if I should allocate a large block that may contain uninitialized values, or use the dynamic array.
I'm running in fps, so looping through 200 elements every frame could be very slow, especially because I will have multiple instances of this loop.
Based on theory and what I have heard, dynamic arrays are very inefficient
My question is: Would looping through an array of say, 200 elements be faster than appending an object to a dynamic array?
Edit>>>
More information:
I will know the maxlength of the array, if it is static.
The items in the array will frequently change, but their sizes are constant, therefore I can easily change them.
Allocating it statically will be the likeness of a memory pool
Other instances may have more or less of the data initialized than others

You right really, I should use a profiler first, but I'm also just curious about the question 'in theory'.
The "theory" is too complicated. There are too many alternatives (different ways to implement this) to analyse. On top of that, the actual performance for each alternative will depend on the the hardware, JIT compiler, the dimensions of the data structure, and the access and update patterns in your (real) application on (real) inputs.
And the chances are that it really doesn't matter.
In short, nobody can give you an answer that is well founded in theory. The best we can give is recommendations that are based on intuition about performance, and / or based on software engineering common sense:
simpler code is easier to write and to maintain,
a compiler is a more consistent1 optimizer than a human being,
time spent on optimizing code that doesn't need to be optimized is wasted time.
1 - Certainly over a large code-base. Given enough time and patience, human can do a better job for some problems, but that is not sustainable over a large code-base and it doesn't take account of the facts that 1) compilers are always being improved, 2) optimal code can depend on things that a human cannot take into account, and 3) a compiler doesn't get tired and make mistakes.

The fastest way to iterate over bytes is as a single arrays. A faster way to process these are as int or long types as process 4-8 bytes at a time is faster than process one byte at a time, however it rather depends on what you are doing. Note: a byte[4] is actually 24 bytes on a 64-bit JVM which means you are not making efficient use of your CPU cache. If you don't know the exact size you need you might be better off creating a buffer larger than you need even if you are not using all the buffer. i.e. in the case of the byte[][] you are using 6x time the memory you really need already.

Any performance difference will not be visible, when you set initialCapacity on ArrayList. You say that your collection's size can never change, but what if this logic changes?
Using ArrayList you get access to a lot of methods such as contains.
As other people have said already, use ArrayList unless performance benchmarks say it is a bottle neck.

When performing mmap, would C or Java have any significant performance differences?

I have a 50GB file that is a sorted csv file.
Would it in theory make any difference if I was performing lookups on this file using memory mapped access using C or java?
I'm guessing since the file access is pushed down to the operating system level, it really shouldn't make much of a difference correct?

In theory, Java will be infinitesimally slower because of the need for additional indirections due to Java's object-oriented method invocation, and possibly due to the need to cross the Java/JNI boundary.
In practice, the Hotspot compiler optimizes direct ByteBuffer access, and the cost of page faults will far exceed the extra memory indirection.

Giving a direct answer to question.
C's mmap() and Java's FileChannel.map() are considered to be pretty much equivalents and won't have significant performance differences.

Java can only map 2 GB at a time. This is because ByteBuffer uses 32-bit integers for length, size, etc. So you'd need 25 mmaps for your 50 GB file. C can just create a single mmap, although it won't be portable to 1990s computers (if you care about that)

Why is there no sizeof in Java?

For what design reason is there no sizeof operator in Java? Knowing that it is very useful in C++ and C#, how can you get the size of a certain type if needed?

Because the size of primitive types is explicitly mandated by the Java language. There is no variance between JVM implementations.
Moreover, since allocation is done by the new operator depending on its argument there is no need to specify the amount of memory needed.
It would sure be convenient sometimes to know how much memory an object will take so you could estimate things like max heap size requirements but I suppose the Java Language/Platform designers did not think it was a critical aspect.

In c is useful only because you have to manually allocate and free memory. However, since in java there is automatic garbage collection, this is not necessary.

In java, you don't work directly with memory, so sizeof is usually not needed, if you still want to determine the size of an object, check out this question.

Memory management is done by the VM in Java, perhaps this might help you: http://www.javamex.com/java_equivalents/memory_management.shtml

C needed sizeof because the size of ints and longs varied depending on the OS and compiler. Or at least it used to. :-) In Java all sizes, bit configurations (e.g. IEEE 754) are better defined.
EDIT - I see that #Maerics provided a link to the Java specs in his answer.

Size of operator present in c/c++ and c/c++ is machine dependent langauge so different data types might have different size on different machine so programmes need to know how big those data types while performing operation that are sensitive to size.
Eg:one machine might have 32 bit integer while another machine might have 16 bit integer.
But Java is machine independent langauge and all the data types are the same size on all machine so no need to find size of data types it is pre defined in Java.

How to test how many bytes an object reference use in Java?

I would like to test how many bytes an object reference use in the Java VM that I'm using. Do you guys know how to test this?
Thanks!

Taking the question literally, on most JVMs, all references on 32-bit JVMs take 4 bytes, one 64-bit JVMs, a reference takes 8 bytes unless -XX:+UseCompressedOops has been used, in which case it takes 4-bytes.
I assume you are asking how to tell how much space an Object occupies. You can use Instrumentation (not a simple matter) but this will only give you a shallow depth. Java tends you break into many objects something which is C++ might be a single structure so it is not as useful.
However, ifyou have a memory issue, I suggest you a memory profiler. This will give you the shallow and deep space objects use and give you a picture across the whole system. This is often more useful as you can start with the biggest consumers and optimise those as even if you have been developing Java for ten years+ you will only be guessing where is the best place to optimise unless you have hard data.
Another way to get the object size if you don't want to use a profiler is to allocate a large array and see how much memory is consumed, You have to do this many times to get a good idea what the average size is. I would set the young space very high to avoid GCs confusing your results e.g. -XX:NewSize=1g

It can differ from JVM to JVM but "Sizeof for Java" says
You might recollect "Java Tip 130: Do You Know Your Data Size?" that described a technique based on creating a large number of identical class instances and carefully measuring the resulting increase in the JVM used heap size. When applicable, this idea works very well, and I will in fact use it to bootstrap the alternate approach in this article.

If you need to be fairly accurate, check out the Instrumentation framework.

This one is the one I use. Got to love those 16-byte references !
alphaworks.ibm.heapanalyzer

determining java memory usage

Hmmm. Is there a primer anywhere on memory usage in Java? I would have thought Sun or IBM would have had a good article on the subject but I can't find anything that looks really solid. I'm interested in knowing two things:
at runtime, figuring out how much memory the classes in my package are using at a given time
at design time, estimating general memory overhead requirements for various things like:
how much memory overhead is required for an empty object (in addition to the space required by its fields)
how much memory overhead is required when creating closures
how much memory overhead is required for collections like ArrayList
I may have hundreds of thousands of objects created and I want to be a "good neighbor" to not be overly wasteful of RAM. I mean I don't really care whether I'm using 10% more memory than the "optimal case" (whatever that is), but if I'm implementing something that uses 5x as much memory as I could if I made a simple change, I'd want to use less memory (or be able to create more objects for a fixed amount of memory available).
I found a few articles (Java Specialists' Newsletter and something from Javaworld) and one of the builtin classes java.lang.instrument.getObjectSize() which claims to measure an "approximation" (??) of memory use, but these all seem kind of vague...
(and yes I realize that a JVM running on two different OS's may be likely to use different amounts of memory for different objects)

I used JProfiler a number of years ago and it did a good job, and you could break down memory usage to a fairly granular level.

As of Java 5, on Hotspot and other VMs that support it, you can use the Instrumentation interface to ask the VM the memory usage of a given object. It's fiddly but you can do it.
In case you want to try this method, I've added a page to my web site on querying the memory size of a Java object using the Instrumentation framework.
As a rough guide in Hotspot on 32 bit machines:
objects use 8 bytes for
"housekeeping"
fields use what you'd expect them to
use given their bit length (though booleans tend to be allocated an entire byte)
object references use 4 bytes
overall obejct size has a
granularity of 8 bytes (i.e. if you
have an object with 1 boolean field
it will use 16 bytes; if you have an
object with 8 booleans it will also
use 16 bytes)
There's nothing special about collections in terms of how the VM treats them. Their memory usage is the total of their internal fields plus -- if you're counting this -- the usage of each object they contain. You need to factor in things like the default array size of an ArrayList, and the fact that that size increases by 1.5 whenever the list gets full. But either asking the VM or using the above metrics, looking at the source code to the collections and "working it through" will essentially get you to the answer.
If by "closure" you mean something like a Runnable or Callable, well again it's just a boring old object like any other. (N.B. They aren't really closures!!)

You can use JMP, but it's only caught up to Java 1.5.

I've used the profiler that comes with newer versions of Netbeans a couple of times and it works very well, supplying you with a ton of information about memory usage and runtime of your programs. Definitely a good place to start.

If you are using a pre 1.5 VM - You can get the approx size of objects by using serialization. Be warned though.. this can require double the amount of memory for that object.

See if PerfAnal will give you what you are looking for.

This might be not the exact answer you are looking for, but the bosts of the following link will give you very good pointers. Other Question about Memory

I believe the profiler included in Netbeans can moniter memory usage also, you can try that

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.