In a 64 bit VM, will using longs instead of ints do any better in terms of performance given that longs are 64 bits in java and hence pulling and processing 64 bit word may be faster that pulling 32bit word in a 64 bit system. (I am expecting a lot of NOs but I was looking for a detailed explanation).
EDIT: I am implying that "pulling and processing 64 bit word may be faster that pulling 32bit word in a 64 bit system" because I am assuming that in a 64 bit system, pulling a 32 bit data would require you to first get the 64 bit word and then mask the top 32 bits.
Using long for int probably will slow you down in general.
You immediate concern is whether int on 64 bit CPU requires extra processing time. This is highly unlikely on a modern pipelined CPU. We can test this easily with a little program. The data it operates on should be small enough to fit in L1 cache, so that we are testing this specific concern. On my machine (64bit Intel Core2 Quad) there's basically no difference.
In a real app, most data can't reside in CPU caches. We must worry about loading data from main memory to cache which is relatively very slow and usually a bottleneck. Such loading works on the unit of "cache lines", which is 64 bytes or more, therefore loading a single long or int will take the same time.
However, using long will waste precious cache space, so cache misses will increase which are very expensive. Java's heap space is also stressed, so GC activity will increase.
We can demonstrate this by reading huge long[] and int[] arrays with the same number of elements. They are way bigger than caches can contain. The long version takes 65% more time on my machine. The test is bound by the throughput of memory->cache, and the long[] memory volume is 100% bigger. (why doesn't it take 100% more time is beyond me; obviously other factors are in play too)
I always go with use the right datatype based on your problem domain.
What I mean by this is if you need 64 bit long then use a 64bit long, but if you don't need a 64 bit long then use an int.
Using 32 bits on a 64 bit platform is not expensive, and it makes no sense to do a comparison based on performance.
To me this doesn't look right:
for(long l = 0; l<100; l++)
//SendToTheMoon(l);
And SendToTheMoon has nothing to do with it.
Might be faster, might be slower. Depends on the specific Java implementation, and how you use those variables. But in general it's probably not enough difference to worry about.
Related
I have the following function:
public void scanText(char[] T){
int q=0;
for(int i=0;i<T.length;i++){
q = transFunc[preCompRow[q]+T[i]];
if(q==pattern.length){
System.out.println("match found at position: "+(i-pattern.length+2));
}
}
}
This function scans a char Array searching for matches of a given pattern, which is stored as a finite automata. The transition function of the automata is stored in the variable called transFunc.
I am testing this function in a text with 8 millions of characters and using 800000 patterns. The thing is the accession of the array preCompRow[q] (which is an int[]) is very slow. The performance is greatly improved if I delete the preCompRow[q] of the code. I think this might be because in every loop the q variable has a different non-sequential value (2, 56, 18, 9 ..).
Is there any better way to access to an array in a non-sequential manner?
Thanks in advance!
One possible explanation is that your code is seeing poor memory performance due to poor locality in its memory access patterns.
The role of the memory caches in a modern computer is to deal with the speed mismatch between processor instruction times (less than 1 ns) and main memory (5 to 10 ns or more). They work best when your code gets a cache hit most time it fetches from memory.
A modern Intel chipset caches memory in blocks of 64 bytes, and loads from main memory in burst mode. (That corresponds to 16 int values.) The L1 cache on (say) an I7 processor is 2MB.
If your application is able to access the data in a large array (roughly) sequentially, then 7 out of 8 accesses will be a cache hits. If the access pattern is non-sequential and the "working set" of is a large multiple of the cache size, then you may end up with a cache miss on each memory access.
If memory access locality is the root of yoiur problems, then your option are limited:
redesign your algorithm so that locality of memory references is better
buy hardware with larger caches
(maybe) redesign your algorithm to use GPUs or some other strategy to reduce the memory traffic
Recoding your existing in C or C++ may give a performance improvement, but the same memory locality problems will bite you there as well.
I am not aware of any tools that can be used to measure cache performance in Java applications.
In every example I see of the -Xmx flag in java documentation (and in examples on the web), it is always a power of 2. For example '-Xmx512m', '-Xmx1024m', etc.
Is this a requirement, or is it a good idea for some reason?
I understand that memory in general comes in powers of 2; this question is specifically about the java memory flags.
It keeps things simple, but it is more for your benefit than anything else.
There is no particular reason to pick a power of 2, or a multiple of 50 MB (also common) e.g. -Xmx400m or -Xmx750m
Note: the JVM doesn't follow this strictly. It will use this to calculate the sizes of different regions which if you add them up tends to be lightly less than the number you provide. In my experience the heap is 1% - 2% less, but if you consider all the other memory regions the JVM uses, this doesn't make much difference.
Note: memory sizes for hardware typically a power of two (on some PCs it was 3x a power of two) This might have got people into the habit of thinking of a memory sizes as a power of two.
BTW: AFAIK, in most JVMs the actual size of each region is a multiple of the page size i.e. 4 KB.
I have a Java program that operates on a (large) graph. Thus, it uses a significant amount of heap space (~50GB, which is about 25% of the physical memory on the host machine). At one point, the program (repeatedly) picks one node from the graph and does some computation with it. For some nodes, this computation takes much longer than anticipated (30-60 minutes, instead of an expected few seconds). In order to profile these opertations to find out what takes so much time, I have created a test program that creates only a very small part of the large graph and then runs the same operation on one of the nodes that took very long to compute in the original program. Thus, the test program obviously only uses very little heap space, compared to the original program.
It turns out that an operation that took 48 minutes in the original program can be done in 9 seconds in the test program. This really confuses me. The first thought might be that the larger program spends a lot of time on garbage collection. So I turned on the verbose mode of the VM's garbage collector. According to that, no full garbage collections are performed during the 48 minutes, and only about 20 collections in the young generation, which each take less than 1 second.
So my questions is what else could there be that explains such a huge difference in timing? I don't know much about how Java internally organizes the heap. Is there something that takes significantly longer for a large heap with a large number of live objects? Could it be that object allocation takes much longer in such a setting, because it takes longer to find an adequate place in the heap? Or does the VM do any internal reorganization of the heap that might take a lot of time (besides garbage collection, obviously).
I am using Oracle JDK 1.7, if that's of any importance.
While bigger memory might mean bigger problems, I'd say there's nothing (except the GC which you've excluded) what could extend 9 seconds to 48 minutes (factor 320).
A big heap makes seemingly worse spatial locality possible, but I don't think it matters. I disagree with Tim's answer w.r.t. "having to leave the cache for everything".
There's also the TLB which a cache for the virtual address translation, which could cause some problems with very large memory. But again, not factor 320.
I don't think there's anything in the JVM which could cause such problems.
The only reason I can imagine is that you have some swap space which gets used - despite the fact that you have enough physical memory. Even slight swapping can be the cause for a huge slowdown. Make sure it's off (and possibly check swappiness).
Even when things are in memory you have multiple levels of caching of data on modern CPUs. Every time you leave the cache to fetch data the slower that will go. Having 50GB of ram could well mean that it is having to leave the cache for everything.
The symptoms and differences you describe are just massive though and I don't see something as simple as cache coherency making that much difference.
The best advice I can five you is to try running a profiler against it both when it's running slow and when it's running fast and compare the difference.
You need solid numbers and timings. "In this environment doing X took Y time". From that you can start narrowing things down.
I'm writing lots of stuff to log in bursts, and optimizing the data path. I build the log text with StringBuilder. What would be the most efficient initial capacity, memory management wise, so it would work well regardless of JVM? Goal is to avoid reallocation almost always, which should be covered by initial capacity of around 80-100. But I also want to waste as few bytes as possible, since the StringBuilder instance may hang around in buffer and wasted bytes crop up.
I realize this depends on JVM, but there should be some value, which would waste least bytes, no matter the JVM, sort of "least common denominator". I am currently using 128-16, where the 128 is a nice round number, and subtraction is for allocation overhead. Also, this might be considered a case of "premature optimization", but since the answer I am after is a "rule-of-a-thumb" number, knowing it would be useful in future too.
I'm not expecting "my best guess" answers (my own answer above is already that), I hope someone has researched this already and can share a knowledge-based answer.
Don't try to be smart in this case.
I am currently using 128-16, where the 128 is a nice round number, and subtraction is for allocation overhead.
In Java, this is based on totally arbitrary assumptions about the inner workings of a JVM. Java is not C. Byte-alignment and the like are absolutely not an issue the programmer can or should try to exploit.
If you know the (probable) maximum length of your strings you may use that for the initial size. Apart from that, any optimization attempts are simply in vain.
If you really know that vast amounts of your StringBuilders will be around for very long periods (which does not quite fit the concept of logging), and you really feel the need to try to persuade the JVM to save some bytes of heap space you may try and use trimToSize() after the string is built completely. But, again, as long as your strings don't waste megabytes each you really should go and focus on other problems in your application.
Well, I ended up testing this briefly myself, and then testing some more after comments, to get this edited answer.
Using JDK 1.7.0_07 and test app reporting VM name "Java HotSpot(TM) 64-Bit Server VM", granularity of StringBuilder memory usage is 4 chars, increasing at even 4 chars.
Answer: any multiple of 4 is equally good capacity for StringBuilder from memory allocation point of view, at least on this 64-bit JVM.
Tested by creating 1000000 StringBuilder objects with different initial capacities, in different test program executions (to have same initial heap state), and printing out ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getUsed() before and after.
Printing out heap sizes also confirmed, that amount actually allocated from heap for each StringBuilder's buffer is an even multiple of 8 bytes, as expected since Java char is 2 bytes long. In other words, allocating 1000000 instances with initial capacity 1..4 takes about 8 megabytes less memory (8 bytes per instace), than allocating same number of isntances with initial capacity 5...8.
I know that Java uses padding; objects have to be a multiple of 8 bytes. However, I dont see the purpose of it. What is it used for? What exactly is its main purpose?
Its purpose is alignment, which allows for faster memory access at the cost of some space. If data is unaligned, then the processor needs to do some shifts to access it after loading the memory.
Additionally, garbage collection is simplified (and sped up) the larger the size of the smallest allocation unit.
It's unlikely that Java has a requirement of 8 bytes (except on 64-bit systems), but since 32-bit architectures were the norm when Java was created it's possible that 4-byte alignment is required in the Java standard.
The accepted answer is speculation (but partially correct). Here is the real answer.
First of, to #U2EF1's credit, one of the benefits of 8-byte boundaries is that 8-bytes are the optimal access on most processors. However, there was more to the decision than that.
If you have 32-bit references, you can address up to 2^32 or 4 GB of memory (practically you get less though, more like 3.5 GB). If you have 64-bit references, you can address 2^64, which is terrabytes of memory. However, with 64-bit references, everything tends to slow down and take more space. This is due to the overhead of 32-bit processors dealing with 64-bits, and on all processors more GC cycles due to less space and more garbage collection.
So, the creators took a middle ground and decided on 35-bit references, which allow up to 2^35 or 32 GB of memory and take up less space so to have the same performance benefits of 32-bit references. This is done by taking a 32-bit reference and shifting it left 3 bits when reading, and shifting it right 3 bits when storing references. That means all objects must be aligned on 2^3 boundaries (8 bytes). These are called compressed ordinary object pointers or compressed oops.
Why not 36-bit references for accessing 64 GB of memory? Well, it was a tradeoff. You'd require a lot of wasted space for 16-byte alignments, and as far as I know the vast majority of processors receive no speed benefit from 16-byte alignments as opposed to 8-byte alignments.
Note that the JVM doesn't bother using compressed oops unless the maximum memory is set to be above 4 GB, which it does not by default. You can actually enable them with the -XX:+UsedCompressedOops flag.
This was back in the day of 32-bit VMs to provide the extra available memory on 64-bit systems. As far as I know, there is no limitation with 64-bit VMs.
Source: Java Performance: The Definitive Guide, Chapter 8
Data type sizes in Java are multiples of 8 bits (not bytes) because word sizes in most modern processors are multiples of 8-bits: 16-bits, 32-bits, 64-bits. In this way a field in an object can be made to fit ("aligned") in a word or words and waste as little space as possible, taking advantage of the underlying processor's instructions for operating on word-sized data.