ZeroAllocation Hash library vs Murmur3Hash

ZeroAllocation Hash library vs Murmur3Hash - java

I need to hash short strings using a fast hash function. At the moment i use the StringHash function from the Murmur3Hash provided in the scala.utils package.
I thought about using the xxHash from the zero allocation library for java/scala, since it is stated that it is much faster and has the same quality as Murmur3Hash:
https://github.com/OpenHFT/Zero-Allocation-Hashing
Is it significantly faster than Murmur3Hash in practice for strings?
At the moment i use 32-bit values, since its the output from the StringHash function, and it suffices my needs. The xxHash function works on 64-bit values. While space is no issue here, is the performance difference between 32-bit and 64-bit values big ?
While there is surely an answer scattered across the internet, i was not able to gather an definite answer.
Thanks in advance!

Related

How do I calculate 64bit Java Memory Cost

I'm trying to find a simple and accurate reference for the cost in bytes of Java 64 bit Objects. I've not been able to find this. The primitives are clearly specified but there are all these edge cases and exceptions that I am trying to figure out like padding for an Object and cost vrs. space they actually take up on the heap, etc. From the gist of what I'm reading here: http://btoddb-java-sizing.blogspot.com/ that can actually be different?? :-/

If you turn off the TLAB, you will get accurate accounting and you can see exactly how much memory each object allocation uses.
The best way to see where your memory is being used, is via a memory profiler. Worry about bytes here and there is most likely a waste of time. When you have hundreds of MB, then it makes a difference and the best way to see that is in a profiler.
BTW Most systems use 32-bit references, even in 64-bit JVMs. There is no such thing as a 64-bit Object. Apart from the header, the object will uses the same space whether it is a 32-bit JVM or using 32-bit references in a 64-bit JVM.

You are essentially asking for a simple way to get an accurate prediction of object sizes in Java.
Unfortunately ... there isn't one!
The blog posting you found mentions a number of complicating factors. Another one is that the object sizing calculation can potentially change from one Java release to the next, or between different Java implementation vendors.
In practice, you options are:
Estimate the sizes based on what you know, and accept that your estimates may be wrong. (If you take account of enough factors, you should be able to get reasonable ballpark estimates, at least for a particular platform. But accurate predictions are inherently hard work.)
Write micro benchmarks using the TLAB technique to measure the size of the objects.
The other point is that in most cases it doesn't matter if your object size predictions are not entirely accurate. The recommended approach is to implement, measure and then optimize. This does not require accurate size information until you get to the optimization stage, and at that point you can measure the sizes ... if you need the information.

Alternative to Java Bitset with array like performance?

I am looking for an alternative to Java Bitset implementation. I am implementing a high performance algorithm and seems like using a Bitset object is killing its performance. Any ideas?

Someone here has compared boolean[] to BitSet and concluded with:
BitSet is more memory efficient than boolean[] except for very
small sizes. Each boolean in the array takes a byte. The numbers
from runtime.freeMemory() are a bit muddled for BitSet, but less.
boolean[] is more CPU efficient except for very large sizes, where
they are about even. E.g., for size 1 million boolean[] is about
four times faster (e.g. 6ms vs 27ms), for ten and a hundred million
they are about even.
If you Google, you can find some alternative implementations as well, like JavaEWAH, used by Apache Hive, Apache Spark and Eclipse JGit. It claims:
The goal of word-aligned compression is not to achieve the best
compression, but rather to improve query processing time. Hence, we
try to save CPU cycles, maybe at the expense of storage. However, the
EWAH scheme we implemented is always more efficient storage-wise than
an uncompressed bitmap as implemented in the BitSet class). Unlike
some alternatives, javaewah does not rely on a patented scheme.

While searching an answer for my question single byte comparison vs multiple boolean comparison, I found OpenBitSet
They claim to be faster than Java BitSet and direct access to the array of words storing the bits.
I am definitely gonna try that. See if it solve your purpose too.

Look at Javolution FastBitSet :
A high-performance bitset integrated with the collection framework as a set of indices and obeying the collection semantic for methods such as FastSet.size() (cardinality) or FastCollection.equals(java.lang.Object) (same set of indices).
See also http://code.google.com/p/guava-libraries/issues/detail?id=724#c3.

If you really must squeeze the maximum performance out of this thing, and if memory does not matter, you can try storing each one of your flags in an integer whose bit size is equal to the width of the data bus of your CPU.
You are probably on a 64-bit data bus CPU, so try long integers.

There are a number of compressed alternatives to the BitSet class. EWAH was already mentioned (https://github.com/lemire/javaewah). More recent additions include Roaring bitmaps (https://github.com/RoaringBitmap/RoaringBitmap) that are used by Apache Lucene, Apache Spark, Elastic Search, and so forth.

How long does SHA-1 take to create hashes?

Roughly how long, and how much processing power is required to create SHA-1 hashes of data? Does this differ a lot depending on the original data size? Would generating the hash of a standard HTML file take significantly longer than the string "blah"? How would C++, Java, and PHP compare in speed?

You've asked a lot of questions, so hopefully I can try to answer each one in turn.
SHA-1 (and many other hashes designed to be cryptographically strong) are based on repeated application of an encryption or decryption routine to fixed-sized blocks of data. Consequently, when computing a hash value of a long string, the algorithm takes proportionally more time than computing the hash value of a small string. Mathematically, we say that the runtime to hash a string of length N is O(N) when using SHA-1. Consequently, hashing an HTML document should take longer than hashing the string "blah," but only proportionally so. It won't take dramatically longer to do the hash.
As for comparing C++, Java, and PHP in terms of speed, this is dangerous territory and my answer is likely to get blasted, but generally speaking C++ is slightly faster than Java, which is slightly faster than PHP. A good hash implementation written in one of those languages might dramatically outperform the others if they aren't written well. However, you shouldn't need to worry about this. It is generally considered a bad idea to implement your own hash functions, encryption routines, or decryption routines because they are often vulnerable to side-channel attacks in which an attacker can break your security by using bugs in the implementation that are often extremely difficult to have anticipated. If you want to use a good hash function, use a prewritten version. It's likely to be faster, safer, and less error-prone than anything you do by hand.
Finally, I'd suggest not using SHA-1 at all. SHA-1 has known cryptographic weaknesses and you should consider using a strong hash algorithm instead, such as SHA-256.
Hope this helps!

The "speed" of cryptographic hash functions is often measured in "clock cycles per byte". See this page for an admittedly outdated comparison - you can see how implementation and architecture influence the results. The results vary largely not only due to the algorithm being used, but they are also largely dependent on your processor architecture, the quality of the implementation and if the implementation uses the hardware efficiently. That's why some companies specialize in creating hardware especially well suited for the exact purpose of performing certain cryptographic algorithms as efficiently as possible.
A good example is SHA-512, although it works on larger data chunks than SHA-256 one might be inclined to think that it should generally perform slower than SHA-256 working on smaller input - but SHA-512 is especially well suited for 64 bit processors and performs sometimes even better than SHA-256 there.
All modern hash algorithms are working on fixed-size blocks of data. They perform a fixed number of deterministic operations on a block, and do this for every block until you finally get the result. This also means that the longer your input, the longer the operation will take. From the characteristics just explained we can deduce that the length of the operation is directly proportional to the input size of a message. Mathematically oŕ computer-scientifically speaking we coin this as being an O(n) operation, where n is the input size of the message, as templatetypedef already pointed out.
You should not let the speed of hashing influence your choice of programming language, all modern hash algorithms are really, really fast, regardless of the language. Although C-based implementations will do slightly better than Java, which again will probably be slightly faster than PHP, I bet in practice you won't know the difference.

SHA-1 processes the data by chunks of 64 bytes. The CPU time needed to hash a file of length n bytes is thus roughly equal to n/64 times the CPU time needed to process one chunk. For a short string, you must first convert the string to a sequence of bytes (SHA-1 works on bytes, not on characters); the string "blah" will become 4 or 8 bytes (if you use UTF-8 or UTF-16, respectively) so it will be hashed as a single chunk. Note that the conversion from characters to bytes may take more time than the hashing itself.
Using the pure Java SHA-1 implementation from sphlib, on my PC (x86 Core2, 2.4 GHz, 64-bit mode), I can hash long messages at a bandwidth of 132 MB/s (that's using a single CPU core). Note that this exceeds the speed of a common hard disk, so when hashing a big file, chances are that the disk will be the bottleneck, not the CPU: the time needed to hash the file will be the time needed to read the file from the disk.
(Also, using native code written in C, SHA-1 speed goes up to 330 MB/s.)
SHA-256 is considered to be widely more secure than SHA-1, and a pure Java implementation of SHA-256 ranks at 85 MB/s on my PC, which is still quite fast. As of 2011, SHA-1 is not recommended.

Where to code this heuristic?

I want to ask a complex question.
I have to code a heuristic for my thesis. I need followings:
Evaluate some integral functions
Minimize functions over an interval
Do this over thousand and thousand times.
So I need a faster programming language to do these jobs. Which language do you suggest? First, I started with Java, but taking integrals become a problem. And I'm not sure about speed.
Connecting Java and other softwares like MATLAB may be a good idea. Since I'm not sure, I want to take your opinions.
Thanks!

C,Java, ... are all Turing complete languages. They can calculate the same functions with the same precision.
If you want achieve performance goals use C that is a compiled and high performances language . Can decrease your computation time avoiding method calls and high level features present in an interpreted language like Java.
Anyway remember that your implementation may impact the performances more than which language you choose, because for increasing input dimension is the computational complexity that is relevant ( http://en.wikipedia.org/wiki/Computational_complexity_theory ).

It's not the programming language, it's probably your algorithm. Determine the big0 notation of your algorithm. If you use loops in loops, where you could use a search by a hash in a Map instead, your algorithm can be made n times faster.
Note: Modern JVM's (JDK 1.5 or 1.6) compile Just-In-Time natively (as in not-interpreted) to a specific OS and a specific OS version and a specific hardware architecture. You could try the -server to JIT even more aggressively (at the cost of an even longer initialization time).
Do this over thousand and thousand times.
Are you sure it's not more, something like 10^1000 instead? Try accurately calculating how many times you need to run that loop, it might surprise you. The type of problems on which heuristics are used, tend to have a really big search space.

Before you start switching languages, I'd first try to do the following things:
Find the best available algorithms.
Find available implementations of those algorithms usable from your language.
There are e.g. scientific libraries for Java. Try to use these libraries.
If they are not fast enough investigate whether there is anything to be done about it. Is your problem more specific than what the library assumes. Are you able to improve the algorithm based on that knowledge.
What is it that takes so much/memory? Is this realy related to your language? Try to avoid observing JVM start times instead of the time it performed calculation for you.
Then, I'd consider switching languages. But don't expect it to be easy to beat optimized third party java libraries in c.

Order of the algorithm
Tipically switching between languages only reduce the time required by a constant factor. Let's say you can double the speed using C, but if your algorithm is O(n^2) it will take four times to process if you double the data no matter the language.
And the JVM can optimize a lot of things getting good results.
Some posible optimizations in Java
If you have functions that are called a lot of times make them final. And the same for entire classes. The compiler will know that it can inline the method code, avoiding creating method-call stack frames for that call.

Calculating byte-size of Java object [duplicate]

This question already has answers here:
How to determine the size of an object in Java
(28 answers)
Closed 7 years ago.
I am working on calculaitng the size [memory used] of a java object [hashmap] . It contains elements of different data types [at runtime] so [ no-of-elem * size-of-element] is not that good an approach. The code right now does it by series of
if (x)
do something
else if (primitives)
lookup size and calculate
However this process is a CPU hog and in-efficient.
I am thinking of following 2 approaches instead:
Serialize the object to a buffer and get the size.
Look into java.lang.instrument to get the size
I am looking for anyones experience with these approaches for performance , efficiency, scaling etc OR if you know any better way.
P.S:
This is a background utility that I am building so the size need no be super accurate though it should be about correct. So I am willing to trade accuracy for performance
I am not interested in the deep-size [the size of objects that are refered by this object will not be computed.]
I am looking for a performance comparisons and understanding how getObjectSize() works internally ..so that I do not messup something else to improve the performance
Thanks

Use getObjectSize() method of the Instrumentation package.
Look here for implementation details:

The serialized size is definitely not the way to go, for two reasons:
In the standard java serialization there can be quite a lot of overhead which would add to the size.
It would not be any quicker than using the getObjectSize() method which we can presume will iterate over all the references, and use some kind of lookup to determine the size of the primitive values/references of an object.
If you need better performance then that really will depend on the distribution of your objects. One possiblility would be to do some random sampling of the values in your map, determine an average and calculate an estimate from this value.
For advice on how to look up a random value in a hashmap, see this question.

You may be interested in an article I wrote a while ago on how to calculate the memory usage of a Java object. It is admittedly aimed primarily at 32-bit Hotspot, although much of it applies in essence to other environments.
You can also download a simple agent for measuring Java object size from the same site which will take some of the hard work out of it for you and should work in 64-bit environments.
Note as others have I think mentioned that the serialised form of an object isn't the same as its form in memory, so using serialisation isn't suitable if you want to measure the memory footprint accurately.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.