n-body Simulation expected performance barnes hut

n-body Simulation expected performance barnes hut - java

I made a 2D n-body simulation using brute force at first, but then following http://arborjs.org/docs/barnes-hut this I've implemented a Barnes-Hut approximation algorithm. However this didn't give me the effect I was looking for.
Ex:
Barnes-Hut -> 2000 Bodies; frametime avg. 32 ms and 5000; 164 ms
Brute force -> 2000 Bodies; frametime avg. 31 ms and 5000; 195 ms
These values are with rendering turned off.
Am I correct to assume that I haven't correctly implemented the algorithm and am thus not getting a substantial increase in performance?
Theta is currently set to s/d < 0.5. Changing this value to e.g. 1 does increase performance, but it's quite obvious why this isn't preferred.
Single threaded
My code along general lines:
while(!close)
{
long newTime = System.currentTimeMillis();
long frameTime = newTime-currentTime;
System.out.println(frameTime);
currentTime = newTime;
update the bodies
}
Within the function that updates the bodies:
first insert all bodies into the quadtree with all its subnodes
for all bodies
{
compute the physics using Barnes-Hut which yields a net force per planet (doPhysics(body))
calculate instantaneous acceleration from net force
update the instantaneous velocity
}
The barneshut function:
doPhysics(body)
{
if(node is external (contains 1 body) and that body is not itself)
{
calculate the force between those two bodies
}else if(node is internal and s/d < 0.5)
{
create a pseudobody at the COM with the nodes total mass
calculate the force between the body and pseudobody
}else (if is it internal but s/d >= 0.5)
{
(this is where recursion comes in)
doPhysics on same body but on the NorthEast subnode
doPhysics on same body but on the NorthWest subnode
doPhysics on same body but on the SouthEast subnode
doPhysics on same body but on the SouthWest subnode
}
}
Actually calculating the force:
calculateforce(body, otherbody)
{
if(body is not at exactly the same position (avoid division by 0))
{
calculate force using newtons law of gravitation in vector form
add the force to the bodies' net force in this frame
}
}

Your code is still incomplete (read on SSCCEs ), and in-depth debugging of incomplete code is not the purpose of the site. However, this is how I would approach the next steps of figuring what, if anything, is wrong:
time only the function that you are worried about (let us call it barnes_hutt_update()); and not the whole update loop. Compare that to the equivalent, non-B-H code, and not to the whole update loop without B-H. This would result in a much more meaningful comparison.
you seem to have hard-coded s/d 0.5 into your algorithm. Leaving it as an argument, you should be able to notice speedups when it is set higher; and very similar performance to a naive, non-B-H implementation if set to 0. Speedup in B-H comes from evaluating less nodes (because far-away nodes are lumped together); do you know how many nodes you are managing to skip? No skipped nodes, no speedup. On the other hand, skipping nodes introduces small errors in the calculation - have you quantified those?
have a look at other implementations of B-H online. D3's force layout uses it internally, and is quite readable. There are multiple existing quadtree implementations. If you have built your own, they may be sub-optimal (or even buggy). Unless you are trying to learn-by-doing, it is always better to use a tested library instead of rolling your own.
slowdown may be due to the use of quadtrees, rather than from force addition itself. It would be useful to know how long building and updating the quadtree is taking, as compared to the B-H force aproximation itself -- because quadtrees are, in this case, pure overhead. B-H needs quadtrees, but the naive, non-B-H implementation does not. For small amounts of nodes, naive will be faster (but will get slower very fast as you add more and more). How does the performance scale as you add more and more bodies?
are you creating and discarding large amounts of objects? You can make your algorithm avoid the associated overhead (yes, lots of news + garbage collection can result in significant slowdowns) by using a memory pool.

Related

Java Micro-optimization: To cache or not to cache a System.currentTimeMillis() return value?

Simple question, which I've been wondering. Of the following two versions of code, which is better optimized? Assume that the time value resulting from the System.currentTimeMillis() call only needs to be pretty accurate, so caching should only be considered from a performance point of view.
This (with value caching):
long time = System.currentTimeMillis();
for (long timestamp : times) {
if (time - timestamp > 600000L) {
// Do something
}
}
Or this (no caching):
for (long timestamp : times) {
if (System.currentTimeMillis() - timestamp > 600000L) {
// Do something
}
}
I'm assuming System.currentTimeMillis() is already a very optimized and lightweight method call, but let's assume I'll be calling it many, many times in a short period.
How many values must the "times" collection/array contain to justify caching the return value of System.currentTimeMillis() in its own variable?
Is this better to do from a CPU or memory optimization point of view?

A long is basically free. A JVM with a JIT compiler can keep it in a register, and since it's a loop invariant can even optimize your loop condition to -timestamp < 600000L - time or timestamp > time - 600000L. i.e. the loop condition becomes a trivial compare between the iterator and a loop-invariant constant in a register.
So yes it's obviously more efficient to hoist a function call out of a loop and keep the result in a variable, especially when the optimizer can't do that for you, and especially when the result is a primitive type, not an Object.
Assuming your code is running on a JVM that JITs x86 machine code, System.currentTimeMillis() will probably include at least an rdtsc instruction and some scaling of that result1. So the cheapest it can possibly be (on Skylake for example) is a micro-coded 20-uop instruction with a throughput of one per 25 clock cycles (http://agner.org/optimize/).
If your // Do something is simple, like just a few memory accesses that usually hit in cache, or some simpler calculation, or anything else that out-of-order execution can do a good job with, that could be most of the cost of your loop. Unless each loop iterations typically takes multiple microseconds (i.e. time for thousands of instructions on a 4GHz superscalar CPU), hoisting System.currentTimeMillis() out of the loop can probably make a measurable difference. Small vs. huge will depend on how simple your loop body is.
If you can prove that hoisting it out of your loop won't cause correctness problems, then go for it.
Even with it inside your loop, your thread could still sleep for an unbounded length of time between calling it and doing the work for that iteration. But hoisting it out of the loop makes it more likely that you could actually observe this kind of effect in practice; running more iterations "too late".
Footnote 1: On modern x86, the time-stamp counter runs at a fixed rate, so it's useful as a low-overhead timesource, and less useful for cycle-accurate micro-benchmarking. (Use performance counters for that, or disable turbo / power saving so core clock = reference clock.)
IDK if a JVM would actually go to the trouble of implementing its own time function, though. It might just use an OS-provided time function. On Linux, gettimeofday and clock_gettime are implemented in user-space (with code + scale factor data exported by the kernel into user-space memory, in the VDSO region). So glibc's wrapper just calls that, instead of making an actual syscall.
So clock_gettime can be very cheap compared to an actual system call that switches to kernel mode and back. That can take at least 1800 clock cycles on Skylake, on a kernel with Spectre + Meltdown mitigation enabled.
So yes, it's hopefully safe to assume System.currentTimeMillis() is "very optimized and lightweight", but even rdtsc itself is expensive compared to some loop bodies.

In your case, method calls should always be hoisted out of loops.
System.currentTimeMillis() simply reads a value from OS memory, so it is very cheap (a few nanoseconds), as opposed to System.nanoTime(), which involves a call to hardware, and therefore can be orders of magnitude slower.

Distinguishing voiced/ unvoiced speech using zero-crossing rate

The zero-crossing rate is the rate of sign-changes along a signal, i.e., the rate at which the signal changes from positive to negative or back.
The zero-crossing rate Zn can be used to:
1-Distinguish voiced/unvoiced speech
2-Seperate unvoiced speech from static background noise.
It is a simple (yet effective) way to distinguish between
voiced and unvoiced speech regions:
• Voiced region: lower zero-crossing rate
• Unvoiced region: higher zero-crossing rate
and here is the code i am using:
public double evaluate(){
int numZC=0;
int size=signals.length;
for (int i=0; i<size-1; i++){
if((signals[i]>=0 && signals[i+1]<0) || (signals[i]<0 && signals[i+1]>=0)){
numZC++;
}
}
return numZC/lengthInSecond;
}
MY questions are:
1- My goal of using zero crossing is to eliminate the unvoiced part of the signal,,, and this code gives back the ZERO-CROSSING RATE. SO how will i do that?!
2- How will i know how much is a "low" zero-crossing rate and how much is a "high" zero-crossing rate???

The fundamental problem is that while you've found a way to calculate the zero crossing rate of a block of samples, you can't use that to distinguish sounds within that block because it only gives you one number that describes your entire block.
One potential solution is to divide your big block into small blocks, and then work on those. If you do that, you will soon find that your small blocks, which you made arbitrarily, don't fit into neat categories of voiced and unvoiced, and simply removing one block or setting a block's volume to zero will leave you with "choppy" sounds or even harsh clicking sounds, and won't divide the parts of speech as cleanly as you like.
This may be a worthwhile point to start with, because it's closer to your existing code, but it won't work out in the long run, unless you are just looking to do something rough (in which case, this might be good enough!).
To resolve this, you may want to consider calculating an "instantaneous zero crossing rate"1 that updates the Zr for each sample.
My goal of using zero crossing is to eliminate the unvoiced part of the signal,,, and this code gives back the ZERO-CROSSING RATE. SO how will i do that?! It's not clear what you want. What do you mean by "eliminate"? Do you want silence or do you want to skip those sections? For silence, simply replace the unwanted sections with zero. To skip, simply remove those samples. Of course, you will still end up with clicks and pops, but I assume you know how to get rid of that. If not, maybe you can read up on linear interpolation. Keep in mind that you will almost certainly have to apply some heuristics like "don't remove any sections that are smaller than n samples".
How will i know how much is a "low" zero-crossing rate and how much is a "high" zero-crossing rate??? I would guess a good threshold will be roughly around 400Hz, but speech is not my specialty. Moreover it will vary a bit by speaker and possibly by language and other factors. I suggest you make some samples and see for yourself.
1 this name is a bit misleading and you could say "there's no such thing as an instantaneous zero crossing rate". I'm not here to argue that; rather I want to use that phrase because it expresses what I mean and I hope you understand it. Suffice it to say you should do your best to update Zr as often as you can. eg. something like this:
int lastSign = 0;
int lastCrossing = 0;
float nextZeroCrossing( float newSample ) {
int thisSign = newSample > 0 ? 1 : -1 ;
if( thisSign != lastSign ) {
lastSign = thisSign;
//zero crossing has happened. Update our estimate of Zr using lastCrossing and return that
} else {
++lastCrossing;
//zero crossing has not happened. Return existing Zr
}
}
You may want to "smooth" the output of nextZeroCrossing(), as it will tend to jump around a lot. A simple exponential or moving average filter will work great.

checking a value for reset value before resetting it - performance impact?

I have a variable that gets read and updated thousands of times a second. It needs to be reset regularly. But "half" the time, the value is already the reset value. Is it a good idea to check the value first (to see if it needs resetting) before resetting (a write operaion), or I should just reset it regardless? The main goal is to optimize the code for performance.
To illustrate:
Random r = new Random();
int val = Integer.MAX_VALUE;
for (int i=0; i<100000000; i++) {
if (i % 2 == 0)
val = Integer.MAX_VALUE;
else
val = r.nextInt();
if (val != Integer.MAX_VALUE) //skip check?
val = Integer.MAX_VALUE;
}
I tried to use the above program to test the 2 scenarios (by un/commenting the 2nd "if" line), but any difference is masked by the natural variance of the run duration time.
Thanks.

Don't check it.
It's more execution steps = more cycles = more time.
As an aside, you are breaking one of the basic software golden rules: "Don't optimise early". Unless you have hard evidence that this piece if code is a performance problem, you shouldn't be looking at it. (Note that doesn't mean you code without performance in mind, you still follow normal best practice, but you don't add any special code whose only purpose is "performance related")

The check has no actual performance impact. We'd be talking about a single clock cycle or something, which is usually not relevant in a Java program (as hard-core number crunching usually isn't done in Java).
Instead, base the decision on readability. Think of the maintainer who's going to change this piece of code five years on.
In the case of your example, using my rationale, I would skip the check.

Most likely the JIT will optimise the code away because it doesn't do anything.
Rather than worrying about performance, it is usually better to worry about what it
simpler to understand
cleaner to implement
In both cases, you might remove the code as it doesn't do anything useful and it could make the code faster as well.
Even if it did make the code a little slower it would be very small compared to the cost of calling r.nextInt() which is not cheap.

What is the fastest way to compute an epsilon closure?

I am working on a program to convert Non-deterministic finite state automata (NFAs) to Deterministic finite state automata (DFAs). To do this, I have to compute the epsilon closure of every state in the NFA that has an epsilon transition. I have already figured out a way to do this, but I always assume that the first thing I think of is usually the least efficient way to do something.
Here is an example of how I would compute a simple epsilon closure:
Input strings for transition function: format is startState, symbol = endState
EPS is an epsilon transition
1, EPS = 2
Results in the new state { 12 }
Now obviously this is a very simple example. I would need to be able to compute any number of epsilon transitions from any number of states. To this end, my solution is a recursive function that computes the epsilon closure on the given state by looking at the state it has an epsilon transition into. If that state has (an) epsilon transition(s) then the function is called recursively within a for loop for as many epsilon transitions as it has. This will get the job done but probably isn't the fastest way. So my question is this: what is the fastest way to compute an epsilon closure in Java?

Depth first search (or breadth first search - doesn't really matter) over the graph whose edges are your epilson transitions. So in other words, your solution is optimal provided you efficiently track which states you've already added to the closure.

JFLAP does this. You can see their source - specifically ClosureTaker.java. It's a depth-first search (which is what Peter Taylor suggested), and since JFLAP uses it I assume that's the near-optimal solution.

Did you look into an algorithm book? But I doubt you'll find a significantly better approach. But the actual performance of this algorithm may very well depend on the concrete data structure you use to implement your graph. And you can share work, depending on the order you simplify your graph. Think about subgraphs which are epsilon connected and are referenced from two different nodes.
I am not sure whether this can be done in an optimal way, or whether you have to resort to some heuristics.
Scan the the literature on algorithms.

Just so that people looking only for the specific snippet of code referenced by #Xodarap 's answer don't find themselves in the need of downloading both the source code and an application to view the code of the jar file, I took the liberty to attach said snippet.
public static State[] getClosure(State state, Automaton automaton) {
List<State> list = new ArrayList<>();
list.add(state);
for (int i = 0; i < list.size(); i++) {
state = (State) list.get(i);
Transition transitions[] = automaton.getTransitionsFromState(state);
for (int k = 0; k < transitions.length; k++) {
Transition transition = transitions[k];
LambdaTransitionChecker checker = LambdaCheckerFactory
.getLambdaChecker(automaton);
/** if lambda transition */
if (checker.isLambdaTransition(transition)) {
State toState = transition.getToState();
if (!list.contains(toState)) {
list.add(toState);
}
}
}
}
return (State[]) list.toArray(new State[0]);
}
It goes without saying that all credit goes to #Xodarap and the JFLAP project.

How can I code Java to allow SSE use and bounds-check elimination (or other advanced optimizations)?

The Situation:
I'm optimizing a pure-java implementation of the LZF compression algorithm, which involves a lot of byte[] access and basic int mathematics for hashing and comparison. Performance really matters, because the goal of the compression is to reduce I/O requirements. I am not posting code because it isn't cleaned up yet, and may be restructured heavily.
The Questions:
How can I write my code to allow it to JIT-compile to a form using faster SSE operations?
How can I structure it so the compiler can easily eliminate array bounds checks?
Are there any broad references about the relative speed of specific math operations (how many increments/decrements does it take to equal a normal add/subtract, how fast is shift-or vs. an array access)?
How can I work on optimizing branching -- is it better to have numerous conditional statements with short bodies, or a few long ones, or short ones with nested conditions?
With current 1.6 JVM, how many elements must be copied before System.arraycopy beats a copying loop?
What I've already done:
Before I get attacked for premature optimization: the basic algorithm is already excellent, but the Java implementation is less than 2/3 the speed of equivalent C. I've already replaced copying loops with System.arraycopy, worked on optimizing loops and eliminated un-needed operations.
I make heavy use of bit twiddling and packing bytes into ints for performance, as well as shifting & masking.
For legal reasons, I can't look at implementations in similar libraries, and existing libraries have too restrictive license terms to use.
Requirements for a GOOD (accepted) answer:
Unacceptable answers: "this is faster" without an explanation of how much AND why, OR hasn't been tested with a JIT compiler.
Borderline answers: have not been tested with anything before Hotspot 1.4
Basic answers: will provide a general rule and explanation of why it is faster at the compiler level, and roughly how much faster
Good answers: include a couple of samples of code to demonstrate
Excellent answers: have benchmarks with both JRE 1.5 and 1.6
PERFECT answer: Is by someone who worked on the HotSpot compiler, and can fully explain or reference the conditions for an optimization to be used, and how much faster it typically is. Might include java code and sample assembly code generated by HotSpot.
Also: if anyone has links detailing the guts of Hotspot optimization and branching performance, those are welcome. I know enough about bytecode that a site analyzing performance at a bytecode rather than sourcecode level would be helpful.
(Edit) Partial Answer: Bounds-Check Ellimination:
This is taken from supplied link to the HotSpot internals wiki at: https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination
HotSpot will eliminate bounds checks in all for loops with the following conditions:
Array is loop invariant (not reallocated within the loop)
Index variable has a constant stride (increases/decreases by constant amount, in only one spot if possible)
Array is indexed by a linear function of the variable.
Example: int val = array[index*2 + 5]
OR: int val = array[index+9]
NOT: int val = array[Math.min(var,index)+7]
Early version of code:
This is a sample version. Do not steal it, because it is an unreleased version of code for the H2 database project. The final version will be open source. This is an optimization upon the code here: H2 CompressLZF code
Logically, this is identical to the development version, but that one uses a for(...) loop to step through input, and an if/else loop for different logic between literal and backreference modes. It reduces array access and checks between modes.
public int compressNewer(final byte[] in, final int inLen, final byte[] out, int outPos){
int inPos = 0;
// initialize the hash table
if (cachedHashTable == null) {
cachedHashTable = new int[HASH_SIZE];
} else {
System.arraycopy(EMPTY, 0, cachedHashTable, 0, HASH_SIZE);
}
int[] hashTab = cachedHashTable;
// number of literals in current run
int literals = 0;
int future = first(in, inPos);
final int endPos = inLen-4;
// Loop through data until all of it has been compressed
while (inPos < endPos) {
future = (future << 8) | in[inPos+2] & 255;
// hash = next(hash,in,inPos);
int off = hash(future);
// ref = possible index of matching group in data
int ref = hashTab[off];
hashTab[off] = inPos;
off = inPos - ref - 1; //dropped for speed
// has match if bytes at ref match bytes in future, etc
// note: using ref++ rather than ref+1, ref+2, etc is about 15% faster
boolean hasMatch = (ref > 0 && off <= MAX_OFF && (in[ref++] == (byte) (future >> 16) && in[ref++] == (byte)(future >> 8) && in[ref] == (byte)future));
ref -=2; // ...EVEN when I have to recover it
// write out literals, if max literals reached, OR has a match
if ((hasMatch && literals != 0) || (literals == MAX_LITERAL)) {
out[outPos++] = (byte) (literals - 1);
System.arraycopy(in, inPos - literals, out, outPos, literals);
outPos += literals;
literals = 0;
}
//literal copying split because this improved performance by 5%
if (hasMatch) { // grow match as much as possible
int maxLen = inLen - inPos - 2;
maxLen = maxLen > MAX_REF ? MAX_REF : maxLen;
int len = 3;
// grow match length as possible...
while (len < maxLen && in[ref + len] == in[inPos + len]) {
len++;
}
len -= 2;
// short matches write length to first byte, longer write to 2nd too
if (len < 7) {
out[outPos++] = (byte) ((off >> 8) + (len << 5));
} else {
out[outPos++] = (byte) ((off >> 8) + (7 << 5));
out[outPos++] = (byte) (len - 7);
}
out[outPos++] = (byte) off;
inPos += len;
//OPTIMIZATION: don't store hashtable entry for last byte of match and next byte
// rebuild neighborhood for hashing, but don't store location for this 3-byte group
// improves compress performance by ~10% or more, sacrificing ~2% compression...
future = ((in[inPos+1] & 255) << 16) | ((in[inPos + 2] & 255) << 8) | (in[inPos + 3] & 255);
inPos += 2;
} else { //grow literals
literals++;
inPos++;
}
}
// write out remaining literals
literals += inLen-inPos;
inPos = inLen-literals;
if(literals >= MAX_LITERAL){
out[outPos++] = (byte)(MAX_LITERAL-1);
System.arraycopy(in, inPos, out, outPos, MAX_LITERAL);
outPos += MAX_LITERAL;
inPos += MAX_LITERAL;
literals -= MAX_LITERAL;
}
if (literals != 0) {
out[outPos++] = (byte) (literals - 1);
System.arraycopy(in, inPos, out, outPos, literals);
outPos += literals;
}
return outPos;
}
Final edit:
I've marked the best answer so far as accepted, since the deadline is nearly up. Since I took so long before deciding to post code, I will continue to upvote and respond to comments where possible. Apologies if the code is messy: this represented code in development, not polished up for committing.

Not a full answer, I simply don't have time to do the detailed benchmarks your question needs but hopefully useful.
Know your enemy
You are targeting a combination of the JVM (in essence the JIT) and the underlying CPU/Memory subsystem. Thus "This is faster on JVM X" is not likely to be valid in all cases as you move into more aggressive optimisations.
If your target market/application will largely run on a particular architecture you should consider investing in tools specific to it.
* If your performance on x86 is the critical factor then intel's VTune is excellent for drilling down into the sort of jit output analysis you describe.
* The differences between 64 bit and 32 bit JITs can be considerable, especially on x86 platforms where calling conventions can change and enregistering opportunities are very different.
Get the right tools
You would likely want to get a sampling profiler. The overhead of instrumentation (and the associated knock on on things like inlining, cache pollution and code size inflation) for your specific needs would be far too great. The intel VTune analyser can actually be used for Java though the integration is not so tight as others.
If you are using the sun JVM and are happy only knowing what the latest/greatest version is doing then the options available to investigate the output of the JIT are considerable if you know a bit of assembly.
This article details some interesting analysis using this functionality
Learn from other implementations
The Change history change history indicates that previous inline assembly was in fact counter productive and that allowing the compiler to take total control of the output (with tweaks in code rather than directives in assembly) yielded better results.
Some specifics
Since LZF is, in an efficient unmanaged implementation on modern desktop CPUS, largely memory bandwidth limited (hence it being compered to the speed of an unoptimised memcpy) you will need you code to remain entirely within level 1 cache.
As such any static fields you cannot make into constants should be placed within the same class as these values will often be placed within the same area of memory devoted to the vtables and meta data associated with classes.
Object allocations which cannot be trapped by Escape Analysis (only in 1.6 onwards) will need to be avoided.
The c code makes aggressive use of loop unrolling. However the performance of this on older (1.4 era) VM's is heavily dependant on the mode the JVM is in. Apparently latter sun jvm versions are more aggressive at inlining and unrolling, especially in server mode.
The prefetch instrctions generated by the JIT can make all the difference on code like this which is near memory bound.
"It's coming straight for us"
Your target is moving, and will continue to. Again Marc Lehmann's previous experience:
default HLOG size is now 15 (cpu caches have increased)
Even minor updates to the jvm can involve significant compiler changes
6544668 Don't vecorized array operations that can't be aligned at runtime.
6536652 Implement some superword (SIMD) optimizations
6531696 don't use immediate 16-bits value store to memory on Intel cpus
6468290 Divide and allocate out of eden on a per cpu basis
Captain Obvious
Measure, Measure, Measure. IF you can get your library to include (in a separate dll) a simple and easy to execute benchmark that logs the relevant information (vm version, cpu, OS, command line switches etc) and makes this simple to send back to you you will increase your coverage, best of all you'll cover those people using it that care.

As far as bounds check elimination is concerned, I believe the new JDK will already include an improved algorithm that eliminates it, whenever it's possible. These are the two main papers on this subject:
V. Mikheev, S. Fedoseev, V. Sukharev, N. Lipsky. 2002
Effective Enhancement of Loop Versioning in Java. Link. This paper is from the guys at Excelsior, who implemented the technique in their Jet JVM.
Würthinger, Thomas, Christian Wimmer, and Hanspeter Mössenböck. 2007. Array Bounds Check Elimination for the Java HotSpot Client Compiler. PPPJ. Link. Slightly based on the above paper, this is the implementation that I believe will be included in the next JDK. The achieved speedups are also presented.
There is also this blog entry, which discusses one of the papers superficially, and also presents some benchmarking results, not only for arrays but also for arithmetic in the new JDK. The comments of the blog entry are also very interesting, since the authors of the above papers present some very interesting comments and discuss arguments. Also, there are some pointers to other similar blog posts on this subject.
Hope it helps.

It's rather unlikely that you need to help the JIT compiler much with optimizing a straightforward number crunching algorithm like LZW. ShuggyCoUk mentioned this, but I think it deserves extra attention:
The cache-friendliness of your code will be a big factor.
You have to reduce the size of your woking set and improve data access locality as much as possible. You mention "packing bytes into ints for performance". This sounds like using ints to hold byte values in order to have them word-aligned. Don't do that! The increased data set size will outweigh any gains (I once converted some ECC number crunching code from int[] to byte[] and got a 2x speed-up).
On the off chance that you don't know this: if you need to treat some data as both bytes and ints, you don't have to shift and |-mask it - use ByteBuffer.asIntBuffer() and related methods.
With current 1.6 JVM, how many
elements must be copied before
System.arraycopy beats a copying loop?
Better do the benchmark yourself. When I did it way back when in Java 1.3 times, it was somewhere around 2000 elements.

Lots of answers so far, but couple of additional things:
Measure, measure, measure. As much as most Java developers warn against micro-benchmarking, make sure you compare performance between changes. Optimizations that do not result in measurable improvements are generally not worth keeping (of course, sometimes it's combination of things, and that gets trickier)
Tight loops matter as much with Java as with C (and ditto with variable allocations -- although you don't directly control it, HotSpot will eventually have to do it). I manage to practically double the speed of UTF-8 decoding by rearranging tight loop for handling single-byte case (7-bit ascii) as tight(er) inner loop, leaving other cases out.
Do not underestimate cost of allocating and/or clearing large arrays -- if you want lzf encoding/decoding to be faster for small/medium chunks too (not just megabyte sized), keep in mind that ALL allocations of byte[]/int[] are somewhat costly; not because of GC, but because JVM MUST clear the space.
H2 implementation has also been optimized quite a bit (for example: it does not clear the hash array any more, this often makes sense); and I actually helped modify it for use in another Java project. My contribution was mostly just changing it do be more optimal for non-streaming case, but that did not touch the tight encode/decode loops.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.