Performance of 2D array allocation

Performance of 2D array allocation - java

I am wondering why allocation of a 2D int array at once (new int[50][2]) performs poorer than allocating separately, that is, execute new int[50][] first, then new int[2] one-by-one. Here is a non-professional benchmark code:
public class AllocationSpeed {
private static final int ITERATION_COUNT = 1000000;
public static void main(String[] args) {
new AllocationSpeed().run();
}
private void run() {
measureSeparateAllocation();
measureAllocationAtOnce();
}
private void measureAllocationAtOnce() {
Stopwatch stopwatch = Stopwatch.createStarted();
for (int i = 0; i < ITERATION_COUNT; i++) {
allocateAtOnce();
}
stopwatch.stop();
System.out.println("Allocate at once: " + stopwatch);
}
private int allocateAtOnce() {
int[][] array = new int[50][2];
return array[10][1];
}
private void measureSeparateAllocation() {
Stopwatch stopwatch = Stopwatch.createStarted();
for (int i = 0; i < ITERATION_COUNT; i++) {
allocateSeparately();
}
stopwatch.stop();
System.out.println("Separate allocation: " + stopwatch);
}
private int allocateSeparately() {
int[][] array = new int[50][];
for (int i = 0; i < array.length; i++) {
array[i] = new int[2];
}
return array[10][1];
}
}
I tested on 64 bit linux, these are the results with different 64 bit oracle java versions:
1.6.0_45-b06:
Separate allocation: 401.0 ms
Allocate at once: 1.673 s
1.7.0_45-b18
Separate allocation: 408.7 ms
Allocate at once: 1.448 s
1.8.0-ea-b115
Separate allocation: 380.0 ms
Allocate at once: 1.251 s
Just for curiosity, I tried it with OpenJDK 7 as well (where the difference is smaller):
Separate allocation: 424.3 ms
Allocate at once: 1.072 s
For me it's quite counter-intuitive, I would expect allocating at once to be faster.

Absolute unbelievable. A benchmark source might suffer from optimizations, gc and JIT, but this?
Looking at the java byte code instruction set:
anewarray (+ 2 bytes indirect class index) for arrays of object classes (a = address)
newarray (+ 1 byte for prinitive class) for arrays of primitive types
multianewarray (+ 2 bytes indirect class index) for multidimensional arrays
This leads one to suspect that multianewarray is suboptimal for primitive types.
Before looking further, I hope someone knows where we are misled.

The latter code's inner loop (with a newarray) is hit more times than the former code's multianewarray, so it probably hits C2 and gets subjected to escape analysis sooner. (Once that happens, the rows created by the latter code are allocated on the stack, which is faster than the heap and reduces the workload for the garbage collector.)
It's also possible that these JDK versions didn't actually do escape analysis on rows from a multianewarray, since a multidimensional array is more likely to exceed the size limit for a stack array.

Related

Java iterative vs recursive

Can anyone explain why the following recursive method is faster than the iterative one (Both are doing it string concatenation) ? Isn't the iterative approach suppose to beat up the recursive one ? plus each recursive call adds a new layer on top of the stack which can be very space inefficient.
private static void string_concat(StringBuilder sb, int count){
if(count >= 9999) return;
string_concat(sb.append(count), count+1);
}
public static void main(String [] arg){
long s = System.currentTimeMillis();
StringBuilder sb = new StringBuilder();
for(int i = 0; i < 9999; i++){
sb.append(i);
}
System.out.println(System.currentTimeMillis()-s);
s = System.currentTimeMillis();
string_concat(new StringBuilder(),0);
System.out.println(System.currentTimeMillis()-s);
}
I ran the program multiple time, and the recursive one always ends up 3-4 times faster than the iterative one. What could be the main reason there that is causing the iterative one slower ?

See my comments.
Make sure you learn how to properly microbenchmark. You should be timing many iterations of both and averaging these for your times. Aside from that, you should make sure the VM isn't giving the second an unfair advantage by not compiling the first.
In fact, the default HotSpot compilation threshold (configurable via -XX:CompileThreshold) is 10,000 invokes, which might explain the results you see here. HotSpot doesn't really do any tail optimizations so it's quite strange that the recursive solution is faster. It's quite plausible that StringBuilder.append is compiled to native code primarily for the recursive solution.
I decided to rewrite the benchmark and see the results for myself.
public final class AppendMicrobenchmark {
static void recursive(final StringBuilder builder, final int n) {
if (n > 0) {
recursive(builder.append(n), n - 1);
}
}
static void iterative(final StringBuilder builder) {
for (int i = 10000; i >= 0; --i) {
builder.append(i);
}
}
public static void main(final String[] argv) {
/* warm-up */
for (int i = 200000; i >= 0; --i) {
new StringBuilder().append(i);
}
/* recursive benchmark */
long start = System.nanoTime();
for (int i = 1000; i >= 0; --i) {
recursive(new StringBuilder(), 10000);
}
System.out.printf("recursive: %.2fus\n", (System.nanoTime() - start) / 1000000D);
/* iterative benchmark */
start = System.nanoTime();
for (int i = 1000; i >= 0; --i) {
iterative(new StringBuilder());
}
System.out.printf("iterative: %.2fus\n", (System.nanoTime() - start) / 1000000D);
}
}
Here are my results...
C:\dev\scrap>java AppendMicrobenchmark
recursive: 405.41us
iterative: 313.20us
C:\dev\scrap>java -server AppendMicrobenchmark
recursive: 397.43us
iterative: 312.14us
These are times for each approach averaged over 1000 trials.
Essentially, the problems with your benchmark are that it doesn't average over many trials (law of large numbers), and that it is highly dependent on the ordering of the individual benchmarks. The original result I was given for yours:
C:\dev\scrap>java StringBuilderBenchmark
80
41
This made very little sense to me. Recursion on the HotSpot VM is more than likely not going to be as fast as iteration because as of yet it does not implement any sort of tail optimization that you might find used for functional languages.
Now, the funny thing that happens here is that the default HotSpot JIT compilation threshold is 10,000 invokes. Your iterative benchmark will more than likely be executing for the most part before append is compiled. On the other hand, your recursive approach should be comparatively fast since it will more than likely enjoy append after it is compiled. To eliminate this from influencing the results, I passed -XX:CompileThreshold=0 and found...
C:\dev\scrap>java -XX:CompileThreshold=0 StringBuilderBenchmark
8
8
So, when it comes down to it, they're both roughly equal in speed. Note however that the iterative appears to be a bit faster if you average with higher precision. Order might still make a difference in my benchmark, too, as the latter benchmark will have the advantage of the VM having collected more statistics for its dynamic optimizations.

Multiplication time in BigInteger

My mini benchmark:
import java.math.*;
import java.util.*;
import java.io.*;
public class c
{
static Random rnd = new Random();
public static String addDigits(String a, int n)
{
if(a==null) return null;
if(n<=0) return a;
for(int i=0; i<n; i++)
a+=rnd.nextInt(10);
return a;
}
public static void main(String[] args) throws IOException
{
int n = 10000; \\number of iterations
int k = 10; \\number of digits added at each iteration
BigInteger a;
BigInteger b;
String as = "";
String bs = "";
as += rnd.nextInt(9)+1;
bs += rnd.nextInt(9)+1;
a = new BigInteger(as);
b = new BigInteger(bs);
FileWriter fw = new FileWriter("c.txt");
long t1 = System.nanoTime();
a.multiply(b);
long t2 = System.nanoTime();
//fw.write("1,"+(t2-t1)+"\n");
if(k>0) {
as = addDigits(as, k-1);
bs = addDigits(as, k-1);
}
for(int i=0; i<n; i++)
{
a = new BigInteger(as);
b = new BigInteger(bs);
t1 = System.nanoTime();
a.multiply(b);
t2 = System.nanoTime();
fw.write(((i+1)*k)+","+(t2-t1)+"\n");
if(i < n-1)
{
as = addDigits(as, k);
bs = addDigits(as, k);
}
System.out.println((i+1)*k);
}
fw.close();
}
}
It measures multiplication time of n-digit BigInteger
Result:
You can easily see the trend but why there is so big noise above 50000 digits?
It is because of garbage collector or is there something else that affects my results?
When performing the test, there were no other applications running.
Result from test with only odd digits. The test was shorter (n=1000, k=100)
Odd digits (n=10000, k=10)
As you can see there is a huge noise between 65000 and 70000. I wonder why...
Odd digits (n=10000, k=10), System.gc() every 1000 iterations
Results in noise between 50000-70000

I also suspect this is a JVM warmup effect. Not warmup involving classloading or the JIT compiler, but warmup of the heap.
Put a (java) loop around the whole benchmark, and run it a number of times. (If this gives you the same graphs as before ... you will have evidence that this is not a warmup effect. Currently you don't have any empirical evidence one way or the other.)
Another possibility is that the noise is caused by your benchmark's interactions with the OS and/or other stuff running on the machine.
You are writing your timing data to an unbuffered stream. That means LOTS of syscalls, and (potentially) lots of fine-grained disc writes.
You are making LOTS of calls to nanoTime(), and that might introduce noise.
If something else is running on your machine (e.g. you are web browsing) that will slow down your benchmark for a bit and introduce noise.
There could be competition over physical memory ... if you've got too much running on your machine for the amount of RAM.
Finally, a certain amount of noise is inevitable, because each of those multiply calls generates garbage, and the garbage collector is going to need to work to deal with it.
Finally finally, if you manually run the garbage collector (or increase the heap size) to "smooth out" the data points, what you are actually doing is concealing one of the costs of multiply calls. The resulting graphs looks nice, but it is misleading:
The noisiness reflects what will happen in real life.
The true cost of the multiply actually includes the amortized cost of running the GC to deal with the garbage generated by the call.
To get a measurements that reflect the way that BigInteger behaves in real life, you need to run the test a large number of times, calculate average times and fit a curve to the average data-points.
Remember, the real aim of the game is to get scientifically valid results ... not a smooth curve.

If you do a microbenchmark, you must "warm up" the JVM first to let the JIT optimize the code, and then you can measure the performance. Otherwise you are measuring the work done by the JIT and that can change the result on each run.
The "noise" happens probably because the cache of the CPU is exceeded and the performance starts degrading.

Is Java's System.arraycopy() efficient for small arrays?

Is Java's System.arraycopy() efficient for small arrays, or does the fact that it's a native method make it likely to be substantially less efficient than a simple loop and a function call?
Do native methods incur additional performance overhead for crossing some kind of Java-system bridge?

Expanding a little on what Sid has written, it's very likely that System.arraycopy is just a JIT intrinsic; meaning that when code calls System.arraycopy, it will most probably be calling a JIT-specific implementation (once the JIT tags System.arraycopy as being "hot") that is not executed through the JNI interface, so it doesn't incur the normal overhead of native methods.
In general, executing native methods does have some overhead (going through the JNI interface, also some internal JVM operations cannot happen when native methods are being executed). But it's not because a method is marked as "native" that you're actually executing it using JNI. The JIT can do some crazy things.
Easiest way to check is, as has been suggested, writing a small benchmark, being careful with the normal caveats of Java microbenchmarks (warm up the code first, avoid code with no side-effects since the JIT just optimizes it as a no-op, etc).

Here is my benchmark code:
public void test(int copySize, int copyCount, int testRep) {
System.out.println("Copy size = " + copySize);
System.out.println("Copy count = " + copyCount);
System.out.println();
for (int i = testRep; i > 0; --i) {
copy(copySize, copyCount);
loop(copySize, copyCount);
}
System.out.println();
}
public void copy(int copySize, int copyCount) {
int[] src = newSrc(copySize + 1);
int[] dst = new int[copySize + 1];
long begin = System.nanoTime();
for (int count = copyCount; count > 0; --count) {
System.arraycopy(src, 1, dst, 0, copySize);
dst[copySize] = src[copySize] + 1;
System.arraycopy(dst, 0, src, 0, copySize);
src[copySize] = dst[copySize];
}
long end = System.nanoTime();
System.out.println("Arraycopy: " + (end - begin) / 1e9 + " s");
}
public void loop(int copySize, int copyCount) {
int[] src = newSrc(copySize + 1);
int[] dst = new int[copySize + 1];
long begin = System.nanoTime();
for (int count = copyCount; count > 0; --count) {
for (int i = copySize - 1; i >= 0; --i) {
dst[i] = src[i + 1];
}
dst[copySize] = src[copySize] + 1;
for (int i = copySize - 1; i >= 0; --i) {
src[i] = dst[i];
}
src[copySize] = dst[copySize];
}
long end = System.nanoTime();
System.out.println("Man. loop: " + (end - begin) / 1e9 + " s");
}
public int[] newSrc(int arraySize) {
int[] src = new int[arraySize];
for (int i = arraySize - 1; i >= 0; --i) {
src[i] = i;
}
return src;
}
From my tests, calling test() with copyCount = 10000000 (1e7) or greater allows the warm-up to be achieved during the first copy/loop call, so using testRep = 5 is enough; With copyCount = 1000000 (1e6) the warm-up need at least 2 or 3 iterations so testRep shall be increased in order to obtain usable results.
With my configuration (CPU Intel Core 2 Duo E8500 # 3.16GHz, Java SE 1.6.0_35-b10 and Eclipse 3.7.2) it appears from the benchmark that:
When copySize = 24, System.arraycopy() and the manual loop take almost the same time (sometimes one is very slightly faster than the other, other times it’s the contrary),
When copySize < 24, the manual loop is faster than System.arraycopy() (slightly faster with copySize = 23, really faster with copySize < 5),
When copySize > 24, System.arraycopy() is faster than the manual loop (slightly faster with copySize = 25, the ratio loop-time/arraycopy-time increasing as copySize increases).
Note: I’m not English native speaker, please excuse all my grammar/vocabulary errors.

This is a valid concern. For example, in java.nio.DirectByteBuffer.put(byte[]), the author tries to avoid a JNI copy for small number of elements
// These numbers represent the point at which we have empirically
// determined that the average cost of a JNI call exceeds the expense
// of an element by element copy. These numbers may change over time.
static final int JNI_COPY_TO_ARRAY_THRESHOLD = 6;
static final int JNI_COPY_FROM_ARRAY_THRESHOLD = 6;
For System.arraycopy(), we can examine how JDK uses it. For example, in ArrayList, System.arraycopy() is always used, never "element by element copy", regardless of length (even if it's 0). Since ArrayList is very performance conscious, we can derive that System.arraycopy() is the most effecient way of array copying regardless of length.

System.arraycopy use a memmove operation for moving words and assembly for moving other primitive types in C behind the scene. So it makes its best effort to move as much as efficient it can reach.

Instead of relying on speculation and possibly outdated information, I ran some benchmarks using caliper. In fact, Caliper comes with some examples, including a CopyArrayBenchmark that measures exactly this question! All you have to do is run
mvn exec:java -Dexec.mainClass=com.google.caliper.runner.CaliperMain -Dexec.args=examples.CopyArrayBenchmark
My results are based on Oracle's Java HotSpot(TM) 64-Bit Server VM, 1.8.0_31-b13, running on a mid-2010 MacBook Pro (macOS 10.11.6 with an Intel Arrandale i7, 8 GiB RAM). I don't believe that it's useful to post the raw timing data. Rather, I'll summarize the conclusions with the supporting visualizations.
In summary:
Writing a manual for loop to copy each element into a newly instantiated array is never advantageous, even for arrays as short as 5 elements.
Arrays.copyOf(array, array.length) and array.clone() are both consistently fast. These two techniques are nearly identical in performance; which one you choose is a matter of taste.
System.arraycopy(src, 0, dest, 0, src.length) is almost as fast as Arrays.copyOf(array, array.length) and array.clone(), but not quite consistently so. (See the case for 50000 ints.) Because of that, and the verbosity of the call, I would recommend System.arraycopy() if you need fine control over which elements get copied where.
Here are the timing plots:

Byte codes are executed natively anyways so it's likely that performance would be better than a loop.
So in case of a loop it would have to execute byte codes which will incur an overhead. While array copy should be straight memcopy.

Native functions should be faster than JVM functions, since there is no VM overhead. However for a lot of(>1000) very small(len<10) arrays it might be slower.

Threads not much faster than no thread version

I know there are other questions like that but I'm a beginner and most of the code and questions were quite complicated. Thats why I keep it as simple as possible. I come from an R background but recently I wanted to learn more about Java threads. I run through several tutorial on the topic and most of it boils down to the code I posted below. Note the code is not doing much and I made it quite inefficient so the threads would run a few seconds.
The main thing to notice is that on my machine the threads run not much faster than the none threaded run. With low values in the for loop in the run method even sometimes slower. It could be because of my crappy hardware (only two cores), and that with more cores one would see the threads go faster than the non parallel version. I don't know. But what puzzles me most is that when I look at the System monitor while the program is running in both runs (parallel and non parallel) both cores are used but in the parallel version they run at nearly 100 % and in non parallel both run at 50 - 60 %. Considering that both finish at the same time the parallel version is a lot more inefficient because it uses more computer power for doing the same job not even faster.
To put it in the nutshell. What am I doing wrong? I thought I wrote the the program not much different than in the Java tutorial. I posted the link below. I run linux ubuntu with the sun version of java.
http://www.java2s.com/Tutorial/Java/0160__Thread/0020__Create-Thread.htm
import java.util.ArrayList;
public class Main {
public static void main(String[] args) {
ArrayList<PermutateWord> words = new ArrayList<PermutateWord>();
System.out.println(Runtime.getRuntime().availableProcessors());
for(int i = 0; i < Runtime.getRuntime().availableProcessors();i++){
words.add(new PermutateWord("Christoph"));
}
System.out.println("Run as thread");
long d = System.currentTimeMillis();
for (PermutateWord w : words) {
w.start();
}
for (PermutateWord w : words) {
try {
w.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
for (PermutateWord w : words) {
System.out.println(w.getWord());
}
System.out.println(((double)(System.currentTimeMillis()-d))/1000+"\n");
System.out.println("No thread");
d = System.currentTimeMillis();
for (PermutateWord w : words) {
w.run();
}
for (PermutateWord w : words) {
System.out.println(w.getWord());
}
System.out.println(((double)(System.currentTimeMillis()-d))/1000+"\n");
}
}
class PermutateWord extends Thread {
private String word;
public PermutateWord (String word){
this.word = word;
}
public void run() {
java.util.Random rand = new java.util.Random();
for(int i = 0; i <8000000;i++){
word = swap(word,rand.nextInt(word.length()), rand.nextInt(word.length()));
}
}
private String swap(String word2, int r1, int r2) {
char[] wordArray = word2.toCharArray();
char c = wordArray[r1];
wordArray[r1] = wordArray[r2];
wordArray[r2] = c;
return new String(wordArray);
}
public String getWord(){
return word;
}
}
Thanks in advance
Christoph

Most of the time is spend allocating and dealocating temporary strings, which has to be synchronized. The work that can be done in parallel is trivial and multiple threads won't give you much gain.
Math.random() also has to be synchronized. You will have better results creating local java.util.Random for each thread.
java.util.Random rand = new java.util.Random();
public void run() {
for(int i = 0; i <8000000;i++){
word = swap(word,rand.nextInt(word.length()), rand.nextInt(word.length()));
}
}
But, you should really focus on optimizing swap function. I'm not sure, if it does what you want, but I'm sure it's very inefficient. + is expensive operation on Strings. For every + JVM has to allocate new String which is slow and doesn't work well with multiple threads. If you just want to swap two characters, consider using char[] instead of String. It should be much easier and much faster.
edit:
private String swap(String word2, int r1, int r2) {
char[] wordArray = word2.toCharArray();
char c = wordArray[r1];
wordArray[r1] = wordArray[r2];
wordArray[r2] = c;
return new String(wordArray);
}
This is much better. However, you are still doing 2 allocations. toCharArray() and new String both allocate memory. Because rest of your program is very simple, those two allocations take 90% of your execution time.

I got a lot of mileage out of putting a Thread.sleep(1000) in the join loop.
Empirically, java.util.Random.nextFloat() only bought me 10%.
Even then, both parts ran in 16 seconds on an 8-core machine, suggesting it's
serializing due to the synchronizations mentioned above. But good, grief, without
the sleep it was running 10x slower.

java 128 bit structure bit maninpulation

Is there a way to create a 128 bit object in java, that can be bit manipulated the same way as a long or int? I want to do 32 bit shifts and i want to be able to do a bit OR operation on the whole 128 bit structure.

Here, I present to you... an old idea. Now it's awfully downgraded (no code enhancer, no nothing) to simple 128 bit thingie that should be super fast, though. What I truly want is a ByteBuffer based array of C alike Struct but fully usable in java.
The main idea is allocating more than a single object at a time and using a pointer to the array. Thus, it greatly conserves memory and the memory is allocated in continuous area, so less cache misses (always good).
I did some moderate testing (but the code is still untested).
It does allow basic operations like add, xor, or, set/get with 128 bit numbers.
The standard rule: less documentation than expected applied unfortunately.
Adding extra code for extra operations should be straight forward.
Here is the code, look at main method for some usage. Cheers!
package bestsss.util;
import java.util.Random;
public class Bitz {
final int[] array;
private Bitz(int n){
array=new int[n<<2];
}
public int size(){
return size(this.array);
}
private static int size(int[] array){
return array.length>>2;
}
/**
* allocates N 128bit elements. newIdx to create a pointer
* #param n
* #return
*/
public static Bitz allocate(int n){
return new Bitz(n);
}
/**
* Main utility class - points to an index in the array
* #param idx
* #return
*/
public Idx newIdx(int idx){
return new Idx(array).set(idx);
}
public static class Idx{
private static final long mask = 0xFFFFFFFFL;
//dont make the field finals
int idx;
int[] array;//keep ref. here, reduce the indirection
Idx(int[] array){
this.array=array;
}
public Idx set(int idx) {
if (Bitz.size(array)<=idx || idx<0)
throw new IndexOutOfBoundsException(String.valueOf(idx));
this.idx = idx<<2;
return this;
}
public int index(){
return idx>>2;
}
public Idx shl32(){
final int[] array=this.array;
int idx = this.idx;
array[idx]=array[++idx];
array[idx]=array[++idx];
array[idx]=array[++idx];
array[idx]=0;
return this;
}
public Idx shr32(){
final int[] array=this.array;
int idx = this.idx+3;
array[idx]=array[--idx];
array[idx]=array[--idx];
array[idx]=array[--idx];
array[idx]=0;
return this;
}
public Idx or(Idx src){
final int[] array=this.array;
int idx = this.idx;
int idx2 = src.idx;
final int[] array2=src.array;
array[idx++]|=array2[idx2++];
array[idx++]|=array2[idx2++];
array[idx++]|=array2[idx2++];
array[idx++]|=array2[idx2++];
return this;
}
public Idx xor(Idx src){
final int[] array=this.array;
int idx = this.idx;
int idx2 = src.idx;
final int[] array2=src.array;
array[idx++]^=array2[idx2++];
array[idx++]^=array2[idx2++];
array[idx++]^=array2[idx2++];
array[idx++]^=array2[idx2++];
return this;
}
public Idx add(Idx src){
final int[] array=this.array;
int idx = this.idx+3;
final int[] array2=src.array;
int idx2 = src.idx+3;
long l =0;
l += array[idx]&mask;
l += array2[idx2--]&mask;
array[idx--]=(int)(l&mask);
l>>>=32;
l += array[idx]&mask;
l += array2[idx2--]&mask;
array[idx--]=(int)(l&mask);
l>>>=32;
l += array[idx]&mask;
l += array2[idx2--]&mask;
array[idx--]=(int)(l&mask);
l>>>=32;
l += array[idx]&mask;
l += array2[idx2--];
array[idx]=(int)(l&mask);
// l>>>=32;
return this;
}
public Idx set(long high, long low){
final int[] array=this.array;
int idx = this.idx;
array[idx+0]=(int) ((high>>>32)&mask);
array[idx+1]=(int) ((high>>>0)&mask);
array[idx+2]=(int) ((low>>>32)&mask);
array[idx+3]=(int) ((low>>>0)&mask);
return this;
}
public long high(){
final int[] array=this.array;
int idx = this.idx;
long res = (array[idx]&mask)<<32 | (array[idx+1]&mask);
return res;
}
public long low(){
final int[] array=this.array;
int idx = this.idx;
long res = (array[idx+2]&mask)<<32 | (array[idx+3]&mask);
return res;
}
//ineffective but well
public String toString(){
return String.format("%016x-%016x", high(), low());
}
}
public static void main(String[] args) {
Bitz bitz = Bitz.allocate(256);
Bitz.Idx idx = bitz.newIdx(0);
Bitz.Idx idx2 = bitz.newIdx(2);
System.out.println(idx.set(0, 0xf));
System.out.println(idx2.set(0, Long.MIN_VALUE).xor(idx));
System.out.println(idx.set(0, Long.MAX_VALUE).add(idx2.set(0, 1)));
System.out.println("==");
System.out.println(idx.add(idx));//can add itself
System.out.println(idx.shl32());//left
System.out.println(idx.shr32());//and right
System.out.println(idx.shl32());//back left
//w/ alloc
System.out.println(idx.add(bitz.newIdx(4).set(0, Long.MAX_VALUE)));
//self xor
System.out.println(idx.xor(idx));
//random xor
System.out.println("===init random===");
Random r = new Random(1112);
for (int i=0, s=bitz.size(); i<s; i++){
idx.set(i).set(r.nextLong(), r.nextLong());
System.out.println(idx);
}
Idx theXor = bitz.newIdx(0);
for (int i=1, s=bitz.size(); i<s; i++){
theXor.xor(idx.set(i));
}
System.out.println("===XOR===");
System.out.println(theXor);
}
}

Three possibilities have been identified:
The BitSet class provides some of the operations that you need, but no "shift" method. To implement this missing method, you'd need to do something like this:
BitSet bits = new BitSet(128);
...
// shift left by 32bits
for (int i = 0; i < 96; i++) {
bits.set(i, bits.get(i + 32));
}
bits.set(96, 127, false);
The BigInteger class provides all of the methods (more or less), but since BigInteger is immutable, it could result in an excessive object creation rate ... depending on how you use the bitsets. (There is also the issue that shiftLeft(32) won't chop off the leftmost bits ... but you can deal with this by using and to mask out the bits at index 128 and higher.)
If performance is your key concern, implementing a custom class with 4 int or 2 long fields will probably give best performance. (Which is actually the faster option of the two will depend on the hardware platform, the JVM, etc. I'd probably choose the long version because it will be simpler to code ... and only try to optimize further if profiling indicated that it was a potentially worthwhile activity.)
Furthermore, you can design the APIs to behave exactly as you require (modulo the constraints of Java language). The downside is that you have to implement and test everything, and you will be hard-wiring the magic number 128 into your code-base.

There is no longer data type than long (I have logged this as an RFE along with a 128 bit floating point ;)
You can create an object with four 32-bit int values and support these operations fairly easily.

You can't define any new types to which you could apply Java's built-in bitwise operators.
However, could you just use java.math.BigInteger? BigInteger defines all of the bit-wise operations that are defined for integral types (as methods). This includes, for example, BigInteger.or(BigInteger).

No.
Sorry there isn't a better answer.
One approach may be to create a wrapper object for two long values and implement the required functionality while taking signedness of the relevant operators into account. There is also BigInteger [updated from rlibby's answer], but it doesn't provide the required support.
Happy coding.

Perhaps BitSet would be useful to you.
It has the logical operations, and I imagine shifting wouldn't be all that hard to implement given their utility methods.

Afaik, the JVM will just convert whatever you code into 32 bit chunks whatever you do. JVM is 32 bit. I think even 64 bit version of JVM largely processes in 32 bit chunks. It certainly should to conserve memory... You're just going to slow down your code as the JIT tries to optimise the mess you create. In C/C++ etc. there's no point doing this either as you will still have impedance from the fact that it's 32 or 64 bit registers in the hardware you're most likely using. Even the Intel Xenon Phi (has 512bit vector registers) is just bunches of 32 and 64 bit elements.
If you want to implement something like that, you could try to do it in GLSL or OpenCL if you have GPU hardware available. In 2015 Java Sumatra will be released as part of Java 9, at least that's the plan. Then you will have the ability to integrate java with GPU code out of the box. That IS a big deal, hence the illustrious name!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.