Write "compressed" Array to increase IO performance?

Write "compressed" Array to increase IO performance? - java

I have an int and float array each of length 220 million (fixed). Now, I want to store/upload those arrays to/from memory and disk. Currently, I am using Java NIO's FileChannel and MappedByteBuffer to solve this. It works fine, but it takes near about 5 seconds (Wall Clock Time) for storing/uploading array to/from memory to disk. Now, I want to make it faster.
Here, I should mention most of those array elements are 0 ( nearly 52 %).
like:
int arr1 [] = { 0 , 0 , 6 , 7 , 1, 0 , 0 ...}
Can anybody help me, is there any nice way to improve speed by not storing or loading those 0's. This can compensated by using Arrays.fill (array , 0).

The following approach requires n / 8 + nz * 4 bytes on disk, where n is the size of the array, and nz the number of non-zero entries. For 52% zero entries, you'd reduce storage size by 52% - 3% = 49%.
You could do:
void write(int[] array) {
BitSet zeroes = new BitSet();
for (int i = 0; i < array.length; i++)
zeroes.set(i, array[i] == 0);
write(zeroes); // one bit per index
for (int i = 0; i < array.length; i++)
if (array[i] != 0)
write(array[y]);
}
int[] read() {
BitSet zeroes = readBitSet();
array = new int[zeroes.length];
for (int i = 0; i < zeroes.length; i++) {
if (zeroes.get(i)) {
// nothing to do (array[i] was initialized to 0)
} else {
array[i] = readInt();
}
}
}
Edit: That you say this is slightly slower implies that the disk is not the bottleneck. You could tune the above approach by writing the bitset as you construct it, so you don't have to write the bitset to memory before writing it to disk. Also, by writing the bitset word by word interspersed with the actual data we can do only a single pass over the array, reducing cache misses:
void write(int[] array) {
writeInt(array.length);
int ni;
for (int i = 0; i < array.length; i = ni) {
ni = i + 32;
int zeroesMap = 0;
for (j = i + 31; j >= i; j--) {
zeroesMap <<= 1;
if (array[j] == 0) {
zeroesMap |= 1;
}
}
writeInt(zeroesMap);
for (j = i; j < ni; j++)
if (array[j] != 0) {
writeInt(array[j]);
}
}
}
}
int[] read() {
int[] array = new int[readInt()];
int ni;
for (int i = 0; i < array.length; i = ni) {
ni = i + 32;
zeroesMap = readInt();
for (j = i; j < ni; j++) {
if (zeroesMap & 1 == 1) {
// nothing to do (array[i] was initialized to 0)
} else {
array[j] = readInt();
}
zeroesMap >>= 1;
}
}
return array;
}
(The preceeding code assumes array.length is a multiple of 32. If not, write the last slice of the array in whatever way you like)
If that doesn't reduce proceccing time either, compression is not the way to go (I don't think any general purpose compression algorithm will be faster than the above).

Depending upon the distribution, consider Run-length Encoding:
Run-length encoding (RLE) is a very simple form of data compression in which runs of data (that is, sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. This is most useful on data that contains many such runs.
It is simple ... which is good, and possibly bad, here ;-)

In case you are willing to write the serialization-desirialization code yourself, instead of storing all the zeroes you can store a series of ranges that indicate where those zeros are(with a special marker), together with the actual non-zero data.
So the array in your example: { 0 , 0 , 6 , 7 , 1, 0 , 0 ...}
can be stored as:
%0-1, 6, 7, 1 %5-6
when reading this data, if you hit % it means you have a range in from of you, you read the start and the end and fill an zeroes. Then you go on and see a non #, this means you hit an actual value.
In a sparse array that has large sequences of consecutive values this will yield great compression.

There is a standard compression utils in java: java.util.zip - it's general purpose library but due to sheer availability is an ok solution. Specialized compressions, encoding should be researched, if need arises and I rarely recommend zip as the soultion of choise.
Here is a sample how to handle zip via Deflater/Inflater.
Most people know ZipInput/Output Stream (and esp. Gzip). All of them have downsdes in handling the copy from mem->zlib and esp. GZip which is a total disaster as having CRC32 calling the native code (calling native code removes the ability to optimize and introduces some more performance hits).
Few important notes: do not boost zip compression high, that will kill any performance whatsoever - of course one can experiment and fit their best ratio between CPU and disk activity.
The code also demonstrates one of the real shortcomings of java.util.zip - it doesn't support direct buffers. The support is beyond trivial, yet no one bother to do it. Direct buffers will save few memory copies and reduces the memory footprint.
Last note: there is java version of (j)zlib and it beats the native impl. on compression quite nicely.
package t1;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.Random;
import java.util.zip.DataFormatException;
import java.util.zip.Deflater;
import java.util.zip.Inflater;
public class ZInt {
private static final int bucketSize = 1<<17;//in real world should not be const, but we bored horribly
static final int zipLevel = 2;//feel free to experiement, higher compression (5+)is likely to be total waste
static void write(int[] a, File file, boolean sync) throws IOException{
byte[] bucket = new byte[Math.min(bucketSize, Math.max(1<<13, Integer.highestOneBit(a.length >>3)))];//128KB bucket
byte[] zipOut = new byte[bucket.length];
final FileOutputStream fout = new FileOutputStream(file);
FileChannel channel = fout.getChannel();
try{
ByteBuffer buf = ByteBuffer.wrap(bucket);
//unfortunately java.util.zip doesn't support Direct Buffer - that would be the perfect fit
ByteBuffer out = ByteBuffer.wrap(zipOut);
out.putInt(a.length);//write length aka header
if (a.length==0){
doWrite(channel, out, 0);
return;
}
Deflater deflater = new Deflater(zipLevel, false);
try{
for (int i=0;i<a.length;){
i = put(a, buf, i);
buf.flip();
deflater.setInput(bucket, buf.position(), buf.limit());
if (i==a.length)
deflater.finish();
//hacking and using bucket here is tempting since it's copied twice but well
for (int n; (n= deflater.deflate(zipOut, out.position(), out.remaining()))>0;){
doWrite(channel, out, n);
}
buf.clear();
}
}finally{
deflater.end();
}
}finally{
if (sync)
fout.getFD().sync();
channel.close();
}
}
static int[] read(File file) throws IOException, DataFormatException{
FileChannel channel = new FileInputStream(file).getChannel();
try{
byte[] in = new byte[(int)Math.min(bucketSize, channel.size())];
ByteBuffer buf = ByteBuffer.wrap(in);
channel.read(buf);
buf.flip();
int[] a = new int[buf.getInt()];
if (a.length==0)
return a;
int i=0;
byte[] inflated = new byte[Math.min(1<<17, a.length*4)];
ByteBuffer intBuffer = ByteBuffer.wrap(inflated);
Inflater inflater = new Inflater(false);
try{
do{
if (!buf.hasRemaining()){
buf.clear();
channel.read(buf);
buf.flip();
}
inflater.setInput(in, buf.position(), buf.remaining());
buf.position(buf.position()+buf.remaining());//simulate all read
for (;;){
int n = inflater.inflate(inflated,intBuffer.position(), intBuffer.remaining());
if (n==0)
break;
intBuffer.position(intBuffer.position()+n).flip();
for (;intBuffer.remaining()>3 && i<a.length;i++){//need at least 4 bytes to form an int
a[i] = intBuffer.getInt();
}
intBuffer.compact();
}
}while (channel.position()<channel.size() && i<a.length);
}finally{
inflater.end();
}
// System.out.printf("read ints: %d - channel.position:%d %n", i, channel.position());
return a;
}finally{
channel.close();
}
}
private static void doWrite(FileChannel channel, ByteBuffer out, int n) throws IOException {
out.position(out.position()+n).flip();
while (out.hasRemaining())
channel.write(out);
out.clear();
}
private static int put(int[] a, ByteBuffer buf, int i) {
for (;buf.hasRemaining() && i<a.length;){
buf.putInt(a[i++]);
}
return i;
}
private static int[] generateRandom(int len){
Random r = new Random(17);
int[] n = new int[len];
for (int i=0;i<len;i++){
n[i]= r.nextBoolean()?0: r.nextInt(1<<23);//limit bounds to have any sensible compression
}
return n;
}
public static void main(String[] args) throws Throwable{
File file = new File("xxx.xxx");
int[] n = generateRandom(3000000); //{0,2,4,1,2,3};
long start = System.nanoTime();
write(n, file, false);
long elapsed = System.nanoTime() - start;//elapsed will be fairer if the sync is true
System.out.printf("File length: %d, for %d ints, ratio %.2f in %.2fms %n", file.length(), n.length, ((double)file.length())/4/n.length, java.math.BigDecimal.valueOf(elapsed, 6) );
int[] m = read(file);
//compare, Arrays.equals doesn't return position, so it sucks/kinda
for (int i=0; i<n.length; i++){
if (m[i]!=n[i]){
System.err.printf("Failed at %d%n",i);
break;
}
}
System.out.printf("All done!");
};
}
Please note, the code is not a proper benchmark!
The delayed replies comes from the fact it was quite boring to code, yet another zip example, sorry

Related

CPU Cache/Memory Access Time Anomalies

We are trying to optimize heavy memory operations in Java and ran into some anomalies. From our data, we concluded the hypothesis, that an array/memory block might be loaded into CPU cache caused by a lot of accesses, but after cloning this array multiple times, the cache becomes full and moves the initial array back into RAM.
To test this, we set up a benchmark. It does the following:
Create an array with a given size
Write some data into the fields
Read/iterate it a million times (to push it into CPU cache)
Clone it once into a new array
Clone the new array into a new array and use the new one for the next time a given amount of times
Additionally, after each of these steps the array is iterated three times and the needed time is measured for each iteration. Here is the code:
private static long[] read(byte[] array, int count, boolean logTimes) {
long[] times = null;
if (logTimes) {
times = new long[count];
}
int sum = 0;
for (int n = 0; n < count; n++) {
long start = System.nanoTime();
for (int i = 0; i < array.length; i++) {
sum += array[i];
}
if (logTimes) {
long time = System.nanoTime() - start;
times[n] = time;
}
}
System.out.println(sum);
return times;
}
public static void main(String[] args) {
int arraySize = Integer.parseInt(args[0]);
int clones = Integer.parseInt(args[1]);
byte[] array = new byte[arraySize];
long[] initialReadTimes = read(array, 3, true);
// Fill with some non-zero content
for (int i = 0; i < array.length; i++) {
array[i] = (byte) i;
}
long[] afterWriteTimes = read(array, 3, true);
// Make this array important, so it lands in CPU Cache
read(array, 1_000_000, false);
long[] afterReadTimes = read(array, 3, true);
long[] afterFirstCloneReadTimes = null;
byte[] copy = new byte[array.length];
System.arraycopy(array, 0, copy, 0, array.length);
for (int i = 1; i <= clones; i++) {
byte[] copy2 = new byte[copy.length];
System.arraycopy(copy, 0, copy2, 0, copy.length);
copy = copy2;
if (i == 1) {
afterFirstCloneReadTimes = read(array, 3, true);
}
}
long[] afterAllClonesReadTimes = read(array, 3, true);
// Write to CSV
...
System.out.println("Finished.");
}
We ran this benchmark with arraysize=10,000 and clones=10,000,000 on a 2nd gen i5 with 16 GB RAM:
There was quite a lot of variation though, the 2nd and 3rd runs had different times sometimes or there were peaks in the 2nd and 3rd run of the last reading benchmark.
These results seem pretty confusing. I think that this could show that upon array initialization, it is not immediately loaded into CPU cache, because the initial read times are relatively high. After writing nothing seems to have changed. Only after iterating a lot the access times become faster, while the first run is always slower (because of the measuring overhead that runs between the readings?). Also cloning/filling memory with new arrays does not seem to have an impact at all. Could anyone explain these results?
We assumed that some of this might stem from java specific memory management, so we tried to reimplement the benchmark in C++:
void read(unsigned char array[], int length, int count, std::vector<long int> & logTimes) {
for (int c = 0; c < count; c++) {
int sum = 0;
std::chrono::high_resolution_clock::time_point t1;
if (count <= 3) {
t1 = std::chrono::high_resolution_clock::now();
}
for (int i = 0; i < length; i++) {
sum += array[i];
}
if (count <= 3) {
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
long int duration = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
std::cout << duration << " ns\n";
logTimes.push_back(duration);
}
}
}
int main(int argc, char ** args)
{
int ARRAYSIZE = 10000;
int CLONES = 10000000;
std::vector<long int> initialTimes, afterWritingTimes, afterReadTimes, afterFirstCloneTimes, afterCloneTimes, null;
unsigned char array[ARRAYSIZE];
read(array, ARRAYSIZE, 3, initialTimes);
for (long long i = 0; i < ARRAYSIZE; i++) {
array[i] = i;
}
std::cout << "Reads after writing:\n";
read(array, ARRAYSIZE, 3, afterWritingTimes);
read(array, ARRAYSIZE, 1000000, null);
std::cout << "Reads after 1M Reads:\n";
read(array, ARRAYSIZE, 3, afterReadTimes);
unsigned char copy[ARRAYSIZE];
unsigned char * ptr_copy = copy;
std::memcpy(ptr_copy, array, ARRAYSIZE);
for (long long i = 0; i < CLONES; i++) {
unsigned char copy2[ARRAYSIZE];
std::memcpy(copy2, ptr_copy, ARRAYSIZE);
ptr_copy = copy2;
if (i == 0) {
read(array, ARRAYSIZE, 3, afterFirstCloneTimes);
}
}
std::cout << "Reads after cloning:\n";
read(array, ARRAYSIZE, 3, afterCloneTimes);
writeTimesToCSV(initialTimes, afterWritingTimes, afterReadTimes, afterFirstCloneTimes, afterCloneTimes);
std::cout << "Finished.\n";
}
Using the same parameters, we got the following results:
So in C++ the times are rather similar to each other, with some strange peaks in the 2nd run. This seems to show that above faster timings were caused by java optimizations (or rather suboptimal handling in the first readings). Does this mean that the CPU cache is not involved at all?

Replace all inside bitset with a differently sized bitset

I'm currently dealing with a binary file that will later on be written into a different binary file. This is very important and is the reason I'm hesitant to use ArrayLists and other lists, as they tend to not play nicely with me trying to write it into a file directly.
I've retrieved the bytes out of this binary and separated them into bits using BitSet. I think I've figured out how to find the bitset I want to replace. Currently this looks kinda like this:
try {
InputStream inputStream = new FileInputStream(filepath);
OutputStream outputStream = new FileOutputStream("output.bin");
byte[] buffer = new byte[4096];
BitSet bitSet = new BitSet(4096 * 8);
BitSet bitString = new BitSet(search.length());
BitSet bitReplace = new BitSet(replace.length());
// Search String to bitset
for (int i = search.length() - 1; i >= 0; i--) {
if (search.charAt(i) == '1') {
bitString.set(search.length() - i - 1);
}
}
// Replace String to bitset
for (int i = replace.length() - 1; i >= 0; i--) {
if (replace.charAt(i) == '1') {
bitReplace.set(replace.length() - i - 1);
}
}
while (inputStream.read(buffer) != -1) {
bitSet = BitSet.valueOf(buffer);
bufferCount++;
// GET 4096 BYTES AT THE SAME TIME
// TURN THEM INTO BITS, WE END UP WITH 4096*8 bits
// COMPARE EVERY SEARCHSIZE BITS
for (int i = 0; i < bitSet.length(); i++) {
if (bitSet.get(i, i + bitString.length()).equals(bitString)) {
//TODO: Replace bitset with a different bit set
}
}
}
inputStream.close();
outputStream.close();
} catch (IOException e) {
System.out.println("IOException");
System.exit(1);
}
What I'm missing is how to set an existing bitsets once the pattern of bits have been found with a different bitset(could be differently sized).
So to illustrate:
Find: 01010 replace with: 001111
Would turn this sequence of bits:
00|01010|01000000000000010
into:
00|001111|010000000000000010
Abstractly I've thought of a solution, to be like this:
1. Find the pattern that matches the SEARCHed pattern
2. Replace a bitset with a completely different bitset(this is what I'm struggling with, I was thinking about just appending everything to the end of the file, but that would not be very efficient in terms of read/write
3. Shift the other bits to the left or to the right based on the difference between the sizes of the searched pattern and the pattern we're replacing with.
4. Write into file.

You could define a function setBitsFromIndex(int i, BitSet source, BitSet dest):
private static void setBitsFromIndex(int i, BitSet source, BitSet dest) {
for (int j = 0; j < source.length(); j++) {
dest.set(i+j, source.get(j));
}
}
Then, in your code:
for (int i = 0; i < bitSet.length() - bitString.length(); i++) {
if (bitSet.get(i, i + bitString.length()).equals(bitString)) {
//Replace bitset with a different bit set
BitSet tempBitSet = bitSet.get(i + bitString.length(), bitSet.length());
setBitsFromIndex(i, bitReplace, bitSet);
setBitsFromIndex(i + bitReplace.length(), tempBitSet, bitSet);
// if bitReplace is shorter than bitString, we may need to clear trailing bits
if (bitReplace.length() < bitString.length()) {
bitSet.clear(i + bitReplace.length() + tempBitSet.length(), bitSet.length());
}
break;
}
}
BE WARNED: The length of a BitSet is NOT it's capacity, or even the length it was prior to the last time you set a bit. It is the index + 1 of the HIGHEST SET (1) BIT, so your bitReplace, bitString, and bitSet BitSets might not be the length you think they are if they have 0s in the most-significant bits. If you want to include leading zeros, you have to track the desired size of your bitReplace and bitString BitSets independently.

Remove first n bytes from a ByteBuffer

How can I remove the first n number of bytes from a ByteBuffer without changing or lowering the capacity? The result should be that the 0th byte is the n+1 byte. Is there a better data type in Java to do this type of action?

You could try something like this:
public void removeBytesFromStart(ByteBuffer bf, int n) {
int index = 0;
for(int i = n; i < bf.position(); i++) {
bf.put(index++, bf.get(i));
bf.put(i, (byte)0);
}
bf.position(index);
}
Or something like this:
public void removeBytesFromStart2(ByteBuffer bf, int n) {
int index = 0;
for(int i = n; i < bf.limit(); i++) {
bf.put(index++, bf.get(i));
bf.put(i, (byte)0);
}
bf.position(bf.position()-n);
}
This uses the absolute get and put method of the ByteBuffer class and sets the position at next write position.
Note that the absolute put method is optional, which means that a class that extends the abstract class ByteBuffer may not provide an implementation for it, for example it might throw a ReadOnlyBufferException.
Whether you choose to loop till position or till limit depends on how you use the buffer, for example if you manually set the position you might want to use loop till limit. If you do not then looping till position is enough and more efficient.
Here is some testings:
#Test
public void removeBytesFromStart() {
ByteBuffer bf = ByteBuffer.allocate(16);
int expectedCapacity = bf.capacity();
bf.put("abcdefg".getBytes());
ByteBuffer expected = ByteBuffer.allocate(16);
expected.put("defg".getBytes());
removeBytesFromStart(bf, 3);
Assert.assertEquals(expectedCapacity, bf.capacity());
Assert.assertEquals(0, bf.compareTo(expected));
}
#Test
public void removeBytesFromStartInt() {
ByteBuffer bf = ByteBuffer.allocate(16);
int expectedCapacity = bf.capacity();
bf.putInt(1);
bf.putInt(2);
bf.putInt(3);
bf.putInt(4);
ByteBuffer expected = ByteBuffer.allocate(16);
expected.putInt(2);
expected.putInt(3);
expected.putInt(4);
removeBytesFromStart2(bf, 4);
Assert.assertEquals(expectedCapacity, bf.capacity());
Assert.assertEquals(0, bf.compareTo(expected));
}

I think the method you are looking for is the ByteBuffer's compact() method
Even though the documentation says:
"The bytes between the buffer's current position and its limit, if any, are copied to the beginning of the buffer. That is, the byte at index p = position() is copied to index zero, the byte at index p + 1 is copied to index one, and so forth until the byte at index limit() - 1 is copied to index n = limit() - 1 - p. The buffer's position is then set to n+1 and its limit is set to its capacity."
I am not sure that this method realy does that, because when I debug it seems like the method just does buffer.limit = buffer.capacity.

Do you mean to shift all the element to the begining of the buffer? Like this:
int n = 4;
//allocate a buffer of capacity 10
ByteBuffer b = ByteBuffer.allocate(10);
// add data to buffer
for (int i = 0; i < b.limit(); i++) {
b.put((byte) i);
}
// print buffer
for (int i = 0; i < b.limit(); i++) {
System.out.print(b.get(i) + " ");
}
//shift left the elements from the buffer
//add zeros to the end
for (int i = n; i < b.limit() + n; i++) {
if (i < b.limit()) {
b.put(i - n, b.get(i));
} else {
b.put(i - n, (byte) 0);
}
}
//print buffer again
System.out.println();
for (int i = 0; i < b.limit(); i++) {
System.out.print(b.get(i) + " ");
}
For n=4 it will print:
0 1 2 3 4 5 6 7 8 9
4 5 6 7 8 9 0 0 0 0

Use compact method for that. E.g.:
ByteBuffer b = ByteBuffer.allocate(32);
b.put("hello,world".getBytes());
b.position(6);
b.compact();
System.out.println(new String(b.array()));

Is it possible to read/write bits from a file using JAVA?

To read/write binary files, I am using DataInputStream/DataOutputStream, they have this method writeByte()/readByte(), but what I want to do is read/write bits? Is it possible?
I want to use it for a compression algorithm, so when I am compressing I want to write 3 bits(for one number and there are millions of such numbers in a file) and if I write a byte at everytime I need to write 3 bits, I will write loads of redundant data...

It's not possible to read/write individual bits directly, the smallest unit you can read/write is a byte.
You can use the standard bitwise operators to manipulate a byte though, so e.g. to get the lowest 2 bits of a byte, you'd do
byte b = in.readByte();
byte lowBits = b&0x3;
set the low 4 bits to 1, and write the byte:
b |= 0xf;
out.writeByte(b);
(Note, for the sake of efficiency you might want to read/write byte arrays and not single bytes)

There's no way to do it directly. The smallest unit computers can handle is a byte (even booleans take up a byte). However you can create a custom stream class that packs a byte with the bits you want then writes it. You can then make a wrapper for this class who's write function takes some integral type, checks that it's between 0 and 7 (or -4 and 3 ... or whatever), extracts the bits in the same way the BitInputStream class (below) does, and makes the corresponding calls to the BitOutputStream's write method. You might be thinking that you could just make one set of IO stream classes, but 3 doesn't go into 8 evenly. So if you want optimum storage efficiency and you don't want to work really hard you're kind of stuck with two layers of abstraction. Below is a BitOutputStream class, a corresponding BitInputStream class, and a program that makes sure they work.
import java.io.IOException;
import java.io.OutputStream;
class BitOutputStream {
private OutputStream out;
private boolean[] buffer = new boolean[8];
private int count = 0;
public BitOutputStream(OutputStream out) {
this.out = out;
}
public void write(boolean x) throws IOException {
this.count++;
this.buffer[8-this.count] = x;
if (this.count == 8){
int num = 0;
for (int index = 0; index < 8; index++){
num = 2*num + (this.buffer[index] ? 1 : 0);
}
this.out.write(num - 128);
this.count = 0;
}
}
public void close() throws IOException {
int num = 0;
for (int index = 0; index < 8; index++){
num = 2*num + (this.buffer[index] ? 1 : 0);
}
this.out.write(num - 128);
this.out.close();
}
}
I'm sure there's a way to pack the int with bit-wise operators and thus avoid having to reverse the input, but I don't what to think that hard.
Also, you probably noticed that there is no local way to detect that the last bit has been read in this implementation, but I really don't want to think that hard.
import java.io.IOException;
import java.io.InputStream;
class BitInputStream {
private InputStream in;
private int num = 0;
private int count = 8;
public BitInputStream(InputStream in) {
this.in = in;
}
public boolean read() throws IOException {
if (this.count == 8){
this.num = this.in.read() + 128;
this.count = 0;
}
boolean x = (num%2 == 1);
num /= 2;
this.count++;
return x;
}
public void close() throws IOException {
this.in.close();
}
}
You probably know this, but you should put a BufferedStream in between your BitStream and FileStream or it'll take forever.
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Random;
class Test {
private static final int n = 1000000;
public static void main(String[] args) throws IOException {
Random random = new Random();
//Generate array
long startTime = System.nanoTime();
boolean[] outputArray = new boolean[n];
for (int index = 0; index < n; index++){
outputArray[index] = random.nextBoolean();
}
System.out.println("Array generated in " + (double)(System.nanoTime() - startTime)/1000/1000/1000 + " seconds.");
//Write to file
startTime = System.nanoTime();
BitOutputStream fout = new BitOutputStream(new BufferedOutputStream(new FileOutputStream("booleans.bin")));
for (int index = 0; index < n; index++){
fout.write(outputArray[index]);
}
fout.close();
System.out.println("Array written to file in " + (double)(System.nanoTime() - startTime)/1000/1000/1000 + " seconds.");
//Read from file
startTime = System.nanoTime();
BitInputStream fin = new BitInputStream(new BufferedInputStream(new FileInputStream("booleans.bin")));
boolean[] inputArray = new boolean[n];
for (int index = 0; index < n; index++){
inputArray[index] = fin.read();
}
fin.close();
System.out.println("Array read from file in " + (double)(System.nanoTime() - startTime)/1000/1000/1000 + " seconds.");
//Delete file
new File("booleans.bin").delete();
//Check equality
boolean equal = true;
for (int index = 0; index < n; index++){
if (outputArray[index] != inputArray[index]){
equal = false;
break;
}
}
System.out.println("Input " + (equal ? "equals " : "doesn't equal ") + "output.");
}
}

Please take a look at my bit-io library https://github.com/jinahya/bit-io, which can read and write non-octet-aligned values such as a 1-bit boolean or 17-bit unsigned integer.
<dependency>
<!-- resides in central repo -->
<groupId>com.googlecode.jinahya</groupId>
<artifactId>bit-io</artifactId>
<version>1.0-alpha-13</version>
</dependency>
This library reads and writes arbitrary-length bits.
final InputStream stream;
final BitInput input = new BitInput(new BitInput.StreamInput(stream));
final int b = input.readBoolean(); // reads a 1-bit boolean value
final int i = input.readUnsignedInt(3); // reads a 3-bit unsigned int
final long l = input.readLong(47); // reads a 47-bit signed long
input.align(1); // 8-bit byte align; padding
final WritableByteChannel channel;
final BitOutput output = new BitOutput(new BitOutput.ChannelOutput(channel));
output.writeBoolean(true); // writes a 1-bit boolean value
output.writeInt(17, 0x00); // writes a 17-bit signed int
output.writeUnsignedLong(54, 0x00L); // writes a 54-bit unsigned long
output.align(4); // 32-bit byte align; discarding

InputStreams and OutputStreams are streams of bytes.
To read a bit you'll need to read a byte and then use bit manipulation to inspect the bits you care about. Likewise, to write bits you'll need to write bytes containing the bits you want.

Yes and no. On most modern computers, a byte is the smallest addressable unit of memory, so you can only read/write entire bytes at a time. However, you can always use bitwise operators to manipulate the bits within a byte.

Bits are packaged in bytes and apart from VHDL/Verilog I have seen no language that allows you to append individual bits to a stream. Cache up your bits and pack them into a byte for a write using a buffer and bitmasking. Do the reverse for read, i.e. keep a pointer in your buffer and increment it as you return individually masked bits.

Afaik there is no function for doing this in the Java API. However you can of course read a byte and then use bit manipulation functions. Same goes for writing.

If you are just writing bits to a file, Java's BitSet class might be worth a look at. From the javadoc:
This class implements a vector of bits that grows as needed. Each component of the bit set has a boolean value. The bits of a BitSet are indexed by nonnegative integers. Individual indexed bits can be examined, set, or cleared. One BitSet may be used to modify the contents of another BitSet through logical AND, logical inclusive OR, and logical exclusive OR operations.
You are able to convert BitSets to long[] and byte[] to save data to a file.

The below code should work
int[] mynumbers = {3,4};
BitSet compressedNumbers = new BitSet(mynumbers.length*3);
// let's say you encoded 3 as 101 and 4 as 010
String myNumbersAsBinaryString = "101010";
for (int i = 0; i < myNumbersAsBinaryString.length(); i++) {
if(myNumbersAsBinaryString.charAt(i) == '1')
compressedNumbers.set(i);
}
String path = Resources.getResource("myfile.out").getPath();
ObjectOutputStream outputStream = null;
try {
outputStream = new ObjectOutputStream(new FileOutputStream(path));
outputStream.writeObject(compressedNumbers);
} catch (IOException e) {
e.printStackTrace();
}

Bit manipulation and output in Java

If you have binary strings (literally String objects that contain only 1's and 0's), how would you output them as bits into a file?
This is for a text compressor I was working on; it's still bugging me, and it'd be nice to finally get it working. Thanks!

Easiest is to simply take 8 consecutive characters, turn them into a byte and output that byte. Pad with zeros at the end if you can recognize the end-of-stream, or add a header with length (in bits) at the beginning of the file.
The inner loop would look something like:
byte[] buffer = new byte[ ( string.length + 7 ) / 8 ];
for ( int i = 0; i < buffer.length; ++i ) {
byte current = 0;
for ( int j = 7; j >= 0; --j )
if ( string[ i * 8 + j ] == '1' )
current |= 1 << j;
output( current );
}
You'll need to make some adjustments, but that's the general idea.

If you're lucky, java.math.BigInteger may do everything for you.
String s = "11001010001010101110101001001110";
byte[] bytes = (new java.math.BigInteger(s, 2)).toByteArray();
This does depend on the byte order (big-endian) and right-aligning (if the number of bits is not a multiple of 8) being what you want but it may be simpler to modify the array afterwards than to do the character conversion yourself.

public class BitOutputStream extends FilterOutputStream
{
private int buffer = 0;
private int bitCount = 0;
public BitOutputStream(OutputStream out)
{
super(out);
}
public void writeBits(int value, int numBits) throws IOException
{
while(numBits>0)
{
numBits--;
int mix = ((value&1)<<bitCount++);
buffer|=mix;
value>>=1;
if(bitCount==8)
align8();
}
}
#Override
public void close() throws IOException
{
align8(); /* Flush any remaining partial bytes */
super.close();
}
public void align8() throws IOException
{
if(bitCount > 0)
{
bitCount=0;
write(buffer);
buffer=0;
}
}
}
And then...
if (nextChar == '0')
{
bos.writeBits(0, 1);
}
else
{
bos.writeBits(1, 1);
}

Assuming the String has a multiple of eight bits, (you can pad it otherwise), take advantage of Java's built in parsing in the Integer.valueOf method to do something like this:
String s = "11001010001010101110101001001110";
byte[] data = new byte[s.length() / 8];
for (int i = 0; i < data.length; i++) {
data[i] = (byte) Integer.parseInt(s.substring(i * 8, (i + 1) * 8), 2);
}
Then you should be able to write the bytes to a FileOutputStream pretty simply.
On the other hand, if you looking for effeciency, you should consider not using a String to store the bits to begin with, but build up the bytes directly in your compressor.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Write "compressed" Array to increase IO performance? - java

Related

CPU Cache/Memory Access Time Anomalies

Replace all inside bitset with a differently sized bitset

Remove first n bytes from a ByteBuffer

Is it possible to read/write bits from a file using JAVA?

Bit manipulation and output in Java

Categories

Resources