Is it possible to read/write bits from a file using JAVA? - java

To read/write binary files, I am using DataInputStream/DataOutputStream, they have this method writeByte()/readByte(), but what I want to do is read/write bits? Is it possible?
I want to use it for a compression algorithm, so when I am compressing I want to write 3 bits(for one number and there are millions of such numbers in a file) and if I write a byte at everytime I need to write 3 bits, I will write loads of redundant data...

It's not possible to read/write individual bits directly, the smallest unit you can read/write is a byte.
You can use the standard bitwise operators to manipulate a byte though, so e.g. to get the lowest 2 bits of a byte, you'd do
byte b = in.readByte();
byte lowBits = b&0x3;
set the low 4 bits to 1, and write the byte:
b |= 0xf;
out.writeByte(b);
(Note, for the sake of efficiency you might want to read/write byte arrays and not single bytes)

There's no way to do it directly. The smallest unit computers can handle is a byte (even booleans take up a byte). However you can create a custom stream class that packs a byte with the bits you want then writes it. You can then make a wrapper for this class who's write function takes some integral type, checks that it's between 0 and 7 (or -4 and 3 ... or whatever), extracts the bits in the same way the BitInputStream class (below) does, and makes the corresponding calls to the BitOutputStream's write method. You might be thinking that you could just make one set of IO stream classes, but 3 doesn't go into 8 evenly. So if you want optimum storage efficiency and you don't want to work really hard you're kind of stuck with two layers of abstraction. Below is a BitOutputStream class, a corresponding BitInputStream class, and a program that makes sure they work.
import java.io.IOException;
import java.io.OutputStream;
class BitOutputStream {
private OutputStream out;
private boolean[] buffer = new boolean[8];
private int count = 0;
public BitOutputStream(OutputStream out) {
this.out = out;
}
public void write(boolean x) throws IOException {
this.count++;
this.buffer[8-this.count] = x;
if (this.count == 8){
int num = 0;
for (int index = 0; index < 8; index++){
num = 2*num + (this.buffer[index] ? 1 : 0);
}
this.out.write(num - 128);
this.count = 0;
}
}
public void close() throws IOException {
int num = 0;
for (int index = 0; index < 8; index++){
num = 2*num + (this.buffer[index] ? 1 : 0);
}
this.out.write(num - 128);
this.out.close();
}
}
I'm sure there's a way to pack the int with bit-wise operators and thus avoid having to reverse the input, but I don't what to think that hard.
Also, you probably noticed that there is no local way to detect that the last bit has been read in this implementation, but I really don't want to think that hard.
import java.io.IOException;
import java.io.InputStream;
class BitInputStream {
private InputStream in;
private int num = 0;
private int count = 8;
public BitInputStream(InputStream in) {
this.in = in;
}
public boolean read() throws IOException {
if (this.count == 8){
this.num = this.in.read() + 128;
this.count = 0;
}
boolean x = (num%2 == 1);
num /= 2;
this.count++;
return x;
}
public void close() throws IOException {
this.in.close();
}
}
You probably know this, but you should put a BufferedStream in between your BitStream and FileStream or it'll take forever.
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Random;
class Test {
private static final int n = 1000000;
public static void main(String[] args) throws IOException {
Random random = new Random();
//Generate array
long startTime = System.nanoTime();
boolean[] outputArray = new boolean[n];
for (int index = 0; index < n; index++){
outputArray[index] = random.nextBoolean();
}
System.out.println("Array generated in " + (double)(System.nanoTime() - startTime)/1000/1000/1000 + " seconds.");
//Write to file
startTime = System.nanoTime();
BitOutputStream fout = new BitOutputStream(new BufferedOutputStream(new FileOutputStream("booleans.bin")));
for (int index = 0; index < n; index++){
fout.write(outputArray[index]);
}
fout.close();
System.out.println("Array written to file in " + (double)(System.nanoTime() - startTime)/1000/1000/1000 + " seconds.");
//Read from file
startTime = System.nanoTime();
BitInputStream fin = new BitInputStream(new BufferedInputStream(new FileInputStream("booleans.bin")));
boolean[] inputArray = new boolean[n];
for (int index = 0; index < n; index++){
inputArray[index] = fin.read();
}
fin.close();
System.out.println("Array read from file in " + (double)(System.nanoTime() - startTime)/1000/1000/1000 + " seconds.");
//Delete file
new File("booleans.bin").delete();
//Check equality
boolean equal = true;
for (int index = 0; index < n; index++){
if (outputArray[index] != inputArray[index]){
equal = false;
break;
}
}
System.out.println("Input " + (equal ? "equals " : "doesn't equal ") + "output.");
}
}

Please take a look at my bit-io library https://github.com/jinahya/bit-io, which can read and write non-octet-aligned values such as a 1-bit boolean or 17-bit unsigned integer.
<dependency>
<!-- resides in central repo -->
<groupId>com.googlecode.jinahya</groupId>
<artifactId>bit-io</artifactId>
<version>1.0-alpha-13</version>
</dependency>
This library reads and writes arbitrary-length bits.
final InputStream stream;
final BitInput input = new BitInput(new BitInput.StreamInput(stream));
final int b = input.readBoolean(); // reads a 1-bit boolean value
final int i = input.readUnsignedInt(3); // reads a 3-bit unsigned int
final long l = input.readLong(47); // reads a 47-bit signed long
input.align(1); // 8-bit byte align; padding
final WritableByteChannel channel;
final BitOutput output = new BitOutput(new BitOutput.ChannelOutput(channel));
output.writeBoolean(true); // writes a 1-bit boolean value
output.writeInt(17, 0x00); // writes a 17-bit signed int
output.writeUnsignedLong(54, 0x00L); // writes a 54-bit unsigned long
output.align(4); // 32-bit byte align; discarding

InputStreams and OutputStreams are streams of bytes.
To read a bit you'll need to read a byte and then use bit manipulation to inspect the bits you care about. Likewise, to write bits you'll need to write bytes containing the bits you want.

Yes and no. On most modern computers, a byte is the smallest addressable unit of memory, so you can only read/write entire bytes at a time. However, you can always use bitwise operators to manipulate the bits within a byte.

Bits are packaged in bytes and apart from VHDL/Verilog I have seen no language that allows you to append individual bits to a stream. Cache up your bits and pack them into a byte for a write using a buffer and bitmasking. Do the reverse for read, i.e. keep a pointer in your buffer and increment it as you return individually masked bits.

Afaik there is no function for doing this in the Java API. However you can of course read a byte and then use bit manipulation functions. Same goes for writing.

If you are just writing bits to a file, Java's BitSet class might be worth a look at. From the javadoc:
This class implements a vector of bits that grows as needed. Each component of the bit set has a boolean value. The bits of a BitSet are indexed by nonnegative integers. Individual indexed bits can be examined, set, or cleared. One BitSet may be used to modify the contents of another BitSet through logical AND, logical inclusive OR, and logical exclusive OR operations.
You are able to convert BitSets to long[] and byte[] to save data to a file.

The below code should work
int[] mynumbers = {3,4};
BitSet compressedNumbers = new BitSet(mynumbers.length*3);
// let's say you encoded 3 as 101 and 4 as 010
String myNumbersAsBinaryString = "101010";
for (int i = 0; i < myNumbersAsBinaryString.length(); i++) {
if(myNumbersAsBinaryString.charAt(i) == '1')
compressedNumbers.set(i);
}
String path = Resources.getResource("myfile.out").getPath();
ObjectOutputStream outputStream = null;
try {
outputStream = new ObjectOutputStream(new FileOutputStream(path));
outputStream.writeObject(compressedNumbers);
} catch (IOException e) {
e.printStackTrace();
}

Related

Replace all inside bitset with a differently sized bitset

I'm currently dealing with a binary file that will later on be written into a different binary file. This is very important and is the reason I'm hesitant to use ArrayLists and other lists, as they tend to not play nicely with me trying to write it into a file directly.
I've retrieved the bytes out of this binary and separated them into bits using BitSet. I think I've figured out how to find the bitset I want to replace. Currently this looks kinda like this:
try {
InputStream inputStream = new FileInputStream(filepath);
OutputStream outputStream = new FileOutputStream("output.bin");
byte[] buffer = new byte[4096];
BitSet bitSet = new BitSet(4096 * 8);
BitSet bitString = new BitSet(search.length());
BitSet bitReplace = new BitSet(replace.length());
// Search String to bitset
for (int i = search.length() - 1; i >= 0; i--) {
if (search.charAt(i) == '1') {
bitString.set(search.length() - i - 1);
}
}
// Replace String to bitset
for (int i = replace.length() - 1; i >= 0; i--) {
if (replace.charAt(i) == '1') {
bitReplace.set(replace.length() - i - 1);
}
}
while (inputStream.read(buffer) != -1) {
bitSet = BitSet.valueOf(buffer);
bufferCount++;
// GET 4096 BYTES AT THE SAME TIME
// TURN THEM INTO BITS, WE END UP WITH 4096*8 bits
// COMPARE EVERY SEARCHSIZE BITS
for (int i = 0; i < bitSet.length(); i++) {
if (bitSet.get(i, i + bitString.length()).equals(bitString)) {
//TODO: Replace bitset with a different bit set
}
}
}
inputStream.close();
outputStream.close();
} catch (IOException e) {
System.out.println("IOException");
System.exit(1);
}
What I'm missing is how to set an existing bitsets once the pattern of bits have been found with a different bitset(could be differently sized).
So to illustrate:
Find: 01010 replace with: 001111
Would turn this sequence of bits:
00|01010|01000000000000010
into:
00|001111|010000000000000010
Abstractly I've thought of a solution, to be like this:
1. Find the pattern that matches the SEARCHed pattern
2. Replace a bitset with a completely different bitset(this is what I'm struggling with, I was thinking about just appending everything to the end of the file, but that would not be very efficient in terms of read/write
3. Shift the other bits to the left or to the right based on the difference between the sizes of the searched pattern and the pattern we're replacing with.
4. Write into file.
You could define a function setBitsFromIndex(int i, BitSet source, BitSet dest):
private static void setBitsFromIndex(int i, BitSet source, BitSet dest) {
for (int j = 0; j < source.length(); j++) {
dest.set(i+j, source.get(j));
}
}
Then, in your code:
for (int i = 0; i < bitSet.length() - bitString.length(); i++) {
if (bitSet.get(i, i + bitString.length()).equals(bitString)) {
//Replace bitset with a different bit set
BitSet tempBitSet = bitSet.get(i + bitString.length(), bitSet.length());
setBitsFromIndex(i, bitReplace, bitSet);
setBitsFromIndex(i + bitReplace.length(), tempBitSet, bitSet);
// if bitReplace is shorter than bitString, we may need to clear trailing bits
if (bitReplace.length() < bitString.length()) {
bitSet.clear(i + bitReplace.length() + tempBitSet.length(), bitSet.length());
}
break;
}
}
BE WARNED: The length of a BitSet is NOT it's capacity, or even the length it was prior to the last time you set a bit. It is the index + 1 of the HIGHEST SET (1) BIT, so your bitReplace, bitString, and bitSet BitSets might not be the length you think they are if they have 0s in the most-significant bits. If you want to include leading zeros, you have to track the desired size of your bitReplace and bitString BitSets independently.

Ensuring that an XOR operation on strictly positive longs only yield strictly positive longs

I want to ensure that an XOR operation on strictly positive longs only yields strictly positive longs.
My question is base on the following java code:
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.List;
import java.util.ListIterator;
import java.util.Random;
public class Test {
static Random r = new Random();
static class Data {
long id = Math.abs(r.nextLong());
#Override
public String toString() {
return "Data {" + "id=" + id + '}';
}
}
public static void main(String[] args) {
List<Data> data = new ArrayList<>();
for (int i = 0; i < 10; i++) {
data.add(new Data());
}
final String password = "don't you ever tell them";
byte[] passwordBytes = password.getBytes();
long[] passwordLongs = new long[passwordBytes.length / 8];
for (int i = 0; i < passwordLongs.length; i++) {
ByteBuffer buffer = ByteBuffer.allocate(Long.BYTES);
byte[] chunk = new byte[Long.BYTES];
System.arraycopy(passwordBytes, i * Long.BYTES, chunk, 0, Long.BYTES);
buffer.put(chunk);
buffer.flip();//need flip
passwordLongs[i] = buffer.getLong();
}
System.out.println(data);
ListIterator<Data> encryptIterator = data.listIterator();
while (encryptIterator.hasNext()) {
Data next = encryptIterator.next();
next.id = next.id ^ passwordLongs[(encryptIterator.nextIndex() - 1) % passwordLongs.length];//XOR here
}
System.out.println(data);
}
}
Can anyone please provide an answer possibly with some theory?
Invariant 1: a positive integer's most significant bit is zero.
Invariant 2: 0 XOR 0 = 0.
Conclusion: positive integer XOR positive integer = positive integer.
Given that longs in Java are always signed, you should make sure that every long from your "password" never toggles the sign bit.
That is, all your longs in your password should be like 0b0xxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx.
This weakens the "encryption" (exactly) a bit, but you should not worry much over this.
It should however be noted that unless you want to compare them being greater than 0, there is no reason doing this; the actual number will always be the same, that is, 0xff or 0b11111111 is always the same, only its decimal representation changes according to whether you are using unsigned or signed short integer to store it (255 and -1 respectively).

Java Record / Mix two audio streams

i have a java application that records audio from a mixer and store it on a byte array, or save it to a file.
What I need is to get audio from two mixers simultaneously, and save it to an audio file (i am trying with .wav).
The thing is that I can get the two byte arrays, but don't know how to merge them (by "merge" i don't mean concatenate).
To be specific, it is an application that handles conversations over an USB modem and I need to record them (the streams are the voices for each talking person, already maged to record them separately).
Any clue on how to do it?
Here is my code:
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.Path;
public class FileMixer {
Path path1 = Paths.get("/file1.wav");
Path path2 = Paths.get("/file2.wav");
byte[] byte1 = Files.readAllBytes(path1);
byte[] byte2 = Files.readAllBytes(path2);
byte[] out = new byte[byte1.length];
public FileMixer() {
byte[] byte1 = Files.readAllBytes(path1);
byte[] byte2 = Files.readAllBytes(path2);
for (int i=0; i<byte1.Length; i++)
out[i] = (byte1[i] + byte2[i]) >> 1;
}
}
Thanks in advance
To mix sound waves digitally, you add each corresponding data point from the two files together.
for (int i=0; i<source1.length; i++)
result[i] = (source1[i] + source2[i]) >> 1;
In other words, you take item 0 from byte array 1, and item 0 from byte array two, add them together, and put the resulting number in item 0 of your result array. Repeat for the remaining values. To prevent overload, you may need to divide each resulting value by two.
Make sure to merge amplitude data and not just byte data. If your SampleRate is 8: one byte equals one amplitude data. But if it is 16 you need to add two bytes to one short and merge them.
Currently your loading your file like this
byte[] byte1 = Files.readAllBytes(path1);
This will also load your .wav file header into the byte array but you only want to merge actual audio data. Load it like this:
public static ByteBuffer loadFile(File file) throws IOException {
DataInputStream in = new DataInputStream(new FileInputStream(file));
byte[] sound = new byte[in.available() - 44];
in.skipNBytes(44); // skip the header
in.read(sound);
return ByteBuffer.wrap(sound);
}
You can then merge every byte of these Buffers or every two bytes depending on your sample size. I will use 16 as its more common.
public static ByteBuffer mergeAudio(ByteBuffer smaller, ByteBuffer larger) {
// When we merge we will get problems with LittleEndian/BigEndian
// Actually the amplitude data is stored reverse in the .wav fille
// When we extract the amplitude value we need to reverse it to get the actuall
// value
// We can then add up all the amplitude data and divide it by their amount to
// get the mean
// When we save the value we need to reverse it again
// The result will have the size of the larger audio file. In my case its file2
ByteBuffer result = ByteBuffer.allocate(larger.capacity());
while (larger.hasRemaining()) {
// getShort() for SampleSize 16bit get() for 8 bit.
// Reverse the short because of LittleEndian/BigEndian
short sum = Short.reverseBytes(larger.getShort());
int matches = 1;
// check if the smaller file still has content so it needs to merge
if (smaller.hasRemaining()) {
// getShort() for SampleSize 16bit get() for 8 bit
// Reverse the short because of LittleEndian/BigEndian
sum += Short.reverseBytes(smaller.getShort());
matches++;
}
// append the mean of all merged values
// reverse again
result.putShort(Short.reverseBytes((short) (sum / (float) matches)));
}
return result;
}
We now need to create our own .wav file header and append our merged data. Finally we can write the changes to the disk.
public static void saveToFile(File file, byte[] audioData) throws IOException {
int audioSize = audioData.length;
int fileSize = audioSize + 44;
// The stream that writes the audio file to the disk
DataOutputStream out = new DataOutputStream(new FileOutputStream(file));
// Write Header
out.writeBytes("RIFF");// 0-4 ChunkId always RIFF
out.writeInt(Integer.reverseBytes(fileSize));// 5-8 ChunkSize always audio-length +header-length(44)
out.writeBytes("WAVE");// 9-12 Format always WAVE
out.writeBytes("fmt ");// 13-16 Subchunk1 ID always "fmt " with trailing whitespace
out.writeInt(Integer.reverseBytes(16)); // 17-20 Subchunk1 Size always 16
out.writeShort(Short.reverseBytes(audioFormat));// 21-22 Audio-Format 1 for PCM PulseAudio
out.writeShort(Short.reverseBytes(channels));// 23-24 Num-Channels 1 for mono, 2 for stereo
out.writeInt(Integer.reverseBytes(sampleRate));// 25-28 Sample-Rate
out.writeInt(Integer.reverseBytes(byteRate));// 29-32 Byte Rate
out.writeShort(Short.reverseBytes(blockAlign));// 33-34 Block Align
out.writeShort(Short.reverseBytes(sampleSize));// 35-36 Bits-Per-Sample
out.writeBytes("data");// 37-40 Subchunk2 ID always data
out.writeInt(Integer.reverseBytes(audioSize));// 41-44 Subchunk 2 Size audio-length
out.write(audioData);// append the merged data
out.close();// close the stream properly
}
Its important that the two files you want to merge have the same
Channels, SampleSize, SampleRate, AudioFormat
This is how you calculate the header data:
private static short audioFormat = 1;
private static int sampleRate = 44100;
private static short sampleSize = 16;
private static short channels = 2;
private static short blockAlign = (short) (sampleSize * channels / 8);
private static int byteRate = sampleRate * sampleSize * channels / 8;
Here is your working example where I put everything together:
import static java.lang.Math.ceil;
import static java.lang.Math.round;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.ArrayList;
public class AudioMerger {
private short audioFormat = 1;
private int sampleRate = 44100;
private short sampleSize = 16;
private short channels = 2;
private short blockAlign = (short) (sampleSize * channels / 8);
private int byteRate = sampleRate * sampleSize * channels / 8;
private ByteBuffer audioBuffer;
private ArrayList<MergeSound> sounds = new ArrayList<MergeSound>();
private ArrayList<Integer> offsets = new ArrayList<Integer>();
public void addSound(double offsetInSeconds, MergeSound sound) {
if (sound.getAudioFormat() != audioFormat)
new RuntimeException("Incompatible AudioFormat");
if (sound.getSampleRate() != sampleRate)
new RuntimeException("Incompatible SampleRate");
if (sound.getSampleSize() != sampleSize)
new RuntimeException("Incompatible SampleSize");
if (sound.getChannels() != channels)
new RuntimeException("Incompatible amount of Channels");
int offset = secondsToByte(offsetInSeconds);
offset = offset % 2 == 0 ? offset : offset + 1;// ensure we start at short when merging
sounds.add(sound);
offsets.add(secondsToByte(offsetInSeconds));
}
public void merge(double durationInSeconds) {
audioBuffer = ByteBuffer.allocate(secondsToByte(durationInSeconds));
for (int i = 0; i < sounds.size(); i++) {
ByteBuffer buffer = sounds.get(i).getBuffer();
int offset1 = offsets.get(i);
// iterate over all sound data to append it
while (buffer.hasRemaining()) {
int position = offset1 + buffer.position();// the global position in audioBuffer
// add the audio data to the vars
short sum = Short.reverseBytes(buffer.getShort());
int matches = 1;
// make sure later entries dont override the previsously merged
// continue only if theres empty audio data
if (audioBuffer.getShort(position) == 0) {
// iterate over the other sounds and check if the need to be merged
for (int j = i + 1; j < sounds.size(); j++) {// set j to i+1 to avoid all previous
ByteBuffer mergeBuffer = sounds.get(j).getBuffer();
int mergeOffset = offsets.get(j);
// check if this soundfile contains data that has to be merged
if (position >= mergeOffset && position < mergeOffset + mergeBuffer.capacity()) {
sum += Short.reverseBytes(mergeBuffer.getShort(position - mergeOffset));
matches++;
}
}
// make sure to cast to float 3/1=1 BUT round(3/1f)=2 for example
audioBuffer.putShort(position, Short.reverseBytes((short) round(sum / (float) matches)));
}
}
buffer.rewind();// So the sound can be added again
}
}
private int secondsToByte(double seconds) {
return (int) ceil(seconds * byteRate);
}
public void saveToFile(File file) throws IOException {
byte[] audioData = audioBuffer.array();
int audioSize = audioData.length;
int fileSize = audioSize + 44;
// The stream that writes the audio file to the disk
DataOutputStream out = new DataOutputStream(new FileOutputStream(file));
// Write Header
out.writeBytes("RIFF");// 0-4 ChunkId always RIFF
out.writeInt(Integer.reverseBytes(fileSize));// 5-8 ChunkSize always audio-length +header-length(44)
out.writeBytes("WAVE");// 9-12 Format always WAVE
out.writeBytes("fmt ");// 13-16 Subchunk1 ID always "fmt " with trailing whitespace
out.writeInt(Integer.reverseBytes(16)); // 17-20 Subchunk1 Size always 16
out.writeShort(Short.reverseBytes(audioFormat));// 21-22 Audio-Format 1 for PCM PulseAudio
out.writeShort(Short.reverseBytes(channels));// 23-24 Num-Channels 1 for mono, 2 for stereo
out.writeInt(Integer.reverseBytes(sampleRate));// 25-28 Sample-Rate
out.writeInt(Integer.reverseBytes(byteRate));// 29-32 Byte Rate
out.writeShort(Short.reverseBytes(blockAlign));// 33-34 Block Align
out.writeShort(Short.reverseBytes(sampleSize));// 35-36 Bits-Per-Sample
out.writeBytes("data");// 37-40 Subchunk2 ID always data
out.writeInt(Integer.reverseBytes(audioSize));// 41-44 Subchunk 2 Size audio-length
out.write(audioData);// append the merged data
out.close();// close the stream properly
}
}

Write "compressed" Array to increase IO performance?

I have an int and float array each of length 220 million (fixed). Now, I want to store/upload those arrays to/from memory and disk. Currently, I am using Java NIO's FileChannel and MappedByteBuffer to solve this. It works fine, but it takes near about 5 seconds (Wall Clock Time) for storing/uploading array to/from memory to disk. Now, I want to make it faster.
Here, I should mention most of those array elements are 0 ( nearly 52 %).
like:
int arr1 [] = { 0 , 0 , 6 , 7 , 1, 0 , 0 ...}
Can anybody help me, is there any nice way to improve speed by not storing or loading those 0's. This can compensated by using Arrays.fill (array , 0).
The following approach requires n / 8 + nz * 4 bytes on disk, where n is the size of the array, and nz the number of non-zero entries. For 52% zero entries, you'd reduce storage size by 52% - 3% = 49%.
You could do:
void write(int[] array) {
BitSet zeroes = new BitSet();
for (int i = 0; i < array.length; i++)
zeroes.set(i, array[i] == 0);
write(zeroes); // one bit per index
for (int i = 0; i < array.length; i++)
if (array[i] != 0)
write(array[y]);
}
int[] read() {
BitSet zeroes = readBitSet();
array = new int[zeroes.length];
for (int i = 0; i < zeroes.length; i++) {
if (zeroes.get(i)) {
// nothing to do (array[i] was initialized to 0)
} else {
array[i] = readInt();
}
}
}
Edit: That you say this is slightly slower implies that the disk is not the bottleneck. You could tune the above approach by writing the bitset as you construct it, so you don't have to write the bitset to memory before writing it to disk. Also, by writing the bitset word by word interspersed with the actual data we can do only a single pass over the array, reducing cache misses:
void write(int[] array) {
writeInt(array.length);
int ni;
for (int i = 0; i < array.length; i = ni) {
ni = i + 32;
int zeroesMap = 0;
for (j = i + 31; j >= i; j--) {
zeroesMap <<= 1;
if (array[j] == 0) {
zeroesMap |= 1;
}
}
writeInt(zeroesMap);
for (j = i; j < ni; j++)
if (array[j] != 0) {
writeInt(array[j]);
}
}
}
}
int[] read() {
int[] array = new int[readInt()];
int ni;
for (int i = 0; i < array.length; i = ni) {
ni = i + 32;
zeroesMap = readInt();
for (j = i; j < ni; j++) {
if (zeroesMap & 1 == 1) {
// nothing to do (array[i] was initialized to 0)
} else {
array[j] = readInt();
}
zeroesMap >>= 1;
}
}
return array;
}
(The preceeding code assumes array.length is a multiple of 32. If not, write the last slice of the array in whatever way you like)
If that doesn't reduce proceccing time either, compression is not the way to go (I don't think any general purpose compression algorithm will be faster than the above).
Depending upon the distribution, consider Run-length Encoding:
Run-length encoding (RLE) is a very simple form of data compression in which runs of data (that is, sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. This is most useful on data that contains many such runs.
It is simple ... which is good, and possibly bad, here ;-)
In case you are willing to write the serialization-desirialization code yourself, instead of storing all the zeroes you can store a series of ranges that indicate where those zeros are(with a special marker), together with the actual non-zero data.
So the array in your example: { 0 , 0 , 6 , 7 , 1, 0 , 0 ...}
can be stored as:
%0-1, 6, 7, 1 %5-6
when reading this data, if you hit % it means you have a range in from of you, you read the start and the end and fill an zeroes. Then you go on and see a non #, this means you hit an actual value.
In a sparse array that has large sequences of consecutive values this will yield great compression.
There is a standard compression utils in java: java.util.zip - it's general purpose library but due to sheer availability is an ok solution. Specialized compressions, encoding should be researched, if need arises and I rarely recommend zip as the soultion of choise.
Here is a sample how to handle zip via Deflater/Inflater.
Most people know ZipInput/Output Stream (and esp. Gzip). All of them have downsdes in handling the copy from mem->zlib and esp. GZip which is a total disaster as having CRC32 calling the native code (calling native code removes the ability to optimize and introduces some more performance hits).
Few important notes: do not boost zip compression high, that will kill any performance whatsoever - of course one can experiment and fit their best ratio between CPU and disk activity.
The code also demonstrates one of the real shortcomings of java.util.zip - it doesn't support direct buffers. The support is beyond trivial, yet no one bother to do it. Direct buffers will save few memory copies and reduces the memory footprint.
Last note: there is java version of (j)zlib and it beats the native impl. on compression quite nicely.
package t1;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.Random;
import java.util.zip.DataFormatException;
import java.util.zip.Deflater;
import java.util.zip.Inflater;
public class ZInt {
private static final int bucketSize = 1<<17;//in real world should not be const, but we bored horribly
static final int zipLevel = 2;//feel free to experiement, higher compression (5+)is likely to be total waste
static void write(int[] a, File file, boolean sync) throws IOException{
byte[] bucket = new byte[Math.min(bucketSize, Math.max(1<<13, Integer.highestOneBit(a.length >>3)))];//128KB bucket
byte[] zipOut = new byte[bucket.length];
final FileOutputStream fout = new FileOutputStream(file);
FileChannel channel = fout.getChannel();
try{
ByteBuffer buf = ByteBuffer.wrap(bucket);
//unfortunately java.util.zip doesn't support Direct Buffer - that would be the perfect fit
ByteBuffer out = ByteBuffer.wrap(zipOut);
out.putInt(a.length);//write length aka header
if (a.length==0){
doWrite(channel, out, 0);
return;
}
Deflater deflater = new Deflater(zipLevel, false);
try{
for (int i=0;i<a.length;){
i = put(a, buf, i);
buf.flip();
deflater.setInput(bucket, buf.position(), buf.limit());
if (i==a.length)
deflater.finish();
//hacking and using bucket here is tempting since it's copied twice but well
for (int n; (n= deflater.deflate(zipOut, out.position(), out.remaining()))>0;){
doWrite(channel, out, n);
}
buf.clear();
}
}finally{
deflater.end();
}
}finally{
if (sync)
fout.getFD().sync();
channel.close();
}
}
static int[] read(File file) throws IOException, DataFormatException{
FileChannel channel = new FileInputStream(file).getChannel();
try{
byte[] in = new byte[(int)Math.min(bucketSize, channel.size())];
ByteBuffer buf = ByteBuffer.wrap(in);
channel.read(buf);
buf.flip();
int[] a = new int[buf.getInt()];
if (a.length==0)
return a;
int i=0;
byte[] inflated = new byte[Math.min(1<<17, a.length*4)];
ByteBuffer intBuffer = ByteBuffer.wrap(inflated);
Inflater inflater = new Inflater(false);
try{
do{
if (!buf.hasRemaining()){
buf.clear();
channel.read(buf);
buf.flip();
}
inflater.setInput(in, buf.position(), buf.remaining());
buf.position(buf.position()+buf.remaining());//simulate all read
for (;;){
int n = inflater.inflate(inflated,intBuffer.position(), intBuffer.remaining());
if (n==0)
break;
intBuffer.position(intBuffer.position()+n).flip();
for (;intBuffer.remaining()>3 && i<a.length;i++){//need at least 4 bytes to form an int
a[i] = intBuffer.getInt();
}
intBuffer.compact();
}
}while (channel.position()<channel.size() && i<a.length);
}finally{
inflater.end();
}
// System.out.printf("read ints: %d - channel.position:%d %n", i, channel.position());
return a;
}finally{
channel.close();
}
}
private static void doWrite(FileChannel channel, ByteBuffer out, int n) throws IOException {
out.position(out.position()+n).flip();
while (out.hasRemaining())
channel.write(out);
out.clear();
}
private static int put(int[] a, ByteBuffer buf, int i) {
for (;buf.hasRemaining() && i<a.length;){
buf.putInt(a[i++]);
}
return i;
}
private static int[] generateRandom(int len){
Random r = new Random(17);
int[] n = new int[len];
for (int i=0;i<len;i++){
n[i]= r.nextBoolean()?0: r.nextInt(1<<23);//limit bounds to have any sensible compression
}
return n;
}
public static void main(String[] args) throws Throwable{
File file = new File("xxx.xxx");
int[] n = generateRandom(3000000); //{0,2,4,1,2,3};
long start = System.nanoTime();
write(n, file, false);
long elapsed = System.nanoTime() - start;//elapsed will be fairer if the sync is true
System.out.printf("File length: %d, for %d ints, ratio %.2f in %.2fms %n", file.length(), n.length, ((double)file.length())/4/n.length, java.math.BigDecimal.valueOf(elapsed, 6) );
int[] m = read(file);
//compare, Arrays.equals doesn't return position, so it sucks/kinda
for (int i=0; i<n.length; i++){
if (m[i]!=n[i]){
System.err.printf("Failed at %d%n",i);
break;
}
}
System.out.printf("All done!");
};
}
Please note, the code is not a proper benchmark!
The delayed replies comes from the fact it was quite boring to code, yet another zip example, sorry

Bit manipulation and output in Java

If you have binary strings (literally String objects that contain only 1's and 0's), how would you output them as bits into a file?
This is for a text compressor I was working on; it's still bugging me, and it'd be nice to finally get it working. Thanks!
Easiest is to simply take 8 consecutive characters, turn them into a byte and output that byte. Pad with zeros at the end if you can recognize the end-of-stream, or add a header with length (in bits) at the beginning of the file.
The inner loop would look something like:
byte[] buffer = new byte[ ( string.length + 7 ) / 8 ];
for ( int i = 0; i < buffer.length; ++i ) {
byte current = 0;
for ( int j = 7; j >= 0; --j )
if ( string[ i * 8 + j ] == '1' )
current |= 1 << j;
output( current );
}
You'll need to make some adjustments, but that's the general idea.
If you're lucky, java.math.BigInteger may do everything for you.
String s = "11001010001010101110101001001110";
byte[] bytes = (new java.math.BigInteger(s, 2)).toByteArray();
This does depend on the byte order (big-endian) and right-aligning (if the number of bits is not a multiple of 8) being what you want but it may be simpler to modify the array afterwards than to do the character conversion yourself.
public class BitOutputStream extends FilterOutputStream
{
private int buffer = 0;
private int bitCount = 0;
public BitOutputStream(OutputStream out)
{
super(out);
}
public void writeBits(int value, int numBits) throws IOException
{
while(numBits>0)
{
numBits--;
int mix = ((value&1)<<bitCount++);
buffer|=mix;
value>>=1;
if(bitCount==8)
align8();
}
}
#Override
public void close() throws IOException
{
align8(); /* Flush any remaining partial bytes */
super.close();
}
public void align8() throws IOException
{
if(bitCount > 0)
{
bitCount=0;
write(buffer);
buffer=0;
}
}
}
And then...
if (nextChar == '0')
{
bos.writeBits(0, 1);
}
else
{
bos.writeBits(1, 1);
}
Assuming the String has a multiple of eight bits, (you can pad it otherwise), take advantage of Java's built in parsing in the Integer.valueOf method to do something like this:
String s = "11001010001010101110101001001110";
byte[] data = new byte[s.length() / 8];
for (int i = 0; i < data.length; i++) {
data[i] = (byte) Integer.parseInt(s.substring(i * 8, (i + 1) * 8), 2);
}
Then you should be able to write the bytes to a FileOutputStream pretty simply.
On the other hand, if you looking for effeciency, you should consider not using a String to store the bits to begin with, but build up the bytes directly in your compressor.

Categories

Resources