Fast ByteBuffer to CharBuffer or char[] - java

What is the fastest method to convert a java.nio.ByteBuffer a into a (newly created) CharBuffer b or char[] b.
By doing this it is important, that a[i] == b[i]. This means, that not a[i] and a[i+1] together make up a value b[j], what getChar(i) would do, but the values should be "spread".
byte a[] = { 1,2,3, 125,126,127, -128,-127,-126 } // each a byte (which are signed)
char b[] = { 1,2,3, 125,126,127, 128, 129, 130 } // each a char (which are unsigned)
Note that byte:-128 has the same (lower 8) bits as char:128. Therefore I assume the "best" interpretation would be as I noted it above, because the bits are the same.
After that I also need the vice versa translation: The most efficient way to get a char[] or java.nio.CharBuffer back into a java.nio.ByteBuffer.

So, what you want is to convert using the encoding ISO-8859-1.
I don't claim anything about efficiency, but at least it is quite short to write:
CharBuffer result = Charset.forName("ISO-8859-1").decode(byteBuffer);
The other direction would be:
ByteBuffer result = Charset.forName("ISO-8859-1").encode(charBuffer);
Please measure this against other solutions. (To be fair, the Charset.forName part should not be included, and should also be done only once, not for each buffer again.)
From Java 7 on there also is the StandardCharsets class with pre-instantiated Charset instances, so you can use
CharBuffer result = StandardCharsets.ISO_8859_1.decode(byteBuffer);
and
ByteBuffer result = StandardCharsets.ISO_8859_1.encode(charBuffer);
instead. (These lines do the same as the ones before, just the lookup is easier and there is no risk to mistype the names, and no need to catch the impossible exceptions.)

I would agree with #Ishtar's, suggest to avoid converting to a new structure at all and only convert as you need it.
However if you have a heap ByteBuffer you can do.
ByteBuffer bb = ...
byte[] array = bb.array();
char[] chars = new char[bb.remaining()];
for (int i = 0; i < chars.length; i++)
chars[i] = (char) (array[i + bb.position()] & 0xFF);

Aside from deferring creation of CharBuffer, you may be able to get by without one.
If code that is using data as characters does not strictly need a CharBuffer or char[], just do simple on-the-fly conversion; use ByteBuffer.get() (relative or absolute), convert to char (note: as pointed out, you MUST unfortunately explicitly mask things; otherwise values 128-255 will be sign-extended to incorrect values, 0xFF80 - 0xFFFF; not needed for 7-bit ASCII), and use that.

Related

Basic arithmetic on two byte arrays in Java without BigInteger

I have two byte arrays that represent unsigned 256-bit values and I want to perform simple arithmetic operations on them like ADD, SUB, DIV, MUL and EXP - Is there a way to perform these directly on the byte arrays? Currently I convert these byte array values to a BigInteger and then perform the calculations, but I have an idea this is costing me in performance. How would you do this to get the fastest results?
For example, this is my current add-function:
// Both byte arrays are length 32 and represent unsigned 256-bit values
public void add(byte[] data1, byte[] data2) {
BigInteger value1 = new BigInteger(1, data1);
BigInteger value2 = new BigInteger(1, data2);
BigInteger result = value1.add(value2);
byte[] bytes = result.toByteArray();
ByteBuffer buffer = ByteBuffer.allocate(32);
System.arraycopy(bytes, 0, buffer.array(), 32 - bytes.length, bytes.length);
this.buffer = buffer.array();
}
I don’t think that there is much benefit from working on byte[] directly rather than using BigInteger but for satisfying your curiosity here is an example of how to add two byte arrays of size 32:
public static byte[] add(byte[] data1, byte[] data2) {
if(data1.length!=32 || data2.length!=32)
throw new IllegalArgumentException();
byte[] result=new byte[32];
for(int i=31, overflow=0; i>=0; i--) {
int v = (data1[i]&0xff)+(data2[i]&0xff)+overflow;
result[i]=(byte)v;
overflow=v>>>8;
}
return result;
}
Note that it is possible to use one of the input arrays as target for the result. However, don’t be surprised if such a reusing has even a negative impact on performance. On today’s systems there are no simple answers to “how to speedup” anymore…
For treating as a big unsigned number, a byte[] isn't an ideal solution, consider for example that for adding two of these numbers, you will have to loop over the two arrays, adding each byte (and the carry from the previous byte), then storing back the resulting byte somewhere.
BigInteger internally represents the value in a manner suitable for the operations it provides, so its operations will very likely be at least as good as you can do with byte[]. A slight drawback in terms of performance might be that BigInteger is immutable.
Performance wise, a simple, mutable holder object consisting of 4 long members would probably do best:
My256BitNumber {
long l0;
long l1;
long l2;
long l3;
public void add(My256BitNumber arg) {
//...
}
}
That would allow you to bypass overhead of object creation (due to being mutable), as well as any potential array access overhead (like array index bounds checks).
But considering that none of the operations are trivial to implement, just make use of BigInteger. It combines reasonable performance, with reasonable simplicity of use, and most importantly - is a tested, working solution.
If rolling your own implementation is worth it, depends on your use case. Considering you're asking if one could get better performance than BigInteger, the answer is, yes you can - BUT at severe expense in code complexity.

Efficiently convert Java string into null-terminated byte[] representing a C string? (ASCII)

I would like to transform a Java String str into byte[] b with the following characteristics:
b is a valid C string (ie it has b.length = str.length() + 1 and b[str.length()] == 0.
the characters in b are obtained by converting the characters in str to 8-bit ASCII characters.
What is the most efficient way to do this — preferably an existing library function? Sadly, str.getBytes("ISO-8859-1") doesn't meet my first requirement...
// do this once to setup
CharsetEncoder enc = Charset.forName("ISO-8859-1").newEncoder();
// for each string
int len = str.length();
byte b[] = new byte[len + 1];
ByteBuffer bbuf = ByteBuffer.wrap(b);
enc.encode(CharBuffer.wrap(str), bbuf, true);
// you might want to ensure that bbuf.position() == len
b[len] = 0;
This requires allocating a couple of wrapper objects, but does not copy the string characters twice.
You can use str.getBytes("ISO-8859-1") with a little trick at the end:
byte[] stringBytes=str.getBytes("ISO-8859-1");
byte[] ntBytes=new byte[stringBytes.length+1];
System.arraycopy(stringBytes, 0, ntBytes, 0, stringBytes.length);
arraycopy is relatively fast as it can use native tricks and optimizations in many cases. The new array is filled with null bytes everywhere we didn't overwrite it(basically just the last byte).
ntBytes is the array you need.

Left shift unsigned byte, a better way?

I have an array of bytes (because unsigned byte isn't an option) and need to take 4 of them into a 32 bit int. I'm using this:
byte rdbuf[] = new byte[fileLen+1];
int i = (rdbuf[i++]) | ((rdbuf[i++]<<8)&0xff00) | ((rdbuf[i++]<<16)&0xff0000) | ((rdbuf[i++]<<24)&0xff000000);
If i don't do all the logical ands, it sign extends the bytes which is clearly not what I want.
In c this would be a no brainer. Is there a better way in Java?
You do not have to do this, you can use a ByteBuffer:
int i = ByteBuffer.wrap(rdbuf).order(ByteOrder.LITTLE_ENDIAN).getInt();
If you have many ints to read, the code becomes:
ByteBuffer buf = ByteBuffer.wrap(rdbuf).order(ByteOrder.LITTLE_ENDIAN);
while (buf.remaining() >= 4) // at least four bytes
i = bb.getInt();
Javadoc here. Recommended for use in any situation where binary data has to be dealt with (whether you read or write such data). Can do little endian, big endian and even native ordering. (NOTE: big endian by default).
(edit: #PeterLawrey rightly mentions that this looks like little endian data, fixed code extract -- also, see his answer for how to wrap the contents of a file directly into a ByteBuffer)
NOTES:
ByteOrder has a static method called .nativeOrder(), which returns the byte order used by the underlying architecture;
a ByteBuffer has a builtin offset; the current offset can be queried using .position(), and modified using .position(int); .remaining() will return the number of bytes left to read from the current offset until the end;
there are relative methods which will read from/write at the buffer's current offset, and absolute methods, which will read from/write at an offset you specify.
Instead of reading into a byte[] which you have to wrap with a ByteBuffer which does the shift/mask for you, you can use a direct ByteBuffer which avoid all this overhead.
FileChannel fc = new FileInputStream(filename).getChannel();
ByteBuffer bb = ByteBuffer.allocateDirect(fc.size()).order(ByteBuffer.nativeOrder());
fc.read(bb);
bb.flip();
while(bb.remaining() > 0) {
int n = bb.getInt(); // grab 32-bit from direct memory without shift/mask etc.
short s = bb.getShort(); // grab 16-bit from direct memory without shift/mask etc.
// get a String with an unsigned 16 bit length followed by ISO-8859-1 encoding.
int len = bb.getShort() & 0xFFFF;
StringBuilder sb = new StringBuilder(len);
for(int i=0;i<len;i++) sb.append((char) (bb.get() & 0xFF));
String text = sb.toString();
}
fc.close();

How to convert a byte into bits?

I have one array of bytes. I want to access each of the bytes and want its equivalent binary value(of 8 bits) so as to carry out next operations on it. I've heard of BitSet but is there any other way of dealing with this?
Thank you.
If you just need the String representation of it in binary you can simply use Integer.toString() with the optional second parameter set to 2 for binary.
To perform general bit twiddling on any integral type, you have to use logical and bitshift operators.
// tests if bit is set in value
boolean isSet(byte value, int bit){
return (value&(1<<bit))!=0;
}
// returns a byte with the required bit set
byte set(byte value, int bit){
return value|(1<<bit);
}
You might find something along the lines of what you're looking in the Guava Primitives package.
Alternatively, you might want to write something like
public boolean[] convert(byte...bs) {
boolean[] result = new boolean[Byte.SIZE*bs.length];
int offset = 0;
for (byte b : bs) {
for (int i=0; i<Byte.SIZE; i++) result[i+offset] = (b >> i & 0x1) != 0x0;
offset+=Byte.SIZE;
}
return result;
}
That's not tested, but the idea is there. There are also easy modifications to the loops/assignment to return an array of something else (say, int or long).
BitSet.valueOf(byte[] bytes)
You may have to take a look at the source code how it's implemented if you are not using java 7
Java has bitwise operators. See a tutorial example.
The Java programming language also provides operators that perform bitwise and bit shift operations on integral types. The operators discussed in this section are less commonly used. Therefore, their coverage is brief; the intent is to simply make you aware that these operators exist.
The unary bitwise complement operator "~" inverts a bit pattern; it can be applied to any of the integral types, making every "0" a "1" and every "1" a "0". For example, a byte contains 8 bits; applying this operator to a value whose bit pattern is "00000000" would change its pattern to "11111111".
A byte value IS integral, you can check bit state using masking operations.
The least significant bit corresponds to the mask 1 or 0x1, the next bit correspond to 0x2, etc.
byte b = 3;
if((b & 0x1) == 0x1) {
// LSB is on
} else {
// LSB is off
}
byte ar[] ;
byte b = ar[0];//you have 8 bits value here,if I understood your Question correctly :)
Well I think I get what you mean. Now a rather substantial error with this is that it doesn't work on negative numbers. However assuming you're not using it to read file inputs, you might still be able to use it.
public static ArrayList<Boolean> toBitArr(byte[] bytearr){
ArrayList<Boolean> bitarr = new ArrayList<Boolean>();
ArrayList<Boolean> temp = new ArrayList<Boolean>();
int i = 0;
for(byte b: bytearr){
while(Math.abs(b) > 0){
temp.add((b % 2) == 1);
b = (byte) (b >> 1);
}
Collections.reverse(temp);
bitarr.addAll(temp);
temp.clear();
}
return bitarr;
}
Here is a sample, I hope it is useful for you!
DatagramSocket socket = new DatagramSocket(6160, InetAddress.getByName("0.0.0.0"));
socket.setBroadcast(true);
while (true) {
byte[] recvBuf = new byte[26];
DatagramPacket packet = new DatagramPacket(recvBuf, recvBuf.length);
socket.receive(packet);
String bitArray = toBitArray(recvBuf);
System.out.println(Integer.parseInt(bitArray.substring(0, 8), 2)); // convert first byte binary to decimal
System.out.println(Integer.parseInt(bitArray.substring(8, 16), 2)); // convert second byte binary to decimal
}
public static String toBitArray(byte[] byteArray) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < byteArray.length; i++) {
sb.append(String.format("%8s", Integer.toBinaryString(byteArray[i] & 0xFF)).replace(' ', '0'));
}
return sb.toString();
}

Efficient way to calculate byte length of a character, depending on the encoding

What's the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8 for example the characters have a variable byte length, so each character needs to be determined individually. As far now I've come up with this:
char c = getCharSomehow();
String encoding = getEncodingSomehow();
// ...
int length = new String(new char[] { c }).getBytes(encoding).length;
But this is clumsy and inefficient in a loop since a new String needs to be created everytime. I can't find other and more efficient ways in the Java API. There's a String#valueOf(char), but according its source it does basically the same as above. I imagine that this can be done with bitwise operations like bit shifting, but that's my weak point and I'm unsure how to take the encoding into account here :)
If you question the need for this, check this topic.
Update: the answer from #Bkkbrad is technically the most efficient:
char c = getCharSomehow();
String encoding = getEncodingSomehow();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
// ...
int length = encoder.encode(CharBuffer.wrap(new char[] { c })).limit();
However as #Stephen C pointed out, there are more problems with this. There may for example be combined/surrogate characters which needs to be taken into account as well. But that's another problem which needs to be solved in the step before this step.
Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.
On my system, the following code takes 25 seconds to encode 100,000 single characters:
Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
for (array[0] = 0; array[0] < 10000; array[0]++) {
int len = new String(array).getBytes(utf8).length;
}
}
However, the following code does the same thing in under 4 seconds:
Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
for (array[0] = 0; array[0] < 10000; array[0]++) {
output.clear();
input.clear();
encoder.encode(input, output, false);
int len = output.position();
}
}
Edit: Why do haters gotta hate?
Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:
Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);
int limit = input.limit();
while(input.position() < limit) {
output.clear();
input.mark();
input.limit(Math.max(input.position() + 2, input.capacity()));
if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
//Malformed surrogate pair; do something!
}
input.limit(input.position());
input.reset();
encoder.encode(input, output, false);
int encodedLen = output.position();
}
If you can guarantee that the input is well-formed UTF-8, then there's no reason to find code points at all. One of the strengths of UTF-8 is that you can detect the start of a code point from any position in the string. Simply search backwards until you find a byte such that (b & 0xc0) != 0x80, and you've found another character. Since a UTF-8 encoded code point is always 6 bytes or less, you can copy the intermediate bytes into a fixed-length buffer.
Edit: I forgot to mention, even if you don't go with this strategy, it is not sufficient to use a Java "char" to store arbitrary code points since code point values can exceed 0xffff. You need to store code points in an "int".
It is possible that an encoding scheme could encode a given character as a variable number of bytes, depending on what comes before and after it in the character sequence. The byte length you get from encoding a single character String is therefore not the whole answer.
(For example, you could theoretically receive a baudot / teletype characters encoded as 4 characters every 3 bytes, or you could theoretically treat a UTF-16 + a stream compressor as an encoding scheme. Yes, it is all a bit implausible, but ...)
Try Charset.forName("UTF-8").encode("string").limit(); Might be a bit more efficient, maybe not.

Categories

Resources