I wanted to Convert any length String to fixed 32 Bytes - java

I want to convert any length of String to byte32 in Java.
Code
String s="9c46267273a4999031c1d0f7e40b2a59233ce59427c4b9678d6c3a4de49b6052e71f6325296c4bddf71ea9e00da4e88c4d4fcbf241859d6aeb41e1714a0e";
//Convert into byte32

From the comments it became clear that you want to reduce the storage space of that string to 32 bytes.
The given string can easily be compressed from the 124 bytes to 62 bytes by doing a hexadecimal conversion.
However, there is no algorithm and there will not be an algorithm that can compress any data to 32 bytes. Imagine that would be possible: it would have been implemented and you would be able to get ZIP files of just 32 bytes for any file you compress.
So, unfortunately, the answer is: it's not possible.

You can not convert any length string to a byte array of length 32.
Java uses UTF-16 as it's string encoding, so in order to store 100% of the string, 1:1 as a fixed length byte array, you would be at a surface glance be limited to 16 characters.
If you are willing to live with the limitation of 16 characters, byte[] bytes = s.getBytes(); should give you a variable length byte array, but it's best to specify an explicit encoding. e.g. byte [] array2 = str.getBytes("UTF-16");
This doesn't completely solve your problem. You will now likely have to check that the byte array doesn't exceed 32 bytes, and come up with strategies for padding, possible null termination (which may potentially eat into your character budget)
Now, if you don't need the entire UTF-16 string space that Java uses for strings by default, you can get away with longer strings, by using other encodings.
IF this is to be used for any kind of other standard or something ( I see references to etherium being thrown around) then you will need to follow their standards.
Unless you are writing your own library for dealing with it directly, I highly recommend using a library that already exists, and appears to be well tested, and used.

You can achieve with the following function
byte[] bytes = s.getBytes();

Related

Alternative representation of a String

During the run of my program i create a lot of String(1.000.000) up to size of 700 and my program eats up a lot of memory.These Strings can contain only R,D,L,U as chars so i thought that i could represent them differently.I thought about using BitSet but i am not sure it is more memory efficient.Any ideas?
P.S:i could also shrink the String compressing equal chars(RRRRRRDDDD->R6D4) but i was hoping for a better solution.
as a first step, you could try to switch to char[]. Java String takes approx 40 bytes more than the sum of its characters (source) and char[] is considerably more convenient than bit arithmetic
even more economical is byte[] since one char requires two bytes allocation, while a byte is, of course, one byte (and still has room for 256 distinct values)

In ojdbc6, what does unmarshalCLR and unmarshalUB1 do?

In ojdbc6, an accessor can call the oracle.jdbc.driver.T4CMAREngine's unmarshalCLR method during unmarshaling of results from a database. Inside unmarshalCLR, there is also this unmashalUB1 method.
What do these two methods do?
It's an Oracle database specific thing relating to their TNS protocol.
A google search turns up a spec, though I have no idea how accurate or up-to-date it is.
Mentioning CLRs:
A CLR is a byte array in 64-byte blocks. If its length <=64, it is just
length-byte-preceeded and written as native. Null arrays can be written as the
single bytes 0x0 or 0xff. If length >64, first a LNG byte (0xfe) is written,
then the array is written in length-byte-preceeded chunks of 64 bytes (although
the final chunk can be shorter), followed by a 0 byte. A chunk preceeded by a
length of 0xfe is ignored.
Looks like a CLR is an encoded byte array.
A UB1 is simply an unsigned byte (data type length of 1 byte).

Java : Char vs String byte size

I was surprised to find that the following code
System.out.println("Character size:"+Character.SIZE/8);
System.out.println("String size:"+"a".getBytes().length);
outputs this:
Character size:2
String size:1
I would assume that a single character string should take up the same (or more ) bytes than a single char.
In particular I am wondering.
If I have a java bean with several fields in it, how its size will increase depending on the nature of the fields (Character, String, Boolean, Vector, etc...) I'm assuming that all java objects have some (probably minimal) footprint, and that one of the smallest of these footprints would be a single character. To test that basic assumption I started with the above code - and the results of the print statements seem counterintuitive.
Any insights into the way java stores/serializes characters vs strings by default would be very helpful.
getBytes() outputs the String with the default encoding (most likely ISO-8859-1) while the internal character char has always 2 bytes. Internally Java uses always char arrays with a 2 byte char, if you want to know more about encoding, read the link by Oded in the question comments.
I would like to say what i think,correct me if i am wrong but you are finding the length of the string which is correctly it is showing as 1 as you have only 1 character in the string. length shows the length not the size . length and size are two different things.
check this Link.. you are finding the number of bytes occupied in the wrong way
well, you have that 1 char in char array has the size of 2 bytes and that your String contains is 1 character long, not that it has 1 byte size.
The String object in Java consists of:
private final char value[];
private final int offset;
private final int count;
private int hash;
only this should assure you that anyway the String object is bigger then char array.
If you want to learn more about how object's size you can also read about the object headers and multiplicity factor for char arrays. For example here or here.
I want to add some code first and then a bit of explanation:
import java.nio.charset.Charset;
public class Main {
public static void main(String[] args) {
System.out.println("Character size: " + Character.SIZE / 8);
final byte[] bytes = "a".getBytes(Charset.forName("UTF-16"));
System.out.println("String size: " + bytes.length);
sprintByteAsHex(bytes[0]);
sprintByteAsHex(bytes[1]);
sprintByteAsHex(bytes[2]);
sprintByteAsHex(bytes[3]);
}
static void sprintByteAsHex(byte b) {
System.out.print((Integer.toHexString((b & 0xFF))));
}
}
And the output will be:
Character size: 2
String size: 4
feff061
So what you are actually missing is, you are not providing any parameter to the getBytes method. Probably, you are getting the bytes for UTF-8 representation of the character 'a'.
Well, but why did we get 4 bytes, when we asked for UTF-16? Ok, Java uses UTF-16 internally, then we should have gotten 2 bytes right?
If you examine the output:
feff061
Java actually returned us a BOM: https://en.wikipedia.org/wiki/Byte_order_mark.
So the first 2 bytes: feff is required for signalling that following bytes will be UTF-16 Big Endian. Please see the Wikipedia page for further information.
The remaining 2 bytes: 0061 is the 2 byte representation of the character "a" you have. Can be verified from: http://www.fileformat.info/info/unicode/char/0061/index.htm
So yes, a character in Java is 2 bytes, but when you ask for bytes without a specific encoding, you may not always get 2 bytes since different encodings will require different amount of bytes for various characters.
The SIZE of a Character is the storage needed for a char, which is 16 bit. The length of a string (also the length of the underlying char-array or bytes-array) is the number of characters (or bytes), not a size in bit.
That's why you had do to the division by 8 for the size, but not for the length. The length needs to be multiplied by two.
Also note that you will get other lengths for the byte-array if you specify a different encoding. In this case a transformation to a single- or varying-size encoding was performed when doing getBytes().
See: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#getBytes(java.nio.charset.Charset)

Cassandra = Memory/Encoding-Footprint of Keys (Hash/Bytes[]=>Hex=>UTF16=>Bytes[])

I am trying to understand the implications of using an MD5 Hash as Cassandra Key, in terms of "memory/storage consumption":
MD5 Hash of my content (in Java) = byte[] is 16 bytes long. (16 bytes is from wikipedia for generic md5, I am not shure if the java implementations also returns 16 bytes)
Hex encode this value, to be able to print it in human readable format => 1byte becomes 2hex values
I have to represent every hex value as a "character" in java => result= "two string character values" (for examle "FF" is a string of length/size = 2.)
Java uses UTF-16 => so every "string character" is encoded with two bytes. "FF" would require 2x2 bytes?
Conclusion => The MD5 Hash in Bytes format is 16 bytes, but represented as a java hex utf16 string consumes 16x2x2 = 64Bytes (in memory)!?!? Is this correct?
What is the storage Consumption in Cassandra, using this as a row-key?
If I had directly used the byte-array from the Hash function i would assume it consumes 16 bytes in Cassandra?
But if I use the hex-String representation (as noted above), can cassandra "compress" it to a 16 byets or will it also take 64bytes in cassandra? I assume 64 bytes in Cassandra, is this correct?
What kind of keys do you use? Do you use directly the outpout of an hash function or do you first encode into a hex string and then use the string?
(In MySQL I always, whenever I used a hash-key, I used the hex-string representation of it...So it is directly readable in the MySQL Tools and in the whole application. But I now realize it wastes storage???)
Maybe my thinking is completely incorrect, then it would be kind to explain where I am wrong.
Thans very much!
jens
Correct on both counts: byte[] would be 16 bytes, utf16-as-hex would be 64.
In 0.8, Cassandra has key metadata so you can tell it "this key is a byte[]" and it will display in hex in the cli.

Reading a UTF-8 String from a ByteBuffer where the length is an unsigned int

I am trying to read a UTF8 string via a java.nio.ByteBuffer. The size is an unsinged int, which, of course, Java doesn't have. I have read the value into a long so that I have the value.
The next issue I have is that I cannot create an array of bytes with the long, and casting he long back to an int will cause it to be signed.
I also tried using limit() on the buffer, but again it works with int not long.
The specific thing I am doing is reading the UTF8 strings out of a class file, so the buffer has more in it that just the UTF8 string.
Any ideas on how to read a UTF8 string that has a potential length of an unsigned int from a ByteBuffer.
EDIT:
Here is an example of the issue.
SourceDebugExtension_attribute {
u2 attribute_name_index;
u4 attribute_length;
u1 debug_extension[attribute_length];
}
attribute_name_index
The value of the attribute_name_index item must be a valid index into the constant_pool table. The constant_pool entry at that index must be a CONSTANT_Utf8_info structure representing the string "SourceDebugExtension".
attribute_length
The value of the attribute_length item indicates the length of the attribute, excluding the initial six bytes. The value of the attribute_length item is thus the number of bytes in the debug_extension[] item.
debug_extension[]
The debug_extension array holds a string, which must be in UTF-8 format. There is no terminating zero byte.
The string in the debug_extension item will be interpreted as extended debugging information. The content of this string has no semantic effect on the Java Virtual Machine.
So, from a technical point of view, it is possible to have a string in the class file that is the full u4 (unsigned, 4 bytes) in length.
These won't be an issue if there is a limit to the size of a UTF8 string (I am no UTF8 expert so perhaps there is such a limit).
I could just punt on it and go with the reality that there is not going to be a String that long...
Unless your array of bytes is more than 2GB (the largest positive value of a Java int), you won't have a problem with casting the long back into a signed int.
If your array of bytes needs to be more than 2GB in length, you're doing it wrong, not least because that's way more than the default maximum heapsize of the JVM...
Having signed int won't be your main problem. Say you had a String which was 4 billion in length. You would need a ByteBuffer which is at least 4 GB, a byte[] which is at least 4 GB. When you convert this to a String, you need at least 8 GB (2 bytes per character) and a StringBuilder to build it. (Of at least 8 GB)
All up you need, 24 GB to process 1 String. Even if you have a lot of memory you won't get many Strings of this size.
Another approach is to treat the length as signed and if unsigned treat as a error as you won't have enough memory to process the String in any case. Even to handle a String which is 2 billion (2^31-1) in length you will need 12 GB to convert it to a String this way.
Java arrays use a (Java, i.e. signed) int for access as per the languge spec, so it's impossible to have an String (which is backed by a char array) longer than Integer.MAX_INT
But even that much is way too much to be processing in one chunk - it'll totally kill performance and make your program fail with an OutOfMemoryError on most machines if a sufficiently large String is ever encountered.
What you should do is process any string in chunks of a sensible size, say a few megs at a time. Then there's no practical limit on the size you can deal with.
I guess you could implement CharSequence on top of a ByteBuffer. That would allow you to keep your "String" from turning up on the heap, although most utilities that deal with characters actually expect a String. And even then, there is actually a limit on CharSequence as well. It expects the size to be returned as an int.
(You could theoretically create a new version of CharSequence that returns the size as a long, but then there's nothing in Java that would help you in dealing with that CharSequence. Perhaps it would be useful if you would implement subSequence(...) to return an ordinary CharSequence.)

Categories

Resources