Query on reading bytes from "UTF-8" world to Java "char"

Query on reading bytes from "UTF-8" world to Java "char" - java

With the below code snippet given in this link,
byte[] bytes = {0x00, 0x48, 0x00, 0x69, 0x00, 0x2C,
0x60, (byte)0xA8, 0x59, 0x7D, 0x00, 0x21}; // "Hi,您好!"
Charset charset = Charset.forName("UTF-8");
// Encode from UCS-2 to UTF-8
// Create a ByteBuffer by wrapping a byte array
ByteBuffer bb = ByteBuffer.wrap(bytes);
// Create a CharBuffer from a view of this ByteBuffer
CharBuffer cb = bb.asCharBuffer();
Using wrap() method, "The new buffer will be backed by the given byte array", Here we do not have any encoding from byte to other format, it just placed byte array in a buffer.
Can you please help me understand, what exactly are we doing when we say bb.asCharBuffer() in the above code?cb is similar to array of characters. Because char is UTF-16 in Java, Using asCharBuffer() method, Are we considering every 2bytes in bb as char? Is this the right approach? If no, Please help me with right approach.
Edit:
I tried this program as recommended by Meisch below,
byte[] bytes = {0x00, 0x48, 0x00, 0x69, 0x00, 0x2C,
0x60, (byte)0xA8, 0x59, 0x7D, 0x00, 0x21}; // "Hi,您好!"
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
ByteBuffer bb = ByteBuffer.wrap(bytes);
CharBuffer cb = decoder.decode(bb);
which gives exception
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(Unknown Source)
at java.nio.charset.CharsetDecoder.decode(Unknown Source)
at TestCharSet.main(TestCharSet.java:16)
Please help me, am stuck up here!!!
Note : am using java 1.6

You ask: “Because char is UTF-16 in Java, using asCharBuffer() method, are we considering every 2 bytes in bb as char?”
The answer to that question is yes. Your understanding is correct.
Your next question is: “Is this the right approach?”
If you are just trying to demonstrate how the ByteBuffer, CharBuffer and Charset classes work, it's acceptable.
However, when you are coding an application, you will never write code like that. To begin with, there is no need for a byte array; you can represent the characters as a literal String:
String s = "Hi,\u60a8\u597d!";
If you want to convert the string to UTF-8 bytes, you can simply do this:
byte[] encodedBytes = s.getBytes(StandardCharsets.UTF_8);
If you're still using Java 6, you would do this instead:
byte[] encodedBytes = s.getBytes("UTF-8");
Update: Your byte array represents chars in the UTF-16BE (big-endian) encoding. Specifically, your array has exactly two bytes per character. That is not a valid UTF-8 encoded byte sequence, which is why you're getting the MalformedInputException.
When characters are encoded as UTF-8 bytes, each character will be represented with 1 to 4 bytes. For your second code fragement to work, the array must be:
byte[] bytes = {
0x48, 0x69, 0x2c, // ASCII chars are 1 byte each
(byte) 0xe6, (byte) 0x82, (byte) 0xa8, // U+60A8
(byte) 0xe5, (byte) 0xa5, (byte) 0xbd, // U+597D
0x21
};
When converting from bytes to chars, my earlier statement still applies: You don't need ByteBuffer or CharBuffer or Charset or CharsetDecoder. You can use those classes, but usually it's more succinct to just create a String:
String s = new String(bytes, "UTF-8");
If you want a CharBuffer, just wrap the String:
CharBuffer cb = CharBuffer.wrap(s);
You may be wondering when it is appropriate to use a CharsetDecoder directly. You would do that if the bytes are coming from a source which is not under your control, and you have good reason to believe it may not contain properly UTF-8 encoded bytes. Using an explicit CharsetDecoder allows you to customize how invalid bytes will be handled.

I just had a look at the sources, it boils down to two bytes from the byte buffer being combined into one character. The order in which the two bytes are used depends on the endianness, default ist big-endian.
Another approach using nio classes than what I wrote in the comments would be to use the CharsetDecoder.decode() method.
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
ByteBuffer bb = ByteBuffer.wrap(bytes);
CharBuffer cb = decoder.decode(bb);

Related

Why with BouncyCastle decrypted text is a bit different from input text?

I found on Google this code for encrypt/decrypt a string in Java:
Security.addProvider(new org.bouncycastle.jce.provider.BouncyCastleProvider());
byte[] input = "test".getBytes();
byte[] keyBytes = new byte[] { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09,
0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17 };
SecretKeySpec key = new SecretKeySpec(keyBytes, "AES");
Cipher cipher = Cipher.getInstance("AES/ECB/PKCS7Padding", "BC");
System.out.println(new String(input));
// encryption pass
cipher.init(Cipher.ENCRYPT_MODE, key);
byte[] cipherText = new byte[cipher.getOutputSize(input.length)];
int ctLength = cipher.update(input, 0, input.length, cipherText, 0);
ctLength += cipher.doFinal(cipherText, ctLength);
System.out.println(new String(cipherText));
System.out.println(ctLength);
// decryption pass
cipher.init(Cipher.DECRYPT_MODE, key);
byte[] plainText = new byte[cipher.getOutputSize(ctLength)];
int ptLength = cipher.update(cipherText, 0, ctLength, plainText, 0);
ptLength += cipher.doFinal(plainText, ptLength);
System.out.println(new String(plainText));
System.out.println(ptLength);
And this is the output (screenshot because I can't copy-paste some characters):
output screenshot
My question is:
Why the first input "test" is different from the second (decrypted) "test"?
I need this code to encrypt a password and save it on a TXT file and then read this encrypted password from the TXT file and decrypt it..
But if these two outputs are different I can't do this.
Second question:
Is it possible to exclude ";" from the encrypted text?
Can someone help me, please? Thanks!

If you read the documentation for getOutputSize() then you will see that it returns the maximum amount of plaintext to expect. The cipher instance cannot know how much padding is added, so it guesses high. You will have to resize the byte array when you are using ECB or CBC mode (or any other non-streaming mode).
System.out.println(ctLength);
As you can see, ctLength does have the correct size. Use Arrays.copyOf(plainText, ptLength) to get the right number of bytes, or use the four parameter String constructor (new String(plainText, 0, ptLength, StandardCharsets.UTF_8)) in case you're just interested in the string.
The ciphertext consists of random characters. It actually depends on your standard character set what you see on the screen. If you really need text, then you can base 64 encode the ciphertext.
ECB mode encryption is not suitable to encrypt strings. You should try and use a different mode that includes setting / storing an IV.
I'd use new String(StandardCharsets.UTF_8) and String#getBytes(StandardCharsets.UTF_8) to convert to and from strings. If you don't specify the character set then it uses the system default character set, and that means decrypting your passwords won't work on all systems. The allowed characters also differ with Linux & Android defaulting on UTF-8 while Java SE on Windows (still?) defaults to the Windows-1252 (extended Western-Latin) character set.
There is absolutely no need to use the Bouncy Castle provider for AES encryption (the compatible padding string is "PKCS5Padding").
Please don't grab random code samples from Google. You need to understand cryptography before you start implementing it. The chances that you grab a secure code sample is practically zero unfortunately.

CharBuffer and ByteBuffer - charset encoding

Java stores characters in UCS-2 format.
byte[] bytes = {0x00, 0x48, 0x00, 0x69, 0x00, 0x2c,
0x60, (byte)0xA8, 0x59, 0x7D, 0x00, 0x21};
// Print UCS-2 in hex codes
System.out.printf("%10s", "UCS-2");
for(int i=0; i<bytes.length; i++) {
System.out.printf("%02x", bytes[i]);
}
1)
In the below code,
Charset charset = Charset.forName("UTF-8");
// Encode from UCS-2 to UTF-8
// Create a ByteBuffer by wrapping a byte array
ByteBuffer bb = ByteBuffer.wrap(bytes);
What is the byte order used to store bytes in bb on wrap()? BigEndian or LittleEndian?
2)
In the below code,
// Create a CharBuffer from a view of this ByteBuffer
CharBuffer cb = bb.asCharBuffer();
ByteBuffer bbOut = charset.encode(cb);
What is the encoding format used to store bytes of bb as characters in cb on asCharBuffer()?

Conversion python to Java

I need to convert this Python code in Java but the '\x00' hexcode is a real problem.
How can I do the export data to bytes[] ?
message__FirstPart = '\x45\x55\x43\x45\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00'
# event_par = key number (5+ 0..9
message__event_par = chr(key+5)
message__filler = '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
# mit link state = 1
message__link_status = chr(int(GroupID))
# mit link state = 1
#message__link_status = '\x01'
# the sender’s mac-address
message__mac = '\x00\x00\x00\x00\x00\x00'
data = message__FirstPart + message__event_par + message__filler + message__link_status + message__mac
Thanks !

Strings in Java are made up of char, not byte. Though a char is really just two bytes, it is interpreted as UTF-16 and this can cause all sorts of trouble when you really want to talk about a series of bytes as it appears in your example.
You can express literal bytes as hexadecimal in Java as 0x00, 0x01, ..., 0xFF though these usually get interpreted as int and must be cast to byte:
byte[] arr = new byte[] { (byte) 0x01, (byte) 0x02 };
You can also look into the ByteBuffer class for assembling streams of bytes.

Tokenising binary data in java

I am writing some code to handle a stream of binary data. It is received in chunks represented by byte arrays. Combined, the byte arrays represent sequential stream of messages, each of which ends with the same constant terminator value (0xff in my case). However, the terminator value can appear at any point in any given chunk of data. A single chunk can contain part of a message, multiple messages and anything in between.
Here is a small sampling of what data handled by this might look like:
[0x00, 0x0a, 0xff, 0x01]
[0x01, 0x01]
[0xff, 0x01, 0xff]
This data should be converted into these messages:
[0x00, 0x0a, 0xff]
[0x01, 0x01, 0x01, 0xff]
[0x01, 0xff]
I have written a small class to handle this. It has a method to add some data in byte array format, which is then placed in a buffer array. When the terminator character is encountered, the byte array is cleared and the complete message is placed in a message queue, which can be accessed using hasNext() and next() methods (similar to an iterator).
This solution works fine, but as I finished it, I realized that there might already be some stable, performant and tested code in an established library that I could be using instead.
So my question is - do you know of a utility library that would have such a class, or maybe there is something in the standard Java 6 library that can do this already?

I don't think you need a framework as a custom parser is simple enough.
InputStream in = new ByteArrayInputStream(new byte[]{
0x00, 0x0a, (byte) 0xff,
0x01, 0x01, 0x01, (byte) 0xff,
0x01, (byte) 0xff});
ByteArrayOutputStream baos = new ByteArrayOutputStream();
for (int b; (b = in.read()) >= 0; ) {
baos.write(b);
if (b == 0xff) {
byte[] bytes = baos.toByteArray();
System.out.println(Arrays.toString(bytes));
baos = new ByteArrayOutputStream();
}
}
prints as (byte) 0xFF == -1
[0, 10, -1]
[1, 1, 1, -1]
[1, -1]

Can't decrypt file encrypted using openssl AES_ctr128_encrypt

I have a file encrypted using the following code in c:
unsigned char ckey[] = "0123456789ABCDEF";
unsigned char iv[8] = {0};
AES_set_encrypt_key(ckey, 128, &key);
AES_ctr128_encrypt(indata, outdata, 16, &key, aesstate.ivec, aesstate.ecount, &aesstate.num);
I have to decrypt this file using java so I was using the code below to do it:
private static final byte[] encryptionKey = new byte[]{ 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F };
byte[] iv = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 };
IvParameterSpec ips = new IvParameterSpec(iv);
Cipher aesCipher = Cipher.getInstance("AES/CTR/NoPadding");
SecretKeySpec aeskeySpec = new SecretKeySpec(encryptionKey, "AES");
aesCipher.init(Cipher.DECRYPT_MODE, aeskeySpec, ips);
FileInputStream is = new FileInputStream(in);
CipherOutputStream os = new CipherOutputStream(new FileOutputStream(out), aesCipher);
copy(is, os);
os.close();
The JAVA code doesn't give me any error but the output is not correct.
What am I doing wrong?
My main doubts are if i'm using the correct padding (also tried PKCS5Padding without success) and if the key and iv are correct (don't know what the function AES_set_encrypt_key really does...).
** EDIT **
I think I have an answer to my own question, but I still have some doubts.
CTR means counter mode. The function AES_ctr128_encrypt receives as parameters the actual counter (ecount) and the number of blocks used (num).
The file is being encrypted in blocks of 16 bytes, like this:
for(int i = 0; i < length; i+=16)
{
// .. buffer processing here
init_ctr(&aesstate, iv); //Counter call
AES_ctr128_encrypt(indata, outdata, 16, &key, aesstate.ivec, aesstate.ecount, &aesstate.num);
}
the function init_ctr does this:
int init_ctr(struct ctr_state *state, const unsigned char iv[8])
{
state->num = 0;
memset(state->ecount, 0, 16);
memset(state->ivec + 8, 0, 8);
memcpy(state->ivec, iv, 8);
return 0;
}
This means that before every encryption/decryption the C code is resetting the counter and the ivec.
I am trying to decrypt the file as a whole in java. This probably means Java is using the counter correctly but the C code is not as it is resetting the counter at each block.
Is my investigation correct?
I have absolutely NO CONTROL over the C code that is calling openssl. Is there a way of doing the same in JAVA, i.e. resetting the counter at each block of 16? (The API only requests the key, algorithm, mode and IV)
My only other option is to use openssl via JNI but I was trying to avoid it...
Thank you!

I did not try it, but you should be able to effectively emulate what is done there on the C side - decrypt each 16-byte (=128 bit) block separately, and reset the cipher between two calls.
Please note that using CTR mode for just one block, with a zero initialization vector and counter, defeats the goal of CTR mode - it is worse than ECB.
If I see this right, you could try to encrypt some blocks of zeros with your C function (or the equivalent Java version) - these should come out as the same block each time. XOR this block with any ciphertext to get your plaintext back.
This is the equivalent to a Caesar cipher on a 128-bit alphabet (e.g. the 16-byte blocks), the block cipher adds no security here to a simple 128-bit XOR cipher. Guessing one block of plaintext (or more generally, guessing 128 bits at the right positions, not necessary all in the same block) allows getting the effective key, which allows getting all the remaining plaintext blocks.

Your encryption keys are different.
The C code uses the ASCII character codes for 0 through F, whereas the Javacode uses the actual bytes 0 through 16.

There are numerous serious problems with that C code:
As already noted, it is reinitialising the counter on every block. This makes the encryption completely insecure. This can be fixed by calling init_ctr() once only, prior to encrypting the first block.
It is setting the IV statically to zero. A fresh IV should be generated randomly, for example if (!RAND_bytes(iv, 8)) { /* handle error */ }.
The code appears to be directly using a password string as the key. Instead, a key should be generated from the password using a key derivation function like PBKDF2 (implemented in OpenSSL by PKCS5_PBKDF2_HMAC_SHA1()).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.