String comparision in UTF8

String comparision in UTF8 - java

I have a PHP script which is supposed to return an UTF-8 encoded string. However, in Java I can't seem to compare it with it's internal string in any way.
If I print "OK" and response, they appear the same in console. However, if I check equality
if ( "OK".equals(response) ) {
the result is false. I printed out both in binary, response is 11101111 10111011 10111111 01001111 01001011, the Java's String "OK" however is 01001111 01001011 which is cleary ASCII. I tried to convert it to UTF8 in a few ways, but no avail:
String result2 = new String("OK".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8);
and
String result2 = new String("OK".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
are both not working, still return ASCII codes for some reason.
byte[] result2 = "OK".getBytes(StandardCharsets.UTF_8); System.out.print(new String(result2));
While this also gives the correct "OK" result, in binary it still returns ASCII.
I've tried to change communication to numbers instead, but 1 still does not equal to 1, as Integer.parseInt(response) returns "1" is not a String error message, altough in every other aspect, it is recognised as a normal String.
I'm looking for a solution preferably where "OK" is converted to UTF-8 and not response to ASCII, since I need to communicate with a PHP script along with 2 databases, all set to UTF-8. Java is started with the switch -Dfile.encoding=UTF8 to ensure national characters are not broken.

in UTF-8 all characters with codes 127 or less are encoded by a single byte. Therefore "OK" in UTF-8 and ASCII is the same two bytes.
11101111 10111011 10111111 01001111 01001011 it is not just simple "OK" it is
0xEF, 0xBB, 0xBF, "OK"
where 0xEF, 0xBB, 0xBF are a BOM (Byte order mark)
It is symbols which are not displayed by editors but used to determine the encoding.
Probably those symbols appeared in you php script before <?php
You have to configure your editor to remove BOM from the file
UPD
If it is not possible to alter the php script, you can use a workaround:
// check if the first symbol of the response is BOM
if (!response.isEmpty() && (response.charAt(0) == 0xFEFF)) {
// removing the first symbol
response = response.substring(1);
}

Related

How to detect encoding mismatch

I have a bunch of old AES-encrypted Strings encrypted roughly like this:
String is converted to bytes with ISO-8859-1 encoding
Bytes are encrypted with AES
Result is converted to BASE64 encoded char array
Now I would like to change the encoding to UTF8 for new values (eg. '€' does not work with ISO-8859-1). This will of
course cause problems if I try to decrypt the old ISO-8859-1 encoded values with UTF-8 encoding:
org.junit.ComparisonFailure: expected:<!#[¤%&/()=?^*ÄÖÖÅ_:;>½§#${[]}<|'äöå-.,+´¨]'-Lorem ipsum dolor ...> but was:<!#[�%&/()=?^*����_:;>��#${[]}<|'���-.,+��]'-Lorem ipsum dolor ...>
I'm thinking of creating some automatic encoding fallback for this.
So the main question would be that is it enough to inspect the decrypted char array for '�' characters to figure out encoding mismatch? And what is the 'correct' way to declare that '�' symbol when comparing?
if (new String(utf8decryptedCharArray).contains("�")) {
// Revert to doing the decrypting with ISO-8859-1
decryptAsISO...
}

When decrypting, you get back the original byte sequence (result of your step 1), and then you can only guess whether these bytes denote characters according to the ISO-8859-1 or the UTF-8 encoding.
From a byte sequence, there's no way to clearly tell how it is to be interpreted.
A few ideas:
You could migrate all the old encrypted strings (decrypt, decode to string using ISO-8859-1, encode to byte array using UTF-8, encrypt). Then the problem is solved once and forever.
You could try to decode the byte array in both versions, see if one version is illegal, or if both versions are equal, and if it still is ambiguous, take the one with higher probability according to expected characters. I wouldn't recommend to go that way, as it needs a lot of work and still there's some probability of error.
For the new entries, you could prepend the string / byte sequence by some marker that doesn't appear in ISO-8859-1 text. E.g. some people follow the convention to prepend a Byte Order Marker at the beginning of UTF-8 encoded files. Although the resulting bytes (EF BB BF) aren't strictly illegal in ISO-8859-1 (being read as ï»¿), they are highly unlikely. Then, when your decrypted bytes start with EF BB BF, decode to string using UTF-8, otherwise using ISO-8859-1. Still, there's a non-zero probability of error.
If ever possible, I'd go for migrating the existing entries. Otherwise, you'll have to carry on with "old-format compatibility stuff" in your code base forever, and still can't absolutely guarantee correct behaviour.

When decoding bytes to text, don't rely on the � character to detect malformed input. Use a strict decoder. Here is a helper method for that:
static String decodeStrict(byte[] bytes, Charset charset) throws CharacterCodingException {
return charset.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT)
.decode(ByteBuffer.wrap(bytes))
.toString();
}
Here is the corresponding strict encoder helper method, in case you need it:
static byte[] encodeStrict(String str, Charset charset) throws CharacterCodingException {
ByteBuffer buf = charset.newEncoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT)
.encode(CharBuffer.wrap(str));
byte[] bytes = buf.array();
if (bytes.length == buf.limit())
return bytes;
return Arrays.copyOfRange(bytes, 0, buf.limit());
}
Since ISO-8859-1 allows all bytes, you can't use it to detect malformed input. UTF-8 is however validating, so it is very likely to detect malformed input. It is however not 100% guaranteed, but it's the best we get do.
So, try decoding using strict UTF-8, and then fall back to ISO-8859-1 if it fails:
static String decode(byte[] bytes) {
try {
return decodeStrict(bytes, StandardCharsets.UTF_8);
} catch (CharacterCodingException e) {
return new String(bytes, StandardCharsets.ISO_8859_1);
}
}
Test
System.out.println(decode("señor".getBytes(StandardCharsets.ISO_8859_1))); // prints: señor
System.out.println(decode("señor".getBytes(StandardCharsets.UTF_8))); // prints: señor
System.out.println(decode("€100".getBytes(StandardCharsets.UTF_8))); // prints: €100

UrlDecoder decode several times

Is there any method that fully decodes a String ? For example I have
monta%25C3%25B1a , if I use UrlDecoder.decode method ONCE : it returns monta%C3%B1a and if I decodify AGAIN , it finally returns montaña (that is the fully decodified string).Is there any method or library in Java that achieves this result?

monta[%25]C3[%25]B1a
monta % C3 % B1a which has a UTF-8 multi-byte sequence
monta ñ a
It is important to decode with the same Charset as it was encoded.
Evidently it was URL encoded twice, first into UTF-8, and then % was still encoded once.
Twice doing the encoding should be repaired, as otherwise an incomprehensible patch remains:
s = URLDecoder.decode(s, StandardCharsets.UTF_8);
s = URLDecoder.decode(s, StandardCharsets.UTF_8);

Encoding issue with JLine

Jline is a module for intercepting user input at a console before the user presses Enter. It uses JNA or similar wizardry.
I'm doing a few experiments with it and I'm getting encoding problems when I input more "exotic" Unicode characters. The OS here is W10 and I'm using Cygwin. Also this is in Groovy but should be obvious to Java people.
def terminal = org.jline.terminal.TerminalBuilder.builder().jna( true ).system( true ).build()
terminal.enterRawMode()
// NB the Terminal I get is class org.jline.terminal.impl.PosixSysTerminal
def reader = terminal.reader()
def bytes = [] // NB class ArrayList
int readInt = -1
while( readInt != 13 && readInt != 10 ) {
readInt = reader.read()
byte convertedByte = (byte)readInt
// see what the binary looks like:
String binaryString = String.format("%8s", Integer.toBinaryString( convertedByte & 0xFF)).replace(' ', '0')
println "binary |$binaryString|"
bytes << (byte)readInt // NB means "append to list"
println ">>> read |$readInt| byte |$convertedByte|"
}
// strip final byte (13 or 10)
bytes = bytes[0..-2]
println "z bytes $bytes, class ${bytes.class.name}"
def response = new String( (byte[])bytes.toArray(), 'UTF-8' )
// to get proper out encoding for Cygwin I then need to do this (I have no idea why!)
def psOut = new PrintStream(System.out, true, 'UTF-8' )
psOut.print( "using PrintStream: |$response|" )
This works fine with one-byte Unicode, and letters like "é" (2-bytes) get handled fine. But it goes wrong with "ẃ":
ẃ --> Unicode U+1E83
UTF-8 HEX: 0xE1 0xBA 0x83 (e1ba83)
BINARY: 11100001:10111010:10000011
Actually the binary it puts out when you enter "ẃ" is 11100001:10111010:10010010.
This translates to U+1E92, which is another Polish character, "Ẓ". And that is indeed what gets printed out in the response String.
Unfortunately the JLine package hands you this reader, which is class org.jline.utils.NonBlocking$NonBlockingInputStreamReader... So I don't really know what I can do to investigate its encoding (I presume UTF-8) or somehow modify it... Can anyone explain what the problem is?

As far as I can tell this relates to a Cygwin-specific problem, as asked and then answered by me a year ago.
There is a solution, in my answer to the question I asked directly after this one... which correctly deals with Unicode input, even when outside the Basic Multilingual Plane, using JLine, ... and using a Cygwin console ... hopefully.

Convert JSON Base64 string to String in Java

I am trying to convert a protobuf stream to JSON object using the com.google.protobuf.util.JsonFormat class as below.
String jsonFormat = JsonFormat.printer().print(data);
As per the documentation https://developers.google.com/protocol-buffers/docs/proto3#json I am getting the bytes as Base64 string(example "hashedStaEthMac": "QDOMIxG+tTIRi7wlMA9yGtOoJ1g=",
). But I would like to get this a readable string(example "locAlgorithm": "ALGORITHM_ESTIMATION",
). Below is a sample output. is there a way to the JSON object asplain text or any work around to get the actual values.
{
"seq": "71811887",
"timestamp": 1488640438,
"op": "OP_UPDATE",
"topicSeq": "9023777",
"sourceId": "xxxxxxxx",
"location": {
"staEthMac": {
"addr": "xxxxxx"
},
"staLocationX": 1148.1763,
"staLocationY": 980.3377,
"errorLevel": 588,
"associated": false,
"campusId": "n5THo6IINuOSVZ/cTidNVA==",
"buildingId": "7hY/jVh9NRqqxF6gbqT7Jw==",
"floorId": "LV/ZiQRQMS2wwKiKTvYNBQ==",
"hashedStaEthMac": "xxxxxxxxxxx",
"locAlgorithm": "ALGORITHM_ESTIMATION",
"unit": "FEET"
}
}
Expected format is as below.
seq: 85264233
timestamp: 1488655098
op: OP_UPDATE
topic_seq: 10955622
source_id: 00505698749E
location {
sta_eth_mac {
addr: xx:xx:xx:xx:xx:xx
}
sta_location_x: 916.003
sta_location_y: 580.115
error_level: 854
associated: false
campus_id: 9F94C7A3A20836E392559FDC4E274D54
building_id: EE163F8D587D351AAAC45EA06EA4FB27
floor_id: 83144E609EEE3A64BBD22C536A76FF5A
hashed_sta_eth_mac:
loc_algorithm: ALGORITHM_ESTIMATION
unit: FEET
}

Not easily, because the actual values are binary, which is why they're Base64-encoded in the first place.
Try to decode one of these values:
$ echo -n 'n5THo6IINuOSVZ/cTidNVA==' | base64 -D
??ǣ6?U??N'MT
In order to get more readable values, you have to understand what the binary data actually is, and then decide what format you want to use to display it.
The field called staEthMac.addr is 6 bytes and is probably an Ethernet MAC address. It's usually displayed as xx:xx:xx:xx:xx:xx where xx are the hexadecimal values of each byte. So you could decode the Base64 strings into a byte[] and then call a function to convert each byte to hex and delimit them with ':'.
The fields campusId, buildingId, and floorId are 16 bytes (128 bits) and are probably UUIDs. UUIDs are usually displayed as xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx where each x is a hex digit (4 bits). So you could (again) convert the Base64 string to byte[] and then print the hex digits, optionally adding the dashes.
Not sure about sourceId and hashedStaEthMac, but you could just follow the pattern of converting to byte[] and printing as hex. Essentially you're just doing a conversion from base 64 to base 16. You'll wind up with something like this:
$ echo -n 'n5THo6IINuOSVZ/cTidNVA==' | base64 -D | xxd -p
9f94c7a3a20836e392559fdc4e274d54
A point that I'm not sure you are getting is that it's binary data. There is no "readable" version that makes sense like "ALGORITHM_ESTIMATION" does; the best you can do is encode the binary data using letters and numbers so you can at least pronounce it.
Base64 (which encodes binary using 64 different characters) is pronounceable "N five T H lowercase-O six ..." but it's not real friendly because letter case is significant and because it uses letters like O and I that look like numbers. Hex (which encodes binary using just 16 characters) is a little easier to read.

Why do I have to encode a utf-8 parameter String to iso-Latin and then decode as utf-8 to get Java utf-8 String?

I have a Java servlet that takes a parameter String (inputString) that may contain Greek letters from a web page marked up as utf-8. Before I send it to a database I have to convert it to a new String (utf8String) as follows:
String utf8String = new String(inputString.getBytes("8859_1"), "UTF-8");
This works, but, as I hope will be appreciated, I hate doing something I don't understand, even if it works.
From the method description in the Java doc the getBytes() method "Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array" i.e. I am encoding it in 8859_1 — isoLatin. And from the Constructor description "Constructs a new String by decoding the specified array of bytes using the specified charset" i.e. decodes the byte array to utf-8.
Can someone explain to me why this is necessary?

My question is based on a misconception regarding the character set used for the HTTP request. I had assumed that because I marked up the web page from which the request was sent as UTF-8 the request would be sent as UTF-8, and so the Greek characters in the parameter sent to the servlet would be read as a UTF-8 String (‘inputString’ in my line of code) by the HttpRequest.getParameter() method. This is not the case.
HTTP requests are sent as ISO-8859-1 (POST) or ASCII (GET), which are generally the same. This is part of the URI Syntax specification — thanks to Andreas for pointing me to http://wiki.apache.org/tomcat/FAQ/CharacterEncoding where this is explained.
I had also forgotten that the encoding of Greek letters such as α for the request is URL-encoding, which produces %CE%B1. The getParameter() handles this by decoding it as two ISO-8859-1 characters, %CE and %B1 — Î and ± (I checked this).
I now understand why this needs to be turned into a byte array and the bytes interpreted as UTF-8. 0xCE does not represent a one-byte character in UTF-8 and hence it is addressed with the next byte, 0xB1, to be interpretted as α. (Î is 0xC3 0x8E and ± is 0xC2 0xB1 in UTF-8.)

When decoding, could you not create a class with a decoder method that takes the bytes [] as a parameter and
return it as a string? here is an example that i have used before.
public class Decoder
{
public String decode(byte[] bytes)
{
//Turns the bytes array into a string
String decodedString = new String(bytes);
return decodedString;
}
}
Try use this instead of .getBytes(). hope this works.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.