Trouble comparing Java strings (of different encoding)

Trouble comparing Java strings (of different encoding) - java

I'm writing EXIF metadata to a JPEG using Apache Commons Imaging (Sanselan), and, at least in the 0.97 release of Sanselan, there were some bugs related to charset/encoding. The EXIF 2.2 standard requires that the encoding of fields of type UNDEFINED be prefixed with an 8-byte ASCII "signature", describing the encoding of the following content. The field/tag I'm writing to is the UserComment EXIF tag.
Windows expects the content to be encoded in UTF16, so the bytes written to the JPEG must contain a combination of (single byte) ASCII characters, followed by (double byte) Unicode characters. Furthermore, although UserComment doesn't seem to require it, I notice that often the content is "null-padded" to even length.
Here's the code I'm using to create and write the tag:
String textToSet = "Test";
byte[] ASCIIMarker = new byte[]{ 0x55, 0x4E, 0x49, 0x43, 0x4F, 0x44, 0x45, 0x00 }; // spells out "UNICODE"
byte[] comment = textToSet.getBytes("UnicodeLittle");
// pad with \0 if (total) length is odd (or is \0 byte automatically added by arraycopy?)
int pad = (ASCIIMarker.length + comment.length) % 2;
byte[] bytesComment = new byte[ASCIIMarker.length + comment.length + pad];
System.arraycopy(ASCIIMarker, 0, bytesComment, 0, ASCIIMarker.length);
System.arraycopy(comment, 0, bytesComment, ASCIIMarker.length, comment.length);
if (pad > 0) bytesComment[bytesComment.length-1] = 0x00;
TiffOutputField exif_comment = new TiffOutputField(TiffConstants.EXIF_TAG_USER_COMMENT,
TiffFieldTypeConstants.FIELD_TYPE_UNDEFINED, bytesComment.length - pad, bytesComment);
Then when I read the tag back from the JPEG, I do the following:
String textRead;
TiffField field = jpegMetadata.findEXIFValue(TiffConstants.EXIF_TAG_USER_COMMENT);
if (field != null) {
textRead= new String(field.getByteArrayValue(), "UnicodeLittle");
}
What confuses me is this: The bytes written to the JPEG are prefixed with 8 ASCII bytes, which obviously need to be "stripped off" in order to compare what was written to what was read:
if (textRead != null) {
if (textToSet.equals(textRead)) { // expecting this to FAIL
print "Equal";
} else {
print "Not equal";
if (textToSet.equals(textRead.substring(5))) { // this works
print "Equal after all...";
}
}
}
But why substring(5), as opposed to... substring(8)? If it was 4, I might think that 4 double byte (UTF-16) symbols total 8 bytes, but it only works if I strip off 5 bytes. Is this an indication that I'm not creating the payload (byte array bytesComment) properly?
PS! I will update to Apache Commons Imaging RC 1.0, which came out in 2016 and hopefully has fixed these bugs, but I'd still like to understand why this works once I've gotten this far with 0.97 :-)

Related

Decoding array of hexadecimal bytes to a specific codepage brings a wrong result when encoding afterwards

I created a simple app which looks like this:
String stringValue = new String(new byte[] { 0x00, 0x00, 0x00, 0x25 }, "273");
byte[] valueEncoded = Arrays.copyOfRange(stringValue.getBytes("273"), 0, 4);
int finalResult = ByteBuffer.wrap(valueEncoded).getInt();
System.out.println("Result: "+finalResult);
I expect the result to be 37, however the result is 21. How come? Am I missing something? Or is my method not the way it is supposed to be and therefore this error pops up?
I tried many other numbers and all seem to work fine...
As you can see I am using codepage 273 (IBM).

Looks to me like linefeed 0x25 got mapped to newline 0x15.
I assume that it went like this:
bytes to string: EBCDIC 0x25 -> UTF-16 0x000A
string to bytes: UTF-16 0x000A -> EBCDIC 0x15.
Why that's preferred, I can't guess. Well, actually I can guess (but it's only a guess). 0x000A is the standard line terminator on many systems, but it's generally output as "move to beginning of next line". Converting to IBM 273 is 'probably' because the text is destined for output on a device that uses that code page, and perhaps such devices want NL rather than LF for starting a new line.
Hypothesis validated:
class D {
public static void main(String... a) throws Exception {
String lf = "\n";
byte[] b = lf.getBytes("273");
System.out.println((int)b[0]);
}
}
Output is 21.

Can't send special characters using Open-Smpp library in multi sms

I am using open-smpp library to communicate with SMSC.
I am able to send both singe and multi SMS's, however I am having problem with special characters (šđžć) which in case of sending multi message(sendMultiSMS) are coming as '?'.
I read at https://en.wikipedia.org/wiki/Short_Message_Peer-to-Peer, that text in short_message field must match data_coding.
PSB, code parts of two methods.
As per above wiki resource, I defined variable DATA_CODING which represents data_coding and I tried to encode text in short_message like this:
submitSM.setShortMessage(message.getMessage(), Data.ENC_UTF16); - single message
ed.appendString(messageAux, Data.ENC_UTF16); - multi message
So for single message, bellow combination is fine (DATA_CODING = (byte) 0x08 and Data.ENC_UTF16), characters are coming fine, but for multi-sms special characters are coming as '?'.
I tried all combinations like:
DATA_CODING = (byte) 0x01 and Data.ENC_UTF16
DATA_CODING = (byte) 0x08 and Data.ENC_UTF16
DATA_CODING = (byte) 0x01 and Data.ENC_UTF8
DATA_CODING = (byte) 0x08 and Data.ENC_UTF8
etc., but without success.
**private static final byte DATA_CODING = (byte) 0x08;**
public void sendSMS(XMessage message) throws SmppException
{
.
.
.
SubmitSM submitSM = new SubmitSM();
setScheduleDate(message, submitSM);
submitSM.setProtocolId(PROTOCOL_ID);
**submitSM.setDataCoding(DATA_CODING);**
submitSM.setSourceAddr(mSrcAddress);
submitSM.setDestAddr(mDestAddress);
submitSM.setSequenceNumber(message.getSequence());
**submitSM.setShortMessage(message.getMessage(), Data.ENC_UTF16);**
SubmitSMResp submitSMResp = mSession.submit(submitSM);
}
public void sendMultiSMS(XMessage message) throws SmppException
{
.
.
.
submitSMMulti = new SubmitSM();
submitSMMulti.setProtocolId(PROTOCOL_ID);
**submitSMMulti.setDataCoding(DATA_CODING);**
setScheduleDate(message, submitSMMulti);
submitSMMulti.setSourceAddr(mSrcAddress);
submitSMMulti.setDestAddr(mDestAddress);
submitSMMulti.setEsmClass((byte)0x40);
messageArray = XSMSProcessUtil.getMultiMessages(message.getMessage(), numSegments);
SubmitSMResp submitSMResp = null;
for(int count = 0; count < messageArray.length; count++)
{
submitSMMulti.setSequenceNumber(message.getSequence() + count);
messageAux = messageArray[count];
ByteBuffer ed = new ByteBuffer();
ed.appendByte((byte)5);
ed.appendByte((byte)0x00);
ed.appendByte((byte)3);
ed.appendByte((byte)message.getSequence());
ed.appendByte((byte)numSegments);
ed.appendByte((byte)(count +1));
**ed.appendString(messageAux, Data.ENC_UTF16);**
submitSMMulti.setShortMessageData(ed);
submitSMResp = mSession.submit(submitSMMulti);
}
}

I found solution using information from this
URL.
Here is short explanation:
The GSM character encoding uses 7 bits to represent each character,
non-Latin-based alphabet languages usually use phones supporting
Unicode. The specific character encoding utilized by these phones is
usually UTF-16 or UCS-2. Both UTF-16 and UCS-2 use 16 bits to
represent each character. Standard SMS messages have a maximum payload
of 140 bytes (1120 bits). For Unicode phones, which use a 16-bit
character encoding, this allows a maximum of 70 characters per
standard SMS message. UDH takes up 6 bytes (48 bits) of a normal SMS
message payload. So
each individual concatenated SMS message can hold 67 characters: 1072
bits / (16 bits/character) = 67 characters
I needed to lower maximal size of the message from 153 to 67 and use DATA_CODING = (byte) 0x08 and Data.ENC_UTF16.

How to tell the original encoding of a file

I have a bunch of plain text file which I downloaded from 3rd party servers.
Some of them are gibberish; the server sent the information of ENCODING1 (e.g.: UTF8), but in reality the encoding of the file was ENCODING2 (e.g.: Windows1252).
Is there a way to somehow correct these files?
I presume the files were (ENCODING1) mostly encoded in UTF8, ISO-8859-2 and Windows1252 (and I presume they were mostly saved with one of these encodings). I was thinking about re-encoding every filecontent with
new String(String.getBytes(ENCODING1), ENCODING2)
with all possibilites of ENCODING1 and ENCODING2 (for 3 encodings that would be 9 options)
then finding some way (for example: charachter frequency?) to tell which of the 9 results is the correct one.
Are there any 3rd party libraries for this?
I tried JChardet and ICU4J, but as far as I know both of them are only capable of detecting the encoding of the file before the step with ENCODING1 took place
Thanks,
krisy

You can use library provided by google to detect character set for a file, please see following:
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector
{
public static void main(String[] args) throws java.io.IOException
{
if (args.length != 1) {
System.err.println("Usage: java TestDetector FILENAME");
System.exit(1);
}
byte[] buf = new byte[4096];
String fileName = args[0];
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
Read more at following URL
You can also try jCharDet by sourceforge, please see following URL
Cheers !!

Inside JVM Strings are always unicode (converted by reading or creation), so aStringVariable.getBytes(ENCODING1) will only work for output.
For a basic understanding you should read http://www.joelonsoftware.com/articles/Unicode.html.
As mentioned in this article there is no way to know for sure which original encoding was used; due to this article e.g. Internet Explorer guesses by the frequency of different bytes.

So the original files are in UTF8 (multibyte Unicode format), ISO-8859-2 (Latin-2) and Windows-1252 (MS Latin-1). You want to have them all in UTF-8.
First the download should not do any conversion, so the contents stay intact.
Otherwise you could only attempt to repair a wrong encoding, without guarantee.
Java uses Unicode for text internally. So create a String only with the correct encoding. For the file contents use byte[].
The functionality available:
If the file is in 7-bits US-ASCII then it already UTF-8
If the file has only valid UTF-8 sequences, it most likely is UTF-8; can be tested
Remains to distinguish between Latin-2 and MS Latin-1
The latter can be done by some statistics. For instance identifying the language by their 100 most frequent words functions rather well.
I am aware of a couple of charset detectors. That one did not seem to work might also be that the file is already corrupted. With Notepad++ or JEdit or some other encoding converting editor you might check.
Charset detectCharset(Path path) throws IOException {
byte[] content = Files.readAllBytes(path);
boolean ascii = true;
boolean utf8 = true;
Map<Byte, Integer> specialCharFrequencies = new TreeMap<>();
for (int i = 0; i < content.length; ++i) {
byte b = content[i];
if (b < 0) {
ascii = false;
if ((b & 0xC0) == 0xC0) { // UTF-8 continuation byte
if (i == 0 || content[i - 1] >= 0) {
utf8 = false;
}
}
specialCharFrequencies.merge(b, 1, Integer::sum);
}
}
if (ascii || utf8) {
return StandardCharsets.UTF_8;
}
// ... determine by frequencies
Charset latin1 = Charset.forName("Windows-1252");
Charset latin2 = Charset.forName("ISO-8859-2");
System.out.println(" B Freq 1 2");
specialCharFrequencies.entrySet().stream()
.forEach(e -> System.out.printf("%02x %06d %s %s%n",
e.getKey() & 0xFF, e.getValue(),
new String(new byte[] {e.getKey(), 0, 1}, latin1),
new String(new byte[] {e.getKey(), 0, 1}, latin2)));
return null;
}
Illegal UTF-8 can slip through this check, but it would be easy to use a Charset decoder.

Is there any way to get the size in bytes of a string in Java?

I need the size in bytes of each line in a file, so I can get a percentage of the file read. I already got the size of the file with file.length(), but how do I get each line's size?

final String hello_str = "Hello World";
hello_str.getBytes().length is the "byte size", i.e. the number of bytes

You need to know the encoding - otherwise it's a meaningless question. For example, "foo" is 6 bytes in UTF-16, but 3 bytes in ASCII. Assuming you're reading a line at a time (given your question) you should know which encoding you're using as you should have specified it when you started to read.
You can call String.getBytes(charset) to get the encoded representation of a particular string.
Do not just call String.getBytes() as that will use the platform default encoding.
Note that all of this is somewhat make-work... you've read the bytes, decoded them to text, then you're re-encoding them into bytes...

You probably use about the following to read the file
FileInputStream fis = new FileInputStream(path);
BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
String line;
while ((line = br.readLine()) != null) {
/* process line */
/* report percentage */
}
You need to specify the encoding already at the beginning. If you don't, you should get UTF-8 on Android. It is the default but that can be changed. I would assume that no device does that though.
To repeat what the other answers already stated: The character count is not always the same as the byte count. Especially the UTF encodings are tricky. There are currently 249,764 assigned Unicode characters and potentially over a million (WP) and UTF uses 1 to 4 byte to be able to encode all of them. UTF-32 is the simplest case since it will always use 4 bytes. UTF-8 does that dynamically and uses 1 to 4 bytes. Simple ASCII characters use just 1 byte. (source: UTF & BOM FAQ)
To get the amount of bytes you can use e.g. line.getBytes("UTF-8").length(). One big disadvantage is that this is very inefficient since it creates copy of the String internal array each time and throws it away after that. That is #1 addressed at Android | Performance Tips
It is also not 100% accurate in terms of actual bytes read from the file for following reasons:
UTF-16 textfiles for example often start with a special 2 byte BOM (Byte Order Mark) to signal whether they have to interpreted little or big endian. Those 2 (UTF-8: 3, UTF-32: 4) bytes are not reported when you just look at the String you get from your reader. So you are already some bytes off here.
Turning every line of a file into an UTF-16 String will include those BOM bytes for each line. So getBytes will report 2 bytes too much for each line.
Line ending characters are not part of the resulting line-String. To make things worse you have different ways of signaling the end of a line. Usually the Unix-Style '\n' which is only 1 character or the Windows-Style '\r''\n' which is two characters. The BufferedReader will simply skip those. Here your calculation is missing a very variable amount of bytes. From 1 byte for Unix/UTF-8 to 8 bytes for Windows/UTF-32.
The last two reasons would negate each other if you have Unix/UTF-16, but that is probably not the typical case. The effect of the error also depends on line length: if you have an error of 4 byte for each line that is in total only 10 bytes long your progress will be quite considerably wrong (if my math is good your progress would be at 140% or 60% when after the last line, depending on whether your calculation assumes -4 or +4 byte per line)
That means so far that regardless of what you do, you get no more than an approximation.
Getting the actual byte-count could probably be done if you write your own special byte counting Reader but that would be quite a lot of work.
An alternative would be to use a custom InputStream that counts how much bytes are actually read from the underlying stream. That's not too hard to do and it does not care for encodings.
The big disadvantage is that it does not increase linearly with the lines you read since BufferedReader will fill it's internal buffer and read lines from there, then read the next chunk from the file and so on. If the buffer is large enough you are at 100% at the first line already. But I assume your files are big enough or you would not want to find out about the progress.
This for example would be such an implementation. It works but I can't guarantee that it is perfect. It won't work if streams use mark() and reset(). File reading should no do that though.
static class CountingInputStream extends FilterInputStream {
private long bytesRead;
protected CountingInputStream(InputStream in) {
super(in);
}
#Override
public int read() throws IOException {
int result = super.read();
if (result != -1) bytesRead += 1;
return result;
}
#Override
public int read(byte[] b) throws IOException {
int result = super.read(b);
if (result != -1) bytesRead += result;
return result;
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
int result = super.read(b, off, len);
if (result != -1) bytesRead += result;
return result;
}
#Override
public long skip(long n) throws IOException {
long result = super.skip(n);
if (result != -1) bytesRead += result;
return result;
}
public long getBytesRead() {
return bytesRead;
}
}
Using the following code
File file = new File("mytestfile.txt");
int linesRead = 0;
long progress = 0;
long fileLength = file.length();
String line;
CountingInputStream cis = new CountingInputStream(new FileInputStream(file));
BufferedReader br = new BufferedReader(new InputStreamReader(cis, "UTF-8"), 8192);
while ((line = br.readLine()) != null) {
long newProgress = cis.getBytesRead();
if (progress != newProgress) {
progress = newProgress;
int percent = (int) ((progress * 100) / fileLength);
System.out.println(String.format("At line: %4d, bytes: %6d = %3d%%", linesRead, progress, percent));
}
linesRead++;
}
System.out.println("Total lines: " + linesRead);
System.out.println("Total bytes: " + fileLength);
br.close();
I get output like
At line: 0, bytes: 8192 = 5%
At line: 82, bytes: 16384 = 10%
At line: 178, bytes: 24576 = 15%
....
At line: 1621, bytes: 155648 = 97%
At line: 1687, bytes: 159805 = 100%
Total lines: 1756
Total bytes: 159805
or in case of the same file UTF-16 encoded
At line: 0, bytes: 24576 = 7%
At line: 82, bytes: 40960 = 12%
At line: 178, bytes: 57344 = 17%
.....
At line: 1529, bytes: 303104 = 94%
At line: 1621, bytes: 319488 = 99%
At line: 1687, bytes: 319612 = 100%
Total lines: 1756
Total bytes: 319612
Instead of printing that you could update your progress.
So, what is the best approach?
If you know that you have simple ASCII text in an encoding that uses only 1 byte for those characters: just use String#length() (and maybe add +1 or +2 for the line ending)
String#length() is fast and simple and as long as you know what files you have you should have no problems.
If your have international text where the simple approach won't work:
for smaller files where processing each line takes rather long: String#getBytes(), the longer processing 1 line takes the lower the impact of temporary arrays and their garbage collection. The inaccuracy should be within acceptable bounds. Just make sure not to crash if progress > 100% or < 100% at the end.
for larger files above approach. The larger the file the better. Updating progress in 0.001% steps is just slowing down things. Decreasing the reader's buffer size would increases the accuracy but it also decreases the read performance.
If you have enough time: write your own Reader that tells you the exact byte position. Maybe a combination of InputStreamReader and BufferedReader since Reader already operates on characters. Android's implementation may help as starting point.

If the File is an ASCII file, then you can use String.length();
otheriwse it gets more complex.

Consider you have a string variable called hello_str
final String hello_str = "Hello World";
//Check Character length
hello_str.length() //output will be 11
// Check encoded sizes
final byte[] utf8Bytes = hello_str.getBytes("UTF-8");
utf8Bytes.length //output will be 11
final byte[] utf16Bytes= hello_str.getBytes("UTF-16");
utf16Bytes.length // output will be "24"
final byte[] utf32Bytes = hello_str.getBytes("UTF-32");
utf32Bytes.length // output will be "44"

Confusion about Java conversion of bytes to String for comparison of "byte order marks"

I'm trying to recognize a BOM for UTF-8 when reading a file. Of course, Java files like to deal with 16 bit chars, and the BOM characters are eight bit bytes.
My test code looks like:
public void testByteOrderMarks() {
System.out.println("test byte order marks");
byte[] bytes = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF, (byte) 'a', (byte) 'b',(byte) 'c'};
String test = new String(bytes, Charset.availableCharsets().get("UTF-8"));
System.out.printf("test len: %s value %s\n", test.length(), test);
String three = test.substring(0,3);
System.out.printf("len %d >%s<\n", three.length(), three);
for (int i = 0; i < test.length();i++) {
byte b = bytes[i];
char c = test.charAt(i);
System.out.printf("b: %s %x c: %s %x\n", (char) b, b, c, (int) c);
}
}
and the result is:
test byte order marks
test len: 4 value ?abc
len 3 >?ab<
b: ? ef> c: ? feff
b: ? bb c: a 61
b: ? bf c: b 62
b: a 61 c: c 63
I can't figure out why the length of "test" is 4 and not 6.
I can't figure out why I don't pick up each 8 bit byte to do the comparison.
Thanks

Don't use characters when trying to figure out the BOM header. The BOM header is two or three bytes, so you should open an (File)InputStream, read two bytes and process them.
Incidentally, the XML header (<?xml version=... encoding=...>) is pure ASCII so it's safe to load that as a byte stream, too (well, unless there is a BOM to indicate that the file is saved with 16bit characters and not as UTF-8).
My solution (see DecentXML's XMLInputStreamReader) is to load the first few bytes of the file and analyze them. That gives me enough information to create a properly decoding Reader out of an InputStream.

A character is a character. The Byte Order Mark is the Unicode character U+FEFF. In Java it is the character '\uFEFF'. There is no need to delve into bytes. Just read the first character of the file, and if it matches '\uFEFF' it is the BOM. If it doesn't match then the file was written without a BOM.
private final static char BOM = '\uFEFF'; // Unicode Byte Order Mark
String firstLine = readFirstLineOfFile("filename.txt");
if (firstLine.charAt(0) == BOM) {
// We have a BOM
} else {
// No BOM present.
}

If you want to recognize a BOM file a better solution (and works for me) will be use the encoding detector library of Mozilla: http://code.google.com/p/juniversalchardet/
In that link is described easily how to use it:
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector {
public static void main(String[] args) throws java.io.IOException {
byte[] buf = new byte[4096];
String fileName = "testFile.";
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
If you are using maven the dependency is:
<dependency>
<groupId>com.googlecode.juniversalchardet</groupId>
<artifactId>juniversalchardet</artifactId>
<version>1.0.3</version>
</dependency>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Trouble comparing Java strings (of different encoding) - java

Related

Decoding array of hexadecimal bytes to a specific codepage brings a wrong result when encoding afterwards

Can't send special characters using Open-Smpp library in multi sms

How to tell the original encoding of a file

Is there any way to get the size in bytes of a string in Java?

Confusion about Java conversion of bytes to String for comparison of "byte order marks"

Categories

Resources