Debugging if UTF-8 decoding is done correctly?

Debugging if UTF-8 decoding is done correctly? - java

We have a Java code talking to external system over TCP connections with xml messages encoded in UTF-8.
The message received begin with '?'. SO the XML received is
?<begin>message</begin>
There is a real doubt if the first character is indeed '?'. At the moment, we cannot ask the external system if/what.
The code snippet for reading the stream is as below.
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, Charset.forName("UTF-8")));
int readByte = reader.read();
if (readByte <= 0) {
inputStream.close();
}
builder.append((char) readByte);
We are currently trying to log the raw bytes int readByte = inputStream.read(). The logs will take few days to be received.
In the mean time, I was wondering how we could ascertain at our end if it was truly a '?' and not a decoding issue?

I suspect strongly you have a byte-order-mark at the beginning of your doc. That won't render as a valid character, and consequently could appear as a question mark. Can you dump the raw bytes out and check for that sequence ?

Your question seems to boil down to this:
Can we ascertain the real value of the first byte of the message without actually looking at it.
The answer is "No, you can't". (Obviously!)
...
However, if you could intercept the TCP/IP traffic from the external system with a packet sniffer (aka traffic monitoring tool), then dumping the first byte or bytes of the message would be simple ... requiring no code changes.
Is logging the int returned by inputStream.read() the correct way to to analyse the bytes received. Or does the word length of the OS or other environment variables come into picture.
The InputStream.read() method returns either a single (unsigned) byte of data (in the range 0 to 255 inclusive) or -1 to indicate "end of stream". It is not sensitive to the "word length" or anything else.
In short, provided you treat the results appropriately, calling read() should give you the data you need to see what the bytes in the stream really are.

Related

HTTP Webserver ignoring last line of POST request [duplicate]

Right now, I'm trying to write a GUI based Java tic-tac-toe game that functions over a network connection. It essentially works at this point, however I have an intermittent error in which several chars sent over the network connection are lost during gameplay. One case looked like this, when println statements were added to message sends/reads:
Player 1:
Just sent ROW 14 COLUMN 11 GAMEOVER true
Player 2:
Just received ROW 14 COLUMN 11 GAMEOV
Im pretty sure the error is happening when I read over the network. The read takes place in its own thread, with a BufferedReader wrapped around the socket's InputStream, and looks like this:
try {
int input;
while((input = dataIn.read()) != -1 ){
char msgChar = (char)input;
String message = msgChar + "";
while(dataIn.ready()){
msgChar = (char)dataIn.read();
message+= msgChar;
}
System.out.println("Just received " + message);
this.processMessage(message);
}
this.sock.close();
}
My sendMessage method is pretty simple, (just a write over a DataOutputStream wrapped around the socket's outputstream) so I don't think the problem is happening there:
try {
dataOut.writeBytes(message);
System.out.println("Just sent " + message);
}
Any thoughts would be highly appreciated. Thanks!

As it turns out, the ready() method guaruntees only that the next read WON'T block. Consequently, !ready() does not guaruntee that the next read WILL block. Just that it could.
I believe that the problem here had to do with the TCP stack itself. Being stream-oriented, when bytes were written to the socket, TCP makes no guarantees as to the order or grouping of the bytes it sends. I suspect that the TCP stack was breaking up the sent string in a way that made sense to it, and that in the process, the ready() method must detect some sort of underlying break in the stream, and return false, in spite of the fact that more information is available.
I refactored the code to add a newline character to every message send, then simply performed a readLine() instead. This allowed my network protocol to be dependent on the newline character as a message delimiter, rather than the ready() method. I'm happy to say this fixed the problem.
Thanks for all your input!

Try flushing the OutputStream on the sender side. The last bytes might remain in some intenal buffers.

It is really important what types of streamed objects you use to operate with data. It seems to me that this troubleshooting is created by the fact that you use DataOutputStream for sending info, but something else for receiving. Try to send and receive info by DataOutputStream and DataInputStream respectively.
Matter fact, if you send something by calling dataOut.writeBoolean(b)
but trying to receive this thing by calling dataIn.readString(), you will eventually get nothing. DataInputStream and DataOutputStream are type-sensitive. Try to refactor your code keeping it in mind.
Moreover, some input streams return on invocation of read() a single byte. Here you try to convert this one single byte into char, while in java char by default consists of two bytes.
msgChar = (char)dataIn.read();
Check whether it is a reason of data loss.

Data is corrupted after sending from PHP to JAVA using tcp

I am trying to send data from PHP TCP server to JAVA TCP client.
I am comparing my results by comparing hex values of the data.
PHP script reads STDIN, sends it through socket one byte at a time and java reads it using DataInputStream.read(), converts to hex and displays.
If I manually type data into script - it works ok.
If I use file with data - it works OK
But when I assign /dev/urandom(even few bytes) - the data on the java side is coming corrupted. There is always a hex of value efbfbd in random places instead of correct data.
Please help me with this issue.
PHP code:
$f = fopen( 'php://stdin', 'rb' );
while($line = fread($f, 1)){
$length = 1;
echo bin2hex($line)."\n";
echo socket_write($client, $line, 1)."\n";
$sent = socket_write($client, $line, $length);
if ($sent === false) {
break;
}
// Check if the entire message has been sented
if ($sent < $length) {
// If not sent the entire message.
// Get the part of the message that has not yet been sented as message
$line = substr($line, $sent);
// Get the length of the not sented part
$length -= $sent;
}
Java code:
in = new DataInputStream(clientSocket.getInputStream());
byte[] data = new byte[1];
int count = 0;
while(in.available() > 0){
//System.out.println(in.available());
in.read(data);
String message = new String(data);
System.out.println(message);
//System.out.flush();
System.out.println( toHex(message) );
//in.flush();
message = "";
}

You're stumbling upon encoding. By calling new String(data) the byte array is converted using your default encoding to a string, whatever this encoding may is (you can set the encoding by java -Dfile.encoding=UTF-8 to UTF-8 for example).
The Java code you want would most likely look the following:
in = new DataInputStream(clientSocket.getInputStream());
byte[] data = new byte[1];
int count = 0;
while (in.available() > 0) {
// System.out.println(in.available());
in.read(data);
String hexMessage = Integer.toHexString(data[0] & 0xFF);
String stringMessage = new String(data, "UTF-8"); // US-ASCII, ISO-8859-1, ...
System.out.println(hexMessage);
}
Update: I missed the 32bit issue. The 8-bit byte, which is signed in Java, is sign-extended to a 32-bit int. To effectively undo this sign extension, one can mask the byte with 0xFF.

There are two main issues with your Java program.
First - the use of in.available(). It does not tell you how many bytes there are still in the message. It merely says how many bytes are ready in the stream and for available reading without being blocked. For example, if the server sends two packets of data over the socket, one has arrived, but one is still being sent over the Internet, and each packet has 200 bytes (this is just an example), then in the first call you'll get the answer 200. If you read 200 bytes, you're sure not to be blocked. But if the second packet has not arrived yet, your next check of in.available() will return 0. If you stop at this point, you only have half the data. Not what you wanted.
Typically you either have to read until you reach end-of-stream (InputStream.read() returns -1), and then you can't use the same stream anymore and you close the socket, or you have a specific protocol that tells you how many bytes to expect and you read that number of bytes.
But that's not the reason for the strange values you see in output from your program. The reason is that Java and PHP represent strings completely differently. In PHP, a string can contain any bytes at all, and the interpretation of them as characters is up to the prorgrammer.
This basically means that a PHP string is the equivalent of a byte[] in Java.
But Java Strings are completely different. It consists internally of an array of char, and char is always two bytes in UTF-16 encoding. When you convert bytes you read into a Java String, it's always done by encoding the bytes using some character encoding so that the appropriate characters are stored in the string.
For example, if your bytes are 44 4F 4C 4C, and the character encoding is ISO-8859-1, this will be interpreted as the characters \u0044, \u004F, \u004C, \u004C. It will be a string of four characters - "DOLL". But if your character encoding is UTF-16, the bytes will be interpreted as \u444F and \u4C4C. A string of only two characters, "䑏䱌".
When you were reading from the console or from a file, the data was probably in the encoding that Java expects by default. This is usually the case when the file is written in pure English, with just English letters, spaces and punctuation. These are all 7-bit characters which are the same in ISO-8859-1 and UTF-8, which are the common defaults. But in /dev/urandom you'd have some bytes in the range 80 through FF, which may be treated differently when interpreted into a UTF-16 Java string.
Furthermore, you didn't show your toHex() method in Java. It probably reads bytes back from the string again, but using which encoding? If you read the bytes into the String using ISO-8859-1, and got them out in UTF-8, you'd get completely different bytes.
If you want to see exactly what PHP sent you, don't put the bytes in a String. Write a toHex method that works on byte arrays, and use the byte[] you read directly.
Also, always remember to check the number of bytes returned by read() and only interpret that number of bytes! read() does not always fill the entire array. So in your new toHex() method, you need to also pass the number of bytes read as a parameter, so that it doesn't display the parts of the array after them. In your case you just have a one-byte array - which is not recommended - but even in this case, read() can return 0, and it's a perfectly legal value indicating that in this particular call to read() there were no bytes available although there may be some available in the next read().

As the comment above says you might be having troubles with the string representation of the bytes String message = new String(data); To be certain, you should get the data bytes and encode them in Base64 for example. You can use a library such as Apache Commons or Java 8 to do that. You should be able to do something similar in PHP to compare.

DataOutputStream.writeBytes adds zero-bytes

I have a small TCP server program and a corresponding client, and they communicate via ServerSocket and Socket classes and DataInputStream/DataOutputStream. And I have a problem with sending Strings to the server.
connection = new Socket("localhost", 2233);
outStream = new DataOutputStream(connection.getOutputStream());
outStream.writeBytes(fileName);
fileName is, at this point in time, a hard-coded String with the value "listener.jardesc". The server reads the string with the following code:
inStream = new DataInputStream(connection.getInputStream());
String fileName = inStream.readLine();
The string is received properly, but three zero-value bytes have been added to the end. Why is that and how can I stop it from happening? (I could, of course, trim the received string or somehow else stop this problem from mattering, but I'd rather prevent the problem completely)

I'm just going to throw this out there. You're using the readLine() method which has been deprecated in Java 5, 6 & 7. The API docs state quite clearly that this method "does not properly convert bytes to characters". I would read it as bytes or use a Buffered Reader.
http://docs.oracle.com/javase/1.5.0/docs/api/java/io/DataInputStream.html#readLine%28%29

writeBytes() does not add extra bytes.
The code you've written is invalid, as you aren't writing a newline. Therefore it doesn't work, and blocks forever in readLine().
In trying to debug this you appear to have read the bytes some other way, probably with read(); and to have ignored the return value returned by read, and to have concluded that read() filled the buffer you provided, when it didn't, leaving three bytes in their initial state, which is zero.

Read all bytes from socket Stops at 52964 bytes

I'm making a Server that gets packages at 64 kb size.
int length = 65536;
byte[] bytes = new byte[length];
int pos = 0;
while(pos < length -1)
{
System.out.println("Before read");
pos += dis.read(bytes, pos, length-pos);
System.out.println(""+pos+" >> "+ length);
}
This is the code I use to read all bytes from the socket. Dis is a InputStream.
When I run the code 1 out of n goes wrong. The code only receives 52964 bytes instead of 65536 bytes.
I also checked the C code and it says it send 65536 bytes.
Does someone know what I'm doing wrong?

This is yet another case where Jakarta Commons IOUtils is a better choice than writing it yourself. It's one line of code, and it's fully tested. I recommend IOUtils.readFully() in this case.
If it does not read the entire buffer, then you know that you're not sending all the content. Perhaps you're missing a flush on the server side.

InputStream.read() returns the number of bytes read or -1 if the end of the stream has been reached. You need to check for that error condition. Also, I suspect your while(..) loop is the problem. Why are you calling it pos as in position? You may be terminating prematurely. Also, ensure that your C code, whatever it is doing, is sending properly. You can examine the network traffic with a tool like Wireshark to be sure.

What do you mean it "goes wrong"? What is the output? It can't be exiting the loop before reading the full 64 KB, so what really happens?
Also, it's better to save the return value of the I/O call separately and inspect it, before assuming the I/O was successful. If that's DataInputStream.read(), it returns -1 on error.

Your code is incorrect as it doesn't check for -1.
This is a case for using DataInputStream.readFully() rather than coding it yourself and getting it wrong.

Extract first valid line of string from byte array

I am writing a utility in Java that reads a stream which may contain both text and binary data. I want to avoid having I/O wait. To do that I create a thread to keep reading the data (and wait for it) putting it into a buffer, so the clients can check avialability and terminate the waiting whenever they want (by closing the input stream which will generate IOException and stop waiting). This works every well as far as reading bytes out of it; as binary is concerned.
Now, I also want to make it easy for the client to read line out of it like '.hasNextLine()' and '.readLine()'. Without using an I/O-wait stream like buffered stream, (Q1) How can I check if a binary (byte[]) contain a valid unicode line (in the form of the length of the first line)? I look around the String/CharSet API but could not find it (or I miss it?). (NOTE: If possible I don't want to use non-build-in library).
Since I could not find one, I try to create one. Without being so complicated, here is my algorithm.
1). I look from the start of the byte array until I find '\n' or '\r' without '\n'.
2). Then, I cut the byte array from the start to that point and using it to create a string (with CharSet if specified) using 'new String(byte[])' or 'new String(byte[], CharSet)'.
3). If that success without exception, we found the first valid line and return it.
4). Otherwise, these bytes may not be a string, so I look further to another '\n' or '\r' w/o '\n'. and this process repeat.
5. If the search ends at the end of available bytes I stop and return null (no valid line found).
My question is (Q2)Is the following algorithm adequate?
Just when I was about to implement it, I searched on Google and found that there are many other codes for new line, for example U+2424, U+0085, U+000C, U+2028 and U+2029.
So my last question is (Q3), Do I really need to detect these code? If I do, Will it increase the chance of false alarm?
I am well aware that recognize something from binary is not absolute. I am just trying to find the best balance.
To sum up, I have an array of byte and I want to extract a first valid string line from it with/without specific CharSet. This must be done in Java and avoid using any non-build-in library.
Thanks you all in advance.

I am afraid your problem is not well-defined. You write that you want to extract the "first valid string line" from your data. But whether somet byte sequence is a "valid string" depends on the encoding. So you must decide which encoding(s) you want to use in testing.
Sensible choices would be:
the platform default encoding (Java property "file.encoding")
UTF-8 (as it is most common)
a list of encodings you know your clients will use (such as several Russian or Chinese encodings)
What makes sense will depend on the data, there's no general answer.
Once you have your encodings, the problem of line termination should follow, as most encodings have rules on what terminates a line. In ASCII or Latin-1, LF,CR-LF and LF-CR would suffice. On Unicode, you need all the ones you listed above.
But again, there's no general answer, as new line codes are not strictly regulated. Again, it would depend on your data.

First of all let me ask you a question, is the data you are trying to process a legacy data? In other words, are you responsible for the input stream format that you are trying to consume here?
If you are indeed controlling the input format, then you probably want to take a decision Binary vs. Text out of the Q1 algorithm. For me this algorithm has one troubling part.
`4). Otherwise, these bytes may not be a string, so I look further to
another '\n' or '\r' w/o '\n'. and this process repeat.`
Are you dismissing input prior to line terminator and take the bytes that start immediately after, or try to reevaluate the string with now 2 line terminators? If former, you may have broken binary data interface, if latter you may still not parse the text correctly.
I think having well defined markers for binary data and text data in your stream will simplify your algorithm a lot.
Couple of words on String constructor. new String(byte[], CharSet) will not generate any exception if the byte array is not in particular CharSet, instead it will create a string full of question marks ( probably not what you want ). If you want to generate an exception you should use CharsetDecoder.
Also note that in Java 6 there are 2 constructors that take charset
String(byte[] bytes, String charsetName) and String(byte[] bytes, Charset charset). I did some simple performance test a while ago, and constructor with String charsetName is magnitudes faster than the one that takes Charset object ( Question to Sun: bug, feature? ).

I would try this:
make the IO reader put strings/lines into a thread safe collection (for example some implementation of BlockingQueue)
the main code has only reference to the synced collection and checks for new data when needed, like queue.peek(). It doesn't need to know about the io thread nor the stream.
Some pseudo java code (missing exception & io handling, generics, imports++) :
class IORunner extends Thread {
IORunner(InputStream in, BlockingQueue outputQueue) {
this.reader = new BufferedReader(new InputStreamReader(in, "utf-8"));
this.outputQueue = outputQueue;
}
public void run() {
String line;
while((line=reader.readLine())!=null)
this.outputQueue.put(line);
}
}
class Main {
public static void main(String args[]) {
...
BlockingQueue dataQueue = new LinkedBlockingQueue();
new IORunner(myStreamFromSomewhere, dataQueue).start();
while(true) {
if(!dataQueue.isEmpty()) { // can also use .peek() != null
System.out.println(dataQueue.take());
}
Thread.sleep(1000);
}
}
}
The collection decouples the input(stream) more from the main code. You can also limit the number of lines stored/mem used by creating the queue with a limited capacity (see blockingqueue doc).
The BufferedReader handles the checking of new lines for you :) The InputStreamReader handles the charset (recommend setting one yourself since the default one changes depending on OS etc.).

The java.text namespace is designed for this sort of natural language operation. The BreakIterator.getLineInstance() static method returns an iterator that detects line breaks. You do need to know the locale and encoding for best results, though.

Q2: The method you use seems reasonable enough to work.
Q1: Can't think of something better than the algorithm that you are using
Q3: I believe it will be enough to test for \r and \n. The others are too exotic for usual text files.

I just solved this to get test stubb working for Datagram - I did byte[] varName= String.getBytes(); then final int len = varName.length; then send the int as DataOutputStream and then the byte array and just do readInt() on the rcv then read bytes(count) using the readInt.
Not a lib, not hard to do either. Just read up on readUTF and do what they did for the bytes.
The string should construct from the byte array recovered that way, if not you have other problems. If the string can be reconstructed, it can be buffered ... no?
May be able to just use read / write UTF() in DataStream - why not?
{ edit: per OP's request }
//Sending end
String data = new String("fdsfjal;sajssaafe8e88e88aa");// fingers pounding keyboard
DataOutputStream dataOutputStream = new DataOutputStream();//
final Integer length = new Integer(data.length());
dataOutputStream.writeInt(length.intValue());//
dataOutputStream.write(data.getBytes());//
dataOutputStream.flush();//
dataOutputStream.close();//
// rcv end
DataInputStream dataInputStream = new DataInputStream(source);
final int sizeToRead = dataInputStream.readInt();
byte[] datasink = new byte[sizeToRead.intValue()];
dataInputStream.read(datasink,sizeToRead);
dataInputStream.close;
try
{
// constructor
// String(byte[] bytes, int offset, int length)
final String result = new String(datasink,0x00000000,sizeToRead);//
// continue coding here
Do me a favor, keep the heat off of me. This is very fast right in the posting tool - code probably contains substantial errors - it's faster for me just to explain it writing Java ~ there will be others who can translate it to other code language ( s ) which you can too if you wish it in another codebase. You will need exception trapping an so on, just do a compile and start fixing errors. When you get a clean compile, start over from the beginnning and look for blunders. ( that's what a blunder is called in engineering - a blunder )

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.