I'm trying to write a perl client program to connect to a Java server application (JDuplicate). I see that the java server uses The DataInput.readUTF and DataInput.writeUTF methods, which the JDuplicate website lists as "Java's modified UTF-8 protocol".
My test program is pretty simple, i'm trying to send client type data, which should invoke a response from the sever, however it just times out:
#!/usr/bin/perl
use strict;
use Encode;
use IO::Socket;
my $remote = IO::Socket::INET->new(
Proto => 'tcp',
PeerAddr => 'localhost',
PeerPort => '10421'
) or die "Cannot connect to server\n";
$|++;
$remote->send(encode_utf8("CLIENTTYPE|JDSC#0.5.9#0.2"));
while (<$remote>) {
print $_,"\n";
}
close($remote);
exit(0);
I've tried $remote->send(pack("U","..."));, I've tried "use utf8;", I've tried binmode($remote, ":utf8"), and I've tried sending just plain ASCII text, nothing ever gets responded to.
I can see the data being sent with tcpdump, all in one packet, but the server itself does nothing with it (other then ack the packet).
Is there something additional i need to do to satisfy the "modified" utf implementation of Java?
Thanks.
You have to implement the protocol correctly:
First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException is thrown. Otherwise, this length is written to the output stream in exactly the manner of the writeShort method; after this, the one-, two-, or three-byte representation of each character in the string s is written.
As indicated in the docs for writeShort, it sends a 16-bit quantity in network order.
In Perl, that resembles
sub sendmsg {
my($s,$msg) = #_;
die "message too long" if length($msg) > 0xffff;
my $sent = $s->send(
pack(n => (length($msg) & 0xffff)) .
$msg
);
die "send: $!" unless defined $sent;
die "short write" unless $sent == length($msg) + 2;
}
sub readmsg {
my($s) = #_;
my $buf;
my $nread;
$nread = $s->read($buf, 2);
die "read: $!" unless defined $nread;
die "short read" unless $nread == 2;
my $len = unpack n => $buf;
$nread = $s->read($buf, $len);
die "read: $!" unless defined $nread;
die "short read" unless $nread == $len;
$buf;
}
Although the code above doesn't perform modified UTF encoding, it elicits a response:
my $remote = IO::Socket::INET->new(
Proto => 'tcp',
PeerAddr => 'localhost',
PeerPort => '10421'
) or die "Cannot connect to server: $#\n";
my $msg = "CLIENTTYPE|JDSC#0.5.9#0.2";
sendmsg $remote, $msg;
my $buf = readmsg $remote;
print "[$buf]\n";
Output:
[SERVERTYPE|JDuplicate#0.5.9 beta (build 584)#0.2]
This is unrelated to the main part of your question, but I thought I would explain what the "Java's modified UTF-8" that the API expects is; it's UTF-8, except with UTF-16 surrogate pairs encoded as their own codepoints, instead of having the characters represented by the pairs encoded directly in UTF-8. For instance, take the character U+1D11E MUSICAL SYMBOL G CLEF.
In UTF-8 it's encoded as the four bytes F0 9D 84 9E.
In UTF-16, because it's beyond U+FFFF, it's encoded using the surrogate pair 0xD834 0xDD1E.
In "modified UTF-8", it's given the UTF-8 encoding of the surrogate pair codepoints: that is, you encode "\uD834\uDD1E" into UTF-8, giving ED A0 B4 ED B4 9E, which happens to be fully six bytes long.
When using this format, Java will also encode any embedded nulls using the illegal overlong form C0 80 instead of encoding them as nulls, ensuring that there are never any embedded nulls in a "modified UTF-8" string.
If you're not sending any characters outside of the BMP or any nulls, though, there's no difference from the real thing ;)
Here's some documentation courtesy of Sun.
Related
I have a PHP script which is supposed to return an UTF-8 encoded string. However, in Java I can't seem to compare it with it's internal string in any way.
If I print "OK" and response, they appear the same in console. However, if I check equality
if ( "OK".equals(response) ) {
the result is false. I printed out both in binary, response is 11101111 10111011 10111111 01001111 01001011, the Java's String "OK" however is 01001111 01001011 which is cleary ASCII. I tried to convert it to UTF8 in a few ways, but no avail:
String result2 = new String("OK".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8);
and
String result2 = new String("OK".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
are both not working, still return ASCII codes for some reason.
byte[] result2 = "OK".getBytes(StandardCharsets.UTF_8); System.out.print(new String(result2));
While this also gives the correct "OK" result, in binary it still returns ASCII.
I've tried to change communication to numbers instead, but 1 still does not equal to 1, as Integer.parseInt(response) returns "1" is not a String error message, altough in every other aspect, it is recognised as a normal String.
I'm looking for a solution preferably where "OK" is converted to UTF-8 and not response to ASCII, since I need to communicate with a PHP script along with 2 databases, all set to UTF-8. Java is started with the switch -Dfile.encoding=UTF8 to ensure national characters are not broken.
in UTF-8 all characters with codes 127 or less are encoded by a single byte. Therefore "OK" in UTF-8 and ASCII is the same two bytes.
11101111 10111011 10111111 01001111 01001011 it is not just simple "OK" it is
0xEF, 0xBB, 0xBF, "OK"
where 0xEF, 0xBB, 0xBF are a BOM (Byte order mark)
It is symbols which are not displayed by editors but used to determine the encoding.
Probably those symbols appeared in you php script before <?php
You have to configure your editor to remove BOM from the file
UPD
If it is not possible to alter the php script, you can use a workaround:
// check if the first symbol of the response is BOM
if (!response.isEmpty() && (response.charAt(0) == 0xFEFF)) {
// removing the first symbol
response = response.substring(1);
}
Jline is a module for intercepting user input at a console before the user presses Enter. It uses JNA or similar wizardry.
I'm doing a few experiments with it and I'm getting encoding problems when I input more "exotic" Unicode characters. The OS here is W10 and I'm using Cygwin. Also this is in Groovy but should be obvious to Java people.
def terminal = org.jline.terminal.TerminalBuilder.builder().jna( true ).system( true ).build()
terminal.enterRawMode()
// NB the Terminal I get is class org.jline.terminal.impl.PosixSysTerminal
def reader = terminal.reader()
def bytes = [] // NB class ArrayList
int readInt = -1
while( readInt != 13 && readInt != 10 ) {
readInt = reader.read()
byte convertedByte = (byte)readInt
// see what the binary looks like:
String binaryString = String.format("%8s", Integer.toBinaryString( convertedByte & 0xFF)).replace(' ', '0')
println "binary |$binaryString|"
bytes << (byte)readInt // NB means "append to list"
println ">>> read |$readInt| byte |$convertedByte|"
}
// strip final byte (13 or 10)
bytes = bytes[0..-2]
println "z bytes $bytes, class ${bytes.class.name}"
def response = new String( (byte[])bytes.toArray(), 'UTF-8' )
// to get proper out encoding for Cygwin I then need to do this (I have no idea why!)
def psOut = new PrintStream(System.out, true, 'UTF-8' )
psOut.print( "using PrintStream: |$response|" )
This works fine with one-byte Unicode, and letters like "é" (2-bytes) get handled fine. But it goes wrong with "ẃ":
ẃ --> Unicode U+1E83
UTF-8 HEX: 0xE1 0xBA 0x83 (e1ba83)
BINARY: 11100001:10111010:10000011
Actually the binary it puts out when you enter "ẃ" is 11100001:10111010:10010010.
This translates to U+1E92, which is another Polish character, "Ẓ". And that is indeed what gets printed out in the response String.
Unfortunately the JLine package hands you this reader, which is class org.jline.utils.NonBlocking$NonBlockingInputStreamReader... So I don't really know what I can do to investigate its encoding (I presume UTF-8) or somehow modify it... Can anyone explain what the problem is?
As far as I can tell this relates to a Cygwin-specific problem, as asked and then answered by me a year ago.
There is a solution, in my answer to the question I asked directly after this one... which correctly deals with Unicode input, even when outside the Basic Multilingual Plane, using JLine, ... and using a Cygwin console ... hopefully.
I am trying to convert a protobuf stream to JSON object using the com.google.protobuf.util.JsonFormat class as below.
String jsonFormat = JsonFormat.printer().print(data);
As per the documentation https://developers.google.com/protocol-buffers/docs/proto3#json I am getting the bytes as Base64 string(example "hashedStaEthMac": "QDOMIxG+tTIRi7wlMA9yGtOoJ1g=",
). But I would like to get this a readable string(example "locAlgorithm": "ALGORITHM_ESTIMATION",
). Below is a sample output. is there a way to the JSON object asplain text or any work around to get the actual values.
{
"seq": "71811887",
"timestamp": 1488640438,
"op": "OP_UPDATE",
"topicSeq": "9023777",
"sourceId": "xxxxxxxx",
"location": {
"staEthMac": {
"addr": "xxxxxx"
},
"staLocationX": 1148.1763,
"staLocationY": 980.3377,
"errorLevel": 588,
"associated": false,
"campusId": "n5THo6IINuOSVZ/cTidNVA==",
"buildingId": "7hY/jVh9NRqqxF6gbqT7Jw==",
"floorId": "LV/ZiQRQMS2wwKiKTvYNBQ==",
"hashedStaEthMac": "xxxxxxxxxxx",
"locAlgorithm": "ALGORITHM_ESTIMATION",
"unit": "FEET"
}
}
Expected format is as below.
seq: 85264233
timestamp: 1488655098
op: OP_UPDATE
topic_seq: 10955622
source_id: 00505698749E
location {
sta_eth_mac {
addr: xx:xx:xx:xx:xx:xx
}
sta_location_x: 916.003
sta_location_y: 580.115
error_level: 854
associated: false
campus_id: 9F94C7A3A20836E392559FDC4E274D54
building_id: EE163F8D587D351AAAC45EA06EA4FB27
floor_id: 83144E609EEE3A64BBD22C536A76FF5A
hashed_sta_eth_mac:
loc_algorithm: ALGORITHM_ESTIMATION
unit: FEET
}
Not easily, because the actual values are binary, which is why they're Base64-encoded in the first place.
Try to decode one of these values:
$ echo -n 'n5THo6IINuOSVZ/cTidNVA==' | base64 -D
??ǣ6?U??N'MT
In order to get more readable values, you have to understand what the binary data actually is, and then decide what format you want to use to display it.
The field called staEthMac.addr is 6 bytes and is probably an Ethernet MAC address. It's usually displayed as xx:xx:xx:xx:xx:xx where xx are the hexadecimal values of each byte. So you could decode the Base64 strings into a byte[] and then call a function to convert each byte to hex and delimit them with ':'.
The fields campusId, buildingId, and floorId are 16 bytes (128 bits) and are probably UUIDs. UUIDs are usually displayed as xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx where each x is a hex digit (4 bits). So you could (again) convert the Base64 string to byte[] and then print the hex digits, optionally adding the dashes.
Not sure about sourceId and hashedStaEthMac, but you could just follow the pattern of converting to byte[] and printing as hex. Essentially you're just doing a conversion from base 64 to base 16. You'll wind up with something like this:
$ echo -n 'n5THo6IINuOSVZ/cTidNVA==' | base64 -D | xxd -p
9f94c7a3a20836e392559fdc4e274d54
A point that I'm not sure you are getting is that it's binary data. There is no "readable" version that makes sense like "ALGORITHM_ESTIMATION" does; the best you can do is encode the binary data using letters and numbers so you can at least pronounce it.
Base64 (which encodes binary using 64 different characters) is pronounceable "N five T H lowercase-O six ..." but it's not real friendly because letter case is significant and because it uses letters like O and I that look like numbers. Hex (which encodes binary using just 16 characters) is a little easier to read.
I am using protocol buffers in an iOS application. The app consumes a web service written in Java, which spits back a base64 encoded string.
The base64 string is the same on both ends.
In the app however, whenever I try to convert the string to NSData, the number of bytes may or may not be the same on both ends. The result is a possible invalid protocol buffer exception, invalid end tag.
For example:
Source(bytes) | NSData | Diff
93 93 0
6739 6735 -4
5745 5739 -6
The bytes are equal in the trivial case of an empty protocol buffer.
Here is the Java source:
import org.apache.commons.codec.binary.Base64;
....
public static String bytesToBase64(byte[] bytes) {
return Base64.encodeBase64String(bytes);
}
On the iOS side, I have tried various algorithms from similar questions which all agree in byte size and content.
What could be causing this?
On closer inspection, the issue was my assumption that Base64 is Base64. I was using the url variant in the web service while the app's decode was expecting a normal version.
I noticed underscores in the Base64, which I thought odd.
The Base64 page http://en.wikipedia.org/wiki/Base64 map of value/char shows no underscores, but later in the article goes over variants, which do use underscores.
i have a problem when reading special charatters from oracle database (use JDBC driver and glassfish tooplink).
I store on database the name "GRØNLÅEN KJÆTIL" through WebService and, on database, the data are store correctly.
But when i read this String, print on log file and convert this in byte array whit this code:
int pos = 0;
byte[] msg=new byte[1024];
String F = "F" + passenger.getName();
logger.debug("Add " + F + " " + F.length());
msg = addStringToArrayBytePlusSeparator(msg, F,pos);
..............
private byte[] addStringToArrayBytePlusSeparator(byte[] arrDest,String strToAdd,int destPosition)
{
System.arraycopy(strToAdd.getBytes(Charset.forName("ISO-8859-1")), 0, arrDest, destPosition, strToAdd.getBytes().length);
arrDest = addSeparator(arrDest,destPosition+strToAdd.getBytes().length,1);
return arrDest;
}
1) In the log file there is:"Add FGRÃNLÃ " (the name isn't correct and the F.length() are not printed).
2) The code throw:
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at it.edea.ebooking.business.chi.control.VingCardImpl.addStringToArrayBytePlusSeparator(Test.java:225).
Any solution?
Tanks
You're calling strToAdd.getBytes() without specifying the character encoding, within the System.arraycopy call - that will be using the system default encoding, which may well not be ISO-8859-1. You should be consistent in which encoding you use. Frankly I'd also suggest that you use UTF-8 rather than ISO-8859-1 if you have the choice, but that's a different matter.
Why are you dealing with byte arrays anyway at this point? Why not just use strings?
Also note that your addStringToArrayBytePlusSeparator method doesn't give any indication of how many bytes it's copied, which means the caller won't have any idea what to do with it afterwards. If you must use byte arrays like this, I'd suggest making addStringToArrayBytePlusSeparator return either the new "end of logical array" or the number of bytes copied. For example:
private static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
/**
* (Insert fuller description here.)
* Returns the number of bytes written to the array
*/
private static int addStringToArrayBytePlusSeparator(byte[] arrDest,
String strToAdd,
int destPosition)
{
byte[] encodedText = ISO_8859_1.getBytes(strToAdd);
// TODO: Verify that there's enough space in the array
System.arraycopy(encodedText, 0, arrDest, destPosition, encodedText.length);
return encodedText.length;
}
Encoding/Decoding problems are hard. In every process step you have to do the correct encoding/decoding. So,
familiarize yourself with the difference of bytes (inputstream) and Characters (Readers, Strings)
Choose in which character encoding you want to store your data in the database, and in which character encoding you want to expose your webservice. Make sure when you load initial data in the database it's in the right encoding
connect with the right database properties. mysql requires an addition to the connection url:?useUnicode=true&characterEncoding=UTF-8 when using UTF-8, I don't know about oracle.
if you print/debug at a certain step and it looks ok, you can't be sure you did it right. The logger can write with the wrong encoding (sometimes making something look ok, while in fact it's broken). Your terminal might not handle strange byte encodings correct. The same holds for command-line database clients. Your data might wrongly be stored, but your wrongly configured terminal interprets/shows the data as correct.
In XML, it's not only the stream encoding that matters, but also the xml-encoding attribute.