Jline is a module for intercepting user input at a console before the user presses Enter. It uses JNA or similar wizardry.
I'm doing a few experiments with it and I'm getting encoding problems when I input more "exotic" Unicode characters. The OS here is W10 and I'm using Cygwin. Also this is in Groovy but should be obvious to Java people.
def terminal = org.jline.terminal.TerminalBuilder.builder().jna( true ).system( true ).build()
terminal.enterRawMode()
// NB the Terminal I get is class org.jline.terminal.impl.PosixSysTerminal
def reader = terminal.reader()
def bytes = [] // NB class ArrayList
int readInt = -1
while( readInt != 13 && readInt != 10 ) {
readInt = reader.read()
byte convertedByte = (byte)readInt
// see what the binary looks like:
String binaryString = String.format("%8s", Integer.toBinaryString( convertedByte & 0xFF)).replace(' ', '0')
println "binary |$binaryString|"
bytes << (byte)readInt // NB means "append to list"
println ">>> read |$readInt| byte |$convertedByte|"
}
// strip final byte (13 or 10)
bytes = bytes[0..-2]
println "z bytes $bytes, class ${bytes.class.name}"
def response = new String( (byte[])bytes.toArray(), 'UTF-8' )
// to get proper out encoding for Cygwin I then need to do this (I have no idea why!)
def psOut = new PrintStream(System.out, true, 'UTF-8' )
psOut.print( "using PrintStream: |$response|" )
This works fine with one-byte Unicode, and letters like "é" (2-bytes) get handled fine. But it goes wrong with "ẃ":
ẃ --> Unicode U+1E83
UTF-8 HEX: 0xE1 0xBA 0x83 (e1ba83)
BINARY: 11100001:10111010:10000011
Actually the binary it puts out when you enter "ẃ" is 11100001:10111010:10010010.
This translates to U+1E92, which is another Polish character, "Ẓ". And that is indeed what gets printed out in the response String.
Unfortunately the JLine package hands you this reader, which is class org.jline.utils.NonBlocking$NonBlockingInputStreamReader... So I don't really know what I can do to investigate its encoding (I presume UTF-8) or somehow modify it... Can anyone explain what the problem is?
As far as I can tell this relates to a Cygwin-specific problem, as asked and then answered by me a year ago.
There is a solution, in my answer to the question I asked directly after this one... which correctly deals with Unicode input, even when outside the Basic Multilingual Plane, using JLine, ... and using a Cygwin console ... hopefully.
Related
I have successfully used the ISO8859-13 character encoding before but this time it doesn't seem to be working.
Based on the web site https://en.wikipedia.org/wiki/ISO/IEC_8859-13 it is a valid character.
These are the 3 characters stored in the file.
äää
Here is the code being used.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
public class ReadFile
{
public static void main(String[] arguments)
{
try
{
File inFile = new File("C:\\Downloads\\MyFile.txt");
if (inFile.exists())
{
System.out.println("File found");
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(inFile), "ISO8859-13"));
String line = null;
while ( (line = in.readLine()) != null )
{
System.out.println("Line Read: >" + line + "<");
}
}
else
{
System.out.println("File not found");
}
}
catch (IOException e)
{
}
}
}
The output on both Windows and Linux with and/or without Eclipse is the same which is.
Line Read: >?¤?¤?¤<
This previously worked for a number of other characters but why not for this?
There are many explanations possible for what you are observing. The two most likely ones, along with some code you can use to confirm that you've found the cause:
Option #1: Terminal issues
Maybe you are writing this to a terminal that either cannot render ä, or, there is a terminal transfer issue (terminals are, in the end, just a bunch of streams and pipes hooked together, they are bytes under the hood, so if one part of the process thinks all are agreed that all bytes are to be interpreted as UTF-8 encoded text, and another as ISO-8859-13 encoded, you get problems). Given that you see the exact same output on windows as on linux this is unlikely (it would be particularly likely if you are seeing this in the 'console' view in an IDE, or different outputs on different systems for the same code). If you want to test it, run instead: System.out.println("unicode codepoint of the first character: " + (int) line.charAt(0)); - this should print 228, which is the unicode codepoint for ä. If it doesn't, then you can be certain this isn't the (only) problem.
If this is it, the fix is to, well, use another terminal or mess with settings, I'd just ask another SO question and give plenty of detail on your setup (which OS, which terminal client, what does SET print, does the client have encoding options, etcetera).
Option #2: It's not actually ISO-8859-13
This, too, is simple to test: remark out your BufferedReader in = .... line and replace it with: System.out.println(new FileInputStream(file).read()); - this should print 228. If it prints anything else, your input file is not actually ISO-8859-13.
If this is it, find out what the encoding actually is and use that instead. For example, in UTF-8 encoding, ä would end up as 2 bytes in a file. That would already imply that your input file containing just äää and not even a newline afterwards is 6 bytes large (in ISO-8859-13, it would be 3), and that the raw bytes, as you read them with fileInputStream.read(), are, in order: 195 164 195 164 195 164. So, if you run the above code and it prints 195 instead of 228 - your input is probably in UTF-8; it's definitely not in ISO-8859-13.
I have a PHP script which is supposed to return an UTF-8 encoded string. However, in Java I can't seem to compare it with it's internal string in any way.
If I print "OK" and response, they appear the same in console. However, if I check equality
if ( "OK".equals(response) ) {
the result is false. I printed out both in binary, response is 11101111 10111011 10111111 01001111 01001011, the Java's String "OK" however is 01001111 01001011 which is cleary ASCII. I tried to convert it to UTF8 in a few ways, but no avail:
String result2 = new String("OK".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8);
and
String result2 = new String("OK".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
are both not working, still return ASCII codes for some reason.
byte[] result2 = "OK".getBytes(StandardCharsets.UTF_8); System.out.print(new String(result2));
While this also gives the correct "OK" result, in binary it still returns ASCII.
I've tried to change communication to numbers instead, but 1 still does not equal to 1, as Integer.parseInt(response) returns "1" is not a String error message, altough in every other aspect, it is recognised as a normal String.
I'm looking for a solution preferably where "OK" is converted to UTF-8 and not response to ASCII, since I need to communicate with a PHP script along with 2 databases, all set to UTF-8. Java is started with the switch -Dfile.encoding=UTF8 to ensure national characters are not broken.
in UTF-8 all characters with codes 127 or less are encoded by a single byte. Therefore "OK" in UTF-8 and ASCII is the same two bytes.
11101111 10111011 10111111 01001111 01001011 it is not just simple "OK" it is
0xEF, 0xBB, 0xBF, "OK"
where 0xEF, 0xBB, 0xBF are a BOM (Byte order mark)
It is symbols which are not displayed by editors but used to determine the encoding.
Probably those symbols appeared in you php script before <?php
You have to configure your editor to remove BOM from the file
UPD
If it is not possible to alter the php script, you can use a workaround:
// check if the first symbol of the response is BOM
if (!response.isEmpty() && (response.charAt(0) == 0xFEFF)) {
// removing the first symbol
response = response.substring(1);
}
I essentially have the exact opposite problem as
new-line-appending-on-my-encrypted-string
It seems like the old Java Base64 utility would always add new lines every 76 characters when returning a string, but using the following code, I don't get those breaks I need.
Path path = Paths.get(file);
byte[] data = Files.readAllBytes(path);
String txt= Base64.getEncoder().encodeToString(data);
Is there an easy way to tell the encoder to add the newlines?
I've tried implementing a stringbuilder to insert the newlines, But it ends up changing the entire output (I copy the text from java console to HxD editor, and compare against my known working 'BLOB' with newlines).
String txt= Base64.getEncoder().encodeToString(data);
//Byte code for newline
byte b1 = 0x0D;
byte b2 = 0x0A;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < txt.length(); i++) {
if (i > 0 && (i % 76 == 0)) {
sb.append((char)b1);
sb.append((char)b2);
}
sb.append(txt.charAt(i));
}
EDIT (in response to question):
It's not the easiest thing to explain, but when I don't use string builder, the output of the encode will start like this:
AAAAPAog4lBVgGJrT2b+mQVicHN3d////////3hhcDJiLWVtMjUwLWVtMjUwLWRldjA0NTUAAAAAAA
But I want it to look like this:
AAAAPAog4lBVgGJrT2b+mQVicHN3d////////3hhcDJiLWVtMjUwLWVtMjUwLWRldjA0NTUAAAAA..AA
As you can see, the ".." represents 0x0D and 0X0A or a newline, which is insterted at the 76th character (this is what the old base64 would output).
However, when I append the bytes b1 and b2 (newline) after the 76th character, the output becomes:
BPwAFHwA0CUFoG8AgDRCAAIlQgAAJUIAAhUfNEIAAiUkmw/0fADQFSInART/ADUlfADQFQE0fADQ..
So it looks like the ".." is in the right spot, but everything before it is different.
Thanks!
You want getMimeEncoder instead:
MIME
Uses the "The Base64 Alphabet" as specified in Table 1 of RFC 2045 for encoding and decoding operation. The encoded output must be represented in lines of no more than 76 characters each and uses a carriage return '\r' followed immediately by a linefeed '\n' as the line separator. No line separator is added to the end of the encoded output. All line separators or other characters not found in the base64 alphabet table are ignored in decoding operation.
(emphasis mine)
Note that the encoding scheme is otherwise the same as the basic encoder from getEncoder - they are both derived from RFC 2045.
Today I splitted Base64 representation of X509Certificate with the following code:
StringBuilder sb = new StringBuilder();
int chunksCount = str.length()/76;
for(int i=0;i<chunksCount;i++){
sb.append(str.substring(76*i,76*(i+1))).append("\r\n");
}
if(str.length() % 76 != 0) sb.append(str.substring(76*chunksCount)).append("\r\n");
I think, adding big parts better than iterating over each letter. Also, some libraries provide Base64 encoder with special parameter allowing to split with required part size but I had to use some library without such feature.
I'm punishing myself a bit by doing the python challenges series in Scala.
Now, one of the challenges is to read in a string that's been compressed using the bzip algorithm and output the result.
BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084
Now, after some digging it appears as if there isn't a standard java library for bzip processing, but there is something in the apache ant project, that this guy has kindly taken out for use as a separate library.
The thing is, I can't seem to get it to work with the following code, it just hangs in the scala REPL and the JVM maxes out at 100% CPU usage
This is the code I'm trying...
import java.io.{ByteArrayInputStream}
import org.apache.tools.bzip2.{CBZip2InputStream}
import org.apache.commons.io.{IOUtils}
object ChallengeEight extends Application {
val inputString = """BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084"""
val inputStream = new ByteArrayInputStream( inputString.getBytes("UTF-8") ) //convert string to inputstream
inputStream.skip(2) //skip the 'BZ' part at the start
val bzipInputStream = new CBZip2InputStream(inputStream) //hangs here....
val result = IOUtils.toString(bzipInputStream, "UTF-8");
println(result)
}
Anyone got any ideas? Or is the CBZip2InputStream class expecting some extra bytes that you might find in a file that has been zipped with bzip2?
Any help would be appreciated
EDIT For the record this is the python solution
import bz2
un = "BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!" \
"\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084"
print [bz2.decompress(elt) for elt in (un)]
To escape characters use a unicode escape sequence like \uXXXX syntax where XXXX is the hexadecimal sequence for the unicode character.
val un = "BZh91AY&SYA\u00af\u0082\r\u0000\u0000\u0001\u0001\u0080\u0002\u00c0\u0002\u0000 \u0000!\u009ah3M\u0007<]\u00c9\u0014\u00e1BA\u0006\u00be\u00084"
You are enclosing your string in triple quotes which means you will pass the literal characters to the algorithm rather than the control/binary characters they represent.
I'm trying to write a perl client program to connect to a Java server application (JDuplicate). I see that the java server uses The DataInput.readUTF and DataInput.writeUTF methods, which the JDuplicate website lists as "Java's modified UTF-8 protocol".
My test program is pretty simple, i'm trying to send client type data, which should invoke a response from the sever, however it just times out:
#!/usr/bin/perl
use strict;
use Encode;
use IO::Socket;
my $remote = IO::Socket::INET->new(
Proto => 'tcp',
PeerAddr => 'localhost',
PeerPort => '10421'
) or die "Cannot connect to server\n";
$|++;
$remote->send(encode_utf8("CLIENTTYPE|JDSC#0.5.9#0.2"));
while (<$remote>) {
print $_,"\n";
}
close($remote);
exit(0);
I've tried $remote->send(pack("U","..."));, I've tried "use utf8;", I've tried binmode($remote, ":utf8"), and I've tried sending just plain ASCII text, nothing ever gets responded to.
I can see the data being sent with tcpdump, all in one packet, but the server itself does nothing with it (other then ack the packet).
Is there something additional i need to do to satisfy the "modified" utf implementation of Java?
Thanks.
You have to implement the protocol correctly:
First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException is thrown. Otherwise, this length is written to the output stream in exactly the manner of the writeShort method; after this, the one-, two-, or three-byte representation of each character in the string s is written.
As indicated in the docs for writeShort, it sends a 16-bit quantity in network order.
In Perl, that resembles
sub sendmsg {
my($s,$msg) = #_;
die "message too long" if length($msg) > 0xffff;
my $sent = $s->send(
pack(n => (length($msg) & 0xffff)) .
$msg
);
die "send: $!" unless defined $sent;
die "short write" unless $sent == length($msg) + 2;
}
sub readmsg {
my($s) = #_;
my $buf;
my $nread;
$nread = $s->read($buf, 2);
die "read: $!" unless defined $nread;
die "short read" unless $nread == 2;
my $len = unpack n => $buf;
$nread = $s->read($buf, $len);
die "read: $!" unless defined $nread;
die "short read" unless $nread == $len;
$buf;
}
Although the code above doesn't perform modified UTF encoding, it elicits a response:
my $remote = IO::Socket::INET->new(
Proto => 'tcp',
PeerAddr => 'localhost',
PeerPort => '10421'
) or die "Cannot connect to server: $#\n";
my $msg = "CLIENTTYPE|JDSC#0.5.9#0.2";
sendmsg $remote, $msg;
my $buf = readmsg $remote;
print "[$buf]\n";
Output:
[SERVERTYPE|JDuplicate#0.5.9 beta (build 584)#0.2]
This is unrelated to the main part of your question, but I thought I would explain what the "Java's modified UTF-8" that the API expects is; it's UTF-8, except with UTF-16 surrogate pairs encoded as their own codepoints, instead of having the characters represented by the pairs encoded directly in UTF-8. For instance, take the character U+1D11E MUSICAL SYMBOL G CLEF.
In UTF-8 it's encoded as the four bytes F0 9D 84 9E.
In UTF-16, because it's beyond U+FFFF, it's encoded using the surrogate pair 0xD834 0xDD1E.
In "modified UTF-8", it's given the UTF-8 encoding of the surrogate pair codepoints: that is, you encode "\uD834\uDD1E" into UTF-8, giving ED A0 B4 ED B4 9E, which happens to be fully six bytes long.
When using this format, Java will also encode any embedded nulls using the illegal overlong form C0 80 instead of encoding them as nulls, ensuring that there are never any embedded nulls in a "modified UTF-8" string.
If you're not sending any characters outside of the BMP or any nulls, though, there's no difference from the real thing ;)
Here's some documentation courtesy of Sun.