Odd behaviour with StringBuilder, Java

Odd behaviour with StringBuilder, Java - java

So I've been trying to print out some lines of "-" characters. Why does the following not work?:
StringBuilder horizonRule = new StringBuilder();
for(int i = 0 ; i < 12 ; i++) {
horizonRule.append("─");
System.out.println(horizonRule.toString());
}
The correct output is several lines like
─
──
───
────
and so on, but the incorrect output is
â??
â??â??
â??â??â??
I'm guessing the string is not being properly decoded by println or something

The string in your code is not a hyphen but a UTF8 box drawing character.
The terminal your application is printing to doesn't seem to expect any UTF8 content, so the issue is not inside your application.
Replace it with a real hyphen (-) or make sure the tool that displays the output supports UTF8.

You say that the IDE wants to save as UTF-8. You then probably have saved it as UTF-8.
However your compiler is likely to compile in whatever encoding your system uses.
If you write your code as UTF-8, make sure to compile it with the same encoding:
javac -encoding utf8 MyClass.java

I tried your code (I literally just copy'n'paste) using BeanShell, and it worked perfectly. So there's nothing wrong with the code. It will be your environment.
stewart$ bsh
Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true -Dapple.awt.UIElement=true
BeanShell 2.0b4 - by Pat Niemeyer (pat#pat.net)
bsh % StringBuilder horizonRule = new StringBuilder();
bsh % for(int i=0; i<12; i++) {
horizonRule.append("─");
System.out.println(horizonRule.toString());
}
─
──
───
────
─────
──────
───────
────────
─────────
──────────
───────────
────────────
bsh %

public class myTest1 {
public static void main(String[] args) {
StringBuilder horizonRule = new StringBuilder();
for (int i = 0 ; i <= 13 ; i++){
horizonRule.append('_');
System.out.println(horizonRule.toString());
}
}
}
is correct;
maybe you use a different encoding ? clear env path

Related

Which encoding for ProcessBuilder parameters

Using ProcessBuilder, I need to be able to send non-ASCII parameters to another Java program.
In this case, a program Abc needs to send e.g. Arabic characters to Def program through the parameters. I have control of Abc code, but not of Def.
Using the normal way of ProcessBuilder without any playing with the encoding, it was mentioned here, it is not possible. Def recieves question marks "?????".
However, I am able to get some result, but different encodings can be used for different scenarios.
E.g. I am trying all encodings to send to the recipient, and comparing the result of what is expected.
Windows, IntelliJ console:
Default charset: UTF-8
Found charsets: windows-1252, windows-1254 and windows-1258
Windows, command prompt:
Default charset: windows-1252
Found charsets: CESU-8 and UTF-8
Ubuntu, command prompt:
Default charset: ISO-8859-1
Found charsets: ISO-2022-CN, ISO-2022-KR, ISO-8859-1, ISO-8859-15, ISO-8859-9, x-IBM1129, x-ISO-2022-CN-CNS and x-ISO-2022-CN-GB
My question is: how to programmatically know which correct encoding to use, since I need to have something universal?
In other words, what is the relation between the default charset and the found ones?
public class Abc {
private static final Path PATH = Paths.get("."); // With maven: ./target/classes
public static void main(String[] args) throws Exception {
var string = "hello أحمد";
var bytes = string.getBytes();
System.out.println("Original string: " + string);
System.out.println("Default charset: " + Charset.defaultCharset());
for (var c : Charset.availableCharsets().values()) {
var newString = new String(bytes, c);
var process = new ProcessBuilder().command("java", "-cp",
PATH.toAbsolutePath().toString(),
"Def", newString).start();
process.waitFor();
var output = asString(process.getInputStream());
if (output.contains(string)) {
System.out.println("Found " + c + " " + output);
}
}
}
private static String asString(InputStream is) throws IOException {
try (var reader = new BufferedReader(new InputStreamReader(is))) {
var builder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
if (builder.length() != 0) {
builder.append(System.lineSeparator());
}
builder.append(line);
}
return builder.toString();
}
}
}
public class Def {
public static void main(String[] args) {
System.out.println(args[0]);
}
}

Under the hood, what's actually being passed around is bytes, not chars. Normally, you'd expect the java method that ends up turning characters into bytes to have an overload that lets you specify charset, but, for whatever reason, it does not exist here.
How it should work is thusly:
You pass a string to ProcessBuilder
PB will turn that string into bytes using Charset.defaultCharset() (why? Because PB is all about making the OS do things, and the default charset reflects the OS's preferred charset).
These bytes are then fed to the process.
The process starts up. If it is java, and we're talking the args in psv main(String[] args), the same is done in reverse: Java takes the bytes and turns them back to characters via Charset.defaultCharset(), again.
This does show an immediate issue: If the default charset is not capable of representing a certain character, then in theory you are out of luck.
That would strongly suggest that using java to fire up java.exe should ordinarily mean you can pass whatever you want (unless the characters involved aren't representable in the system's charset).
Your code is odd. In particular, this line is the problem:
var bytes = string.getBytes();
This is short for string.getBytes(Charset.defaultCharset()). So now you have your bytes in the provided charset.
var newString = new String(bytes, c);
and now you're taking those bytes and turning them into a string using a completely different charset. I'm not sure what you're trying to accomplish with this. Pure gobbledygook would come out.
In other words, what is the relation between the default charset and the found ones?
What do you mean by 'found ones'? The string "Found charsets" appears nowhere in your code. If you mean: What Charset.availableCharsets() returns - there is no relationship at all. availableCharsets isn't relevant for ProcessBuilder.

One possibility is to convert your String to Unicode sequences string and then pass it to another process and there convert it back to a regular String. String of Unicode sequences will always contain ASCI characters only. Here is how it may look like:
String encoded = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("hello أحمد"));
The result will be that String encode will hold this value:
"\u0068\u0065\u006c\u006c\u006f\u0020\u0623\u062d\u0645\u062f"
This String you can safely pass to another process. In that other process, you can do the following:
String originalString = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(encodedString);
And the result will be that originalString will now hold this value:
"hello أحمد"
Class StringUnicodeEncoderDecoder could be found in an Open Source library called MgntUtils. You can get this library as Maven Artifact or get it on Github (including source code and JavaDoc). JavaDoc online is available here
This library and this particular feature is used and well tested by multiple users.
Disclamer: This library is written by me

Print all Unicode characters within a specific range

I can't find the right API for this. I tried this;
public static void main(String[] args) {
for (int i = 2309; i < 3000; i++) {
String hex = Integer.toHexString(i);
System.out.println(hex + " = " + (char) i);
}
}
This code only prints like this in Eclipse IDE.
905 = ?
906 = ?
907 = ?
...
How can I make us of these decimal and hex values to get the Unicode characters?

It prints like that because all consoles use a mono spaced font. Try that on a JLabel in a frame and it should display fine.
EDIT:
Try creating a unicode printstream
PrintStream out = new PrintStream(System.out, true, "UTF-8");
And then print to it.
Here's the output in CMD window.

I forgot to save it in UTF-8 format by changing it from
File > Properties > Select the text file encoding
This will properly print the right character from the Eclipse console. The default is cp1252 which will print only ? for those characters it does not understand.

Java: Converting UTF 8 to String

When I run the following program:
public static void main(String args[]) throws Exception
{
byte str[] = {(byte)0xEC, (byte)0x96, (byte)0xB4};
String s = new String(str, "UTF-8");
}
on Linux and inspect the value of s in jdb, I correctly get:
s = "ì–´"
on Windows, I incorrectly get:
s = "?"
My byte sequence is a valid UTF-8 character in Korean, why would it be producing two very different results?

It correctly prints "어" on my computer (Ubuntu Linux), as described in Code Table Korean Hangul. Windows command prompt is known to have issues with encoding, don't bother.
Your code is fine.

It gives 어 for me. This means your console is probably not configured to display UTF-8 and it is a printing/display problem, rather than a problem with representation.

You get the correct string, it's Windows console that does not display the string correctly.
Here is a link to an article that discusses a way to make Java console produce correct Unicode output using JNI.

JDB is displaying the data incorrectly. The code works the same on both Windows and Linux. Try running this more definitive test:
public static void main(String[] args) throws Exception {
byte str[] = {(byte)0xEC, (byte)0x96, (byte)0xB4};
String s = new String(str, "UTF-8");
for(int i=0; i<s.length(); i++) {
System.out.println(BigInteger.valueOf((int)s.charAt(i)).toString(16));
}
}
This prints out the hex value of every character in the string. This will correctly print out "c5b4" in both Windows and Linux.

Desktop.Action.MAIL encoding subject and body strings correctly for mailto: in URI

I have a question specific to the Java Desktop API in Java 6, more specifically desktop.mail(URI uri)..
I was wondering if there is a function one could use to ensure that the Subject and Body in f.ex:
mailToURI = new URI("mailto", getToEmails() + "?SUBJECT=" + getEmailSubject()
+ "&BODY=" + getEmailBody(), null);
desktop.mail(mailToURI);
will be kept in accordance with rfc2368 and still be displayed correctly in the email application?
Right now examples of problematic texts are the scandinavian letters: æøå / ÆØÅ and adding complex URLS in the Body containing ampersands (&) and such f.ex: http://www.whatever.com?a=b&c=d etc..
Is there a function in Java that ensures the aboved seeked integrity is preserved when using the mailto: URI scheme with Java Desktops mail(URI) function?
Would it be possible to make one?
At this point I have tried everything I can think of including:
MimeUtility.encodeText()
URLEncode.encode(..
A custom function encodeUnusualCharacters()
private static final Pattern SIMPLE_CHARS = Pattern.compile("[a-zA-Z0-9]");
private String encodeUnusualChars(String aText) {
StringBuilder result = new StringBuilder();
CharacterIterator iter = new StringCharacterIterator(aText);
for (char c = iter.first(); c != CharacterIterator.DONE; c = iter.next()) {
char[] chars = {c};
String character = new String(chars);
if (isSimpleCharacter(character)) {
result.append(c);
} else {
//hexEncode(character, "UTF-8", result);
}
}
return result.toString();
}
private boolean isSimpleCharacter(String aCharacter) {
Matcher matcher = SIMPLE_CHARS.matcher(aCharacter);
return matcher.matches();
}
/**
For the given character and encoding, appends one or more hex-encoded characters.
For double-byte characters, two hex-encoded items will be appended.
*/
private static void hexEncode(String aCharacter, String aEncoding, StringBuilder aOut) {
try {
String HEX_DIGITS = "0123456789ABCDEF";
byte[] bytes = aCharacter.getBytes(aEncoding);
for (int idx = 0; idx < bytes.length; idx++) {
aOut.append('%');
aOut.append(HEX_DIGITS.charAt((bytes[idx] & 0xf0) >> 4));
aOut.append(HEX_DIGITS.charAt(bytes[idx] & 0xf));
}
} catch (UnsupportedEncodingException ex) {
Logger.getLogger(LocalMail.class.getName()).log(Level.SEVERE, null, ex);
}
}
And many more...
At the best I end up with the encoded text in the email that is opened up.
Not providing any special encoding will cause æøå or similar to stop further processing of the content.
I feel I am missing something crucial. Could anyone please enlighten me with a solution to this?
For line breaks I use String NL = System.getProperty("line.separator");
Perhaps there is some System specific stuff that needs to be called to make this work??
By the way I am currently on Mac OS X 10.6.8 with Mail 4.5
marius$ java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-384-10M3425)
Java HotSpot(TM) Client VM (build 20.1-b02-384, mixed mode)
I really feel there must be a way - otherwise the subject and message part of the desktop.mail(URI) function is completely unreliable to the point of being useless.
Any help to point me in the right direction is greatly appreciated!!

Thanks Marius, it's a very useful line of code.
I modified it a bit for performances...
It's better to use "replace" instead of "replaceAll", when you are not using RegExp.
This:
.replace("+", "%20")
is faster than:
.replaceAll("\\+", "%20")
Both replace ALL occurrences, but the first one does not have to do any regexp parsing.
http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replace%28java.lang.CharSequence,%20java.lang.CharSequence%29
Also, if the original string already has \r\n for new lines, the second replace will double the \r. It's not a big issue, but I prefer to remove that one and provide a proper input string:
String result = java.net.URLEncoder.encode(src, "utf-8").replace("+", "%20")

Try this, hope it will work for you.
String result = java.net.URLEncoder.encode(src, "utf-8").replaceAll("\\+", "%20").replaceAll("\\%0A", "%0D%0A");

Java UTF-8 strange behaviour

I am trying to decode some UTF-8 strings in Java.
These strings contain some combining unicode characters, such as CC 88 (combining diaresis).
The character sequence seems ok, according to http://www.fileformat.info/info/unicode/char/0308/index.htm
But the output after conversion to String is invalid.
Any idea ?
byte[] utf8 = { 105, -52, -120 };
System.out.print("{{");
for(int i = 0; i < utf8.length; ++i)
{
int value = utf8[i] & 0xFF;
System.out.print(Integer.toHexString(value));
}
System.out.println("}}");
System.out.println(">" + new String(utf8, "UTF-8"));
Output:
{{69cc88}}
>i?

The console which you're outputting to (e.g. windows) may not support unicode, and may mangle the characters. The console output is not a good representation of the data.
Try writing the output to a file instead, making sure the encoding is correct on the FileWriter, then open the file in a unicode-friendly editor.
Alternatively, use a debugger to make sure the characters are what you expect. Just don't trust the console.

Here how I finally solved the problem, in Eclipse on Windows:
Click Run Configuration.
Click Arguments tab.
Add -Dfile.encoding=UTF-8
Click Common tab.
Set Console Encoding to UTF-8.
Modify the code:
byte[] utf8 = { 105, -52, -120 };
System.out.print("{{");
for(int i = 0; i < utf8.length; ++i)
{
int value = utf8[i] & 0xFF;
System.out.print(Integer.toHexString(value));
}
System.out.println("}}");
PrintStream sysout = new PrintStream(System.out, true, "UTF-8");
sysout.print(">" + new String(utf8, "UTF-8"));
Output:
{{69cc88}}
> ï

The code is fine, but as skaffman said your console probably doesn't support the appropriate character.
To test for sure, you need to print out the unicode values of the character:
public class Test {
public static void main(String[] args) throws Exception {
byte[] utf8 = { 105, -52, -120 };
String text = new String(utf8, "UTF-8");
for (int i=0; i < text.length(); i++) {
System.out.println(Integer.toHexString(text.charAt(i)));
}
}
}
This prints 69, 308 - which is correct (U+0069, U+0308).

Java, not unreasonably, encodes Unicode characters into native system encoded bytes before it writes them to stdout. Some operating systems, like many Linux distros, use UTF-8 as their default character set, which is nice.
Things are a bit different on Windows for a variety of backwards-compatibility reasons. The default system encoding will be one of the "ANSI" codepages and if you open the default command prompt (cmd.exe) it will be one of the old "OEM" DOS codepages (though it is possible to get ANSI and Unicode there with a bit of work).
Since U+0308 isn't in any of the "ANSI" character sets (probably 1252 in your case), it'll get encoded as an error character (usually a question mark).
An alternative to Unicode-enabling everything is to normalize the combining sequence U+0069 U+0308 to the single character U+00EF:
public static void emit(String foo) throws IOException {
System.out.println("Literal: " + foo);
System.out.print("Hex: ");
for (char ch : foo.toCharArray()) {
System.out.print(Integer.toHexString(ch & 0xFFFF) + " ");
}
System.out.println();
}
public static void main(String[] args) throws IOException {
String foo = "\u0069\u0308";
emit(foo);
foo = Normalizer.normalize(foo, Normalizer.Form.NFC);
emit(foo);
}
Under windows-1252, this code will emit:
Literal: i?
Hex: 69 308
Literal: ï
Hex: ef

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Odd behaviour with StringBuilder, Java - java

The string in your code is not a hyphen but a UTF8 box drawing character. The terminal your application is printing to doesn't seem to expect any UTF8 content, so the issue is not inside your application. Replace it with a real hyphen (-) or make sure the tool that displays the output supports UTF8.

You say that the IDE wants to save as UTF-8. You then probably have saved it as UTF-8. However your compiler is likely to compile in whatever encoding your system uses. If you write your code as UTF-8, make sure to compile it with the same encoding: javac -encoding utf8 MyClass.java

public class myTest1 { public static void main(String[] args) { StringBuilder horizonRule = new StringBuilder(); for (int i = 0 ; i <= 13 ; i++){ horizonRule.append('_'); System.out.println(horizonRule.toString()); } } } is correct; maybe you use a different encoding ? clear env path

Related

Which encoding for ProcessBuilder parameters

Print all Unicode characters within a specific range

Java: Converting UTF 8 to String

Desktop.Action.MAIL encoding subject and body strings correctly for mailto: in URI

Java UTF-8 strange behaviour

Categories

Resources