Java, converting non latin UTF-8 characters to proper string

Java, converting non latin UTF-8 characters to proper string - java

When I try to output string in Java like this:
System.output.println("Привет");
Console output shows me this result:
Привет
I have a REST API method where I receive string from outside request. When I send exact same Привет string with UTF-8 encoding and try to output it like this:
post("/check", (req, res) -> {
receivedString = req.body();
}
System.ouput.println(receivedString);
It shows this:
������
What I need to do in order to turn this questionmark thing into proper readable string?

You can try with ...
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(receivedString);

Related

Which encoding for ProcessBuilder parameters

Using ProcessBuilder, I need to be able to send non-ASCII parameters to another Java program.
In this case, a program Abc needs to send e.g. Arabic characters to Def program through the parameters. I have control of Abc code, but not of Def.
Using the normal way of ProcessBuilder without any playing with the encoding, it was mentioned here, it is not possible. Def recieves question marks "?????".
However, I am able to get some result, but different encodings can be used for different scenarios.
E.g. I am trying all encodings to send to the recipient, and comparing the result of what is expected.
Windows, IntelliJ console:
Default charset: UTF-8
Found charsets: windows-1252, windows-1254 and windows-1258
Windows, command prompt:
Default charset: windows-1252
Found charsets: CESU-8 and UTF-8
Ubuntu, command prompt:
Default charset: ISO-8859-1
Found charsets: ISO-2022-CN, ISO-2022-KR, ISO-8859-1, ISO-8859-15, ISO-8859-9, x-IBM1129, x-ISO-2022-CN-CNS and x-ISO-2022-CN-GB
My question is: how to programmatically know which correct encoding to use, since I need to have something universal?
In other words, what is the relation between the default charset and the found ones?
public class Abc {
private static final Path PATH = Paths.get("."); // With maven: ./target/classes
public static void main(String[] args) throws Exception {
var string = "hello أحمد";
var bytes = string.getBytes();
System.out.println("Original string: " + string);
System.out.println("Default charset: " + Charset.defaultCharset());
for (var c : Charset.availableCharsets().values()) {
var newString = new String(bytes, c);
var process = new ProcessBuilder().command("java", "-cp",
PATH.toAbsolutePath().toString(),
"Def", newString).start();
process.waitFor();
var output = asString(process.getInputStream());
if (output.contains(string)) {
System.out.println("Found " + c + " " + output);
}
}
}
private static String asString(InputStream is) throws IOException {
try (var reader = new BufferedReader(new InputStreamReader(is))) {
var builder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
if (builder.length() != 0) {
builder.append(System.lineSeparator());
}
builder.append(line);
}
return builder.toString();
}
}
}
public class Def {
public static void main(String[] args) {
System.out.println(args[0]);
}
}

Under the hood, what's actually being passed around is bytes, not chars. Normally, you'd expect the java method that ends up turning characters into bytes to have an overload that lets you specify charset, but, for whatever reason, it does not exist here.
How it should work is thusly:
You pass a string to ProcessBuilder
PB will turn that string into bytes using Charset.defaultCharset() (why? Because PB is all about making the OS do things, and the default charset reflects the OS's preferred charset).
These bytes are then fed to the process.
The process starts up. If it is java, and we're talking the args in psv main(String[] args), the same is done in reverse: Java takes the bytes and turns them back to characters via Charset.defaultCharset(), again.
This does show an immediate issue: If the default charset is not capable of representing a certain character, then in theory you are out of luck.
That would strongly suggest that using java to fire up java.exe should ordinarily mean you can pass whatever you want (unless the characters involved aren't representable in the system's charset).
Your code is odd. In particular, this line is the problem:
var bytes = string.getBytes();
This is short for string.getBytes(Charset.defaultCharset()). So now you have your bytes in the provided charset.
var newString = new String(bytes, c);
and now you're taking those bytes and turning them into a string using a completely different charset. I'm not sure what you're trying to accomplish with this. Pure gobbledygook would come out.
In other words, what is the relation between the default charset and the found ones?
What do you mean by 'found ones'? The string "Found charsets" appears nowhere in your code. If you mean: What Charset.availableCharsets() returns - there is no relationship at all. availableCharsets isn't relevant for ProcessBuilder.

One possibility is to convert your String to Unicode sequences string and then pass it to another process and there convert it back to a regular String. String of Unicode sequences will always contain ASCI characters only. Here is how it may look like:
String encoded = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("hello أحمد"));
The result will be that String encode will hold this value:
"\u0068\u0065\u006c\u006c\u006f\u0020\u0623\u062d\u0645\u062f"
This String you can safely pass to another process. In that other process, you can do the following:
String originalString = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(encodedString);
And the result will be that originalString will now hold this value:
"hello أحمد"
Class StringUnicodeEncoderDecoder could be found in an Open Source library called MgntUtils. You can get this library as Maven Artifact or get it on Github (including source code and JavaDoc). JavaDoc online is available here
This library and this particular feature is used and well tested by multiple users.
Disclamer: This library is written by me

From Angular Actual parameter value is "Ébénisterie" but in JAVA getting value "Ã?bÃ©nisterie"

From Angular, there is one parameter and the value of that parameter is Ébénisterie but when I print the value of that variable in java then I got Ã?bÃ©nisterie can you please let me know how I can convert it to original text Ébénisterie? Which Encode/decode I have to apply?
I have tried the following thing.
new String(readable.getBytes("ISO-8859-15"), "UTF-8");
new String(readable.getBytes("UTF-8"), "ISO-8859-15");
but it's not working.
String readable ="Ã?bÃ©nisterie Distinction";
String test = null;
try {
test = new String(readable.getBytes("ISO-8859-15"), "UTF-8");
System.out.println("test"+test);
} catch (UnsupportedEncodingException e) {
}
Expected: Ébénisterie
Actual: Ã?bÃ©nisterie

After long research didn't find anything.
So got one solution in mind that BASE64 Encode decode so now from Angularjs sending encoded text and In java side, I have decoded the text.
Here, is the sample code
Angularjs
window.btoa("Ébénisterie")
JAVA
String actualString= new String(Base64.getDecoder().decode("ENCODED STRING"));

Android convert diamond question marks to UTF-8 Arabic string

I'm using an API that sends and receives raw bytes.
But i have problem with displaying the Arabic words that comes over the API, it's displaying like diamond question marks "���"
I've tried to convert the string from and to utf-8.
This example returns question marks but not inside the black square "??? ???" :
String str = new String(originalStr.getBytes("ISO-8859-1"), "UTF-8");
This one returns empty string :
String str = new String(originalStr.getBytes("WINDOWS-1256"), "UTF-8");
And this one also returns an empty string :
String str = new String(originalStr.getBytes("WINDOWS-1252"), "UTF-8");
I've succeded to display the Arabic words in PHP by converting from cp1256 to utf-8 :
echo iconv('cp1256', 'utf-8', $string);
The correct character encoding for Arabic is cp1256
How can i achieve that?

java or node.js removes one character when sending or receiving a string

I'm trying to send a string from a java (android) app to a node.js server.
But one character disappears somewhere in the middle and I can't really figure out why.
To send I use a HttpUrlConnection (conn) and send the string like this:
try {
OutputStream os = conn.getOutputStream();
os.write(json.getBytes());
os.close();
} catch (Exception e) {
e.printStackTrace();
}
Here is the base64 encoded string when sent, and string when received:
khVGUBH2kNAR5PPRy7v5dO5iz48Rc7benYARu78\/9wY=\n
khVGUBH2kNAR5PPRy7v5dO5iz48Rc7benYARu78/9wY=\n
so one backslash has be removed.
In node I use this:
exports.getString = function(req, res) {
var string = req.body.thestring;
}
which outputs the later of the two strings.
var express = require('express'),
http = require('http'),
stylus = require('stylus'),
nib = require('nib');
var app = express();
app.configure(function () {
app.use(express.logger('dev'));
//app.use(express.bodyParser());
app.use(express.json());
app.use(express.urlencoded());
app.use(app.router);
}
Any ideas of how I can get the missing character?

The missing backslash character is most probably disappearing in node.js side.
As per the chosen answer on the following question:
Two part question on JavaScript forward slash
As far as JS is concerned / and \ / are identical inside a string
So maybe a fix from Java's would solve your problem by using String's replaceAll method to replace all occurrences of \/ with \\/:
os.write(json.replaceAll("\\/", "\\\\/").getBytes());
Note that replaceAll returns the new string and doesn't change the original string.

Making the base64 encoding url safe solved my problem.

Java Output UTF-8 to Real Characters?

In Java, how can I output UTF-8 to real string?
我们
\u6211\u4eec
String str = new String("\u6211\u4eec");
System.out.println(str); // still ouput \u6211\u4eec, but I expect 我们 to be an output
-----
String tmp = request.getParameter("tag");
System.out.println("request:"+tmp);
System.out.println("character set :"+request.getCharacterEncoding());
String tmp1 = new String("\u6211\u4eec");
System.out.println("string equal:"+(tmp.equalsIgnoreCase(tmp1)));
String tag = new String(tmp);
System.out.println(tag);
request:\u6211\u4eec
character set :UTF-8
string equal:false
\u6211\u4eec
From the output, the value from the request is the same as the string value of tmp1, but why does equalsIgnoreCase output false?

did you try to display just one of them? like
String str = new String("\u6211");
System.out.println(str);
I bet there is a problem in how you create that string.

Java String are encoded in UTF-16. I do not see any problem in your code, I would believe the problem comes from your console and it doesn't show correctly the content of the String.
If you are using eclipse, change your console encoding here to UTF-8
Eclipse > Preferences > General > Workspace > Text file encoding

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java, converting non latin UTF-8 characters to proper string - java

You can try with ... PrintStream out = new PrintStream(System.out, true, "UTF-8"); out.println(receivedString);

Related

Which encoding for ProcessBuilder parameters

From Angular Actual parameter value is "Ébénisterie" but in JAVA getting value "Ã?bÃ©nisterie"

Android convert diamond question marks to UTF-8 Arabic string

java or node.js removes one character when sending or receiving a string

Java Output UTF-8 to Real Characters?

Categories

Resources