Why does my code return unicode characters?

Why does my code return unicode characters? - java

String encodedInputText = URLEncoder.encode("input=" + question, "UTF-8");
urlStr = Parameters.getWebserviceURL();
URL url = new URL(urlStr + encodedInputText + "&sku=" + sku);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
jsonOutput = in.readLine();
in.close();
The problem is that the returned JSON string contains all unicodes like
"question":"\u51e0\u5339\u7684",
Not the actual Chinese characters. The "UTF-8" should solve the problem. Why doesn't it?
EDIT:
ObjectMapper mapper = new ObjectMapper();
ResponseList = responseList = mapper.readValue(jsonOutput, ResponseList.class);

This is not problem of encoding, it is problem your data source. Encoding comes into play when you convert bytes into string. You expect encoding to convert string in form of \uxxxx into another string which is not going to happen.
The whole point is, that the source of data is serializing data this way so your raw data is gone and is replaced with \uxxxx.
Now you would have to manualy capture \uxxx sequences and convert that to actual characters.

Related

Convert byte[] to String and back

I'm trying to save content of a pdf file in a json and thought of saving the pdf as String value converted from byte[].
byte[] byteArray = feature.convertPdfToByteArray(Paths.get("path.pdf"));
String byteString = new String(byteArray, StandardCharsets.UTF_8);
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
System.out.println(secondString.equals(byteString));
System.out.println(Arrays.equals(byteArray, newByteArray));
System.out.println(byteArray.length + " vs " + newByteArray.length);
The result of the above code is as follows:
true
false
421371 vs 760998
The two String's are equal while the two byte[]s are not. Why is that and how to correctly convert/save a pdf inside a json?

You are probably using the wrong charset when reading from the PDF file.
For example, the character é (e with acute) does not exists in ISO-8859-1 :
byte[] byteArray = "é".getBytes(StandardCharsets.ISO_8859_1);
String byteString = new String(byteArray, StandardCharsets.UTF_8);
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
System.out.println(secondString.equals(byteString));
System.out.println(Arrays.equals(byteArray, newByteArray));
System.out.println(byteArray.length + " vs " + newByteArray.length);
Output :
true
false
1 vs 3

Why is that
If the byteArray indeed contains a PDF, it most likely is not valid UTF-8. Thus, wherever
String byteString = new String(byteArray, StandardCharsets.UTF_8);
stumbles over a byte sequence which is not valid UTF-8, it will replace that by a Unicode replacement character. I.e. this line damages your data, most likely beyond repair. So the following
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
does not result in the original byte array but instead a damaged version of it.
The newByteArray, on the other hand, is the result of UTF-8 encoding a given string, byteString. Thus, newByteArray is valid UTF-8 and
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
does not need to replace anything outside the UTF-8 mappings, in particular byteString and secondString are equal.
how to correctly convert/save a pdf inside a json?
As #mammago explained in his comment,
JSON is not the appropriate format for binary content (like files). You should propably use something like base64 to create a string out of your PDF and store that in your JSON object.

reading japanese text from json: wrong characters

I have a ja.json file which contains key value pairs for selenium framework :
"key": [ "私はあなたを愛しています！",
I have saved the file as UTF-8 format.But when I am trying to read values from json, I am getting string as "?????"
I am using below code:
Object obj = parser.parse(new FileReader(filePath));
JSONObject jsonObject = (JSONObject) obj;
String text= (String) jsonObject.get(key);
String expectedValue = new String(text.getBytes("UTF-8"),"UTF-8");
What else can I do to get japanese characters from a JSON file(or any other format if required) and send ?

You need to read the file with the correct charset, for example:
Object obj = parser.parse(new InputStreamReader(
new FileInputStream(filePath), StandardCharsets.UTF_8));
The FileReader will use the platform encoding whatever that is on your system.
Any attempt to repair the encoding after reading the file with the wrong encoding will fail. So your line
String expectedValue = new String(text.getBytes("UTF-8"),"UTF-8");
is useless.

Convert encoded string to readable string in java

I am trying to send a POST request from a C# program to my java server.
I send the request together with an json object.
I recive the request on the server and can read what is sent using the following java code:
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
OutputStream out = conn.getOutputStream();
String line = reader.readLine();
String contentLengthString = "Content-Length: ";
int contentLength = 0;
while(line.length() > 0){
if(line.startsWith(contentLengthString))
contentLength = Integer.parseInt(line.substring(contentLengthString.length()));
line = reader.readLine();
}
char[] temp = new char[contentLength];
reader.read(temp);
String s = new String(temp);
The string s is now the representation of the json object that i sent from the C# client. However, some characters are now messed up.
Original json object:
{"key1":"value1","key2":"value2","key3":"value3"}
recived string:
%7b%22key1%22%3a%22value1%22%2c%22key2%22%3a%22value2%22%2c%22key3%22%3a%22value3%22%%7d
So my question is: How do I convert the recived string so it looks like the original one?

Seems like URL Encoded so why not use java.net.URLDecoder
String s = java.net.URLDecoder.decode(new String(temp), StandardCharsets.UTF_8);
This is assuming the Charset is in fact UTF-8

Those appear the be URL encoded, so I'd use URLDecoder, like so
String in = "%7b%22key1%22%3a%22value1%22%2c%22key2"
+ "%22%3a%22value2%22%2c%22key3%22%3a%22value3%22%7d";
try {
String out = URLDecoder.decode(in, "UTF-8");
System.out.println(out);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Note you seemed to have an extra percent in your example, because the above prints
{"key1":"value1","key2":"value2","key3":"value3"}

Converting to international charater doesnt work with jsonobject.tostring but works with string literal?

This doesnt convert to å
String j_post = new String((byte[]) j_posts.getJSONObject(i).get("tagline").toString().getBytes("utf-8"), "utf-8");
but the following does
String j_post = new String((byte[]) "\u00e5".getBytes("utf-8"), "utf-8");
How do i fix this?
UPDATE: Now i tried fixing the encoding before i cast it as JSONObject and it still doesnt work.
json = new JSONObject(new String((byte[]) jsonContent.getBytes("utf-8"), "utf-8"));
JSONArray j_posts = json.getJSONArray("posts");
for (int i = 0; i<j_posts.length();i++){
//[String(byte[] data)][2]
String j_post =j_posts.getJSONObject(i).get("tagline").toString();
post_data.add(new Post(j_post));
}
Please note that i am getting a string as a response from my web-server.

This is because your JSON doesn't have the character in the required format. Look into the code, where the JSON is prepared and include the UTF-8 encoding there, when the JSON is formed.

String j_post = new String((byte[]) "\u00e5".getBytes("utf-8"), "utf-8");
indeed is (better):
String j_post = "\u00e5";
And hence
String j_post = new String((byte[]) j_posts.getJSONObject(i).get("tagline")
.toString().getBytes("utf-8"), "utf-8");
is
String j_post = j_posts.getJSONObject(i).get("tagline").toString();
So #RJ is right, and the data is mangled: either in getting or sending (wrong encoding).

Double \\ appears while reading from internet

I'm reading some information from an external server where I have no access and I don't know the encoding and I've having some problems with characters like í. What I do is a POST request using the code below and afterwards, I parse it.
String response = "";
URL url = new URL(pURL);
URLConnection uc = url.openConnection();
if (sid!=null) uc.setRequestProperty("Cookie", sid);
uc.setDoOutput(true);
OutputStreamWriter osw = new OutputStreamWriter(uc.getOutputStream());
osw.write(request);
osw.flush();
InputStreamReader isr = new InputStreamReader(uc.getInputStream(), "UTF8");
BufferedReader br = new BufferedReader(isr);
String content;
while ((content = br.readLine())!=null){
response += content;
}
br.close();
osw.close();
At this moment, if I print the string it shows a \\, I mean, for í instead of appearing \u00ed appears \\\u00ed and if I convert the response string to a char array, I can see that instead of converting it correctly, it's divided into 6 chars \\\\, u, 0, 0, e, d.
I've tried to change encoding where the InputStreamReader is, to replace characters and some regex and none did work. Did anyone have this problem and can help me?
Thank you very much.

Not sure why the response is formatted that way, but you could convert strings with \u00ed into í using StringEscapeUtils as follows:
String input = "\\u00ed";
String unescaped = StringEscapeUtils.unescapeJava(input);
System.out.println(unescaped);
Output:
í

response.replaceAll("\\","\");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why does my code return unicode characters? - java

Related

Convert byte[] to String and back

reading japanese text from json: wrong characters

Convert encoded string to readable string in java

Converting to international charater doesnt work with jsonobject.tostring but works with string literal?

Double \\ appears while reading from internet

Categories

Resources