reading japanese text from json: wrong characters - java

I have a ja.json file which contains key value pairs for selenium framework :
"key": [ "私はあなたを愛しています!",
I have saved the file as UTF-8 format.But when I am trying to read values from json, I am getting string as "?????"
I am using below code:
Object obj = parser.parse(new FileReader(filePath));
JSONObject jsonObject = (JSONObject) obj;
String text= (String) jsonObject.get(key);
String expectedValue = new String(text.getBytes("UTF-8"),"UTF-8");
What else can I do to get japanese characters from a JSON file(or any other format if required) and send ?

You need to read the file with the correct charset, for example:
Object obj = parser.parse(new InputStreamReader(
new FileInputStream(filePath), StandardCharsets.UTF_8));
The FileReader will use the platform encoding whatever that is on your system.
Any attempt to repair the encoding after reading the file with the wrong encoding will fail. So your line
String expectedValue = new String(text.getBytes("UTF-8"),"UTF-8");
is useless.

Related

Java : read json file that is in "UCS-2 LE BOM" or "UTF-8 BOM" encoding format

Need to read a json file that is in "UCS-2 LE BOM" or "UTF-8 BOM" encoding format.
The below code read json from UTF-8
JSONParser parser = new JSONParser();
InputStream inputStream = new FileInputStream(inputJsonPath);
Reader fileReader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);
Object obj = parser.parse(fileReader);
org.json.simple.JSONObject inputJsonObject = (org.json.simple.JSONObject) obj;

Why does my code return unicode characters?

String encodedInputText = URLEncoder.encode("input=" + question, "UTF-8");
urlStr = Parameters.getWebserviceURL();
URL url = new URL(urlStr + encodedInputText + "&sku=" + sku);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
jsonOutput = in.readLine();
in.close();
The problem is that the returned JSON string contains all unicodes like
"question":"\u51e0\u5339\u7684",
Not the actual Chinese characters. The "UTF-8" should solve the problem. Why doesn't it?
EDIT:
ObjectMapper mapper = new ObjectMapper();
ResponseList = responseList = mapper.readValue(jsonOutput, ResponseList.class);
This is not problem of encoding, it is problem your data source. Encoding comes into play when you convert bytes into string. You expect encoding to convert string in form of \uxxxx into another string which is not going to happen.
The whole point is, that the source of data is serializing data this way so your raw data is gone and is replaced with \uxxxx.
Now you would have to manualy capture \uxxx sequences and convert that to actual characters.

processing the contents of an XML email attachment as a String using MimeBodyPart

I am trying to process an email attachment (.xml) using MimeBodyPart.
attachment = part.getContent();
This returns the Java object of type StreamSource (and not a String)
How can I convert this into a String. I am using BufferedReader and StringBuilder to reconstruct the String from InputStream, but the reconstructed String is incomplete
StringBuilder sb = new StringBuilder();
InputStream inputStr = attachment.getInputStream();
br = new BufferedReader(new InputStreamReader(inputStr));
while ((line = br.readLine()) != null) {
sb.append(line);
}
If I process the email atttachment as a .txt instead of a .xml the MimeBodyPart.getContent() returns the attachment as a complete String. I want the same functionality when the email attachment is a .xml
Any ideas?
Try adding the "UTF-8" encoding as a parameter to your InputStreamReader.

How to convert UTF-8 to GBK string in java

I retrieved HTML string from an objective site and within it there is a section
class="f9t" name="Óû§Ãû:ôâÈ»12"
I know it's in GBK encoding, as I can see it from the FF browser display. But I do not know how to convert that name string into a readable GBK string (such as 上海 or 北京).
I am using
String sname = new String(name.getBytes(), "UTF-8");
byte[] gbkbytes = sname.getBytes("gb2312");
String gbkStr = new String( gbkbytes );
System.out.println(gbkStr);
but it's not printed right in GBK text
???¡ì??:????12
I have no clue how to proceed.
You can try this if you already read the name with a wrong encoding and get the wrong name value "Óû§Ãû:ôâÈ»12", as #Karol S suggested:
new String(name.getBytes("ISO-8859-1"), "GBK")
Or if you read a GBK or GB2312 string from internet or a file, use something like this to get the right string at the first place:
BufferedReader r = new BufferedReader(new InputStreamReader(is,"GBK")); name = r.readLine();
Assuming that name.getBytes() returns GBK encoded string it's enough to create string specifying encoding of array of bytes:
new String(gbkString.getBytes(), "GBK");
Regarding to documentation the name of encryption should be GBK.
Sample code:
String gbkString = "Óû§Ãû:ôâÈ»12";
String utfString = new String(gbkString.getBytes(), "GBK");
System.out.println(utfString);
Result (not 100% sure that it's correct :) ):
脫脙禄搂脙没:么芒脠禄12

Converting to international charater doesnt work with jsonobject.tostring but works with string literal?

This doesnt convert to å
String j_post = new String((byte[]) j_posts.getJSONObject(i).get("tagline").toString().getBytes("utf-8"), "utf-8");
but the following does
String j_post = new String((byte[]) "\u00e5".getBytes("utf-8"), "utf-8");
How do i fix this?
UPDATE: Now i tried fixing the encoding before i cast it as JSONObject and it still doesnt work.
json = new JSONObject(new String((byte[]) jsonContent.getBytes("utf-8"), "utf-8"));
JSONArray j_posts = json.getJSONArray("posts");
for (int i = 0; i<j_posts.length();i++){
//[String(byte[] data)][2]
String j_post =j_posts.getJSONObject(i).get("tagline").toString();
post_data.add(new Post(j_post));
}
Please note that i am getting a string as a response from my web-server.
This is because your JSON doesn't have the character in the required format. Look into the code, where the JSON is prepared and include the UTF-8 encoding there, when the JSON is formed.
String j_post = new String((byte[]) "\u00e5".getBytes("utf-8"), "utf-8");
indeed is (better):
String j_post = "\u00e5";
And hence
String j_post = new String((byte[]) j_posts.getJSONObject(i).get("tagline")
.toString().getBytes("utf-8"), "utf-8");
is
String j_post = j_posts.getJSONObject(i).get("tagline").toString();
So #RJ is right, and the data is mangled: either in getting or sending (wrong encoding).

Categories

Resources