Process input json with illegal encoding [duplicate] - java

This question already has answers here:
JSON Invalid UTF-8 middle byte
(7 answers)
Closed 4 years ago.
My java app has consumer that gets on input JSON files from server and then I tries convert it using Jackson. But ObjectMapper throws an exception:
com.fasterxml.jackson.databind.JsonMappingException: Invalid UTF-8 middle byte 0x2f
As far as I understand it's is due to incorrect encoding.
Can I somehow recognize the encoding and process the server response?

I was needed to recognize the encoding and correctly process the data.
For that I used UniversalDetector from org.mozilla.universalchardet.UniversalDetector
private static final UniversalDetector DETECTOR = new UniversalDetector(null);
private static String getEncode(byte[] data) throws IOException {
DETECTOR.reset();
byte[] buf = new byte[data.length];
InputStream is = new ByteArrayInputStream(data);
int read;
while ((read = is.read(buf)) > 0 && !DETECTOR.isDone()) {
DETECTOR.handleData(buf, 0, read);
}
is.close();
DETECTOR.dataEnd();
return DETECTOR.getDetectedCharset();
}
And then I read it with correct encode:
private static String readWithEncode(byte[] data, String encoding) throws IOException {
BufferedReader br = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(data), encoding));
StringBuilder result = new StringBuilder();
String s;
while ((s = br.readLine()) != null) {
result.append(s);
}
br.close();
return result.toString();
}

Related

How to create a json_template?

I'm creating some code and I saw an example here on this forum, and I have a hard time using geojson.
Whenever it is giving an error in raw because I did not add this json_template
private String getGeoString() throws IOException{
InputStream is = getResources().openRawResource(R.raw.json_template);
Writer writer = new StringWriter();
char[] buffer = new char [1024];
try{
Reader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
int n;
while ((n= reader.read(buffer)) != -1){
writer.write(buffer, 0, n);
}
}finally {
is.close();
}
String jsonString = writer.toString();
return jsonString();
}
How can I solve this error?
If you would like to create a json file in your app, you can first create a json_template.json file in your app -> res -> raw folder.
In that file, you can put the entire json response received upon querying.
A useful tool to see the json traversal in the nested response is JSON Pretty Print.
Then you can try:
public static String getGeoString(Context context) throws IOException {
InputStream is = context.getResources().openRawResource(R.raw.json_template);
Writer writer = new StringWriter();
char[] buffer = new char[1024];
try {
Reader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
int n;
while ((n = reader.read(buffer)) != -1) {
writer.write(buffer, 0, n);
}
} finally {
is.close();
}
// The local variable 'jsonString' can be inlined as below
return writer.toString();
}
It should work. Hope this is helpful.

â® characters getting converted to question marks while getting back

I am having a very weird issue.
I am putting and getting messages from Amazon AWS SQS.
While putting I am compressing and encoding the messages, like this :
String responseMessageBodyOriginal = gson.toJson(responseData);
String responseMessageBodyCompressed = compressToBase64String(responseMessageBodyOriginal);
AmazonSqsHelper.sendMessage(responseMessageBodyCompressed, queue, null);
Compression and encoding function, looks like this :
public static String compressToBase64String(String data) throws IOException {
ByteArrayOutputStream bos = new ByteArrayOutputStream(data.length());
GZIPOutputStream gzip = new GZIPOutputStream(bos);
gzip.write(data.getBytes());
gzip.close();
byte[] compressedBytes = bos.toByteArray();
bos.close();
return new String(Base64.encodeBase64(compressedBytes));
}
On the other hand, while receiving message, this is the code :
List<Message> sqsMessageList = AmazonSqsHelper.receiveMessages(queueUrl, max_message_read_count,
default_visibility_timeout);
int num_messages = sqsMessageList.size();
if (num_messages > 0) {
for (Message m : sqsMessageList) {
String responseMessageBodyCompressed = m.getBody();
String responseMessageBodyOriginal = decompressFromBase64String(responseMessageBodyCompressed);
}
}
And the function used for decoding and unzipping is like this :
public static String decompressFromBase64String(String compressedString) throws IOException {
byte[] compressedBytes = Base64.decodeBase64(compressedString);
ByteArrayInputStream bis = new ByteArrayInputStream(compressedBytes);
GZIPInputStream gis = new GZIPInputStream(bis);
BufferedReader br = new BufferedReader(new InputStreamReader(gis, "UTF-8"));
StringBuilder sb = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
gis.close();
bis.close();
return sb.toString();
}
But the problem is , at times if I pass characters like "â®" then those are getting converted to ???? , after decoding if I am printing the message.
Not able to figure out why encoding and decoding is behaving weird. Any help would be appreciated.
Issue is that encoding is done using the platform's default charset (data.getBytes()), while decoding - using UTF-8.
In compressToBase64String change data.getBytes() to data.getBytes(StandardCharsets.UTF_8).

How to choose the buffer size when reading from a URL

Aim : To read a Url which containing information in Json.
Question: I got a code of reading Url Which is given Below. I have a complete Understanding what code is doing but I do not have any idea why the size of char array is 1024 not 2048 or something else . How to decide what character size array is good at the time of reading Url ?
private static String readUrl(String urlString) throws Exception {
BufferedReader reader = null;
try {
URL url = new URL(urlString);
reader = new BufferedReader(new InputStreamReader(url.openStream()));
StringBuffer buffer = new StringBuffer();
int read;
char[] chars = new char[1024]; ???
while ((read = reader.read(chars)) != -1)
buffer.append(chars, 0, read);
return buffer.toString();
} finally {
if (reader != null)
reader.close();
}
}
As the BufferedReader already has an internal buffer of 4096 characters, implementation-dependent, and as the socket already has a considerably larger receive buffer, it really doesn't make much difference what value you choose. The returns on buffering diminish geometrically with size.

InputStreamReader don't limit returned length

I am working on learning Java and am going through the examples on the Android website. I am getting remote contents of an XML file. I am able to get the contents of the file, but then I need to convert the InputStream into a String.
public String readIt(InputStream stream, int len) throws IOException, UnsupportedEncodingException {
InputStreamReader reader = null;
reader = new InputStreamReader(stream, "UTF-8");
char[] buffer = new char[len];
reader.read(buffer);
return new String(buffer);
}
The issue I am having is I don't want the string to be limited by the len var. But, I don't know java well enough to know how to change this.
How can I create the char without a length?
Generally speaking it's bad practice to not have a max length on input strings like that due to the possibility of running out of available memory to store it.
That said, you could ignore the len variable and just loop on reader.read(...) and append the buffer to your string until you've read the entire InputStream like so:
public String readIt(InputStream stream, int len) throws IOException, UnsupportedEncodingException {
String result = "";
InputStreamReader reader = null;
reader = new InputStreamReader(stream, "UTF-8");
char[] buffer = new char[len];
while(reader.read(buffer) >= 0)
{
result = result + (new String(buffer));
buffer = new char[len];
}
return result;
}

Convert InputStream to String with encoding given in stream data

My input is a InputStream which contains an XML document. Encoding used in XML is unknown and it is defined in the first line of XML document.
From this InputStream, I want to have all document in a String.
To do this, I use a BufferedInputStream to mark the beginning of the file and start reading first line. I read this first line to get encoding and then I use an InputStreamReader to generate a String with the correct encoding.
It seems that it is not the best way to achieve this goal because it produces an OutOfMemory error.
Any idea, how to do it?
public static String streamToString(final InputStream is) {
String result = null;
if (is != null) {
BufferedInputStream bis = new BufferedInputStream(is);
bis.mark(Integer.MAX_VALUE);
final StringBuilder stringBuilder = new StringBuilder();
try {
// stream reader that handle encoding
final InputStreamReader readerForEncoding = new InputStreamReader(bis, "UTF-8");
final BufferedReader bufferedReaderForEncoding = new BufferedReader(readerForEncoding);
String encoding = extractEncodingFromStream(bufferedReaderForEncoding);
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
// stream reader that handle encoding
bis.reset();
final InputStreamReader readerForContent = new InputStreamReader(bis, encoding);
final BufferedReader bufferedReaderForContent = new BufferedReader(readerForContent);
String line = bufferedReaderForContent.readLine();
while (line != null) {
stringBuilder.append(line);
line = bufferedReaderForContent.readLine();
}
bufferedReaderForContent.close();
bufferedReaderForEncoding.close();
} catch (IOException e) {
// reset string builder
stringBuilder.delete(0, stringBuilder.length());
}
result = stringBuilder.toString();
}else {
result = null;
}
return result;
}
The call to mark(Integer.MAX_VALUE) is causing the OutOfMemoryError, since it's trying to allocate 2GB of memory.
You can solve this by using an iterative approach. Set the mark readLimit to a reasonable value, say 8K. In 99% of cases this will work, but in pathological cases, e.g 16K spaces between the attributes in the declaration, you will need to try again. Thus, have a loop that tries to find the encoding, but if it doesn't find it within the given mark region, it tries again, doubling the requested mark readLimit size.
To be sure you don't advance the input stream past the mark limit, you should read the InputStream yourself, upto the mark limit, into a byte array. You then wrap the byte array in a ByteArrayInputStream and pass that to the constructor of the InputStreamReader assigned to 'readerForEncoding'.
You can use this method to convert inputstream to string. this might help you...
private String convertStreamToString(InputStream input) throws Exception{
BufferedReader reader = new BufferedReader(new InputStreamReader(input));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line);
}
input.close();
return sb.toString();
}

Categories

Resources