Weird error with Jackson and file encoding

Weird error with Jackson and file encoding - java

I'm pulling a file down from S3, and when I call object mapper using the content stream as a bare InputStream, decoding fails with a UTF-8 exception, but when I use a BufferedReader wrapping the InputStream, it works fine.
If I read the file down into a local file, then open it as a FileInputStream, that works fine also. I am perplexed. I'm hoping somebody has run into this before me, or has some insight around the workings of a bare InputStream versus a BufferedReader in terms of the encoding in Jackson.
This fails
S3Object s3o = s3Client.getObject("my-bucket","my-key");
Object t = om.readValue(s3o.getObjectContent(), Object.class);
This works
S3Object s3o = s3Client.getObject("my-bucket","my-key");
Object t = om.readValue(new BufferedReader(new InputStreamReader(s3o.getObjectContent())), Object.class);
with the error:
org.codehaus.jackson.JsonParseException: Invalid UTF-8 middle byte 0x5c
at [Source: org.apache.http.conn.EofSensorInputStream#6460029d; line: 1, column: 31611]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521)
at org.codehaus.jackson.impl.Utf8StreamParser._reportInvalidOther(Utf8StreamParser.java:2830)
at org.codehaus.jackson.impl.Utf8StreamParser._reportInvalidOther(Utf8StreamParser.java:2837)
at org.codehaus.jackson.impl.Utf8StreamParser._decodeUtf8_2(Utf8StreamParser.java:2625)
at org.codehaus.jackson.impl.Utf8StreamParser._finishString2(Utf8StreamParser.java:1952)
at org.codehaus.jackson.impl.Utf8StreamParser._finishString(Utf8StreamParser.java:1905)
at org.codehaus.jackson.impl.Utf8StreamParser.getText(Utf8StreamParser.java:276)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:59)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:218)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapArray(UntypedObjectDeserializer.java:165)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:51)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:218)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:196)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47)
at org.codehaus.jackson.map.ObjectMapper._readMapAndClose(ObjectMapper.java:2732)
at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1909)

Your content is not UTF-8, but something that is not valid for JSON like ISO-8859-1 (Latin-1). Your use of BufferedReader is bit wrong too -- you should specify encoding, otherwise platform-default encoding (which could be anything) is used -- but it probably converts from that encoding to avoid the error.
Nonetheless, it sounds like content is not valid JSON and whoever produces it should fix it to use one of supported encoding (UTF-8 or UTF-16).

Related

Getting Invalid data on decode using appache common-codec library

I have scenario where I am getting my pkcs12 cert content as encoded string(appache common-codec library). Now I have to decode that string and have to store it file. But while decoding it as a string I am getting an invalid cert content.
When I am trying to write bytes in a file it works fine. Please find the snippets I have tried below.
For encode:
Base64.encodeBase64String(certcontentInBytes[])
For decode:
new String(Base64.decodeBase64(certstringContent));

new String(bytes) actually does new String(bytes, defaultCharset) to convert the bytes using the charset of the bytes to a Unicode string. Non-portable and probable wrong charset.
For bytes as binary data that will not work. String should not be used for binary data. I bet the bytes were written to the file.

Maybe you have to use new String(byte[], Charset charset) with the correct charset (probably UTF-8), because otherwise it will use the platform charset, which differs between Windows and Linux/Unix.
I wonder why you can't simply use the byte array as parameter?

XMLStreamException : Parse error

I have a process that parses an xml file with java 5 on apache tomcat 6.
Since, I compiled in java 7 with an execution join apache tomcat 7, I receive the following error:
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,60]
Message: Invalid encoding name "ISO8859-1".
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.setInputSource(XMLStreamReaderImpl.java:219)
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.(XMLStreamReaderImpl.java:189)
at com.sun.xml.internal.stream.XMLInputFactoryImpl.getXMLStreamReaderImpl(XMLInputFactoryImpl.java:262)
at com.sun.xml.internal.stream.XMLInputFactoryImpl.createXMLStreamReader(XMLInputFactoryImpl.java:129)
at com.sun.xml.internal.stream.XMLInputFactoryImpl.createXMLEventReader(XMLInputFactoryImpl.java:78)
at org.simpleframework.xml.stream.StreamProvider.provide(StreamProvider.java:66)
at org.simpleframework.xml.stream.NodeBuilder.read(NodeBuilder.java:58)
at org.simpleframework.xml.core.Persister.read(Persister.java:543)
at org.simpleframework.xml.core.Persister.read(Persister.java:444)
Here is the xml fragment used:
?xml version="1.0" encoding="ISO8859-1" standalone="no" ?
If I replace ISO8859-1 by UTF-8 the parsing process works but it's not an option for me.
The lib that I use is simple-xml-2.1.8.jar
As someone noticed me, ISO8859-1 is a wrong content type. ISO-8859-1 is the correct one. As I mentioned, it's difficult to ask "producers" to correct their files. I would want to manage the problem in my application.

Get access to the Xerces XMLReader instance from Simple XML and set
reader.setFeature("http://apache.org/xml/features/allow-java-encodings", true)
before parsing the XML.
Since ISO8859-1 "works" in Java, this may just work.
The list of supported "features" of Xerces is available here
Alternatively, a good old regex on encoding="ISO8859-1" to fix the XML should do the trick, prior to processing it.

If you know the file encoding up front (UTF-8, ISO-8859-1 or whatever) then you should create a suitable Reader configured for that encoding, then use the Persister.read method that takes a Reader instead of the one that takes a File or InputStream. That way you are in control of the byte-to-character decoding rather than relying on the XML reader to detect the encoding (and fail, as the file declared it wrongly). So instead of
File f = new File(....);
MyType obj = persister.read(MyType.class, f);
you would do something more like
File f = new File(....);
MyType obj = null;
try( FileInputStream fis = new FileInputStream(f);
InputStreamReader reader = new InputStreamReader(fis, "ISO-8859-1")) { // or UTF-8, ...
obj = persister.read(MyType.class, reader);
}

JAX-RS and character encoding problems

I am using Jax RS and have simple POST WS, that takes InputStream, that contains MIME message (xml + file).
The MIME message is in UTF-8, file contained as a body part is an email message in MIME RFC 822 in ISO-8859-1 encoding, that I'm converting to PDF using Aspose.
When running as a webservice, the resulting PDF has incorrect characters (ø, å etc.). But when I tried to use the exact input, but reading it from file instead and call the method with FileInputStream, the resulting PDF is OK.
Here is the simplified version of the code:
#POST
#Path(value = "/documents/convert/{flag}")
#Produces("text/plain")
public String convertFile(InputStream input, #PathParam("flag") String flag) throws WebApplicationException {
FileInfo info = convertToPdf(input);
return info.getResponse();
}
If I run this as webservice it produces PDF with incorrectly encoded characters with "box" instead of some charcters (such as ø, å etc.). When I run the the same code with the same input by by calling
FileInputStream fis = new FileInputStream(file);
convertFile(fis);
the resulting PDF has correct encoding (the WS is run on server, testing with file is done on my local machine).
Could this be incorrect setting of locale on the server?

Do you use an InputStreamReader to read the FileInputStream ? If so, did you initialize it using the 2-parameters constructor, with CharSet.forName("UTF-8") as the second argument ? (as you mentionned the incoming stream is already in UTF-8) ?

You might need to tell the container that it's UTF-8.
something like...
#Produces("text/plain; charset=utf-8")

Apparently your local file and you MIME message body are not encoded the same way.
Your post states that the file is encoded in ISO-8859-1.
If you are using an InputStreamReader (as Xavier Coulon's is suggesting) you should pass the expected encoding to it. In this case
CharSet.forName("ISO-8859-1")
If this does not help, could you please provide the content of the convertToPdf(InputStream is) method

Converting XML to JSON results in unknown characters when running on Centos instead of Windows

I have a Java servlet which gets RSS feeds converts them to JSON. It works great on Windows, but it fails on Centos.
The RSS feed contains Arabic and it shows unintelligible characters on Centos. I am using those lines to encode the RSS feed:
byte[] utf8Bytes = Xml.getBytes("Cp1256");
// byte[] defaultBytes = Xml.getBytes();
String roundTrip = new String(utf8Bytes, "UTF-8");
I tried it on Glassfish and Tomcat. Both have the same problem; it works on Windows, but fails on Centos. How is this caused and how can I solve it?

byte[] utf8Bytes = Xml.getBytes("Cp1256");
String roundTrip = new String(utf8Bytes, "UTF-8");
This is an attempt to correct a badly-decoded string. At some point prior to this operation you have read in Xml using the default encoding, which on your Windows box is code page 1256 (Windows Arabic). Here you are encoding that string back to code page 1256 to retrieve its original bytes, then decoding it properly as the encoding you actually wanted, UTF-8.
On your Linux server, it fails, because the default encoding is something other than Cp1256; it would also fail on any Windows server not installed in an Arabic locale.
The commented-out line that uses the default encoding instead of explicitly Cp1256 is more likely to work on a Linux server. However, the real fix is to find where Xml is being read, and fix that operation to use the correct encoding(*) instead of the default. Allowing the default encoding to be used is almost always a mistake, as it makes applications dependent on configuration that varies between servers.
(*: for this feed, that's UTF-8, which is the most common encoding, but it may differ for others. Finding out the right encoding for a feed depends on the Content-Type header returned for the resource and the <?xml encoding declaration. By far the best way to cope with this is to fetch and parse the resource using a proper XML library that knows about this, for example with DocumentBuilder.parse(uri).)

There are many places where wrong encoding can be used. Here is the complete list http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8

Convert from Codepage 1252 (Windows) to Java, in Java

I have some strings in Java (originally from an Excel sheet) that I presume are in Windows 1252 codepage. I want them converted to Javas own unicode format. The Excel file was parsed using the JXL package, in case that matter.
I will clarify: apparently the strings gotten from the Excel file look pretty much like it already is some kind of unicode.
WorkbookSettings ws = new WorkbookSettings();
ws.setCharacterSet(someInteger);
Workbook workbook = Workbook.getWorkbook(new File(filename), ws);
Sheet s = workbook.getSheet(sheet);
row = s.getRow(4);
String contents = row[0].getContents();
This is where contents seems to contain something unicode, the åäö are multibyte characters, while the ASCII ones are normal single byte characters. It is most definitely not Latin1. If I print the "contents" string with printLn and redirect it to a hello.txt file, I find that the letter "ö" is represented with two bytes, C3 B6 in hex. (195 and 179 in decimal.)
[edit]
I have tried the suggestions with different codepages etc given below, tried converting from Cp1252 etc. There was some kind of conversion, because I would get some other kind of gibberish instead. As reference I always printed an "ö" string hand coded into the source code, to verify that there was not something wrong with my terminal or typefaces or anything. The manually typed "ö" always worked.
[edit]
I also tried WorkBookSettings as suggested in the comments, but I looked in the code for JXL and characterSet seems to be ignored by parsing code. I think the parsing code just looks at whatever encoding the XLS file is supposed to be in.

WorkbookSettings ws = new WorkbookSettings();
ws.setEncoding("CP1250");
Worked for me.

If none of the answer above solve the problem, the trick might be done like this:
String myOutput = new String (myInput, "UTF-8");
This should decode the incoming string, whatever its format.

When Java parses a file it uses some encoding to read the bytes on the disk and create bytes in memory. The default encoding varies from platform to platform. Java's internal String representation is Unicode already, so if it parses the file with the right encoding then you are already done; just write out the data in any encoding you want.
If your strings appear corrupted when you look at them in Java, it is probably because you are using the wrong encoding to read the data. Excel is probably using UTF-16 (Little-Endian I think) but I'd expect a library like JXL should be able to detect it appropriately. I've looked at the Javadocs for JXL and it doesn't do anything with character encodings. I imagine it auto-detects any encodings as it needs to.
Do you just need to write the already loaded strings to a text file? If so, then something like the following will work:
String text = getCP1252Text(); // doesn't matter what the original encoding was, Java always uses Unicode
FileOutputStream fos = new FileOutputStream("test.txt"); // Open file
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-16"); // Specify character encoding
PrintWriter pw = new PrintWriter(osw);
pw.print(text ); // repeat as needed
pw.close(); // cleanup
osw.close();
fos.close();
If your problem is something else please edit your question and provide more details.

You need to specify the correct encoding when the file is parsed - once you have a Java String based on the wrong encoding, it's too late.
JXL allows you to specify the encoding by passing a WorkbookSettings object to the factory method.

"windows-1252"/"Cp1252" is not required to be supported by JREs, but is by Sun's (and presumably most others). See the "Supported Encodings" in your JDK documentation. Then it's just a matter of using String, InputStreamReader or similar to decode the bytes into chars.

FileInputStream fis = new FileInputStream (yourFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"CP1250"));
And do with reader whatever you'd do directly with file.

Your description indicates that the encoding is UTF-8 and indeed C3 B6 is the UTF-8 encoding for 'ö'.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Weird error with Jackson and file encoding - java

Related

Getting Invalid data on decode using appache common-codec library

XMLStreamException : Parse error

JAX-RS and character encoding problems

Converting XML to JSON results in unknown characters when running on Centos instead of Windows

Convert from Codepage 1252 (Windows) to Java, in Java

Categories

Resources