JAX-RS and character encoding problems

JAX-RS and character encoding problems - java

I am using Jax RS and have simple POST WS, that takes InputStream, that contains MIME message (xml + file).
The MIME message is in UTF-8, file contained as a body part is an email message in MIME RFC 822 in ISO-8859-1 encoding, that I'm converting to PDF using Aspose.
When running as a webservice, the resulting PDF has incorrect characters (ø, å etc.). But when I tried to use the exact input, but reading it from file instead and call the method with FileInputStream, the resulting PDF is OK.
Here is the simplified version of the code:
#POST
#Path(value = "/documents/convert/{flag}")
#Produces("text/plain")
public String convertFile(InputStream input, #PathParam("flag") String flag) throws WebApplicationException {
FileInfo info = convertToPdf(input);
return info.getResponse();
}
If I run this as webservice it produces PDF with incorrectly encoded characters with "box" instead of some charcters (such as ø, å etc.). When I run the the same code with the same input by by calling
FileInputStream fis = new FileInputStream(file);
convertFile(fis);
the resulting PDF has correct encoding (the WS is run on server, testing with file is done on my local machine).
Could this be incorrect setting of locale on the server?

Do you use an InputStreamReader to read the FileInputStream ? If so, did you initialize it using the 2-parameters constructor, with CharSet.forName("UTF-8") as the second argument ? (as you mentionned the incoming stream is already in UTF-8) ?

You might need to tell the container that it's UTF-8.
something like...
#Produces("text/plain; charset=utf-8")

Apparently your local file and you MIME message body are not encoded the same way.
Your post states that the file is encoded in ISO-8859-1.
If you are using an InputStreamReader (as Xavier Coulon's is suggesting) you should pass the expected encoding to it. In this case
CharSet.forName("ISO-8859-1")
If this does not help, could you please provide the content of the convertToPdf(InputStream is) method

Related

Weird error with Jackson and file encoding

I'm pulling a file down from S3, and when I call object mapper using the content stream as a bare InputStream, decoding fails with a UTF-8 exception, but when I use a BufferedReader wrapping the InputStream, it works fine.
If I read the file down into a local file, then open it as a FileInputStream, that works fine also. I am perplexed. I'm hoping somebody has run into this before me, or has some insight around the workings of a bare InputStream versus a BufferedReader in terms of the encoding in Jackson.
This fails
S3Object s3o = s3Client.getObject("my-bucket","my-key");
Object t = om.readValue(s3o.getObjectContent(), Object.class);
This works
S3Object s3o = s3Client.getObject("my-bucket","my-key");
Object t = om.readValue(new BufferedReader(new InputStreamReader(s3o.getObjectContent())), Object.class);
with the error:
org.codehaus.jackson.JsonParseException: Invalid UTF-8 middle byte 0x5c
at [Source: org.apache.http.conn.EofSensorInputStream#6460029d; line: 1, column: 31611]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521)
at org.codehaus.jackson.impl.Utf8StreamParser._reportInvalidOther(Utf8StreamParser.java:2830)
at org.codehaus.jackson.impl.Utf8StreamParser._reportInvalidOther(Utf8StreamParser.java:2837)
at org.codehaus.jackson.impl.Utf8StreamParser._decodeUtf8_2(Utf8StreamParser.java:2625)
at org.codehaus.jackson.impl.Utf8StreamParser._finishString2(Utf8StreamParser.java:1952)
at org.codehaus.jackson.impl.Utf8StreamParser._finishString(Utf8StreamParser.java:1905)
at org.codehaus.jackson.impl.Utf8StreamParser.getText(Utf8StreamParser.java:276)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:59)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:218)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapArray(UntypedObjectDeserializer.java:165)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:51)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:218)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:196)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47)
at org.codehaus.jackson.map.ObjectMapper._readMapAndClose(ObjectMapper.java:2732)
at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1909)

Your content is not UTF-8, but something that is not valid for JSON like ISO-8859-1 (Latin-1). Your use of BufferedReader is bit wrong too -- you should specify encoding, otherwise platform-default encoding (which could be anything) is used -- but it probably converts from that encoding to avoid the error.
Nonetheless, it sounds like content is not valid JSON and whoever produces it should fix it to use one of supported encoding (UTF-8 or UTF-16).

How to show regional characters in Android

I get a sites source code by Java and assign it to a string. But when i see content of that string there ara ? instead of ç,ş,İ,ğ. Hope you can help me.

DataInputStream.readLine is capable of reading latin1-encoded text only. The characters you want are not in latin1 so the page must have some different encoding, such as UTF-8.
Assuming the page is encoded in UTF-8 you can read it if you substitute the part where you declare and initialize the variable in with the following:
Reader in = null;
try {
in = new BufferedReader(new InputStreamReader(u.getInputStream(), "UTF-8"));
If you don't know the page encoding beforehand you may be able to use the URLConnection.getContentEncoding() method to find out. This method returns the encoding declared i the HTTP header Content-Type. If the content type does not have the encoding you just have to guess.

Converting XML to JSON results in unknown characters when running on Centos instead of Windows

I have a Java servlet which gets RSS feeds converts them to JSON. It works great on Windows, but it fails on Centos.
The RSS feed contains Arabic and it shows unintelligible characters on Centos. I am using those lines to encode the RSS feed:
byte[] utf8Bytes = Xml.getBytes("Cp1256");
// byte[] defaultBytes = Xml.getBytes();
String roundTrip = new String(utf8Bytes, "UTF-8");
I tried it on Glassfish and Tomcat. Both have the same problem; it works on Windows, but fails on Centos. How is this caused and how can I solve it?

byte[] utf8Bytes = Xml.getBytes("Cp1256");
String roundTrip = new String(utf8Bytes, "UTF-8");
This is an attempt to correct a badly-decoded string. At some point prior to this operation you have read in Xml using the default encoding, which on your Windows box is code page 1256 (Windows Arabic). Here you are encoding that string back to code page 1256 to retrieve its original bytes, then decoding it properly as the encoding you actually wanted, UTF-8.
On your Linux server, it fails, because the default encoding is something other than Cp1256; it would also fail on any Windows server not installed in an Arabic locale.
The commented-out line that uses the default encoding instead of explicitly Cp1256 is more likely to work on a Linux server. However, the real fix is to find where Xml is being read, and fix that operation to use the correct encoding(*) instead of the default. Allowing the default encoding to be used is almost always a mistake, as it makes applications dependent on configuration that varies between servers.
(*: for this feed, that's UTF-8, which is the most common encoding, but it may differ for others. Finding out the right encoding for a feed depends on the Content-Type header returned for the resource and the <?xml encoding declaration. By far the best way to cope with this is to fetch and parse the resource using a proper XML library that knows about this, for example with DocumentBuilder.parse(uri).)

There are many places where wrong encoding can be used. Here is the complete list http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8

How to parse an XML file containing BOM?

I want to parse an XML file from URL using JDOM. But when trying this:
SAXBuilder builder = new SAXBuilder();
builder.build(aUrl);
I get this exception:
Invalid byte 1 of 1-byte UTF-8 sequence.
I thought this might be the BOM issue. So I checked the source and saw the BOM in the beginning of the file. I tried reading from URL using aUrl.openStream() and removing the BOM with Commons IO BOMInputStream. But to my surprise it didn't detect any BOM.
I tried reading from the stream and writing to a local file and parse the local file. I set all the encodings for InputStreamReader and OutputStreamWriter to UTF8 but when I opened the file it had crazy characters.
I thought the problem is with the source URL encoding. But when I open the URL in browser and save the XML in a file and read that file through the process I described above, everything works fine.
I appreciate any help on the possible cause of this issue.

That HTTP server is sending the content in GZIPped form (Content-Encoding: gzip; see http://en.wikipedia.org/wiki/HTTP_compression if you don't know what that means), so you need to wrap aUrl.openStream() in a GZIPInputStream that will decompress it for you. For example:
builder.build(new GZIPInputStream(aUrl.openStream()));
Edited to add, based on the follow-up comment: If you don't know in advance whether the URL will be GZIPped, you can write something like this:
private InputStream openStream(final URL url) throws IOException
{
final URLConnection cxn = url.openConnection();
final String contentEncoding = cxn.getContentEncoding();
if(contentEncoding == null)
return cxn.getInputStream();
else if(contentEncoding.equalsIgnoreCase("gzip")
|| contentEncoding.equalsIgnoreCase("x-gzip"))
return new GZIPInputStream(cxn.getInputStream());
else
throw new IOException("Unexpected content-encoding: " + contentEncoding);
}
(warning: not tested) and then use:
builder.build(openStream(aUrl.openStream()));
. This is basically equivalent to the above — aUrl.openStream() is explicitly documented to be a shorthand for aUrl.openConnection().getInputStream() — except that it examines the Content-Encoding header before deciding whether to wrap the stream in a GZIPInputStream.
See the documentation for java.net.URLConnection.

You might find you can avoid handling encoded responses by sending a blank Accept-Encoding header. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html: "If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding.". That seems to be occurring here.

send arabic SMS on mobile in java

in my application there is both arabic and english language suport but i am facing a problem when the mobile receive arabic SMS it is displaied as ??? ???? (question marks) knowing that the monbile i am using for testing supports arabic and all the arabic in the application is working fine the problem is only when an arabic SMS is received by my mobile.
String ff = new String(smsContent.getBytes("UTF-8"), "UTF-8");
StringWriter stringBuffer = new StringWriter();
PrintWriter pOut = new PrintWriter(stringBuffer);
pOut.print("<?xml version=\"1.0\" encoding=\"utf-8\"?>");
pOut.print("<!DOCTYPE MESSAGE SYSTEM \"http://127.0.0.1/psms/dtd/messagev12.dtd\" >");
pOut.print("<MESSAGE VER=\"1.2\"><USER USERNAME=\""+userName+"\" PASSWORD=\""+password+"\"/>");
pOut.print("<SMS UDH=\"0\" CODING=\"1\" TEXT=\""+ff+"\" PROPERTY=\"0\" ID=\"2\">");
pOut.print("<ADDRESS FROM=\""+fromNo+"\" TO=\""+toNO+"\" SEQ=\"1\" TAG=\"\" />");
pOut.print("</SMS>");
pOut.print("</MESSAGE>");
pOut.flush();
pOut.close();
URL url = new URL("url");
HttpURLConnection connection = (HttpURLConnection)url.openConnection();
connection.setDoOutput(true);
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(connection.getOutputStream()));
out.write("data="+message+"&action=send");
out.flush();
SMS in english working file in my application.

First, new String(smsContent.getBytes("UTF-8"), "UTF-8") is a redundant roundtrip, equivalent to smsContent. First you encode the string as bytes via UTF-8, and then immediately decode it back from the bytes again.
Second, your method of puzzling together XML is completely broken. You can't just concatenate strings and hope to end up with well-formed XML. Just for example think about what happens if someone tries to send a "? Use an XML library.
Third, you're implicitly using the platform default encoding for your OutputStreamWriter instead of explicitly specifying one, which means your code only works on those machines which randomly happen to have the correct encoding as default. I'm guessing yours does not.
Fourth, your method of puzzling together POST parameters is broken. You haven't specified what the variable message is. I'm guessing it's the complete XML document, but then you're trying to send it as a POST parameter to some kind of HTTP service, in which case it needs to be escaped/url-encoded. Just for example, what happens if someone tries to send the message &data=<whatever>&? Please clarify.
See also Using java.net.URLConnection to fire and handle HTTP requests
Fifth, since you're sending to some HTTP service, there's probably some documentation for that service what encoding to send or how to specify it, possibly with a HTTP header (Probably "Content-type: application/x-www-form-urlencoded; charset=UTF-8"?). Point us to the documentation if you can't figure it out yourself.
Edit: Found the documentation: http://www.google.se/search?q=valuefirst+pace
It pretty clearly states that you need to url encode the XML document, so that's probably what you're missing, in which case the encoding for the OutputStreamWriter won't matter as long as it's ASCII-compatible.
However, the documentation does not specify which character encoding to use for url-encoding, which is pretty weak. UTF-8 is the most likely though.

From what I've read on some internet pages, SMS in arabic languages (and others too) are encoded with UCS-2 and not UTF-8. Changing the encoding is worth a try.

You are using your platform's default encoding for the request data, which may very well differ from UTF-8. Try specifying UTF-8 in the OutputStreamWriter:
... new OutputStreamWriter(connection.getOutputStream(), "UTF-8") ...
Another issue is of course that your hand-made XML document will fail as soon as any of your parameters contain characters, which have to be escaped in XML, but that's a different story. Why don't you use an XML library instead?
Just an additional information: The documentation Christoffer points to also explains that the request example you are using is only suitable for text messages with characters in the standard SMS character set. For Unicode character support, you have to use a different request.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.