How to parse an XML file containing BOM?

How to parse an XML file containing BOM? - java

I want to parse an XML file from URL using JDOM. But when trying this:
SAXBuilder builder = new SAXBuilder();
builder.build(aUrl);
I get this exception:
Invalid byte 1 of 1-byte UTF-8 sequence.
I thought this might be the BOM issue. So I checked the source and saw the BOM in the beginning of the file. I tried reading from URL using aUrl.openStream() and removing the BOM with Commons IO BOMInputStream. But to my surprise it didn't detect any BOM.
I tried reading from the stream and writing to a local file and parse the local file. I set all the encodings for InputStreamReader and OutputStreamWriter to UTF8 but when I opened the file it had crazy characters.
I thought the problem is with the source URL encoding. But when I open the URL in browser and save the XML in a file and read that file through the process I described above, everything works fine.
I appreciate any help on the possible cause of this issue.

That HTTP server is sending the content in GZIPped form (Content-Encoding: gzip; see http://en.wikipedia.org/wiki/HTTP_compression if you don't know what that means), so you need to wrap aUrl.openStream() in a GZIPInputStream that will decompress it for you. For example:
builder.build(new GZIPInputStream(aUrl.openStream()));
Edited to add, based on the follow-up comment: If you don't know in advance whether the URL will be GZIPped, you can write something like this:
private InputStream openStream(final URL url) throws IOException
{
final URLConnection cxn = url.openConnection();
final String contentEncoding = cxn.getContentEncoding();
if(contentEncoding == null)
return cxn.getInputStream();
else if(contentEncoding.equalsIgnoreCase("gzip")
|| contentEncoding.equalsIgnoreCase("x-gzip"))
return new GZIPInputStream(cxn.getInputStream());
else
throw new IOException("Unexpected content-encoding: " + contentEncoding);
}
(warning: not tested) and then use:
builder.build(openStream(aUrl.openStream()));
. This is basically equivalent to the above — aUrl.openStream() is explicitly documented to be a shorthand for aUrl.openConnection().getInputStream() — except that it examines the Content-Encoding header before deciding whether to wrap the stream in a GZIPInputStream.
See the documentation for java.net.URLConnection.

You might find you can avoid handling encoded responses by sending a blank Accept-Encoding header. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html: "If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding.". That seems to be occurring here.

Related

What encoding should I use for an HTTP servlet input stream if none was specified?

While reading a ServletInputStream my team was doing something like this:
br = new BufferedReader(new InputStreamReader(servletInputStream));
This unsurprisingly gave a red flag on my code analyzer as the encoding is not specified and so it will rely on whatever the default system encoding is.
The first step would be to try to get the encoding from the request:
String encoding = request.getCharacterEncoding();
if (encoding != null) {
br = new BufferedReader(new InputStreamReader(servletInputStream), encoding);
}
However, as this related answer told me, most browsers don't send the encoding, which will cause the encoding to be null above. In that case, how on earth am I supposed to know what the encoding is?
Do browsers not send the encoding because:
There is a universally-agreed default encoding for HTTP requests which is used if none was specified? (if so what is it and where is the standard that defines it should always be used), or,
There is some other way to determine what the encoding is? (if so what is it? Surely not just trying different encodings and seeing whether you get garbage or something parseable?)

Since its a dynamic web application, you'd be expected to have some control over what is the encoding with clients post the request. Usually its UTF-8.

JAX-RS and character encoding problems

I am using Jax RS and have simple POST WS, that takes InputStream, that contains MIME message (xml + file).
The MIME message is in UTF-8, file contained as a body part is an email message in MIME RFC 822 in ISO-8859-1 encoding, that I'm converting to PDF using Aspose.
When running as a webservice, the resulting PDF has incorrect characters (ø, å etc.). But when I tried to use the exact input, but reading it from file instead and call the method with FileInputStream, the resulting PDF is OK.
Here is the simplified version of the code:
#POST
#Path(value = "/documents/convert/{flag}")
#Produces("text/plain")
public String convertFile(InputStream input, #PathParam("flag") String flag) throws WebApplicationException {
FileInfo info = convertToPdf(input);
return info.getResponse();
}
If I run this as webservice it produces PDF with incorrectly encoded characters with "box" instead of some charcters (such as ø, å etc.). When I run the the same code with the same input by by calling
FileInputStream fis = new FileInputStream(file);
convertFile(fis);
the resulting PDF has correct encoding (the WS is run on server, testing with file is done on my local machine).
Could this be incorrect setting of locale on the server?

Do you use an InputStreamReader to read the FileInputStream ? If so, did you initialize it using the 2-parameters constructor, with CharSet.forName("UTF-8") as the second argument ? (as you mentionned the incoming stream is already in UTF-8) ?

You might need to tell the container that it's UTF-8.
something like...
#Produces("text/plain; charset=utf-8")

Apparently your local file and you MIME message body are not encoded the same way.
Your post states that the file is encoded in ISO-8859-1.
If you are using an InputStreamReader (as Xavier Coulon's is suggesting) you should pass the expected encoding to it. In this case
CharSet.forName("ISO-8859-1")
If this does not help, could you please provide the content of the convertToPdf(InputStream is) method

getContentType method returning always 'application/force-download'

I have an URL to file which I can download. It looks like this:
http://<server>/recruitment-mantis/plugin.php?page=BugSynchronizer/getfile&fileID=139&filehash=3e7a52a242f90c23539a17f6db094d86
How to get content type of this file? I have to admin that in this case simple:
URL url = new URL(stringUrl);
URLConnection urlConnection = url.openConnection();
urlConnection.connect();
String urlContent = urlConnection.getContentType();
returning me application/force-download content type in every file (no matter is jpg or pdf file).
I want to do this cause I want to set extension of downloaded file (which can be various). How to 'get around' of this application/force-download content type? Thanks in advance for your help.

Check urlConnection.getHeaderField("Content-Disposition") for a filename. Usually that header is used for attachments in multipart content, but it doesn't hurt to check.
If that header is not present, you can save the URL to a temporary file, and use probeContentType to get a meaningful MIME type:
Path tempFile = Files.createTempFile(null, null);
try (InputStream urlStream = urlConnection.getInputStream()) {
Files.copy(urlStream, tempFile, StandardCopyOption.REPLACE_EXISTING);
}
String mimeType = Files.probeContentType(tempFile);
Be aware that probeContentType may return null if it can't determine the type of the file.

How to 'get around' of this application/force-download content type?
I had the same problem with my uploaded content-type. Although you can trust the content-type from the URL, I chose to go looking for a content-type utilities to determine the content from the byte content.
After trying 5 or so implementations I decided to reinvent the wheel and released my SimpleMagic package which makes use of the magic(5) Unix content-type files to implement the same functionality as the Unix file(1) command. It uses either internal config files or can read /etc/magic, /usr/share/file/magic, or other magic(5) files and determine file content from File, InputStream, or byte[].
Location of the github sources, javadocs, and some documentation are available from the home page.
With SimpleMagic, you do something like the following:
ContentInfoUtil util = new ContentInfoUtil();
ContentInfo info = util.findMatch(byteArray);
It works from the contents of the data (File, InputStream, or byte[]), not the file name.

I guess this content type is set from the server your are downloading from. Some server use these kind of content type to force browsers to download the file instead of trying to open it. For example when my server return content type "application/pdf" chrome will try to open it as pdf, but when the server returns "application/force-download" the browser will save it to disk, because he has no clue what to do with this.
So you need to change the server to return the correct content type or better try some other heuristic to get the correct file type, because the server can always lie to you by setting it to jpg but giving you an exe.
I see with Java 7 you can try this method:
http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType%28java.nio.file.Path%29

send arabic SMS on mobile in java

in my application there is both arabic and english language suport but i am facing a problem when the mobile receive arabic SMS it is displaied as ??? ???? (question marks) knowing that the monbile i am using for testing supports arabic and all the arabic in the application is working fine the problem is only when an arabic SMS is received by my mobile.
String ff = new String(smsContent.getBytes("UTF-8"), "UTF-8");
StringWriter stringBuffer = new StringWriter();
PrintWriter pOut = new PrintWriter(stringBuffer);
pOut.print("<?xml version=\"1.0\" encoding=\"utf-8\"?>");
pOut.print("<!DOCTYPE MESSAGE SYSTEM \"http://127.0.0.1/psms/dtd/messagev12.dtd\" >");
pOut.print("<MESSAGE VER=\"1.2\"><USER USERNAME=\""+userName+"\" PASSWORD=\""+password+"\"/>");
pOut.print("<SMS UDH=\"0\" CODING=\"1\" TEXT=\""+ff+"\" PROPERTY=\"0\" ID=\"2\">");
pOut.print("<ADDRESS FROM=\""+fromNo+"\" TO=\""+toNO+"\" SEQ=\"1\" TAG=\"\" />");
pOut.print("</SMS>");
pOut.print("</MESSAGE>");
pOut.flush();
pOut.close();
URL url = new URL("url");
HttpURLConnection connection = (HttpURLConnection)url.openConnection();
connection.setDoOutput(true);
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(connection.getOutputStream()));
out.write("data="+message+"&action=send");
out.flush();
SMS in english working file in my application.

First, new String(smsContent.getBytes("UTF-8"), "UTF-8") is a redundant roundtrip, equivalent to smsContent. First you encode the string as bytes via UTF-8, and then immediately decode it back from the bytes again.
Second, your method of puzzling together XML is completely broken. You can't just concatenate strings and hope to end up with well-formed XML. Just for example think about what happens if someone tries to send a "? Use an XML library.
Third, you're implicitly using the platform default encoding for your OutputStreamWriter instead of explicitly specifying one, which means your code only works on those machines which randomly happen to have the correct encoding as default. I'm guessing yours does not.
Fourth, your method of puzzling together POST parameters is broken. You haven't specified what the variable message is. I'm guessing it's the complete XML document, but then you're trying to send it as a POST parameter to some kind of HTTP service, in which case it needs to be escaped/url-encoded. Just for example, what happens if someone tries to send the message &data=<whatever>&? Please clarify.
See also Using java.net.URLConnection to fire and handle HTTP requests
Fifth, since you're sending to some HTTP service, there's probably some documentation for that service what encoding to send or how to specify it, possibly with a HTTP header (Probably "Content-type: application/x-www-form-urlencoded; charset=UTF-8"?). Point us to the documentation if you can't figure it out yourself.
Edit: Found the documentation: http://www.google.se/search?q=valuefirst+pace
It pretty clearly states that you need to url encode the XML document, so that's probably what you're missing, in which case the encoding for the OutputStreamWriter won't matter as long as it's ASCII-compatible.
However, the documentation does not specify which character encoding to use for url-encoding, which is pretty weak. UTF-8 is the most likely though.

From what I've read on some internet pages, SMS in arabic languages (and others too) are encoded with UCS-2 and not UTF-8. Changing the encoding is worth a try.

You are using your platform's default encoding for the request data, which may very well differ from UTF-8. Try specifying UTF-8 in the OutputStreamWriter:
... new OutputStreamWriter(connection.getOutputStream(), "UTF-8") ...
Another issue is of course that your hand-made XML document will fail as soon as any of your parameters contain characters, which have to be escaped in XML, but that's a different story. Why don't you use an XML library instead?
Just an additional information: The documentation Christoffer points to also explains that the request example you are using is only suitable for text messages with characters in the standard SMS character set. For Unicode character support, you have to use a different request.

URL encoding for latin characters in Java

I'm trying to read in an image URL. As mentioned in the java documentation, I tried converting the URL to URI by
String imageURL = "http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg";
URL url = new URL(imageURL);
url = new URI(url.getProtocol(), url.getHost(), url.getFile(), null).toURL();
URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();
I get the a Java.io.FileNotFound Exception for file
http://www.shefinds.com/files/Christian-Louboutin-DÃ©colletÃ©-100-pumps.jpg
What am I doing wrong and what is the right way to encode this URL?
Update:
I'm using Rome to read in RSS feeds. Taking suggestions from BalusC I have printed out the raw input from different stages and seems like that the ROME rss parser is using ISO-8859-1 instead of UTF-8.

Works fine here (returns a 403, it's at least not a 404):
URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
InputStream input = connection.getInputStream();
When I fix it so that it doesn't return a 403, the picture is correctly retireved:
URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/4.0");
InputStream input = connection.getInputStream();
OutputStream output = new FileOutputStream("/pic.jpg");
for (int data = 0; (data = input.read()) != -1;) {
output.write(data));
}
So your problem lies somewhere else. Converting is actually not needed. The initial URL is valid.
Maybe you're obtaining the actual URL from some binary source using the wrong character encoding? The transition of é to Ã© namely suggests that the original source was UTF-8 encoded and that the code has incorrectly read it in in using ISO-8859-1 instead of UTF-8.
Update: or maybe you've actually hardcoded it in the Java source code and saving the source file itself using the wrong encoding. I've configured my editor (Eclipse) to save files using UTF-8 and the -Dfile.encoding is also defaulted to UTF-8, that would explain why it works at my machine ;)
Update 2: as per the comments, in a nutshell, everything should work fine if the encoding used to save the source file matches the default -Dfile.encoding of the runtime platform (and the character encoding in question supports the é). To avoid those unforeseen clashes whenever you like to distribute the code, it's indeed better to replace hardcoded non-ASCII chars by unicode escapes.

I think the technical answer is "you can't." Non-ASCII characters can't be used in a URL according to the standard, and even some ASCII characters must be escaped with "%XX" syntax, where XX is the ASCII value of the character.
If anything, you can escape 'é' with '%E9' but this relies on the server interpreting this as an encoding of the character according to ISO-8859-1. While this isn't technically allowed, I believe many servers will do it.

The encoding of your source file is to blame. Using your IDE, set it to UTF-8, and then repaste the URL.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.