XML entity references with unicode paths - java

I'm caching XML entities so I don't have to fetch them from server, resulting in XML header tags like
<!ENTITY % xhtml-special-local SYSTEM "/Users/test/Library/Application Support/test/xhtml-special.ent" > %xhtml-special-local;
This works great unless the username contains öäå or similar non-ascii characters. With these, I get the following parser error
java.net.MalformedURLException: no protocol: /Users/test/Library/Application Support/testööö/xhtml-special.ent
How should the entity path be escaped to be accepted by the parser?

This can be fixed by prepending file:/// to the path, like so
<!ENTITY % xhtml-special-local SYSTEM "file:////Users/test/Library/Application Support/test/xhtml-special.ent" > %xhtml-special-local;

It should be a URI, not a filename. That means it should have "file://" at the start.
Secondly, URIs only allow ASCII characters. Some systems are more flexible and accept IRIs (an extension of URIs that permits non-ASCII characters), but there's nothing in the specs that endorses this. To be portable you need to use %XX escaping for non-ASCII characters. The easiest way to do this if you're in Java is with File.toURI().
System.err.println(new File(
"/Users/test/Library/Application Support/testööö/xhtml-special.ent"
).toURI().toASCIIString());
outputs
file:/Users/test/Library/Application%20Support/test%C3%B6%C3%B6%C3%B6/xhtml-special.ent
The sequence %C3%B6 is the hex representation of the two octets making up the UTF-8 encoding of the Unicode character ö.

Related

Unzip files that contain chinese characters

I have a zip file.It contains some files.Files contain chinese characters so I used
ZipInputStream zipStream = new ZipInputStream(
new BufferedInputStream(new FileInputStream(zipFilePath), BUFFER_SIZE),
Charset.forName("ISO-8859-1")
);
......
FileOutputStream fileOutput = new FileOutputStream(uncompressedFileName);
while (zipStream.available() > 0) {
fileOutput.write(zipStream.read());
}
Extraction runs succesfully.After that I want to use encodingDetect method to find encoding but now service is not running.It returns nomatch. If I send files directly to service,The service is running.It find charset properly like UTF-8.
I guess that Charset.forName("ISO-8859-1")extract files but format is corrupted.Do you have any idea?
The problem is the Charset of the file names in the zip. UTF-8 raises an error (the file names are evidently not in UTF-8), as UTF-8 requires as special format for the multi-byte sequences, and evidently there are wrong "multibyte" sequences.
ISO-8859-1 is a single byte enconding, accepting garbage.
What you should do is to try the small number of Chinese Charsets, so the file name strings are filled correctly. Java String contains Unicode, so can hold any Charset. The help from someone talking Chinese probably would make sense.
And then try writing files with those names. If not successful on your PC, you must use artificial file names, maybe transliteration from Chinese.
A translation table from original Chinese file name to actual file name may be created
as UTF-8 text file, maybe with a BOM, '\uFEFF` at the begin-of-file.
ISO-8859-1 charset most definitely does not support Chinese language. Use UTF-8 instead of ISO-8859-1

CharConversionException while transforming xml file

I have a Java program which process xml files. When transforming xml into another xml file base on certain schema( xsd/xsl) it throws following error.
This error only throws for one xml file which has a xml tag like this.
<abc>xxx yyyy “ggggg vvvv” uuuu</abc>
But after removing or re-type two quotes, it doesn’t throw the error.
Anybody, please assist me to resolve this issue.
java.io.CharConversionException: Character larger than 4 bytes are not supported: byte 0x93 implies a length of more than 4 bytes
at .org.apache.xmlbeans..impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDecoder.java:162)
<?xml version= “1.0’ encoding =“UTF-8” standalone =“yes “?><xyz xml s=“http://pqr.yy”><Header><abc> aaa “cccc” aaaaa vvv</abc></Header></xyz>.
As others have reported in comments, it has failed because the typographical quotation marks are encoded in Windows-1292 encoding, not in UTF-8, so the parser hasn't managed to decode them.
The encoding declared in the XML declaration must match the actual encoding used for the characters.
To find out how this error arose, and to prevent it happening again, we would need to know where this (wannabe) XML file came from, and how it was created.
My guess would be that someone used a "smart" editor; Microsoft editors in particular are notorious for changing what you type to what Microsoft think you wanted to type. If you're editing XML by hand it's best to use an XML-aware editor.

Unable to parse trademark symbol in XML using XPath

Im trying to parse an XML file in Java and some lines contains an HTML symbol & #153; Still, when I do
((String) myXPath.evaluate(node, STRING));
I get square symbol instead of ™. My machine is Linux and XML encoding is UTF-8. I can't understand how to properly encode this exact symbol. & #8482; is encoded perfectly well...
I create a Document instance in a following way:
File xmlFile = new File(path);
FileInputStream fileIS = new FileInputStream(xmlFile);
xmlDocument = builder.parse(fileIS);
The HTML entity & # 153 represents the character with Unicode codepoint 153, which is some unprintable control character. It isn't a trademark symbol. 153 might be a trademark symbol in some Microsoft Windows character set, but that's irrelevant on the web. You need to use the Unicode codepoint which is 8482 - https://en.wikipedia.org/wiki/Trademark_symbol
Note that the numbers used in HTML entity references have nothing to do with the file encoding. In fact, that's the whole point of using them - they survive changes of encoding.

Different behavior when space is encoded as + and %20 in a URL

Pages with spaces in the URL don't get correctly translated:
i.e.
http://www.streetinsider.com/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html
or
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
Gives 404. Please note "Press Releases" is encoded as "Press%20Releases".
However following two versions work fine where "Press Releases" is encoded as "Press+Releases".
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
The article parses fine with plus signs or HEX spaces %20.
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
Both + and %20 represent spaces. Then why this behavior.
And also, in java what could I use to get the correct encoded URL
Both + and %20 represent spaces
Only in query strings. Elsewhere in a URL a plus is a plus, not a space. In this case the web server gives you the same content for the two different URLs
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
and
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
but the two URLs are distinct, they're not alternative representations of the same URL.
Officially + might only be used in the query string (after ?).
This is what URLEncoder is for:
"?x=" + URLEncoder.encode("Hello World", "UTF-8");
"?x=" + URLEncoder.encode("ŝi estas ĉarma", "UTF-8");
?x=Hello+World
?x=%C5%9Di+estas+%C4%89arma
The more universal class URI, obeys the specification for spaces to be replaced, using %.
URI uri = new URI("http", "www.streetinsider.com",
"/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html",
"?x=ŝi estas ĉarma");
String u = uri.toString();
http://www.streetinsider.com/Press%20Releases/National%20Trends%20
Reflected%20in%20Plano%20Housing%20Market/9778767.html#?x=ŝi%20estas%20ĉarma
One sometime encounters URI as generalisation for File and others, and then has to be careful not introducing %20 in file names.
So probably there is a partial remapping on streetinsider of + or even %20 as it seems; in order to reach the same code.
Your statement
Both + and %20 represent spaces.
is not exactly true in all cases.
Space characters may only be encoded as "+" in one context: application/x-www-form-urlencoded key-value pairs.
The RFC-1866 (HTML 2.0 specification), paragraph 8.2.1. subparagraph 1. says: "The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped").
Here is an example of such a string in URL where RFC-1866 allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses (in other cases, spaces should be encoded to %20). This way of encoding form data is also given in later HTML specifications, for example, look for relevant paragraphs about application/x-www-form-urlencoded in HTML 4.01 Specification, and so on.
The URL that you have provided is not a form data containing key/value pairs, it's just a path to a 9778767.html file:
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
So, it is illegal to use pluses here. The correct URL in this case should have been the following:
http://www.streetinsider.com/Press%20Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html

How to check encoding in java?

I am facing a problem about encoding.
For example, I have a message in XML, whose format encoding is "UTF-8".
<message>
<product_name>apple</product_name>
<price>1.3</price>
<product_name>orange</product_name>
<price>1.2</price>
.......
</message>
Now, this message is supporting multiple languages:
Traditional Chinese (big5),
Simple Chinese (gb),
English (utf-8)
And it will only change the encoding in specific fields.
For example (Traditional Chinese),
蘋果
1.3
橙
1.2
.......
Only "蘋果" and "橙" are using big5, "<product_name>" and "</product_name>" are still using utf-8.
<price>1.3</price> and <price>1.2</price> are using utf-8.
How do I know which word is using different encoding?
It looks like whoever is providing the XML is providing incorrect XML. They should be using a consistent encoding.
http://sourceforge.net/projects/jchardet/files/ is a pretty good heuristic charset detector.
It's a port of the one used in Firefox to detect the encoding of pages that are missing a charset in content-type or a BOM.
You could use that to try and figure out the encoding for substrings in a malformed XML file if you can't get the provider to fix their output.
you should use only one encoding in one xml file. there are counterparts of the characters of big5 in the UTF_8 encoding.
Because I cannot get the provider to fix the output, so I should be handle it by myself and I cannot use the extend library in this project.
I only can solve that like this,
String str = new String(big5String.getByte("UTF-8"));
before display the message.

Categories

Resources