Unable to parse trademark symbol in XML using XPath

Unable to parse trademark symbol in XML using XPath - java

Im trying to parse an XML file in Java and some lines contains an HTML symbol & #153; Still, when I do
((String) myXPath.evaluate(node, STRING));
I get square symbol instead of . My machine is Linux and XML encoding is UTF-8. I can't understand how to properly encode this exact symbol. & #8482; is encoded perfectly well...
I create a Document instance in a following way:
File xmlFile = new File(path);
FileInputStream fileIS = new FileInputStream(xmlFile);
xmlDocument = builder.parse(fileIS);

The HTML entity & # 153 represents the character with Unicode codepoint 153, which is some unprintable control character. It isn't a trademark symbol. 153 might be a trademark symbol in some Microsoft Windows character set, but that's irrelevant on the web. You need to use the Unicode codepoint which is 8482 - https://en.wikipedia.org/wiki/Trademark_symbol
Note that the numbers used in HTML entity references have nothing to do with the file encoding. In fact, that's the whole point of using them - they survive changes of encoding.

Related

XML entity references with unicode paths

I'm caching XML entities so I don't have to fetch them from server, resulting in XML header tags like
<!ENTITY % xhtml-special-local SYSTEM "/Users/test/Library/Application Support/test/xhtml-special.ent" > %xhtml-special-local;
This works great unless the username contains öäå or similar non-ascii characters. With these, I get the following parser error
java.net.MalformedURLException: no protocol: /Users/test/Library/Application Support/testööö/xhtml-special.ent
How should the entity path be escaped to be accepted by the parser?

This can be fixed by prepending file:/// to the path, like so
<!ENTITY % xhtml-special-local SYSTEM "file:////Users/test/Library/Application Support/test/xhtml-special.ent" > %xhtml-special-local;

It should be a URI, not a filename. That means it should have "file://" at the start.
Secondly, URIs only allow ASCII characters. Some systems are more flexible and accept IRIs (an extension of URIs that permits non-ASCII characters), but there's nothing in the specs that endorses this. To be portable you need to use %XX escaping for non-ASCII characters. The easiest way to do this if you're in Java is with File.toURI().
System.err.println(new File(
"/Users/test/Library/Application Support/testööö/xhtml-special.ent"
).toURI().toASCIIString());
outputs
file:/Users/test/Library/Application%20Support/test%C3%B6%C3%B6%C3%B6/xhtml-special.ent
The sequence %C3%B6 is the hex representation of the two octets making up the UTF-8 encoding of the Unicode character ö.

Handle ligatures in Apache Tika

Tika doesn't seem to recognize ligatures (fi, ff, fl...) in PDF files and replaces them with question marks.
Any idea (not only on Tika) to extract PDF text while converting character ligatures to separated characters ?
File file = new File("path/to/file.pdf");
String text = Tika().parseToString(file);
Edit
My PDF file is UTF-8 encoded (that's what InputStream.getEncoding() says), my platform encoding is also UTF-8. Even with a -Dfile.encoding=UTF8, it is not working.
For instance, I'm supposed to have :
"différentes implémentations"
...and that's what I really get :
"di��erentes impl�ementations"

Trouble of viewing an XML file encoded in utf-8 with non-ascii characters

I have an xml file which gets its data from an oracle table of CLOB type. I wrote into the Clob value using Unicode Stream
Writer value = clob.setCharacterStream(0L);
value.write(StrValue));
When I write non-ascii characters like chinese and then access the clob attribute using PL/SQL developer, I see the characters showing up as they are. However, when I put the value in an xml file encoded in UTF-8 and try to open the xml file through IE, i get the error message
"an invalid character was found in text content. Error processing
resource ...".
The other interesting thing is that when I write into CLOB using ascii stream, like:
OutputStream value = clob.getAsciiOutputStream();
value.write(strValue.getBytes("UTF-8"));
then, the characters appear correctly in the XML on browser , but are messy in DB when accessed using PL/SQL developer.
Is there any problem in converting unicode characters to UTF-8. Any suggestion please?

Parse XML file containing umlaute using SAX parser

I have looked through a lot of posts regarding the same problem, but i can't figure it out. I trying to parse a XML file with umlauts in it. This is what i have now:
File file = new File(this.xmlConfig);
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handlerConfig);
But it won't get umlauts properly. Ä,Ü and Ö will be only weird characters. The file is definitely in utf-8 and it is declared as such with the first line like this: <?xml version="1.0" encoding="utf-8"?>
What I'm doing wrong?

First rule: Don't second guess the encoding used in the XML document. Always use byte streams to parse XML documents:
InputStream inputStream= new FileInputStream(this.xmlConfig);
InputSource is = new InputSource(inputStream);
saxParser.parse(is, handlerConfig);
If that doesn't work, the <?xml version=".." encoding="UTF-8" ?> (or whatever) in the XML is wrong, and you have to take it from there.
Second rule: Make sure you inspect the the result with a tool that supports the encoding used in the target, or result, document. Have you?
Third rule: Check the byte values in the source document. Bring up your favourite HEX editor/viewer and inspect the content. For example, the letter Ä should be the byte sequence 0xC3 0x84, if the encoding is UTF-8.
Forth rule: If it doesn't look correct, always suspect that the UTf-8 source is viewed, or interpreted, as an ISO-8859-1 source. Verify this by comparing the first and second byte from the UTF-8 source with the ISO 8859-1 code charts.
UPDATE:
The byte sequence for the UNICODE letter ä (latin small letter a with diaresis, U+00E4) is 0xC3 0xA4 in the UTF-8 encoding. If you use a viewing tool that only understands (or is configured to interpret the source as) ISO-8859-1 encoding, the first byte, 0xC3is the letter Ã, and the second byte is the letter ¤, or currency sign (Unicode U+00A4), which may look like a circle.
Hence, the "TextView" thingy in Android is interpreting your input as an ISO-8859-1 stream. I have no idea if it is possible to change that or not. But if you have your parsing result as a String or a byte array, you could convert that to a ISO-8859-1 stream (or byte array), and then feed it to "TextView".

Filtering Wikipedia's XML dump: error on some accents

I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.
I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.
In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).
I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…
Any idea ?
Alternatives I tried:
Getting the file (commented lines were tried wihtout success*):
FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream,
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);
Filtered prefixes:
ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
"Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
"CatÃ©gorie:", "ModÃ¨le:", "WikipÃ©dia:", // UTF-8 as ISO-8859-1
"Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work
* ERRATUM
Actually, my bad, that one I tried work, I tested the wrong index:
InputSource is = new InputSource( fis );
is.setEncoding("UTF-8"); // force UTF-8 interpretation
parser.parse(fis, handler);

Since you write the prefixes as plain strings into your source file, you want to make sure that you save that .java file in UTF-8, too (or any other encoding that supports the special characters you're using). Then, however, you have to tell the compiler which encoding the file is in with the -encoding flag:
javac -encoding utf-8 *.java
For the XML source, you could try
Reader r = new InputStreamReader(new FileInputStream(xmlFileName), "UTF-8");
InputStreams do not deal with encodings since they are byte-based, not character-based. So, here we create a Reader from an FileInputStream - the latter (stream) doesn't know about encodings, but the former (reader) does, because we give the encoding in the constructor.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.