Parse XML file containing umlaute using SAX parser

Parse XML file containing umlaute using SAX parser - java

I have looked through a lot of posts regarding the same problem, but i can't figure it out. I trying to parse a XML file with umlauts in it. This is what i have now:
File file = new File(this.xmlConfig);
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handlerConfig);
But it won't get umlauts properly. Ä,Ü and Ö will be only weird characters. The file is definitely in utf-8 and it is declared as such with the first line like this: <?xml version="1.0" encoding="utf-8"?>
What I'm doing wrong?

First rule: Don't second guess the encoding used in the XML document. Always use byte streams to parse XML documents:
InputStream inputStream= new FileInputStream(this.xmlConfig);
InputSource is = new InputSource(inputStream);
saxParser.parse(is, handlerConfig);
If that doesn't work, the <?xml version=".." encoding="UTF-8" ?> (or whatever) in the XML is wrong, and you have to take it from there.
Second rule: Make sure you inspect the the result with a tool that supports the encoding used in the target, or result, document. Have you?
Third rule: Check the byte values in the source document. Bring up your favourite HEX editor/viewer and inspect the content. For example, the letter Ä should be the byte sequence 0xC3 0x84, if the encoding is UTF-8.
Forth rule: If it doesn't look correct, always suspect that the UTf-8 source is viewed, or interpreted, as an ISO-8859-1 source. Verify this by comparing the first and second byte from the UTF-8 source with the ISO 8859-1 code charts.
UPDATE:
The byte sequence for the UNICODE letter ä (latin small letter a with diaresis, U+00E4) is 0xC3 0xA4 in the UTF-8 encoding. If you use a viewing tool that only understands (or is configured to interpret the source as) ISO-8859-1 encoding, the first byte, 0xC3is the letter Ã, and the second byte is the letter ¤, or currency sign (Unicode U+00A4), which may look like a circle.
Hence, the "TextView" thingy in Android is interpreting your input as an ISO-8859-1 stream. I have no idea if it is possible to change that or not. But if you have your parsing result as a String or a byte array, you could convert that to a ISO-8859-1 stream (or byte array), and then feed it to "TextView".

Related

Unzip files that contain chinese characters

I have a zip file.It contains some files.Files contain chinese characters so I used
ZipInputStream zipStream = new ZipInputStream(
new BufferedInputStream(new FileInputStream(zipFilePath), BUFFER_SIZE),
Charset.forName("ISO-8859-1")
);
......
FileOutputStream fileOutput = new FileOutputStream(uncompressedFileName);
while (zipStream.available() > 0) {
fileOutput.write(zipStream.read());
}
Extraction runs succesfully.After that I want to use encodingDetect method to find encoding but now service is not running.It returns nomatch. If I send files directly to service,The service is running.It find charset properly like UTF-8.
I guess that Charset.forName("ISO-8859-1")extract files but format is corrupted.Do you have any idea?

The problem is the Charset of the file names in the zip. UTF-8 raises an error (the file names are evidently not in UTF-8), as UTF-8 requires as special format for the multi-byte sequences, and evidently there are wrong "multibyte" sequences.
ISO-8859-1 is a single byte enconding, accepting garbage.
What you should do is to try the small number of Chinese Charsets, so the file name strings are filled correctly. Java String contains Unicode, so can hold any Charset. The help from someone talking Chinese probably would make sense.
And then try writing files with those names. If not successful on your PC, you must use artificial file names, maybe transliteration from Chinese.
A translation table from original Chinese file name to actual file name may be created
as UTF-8 text file, maybe with a BOM, '\uFEFF` at the begin-of-file.

ISO-8859-1 charset most definitely does not support Chinese language. Use UTF-8 instead of ISO-8859-1

Decoding a base64 XML cuts off the last part

I have a base64 encoded string, which represents an XML Schema (xsd). I decode this using Apache's Base64 utilities, put the resulting byte array into an intputsource and let an XMLSchemaCollection read this inputSource:
String base64String = ......
byte[] decoded = Base64.decodeBase64(base64String);
InputSource inputSource = new InputSource(new ByteArrayInputStream(decoded));
xmlSchemaCollection.read(inputSource, new ValidationEventHandler());
This gives an error:
XML document structure must start and end within the same entity
Which usually means the XML structure isn't valid. I performed two tests to see what the base64 actually holds. First is printing it out to the console:
System.out.println(new String(decoded,"UTF-8"));
In eclipse, I see my xml is suddenly cut off, like part of it is missing. However, if I use any online website, such as https://www.base64decode.org/, and I copy/paste my base64, I see the complete full xml. If I validate this xml, the validation succeeds. So I'm a bit confused as to why eclipse seemingly cuts off my xml after decoding?

Errors like this are usually indicative of a badly formatted document:
XML document structures must start and end within the same entity...
A few things you can do to debug this:
1. Print out the XML document to a log and run it through some sort of XML validator.
2. Check to make sure that there are no invalid characters (ex UTF-16 characters in a UTF-8 document)

How to create a utf-8 encoded file in java such that it shows as UTF-8 encoded when opened in notepad++/notepad or any other text editor

I have tried to create UTF-8 file using java using different readers.But after creating when I open the file it is not read as being UTF-8 encoded(I opened it in notepad++ and it was UTF-8 without BOM).
File fileDir = new File("c:\\temp\\test.txt");
Writer out1 = new BufferedWriter(
new OutputStreamWriter(
new FileOutputStream(fileDir),
Charset.forName("UTF-8").newEncoder())
);
Writer out = new OutputStreamWriter(
new FileOutputStream(fileDir),
Charset.forName("UTF-8")
);
out.append("Website UTF-8").append("\r\n");
out.append("?? UTF-8").append("\r\n");
out.append("??????? UTF-8").append("\r\n");
out.flush();
out.close();

You are correctly writing a file in the UTF-8 encoding. (Note that you're not using out1 and it's unnecessary).
Notepad++ tells you that the file is "UTF-8 without BOM". Why do you think this is not UTF-8?
BOM stands for byte order mark. It's a special Unicode character to indicate if the bytes in a file are in little-endian or big-endian order. But for UTF-8 it has no meaning and its use is not recommended. From the Wikipedia article:
The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters ï»¿ for this.
The Unicode Standard permits the BOM in UTF-8, but does not require nor recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8. The BOM may also appear when UTF-8 data is converted from other encodings that use a BOM.
Is there a special reason why you need a BOM to be included? If not, then don't worry about it. Some Java XML parsers cannot deal with an UTF-8 BOM properly and will give an error when you try to parse an XML document encoded in UTF-8 when it starts with a BOM.

XML Document read in as Latin1 but half converted to UTF-8

I'm hitting my head off a brick wall with a bizarre problem that I know there will be an obvious answer to, but I can't see if for the life of me. It's all to do with encoding. Before the code, a simple description: I want to take in an XML document which is Latin1 (ISO-8859-1) encoded, and then send the thing completely unchanged over an HttpURLConnection. I have a small test class and the raw XML which shows my problem. The XML file contains a Latin1 character 0xa2 (a cent character), which is invalid UTF-8 - I'm deliberately using this as my test case. The XML declaration is ISO-8859-1. I can read it in no bother, but then when I want to convert the org.w3c.dom.Document to a byte[] array to send down the HttpURLConnection, the 0xa2 character gets converted to the UTF-8 encoded cent character (0xc2 0xa2), and the declaration stays as ISO-8859-1. In other words, it's converted to two characters - totally wrong.
The code which does this:
FileInputStream input = new FileInputStream( "input-file" );
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware( true );
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse( input );
Source source = new DOMSource( document );
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Result result = new StreamResult( baos );
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform( source, result );
byte[] bytes = baos.toByteArray();
FileOutputStream fos = new FileOutputStream( "output-file" );
fos.write( bytes );
I'm just writing it to a file at the moment while I figure out what on earth is converting this character. The input-file has 0xa2, the output-file contains 0xc2 0xa2. One way to fix this is to put this line in the 2nd last block:
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
However, not all XML documents that I'll be dealing with will be Latin1; most, indeed, will be UTF-8 when they come in. I'm assuming I shouldn't have to be working out what the encoding is such that I feed that in to the transformer though? I mean, surely it should be working this out for itself, and I'm just doing something else wrong?
A thought had occurred to me that I could just query the document to find out the encoding and thus the extra line could just do the trick:
transformer.setOutputProperty(OutputKeys.ENCODING, document.getInputEncoding());
However, I then determined that this wasn't the answer, as document.getInputEncoding() returns a different String if I run it in a terminal on the linux box in comparison to when I run it within Eclipse on my Mac.
Any hints would be appreciated. I fully accept I'm missing out on something obvious.

yes, by default, xml documents are written as utf-8, so you need to explicitly tell the Transformer to use a different encoding. your last edit is the "trick" to doing this such that it always matches the input xml encoding:
transformer.setOutputProperty(OutputKeys.ENCODING, document.getXmlEncoding());
the only question is, do you really need to maintain the input encoding?

Why not just open it with a normal FileInputStream and stream the bytes to the output stream directly from that? Why do you need to load it into DOM format in memory if you are just sending it byte for byte over an HttpURLConnection?
Edit: According to javadoc for Document, you should probably be using document.getXmlEncoding() to get what matches the encoding in the XML prolog.

This may be helpful - it's too long for a comment, but not really an answer. From the spec:
The encoding attribute specifies the preferred encoding to use for
outputting the result tree. XSLT processors are required to respect
values of UTF-8 and UTF-16. For other values, if the XSLT processor
does not support the specified encoding it may signal an error; if it
does not signal an error it should use UTF-8 or UTF-16 instead.
You may want to test with "encoding=junk", as it were, to see what it does.
The valid values for Java are described here. See also IANA charsets.

Filtering Wikipedia's XML dump: error on some accents

I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.
I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.
In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).
I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…
Any idea ?
Alternatives I tried:
Getting the file (commented lines were tried wihtout success*):
FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream,
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);
Filtered prefixes:
ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
"Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
"CatÃ©gorie:", "ModÃ¨le:", "WikipÃ©dia:", // UTF-8 as ISO-8859-1
"Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work
* ERRATUM
Actually, my bad, that one I tried work, I tested the wrong index:
InputSource is = new InputSource( fis );
is.setEncoding("UTF-8"); // force UTF-8 interpretation
parser.parse(fis, handler);

Since you write the prefixes as plain strings into your source file, you want to make sure that you save that .java file in UTF-8, too (or any other encoding that supports the special characters you're using). Then, however, you have to tell the compiler which encoding the file is in with the -encoding flag:
javac -encoding utf-8 *.java
For the XML source, you could try
Reader r = new InputStreamReader(new FileInputStream(xmlFileName), "UTF-8");
InputStreams do not deal with encodings since they are byte-based, not character-based. So, here we create a Reader from an FileInputStream - the latter (stream) doesn't know about encodings, but the former (reader) does, because we give the encoding in the constructor.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.