What html parser is able to handle encoding?

What html parser is able to handle encoding? - java

This is the beginning -- I have a file on disk which is HTML page. When I open it with regular web browser it displays as it should -- i.e. no matter what encoding is used, I see correct national characters.
Then I come -- my task is to load the same file, parse it, and print out some pieces on the screen (console) -- let's say, all <hX> texts. Of course I would like to see only correct characters, not some mambo-jumbo. The last step is changing some of text, and save the file.
So the parser has to parse and handle encoding in both ways as well. So far I am unaware of parser which is even capable of loading data correctly.
Question
What parser would you recommend?
Details
HTML page in general has the encoding given in header (in meta tag), so parser should use it. The scenario I have to look in advance and check the encoding, and then manually set the encoding in code is no-go. For example, this is taken from JSoup tutorials:
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
I cannot do such thing, parser has to handle encoding detection by itself.
In C# I faced similar problem with loading html. I used HTMLAgilityPack and first executed encoding detection, then using it I encoded the data stream, and after that I parsed the data. So, I did both steps explicitly, but since the library delivers both methods it is fine with me.
Such explicit separation might be even better, because it would be possible to use in case of missing header probabilistic encoding detection method.

The Jsoup API reference says for that parse method that if you provide null as the second argument (the encoding one), it'll use the http-equiv meta-tag to determine the encoding. So it looks like it already does the "parse a bit, determine encoding, re-parse with proper encoding" routine. Normally such parsers should be capable of resolving the encoding themselves using any means available to them. I know that SAX parsers in Java are supposed to use byte-order marks and the XML declaration to try and establish an encoding.
Apparently Jsoup will default to UTF-8 if no proper meta-tag is found. As they say in the documentation, this is "usually safe" since UTF-8 is compatible with a host of common encodings for the lower code points. But I take it that "usually safe" might not really be good enough in this case.
If you don't sufficiently trust Jsoup to detect the encoding, I see two alternatives:
Should you somehow be ascertained that the HTML is always in fact XHTML, then an XML parser might prove a better fit. But that would only work if the input is definitely XML compliant.
Do a heuristic encoding detection yourself by trying to use byte-order marks, parsing a portion using common encodings and finding a meta-tag, detecting the encoding by byte patterns you'd expect in header tags and finally, all else failing, use a default.

Related

What to do when a huge XML document is not well formed (Java)

I am using Java SAX parser to parse XML data sent from a third party source that is around 3 GB. I am getting an error resulting from the XML document not being well formed: The processing instruction target matching "[xX][mM][lL]" is not allowed.
As far as I understand, this is normally due to a character being somewhere it should not be.
Main problem: Cannot manually edit these files due to their very large size.
I was wondering if there was a workaround for files that are very large in size that cannot be opened and edited manually (due to their large size) and if there is a way to code it so that it would remove any problematic characters automatically.

I would think the most likely explanation is that the file contains a concatenation of several XML documents, or perhaps an embedded XML document: either way, an XML declaration that isn't at the start of the file.
A lot now depends on your relationship with the supplier of the bad data. If they sent you faulty equipment or buggy software, you would presumably complain and ask them to fix it. But if you don't have a service relationship with the third party, you either have to change supplier or do the best you can with the faulty input, which means repairing the fault yourself. In general, you can't repair faulty XML unless you know what kind of fault you are looking for, and that can be very difficult to determine if the files are large (or if the failures are very rare).
The data isn't XML, so don't try to use XML tools to process it. Use text processing tools such as sed or awk. The first step is to search the file for occurrences of <?xml and see if that gives any hints.

This error occurs, if the declaration is anywhere but the beginning of the document. The reason might be
Whitespace before the XML declaration
Any hidden character before the XML declaration
The XML declaration appears anywhere else in the document
You should start checking case #2, see here: http://www.w3.org/International/questions/qa-byte-order-mark#remove
If that doesn't help, you should remove leading whitespace from the document. You could do that by wrapping the original InputStream with another InputStream and use that to remove the whitespace.
The same can be done if you are facing case #3, but the implementation would be a bit more complex.

Issue parsing MHTML

I was able to use this question as a starting point in parsing an "mht" file but the "3D" in the anchor tags (e.g.: [anchor text]>) breaks all the internal links and embedded images. I can have the parser replace "=3D" with just "=" (e.g.: [anchor text]>) and it appears to work fine but I want to understand the purpose of that "meta markup".
Why does exporting from ".docx" to ".mht" add "3D" to the right-hand sides of most (if not all) of the html attributes? Is there a better way to handle them or a better regex to use when replacing them?

The =3D is a result of quoted printable encoding. It shouldn't be too hard to find a java library for decoding quoted printable data.

how to write special characters(interpunct) in a xml file in java?

I have a problem in writing a xml file with UTF-8 in JAVA.
Problem: I have a file with filename having an interpunct(middot)(·) in it. When im trying to write the filename inside a xml tag, using java code i get some junk number like  in filename instead of ·
OutputStreamWriter osw =new OutputStreamWriter(file_output_stream,"UTF8");
Above is the java code i used to write the xmlfile. Can anybody tell me why to understand and sort the problem ? thanks in advance

Java sources are UTF-16 by default.
If your character is not in it, then use an escape:
String a = "\u00b7";
Or tell your compiler to use UTF-8 and simply write it to the code as-is.

That character is ASCII 183 (decimal), so you need to escape the character to ·. Here is a demonstration: If I type "·" into this answer, I get "·"
The browser is printing your character because this web page is XML.
There are utility methods that can do this for you, such as apache commons-lang library's StringEscapeUtils.escapeXml() method, which will correctly and safely escape the entire input.

In general it is a good idea to use UTF-8 everywhere.
The editor has to know that the source is in UTF-8. You could use the free programmers editor JEdit which can deal with many encodings.
The javac compiler has to know that the java source is in UTF-8. In Java you can use the solution of #OndraŽižka.
This makes for two settings in your IDE.

Don't try to create XML by hand. Use a library for the purpose. You are just scratching the surface of the heap of special cases that will break a hand-made solution.
One way, using core Java classes, is to create a DOM, then serialize that using an no-op XSL transform that writes to a StreamResult. (if your document is large, you can do something similar by driving a SAX event handler.)
There are many third party libraries that will help you do the same thing very easily.

HTML Mixed Encodings?

First I would like to say thank you for the help in advance.
I am currently writing a web crawler that parses HTML content, strips HTML tags, and then spell checks the text which is retrieved from the parsing.
Stripping HTML tags and spell checking has not caused any problems, using JSoup and Google Spell Check API.
I am able to pull down content from a URL and passing this information into a byte[] and then ultimately a String so that it can be stripped and spell checked. I am running into a problem with character encoding.
For example when parsing http://www.testwareinc.com/...
Original Text: We’ve expanded our Mobile Web and Mobile App testing services.
... the page is using ISO-8859-1 according to meta tag...
ISO-8859-1 Parse: Weve expanded our Mobile Web and Mobile App testing services.
... then trying using UTF-8...
UTF-8 Parse: We�ve expanded our Mobile Web and Mobile App testing services.
Question
Is it possible that HTML of a webpage can include a mix of encodings? And how can that be detected?

It looks like the apostrophe is coded as a 0x92 byte, which according to Wikipedia is an unassigned/private code point.
From there on, it looks like the browser falls back by assuming it's a non-encoded 1-byte Unicode code point : +0092 (Private Use Two) which appears to be represented as an apostrophe. No wait, if it's one byte, it's more probably cp1252: Browsers must have a fallback strategy according to the advertised CP, such as ISO-8859-1 -> CP1252.
So no mix of encoding here but as others said a broken document. But with a fallback heuristic that will sometimes help, sometimes not.
If you're curious enough, you may want to dive into FF or Chrome's source code to see exactly what they do in such a case.

Having more than 1 encoding in a document isn't a mixed document, it is a broken document.
Unfortunately there are a lot of web pages that use an encoding that doesn't match the document definition, or contains some data that is valid in the given encoding and some content that is invalid.
There is no good way to handle this. It is possible to try and guess the encoding of a document, but it is difficult and not 100% reliable. In cases like yours, the simplest solution is just to ignore parts of the document that can't be decoded.

Apache Tika has an encoding detector. There are also commercial alternatives if you need, say, something in C++ and are in a position to spend money.
I can pretty much guarantee that each web page is in one encoding, but it's easy to be mistaken about which one.

seems like issue with special characters. Check this StringEscapeUtils.escapeHtml if it helps. or any method there
edited: added this logic as he was not able to get code working
public static void main(String[] args) throws FileNotFoundException {
String asd = "’";
System.out.println(StringEscapeUtils.escapeXml(asd)); //output - ’
System.out.println(StringEscapeUtils.escapeHtml(asd)); //output - ’
}

Repairing wrong encoding in XML files

One of our providers are sometimes sending XML feeds that are tagged as UTF-8 encoded documents but includes characters that are not included in the UTF-8 charset. This causes the parser to throw an exception and stop building the DOM object when these characters are encountered:
DocumentBuilder.parse(ByteArrayInputStream bais)
throws the following exception:
org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequence.
Is there a way to "capture" these problems early and avoid the exception (i.e. finding and removing those characters from the stream)? What I'm looking for is a "best effort" type of fallback for wrongly encoded documents. The correct solution would obviously be to attack the problem at the source and make sure that only correct documents are delivered, but what is a good approach when that is not possible?

if the problem truly is the wrong encoding (as opposed to a mixed encoding), you don't need to re-encode the document to parse it. just parse it as a Reader instead of an InputStream and the dom parser will ignore the header:
DocumentBuilder.parse(new InpputSource(new InputStreamReader(inputStream, "<real encoding>")));

You should manually take a look at the invalid documents and see what is the common problem to them. It's quite probable they are in fact in another encoding (most probably windows-1252), and the best solution then would be to take every document from the broken system and recode it to UTF-8 before parsing.
Another possible cause is mixed encodings (the content of some elements is in one encoding and the content of other elements is in another encoding). That would be harder to fix.
You would also need a way to know when the broken system gets fixed so you can stop using your workaround.

You should tell them to send you correct UTF-8. Failing that any solution should reencode the bad characters as valid UTF-8 then pass it to the parser. The reason for this is that if the bad characters are preserved then different programs might interpret any output different ways, which can lead to security holes.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.