Good Evening everyone!
I'm trying to parse a HTML-page in Java with JDOM2, to access some information from it.
My code looks like this: (Just added the packages for this codeblock, don't have them in my real source)
//Here goes the reading of the site into my String "string" (using NekoHTML)
org.xml.sax.InputSource is = new InputSource();
is.setCharacterStream(new StringReader(string));
org.cyberneko.html.parsers.DOMParser parser = new DOMParser();
parser.parse(is);
org.jdom2.input.DOMBuilder builder = new DOMBuilder();
org.jdom2.Document doc = builder.build(parser.getDocument());
This works fine for everything except some special case: When the site contains quotation-Marks within an element. Here is an example of what I mean:
Der "realismo mágico" und die Phantastische...
So, after that wonderful Tag I get the following error-trace:
SEVERE: org.jdom2.IllegalNameException: The name "literatur"" is not legal for JDOM/XML attributes: XML name 'literatur"' cannot contain the character """.
So, now my question is: What are my options to take care of this error? Is there maybe a feature in NekoHTML I can use for this (using the "setFeature()"), or something within JDOM I could use?
If no: Are there other libaries that are suitable for scraping websites that can take such a thing as the quotation mark within the tag?
Thanks for your time!
Okay, so I solved the problem like following:
Since there wasn't any dependency on NekoHTML I switched to jTidy as parser which does the job in this case.
Question answered.
Related
I am using getClassLoader().getResources to find the path for Jsoup to parse.
String path = JsoupDemo1.class.getClassLoader().getResource("student.xml").getPath();
Document document = Jsoup.parse(new File(path), "utf-8");
Elements names = document.getElementsByTag("name");
System.out.println(names.size());
My student.xml has been placed under the src folder in my module "day11_xml" and this code snippet comes from the class JsoupDemo1 in the package cn.itcast.xml.jsoup under the same module of "day11_xml". The error messages reads as follows:
java.io.FileNotFoundException:/Users/dingshun/Downloads/New%20Java%20Projects/demo/out/production/day11_xml/student.xml (No such file or directory)
I don't get it, as I can find the exact file in the given path. I'm confused, but could you guys help me out? Also, I'm new to both Java programming and this forum and if this question sounds silly or my question format is not right, please let me know.
What you're doing looks good. Maybe use the stream version JSoup.parse.
URL url = JsoupDemo1.class.getClassLoader().getResource("student.xml");
InputStream stream = JsoupDemo1.class.getClassLoader().getResourceAsStream("student.xml");
document = Jsoup.parse(stream, "utf-8", url.toURI()toString());
The documentation linked seems to imply it will work with html not xml, so maybe you need to use the other argument which provides a parser?
Actually, it turned out that Jsoup could not find my file because the path name "New%20Java%20Projects" has spaces between them. When I reload the file in a folder which has no spaces in its name, it works out just fine. So it can parse xml using parse(File in, String charsetName) method. It seems it cannot parse path name which has spaces in it.
Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?
That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's recover=True option.
See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
.NET:
XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems.
#jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
#jdweng also reports that XmlReader.ReadToFollowing() can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
Go: Set Decoder.Strict to false as shown in this example by #chuckx.
PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000}-\u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with &: credit: blhsin, demo
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.
A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.
The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx
<!DOCTYPE xml [
<!ELEMENT xml - - ANY>
<!ELEMENT description - - ANY>
<!ELEMENT THIS-IS-PART-OF-DESCRIPTION - - EMPTY>
]>
<xml>
<description>blah blah
<THIS-IS-PART-OF-DESCRIPTION>
</description>
</xml>
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.
IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
<root>
<element>...</element>
<element>...</element>
</root>
But if you have a document like:
<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
File file = new File(pathtofile);
Then open a FileInputStream:
FileInputStream fis = new FileInputStream(file);
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>
So the code is:
List<InputStream> streams =
Arrays.asList(
new ByteArrayInputStream("<root>".getBytes()),
fis,
new ByteArrayInputStream("</root>".getBytes()));
Now using a SequenceInputStream, we create a container for the List created above:
InputStream cntr =
new SequenceInputStream(Collections.enumeration(str));
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);
I really need help to extract Mircodata which is embedded in HTML5. My purpose is to get structured data from a webpage just like this tool of google: http://www.google.com/webmasters/tools/richsnippets. I have searched a lot but there is no possible solution.
Currently, I use the any23 library but I can’t find any documentation, just only javadocs which dont provide enough information for me.
I use any23's Microdata Extractor but getting stuck at the third parameter: "org.w3c.dom.Document in". I can't parse a HTML content to be a w3cDom. I have used JTidy as well as JSoup but the DOM objects in these library are not fixed with the Extractor constructor. In addition, I also doubt about the 2nd parameter of the Microdata Extractor.
I hope that anyone can help me to do with any23 or suggest another library can solve this extraction issues.
Edit: I found solution myself by using the same way as any23 command line tool did. Here is the snippet of code:
HTTPDocumentSource doc = new HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(), value);
InputStream documentInputInputStream = doc.openInputStream();
TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream, doc.getDocumentURI());
Document document = tagSoupParser.getDOM();
ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream();
MicrodataParser.getMicrodataAsJSON(tagSoupParser.getDOM(),new PrintStream(byteArrayOutput));
String result = byteArrayOutput.toString("UTF-8");
These line of code only extract microdata from HTML and write them in JSON format. I tried to use MicrodataExtractor which can change the output format to others(Rdf, turtle, ...) but the input document seems to only accept XML format. It throws "Document didn't start" when I put in a HTML document.
If anyone found the way to use MicrodataExtractor, please leave the answer here.
Thank you.
xpath is generally the way to consume html or xml.
have a look at: How to read XML using XPath in Java
I'm trying to read an xml file on from an android app using XOM as the XML library. I'm trying this:
Builder parser = new Builder();
Document doc = parser.build(context.openFileInput(XML_FILE_LOCATION));
But I'm getting nu.xom.ParsingException: Premature end of file. even when the file is empty.
I need to parse a very simple XML file, and I'm ready to use another library instead of XOM so let me know if there's a better one. or just a solution to the problem using XOM.
In case it helps, I'm using xerces to get the parser.
------Edit-----
PS: The purpose of this wasn't to parse an empty file, the file just happened to be empty on the first run which showed this error.
If you follow this post to the end, it seems that this has to do with xerces and the fact that its an empty file, and they didn't reach a solution on xerces side.
So I handled the issue as follows:
Document doc = null;
try {
Builder parser = new Builder();
doc = parser.build(context.openFileInput(XML_FILE_LOCATION));
}catch (ParsingException ex) { //other catch blocks are required for other exceptions.
//fails to open the file with a parsing error.
//I create a new root element and a new document.
//I fill them with xml data (else where in the code) and save them.
Element root = new Element("root");
doc = new Document(root);
}
And then I can do whatever I want with doc. and you can add extra checks to make sure that the cause is really an empty file (like check the file size as indicated by one of sam's comments on the question).
An empty file is not a well-formed XML document. Throwing a ParsingException is the right thing to do here.
I need your expertise once again. I have a java class that searches a directory for xml files (displays the files it finds in the eclipse console window), applies the specified xslt to these and sends the output to a directory.
What I want to do now is create an xml containing the file names and file format types. The format should be something like;
<file>
<fileName> </fileName>
<fileType> </fileType>
</file>
<file>
<fileName> </fileName>
<fileType> </fileType>
</file>
Where for every file it finds in the directory it creates a new <file>.
Any help is truely appreciated.
Use an XML library. There are plenty around, and the third party ones are almost all easier to use than the built-in DOM API in Java. Last time I used it, JDom was pretty good. (I haven't had to do much XML recently.)
Something like:
Element rootElement = new Element("root"); // You didn't show what this should be
Document document = new Document(rootElement);
for (Whatever file : files)
{
Element fileElement = new Element("file");
fileElement.addContent(new Element("fileName").addContent(file.getName());
fileElement.addContent(new Element("fileType").addContent(file.getType());
}
String xml = XMLOutputter.outputString(document);
Have a look at DOM and ECS. The following example was adapted to you requirements from here:
XMLDocument document = new XMLDocument();
for (File f : files) {
document.addElement( new XML("file")
.addXMLAttribute("fileName", file.getName())
.addXMLAttribute("fileType", file.getType())
)
);
}
You can use the StringBuilder approach suggested by Vinze, but one caveat is that you will need to make sure your filenames contain no special XML characters, and escape them if they do (for example replace < with <, and deal with quotes appropriately).
In this case it probably doesn't arise and you will get away without it, however if you ever port this code to reuse in another case, you may be bitten by this. So you might want to look at an XMLWriter class which will do all the escaping work for you.
Well just use a StringBuilder :
StringBuilder builder = new StringBuilder();
for(File f : files) {
builder.append("<file>\n\t<fileName>").append(f.getName).append("</fileName>\n)";
[...]
}
System.out.println(builder.toString());