What Java API data structure is good for HTML trees? - java

For fun, I'm writing a basic parser that finds data within an HTML document. I want to find the best structure to represent the branches of the parsed file.
The criteria for "best structure" is this: I want to easily search for a tag's relative location and access its contents, like "the image in the second image tag after the third h3 tag in the body" or "the title tag in the header".
I expect to search the first level of tags for the tag I'm looking for, then move into the branch associated with that tag. That's the structure this question is looking for, but if there is a better way to find relative locations in an HTML document, please explain.
So that's the question. More generally, what kind of Java structures are available through the API that can represent tree data structures?

Don't reinvent the wheel, just use an HTML parser like Jsoup, you will be able to get your tags thanks to a CSS selector using the method Element#select(cssQuery).
Document doc = Jsoup.parse(file, encoding);
Elements elements = doc.select(cssQuery);

Related

Extensible HTML parsing in Java driven by decoupled rules

We are using the awesome jsoup library to parse HTML documents in Java.
Now the source of these documents differ (they are coming from different clients), so the HTML elements and the text differ per different source. To handle this we have written a separate HTML parser per different source of HTML document that deals with elements, element text, element attributes etc. of that document. Some of the parsed text needs to be replaced etc as well.
The stuff is working but indeed it is not extensible. We have to write a new HTML parser for a new html document source or add/change code of an existing one if there are more elements added or removed from the supported HTML document.
E.g if today the parser for a document from company ExampleCompany expects us to parse their HTML and process it with the following 2 element attributes:
Document doc = Jsoup.parse(htmlAsString);
String dataExampleCount = doc.select("div[id=top-share-bar]").attr("data-example_count");
String cbDateText = doc.select("div[class=cbdate]").text();
Tomorrow, the ExampleCompany adds a new element to their HTML (it may be in JavaScript or CSS or in the body) like "a[class=loc mr10]" and expects us to use that element's text as well. So we have to go and add another line of code:
String locMr10Text = doc.select("a[class=loc mr10]").text();
Is there a way to decouple the rules or XPATH expressions to find the elements and their text in some external file, be it XML or JSON or XSL where I can just define which elements to be looked for, which element's attributes or text to be extracted etc?
So, from the above example, if I externalize the rules in JSON:
{
"Attrs": {
"div[id=top-share-bar]": "data-example_count",
},
"Text": '[
"div[class=cbdate]",
"div[class=loc mr10]",
]'
}
We could just keep updating the rules JSON and not add any line of Java code but Just parse the JSON and accordingly parse the HTML.
This will facilitate in:
There will be only 1 HTML parser which just takes the rules and the
HTML document and produces the output.
No need to recompile the code
if the HTML document's elements change. Just change the rules file to
accommodate the change.
I am thinking of writing our own format to externalize the XPATH expressions etc but wished to know if there is something standard being used if there is a requirement like ours.
I have read a related link to what I am asking File format for storing html parser rules, however I am not sure if the answer gives any direction of best way of decoupling the what to parse from how to parse it.
Any suggestions will be helpful.

retrieve information from a url

I want to make a program that will retrieve some information a url.
For example i give the url below, from
librarything
How can i retrieve all the words below the "TAGS" tab, like
Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ?
I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice?
EDIT:
You gave me excellent help, but I want to ask something else.
For every tag we can see how many times each tag has been used, when we press the "number" button. How can I retrieve that number also?
You could use a HTML parser like Jsoup. It allows you to select HTML elements of interest using simple CSS selectors:
E.g.
Document document = Jsoup.connect("http://www.librarything.com/work/9767358/78536487").get();
Elements tags = document.select(".tags .tag a");
for (Element tag : tags) {
System.out.println(tag.text());
}
which prints
Black Library
fantasy
Thanquol & Boneripper
Thanquol and Bone Ripper
Warhammer
Please note that you should read website's robots.txt -if any- and read the website's terms of service -if any- or your server might be IP-banned sooner or later.
I've done this before using PHP with a page scrape, then parsing the HTML as a string using Regular Expressions.
Example here
I imagine there's something similar in java and other languages. The concept would be similar:
Load page data.
Parse the data, (i.e. with a regex, or via the DOM model and using some CSS selectors or some XPath selectors.
Do what you want with the data :)
It's worth remembering that some people might not appreciate you data mining their site and profiting / redistrubuting it on a large scale.

Searching XML through Java Codes

I have around 30 xml files with proper formatting and huge amount of data. I want to search these xml files for specific data retrival. Can you suggest any site or blog which i can use as aguideline to solve my problem.
I need to search inside of each tag for the keyword provided by the user. And also sometime the specific tag name which will return the content inside the tag according to the user request.
example : a.xml, b.xml, c.xml
inside a.xml
<abc>
some content
</abc>
User may search for abc the tag or some keyword inside the content. In both cases it should return the content or if more than one match then it should return the link for both by clicking which the user can see them one by one.
I'd recommend using XPath, which is a SQL-like language for searching in XML documents
http://www.ibm.com/developerworks/library/x-javaxpathapi.html
Use a SAX parser (no need to go back and forth within the documents plus huge amount of data hence don't use a DOM parser).
See this link for a tutorial.
You may store your XMLs into an XML database (for example eXist), and then query it using XQuery.

XML Parsing in java [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Best method to parse various custom XML documents in Java
HI all,
I am beginner to java. I hope the question I am asking may be easy one. My question is if I had an XML file and i want to parse it get the elements only with in specific tag.
for example if XML file looks like..
<date>2005-10-31</date>
<number>12345</number>
<purchased-by>
<name>My name</name>
<address>My address</address>
</purchased-by>
<order-items>
<item>
<code>687</code>
<type>CD</type>
<label>Some music</label>
</item>
<item>
<code>129851</code>
<type>DVD</type>
<label>Some video</label>
</item>
</order-items>
And from this XML I want to parse only the elements with in the tag name order-items.
Is there any generic way to do this..?Please let me know..
Thanks
As said in the comments, a short Google Search should bring you to the SUN examples on how to do this. Basically, you have two main XML parsing methods in Java :
SAX, where you use an handler to only grab what you want in your XML and ditch the rest
DOM, which parses your file all along, and allows you to grab all elements in a more tree-like fashion.
Another very useful XML parsing method, albeit a little more recent than these ones, and included in the JRE only since Java6, is StAX. StAX was conceived as a medial method between the tree-based of DOM and event-based approach of SAX. It is quite similar to SAX in the fact that parsing very large documents is easy, but in this case the application "pulls" info from the parser, instead of the parsing "pushing" events to the application. You can find more explanation on this subject here.
So, depending on what you want to achieve, you can use one of these approaches.
If you want to limit the parsing operation itself to the <order-items> element, then you'll have to use SAX. A SAX parser visits all elements of the input "file" (or stream) and you can define, that the parser shall ignore anything that is not <order-items> or any of its children. The result will be a Document containing these elements only.
If the xml documents are rather small and performance is not a limiting factor, then simply parse the whole document (that's a 2-liner) and use XPath expressions to select the correct nodes.
Use XPath. It lets you select nodes on their name and loads of other conditions. Very little code involved to setup.
IBM Example
It is a classic case for SAX. Register handler that receives tags and ignore all tags other than order-items.
Probably better way is to use Apache Digester but it is over-kill for your specific task.
You can use a DOM Parser to build a Document and then extract whatever elements you want using the getElementsByTagName method.
Here is some sample code to help you get started:
//parse file and build Document
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("file.xml"));
//get list of elements called order-items
NodeList orderItemsNodes = doc.getElementsByTagName("order-items");
//iterate over the elements
for(int i = 0 ; i <orderItemsNodes.getLength();i++ ){
Node orderItemNode = orderItemsNodes.item(i);
}
It honestly depends on how you are planning to use the item data. If you want to parse it into object and then work with it, I would use jaxb marshalling, but if you just want to strip string values from code, type, and label attributes of each item element, you may just consider using simple regex matching on the xml string - match content for each item tag, then match each attribute and extract its value.

How can I efficiently parse HTML with Java?

I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both the tasks.
I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.
I want to know which HTML parser can parse HTML efficiently. I need
Speed
Ease to locate any HtmlElement by its "id" or "name" or "tag type".
It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.
Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
See the Selector javadoc for more info.
This is a new project, so any ideas for improvement are very welcome!
The best I've seen so far is HtmlCleaner:
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
With HtmlCleaner you can locate any element using XPath.
For other html parsers see this SO question.
I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03

Categories

Resources