I want to understand how HtmlCleaner handles Iframes when cleaning raw html to produce valid xml output. One example of a page with iframes is this ebay product page.
When I print the output of HtmlCleaner for this page, I find that some iframe tags are intact while others are missing. One of the missing iframes is the iframe with id="d". It contains the product description and its body has been merged into the main page.
The XML Output of html cleaner: http://pastebin.com/03f9gtdC
Could anyone kindly look at it, or suggest some better HTML parsing library which is able to handle iframes gracefully. That library should be able to support XPath evaluation.
Related
Is there a way where I can read an entire website in HTML in my code and then convert the HTML to java or json objects, kinda? Would be cool to crawl a site and the extract text from certain divs. Is there som way to use a marshaller for this?
You could check xpath which can be used to identify html elements on webpages. It can select certain elements or search for texts using regular expressions.
For example this would be the xpath to your questions paragraph //*[#id="question"]/div/div[2]/div[1]/p (extracted from chrome dev tools). It can be used in combination with selenium web driver if you want to crawl a webpage using java.
I have to find element with "== $0" after end tag of "span". Below is the HTML Code of element.
<div _ngcontent-c6="" class="col-12">
<span _ngcontent-c6="">Registration with prefilled user's data</span>
</div>
Although while I have copied the html code it is removing "== $0" itself. so I am attaching image also.
I have tried to find out solution but it was not working. I have tried xpath that normally works like .//span[text()='Registration with prefilled user's data'], but no sucess. I just found that "we can access this element in chrome console with syntax ' $0' and it is working fine there
but I do't know how to find it with xpath and CSS or any recommended locator's strategies in Selenium.
Note: Please don't mention any work around say use className or css with class name like div.col-12 span as I knew already this. My problem is handling elements with == $0.
So the text, == $0, is not exactly what you think it means. This is just a feature of Chrome dev tools, and not an actual element on the page. It's a property used by dev tools that allows you to test scripts via console. This has been discussed here, and it does not affect Selenium's ability to locate elements on the page.
The issue might be the selector that you are using, or possibly a hidden iframe element higher in the DOM that is obscuring the element.
You can try this XPath:
//span[contains(text(), "Registration with prefilled user's data")]
I just swapped out the text()='text' query with contains(text(), 'text') which may account for any hidden whitespace within the span element.
This XPath is correct for the span element given there are no special cases on the page. So, if this does not work for you, it would be recommended to post the full HTML or link to the page you are automating, so that we can all help you better.
Wanted to write a Java program to extract all the Xpaths of a given HTML page. For POC, using Gmail login page as an example. In the example. i click on the google logo and it gives me an Xpath to it. I should be able to extract all the xpaths of all the elements through a JAVA program and be able to save in a json format. Ex: {"logo": "html/body/div[1]/div[1]/div/div", ....., .....}. Please suggest if there are any libraries available to carryout this task?
Image link to better explain: http://i65.tinypic.com/347zcj6.jpg "googleXpathExample"
1 parse the XHTML (if it is well-formed XML), with DOM, or use JSoup
2 node => Xpath: see this
Generate/get xpath from XML node java
I have written a Jsoup class file to scrape a page and grab the hrefs for every element on the page. What I would like to do from there is to extract the Xpath for each of the elements from their hrefs.
Is there a way to do this in JSoup? If not is what is the best way to do this in Java (and are there any resources on this)?
Update
I want to clarify my question.
I want to scan a page for all the href identifiers and grab the links (that part is done). For my script, I need to get the xpath of all the elements I have identified and scraped from the (scanned) page.
The problem is that I assumed I could easily translate the href links to Xpath.
The comment from #Rishal dev Singh ended up being the right answer.
Check his link here:
http://stackoverflow.com/questions/7085539/does-jsoup-support-xpath
I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both the tasks.
I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.
I want to know which HTML parser can parse HTML efficiently. I need
Speed
Ease to locate any HtmlElement by its "id" or "name" or "tag type".
It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.
Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
See the Selector javadoc for more info.
This is a new project, so any ideas for improvement are very welcome!
The best I've seen so far is HtmlCleaner:
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
With HtmlCleaner you can locate any element using XPath.
For other html parsers see this SO question.
I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03