Scrape Website Data Where HTML is loaded based on XML - java

I am trying to scrape data from a webpage using JSoup library in JAVA. However, problem here is the data that I want to scrape is loaded based on XML, so when I try to parse that from HTML it displays
<div id="report-details-container">
<!-- Container where HTML template will be loaded based on XML -->
</div>
instead of full HTML it just shows this comment.
How can I scrape that data because in inspect element I can see full HTML.

How can I scrape that data because in inspect element I can see full HTML.
You can't scrape the original XML out of the HTML. The XML is not in the HTML.
However:
You might be able to reverse engineer the original XML ... provided that you know the rules for the transformation from XML to HTML (e.g. you have the XSLT file), AND the transformation is not lossy.
If the transformation from XML to HTML is done using client side execution of (say) an XSLT, then you should be able to capture the XML before the transformation is applied.
There might be a way to get the server to send XML instead of HTML. That will depend on the server itself.
However, if all you have is an HTML comment like you have shown us, then you will first need to reverse engineer the process that loads the XML. It is probably done with some client-side scripting.

Related

Java: extract all resources links from HTML

I am looking for a way to extract all resources links from an HTML page in Java. (URL links, links to files..)
I first thought of extracting all elements inside src, href attributes, but the list will not be exhaustive. There is an example of code here: Jsoup, extract links, images, from website. Exception on runtime.
As a tricky example, I want to be able to detect links hidden inside JavaScript (which can also be hidden anywhere in the HTML DOM):
<IMG onmouseover="window.open('http://www.evil.com/image.jpg')">
EDIT:
1) I am not looking for a regex-based solution because they are not reliable to deal with HTML document
2) I have tried to use Html DOM parser like JSoup. They allows the extractions of tags and their properties quite well. However I have not found a way to detect links inside JavaScript with it.
3) Maybe there is an API available that tries to render the page and detect which resources needs to be loaded?
Do you have any thoughts?
Thanks.
If you want to use PHP with a bit of programming knowledge here is a library.
http://simplehtmldom.sourceforge.net/
I used this library to extract info from tags, even from properties of tags. This is exactly what you need to do what you want without working with complicated code.

Get dynamically loaded html in java

I need to grab the whole HTML from a web page included dynamically generated HTML. Is it possible in JAVA or in PHP?
When I Inspect element with firebug it shows whole HTML. but when I use java I get only static HTML file.

Perform click on Web page element before parsing in Java

I'm trying to parse HTML page with DOM parser and jsoup library.
The problem that I'm facing is this:
On Web site there are two buttons which show two different tables.
I need to parse the table which is shown when the second button is clicked.
There are different attribute values set after clicking the second button.
When I do Jsoup.connect("example.com")
I get response like first button is selected and I don't need that data.
Is there a way to perform click on second button, and then start parsing and retrieving data from Web site?
Jsoup is just a parser, i.e. it can't handle events such as clicking on buttons. Have a look at browser automation tools (e.g. Selenium) to perform this kind of job.
JSoup is a HTML parser and not a browser alternative. Take a look at Html Unit
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.
JSoup can't control the web page, only parse the content. For manipulation and interaction, there are some tools. I recommend Geb, which uses a Groovy DSL with a JQuery like syntax, making it very fluent. It's also pretty easy to parse xml/html with it.

Converting XML to PDF, using styles from XSL

I have following problem: I have a XML file with XSL stylesheet, that is rendering this XML file as neat table in HTML when I load it in web browser. Now I need to make a PDF that is looking EXACTLY like that XSL-styled XML in web browser, without need for making custom FO's for every file. Everything must be done in Java.
I need to make a PDF that is looking EXACTLY like that XSL-styled XML in web browser
Think again about this requirement. Paged media such as PDF and non-paged media such as HTML may only look "close enough", but never "exactly like" each other. This is even more obvious if you consider your HTML being displayed on devices with different screen sizes.
If you relax the above requirement somewhat, you'll probably agree that XSL-FO is the best choice. You definitely do not need to write "custom FO's for every file": write an XSLT just once, and use it on-the-fly to convert your XML to XSL-FO, and then use a rendering engine to process XSL-FO to PDF. Simple.
XSL-FO does sound like exactly what you need. But if that's not an option, first explicitly doing the XSLT transform on the XML in Java and then converting the resulting HTML (which by then is a String/byte array/DOM/whatever you want) to PDF using some additional library would do the trick. There's some libraries that support HTML to PDF, like iText for example. XSLT transformations in Java are really simple. Little code involved there.

using jpedal to extract hyperlinks from html? --java

JPedal library in java is usually used to convert pdf to XML or HTML. However, I needed to know if we could extract data from HTML5 document and save it to XML using JPedal library API?
Is there any other possible alternative to this?
Also , I am trying to parse HTML5 document using Java and store it in XML. are there any good solutions to find just specific tags and render an XML out of them?
Please do let me know . Thank you.
There are a number of Java HTML parsers out there, but I recommend using the HTML5 parser from validator.nu available for download from here: http://about.validator.nu/htmlparser/.
Written to use the HTML5 parser algorithm by one of the main protagonists of HTML5, Henri Sivonen of Mozilla, you won't find a more reliable HTML parser and it creates a true DOM that can be manipulated using standard XML tools and queried for hyperlinks using XPath. There are examples of how to use XSLT transformations with it and how to get an XML serialization of the created DOM.

Categories

Resources