I want to retrieve all the links in web page ,but the web page use javascript and each page contain number of links
how can i go to the next page and read its contain in java program
Getting this info from a Javascript'ed page can be a hard job. Your program must interpret the whole page and understand what the JS is doing. Not all web spiders doing this.
Most modern JS libraries (jquery, etc) are mostly manipulate CSS and attributes of HTML elements. So first you have to generate the "flat" HTML from HTML source and JS and then maybe run a classical web spider over the flat HTML code.
(For example the FF webdeveloper plugin allows to see the original source code of a page and the generated code of the page, when all JS is done).
What you are looking for is called Web Spider engine. There are plenty of open source web spider engine's are available. Check http://j-spider.sourceforge.net/ for example
Related
I ask you how to run an event related to an external web page.
In this case the external web page is the following: https://tools.pdf24.org/en/webpage-to-pdf
The web page converts the URL of web pages to pdf.
I want to programmatically execute the generation of PDF, considering that I have a list of web page that I need to be in PDF.
I ask if you have a tool in PHP, JQUERY or JAVA.
Where you can execute the generation of PDF by means of the external web page.
Thanks
At the time of this writing there are a couple of options, wkhtmltopdf and jsPDF.
wkhtmltopdf is a commandline line tool that uses the webkit engine to create PDFs from an HTML source. There are also multiple server side libraries and wrappers for this that you can use.
(link at time of writing) https://wkhtmltopdf.org/
jsPDF is a client side library that accomplishes what you're asking.
(link at time of writing) https://github.com/MrRio/jsPDF
I've got a problem: I want to parse a page (e.g. this one) to collect information about the offered apps and save these information into a database.
Moreover I am using crawler4j for visiting every (available) page. But the problem - as I can see - is, that crawler4j needs links to follow in the source code.
But in this case the hrefs are generated by some JavaScript code so that crawler4j does not get new links to visit / pages to crawl.
So my idea was to use Selenium so that I can inspect several Elements like in a real Browser like Chrome or Firefox (I'm quite new with this).
But, to be honest, I don't know how to get the "generated" HTML instead of the source code.
Can anybody help me?
To inspect elements, you do not need the Selenium IDE, just use Firefox with the Firebug extension. Also, with the developer tools add on you can view a page's source and also the generated source (this is mainly for PHP).
Crawler4J can not handle javascript like this. It is better left for another more advanced crawling library. See this response here:
Web Crawling (Ajax/JavaScript enabled pages) using java
I am a beginner in java and web app development. Suppose to analyze & optimize JSP pages, those are taking some while to get data from server.
My question is, can we load the supporting file while the jsp is Waiting for server response?
I think you misunderstand how HTML, JS and CSS work.
In short: the browser sends a request for a certain JSP page. This page return from the server and holds within it a number of link tags referring to the CSS and JS for the file. The browser parses this page and sees that it needs extra resources in order to properly use the page. So it sends another request to the server for the CSS and JS.
Because of this, it is impossible for the browser to know in advance what CSS and JS the JSP page would need, because these are determined by the contents of the page itself.
However, that does not mean that you are out of luck. the first page will always need to load it afterwards, but it is possible to load the CSS and JS for the other pages in advance through the explanations on Pre-loading external files (CSS, JavaScript) for other pages. I have not tried these methods myself, but they seem valid.
Well, if I understand you correctly, why don't you just load the CSS/JS files and fire your other function when that's done? I'm not quite sure why you'd want that, though.
I am facing a problem retrieving the contents of an HTML page using java. I have described the problem below.
I am loading a URL in java which returns an HTML page.
This page uses javascript. So when I load the URL in the browser, a javascript function call occurs AFTER the page has been loaded (onBodyLoad of HTML page) and it modifies some content (one of the div id's innerHtml) on the webpage. This change is obviously visible to me in the browser.
Now, when I try to do the same thing using java, I only get the HTML content of the page , BEFORE the javascript call has occurred.
What I want to do is, fetch the contents of the html page after the javascript function call has occurred and all this has to be done using java.
How can I do this? What should my approach be?
You need to use a server side browser library that will also execute the JavaScript, so you can get the JavaScript updated DOM contents. The default browser mechanism doesn't do this, which is why you don't get the expected result.
You should try Cobra: Java HTML Parser, which will execute your JavaScript. See here for the download and for the documentation on how to use it.
Cobra:
It is Javascript-aware. DOM modifications that occur during parsing will be reflected in the resulting DOM. However, Javascript can be disabled.
For anyone reading this answer, Scott's answer above was a starting point for me. The Cobra project is long dead and cannot handle pages which use complex JavaScript.
However there is something called HTML Unit which does just exactly what I want.
Here is a small description:
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.
It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use.
It is typically used for testing purposes or to retrieve information from web sites.
I like playing Magic: The Gathering, and I also have a database of my collection. Magic set information was easy to obtain, since I could parse it directly from the HTML URL stream I opened, but I'm now trying to obtain prices for the cards as well from Star City Games (a MTG vendor). However, when I view source, there are no prices in the HTML; it's all done through JavaScript on-the-fly. Here's an example page for reference: http://sales.starcitygames.com/search.php?substring=Snapcaster+Mage&t_all=All&start_date=2010-01-29&end_date=2012-04-22&order_1=finish&limit=25&action=Show%2BDecks&card_qty%5B1%5D=1&auto=Y
The webpage generates the following HTML: http://pastebin.com/Psrpri8r .
I just want to be able to read the text "24.99." All of the code I'm using is in Java.
Thanks.
You need to run javascript... Try selenium.
If you dont want to learn him, i've an Hacky solution (Warning: very complicated): Download the page to your computer (Download also all the .JS pages), Edit the JS code that send the relevant data to your server instead showing it, And open this page in browser.