I'm trying to get a values of Selenium versions available for download from this page: http://selenium-release.storage.googleapis.com/index.html just as test values. But i can't use API's which allow to find elements (like Jsoup) with Xpath or CSS because HttpURLConnection returns me a JS source of page. So my Q is how i can get full HTML of that page so i could at least parse the versions of selenium?? My restriction is only JAVA code can be used!
Related
I'm trying parse a website (f.e. google)
In Chrome local storage i see variables (key-value pairs)
And document content depends of this variables
Can i set it using jsoup?
Or I should use other tools for it?
No, you cannot access Chrome's content with your own code (unless you can find an exploit for that). You'll have to get the site's content with the proper get or post request made with jsoup and then parse the content by yourself. Just remember that jsoup only loads the HTML of the site, so if your content is loaded from JS, you'll have to find another way to get it.
I would like to generate static HTML5 pages, defining my own tags and rendering more complex HTML in the generated pages.
I like the polymer architecture approach to define new sets of tags.
When I want to generate my pages, I'm not in a browser, so I can use Java or NodeJS engines to compute the final HTML pages (from a console for instance).
To sum up, I want to define my own tag libraries using the polymer approach, code some HTML using those new tags, and "print" the result DOM in a static HTML file, running all that from a console program (using Java or NodeJS).
Does somebody know how to do so?
It seems I must have some DOM interpreters, and I know that in Java I can use jsoup, but it will probably lack some JavaScript interpreter? Can NodeJS do that more simply?
I am looking for a way to extract all resources links from an HTML page in Java. (URL links, links to files..)
I first thought of extracting all elements inside src, href attributes, but the list will not be exhaustive. There is an example of code here: Jsoup, extract links, images, from website. Exception on runtime.
As a tricky example, I want to be able to detect links hidden inside JavaScript (which can also be hidden anywhere in the HTML DOM):
<IMG onmouseover="window.open('http://www.evil.com/image.jpg')">
EDIT:
1) I am not looking for a regex-based solution because they are not reliable to deal with HTML document
2) I have tried to use Html DOM parser like JSoup. They allows the extractions of tags and their properties quite well. However I have not found a way to detect links inside JavaScript with it.
3) Maybe there is an API available that tries to render the page and detect which resources needs to be loaded?
Do you have any thoughts?
Thanks.
If you want to use PHP with a bit of programming knowledge here is a library.
http://simplehtmldom.sourceforge.net/
I used this library to extract info from tags, even from properties of tags. This is exactly what you need to do what you want without working with complicated code.
I need to grab the whole HTML from a web page included dynamically generated HTML. Is it possible in JAVA or in PHP?
When I Inspect element with firebug it shows whole HTML. but when I use java I get only static HTML file.
I have a HTML file containing some java script tags. When I run this file in some browser such as IE, some contents are cached from its source and displayed on browser(for example weather of some cities). How can I run run this html file and get contents of web page that was displayed on web browser before? I don't want to display contents on my application; I want to parse returned data and extract some special contents(for example extract weather of each city).
can anyone guide me please?
What you're trying to do is called html scraping.
Your best option is to get help in the form of a library, since this is a conmon and complex task.
See this question: Options for HTML scraping?
Selenium is a good bet. It supports HtmlUnit, Firefox, Chrome amongst other browsers.
Link: http://seleniumhq.org/