Get scripts site with Jsoup - java

I'm trying to get the site scripts using 'Jsoup.connect(url).get().html()'
but it doesn’t appear the script I want, does anyone know how I can get this script?
Script I want to get

It doesn't appear in the source because that video is inside an iframe. That iframe has its own src attribute (visible on your screenshot). Try getting that page instead.
EDIT:
Get the first page and parse it. Then select iframe src and when you have the second URL do the same again so get the page and parse it:
String iframeUrl = Jsoup.connect(url).get().selectFirst("#option-1 iframe").attr("src");
System.out.println(iframeUrl);
Document document = Jsoup.connect(iframeUrl).get();
System.out.println(document.html());

Related

Extract all Images alt text from web page in Java using jSoup

I want to extract alt text for all images in web page.
I have tried all the approaches, I know. You can check
below code.
Document doc = Jsoup.connect("https://www.amazon.com/gp/offer-listing/B003FYLW9Q/ref=olp_f_new?ie=UTF8&f_new=true")
.userAgent("Mozilla")
.timeout(50000)
.cookie("cookiename", "val234")
.cookie("anothercookie", "ilovejsoup")
.referrer("http://google.com")
.header("headersecurity", "xyz123")
.get();
// Method 1
Elements images = doc.select("img[src~=(?i)\\.(gif)]");
System.out.println(images.attr("alt"));
// Method 2
String imageAlt = doc.getElementsByClass("a-spacing-none olpSellerName").select("img").attr("alt");
System.out.println(imageAlt);
Now this code does not work for the link right now in connect method.
It does not work for some link and it does not fetch all URLs alt in that webpage.
But this works for below links:
https://www.amazon.com/gp/offer-listing/B06XWZWYVP/ref=olp_f_new?ie=UTF8&f_new=true
https://www.amazon.com/gp/offer-listing/B079JD7F7G/ref=olp_f_new?ie=UTF8&f_new=true
The class is same for all the links. But it does not work for some links. Can any one please tell me the solution for this problem.

Jsoup with a plugin

I'm using Jsoup to scrape some online data from different stores, but I'm having trouble figuring out how to programmatically replicate what I do as a user. To get the data manually (after logging in), a user must select a store from a tree that pops up.
As best I can tell, the tree is not hard-coded into the site but is built interactively when your computer interacts with the server. When you look for the table in "view page source," there are no entries. When I inspect the tree, I do see the HTML and it seems to come from the "FancyTree" plugin.
As best as I can tell from tracking my activity on Developer Tools -- Network, the next step is a "GET" request which doesn't change the URL, so I'm not sure how my store selection is being transferred.
Any advice on how to get Jsoup or Java generally to programmatically interact with this table would be extremely helpful, thank you!
Jsoup can only parse the original source file, not the DOM. In order to parse the DOM, you'll need to render the page with something like HtmlUnit. Then you can parse the html content with Jsoup.
// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(myURL);
// convert page to generated HTML and convert to document
doc = Jsoup.parse(myPage.asXml());
// do something with html content
System.out.println(doc.html());
// clean up resources
webClient.close();
See Parsing Javascript Generated Page with Jsoup.

Parsing shopping websites usign jsoup

I have the following code:
doc = Jsoup.connect("http://www.amazon.com/gp/goldbox").userAgent("Mozilla").timeout(5000).get();
Elements hrefs = doc.select("div.a-row.layer");
System.out.println("Results:"+ hrefs); //I am trying to print out contents but not able to see the output.
Problem: Want to display all image src within the div with class name "a-row layer". But, i am unable to see the output.
What is the mistake with my query?
I have taken a look at the website and tested it myself. The issue seems to be that the piece of html code you want to extract (div.a-row.layer) is generated by JavaScript.
Jsoup does not support JavaScript and cannot parse those generated by it. You would need a headless web browser to deal with this, such as HTMLUnit.

Get image for captcha session

I want to get the current captcha that is displayed on a website. An example of this would be
http://top100arena.com/in.asp?id=58978
How would I get the image link of the captcha that is displayed other than right clicking - > open image in new page?
You are looking for the div identified by "rechapta_image":
Then extract the src attribute of the img element inside this div.
To do this, you can choose for an easy String-operation-based way or use a HTML parsing library like JSoup.
Here is an example of such an extract URL:
http://www.google.com/recaptcha/api/image?c=03AHJ_VutGj3wvhGoQGxu6FUnG3uOWJdyB2RpSb2N5v9AQJyakMy1kKMPeDoRfADhjAj5rLqekuOzXe3cRChnA_sEN7PL68em4pI_kE3wFKUhhkqFF9jQzKJerX__InwD_DB0Ox1mKQmZVRl97yuSL62tZhYyhSqtuIta-3n0KvytB9QqSn8nXgw8
Actually, it seems that the chapta box is an iframe. So search for an iframe with src string containing "chapta". Example of such a iframe:
<iframe src="http://www.google.com/recaptcha/api/noscriptk=6LeyFroSAAAAAJTmR7CLZ5an7pcsS5eJ3wEoWHhJ"
height="300" width="500" frameborder="0"></iframe><br/>
So, once you extracted that URL, use JSoup again to find the URL to the image. The page fetched has a part this:
So, look for a center element, and get the img element out of it.
Try using Firebug in firefox https://addons.mozilla.org/es/firefox/addon/firebug/, Its easy to use and in the Red section you´ll find a label named Image, you´ll find the image there.

Using java to extract a single value from an html page:

I am continuing work on a project that I've been at for some time now, and I have been struggling to pull some data from a website. The website has an iframe that pulls in some data from an unknown source. The data is in the iframe in a tag something like this:
<DIV id="number_forecast"><LABEL id="lblDay">9,000</LABEL></DIV>
There is a BUNCH of other crap above it but this div id / label is totally unique and is not used anywhere else in the code.
jsoup is probably what you want, it excels at extracting data from an HTML document.
There are many examples available showing how to use the API: http://jsoup.org/cookbook/extracting-data/selector-syntax
The process will be in two steps:
parse the page and find the url of the iframe
parse the content of the iframe and extract the information you need
The code would look like this:
// let's find the iframe
Document document = Jsoup.parse(inputstream, "iso-8859-1", url);
Elements elements = document.select("iframe");
Element iframe = elements.first();
// now load the iframe
URL iframeUrl = new URL(iframe.absUrl("src"));
document = Jsoup.parse(iframeUrl, 15000);
// extract the div
Element div = document.getElementById("number_forecast");
In you page that contains iframe change source of youe iframe to your own url. This url will be processed with your ouw controller, that will read content, parse it, extract all you need and write to response. If there is absolute references in your iframe this should work.

Categories

Resources