I am continuing work on a project that I've been at for some time now, and I have been struggling to pull some data from a website. The website has an iframe that pulls in some data from an unknown source. The data is in the iframe in a tag something like this:
<DIV id="number_forecast"><LABEL id="lblDay">9,000</LABEL></DIV>
There is a BUNCH of other crap above it but this div id / label is totally unique and is not used anywhere else in the code.
jsoup is probably what you want, it excels at extracting data from an HTML document.
There are many examples available showing how to use the API: http://jsoup.org/cookbook/extracting-data/selector-syntax
The process will be in two steps:
parse the page and find the url of the iframe
parse the content of the iframe and extract the information you need
The code would look like this:
// let's find the iframe
Document document = Jsoup.parse(inputstream, "iso-8859-1", url);
Elements elements = document.select("iframe");
Element iframe = elements.first();
// now load the iframe
URL iframeUrl = new URL(iframe.absUrl("src"));
document = Jsoup.parse(iframeUrl, 15000);
// extract the div
Element div = document.getElementById("number_forecast");
In you page that contains iframe change source of youe iframe to your own url. This url will be processed with your ouw controller, that will read content, parse it, extract all you need and write to response. If there is absolute references in your iframe this should work.
Related
i'm using jsoup to parse all the HTML from this website: news
I can fetch all the tilte, description with select some Elements I need. But can't find the video URL element to select. How can i get the video link with jsoup or another kind of library. Thanks!
Maybe I misunderstood your question, but can't you search for <video> elements using JSoup?
All <video> elements have a so-called src attribute.
Maybe try something like this?
// HTML from your webpage
final var html = "this should hold your HTML";
// Deconstruct into element objects
final var document = Jsoup.parse(html);
// Use CSS to select the first <video> element
final var videoElement = document.select("video").first();
// Grab the video's URL by fetching the "src" attribute
final var src = videoElement.attr("src");
Now I did not thoroughly check the website you linked. But some websites insert videos using JavaScript. If this website inserts a video tag after loading, you might be out of luck as Jsoup does not run JavaScript. It only runs on the initial HTML fetched from the page.
Jsoup is an HTML parser, which is why it only parses HTML and not, say, generated HTML.
I'm trying to get the site scripts using 'Jsoup.connect(url).get().html()'
but it doesn’t appear the script I want, does anyone know how I can get this script?
Script I want to get
It doesn't appear in the source because that video is inside an iframe. That iframe has its own src attribute (visible on your screenshot). Try getting that page instead.
EDIT:
Get the first page and parse it. Then select iframe src and when you have the second URL do the same again so get the page and parse it:
String iframeUrl = Jsoup.connect(url).get().selectFirst("#option-1 iframe").attr("src");
System.out.println(iframeUrl);
Document document = Jsoup.connect(iframeUrl).get();
System.out.println(document.html());
I want to be able to get the list of all URLs that a browser will do a GET request for when we try to open a page. For example, if we try to open cnn.com, there are multiple URLs within the first HTTP response which the browser recursively requests for.
I'm not trying to render a page, but I'm trying to obtain a list of all the URLs that are requested when a page is rendered. Doing a simple scan of the HTTP response content wouldn't be sufficient, as there could potentially be images in the CSS which are downloaded. Is there any way I can do this in Java?
My question is similar to this question, but I want to write this in Java.
You can use Jsoup library to extract all the links from a webpage, e.g.:
Document document = Jsoup.connect("http://google.com").get();
Elements links = document.select("a[href]");
for(Element link : links) {
System.out.println(link.attr("href"));
}
Here's the documentation.
I have a table in a HTML page in which I have to iterate through to open the links into a next page where all the information is. In this page I extract any data I need and return to my basic page.
How do I change pages with the framework JSoup in Java? Is it actually possible?
If you look at the JSoup Cookbook, they have an example of getting all the links inside of an HTML element. Iterate the Elements from this example and do a Document doc = Jsoup.connect(<url from Elements>).get();. You can then do String htmlFromLink = doc.toString(); and get the HTML from the link.
I am looking for all of the images on a given website.
For this purpose i need to find the ones that are within the css for example:
.gk-crop {
background-image: url("../images/style1/g_rss-2.png");
}
Now my question is how can i get all of these urls with JSoup?
so far ive tried the following:
Document doc = Jsoup.connect(url).get();
Elements imagePath = doc.select("[src]");
imagePath.select("*[style*='background-image']");
but so far no luck.
Does anyone know how i can acheive it?
Jsoup doesn't parse css files.
Have a look at this to know what Jsoup is responsible for.
You need a separate css parser to extract url from css files. Have a look at this
Just like Niranjan mentioned, Jsoup is not for parsing CSS but XML. If you really need to extract some images from CSS, you will need to use some some 3rd party library for that purpose OR write simple regex for grabbing URLs from CSS file - its still plain text isn't it? This is not flexible resolution to your problem, but it would be the fastest one:)
If you want to select the URL's of all the images on a website you can select all the image tags and then get the absolute URL's.
Example:
String html = "http://www.bbc.co.uk";
Document doc = Jsoup.connect(html).get();
Elements titles = doc.select("img");
for (Element e : titles) {
System.out.println(e.absUrl("src"));
}
which will grab all the <img> elements and present it, such as
http://sa.bbc.co.uk/bbc/bbc/s?name=SET-COUNTER&pal_route=index&ml_name=barlesque&app_type=web&language=en-GB&ml_version=0.16.1&pal_webapp=wwhp&blq_s=3.5&blq_r=3.5&blq_v=default-worldwide
http://static.bbci.co.uk/frameworks/barlesque/2.50.2/desktop/3.5/img/blq-blocks_grey_alpha.png
http://static.bbci.co.uk/frameworks/barlesque/2.50.2/desktop/3.5/img/blq-search_grey_alpha.png
http://news.bbcimg.co.uk/media/images/69139000/jpg/_69139104_69139103.jpg
http://news.bbcimg.co.uk/media/images/69134000/jpg/_69134575_waynerooney1.jpg
If you only want the .JPG files, tell the selector that by including
Elements titles = doc.select("img[src$=.jpg]");
which result in only parsing the .JPG-urls.