I am looking for all of the images on a given website.
For this purpose i need to find the ones that are within the css for example:
.gk-crop {
background-image: url("../images/style1/g_rss-2.png");
}
Now my question is how can i get all of these urls with JSoup?
so far ive tried the following:
Document doc = Jsoup.connect(url).get();
Elements imagePath = doc.select("[src]");
imagePath.select("*[style*='background-image']");
but so far no luck.
Does anyone know how i can acheive it?
Jsoup doesn't parse css files.
Have a look at this to know what Jsoup is responsible for.
You need a separate css parser to extract url from css files. Have a look at this
Just like Niranjan mentioned, Jsoup is not for parsing CSS but XML. If you really need to extract some images from CSS, you will need to use some some 3rd party library for that purpose OR write simple regex for grabbing URLs from CSS file - its still plain text isn't it? This is not flexible resolution to your problem, but it would be the fastest one:)
If you want to select the URL's of all the images on a website you can select all the image tags and then get the absolute URL's.
Example:
String html = "http://www.bbc.co.uk";
Document doc = Jsoup.connect(html).get();
Elements titles = doc.select("img");
for (Element e : titles) {
System.out.println(e.absUrl("src"));
}
which will grab all the <img> elements and present it, such as
http://sa.bbc.co.uk/bbc/bbc/s?name=SET-COUNTER&pal_route=index&ml_name=barlesque&app_type=web&language=en-GB&ml_version=0.16.1&pal_webapp=wwhp&blq_s=3.5&blq_r=3.5&blq_v=default-worldwide
http://static.bbci.co.uk/frameworks/barlesque/2.50.2/desktop/3.5/img/blq-blocks_grey_alpha.png
http://static.bbci.co.uk/frameworks/barlesque/2.50.2/desktop/3.5/img/blq-search_grey_alpha.png
http://news.bbcimg.co.uk/media/images/69139000/jpg/_69139104_69139103.jpg
http://news.bbcimg.co.uk/media/images/69134000/jpg/_69134575_waynerooney1.jpg
If you only want the .JPG files, tell the selector that by including
Elements titles = doc.select("img[src$=.jpg]");
which result in only parsing the .JPG-urls.
Related
i'm using jsoup to parse all the HTML from this website: news
I can fetch all the tilte, description with select some Elements I need. But can't find the video URL element to select. How can i get the video link with jsoup or another kind of library. Thanks!
Maybe I misunderstood your question, but can't you search for <video> elements using JSoup?
All <video> elements have a so-called src attribute.
Maybe try something like this?
// HTML from your webpage
final var html = "this should hold your HTML";
// Deconstruct into element objects
final var document = Jsoup.parse(html);
// Use CSS to select the first <video> element
final var videoElement = document.select("video").first();
// Grab the video's URL by fetching the "src" attribute
final var src = videoElement.attr("src");
Now I did not thoroughly check the website you linked. But some websites insert videos using JavaScript. If this website inserts a video tag after loading, you might be out of luck as Jsoup does not run JavaScript. It only runs on the initial HTML fetched from the page.
Jsoup is an HTML parser, which is why it only parses HTML and not, say, generated HTML.
I'm tring to get image url src from tag <img>
E.g I have this html data from facebook:
<img class="profilePic img" alt="Facebook Developers" src="https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xpf1/v/t1.0-1/p320x320/9988_10151403325753553_1486509350_n.png?oh=ecdfcf4b449779941db77b52950843b3&oe=568F1F42&__gda__=1453778308_a1ffaea01e68e9dade86f1b11989a50d">
How can I get only image src with the class="profilePic img" attribute or class name? Any idea how do I get it? I'm using Jsoup library.
You can get all the images by calling getElementsByTag('img') and then call select(".your_class_name") to get only the images with the specified class (or any other query)
e.g:
Jsoup.connect("http://stackexchange.com").get().getElementsByTag("img").select(".favicon")
Try it
Document document = Jsoup.connect("yourLink").get();
String img_url = document.select("img[class=profilePic img]").first().attr("src");
Log.d('Src image: ', img_url);
Remember: solve it in other thread, not main thread :)
JSoup CSS offers multiple class selection through concatenation. The CSS selector for the classes are .profilePic and .img. Selecting both classes means concatenating: .profilePic.img. So this should be your code:
document.select("img.profilePic.img")
This is better than img[class=profilePic img], because the latter will look for exactly the string "profilePic img". Classes however may appear in different order or with more spaces in the document you parse.
To get to the src attributes of all img elements you need to loop over the results:
Elements imgs = document.select("img.profilePic.img");
for (Element img : imgs){
String srcStr = img.attr("src");
//do what ever you need to do with srcStr
}
I have a table in a HTML page in which I have to iterate through to open the links into a next page where all the information is. In this page I extract any data I need and return to my basic page.
How do I change pages with the framework JSoup in Java? Is it actually possible?
If you look at the JSoup Cookbook, they have an example of getting all the links inside of an HTML element. Iterate the Elements from this example and do a Document doc = Jsoup.connect(<url from Elements>).get();. You can then do String htmlFromLink = doc.toString(); and get the HTML from the link.
The problem is that some of the urls I am trying to pull from have javascript slideshow containers that haven't loaded yet. I want to get the images in the slideshow but since this hasn't loaded yet it doesn't grab the element. Is there anyway to do this? This is what I have so far
Document doc = Jsoup.connect("http://news.nationalgeographic.com/news/2013/03/pictures/130316-gastric-brooding-frog-animals-weird-science-extinction-tedx/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+ng%2FNews%2FNews_Main+%28National+Geographic+News+-+Main%29").get();
Elements jpg = doc.select("img[src$=.jpg]");
jsoup can't handle javascript, but you can use an additional library for this:
Parse JavaScript with jsoup
Trying to parse html hidden by javascript
I am continuing work on a project that I've been at for some time now, and I have been struggling to pull some data from a website. The website has an iframe that pulls in some data from an unknown source. The data is in the iframe in a tag something like this:
<DIV id="number_forecast"><LABEL id="lblDay">9,000</LABEL></DIV>
There is a BUNCH of other crap above it but this div id / label is totally unique and is not used anywhere else in the code.
jsoup is probably what you want, it excels at extracting data from an HTML document.
There are many examples available showing how to use the API: http://jsoup.org/cookbook/extracting-data/selector-syntax
The process will be in two steps:
parse the page and find the url of the iframe
parse the content of the iframe and extract the information you need
The code would look like this:
// let's find the iframe
Document document = Jsoup.parse(inputstream, "iso-8859-1", url);
Elements elements = document.select("iframe");
Element iframe = elements.first();
// now load the iframe
URL iframeUrl = new URL(iframe.absUrl("src"));
document = Jsoup.parse(iframeUrl, 15000);
// extract the div
Element div = document.getElementById("number_forecast");
In you page that contains iframe change source of youe iframe to your own url. This url will be processed with your ouw controller, that will read content, parse it, extract all you need and write to response. If there is absolute references in your iframe this should work.