How to get navigable links to pages from a site using jsoup?

How to get navigable links to pages from a site using jsoup? - java

I am implementing a basic crawler with the purpose of later use in a vulnerability scanner. I am using jsoup for the connection/retrieving and parsing of html document.
I supply manually the base/root of the intended site(www.example.com) and connect.
...
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
this.htmlDocument = htmlDocument;
...
Then i retrieve all the links on the page.
...
Elements linksOnPage = htmlDocument.select("a[href]");
...
After this I loop between the links and try to get the links to all the pages on the site.
for (Element link : linksOnPage) {
this.links.add(link.absUrl("href"));
}
The problem is as follows. Depending on the links I get, some might not be links to new pages or not even links to pages at all. As an example a got links like:
https://example.example.com/webmail
http://193.231.21.13
mailto:example.example#exampl.com
What i need some help whit is the filtering of the links so that i get only links to new pages of the same root/base site.

This is easy. Check if absUrl ends with image format or js or css:
if(absUrl.startsWith("http://www.ics.uci.edu/") && !absUrl.matches(".*\\.(bmp|gif|jpg|png|js|css)$"))
{
//here absUrl starts with domain name and is not image or js or css
}

Related

How to get URL of video or audio from a website with jsoup

i'm using jsoup to parse all the HTML from this website: news
I can fetch all the tilte, description with select some Elements I need. But can't find the video URL element to select. How can i get the video link with jsoup or another kind of library. Thanks!

Maybe I misunderstood your question, but can't you search for <video> elements using JSoup?
All <video> elements have a so-called src attribute.
Maybe try something like this?
// HTML from your webpage
final var html = "this should hold your HTML";
// Deconstruct into element objects
final var document = Jsoup.parse(html);
// Use CSS to select the first <video> element
final var videoElement = document.select("video").first();
// Grab the video's URL by fetching the "src" attribute
final var src = videoElement.attr("src");
Now I did not thoroughly check the website you linked. But some websites insert videos using JavaScript. If this website inserts a video tag after loading, you might be out of luck as Jsoup does not run JavaScript. It only runs on the initial HTML fetched from the page.
Jsoup is an HTML parser, which is why it only parses HTML and not, say, generated HTML.

Extract the list of URLs obtained during a HTML page render in Java

I want to be able to get the list of all URLs that a browser will do a GET request for when we try to open a page. For example, if we try to open cnn.com, there are multiple URLs within the first HTTP response which the browser recursively requests for.
I'm not trying to render a page, but I'm trying to obtain a list of all the URLs that are requested when a page is rendered. Doing a simple scan of the HTTP response content wouldn't be sufficient, as there could potentially be images in the CSS which are downloaded. Is there any way I can do this in Java?
My question is similar to this question, but I want to write this in Java.

You can use Jsoup library to extract all the links from a webpage, e.g.:
Document document = Jsoup.connect("http://google.com").get();
Elements links = document.select("a[href]");
for(Element link : links) {
System.out.println(link.attr("href"));
}
Here's the documentation.

Url is not returning the correct html in a webpage (for my Java crawler)

I want to download some images from a webpage, for that I was writing a crawler. I tested couple of crawlers for this page but none worked as I wanted.
In the first step, I collected the links of 770+ camera models (parent_url), then I was thinking of collecting images in each link(child_urls). However, the page is organized in such a way that child_urls are returning the same html as parent_url.
Here is my code to collect camera links:
public List<String> html_compiler(String url, String exp, String atr){
List<String> outs = new ArrayList<String>();
try {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select(exp);
for (Element link : links) {
outs.add(link.attr(atr));
System.out.println("\nlink : " + link.attr(atr));
}
} catch (IOException | SelectorParseException e) {
e.printStackTrace();
}
return outs;
}
With this code, I collect the links
String expCam = "tr[class='gallery cameras'] > td[class='title'] > a[href]";
String url = "https://www.dpreview.com/sample-galleries?category=cameras";
String atr = "href";
List<String> cams = html_compiler(url, exp, atr); // This gives me the links of individual cameras
String exp2 = "some expression";
html_compiler(cams.get(0), exp2, "src"); // --> this should give me image links of the first
//camera but webpage returns same html as above
How can I solve this problem? I'd love to hear about other pages which classified images according to camera models. (other than Flickr)
EDIT:
e.g. In java the following two links gives the same html.
https://www.dpreview.com/sample-galleries?category=cameras
https://www.dpreview.com/sample-galleries/2653563139/nikon-d1-review-samples-one

To understand how to get the image links, it's important to know how the page loads in a browser. If you click the gallerie link, a javascript event handler will be triggered. The created image viewer then loads the images from the data server. The image links are requested via javascript and thus not visible by just parsing the html. The request URL for the image links is https://www.dpreview.com/sample-galleries/data/get-gallery to get the images in the gallerie you have to add the gallerie id. The gallerie id is provided by the href attribute of the gallerie links. The links look like https://www.dpreview.com/sample-galleries/2653563139/nikon-d1-review-samples-one. In this case 2653563139 is the gallerie id. Take the link given above and add the gallerie id with ?galleryId=2653563139 to the end of the URL to get a json object containing all data needed to create the gallerie. Look for the url fields in the images array to get your images.
To summarize:
The link you get from the href attribute: https://www.dpreview.com/sample-galleries/2653563139/nikon-d1-review-samples-one
The gallerie id: 2653563139
The request url: https://www.dpreview.com/sample-galleries/data/get-gallery
The json object you need: https://www.dpreview.com/sample-galleries/data/get-gallery?galleryId=2653563139
The urls you are looking for inside the json object: "url":"https://3.img-dpreview.com/files/p/TS1800x1200~sample_galleries/2653563139/7864344228.jpg"
And finally your picture link: https://3.img-dpreview.com/files/p/TS1800x1200~sample_galleries/2653563139/7864344228.jpg
Comment if you want further explanation.

jsoup won't extract email only website

I am having trouble with jsoup. It only is extracting website links not the email link. Here is my code:
try {
Document doc = Jsoup.connect(url2).get();
Elements links = doc.select("a[href]");
for (Element web: links) {
Log.i("websites/emails/etc.", web.attr("abs:href"));
}
Elements links2 = doc.select("link[href]");
for (Element web: links2) {
Log.i("websites/emails/etc.", web.attr("abs:href"));
}
The logcat is only showing the website links.
Here is the inspected page:

Edit --
I missed you were using Android. I tested on the JVM and your code looked good, I re-tested on Android and see the same thing. The solution appears to be remove the abs: qualifier from your attr call.
Log.i("websites/emails/etc.", web.attr("href"));
Original answer, which may apply to other attempts to extract mailto:.
This is almost certainly the intended behavior from the website creator. Due to the mailto: tags being easily scraped by spambot email harvesters, there are a variety of techniques used to make the mailto: tag not-obvious when you pull the raw HTML. Instead, they cleverly encoded, or are generated dynamically by javascript. See here for an example. Safari is showing you the element because these technique are designed to be correct in the browser, even when the just the HTML looks funky. If you download with file with curl and look at the raw text, there is likely no "mailto:" tag there.

Open Link in HTML with JSOUP

I have a table in a HTML page in which I have to iterate through to open the links into a next page where all the information is. In this page I extract any data I need and return to my basic page.
How do I change pages with the framework JSoup in Java? Is it actually possible?

If you look at the JSoup Cookbook, they have an example of getting all the links inside of an HTML element. Iterate the Elements from this example and do a Document doc = Jsoup.connect(<url from Elements>).get();. You can then do String htmlFromLink = doc.toString(); and get the HTML from the link.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.