How can I get all website links recursively?

How can I get all website links recursively? - java

I need to write a code which will get all the links in a website recursively. Since I'm new to this is what I've got so far;
List<WebElement> no = driver.findElements(By.tagName("a"));
nooflinks = no.size();
for (WebElement pagelink : no)
{
String linktext = pagelink.getText();
link = pagelink.getAttribute("href");
}
Now what I need to do is if the list finds a link of the same domain, then it should get all the links from that URL and then return back to the previous loop and resume from the next link. This should go on till the last URL in the Whole Website is found. That is for example, Home Page is base URL and it has 5 URLs of other pages, then after getting the first of the 5 URLs the loop should get all the links of that first URL return back to Home Page and resume from second URL. Now if second URL has Sub-sub URL, then the loop should find links for those first then resume to second URL and then go back to Home Page and resume from third URL.
Can anybody help me out here???

I saw this post recently. I don't know if you are still looking for ANY solution for this problem. If not, I thought it might be useful:
import java.io.IOException;
import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.Iterator;
public class URLReading {
public static void main(String[] args) {
try {
String url="";
HashMap<String, String> h = new HashMap<>();
Url = "https://abidsukumaran.wordpress.com/";
Document doc = Jsoup.connect(url).get();
// Page Title
String title = doc.title();
//System.out.println("title: " + title);
// Links in page
Elements links = doc.select("a[href]");
List url_array = new ArrayList();
int i=0;
url_array.add(url);
String root = url;
h.put(url, title);
Iterator<String> keySetIterator = h.keySet().iterator();
while((i<=h.size())){
try{
url = url_array.get(i).toString();
doc = Jsoup.connect(url).get();
title = doc.title();
links = doc.select("a[href]");
for (Element link : links) {
String res= h.putIfAbsent(link.attr("href"), link.text());
if (res==null){
url_array.add(link.attr("href"));
System.out.println("\nURL: " + link.attr("href"));
System.out.println("CONTENT: " + link.text());
}
}
}catch(Exception e){
System.out.println("\n"+e);
}
i++;
}
} catch (Exception e) {
e.printStackTrace();
}
}
}

You can use Set and HashSet. You may try like this:
Set<String> getLinksFromSite(int Level, Set<String> Links) {
if (Level < 5) {
Set<String> locallinks = new HashSet<String>();
for (String link : Links) {
Set<String> new_links = ;
locallinks.addAll(getLinksFromSite(Level+1, new_links));
}
return locallinks;
} else {
return Links;
}
}

I would think the following idiom would be useful in this context:
Set<String> visited = new HashSet<>();
Deque<String> unvisited = new LinkedList<>();
unvisited.add(startingURL);
while (!unvisited.isEmpty()) {
String current = unvisited.poll();
visited.add(current);
for /* each link in current */ {
if (!visited.contains(link.url())
unvisited.add(link.url());
}
}

Related

Need to display the WebCrawler output in the form of tree-like structure

I have below code to fetch the pages inside the given URL but I am not sure how to display them in tree like structure.
public class BasicWebCrawler {
private HashSet<String> links;
public BasicWebCrawler() {
links = new HashSet<String>();
}
public void getPageLinks(String URL) {
//4. Check if you have already crawled the URLs
//(we are intentionally not checking for duplicate content in this example)
if (!links.contains(URL)) {
try {
//4. (i) If not add it to the index
if (links.add(URL)) {
System.out.println(URL);
}
//2. Fetch the HTML code
Document document = Jsoup.connect(URL).get();
//3. Parse the HTML to extract links to other URLs
Elements linksOnPage = document.select("a[href^=\"" +URL+ "\"]");
//5. For each extracted URL... go back to Step 4.
for (Element page : linksOnPage) {
getPageLinks(page.attr("abs:href"));
}
} catch (IOException e) {
System.err.println("For '" + URL + "': " + e.getMessage());
}
}
}
public static void main(String[] args) {
//1. Pick a URL from the frontier
new BasicWebCrawler().getPageLinks("https://www.wikipedia.com/");
}
}

Okay, I think I managed to do what you asked, when all links on site are checked or site has no links then the recursion will finish, but in internet it's actually not doable, it's funny where can you go from one site just by clicking first not checked link:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.HashSet;
public class BasicWebCrawler {
private HashSet<String> links;
public BasicWebCrawler() {
links = new HashSet<String>();
}
public void getPageLinks(String URL, int level) {
//4. Check if you have already crawled the URLs
//(we are intentionally not checking for duplicate content in this example)
if (!links.contains(URL)) {
try {
//4. (i) If not add it to the index
if (links.add(URL)) {
for(int i = 0; i < level; i++) {
System.out.print("-");
}
System.out.println(URL);
}
//2. Fetch the HTML code
Document document = Jsoup.connect(URL).get();
//3. Parse the HTML to extract links to other URLs
Elements linksOnPage = document.select("a[href]");
//5. For each extracted URL... go back to Step 4.
for (Element page : linksOnPage) {
getPageLinks(page.attr("abs:href"), level + 1);
}
} catch (IOException e) {
System.err.println("For '" + URL + "': " + e.getMessage());
}
}
}
public static void main(String[] args) {
//1. Pick a URL from the frontier
new BasicWebCrawler().getPageLinks("http://mysmallwebpage.com/", 0);
}
}

How can I exclude the links validation which were already validated?

below script i have to validate links on a page. Here is the twist, I need to validate links in this page plus need to click on each link then validate links on that page as well but I need to exclude links that were validated in the first page. I really do not know how to perform. I can do up to clicking on the link and validate the links in that page also but what code should i use to exclude those that were already validated. Please help if you can. Thanks
package siteLinks;
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Iterator;
import java.util.List;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
public class LinksValidation {
private static WebDriver driver = null;
public static void main(String[] args) {
// TODO Auto-generated method stub
String homePage = "http://www.safeway.com/Shopstores/Site-Map.page";
String url = "http://www.safeway.com/Shopstores/Site-Map.page";
HttpURLConnection huc = null;
int respCode = 200;
System.setProperty("webdriver.chrome.driver", "C:\\Users\\aaarb00\\Desktop\\Quotients\\lib\\chromedriver.exe");
driver = new ChromeDriver();
driver.manage().window().maximize();
driver.get(homePage);
List<WebElement> links = driver.findElements(By.tagName("a"));
Iterator<WebElement> it = links.iterator();
while(it.hasNext()){
url = it.next().getAttribute("href");
System.out.println(url);
if(url == null || url.isEmpty()){
System.out.println("URL is either not configured for anchor tag or it is empty");
continue;
}
try {
huc = (HttpURLConnection)(new URL(url).openConnection());
huc.setRequestMethod("HEAD");
huc.connect();
respCode = huc.getResponseCode();
if(respCode >= 400){
System.out.println(url+" is a broken link");
}
else{
System.out.println(url+" is a valid link");
}
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
driver.quit();
}
}

You can store the links you've already visited in an ArrayList and check whether that ArrayList contains the link already.
ArrayList<String> visitedLinks = new ArrayList<String>();
List<WebElement> elements = driver.findElements(By.tagName("a"));
for(WebElement element : elements) {
if(visitedLinks.contains(element.getAttribute("href"))) {
System.out.println("Link already checked. Not checking.");
} else {
visitedLinks.add(element.getAttribute("href"));
// Your link checking code
}
}
I'm not sure how you're checking the links off of the pages you check for a status 200 OK response, but you should probably define the URL for each page that you want to check the links on and then loop through those URLs. Otherwise you're likely to exit the site you're checking links for and escape out onto the wider internet. Your test is likely to never finish in that case.

How to add exception to not parse some types of files using jsoup in java ?

I am writing a web crawler program using Jsoup library. (Sorry i can not post my code becase it too long to post it here).I need to crawl only URLs that can leed me to new links without crawling URLs with that starts with http or https and ending with image files, pdf, rar or zip files. I need just to crawl URLs that ending with .html, .htm, .jsp , .php and .asp etc.
I have two question regarding this issue:
1- How can i prevent the program to not read other unneeded URLs (like: images, PDFs or RARs) ?
2- How can i improve this class to not waisting time to load whole URL content to memory then parse URLs from it ?
This is my code below :
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.Writer;
import java.math.BigInteger;
import java.util.Formatter;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.concurrent.TimeUnit;
import java.security.*;
import java.nio.file.Path;
import java.nio.file.Paths;
public class HTMLParser {
private static final int READ_TIMEOUT_IN_MILLISSECS = (int) TimeUnit.MILLISECONDS.convert(30, TimeUnit.SECONDS);
private static HashMap <String, Integer> filecounter = new HashMap<> ();
public static List<LinkNodeLight> parse(LinkNode inputLink){
List<LinkNodeLight> outputLinks = new LinkedList<>();
try {
inputLink.setIpAdress(IpFromUrl.getIp(inputLink.getUrl()));
String url = inputLink.getUrl();
if (inputLink.getIpAdress() != null) {
url.replace(URLWeight.getHostName(url), inputLink.getIpAdress());
}
Document parsedResults = Jsoup
.connect(url)
.timeout(READ_TIMEOUT_IN_MILLISSECS)
.userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
.get();
inputLink.setSize(parsedResults.html().length());
/* IP address moved here in order to speed up the process */
inputLink.setStatus(LinkNodeStatus.OK);
inputLink.setDomain(URLWeight.getDomainName(inputLink.getUrl()));
if (true) {
/* save the file to the html */
String filename = parsedResults.title();//digestBig.toString(16) + ".html";
if (filename.length() > 24) {
filename = filename.substring(0, 24);
}
filename = filename.replaceAll("[^\\w\\d\\s]", "").trim();
filename = filename.replaceAll("\\s+", " ");
if (!filecounter.containsKey(filename)) {
filecounter.put(filename, 1);
} else {
Integer tmp = filecounter.remove(filename);
filecounter.put(filename, tmp + 1);
}
filename = filename + "-" + (filecounter.get(filename)).toString() + ".html";
filename = Paths.get("downloads", filename).toString();
inputLink.setFileName(filename);
/* use md5 of url as file name */
try (PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(filename)))) {
out.println("<!--" + inputLink.getUrl() + "-->");
out.print(parsedResults.html());
out.flush();
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
String tag;
Elements tagElements;
List<LinkNode> result;
tag = "a[href";
tagElements = parsedResults.select(tag);
result = toLinkNodeObject(inputLink, tagElements, tag);
outputLinks.addAll(result);
tag = "area[href";
tagElements = parsedResults.select(tag);
result = toLinkNodeObject(inputLink, tagElements, tag);
outputLinks.addAll(result);
} catch (IOException e) {
inputLink.setParseException(e);
inputLink.setStatus(LinkNodeStatus.ERROR);
}
return outputLinks;
}
static List<LinkNode> toLinkNodeObject(LinkNode parentLink, Elements tagElements, String tag) {
List<LinkNode> links = new LinkedList<>();
for (Element element : tagElements) {
if(isFragmentRef(element)){
continue;
}
String absoluteRef = String.format("abs:%s", tag.contains("[") ? tag.substring(tag.indexOf("[") + 1, tag.length()) : "href");
String url = element.attr(absoluteRef);
if(url!=null && url.trim().length()>0) {
LinkNode link = new LinkNode(url);
link.setTag(element.tagName());
link.setParentLink(parentLink);
links.add(link);
}
}
return links;
}
static boolean isFragmentRef(Element element){
String href = element.attr("href");
return href!=null && (href.trim().startsWith("#") || href.startsWith("mailto:"));
}
}

To add another solution to Pshemo for your first question. You may want to make a regex to compare to so that you don't even take the element and put it in the list
in method "static List toLinkNodeObject" maybe something like
"[http].+[^(pdf|rar|zip)]" and match your url to the regex. This will speed up the program too because you won't even be adding those links to parse for.
String url = element.attr(absoluteRef);
if(url!=null && url.trim().length()>0
&& url.matches("[http].+[^(pdf|rar|zip)]")) {
LinkNode link = new LinkNode(url);
link.setTag(element.tagName());
link.setParentLink(parentLink);
links.add(link);
}
As to speed up the class as a whole, it would help to multithread the downloading and parsing and allow the multiple threads to get and validate the information.

How to get href using jsoup

I have some url. I want to get all href's from the html url is pointing to and all href from all gotten hrefs(recursively). The point is I want to set depth of that "recursion"
For example, if depth = 1, I need only href's from the HTML. If depth = 2, I need hrefs from HTML(that make suppose list1) and hrefs from each of href from list1 and so on
Here is what I have using jsoup:
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.List;
public class Parser {
private final static String FILE_PATH = "src/main/resources/href.txt";
private List<String> result;
private int currentDepth;
private int maxDepth;
public Parser(int maxDepth) {
result = new ArrayList<String>();
this.maxDepth = maxDepth;
}
public void parseURL(String url) throws IOException {
url = url.toLowerCase();
if (!result.contains(url)) {
Connection connection = Jsoup.connect(url);
Document document = connection.get();
Elements links = document.select("a[href]");
for (Element link : links) {
String href = link.attr("href");
result.add(href);
parseURL(link.absUrl("href"));
currentDepth++;
if (currentDepth == maxDepth)
return;
}
}
}
}
How should I fix recursion condition to make it right?

I think you should check the depth first before calling the recursive function.
if (currentDepth >= maxDepth){
// do nothing
}else{
parseURL(...)
}

public void parseURL(String url) throws IOException {
url = url.toLowerCase();
if (!result.contains(url)) {
Connection connection = Jsoup.connect(url);
Document document = connection.get();
Elements links = document.getElementsByAttribute("href");
// Elements links = document.select("a[href]");
for (Element link : links) {
String href = link.attr("href");
result.add(href);
parseURL(link.absUrl("href"));
currentDepth++;
if (currentDepth == maxDepth)
return;
}
}
}
You can try this in your code, you can get all Elements from method getElementsByAttribute(String attribute) which have specified attribute

JSOUP find all images in HTML file with ALT attribute?

Hi I am relatively new to Java but I am hoping to write a class that will find all the ALT (image) attributes in a HTML file using JSOUP. I am hoping to get an error message printed if there is no alt text on an image and if there is to remind users to check it.
import java.io.File;
import org.jsoup.Jsoup;
import org.jsoup.parser.Parser;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
public class grabImages {
File input = new File("...HTML");
Document doc = Jsoup.parse(input, "UTF-8", "file:///C:...HTML");
Elements img = doc.getElementsByTag("img");
Elements alttext = doc.getElementsByAttribute("alt");
for (Element el : img){
if(el.attr("img").contains("alt")){
System.out.println("is the alt text relevant to the image? ");
}
else { System.out.println("no alt text found on image");
}
}
}

I think your logic was a little off.
For example:
Here you are trying to load the 'img' attribute of the 'img' tag...
el.attr("img")
Here's my implementation of the program. You should be able to alter it for your own needs.
public class Controller {
public static void main(String[] args) throws IOException {
// Connect to website. This can be replaced with your file loading implementation
Document doc = Jsoup.connect("http://www.google.co.uk").get();
// Get all img tags
Elements img = doc.getElementsByTag("img");
int counter = 0;
// Loop through img tags
for (Element el : img) {
// If alt is empty or null, add one to counter
if(el.attr("alt") == null || el.attr("alt").equals("")) {
counter++;
}
System.out.println("image tag: " + el.attr("src") + " Alt: " + el.attr("alt"));
}
System.out.println("Number of unset alt: " + counter);
}
}

public class grabImages {
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("...HTML").get();
Elements img = doc.getElementsByTag("img");
for (Element el : img){
if(el.hasAttr("alt")){
System.out.println("is the alt text relevant to the image? ");
}
else {
System.out.println("no alt text found on image");
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
el.hasAttr("alt") will give 'alt' attr is there or not.
for more informatiom
http://jsoup.org/cookbook/extracting-data/example-list-links

You can simplify this by using CSS selectors to select the img that do not have alt, rather than iterating over every img in the doc.
Document doc = Jsoup.connect(url).get();
for (Element img : doc.select("img:not([alt])"))
System.out.println("img does not have alt: " + img);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I get all website links recursively? - java

Related

Need to display the WebCrawler output in the form of tree-like structure

How can I exclude the links validation which were already validated?

How to add exception to not parse some types of files using jsoup in java ?

How to get href using jsoup

JSOUP find all images in HTML file with ALT attribute?

Categories

Resources