Recurrency not working well in Web Crawler project

Recurrency not working well in Web Crawler project - java

For excerise I want to make my own Web Crawler but I have a problem with recurrent invocation of my crawl method. It should start for every link in my links array and goes so on until I decide to abort whole program but it only goes for first element in that array so it simply goes back and forth without any progress. How can I fix this?
Crawler.java
package regularmikey.mikecrawler;
import java.io.IOException;
import org.jsoup.HttpStatusException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Crawler implements Runnable {
private Elements links;
private Document doc;
private String start_url;
public Crawler(){};
public Crawler(String url){start_url = url;};
public void crawl(String url) {
try {
System.out.println(url);
doc = Jsoup.connect(url).get();
String title = doc.title();
System.out.println("title : " + title);
links = doc.select("a[href]");
for (Element link : links) {
if(AdressValidator.validAddress(link.attr("href"))) {
crawl(link.attr("href"));
}
}
} catch (org.jsoup.UnsupportedMimeTypeException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
public void run() {
crawl(start_url);
}
}
App.java
package regularmikey.mikecrawler;
public class App
{
public static void main( String[] args )
{
Thread thread = new Thread(new Crawler("http://facebook.com"));
thread.run();
}
}

You can create a List of url that you have already reached.
private List<String> urls = new ArrayList<String>();
//some code
for (Element link : links) {
if(!urls.contains(link.attr("abs:href"))){
urls.add(link.attr("abs:href"));
crawl(link.attr("abs:href"));
}
}
EDIT : completed with #PallyP Answer

Try changing your
crawl(link.attr("href"))
To
crawl(link.attr("abs:href"))
Adding the abs: prefix will return the absolute URL (e.g. "http://facebook.com")

The private members of class Crawler are overwritten by each (recursive) call to crawl():
private Elements links;
private Document doc;
public void crawl(String url) {
try {
// ...
doc = Jsoup.connect(url).get();
links = doc.select("a[href]");
crawl(link.attr("href"));
}
}
This means, if a recursive call to crawl() returns, links and doc are not restored to their previous values.
This should be fixed first by using local variables for links and doc inside crawl().

Related

Liferay 7 Extending EditableFragmentEntryProcessor

I want to extend functionality of EditableFragmentEntryProcessor in Liferay 7.4 (<lfr-editable> tags in fragments) by searching in text syntaxes like {user.name} and replacing it with value from response from my external API.
e.x.
I type something like
This is super fragment and you are {user.name}.
And result should be
This is super fragment and you are Steven.
I achieve that with creating my own FragmentEntryProcessor, but I did this by putting fragment configuration variable in my custom tag
<my-data-api> ${configuration.testVariable} </my-data-api>
I tried something like this before
<my-data-api>
<lfr-editable id="some-id" type="text">
some text to edit
</lfr-editable>
</my-data-api>
And it doesn't work (and I know why).
So I want to get something like this. Appreciate any help or hints.
EDIT:
Here my custom FragmentEntryProcessor:
package com.example.fragmentEntryProcessorTest.portlet;
import com.example.test.api.api.TestPortletApi;
import com.liferay.fragment.exception.FragmentEntryContentException;
import com.liferay.fragment.model.FragmentEntryLink;
import com.liferay.fragment.processor.FragmentEntryProcessor;
import com.liferay.fragment.processor.FragmentEntryProcessorContext;
import com.liferay.portal.kernel.exception.PortalException;
import com.liferay.portal.kernel.util.Validator;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.osgi.service.component.annotations.Component;
import org.osgi.service.component.annotations.Reference;
import java.io.IOException;
/**
* #author kabatk
*/
#Component(
immediate = true, property = "fragment.entry.processor.priority:Integer=100",
service = FragmentEntryProcessor.class
)
public class FragmentEntryProcessorApiDataCopy implements FragmentEntryProcessor {
private static final String _TAG = "my-data-api";
#Reference
private TestPortletApi _api;
#Override
public String processFragmentEntryLinkHTML(
FragmentEntryLink fragmentEntryLink, String html,
FragmentEntryProcessorContext fragmentEntryProcessorContext)
throws PortalException {
Document document = _getDocument(html);
Elements elements = document.getElementsByTag(_TAG);
elements.forEach(
element -> {
String text = element.text();
String attrValue = element.attr("dataType");
String classValues = element.attr("classes");
Element myElement = null;
String result;
try {
result = _api.changeContent(text);
} catch (IOException e) {
e.printStackTrace();
result = "";
}
if(attrValue.equals("img")){
myElement = document.createElement("img");
myElement.attr("class", classValues);
myElement.attr("src", result);
}else if(attrValue.equals("text")){
myElement = document.createElement("div");
myElement.attr("class", classValues);
myElement.html(result);
}
if(myElement != null)
element.replaceWith(myElement);
else
element.replaceWith(
document.createElement("div").text("Error")
);
});
Element bodyElement = document.body();
return bodyElement.html();
}
#Override
public void validateFragmentEntryHTML(String html, String configuration)
throws PortalException {
Document document = _getDocument(html);
Elements elements = document.getElementsByTag(_TAG);
for (Element element : elements) {
if (Validator.isNull(element.attr("dataType"))) {
throw new FragmentEntryContentException("Missing 'dataType' attribute!");
}
}
}
private Document _getDocument(String html) {
Document document = Jsoup.parseBodyFragment(html);
Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);
document.outputSettings(outputSettings);
return document;
}
}

Why html code in chrome devtools and html code parsed by jsoup are different?

I'm trying to extract information about created date of issues from HADOOP Jira issue site(https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues)
As you can see in this Screenshot, created date is the text between the time tag whose class is live stamp(e.g. <time class=livestamp ...> 'this text' </time>)
So, I tried parse it with code as below.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CreatedDateExtractor {
public static void main(String[] args) {
String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("time.livestamp"); //This line finds elements that matches time tags with livestamp class
System.out.println("# of elements : "+ elements.size());
for(Element e: elements) {
System.out.println(e.text());
}
}
}
I expect that created date is extracted, but the actual output is
# of elements : 0.
I found this is something wrong. So, I tried to parse whole html code from that side with below code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CreatedDateExtractor {
public static void main(String[] args) {
String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("*"); //This line finds whole elements in html document.
System.out.println("# of elements : "+ elements.size());
for(Element e: elements) {
System.out.println(e);
}
}
}
I compared both the html code in chrome devtools and the html code that I parsed one by one. Then I found those are different.
Can you explain why this happens and give me some advices how to extract created date?

I advice you to get elements with "time" tag, and use select to get time tags which have "livestamp" class. Here is the example:
Elements timeTags = doc.select("time");
Element timeLivestamp = null;
for(Element tag:timeTags){
Element livestamp = tag.selectFirst(".livestamp");
if(livestamp != null){
 timeLivestamp = livestamp;
break;
}
}
I don't know why but when I want to use .select() method of Jsoup with more than 1 selector (as you used like time.livestamp), I get interesting outputs like this.

Iterate through all links of a website using Selenium

I'm new to Selenium and I would like to download all the pdf, ppt(x) and doc(x) files from a website. I have written the following code. But I'm confused how to get the inner links:
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.io.FileUtils;
import org.openqa.selenium.By;
import org.openqa.selenium.OutputType;
import org.openqa.selenium.TakesScreenshot;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
public class WebScraper {
String loginPage = "https://blablah/login";
static String userName = "11";
static String password = "11";
static String mainPage = "https://blahblah";
public WebDriver driver = new FirefoxDriver();
ArrayList<String> visitedLinks = new ArrayList<>();
public static void main(String[] args) throws IOException {
System.setProperty("webdriver.gecko.driver", "E:\\geckodriver.exe");
WebScraper webSrcaper = new WebScraper();
webSrcaper.openTestSite();
webSrcaper.login(userName, password);
webSrcaper.getText(mainPage);
webSrcaper.saveScreenshot();
webSrcaper.closeBrowser();
}
/**
* Open the test website.
*/
public void openTestSite() {
driver.navigate().to(loginPage);
}
/**
* #param username
* #param Password Logins into the website, by entering provided username and password
*/
public void login(String username, String Password) {
WebElement userName_editbox = driver.findElement(By.id("IDToken1"));
WebElement password_editbox = driver.findElement(By.id("IDToken2"));
WebElement submit_button = driver.findElement(By.name("Login.Submit"));
userName_editbox.sendKeys(username);
password_editbox.sendKeys(Password);
submit_button.click();
}
/**
* grabs the status text and saves that into status.txt file
*
* #throws IOException
*/
public void getText(String website) throws IOException {
driver.navigate().to(website);
try {
Thread.sleep(10000);
} catch (InterruptedException e) {
e.printStackTrace();
}
List<WebElement> allLinks = driver.findElements(By.tagName("a"));
System.out.println("Total no of links Available: " + allLinks.size());
for (int i = 0; i < allLinks.size(); i++) {
String fileAddress = allLinks.get(i).getAttribute("href");
System.out.println(allLinks.get(i).getAttribute("href"));
if (fileAddress.contains("download")) {
driver.get(fileAddress);
} else {
// getText(allLinks.get(i).getAttribute("href"));
}
}
}
/**
* Saves the screenshot
*
* #throws IOException
*/
public void saveScreenshot() throws IOException {
File scrFile = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
FileUtils.copyFile(scrFile, new File("screenshot.png"));
}
public void closeBrowser() {
driver.close();
}
}
I have an if clause which checks if the current link is a downloadable file (with an address including the word "download"). If it is, I will get it, if not, what to do? That part is my problem. I tried to implement a recursive function to retrieve the nested links and repeat the steps for the nested links, but no success.
In the meantime, the first link which is found when giving https://blahblah as the input, is https://blahblah/# which refers to the same page as https://blahblah. It can also cause a problem, but currently, I'm trapped in another problem, namely the implementation of the recursion function. Could you please help me?

You are not far off, but answering your question, grab all the link into a list of elements, iterate and click(and wait). Using C# something like this;
IList<IWebElement> listOfLinks = _driver.FindElements(By.XPath("//a"));
foreach (var link in listOfLinks)
{
if(link.GetAttribute("href").Contains("download"))
{
link.Click();
WaitForSecs(); //Thread.Sleep(1000)
}
}
JAVA
List<WebElement> listOfLinks = webDriver.findElements(By.xpath("//a"));
for (WebElement link :listOfLinks ) {
if(link.getAttribute("href").contains("download"))
{
link.click();
//WaitForSecs(); //Thread.Sleep(1000)
}
}

One option is to embed groovy in your java code if you want to search depth-first. When httpBuilder parses , it gives xml like documentation and then you can traverse as deep as you like using gpath in groovy. Your test.groovy is like below
#Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7' )
import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.JSON
import groovy.json.*
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper
import groovy.json.JsonSlurper
urlValue="http://yoururl.com"
def http = new HTTPBuilder(urlValue)
//parses page and provide xml tree , it even includes malformed html
def parsedText = http.get([:])
// number of a tags. "**" will parse depth-first
aCount= parsedText."**".findAll {it.name()=='a'}.size()
Then you just call test.groovy from java like this
static void runWithGroovyShell() throws Exception {
new GroovyShell().parse( new File( "test.groovy" ) ).invokeMethod( "hello_world", null ) ;
}
More info on parsing html with groovy
Addition:
When you evaluate groovy within Java, to access groovy variables in Java environment through groovy bindings, have a look here

Getting sub links of a URL using jsoup

Consider a URl www.example.com it may have plenty numbers of links ,some may be internal and other may be external.I want to get a list of all the sub links ,not even the sub-sub links but only sub link.
E.G if there are four links as follows
1)www.example.com/images/main
2)www.example.com/data
3)www.example.com/users
4)www.example.com/admin/data
Then out of the four only 2 and 3 are of use as they are sub links not the sub-sub and so on links .Is there a way to achieve it through j-soup..If this could not be achieved through j-soup then one can introduce me with some other java API.
Also note that it should be a link of the parent Url which is initially sent(i.e. www.example.com)

If i can understand a sub-link can contain one slash you can attempt with this with counting the number of slashes for example :
List<String> list = new ArrayList<>();
list.add("www.example.com/images/main");
list.add("www.example.com/data");
list.add("www.example.com/users");
list.add("www.example.com/admin/data");
for(String link : list){
if((link.length() - link.replaceAll("[/]", "").length()) == 1){
System.out.println(link);
}
}
link.length(): count the number of characters
link.replaceAll("[/]", "").length() : count the number of slashes
If the difference equal to one then right link else no.
EDIT
How will i scan the whole website for sub links?
The answer for this with the robots.txt file or Robots exclusion standard, so in this it define all the sub-links of the web site for example https://stackoverflow.com/robots.txt, so the idea is, to read this file and you can extract the sub-links from this web-site here is a piece of code that can help you :
public static void main(String[] args) throws Exception {
//Your web site
String website = "http://stackoverflow.com";
//We will read the URL https://stackoverflow.com/robots.txt
URL url = new URL(website + "/robots.txt");
//List of your sub-links
List<String> list;
//Read the file with BufferedReader
try (BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()))) {
String subLink;
list = new ArrayList<>();
//Loop throw your file
while ((subLink = in.readLine()) != null) {
//Check if the sub-link is match with this regex, if yes then add it to your list
if (subLink.matches("Disallow: \\/\\w+\\/")) {
list.add(website + "/" + subLink.replace("Disallow: /", ""));
}else{
System.out.println("not match");
}
}
}
//Print your result
System.out.println(list);
}
This will show you :
[https://stackoverflow.com/posts/, https://stackoverflow.com/posts?,
https://stackoverflow.com/search/, https://stackoverflow.com/search?,
https://stackoverflow.com/feeds/, https://stackoverflow.com/feeds?,
https://stackoverflow.com/unanswered/,
https://stackoverflow.com/unanswered?, https://stackoverflow.com/u/,
https://stackoverflow.com/messages/, https://stackoverflow.com/ajax/,
https://stackoverflow.com/plugins/]
Here is a Demo about the regex that i use.
Hope this can help you.

To scan the links on the web page you can use JSoup library.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
class read_data {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("**your_url**").get();
Elements links = doc.select("a");
List<String> list = new ArrayList<>();
for (Element link : links) {
list.add(link.attr("abs:href"));
}
} catch (IOException ex) {
}
}
}
list can be used as suggested in the previous answer.
The code for reading all the links on a website is given below. I have used http://stackoverflow.com/ for illustration. I would recommend you to go through company's terms of use before scraping it's website.
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class readAllLinks {
public static Set<String> uniqueURL = new HashSet<String>();
public static String my_site;
public static void main(String[] args) {
readAllLinks obj = new readAllLinks();
my_site = "stackoverflow.com";
obj.get_links("http://stackoverflow.com/");
}
private void get_links(String url) {
try {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a");
links.stream().map((link) -> link.attr("abs:href")).forEachOrdered((this_url) -> {
boolean add = uniqueURL.add(this_url);
if (add && this_url.contains(my_site)) {
System.out.println(this_url);
get_links(this_url);
}
});
} catch (IOException ex) {
}
}
}
You will get list of all the links in uniqueURL field.

Extracting data from HTML table with Jsoup

I am trying to extract the data from the table on the following website. I.e Club, venue, start time. http://www.national-autograss.co.uk/february.htm
I have got many examples on here working that use a css class table but this website doesn't. I have made an attempt with the code below but it doesn't seem to provide any output. Any help would be very much appreciated.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) {
Document doc = null;
try {
doc = Jsoup.connect("http://www.national-autograss.co.uk/february.htm").get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("table#table1");
String name;
for( Element element : elements )
{
name = element.text();
System.out.println(name);
}
}
}

An id should be unique, so you should use directly doc.select("#table1") and so on

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Recurrency not working well in Web Crawler project - java

You can create a List of url that you have already reached. private List<String> urls = new ArrayList<String>(); //some code for (Element link : links) { if(!urls.contains(link.attr("abs:href"))){ urls.add(link.attr("abs:href")); crawl(link.attr("abs:href")); } } EDIT : completed with #PallyP Answer

Try changing your crawl(link.attr("href")) To crawl(link.attr("abs:href")) Adding the abs: prefix will return the absolute URL (e.g. "http://facebook.com")

Related

Liferay 7 Extending EditableFragmentEntryProcessor

Why html code in chrome devtools and html code parsed by jsoup are different?

Iterate through all links of a website using Selenium

Getting sub links of a URL using jsoup

Extracting data from HTML table with Jsoup

Categories

Resources