Exception in thread "main" java.lang.NoClassDefFoundError: org/jsoup/Jsoup - java

I copied a simple web crawler from the internet and then started to run the application in a test class. Every time i try to run the application I get "Exception in thread "main" java.lang.NoClassDefFoundError: org/jsoup/Jsoup" error. I first imported the jsoup jar as a externaljar in a Libary, because I needed it for the http stuff.
Error messages:
Exception in thread "main" java.lang.NoClassDefFoundError: org/jsoup/Jsoup
at com.copiedcrawler.SpiderLeg.crawl(SpiderLeg.java:35)
at com.copiedcrawler.Spider.search(Spider.java:40)
at com.copiedcrawler.SpiderTest.main(SpiderTest.java:9)
Caused by: java.lang.ClassNotFoundException: org.jsoup.Jsoup
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:602)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 3 more
Spider Class
package com.copiedcrawler;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Set;
public class Spider
{
private static final int MAX_PAGES_TO_SEARCH = 10;
private Set<String> pagesVisited = new HashSet<String>();
private List<String> pagesToVisit = new LinkedList<String>();
public void search(String url, String searchWord)
{
while(this.pagesVisited.size() < MAX_PAGES_TO_SEARCH)
{
String currentUrl;
SpiderLeg leg = new SpiderLeg();
if(this.pagesToVisit.isEmpty())
{
currentUrl = url;
this.pagesVisited.add(url);
}
else
{
currentUrl = this.nextUrl();
}
leg.crawl(currentUrl); // Lots of stuff happening here. Look at the crawl method in
// SpiderLeg
boolean success = leg.searchForWord(searchWord);
if(success)
{
System.out.println(String.format("**Success** Word %s found at %s", searchWord, currentUrl));
break;
}
this.pagesToVisit.addAll(leg.getLinks());
}
System.out.println("\n**Done** Visited " + this.pagesVisited.size() + " web page(s)");
}
/**
* Returns the next URL to visit (in the order that they were found). We also do a check to make
* sure this method doesn't return a URL that has already been visited.
*
* #return
*/
private String nextUrl()
{
String nextUrl;
do
{
nextUrl = this.pagesToVisit.remove(0);
} while(this.pagesVisited.contains(nextUrl));
this.pagesVisited.add(nextUrl);
return nextUrl;
}
}
SpiderLeg class
package com.copiedcrawler;
import java.io.IOException;
import java.util.LinkedList;
import java.util.List;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class SpiderLeg
{
// We'll use a fake USER_AGENT so the web server thinks the robot is a normal web browser.
private static final String USER_AGENT =
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
private List<String> links = new LinkedList<String>();
private Document htmlDocument;
/**
* This performs all the work. It makes an HTTP request, checks the response, and then gathers
* up all the links on the page. Perform a searchForWord after the successful crawl
*
* #param url
* - The URL to visit
* #return whether or not the crawl was successful
*/
public boolean crawl(String url)
{
try
{
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
this.htmlDocument = htmlDocument;
if(connection.response().statusCode() == 200) // 200 is the HTTP OK status code
// indicating that everything is great.
{
System.out.println("\n**Visiting** Received web page at " + url);
}
if(!connection.response().contentType().contains("text/html"))
{
System.out.println("**Failure** Retrieved something other than HTML");
return false;
}
Elements linksOnPage = htmlDocument.select("a[href]");
System.out.println("Found (" + linksOnPage.size() + ") links");
for(Element link : linksOnPage)
{
this.links.add(link.absUrl("href"));
}
return true;
}
catch(IOException ioe)
{
// We were not successful in our HTTP request
return false;
}
}
/**
* Performs a search on the body of on the HTML document that is retrieved. This method should
* only be called after a successful crawl.
*
* #param searchWord
* - The word or string to look for
* #return whether or not the word was found
*/
public boolean searchForWord(String searchWord)
{
// Defensive coding. This method should only be used after a successful crawl.
if(this.htmlDocument == null)
{
System.out.println("ERROR! Call crawl() before performing analysis on the document");
return false;
}
System.out.println("Searching for the word " + searchWord + "...");
String bodyText = this.htmlDocument.body().text();
return bodyText.toLowerCase().contains(searchWord.toLowerCase());
}
public List<String> getLinks()
{
return this.links;
}
}
SpiderTest class
package com.copiedcrawler;
public class SpiderTest {
public static void main(String[] args) {
// TODO Auto-generated method stub
Spider s1 = new Spider();
s1.search("https://www.w3schools.com/html/", "html");
}
}

Based on stacktrace you are running java program from command line and you forgot to add jsoup into class path. Try running
java -cp classes:libs/jsoup.jar com.copiedcrawler.SpiderTest
Where classes is your program compiled and libs is a folder with libraries.

You might have added the Jsoup Jar File into the Modulepath.
You need to add the JAR file to classpath.
Follow the below steps:
Remove the Jsoup JAR from the libraries.
Project->Build Path->Configure Build Path->Libraries->ClassPath->Add External JARs.
Apply and Close.
Re-run the project.
Now, It should work.

Related

Java Webcrawler to extract emails

I want to write a web crawler that starts at one page and goes to each link on that page looking for an email address. This is what I have so far, but it's not doing anything other than going from webpage to webpage.
`package com.netinstructions.crawler;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Set;
public class WebCrawler {
private static final int MAX_PAGES_TO_SEARCH = 26;
private Set<String> pagesVisited = new HashSet<String>();
private List<String> pagesToVisit = new LinkedList<String>();
private List<String> emails = new LinkedList<>();
private String nextUrl()
{
String nextUrl;
do
{
nextUrl = this.pagesToVisit.remove(0);
} while(this.pagesVisited.contains(nextUrl));
this.pagesVisited.add(nextUrl);
return nextUrl;
}
public void search(String url, String searchWord)
{
while(this.pagesVisited.size() < MAX_PAGES_TO_SEARCH)
{
String currentUrl;
SpiderLeg leg = new SpiderLeg();
if(this.pagesToVisit.isEmpty())
{
currentUrl = url;
this.pagesVisited.add(url);
}
else
{
currentUrl = this.nextUrl();
}
leg.crawl(currentUrl); // Lots of stuff happening here. Look at the crawl method in
// SpiderLeg
leg.searchForWord(currentUrl, emails);
this.pagesToVisit.addAll(leg.getLinks());
this.pagesToVisit.addAll(leg.getLinks());
}
System.out.println(emails.toString());
//System.out.println(String.format("**Done** Visited %s web page(s)", this.pagesVisited.size()));
}
}
And this is my Spider Leg Class
package com.netinstructions.crawler;
import java.io.IOException;
import java.util.LinkedList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class SpiderLeg
{
// We'll use a fake USER_AGENT so the web server thinks the robot is a normal web browser.
private static final String USER_AGENT =
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
private List<String> links = new LinkedList<String>();
private Document htmlDocument;
/**
* This performs all the work. It makes an HTTP request, checks the response, and then gathers
* up all the links on the page. Perform a searchForWord after the successful crawl
*
* #param url
* - The URL to visit
* #return whether or not the crawl was successful
*/
public boolean crawl(String url)
{
try
{
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
this.htmlDocument = htmlDocument;
if(connection.response().statusCode() == 200) // 200 is the HTTP OK status code
// indicating that everything is great.
{
System.out.println("\n**Visiting** Received web page at " + url);
}
if(!connection.response().contentType().contains("text/html"))
{
System.out.println("**Failure** Retrieved something other than HTML");
return false;
}
Elements linksOnPage = htmlDocument.select("a[href]");
//System.out.println("Found (" + linksOnPage.size() + ") links");
for(Element link : linksOnPage)
{
this.links.add(link.absUrl("href"));
}
return true;
}
catch(IOException ioe)
{
// We were not successful in our HTTP request
return false;
}
}
/**
* Performs a search on the body of on the HTML document that is retrieved. This method should
* only be called after a successful crawl.
*
* #param searchWord
* - The word or string to look for
* #return whether or not the word was found
*/
public void searchForWord(String searchWord, List<String> emails)
{
if(this.htmlDocument == null)
{
System.out.println("ERROR! Call crawl() before performing analysis on the document");
//return false;
}
Pattern pattern =
Pattern.compile("\"^[A-Z0-9._%+-]+#[A-Z0-9.-]+\\\\.[A-Z]{2,6}$\", Pattern.CASE_INSENSITIVE");
Matcher matchs = pattern.matcher(searchWord);
while (matchs.find()) {
System.out.println(matchs.group());
}
}
public List<String> getLinks()
{
return this.links;
}
}
My web crawler was taken from another source and I changed a few things. I added a List to hold the emails and return them all in a list to me. I think I am going wrong in my way that I take the email and put it in the list, but I am not sure how to fix it.
Spider Leg Class
Pattern.compile("\"^[A-Z0-9._%+-]+#[A-Z0-9.-]+\\\\.[A-Z]{2,6}$\", Pattern.CASE_INSENSITIVE");
Shouldn't this be...?
Pattern.compile("[A-Z0-9._%+-]+#[A-Z0-9.-]+\\.[A-Z]{2,6}", Pattern.CASE_INSENSITIVE);
Nothing gets added to the emails, so you need to emails.push() the emails you find to the list. Secondly, you probably want to be parsing the HTML document, not the URL of the page. Since the method now doesn't return anything, you need to expand the if statement to avoid the null pointer. The searchForWord method should be:
public void searchForWord(String searchWord, List<String> emails)
{
if(this.htmlDocument == null)
{
System.out.println("ERROR! Call crawl() before performing analysis on the document");
} else
{
String input = this.htmlDocument.toString();
Pattern pattern =
Pattern.compile("[A-Z0-9._%+-]+#[A-Z0-9.-]+\\.[A-Z]{2,6}", Pattern.CASE_INSENSITIVE);
Matcher matchs = pattern.matcher(input);
while (matchs.find()) {
emails.push(matchs.group());
}
}
}

Iterate through all links of a website using Selenium

I'm new to Selenium and I would like to download all the pdf, ppt(x) and doc(x) files from a website. I have written the following code. But I'm confused how to get the inner links:
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.io.FileUtils;
import org.openqa.selenium.By;
import org.openqa.selenium.OutputType;
import org.openqa.selenium.TakesScreenshot;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
public class WebScraper {
String loginPage = "https://blablah/login";
static String userName = "11";
static String password = "11";
static String mainPage = "https://blahblah";
public WebDriver driver = new FirefoxDriver();
ArrayList<String> visitedLinks = new ArrayList<>();
public static void main(String[] args) throws IOException {
System.setProperty("webdriver.gecko.driver", "E:\\geckodriver.exe");
WebScraper webSrcaper = new WebScraper();
webSrcaper.openTestSite();
webSrcaper.login(userName, password);
webSrcaper.getText(mainPage);
webSrcaper.saveScreenshot();
webSrcaper.closeBrowser();
}
/**
* Open the test website.
*/
public void openTestSite() {
driver.navigate().to(loginPage);
}
/**
* #param username
* #param Password Logins into the website, by entering provided username and password
*/
public void login(String username, String Password) {
WebElement userName_editbox = driver.findElement(By.id("IDToken1"));
WebElement password_editbox = driver.findElement(By.id("IDToken2"));
WebElement submit_button = driver.findElement(By.name("Login.Submit"));
userName_editbox.sendKeys(username);
password_editbox.sendKeys(Password);
submit_button.click();
}
/**
* grabs the status text and saves that into status.txt file
*
* #throws IOException
*/
public void getText(String website) throws IOException {
driver.navigate().to(website);
try {
Thread.sleep(10000);
} catch (InterruptedException e) {
e.printStackTrace();
}
List<WebElement> allLinks = driver.findElements(By.tagName("a"));
System.out.println("Total no of links Available: " + allLinks.size());
for (int i = 0; i < allLinks.size(); i++) {
String fileAddress = allLinks.get(i).getAttribute("href");
System.out.println(allLinks.get(i).getAttribute("href"));
if (fileAddress.contains("download")) {
driver.get(fileAddress);
} else {
// getText(allLinks.get(i).getAttribute("href"));
}
}
}
/**
* Saves the screenshot
*
* #throws IOException
*/
public void saveScreenshot() throws IOException {
File scrFile = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
FileUtils.copyFile(scrFile, new File("screenshot.png"));
}
public void closeBrowser() {
driver.close();
}
}
I have an if clause which checks if the current link is a downloadable file (with an address including the word "download"). If it is, I will get it, if not, what to do? That part is my problem. I tried to implement a recursive function to retrieve the nested links and repeat the steps for the nested links, but no success.
In the meantime, the first link which is found when giving https://blahblah as the input, is https://blahblah/# which refers to the same page as https://blahblah. It can also cause a problem, but currently, I'm trapped in another problem, namely the implementation of the recursion function. Could you please help me?
You are not far off, but answering your question, grab all the link into a list of elements, iterate and click(and wait). Using C# something like this;
IList<IWebElement> listOfLinks = _driver.FindElements(By.XPath("//a"));
foreach (var link in listOfLinks)
{
if(link.GetAttribute("href").Contains("download"))
{
link.Click();
WaitForSecs(); //Thread.Sleep(1000)
}
}
JAVA
List<WebElement> listOfLinks = webDriver.findElements(By.xpath("//a"));
for (WebElement link :listOfLinks ) {
if(link.getAttribute("href").contains("download"))
{
link.click();
//WaitForSecs(); //Thread.Sleep(1000)
}
}
One option is to embed groovy in your java code if you want to search depth-first. When httpBuilder parses , it gives xml like documentation and then you can traverse as deep as you like using gpath in groovy. Your test.groovy is like below
#Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7' )
import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.JSON
import groovy.json.*
import org.cyberneko.html.parsers.SAXParser
import groovy.util.XmlSlurper
import groovy.json.JsonSlurper
urlValue="http://yoururl.com"
def http = new HTTPBuilder(urlValue)
//parses page and provide xml tree , it even includes malformed html
def parsedText = http.get([:])
// number of a tags. "**" will parse depth-first
aCount= parsedText."**".findAll {it.name()=='a'}.size()
Then you just call test.groovy from java like this
static void runWithGroovyShell() throws Exception {
new GroovyShell().parse( new File( "test.groovy" ) ).invokeMethod( "hello_world", null ) ;
}
More info on parsing html with groovy
Addition:
When you evaluate groovy within Java, to access groovy variables in Java environment through groovy bindings, have a look here

CmisObjectNotFoundException when trying to access my Alfresco repository

I'm new with CMIS and Alfresco and I got this error when in try to connect to my Alfresco's repository using AtomPUB binding. I have no idea about the source of my problem. Is it unless a functionality ? Is it my Credential ?
When I install it, I choose only :
- Alfresco community
- Solr4
How should I do if I want to use web services ? Should I install a specific plugin in my Alfresco ?
I got with error :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/C:/Users/ME%2ME/.m2/repository/org/slf4j/slf4j-simple/1.7.9/slf4j-simple-1.7.9.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/C:/Users/ME%2ME/.m2/repository/ch/qos/logback/logback-classic/1.1.3/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.SimpleLoggerFactory]
Exception in thread "main" org.apache.chemistry.opencmis.commons.exceptions.CmisObjectNotFoundException: Introuvable
at org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.convertStatusCode(AbstractAtomPubService.java:499)
at org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.read(AbstractAtomPubService.java:701)
at org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.getRepositoriesInternal(AbstractAtomPubService.java:873)
at org.apache.chemistry.opencmis.client.bindings.spi.atompub.RepositoryServiceImpl.getRepositoryInfos(RepositoryServiceImpl.java:66)
at org.apache.chemistry.opencmis.client.bindings.impl.RepositoryServiceImpl.getRepositoryInfos(RepositoryServiceImpl.java:92)
at org.apache.chemistry.opencmis.client.runtime.SessionFactoryImpl.getRepositories(SessionFactoryImpl.java:120)
at org.apache.chemistry.opencmis.client.runtime.SessionFactoryImpl.getRepositories(SessionFactoryImpl.java:107)
at fr.omb.TestOMB.connect(TestOMB.java:160)
at fr.omb.TestOMB.main(TestOMB.java:35)
My code :
package fr.omb;
import java.io.ByteArrayInputStream;
import java.io.UnsupportedEncodingException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import org.apache.chemistry.opencmis.client.api.CmisObject;
import org.apache.chemistry.opencmis.client.api.Document;
import org.apache.chemistry.opencmis.client.api.Folder;
import org.apache.chemistry.opencmis.client.api.Repository;
import org.apache.chemistry.opencmis.client.api.Session;
import org.apache.chemistry.opencmis.client.api.SessionFactory;
import org.apache.chemistry.opencmis.client.runtime.SessionFactoryImpl;
import org.apache.chemistry.opencmis.commons.PropertyIds;
import org.apache.chemistry.opencmis.commons.SessionParameter;
import org.apache.chemistry.opencmis.commons.data.ContentStream;
import org.apache.chemistry.opencmis.commons.enums.BaseTypeId;
import org.apache.chemistry.opencmis.commons.enums.BindingType;
import org.apache.chemistry.opencmis.commons.enums.UnfileObject;
import org.apache.chemistry.opencmis.commons.enums.VersioningState;
import org.apache.chemistry.opencmis.commons.exceptions.CmisObjectNotFoundException;
import org.apache.commons.lang3.StringUtils;
public class TestOMB {
private static Session session;
private static final String ALFRSCO_ATOMPUB_URL = "http://localhost:8080/alfresco/service/cmis";
private static final String TEST_FOLDER_NAME = "chemistryTestFolder";
private static final String TEST_DOCUMENT_NAME_1 = "chemistryTest1.txt";
private static final String TEST_DOCUMENT_NAME_2 = "chemistryTest2.txt";
public static void main(String[] args) {
Folder root = connect();
cleanup(root, TEST_FOLDER_NAME);
Folder newFolder = createFolder(root, TEST_FOLDER_NAME);
createDocument(newFolder, TEST_DOCUMENT_NAME_1);
createDocument(newFolder, TEST_DOCUMENT_NAME_2);
System.out.println("+++ List Folder +++");
listFolder(0, newFolder);
DeleteDocument(newFolder, "/" + TEST_DOCUMENT_NAME_2);
System.out.println("+++ List Folder +++");
listFolder(0, newFolder);
}
/**
* Clean up test folder before executing test
*
* #param target
* #param delFolderName
*/
private static void cleanup(Folder target, String delFolderName) {
try {
CmisObject object = session.getObjectByPath(target.getPath() + delFolderName);
Folder delFolder = (Folder) object;
delFolder.deleteTree(true, UnfileObject.DELETE, true);
} catch (CmisObjectNotFoundException e) {
System.err.println("No need to clean up.");
}
}
/**
*
* #param target
*/
private static void listFolder(int depth, Folder target) {
String indent = StringUtils.repeat("\t", depth);
for (Iterator<CmisObject> it = target.getChildren().iterator(); it.hasNext();) {
CmisObject o = it.next();
if (BaseTypeId.CMIS_DOCUMENT.equals(o.getBaseTypeId())) {
System.out.println(indent + "[Docment] " + o.getName());
} else if (BaseTypeId.CMIS_FOLDER.equals(o.getBaseTypeId())) {
System.out.println(indent + "[Folder] " + o.getName());
listFolder(++depth, (Folder) o);
}
}
}
/**
* Delete test document
*
* #param target
* #param delDocName
*/
private static void DeleteDocument(Folder target, String delDocName) {
try {
CmisObject object = session.getObjectByPath(target.getPath() + delDocName);
Document delDoc = (Document) object;
delDoc.delete(true);
} catch (CmisObjectNotFoundException e) {
System.err.println("Document is not found: " + delDocName);
}
}
/**
* Create test document with content
*
* #param target
* #param newDocName
*/
private static void createDocument(Folder target, String newDocName) {
Map<String, String> props = new HashMap<String, String>();
props.put(PropertyIds.OBJECT_TYPE_ID, "cmis:document");
props.put(PropertyIds.NAME, newDocName);
System.out.println("This is a test document: " + newDocName);
String content = "aegif Mind Share Leader Generating New Paradigms by aegif corporation.";
byte[] buf = null;
try {
buf = content.getBytes("UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
ByteArrayInputStream input = new ByteArrayInputStream(buf);
ContentStream contentStream = session.getObjectFactory().createContentStream(newDocName, buf.length,
"text/plain; charset=UTF-8", input);
target.createDocument(props, contentStream, VersioningState.MAJOR);
}
/**
* Create test folder directly under target folder
*
* #param target
* #param createFolderName
* #return newly created folder
*/
private static Folder createFolder(Folder target, String newFolderName) {
Map<String, String> props = new HashMap<String, String>();
props.put(PropertyIds.OBJECT_TYPE_ID, "cmis:folder");
props.put(PropertyIds.NAME, newFolderName);
Folder newFolder = target.createFolder(props);
return newFolder;
}
/**
* Connect to alfresco repository
*
* #return root folder object
*/
private static Folder connect() {
SessionFactory sessionFactory = SessionFactoryImpl.newInstance();
Map<String, String> parameters = new HashMap<String, String>();
// User credentials.
parameters.put(SessionParameter.USER, "myuser");
parameters.put(SessionParameter.PASSWORD, "mypassword");
// Connection settings.
parameters.put(SessionParameter.BINDING_TYPE, BindingType.ATOMPUB.value());
parameters.put(SessionParameter.ATOMPUB_URL, ALFRSCO_ATOMPUB_URL);
parameters.put(SessionParameter.AUTH_HTTP_BASIC, "true");
parameters.put(SessionParameter.COOKIES, "true");
parameters.put(SessionParameter.OBJECT_FACTORY_CLASS,
"org.alfresco.cmis.client.impl.AlfrescoObjectFactoryImpl");
// Create session.
// Alfresco only provides one repository.
Repository repository = sessionFactory.getRepositories(parameters).get(0);
Session session = repository.createSession();
return session.getRootFolder();
}
}
I found the solution, it's because of Alfresco's version. Since the V4.x the url of the AtomPUB is http://localhost:8080/alfresco/cmisatom.
https://community.alfresco.com/docs/DOC-5527-cmis
For Alfresco 3.x : http://[host]:[port]/alfresco/service/cmis
For Alfresco 4.0.x, Alfresco 4.1.x and Alfresco 4.2.a-c: http://[host]:[port]/alfresco/cmisatom
For Alfresco 4.2.d-f, Alfresco 5.0 and Alfresco 5.1: http://[host]:[port]/alfresco/api/-default-/public/cmis/versions/1.0/atom

How to get scrape using crawler4j?

I've been going at this for 4 hours now, and I simply can't see what I'm doing wrong. I have two files:
MyCrawler.java
Controller.java
MyCrawler.java
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.List;
import java.util.regex.Pattern;
import org.apache.http.Header;
public class MyCrawler extends WebCrawler {
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
/**
* You should implement this function to specify whether the given url
* should be crawled or not (based on your crawling logic).
*/
#Override
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/");
}
/**
* This function is called when a page is fetched and ready to be processed
* by your program.
*/
#Override
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String domain = page.getWebURL().getDomain();
String path = page.getWebURL().getPath();
String subDomain = page.getWebURL().getSubDomain();
String parentUrl = page.getWebURL().getParentUrl();
String anchor = page.getWebURL().getAnchor();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Domain: '" + domain + "'");
System.out.println("Sub-domain: '" + subDomain + "'");
System.out.println("Path: '" + path + "'");
System.out.println("Parent page: " + parentUrl);
System.out.println("Anchor text: " + anchor);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
List<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
Header[] responseHeaders = page.getFetchResponseHeaders();
if (responseHeaders != null) {
System.out.println("Response headers:");
for (Header header : responseHeaders) {
System.out.println("\t" + header.getName() + ": " + header.getValue());
}
}
System.out.println("=============");
}
}
Controller.java
package edu.crawler;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.List;
import java.util.regex.Pattern;
import org.apache.http.Header;
import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
public class Controller
{
public static void main(String[] args) throws Exception
{
String crawlStorageFolder = "../data/";
int numberOfCrawlers = 7;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("http://www.ics.uci.edu/~welling/");
controller.addSeed("http://www.ics.uci.edu/~lopes/");
controller.addSeed("http://www.ics.uci.edu/");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(MyCrawler, numberOfCrawlers);
}
}
The Structure is as follows:
java/MyCrawler.java
java/Controller.java
jars/... --> all the jars crawler4j
I try to compile this on a WINDOWS machine using:
javac -cp "C:\xampp\htdocs\crawlcrowd\www\java\jars\*;C:\xampp\htdocs\crawlcrowd\www\java\*" MyCrawler.java
This works perfectly, and I end up with:
java/MyCrawler.class
However, when I type:
javac -cp "C:\xampp\htdocs\crawlcrowd\www\java\jars\*;C:\xampp\htdocs\crawlcrowd\www\java\*" Controller.java
it bombs out with:
Controller.java:50: error: cannot find symbol
controller.start(MyCrawler, numberOfCrawlers);
^
symbol: variable MyCrawler
location: class Controller
1 error
So, I think somehow I am not doing something that I need to be doing. Something that will make this new executable class be "aware" of the MyCrawler.class. I have tried fiddling with the classpath in the commandline javac part. I've also tried setting it in my environment variables.... no luck.
Any idea how I can get this to work?
UPDATE
I got most of this code from the Google Code page itself. But I just can't figure out what must go there. Even if I try this:
MyCrawler mc = new MyCrawler();
No luck. Somehow Controller.class does not know about MyCrawler.class.
UPDATE 2
I don't think it matters, due the problem clearly being that it can't find the class, but either way, here is the signature of "CrawlController controller". Taken from here.
/**
* Start the crawling session and wait for it to finish.
*
* #param _c
* the class that implements the logic for crawler threads
* #param numberOfCrawlers
* the number of concurrent threads that will be contributing in
* this crawling session.
*/
public <T extends WebCrawler> void start(final Class<T> _c, final int numberOfCrawlers) {
this.start(_c, numberOfCrawlers, true);
}
I am in fact passing through a "crawler" as I'm passing in "MyCrawler". The problem is that application doesn't know what MyCrawler is.
A couple of things come to mind:
Is your MyCrawler extending edu.uci.ics.crawler4j.crawler.WebCrawler?
public class MyCrawler extends WebCrawler
Are you passing in MyCrawler.class (i.e., as a class) into controller.start?
controller.start(MyCrawler.class, numberOfCrawlers);
Both of these need to be satisfied in order for the controller to compile and run. Also, Crawler4j has some great examples here:
https://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/basic/BasicCrawler.java
https://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/basic/BasicCrawlController.java
These 2 classes will compile and run right away (i.e., BasicCrawlController), so it's a good starting place if you are running into any issues.
The parameters for start() should be a class and number of crawlers. Its throwing an error as you are passing in an object of crawler and not the crawler class. Use the start method as shown below, it should work
controller.start(MyCrawler.class, numberOfCrawlers)
Here you are passing a class name MyCrawler as a parameter.
controller.start(MyCrawler, numberOfCrawlers);
I think class name should not be a parameter.
I am also working little bit on Crawling!

Error while attempting to program page creation on google sites

I'm trying to programmatic-ally add pages to my Google Site using Java.
This is the code:
import java.io.*;
import java.net.MalformedURLException;
import java.net.URL;
import com.google.gdata.client.sites.*;
import com.google.gdata.data.PlainTextConstruct;
import com.google.gdata.data.XhtmlTextConstruct;
import com.google.gdata.data.sites.*;
import com.google.gdata.util.ServiceException;
import com.google.gdata.util.XmlBlob;
public class PageCreate {
public static void main(String args[]) throws Exception {
WebPageEntry createdEntry = createWebPage("New Webpage Title", "<b>HTML content</b>");
System.out.println("Created! View at " + createdEntry.getHtmlLink().getHref());
}
private static void setContentBlob(BaseContentEntry<?> entry, String pageContent) {
XmlBlob xml = new XmlBlob();
xml.setBlob(pageContent);
entry.setContent(new XhtmlTextConstruct());
}
public static WebPageEntry createWebPage(String title, String content)
throws ServiceException, IOException, MalformedURLException {
SitesService client = new SitesService("*****-pagecreate-v1");
client.setUserCredentials("***********", "*********");
client.site = "intratrial2"; -> ***SYNTAX ERROR REPORTED***
//ContentFeed contentFeed = client.getFeed(new URL(buildContentFeedUrl()), ContentFeed.class);
WebPageEntry entry = new WebPageEntry();
entry.setTitle(new PlainTextConstruct(title));
setContentBlob(entry, content); // Entry's HTML content
return client.insert(new URL(buildContentFeedUrl()), entry);
}
public static String buildContentFeedUrl() {
String domain = "*****"; // OR if the Site is hosted on Google Apps, your domain (e.g. example.com)
String siteName = "intratrial2";
return "https://sites.google.com/feeds/content/" + domain + "/" + siteName + "/";
}
}
If i comment out the line with syntax error i get the following error report on running:
Exception in thread "main" com.google.gdata.util.ServiceException: Internal Server Error
Internal Error
I'm not sure what I'm doing wrong here and I'd really appreciate some help. Thanks.

Categories

Resources