WebCrawler with recursion - java

So I am working on a webcrawler that is supposed to download all images, files, and webpages, and then recursively do the same for all webpages found. However, I seem to have a logic error.
public class WebCrawler {
private static String url;
private static int maxCrawlDepth;
private static String filePath;
/* Recursive function that crawls all web pages found on a given web page.
* This function also saves elements from the DownloadRepository to disk.
public static void crawling(WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {
HashMap<String, WebPage> pages = webpage.getCrawledWebPages();
if(currentCrawlDepth < maxCrawlDepth) {
for(WebPage wp : pages.values()) {
crawling(wp, currentCrawlDepth+1, maxCrawlDepth);
public static void main(String[] args) {
if(args.length != 3) {
System.out.println("Must pass three parameters");
url = "";
maxCrawlDepth = 0;
filePath = "";
url = args[0];
try {
URL testUrl = new URL(url);
URLConnection urlConnection = testUrl.openConnection();
} catch (MalformedURLException e) {
System.out.println("Not a valid URL");
} catch (IOException e) {
System.out.println("Could not open URL");
try {
maxCrawlDepth = Integer.parseInt(args[1]);
} catch (NumberFormatException e) {
System.out.println("Argument is not an int");
filePath = args[2];
File path = new File(filePath);
if(!path.exists()) {
System.out.println("File Path is invalid");
WebPage webpage = new WebPage(url);
crawling(webpage, 0, maxCrawlDepth);
System.out.println("Web crawl is complete");
the function crawl will parse the contents of a website storing any found images, files, or links into a hashmap, for example:
public class WebPage implements WebElement {
private static Elements images;
private static Elements links;
private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();
private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();
private HashMap<String, WebFile> files = new HashMap<String, WebFile>();
private String url;
public WebPage(String url) {
this.url = url;
/* The crawl method parses the html on a given web page
* and adds the elements of the web page to the Download
* Repository.
public void crawl(int currentCrawlDepth) {
System.out.print("Crawling " + url + " at crawl depth ");
System.out.println(currentCrawlDepth + "\n");
Document doc = null;
try {
HttpConnection httpConnection = (HttpConnection) Jsoup.connect(url);
doc = httpConnection.get();
} catch (MalformedURLException e) {
} catch (IOException e) {
} catch (IllegalArgumentException e) {
System.out.println(url + "is not a valid URL");
DownloadRepository downloadRepository = DownloadRepository.getInstance();
if(doc != null) {
images = doc.select("img");
links = doc.select("a[href]");
for(Element image : images) {
String imageUrl = image.absUrl("src");
if(!webImages.containsValue(image)) {
WebImage webImage = new WebImage(imageUrl);
webImages.put(imageUrl, webImage);
downloadRepository.addElement(imageUrl, webImage);
System.out.println("Added image at " + imageUrl);
HttpConnection mimeConnection = null;
Response mimeResponse = null;
for(Element link: links) {
String linkUrl = link.absUrl("href");
linkUrl = linkUrl.trim();
if(!linkUrl.contains("#")) {
try {
mimeConnection = (HttpConnection) Jsoup.connect(linkUrl);
mimeResponse = (Response) mimeConnection.execute();
} catch (Exception e) {
String contentType = null;
if(mimeResponse != null) {
contentType = mimeResponse.contentType();
if(contentType == null) {
if(contentType.toString().equals("text/html")) {
if(!webPages.containsKey(linkUrl)) {
WebPage webPage = new WebPage(linkUrl);
webPages.put(linkUrl, webPage);
downloadRepository.addElement(linkUrl, webPage);
System.out.println("Added webPage at " + linkUrl);
else {
if(!files.containsValue(link)) {
WebFile webFile = new WebFile(linkUrl);
files.put(linkUrl, webFile);
downloadRepository.addElement(linkUrl, webFile);
System.out.println("Added file at " + linkUrl);
System.out.print("\nFinished crawling " + url + " at crawl depth ");
System.out.println(currentCrawlDepth + "\n");
public HashMap<String, WebImage> getImages() {
return webImages;
public HashMap<String, WebPage> getCrawledWebPages() {
return webPages;
public HashMap<String, WebFile> getFiles() {
return files;
public String getUrl() {
return url;
public void saveToDisk(String filePath) {
The point of using a hashmap is to ensure that I do not parse the same website more than once. The error seems to be with my recursion. What is the issue?
Here is also some sample output for starting the crawl at http://www.google.com
Crawling https://www.google.com/ at crawl depth 0
Added webPage at http://www.google.com/intl/en/options/
Added webPage at https://www.google.com/intl/en/ads/
Added webPage at https://www.google.com/services/
Added webPage at https://www.google.com/intl/en/about.html
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/ at crawl depth 0
Crawling https://www.google.com/services/ at crawl depth 1
Added webPage at http://www.google.com/intl/en/enterprise/apps/business/?utm_medium=et&utm_campaign=en&utm_source=us-en-et-nelson_bizsol
Added webPage at https://www.google.com/services/sitemap.html
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/services/ at crawl depth 1
**Crawling https://www.google.com/intl/en/policies/ at crawl depth 2**
Added webPage at https://www.google.com/intl/en/policies/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/privacy/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/faq/
Added webPage at https://www.google.com/intl/en/policies/technologies/
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/intl/en/policies/ at crawl depth 2
**Crawling https://www.google.com/intl/en/policies/ at crawl depth 3**
Notice that it parses http://www.google.com/intl/en/policies/ twice

You are creating a new map for each web-page. This will ensure that if the same link occurs on the page twice it will only be crawled once but it will not deal with the case where the same link appears on two different pages.
https://www.google.com/intl/en/policies/ appears on both https://www.google.com/ and https://www.google.com/services/.
To avoid this use a single map throughout your crawl and pass it as a parameter into the recursion.
public class WebCrawler {
private HashMap<String, WebPage> visited = new HashMap<String, WebPage>();
public static void crawling(Map<String, WebPage> visited, WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {
As you are also holding a map of the images etc you may prefer to create a new object, perhaps call it visited, and make it keep track.
public class Visited {
private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();
public boolean visit(String url, WebPage page) {
if (webPages.containsKey(page)) {
return false;
webPages.put(url, page);
return true;
private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();
public boolean visit(String url, WebImage image) {
if (webImages.containsKey(image)) {
return false;
webImages.put(url, image);
return true;


JSoup get element Span

I am working with JSoup and this is my code:
public class ClassOLX {
public static final String URL = "https://www.olx.com.pe/item/nuevo-nissan-march-autoland-iid-1103776672";
public static void main (String args[]) throws IOException {
if (getStatusConnectionCode(URL) == 200) {
Document document = getHtmlDocument(URL);
String model = document.select(".rui-2CYS9").select(".itemPrice").text();
System.out.println("Model: "+model);
public static int getStatusConnectionCode(String url) {
Response response = null;
try {
response = Jsoup.connect(url).userAgent("Mozilla/5.0").timeout(100000).ignoreHttpErrors(true).execute();
} catch (IOException ex) {
return response.statusCode();
public static Document getHtmlDocument(String url) {
Document doc = null;
try {
doc = Jsoup.connect(url).userAgent("Mozilla/5.0").timeout(100000).get();
} catch (IOException ex) {
return doc;
This is the page:
I want to get the values of the following elements : itemPrice,_18gRm,itemTitle,_2FRXm
Thanks for all.
All you have to do is to use the following class selectors and get the text attribute-
String price = doc.select("._2xKfz").text();
String year = doc.select("._18gRm").text();
String title = doc.select("._3rJ6e").text();
String place = doc.select("._2FRXm").text();
And it will get you the desired data.

Crawling amazon.com

I'm crawling amazon products and principle it's going fine.
I have three classes from this nice tutorial:
I added the files to the following code (class Spider):
import java.io.FileNotFoundException;
import java.util.*;
public class Spider {
public static final int MAX_PAGES_TO_SEARCH = 10000;
private Set<String> pagesVisited = new HashSet<String>();
private List<String> pagesToVisit = new LinkedList<String>();
public void search(String url) {
while (this.pagesVisited.size() < MAX_PAGES_TO_SEARCH) {
String currentUrl;
SpiderLeg leg = new SpiderLeg();
if (this.pagesToVisit.isEmpty()) {
currentUrl = url;
} else {
currentUrl = this.nextUrl();
try {
leg.crawl(currentUrl); // Lots of stuff happening here. Look at the crawl method in
} catch (FileNotFoundException e) {
System.out.println("Oops, FileNotFoundException caught");
} catch (InterruptedException e) {
System.out.println("\n**Done** Visited " + this.pagesVisited.size() + " web page(s)");
SpiderLeg leg = new SpiderLeg();
for (int i = 0; i < leg.adjMatrix.length; i++) {
private String nextUrl() {
String nextUrl;
do {
if (this.pagesToVisit.isEmpty()){
return "https://www.amazon.de/Proband-Thriller-Guido-Kniesel/dp/1535287004/ref=sr_1_1?s=books&ie=UTF8&qid=1478247246&sr=1-1&keywords=%5B%5D";
nextUrl = this.pagesToVisit.remove(0);
} while (this.pagesVisited.contains(nextUrl));
return nextUrl;
class SpiderLeg:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.*;
import java.util.*;
public class SpiderLeg {
// We'll use a fake USER_AGENT so the web server thinks the robot is a normal web browser.
private static final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36";
private static List<String> links = new LinkedList<String>();
private static String graphLink;
private Document htmlDocument;
private static double counter = 0;
static Map<String, Set<String>> adjMap = new HashMap<String, Set<String>>();
static int[][] adjMatrix;
static List<String> mapping;
public boolean crawl(String url) throws FileNotFoundException {
if (url.isEmpty()) {
return false;
Connection connection = Jsoup.connect(url).ignoreContentType(true).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
this.htmlDocument = htmlDocument;
if(connection.response().statusCode() == 200){
// 200 is the HTTP OK status code
// indicating that everything is great.
double progress;
progress = (counter/Spider.MAX_PAGES_TO_SEARCH)*100;
System.out.println("\n**Visiting** Received web page at " + url);
System.out.println("\n**Progress** " + progress + "%");
if(!connection.response().contentType().contains("text/html")) {
System.out.println("**Failure** Retrieved something other than HTML");
return false;
//Elements linksOnPage = htmlDocument.select("a[href*=/gp/product/]");
Elements linksOnPage = htmlDocument.select("a[href*=/dp/]");
Elements salesRank = htmlDocument.select("span.zg_hrsr_rank");
Elements category = htmlDocument.select("span.zg_hrsr_ladder a");
String categoryString = category.html();
String salesRankString = salesRank.html();
salesRankString = salesRankString.replace("\n", " ");
categoryString = categoryString.replace("\n", " ");
System.out.println("Found (" + linksOnPage.size() + ") links");
PrintWriter pw = new PrintWriter(new FileWriter("Horror.csv", true));
StringBuilder sb = new StringBuilder();
int beginIndex = url.indexOf(".de/");
int endIndex = url.indexOf("/dp");
String title = url.substring(beginIndex+4,endIndex);
adjMap.put(title, new HashSet<String>());
for(Element link : linksOnPage){
String graphLink = link.attr("abs:href");
return true;
catch(IOException ioe) {
// We were not successful in our HTTP request
System.out.println("Error in out HTTP request " + ioe);
return false;
public static void calcAdjMatrix(){
Set<String> allMyURLs = new HashSet(adjMap.keySet());
for(String s: adjMap.keySet()){
System.out.println(s + "\t" + adjMap.get(s));
int dim = allMyURLs.size();
adjMatrix = new int[dim][dim];
List<String> nodes_list = new ArrayList<>();
for(String s: allMyURLs){
for(String s: nodes_list){
Set<String> outEdges = adjMap.get(s);
int i = nodes_list.indexOf(s);
if(outEdges != null){
for(String s1: outEdges){
int j = nodes_list.indexOf(s1);
adjMatrix[i][j] = 1;
public String cutTitle(String url) throws FileNotFoundException{
int beginIndex = url.indexOf(".de/");
int endIndex = url.indexOf("/dp");
String title;
if(url.contains(".de") && url.contains("/dp")){
title = url.substring(beginIndex+4,endIndex);
title = "wrong url";
return title;
public boolean searchForWord(String searchWord) {
if(this.htmlDocument == null) {
System.out.println("ERROR! Call crawl() before performing analysis on the document");
return false;
System.out.println("Searching for the word " + searchWord + "...");
String bodyText = this.htmlDocument.body().text();
return bodyText.toLowerCase().contains(searchWord.toLowerCase());
public List<String> getLinks(){
return this.links;
class SpiderTest:
public class SpiderTest {
public static void main(String[] args) {
Spider spider = new Spider();
Now the problem is, that after 100 URLs I think, that amazon is banning me from the server. The program doesn't find URLs anymore.
Does anyone has an idea how I can fix that?
Well, don't be rude and crawl them then.
Check their robots.txt (wiki) to see what they allow you to do. Don't be surprised if they ban you if you go places they don't want you to go.
The problem is very common when you try to crawl big websites that don't want to be crawled. They basically block you for a period of time to prevent their data being crawled or stolen.
With that being said, you have two options, either make each request from a different IP/server which will make your requests look legit and avoid the ban, or go for the easiest way which is to use a service that does that for you.
I've done both and the first one is complex, takes time and needs maintenance (you have to build a network of servers), the second option is usually not free but very fast to implement and guarantees that all your requests will always return data and you won't be banned.
There are some services on the internet that does that. I've used proxycrawl (which also has a free tier) in the past which works very good. They have an API that you can call and you only can use your same code, just changing the url you call.
This would be an example for amazon:
GET https://api.proxycrawl.com?token=yourtoken&url=https://amazon.com
And you would get always a response, even if you crawl 1000 pages a second, you will never be banned as you will be calling that proxy instead which does all the magic for you.
I hope it helps :)
You can try using proxy servers to prevent being blocked. There are services providing working proxies. I have good experience using https://gimmeproxy.com which specifically has proxies supporting amazon.
To get proxy working with Amazon, you need just to make the following request:
You will get JSON response with all proxy data which you can use later as needed:
"supportsHttps": true,
"protocol": "socks5",
"ip": "",
"port": "1915",
"get": true,
"post": true,
"cookies": true,
"referer": true,
"user-agent": true,
"anonymityLevel": 1,
"websites": {
"example": true,
"google": false,
"amazon": true
"country": "BR",
"tsChecked": 1517952910,
"curl": "socks5://",
"ipPort": "",
"type": "socks5",
"speed": 37.78,
"otherProtocols": {}

JSOUP Wait for page to parsed

I have a JSOUP Login program that logs into a website and grabs info from the page. It works well, but it takes ~3 seconds for the information to be parsed into ArrayLists as JSOUP takes a while.
I also have a check to see if the correct page is loaded correctly. (It's just checking the ArrayLists to see if they are empty meaning the page isn't loaded)
public void onClick(View v) {
SourcePage sp = new SourcePage(user.getText().toString(), pass.getText().toString());
if(sp.isConnected()) { //Refer to the bottom of the next code box
Toast.makeText(getApplicationContext(), sp.getGradeLetters().get(0), Toast.LENGTH_SHORT).show();
startActivity(new Intent(MainActivity.this, gradepage.class));
}else {
Toast.makeText(getApplicationContext(), "Login Failed", Toast.LENGTH_SHORT).show();
private void login() {
Thread th = new Thread() {
public void run() {
try {
HashMap<String, String> cookies = new HashMap<>();
HashMap<String, String> formData = new HashMap<>();
Connection.Response loginForm = Jsoup.connect(URL)
Document loginDoc = loginForm.parse();
String pstoken = loginDoc.select("#LoginForm > input[type=\"hidden\"]:nth-child(1)").first().attr("value");
String contextData = loginDoc.select("#contextData").first().attr("value");
String dbpw = loginDoc.select("#LoginForm > input[type=\"hidden\"]:nth-child(3)").first().attr("value");
String serviceName = "PS Parent Portal";
String credentialType = "User Id and Password Credential";
//Inserting all hidden form data things
formData.put("pstoken", pstoken);
formData.put("contextData", contextData);
formData.put("dbpw", dbpw);
formData.put("serviceName", serviceName);
formData.put("credentialType", credentialType);
formData.put("Account", USERNAME);
formData.put("ldappassword", PASSWORD);
formData.put("pw", PASSWORD);
Connection.Response homePage = Jsoup.connect(POST_URL)
mainDoc = Jsoup.parse(homePage.parse().html());
//Get persons name
NAME = mainDoc.select("div#sps-stdemo-non-conf").select("h1").first().text();
//Getting Grades for Semester 2
Elements grades = mainDoc.select("td.colorMyGrade").select("[href*='fg=S2']");
for (Element j : grades)
if (!j.text().equals("--")) {
String gradeText = j.text();
gradeLetter.add(gradeText.substring(0, gradeText.indexOf(" ")));
gradeNumber.add(Double.parseDouble(gradeText.substring(gradeText.indexOf(" ") + 1)));
Elements teachers = mainDoc.select("td[align='left']");
for (int i = 1; i < teachers.size(); i += 2)
String fullText = teachers.get(i).text().replaceAll("//s+", ".");
}catch (IOException e) {
public boolean isConnected() {
return (!(gradeLetter.isEmpty() || gradeNumber.isEmpty() || teacherList.isEmpty()));
The big problem is that the program (onClick) is giving the Toast "Login Failed" because the isConnected method doesn't wait for the page to load. How can I fix this?

Wicket: redirecting to wicket page using setResponsePage

I have a wicket page which has a link ADD PRODUCT. On clicking the link a modal window open which takes the product information.
public class ProductAddPanel extends Panel {
private InlineFrame uploadIFrame = null;
private ModalWindow window;
private Merchant merchant;
private Page redirectPage;
private List<Component> refreshables;
public ProductAddPanel(String id,final Merchant mct,ModalWindow window,List<Component> refreshables,Page p) {
this.window = window;
merchant = mct;
redirectPage = p;
this.refreshables = refreshables;
protected void onBeforeRender() {
if (uploadIFrame == null) {
// the iframe should be attached to a page to be able to get its pagemap,
// that's why i'm adding it in onBeforRender
// Create the iframe containing the upload widget
private void addUploadIFrame() {
IPageLink iFrameLink = new IPageLink() {
public Page getPage() {
return new UploadIFrame(window,merchant,redirectPage,refreshables) {
protected String getOnUploadedCallback() {
return "onUpload_" + ProductAddPanel.this.getMarkupId();
public Class<UploadIFrame> getPageIdentity() {
return UploadIFrame.class;
uploadIFrame = new InlineFrame("upload", iFrameLink);
<iframe wicket:id="upload" frameborder="0"style="height: 600px; width: 475px;overflow: hidden"></iframe>
I am using a Iframe to upload the image. I have added a iframe to my ProductPanel.html. Because it is not possible to upload file using ajax submit.
protected void onSubmit(AjaxRequestTarget target, Form<?> form) {
DynamicImage imageEntry = new DynamicImage();
if(uploadField.getFileUpload() != null && uploadField.getFileUpload().getClientFileName() != null){
FileUpload upload = uploadField.getFileUpload();
String ct = upload.getContentType();
if (!imgctypes.containsKey(ct)) {
hasError = true;
if(upload.getSize() > maximagesize){
hasError = true;
if(hasError == false){
System.out.println("######################## Image can be uploaded ################");
if(imageEntry != null){
try {
} catch (IOException e) {
target.appendJavaScript("$().toastmessage('showNoticeToast','Please select a valid image!!')");
System.out.println("#################### Error in image uploading ###################");
System.out.println("########################### Image not Selected #####################");
MerchantProduct mp =new MerchantProduct();
Product p = new Product();
Date d=new Date();
try {
} catch (Exception e) {
for(Component r: refreshables){
public void save(DynamicImage imageEntry, InputStream imageStream) throws IOException{
//Read the image data
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte [] imageData = baos.toByteArray();
baos = null;
//Get the image suffix
String suffix = null;
suffix = ".gif";
}else if ("image/jpeg".equalsIgnoreCase(imageEntry.getContentType())) {
suffix = ".jpeg";
} else if ("image/png".equalsIgnoreCase(imageEntry.getContentType())) {
suffix = ".png";
// Create a unique name for the file in the image directory and
// write the image data into it.
File newFile = createImageFile(suffix);
OutputStream outStream = new FileOutputStream(newFile);
//copy data from src to dst
private void copy(InputStream source, OutputStream destination) throws IOException{
try {
// Transfer bytes from source to destination
byte[] buf = new byte[1024];
int len;
while ((len = source.read(buf)) > 0) {
destination.write(buf, 0, len);
if (logger.isDebugEnabled()) {
logger.debug("Copying image...");
} catch (IOException ioe) {
throw ioe;
private File createImageFile(String suffix){
UUID uuid = UUID.randomUUID();
File file = new File(imageDir,uuid.toString() + suffix);
logger.debug("File "+ file.getAbsolutePath() + "created.");
return file;
I am using setResonsePage() to redirect to initial page on which "Add Product" link is present. So that i get the refreshed page having new product information.
My problem is that modal window is not closing on window.close() and inside that window i am getting the refreshed page.
My requirement is that Modal window should close and page should be refreshed. I am passing the Parentpage.class in my setResponsePage().
Any help and advices appreciated! Thanks in advance.
In the ParentPage.class on which modal window is open i called setWindowClosedCallback() method in which I am adding getPage() to target so that page will refresh when modal window is closed.
Here is the code for same
modalDialog.setWindowClosedCallback(new ModalWindow.WindowClosedCallback()
private static final long serialVersionUID = 1L;
public void onClose(AjaxRequestTarget target)

Do Not Crawl certain page in a particular link(exclude certain url from crawling)

This is the below code in my MyCrawler.java and it is crawling all those links that I have provided in href.startsWith but suppose If I do not want to crawl this particular page http://inv.somehost.com/people/index.html then how can I do this in my code..
public MyCrawler() {
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
if (href.startsWith("http://www.somehost.com/") || href.startsWith("http://inv.somehost.com/") || href.startsWith("http://jo.somehost.com/")) {
//And If I do not want to crawl this page http://inv.somehost.com/data/index.html then how it can be done..
return true;
return false;
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
List<WebURL> links = page.getURLs();
int parentDocid = page.getWebURL().getParentDocid();
try {
URL url1 = new URL(url);
System.out.println("URL:- " +url1);
URLConnection connection = url1.openConnection();
Map responseMap = connection.getHeaderFields();
Iterator iterator = responseMap.entrySet().iterator();
while (iterator.hasNext())
String key = iterator.next().toString();
if (key.contains("text/html") || key.contains("text/xhtml"))
// Content-Type=[text/html; charset=ISO-8859-1]
if (filters.matcher(key) != null){
try {
final File parentDir = new File("crawl_html");
final String hash = MD5Util.md5Hex(url1.toString());
final String fileName = hash + ".txt";
final File file = new File(parentDir, fileName);
boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt
System.out.println("hash:-" + hash);
// Create file if it does not exist
// File did not exist and was created
FileOutputStream fos = new FileOutputStream(file, true);
PrintWriter out = new PrintWriter(fos);
// Also could be written as follows on one line
// Printwriter out = new PrintWriter(new FileWriter(args[0]));
// Write text to file
Tika t = new Tika();
String content= t.parseToString(new URL(url1.toString()));
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
} catch (IOException e) {
// TODO Auto-generated catch block
} catch (TikaException e) {
// TODO Auto-generated catch block
// http://google.com
} catch (MalformedURLException e) {
} catch (IOException e) {
And this is my Controller.java code from where MyCrawler is getting called..
public class Controller {
public static void main(String[] args) throws Exception {
CrawlController controller = new CrawlController("/data/crawl/root");
controller.start(MyCrawler.class, 20);
Any suggestions will be appreciated..
How about adding a property to tell which urls you want to exclude.
Add to your exclusions list all the pages that you don't want them to get crawled.
Here is an example:
public class MyCrawler extends WebCrawler {
List<Pattern> exclusionsPatterns;
public MyCrawler() {
exclusionsPatterns = new ArrayList<Pattern>();
//Add here all your exclusions using Regular Expresssions
* You should implement this function to specify
* whether the given URL should be visited or not.
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
//Iterate the patterns to find if the url is excluded.
for (Pattern exclusionPattern : exclusionsPatterns) {
Matcher matcher = exclusionPattern.matcher(href);
if (matcher.matches()) {
return false;
if (href.startsWith("http://www.ics.uci.edu/")) {
return true;
return false;
In this example we are telling that all urls that start with http://investor.somehost.com should not be crawled.
So these wont be crawled:
I recommend you reading about regular expresions.

