How to get HtmlElements from a website

How to get HtmlElements from a website - java

I am trying to get urls and html elements from a website.Able to get urls and html from website but, when one url contains multiple elements(like multiple input elements (or)multiple textarea elements)i am able getting only last element.The code like below
GetURLsAndElemens.java
public static void main(String[] args) throws FileNotFoundException,
IOException, ParseException {
Properties properties = new Properties();
properties
.load(new FileInputStream(
"src//io//servicely//ci//plugin//SeleniumResources.properties"));
Map<String, String> urls = gettingUrls(properties
.getProperty("MAIN_URL"));
GettingHTMLElements.getHTMLElements(urls);
// .out.println(urls.size());
// System.out.println(urls);
}
public static Map<String, String> gettingUrls(String mainURL) {
Document doc = null;
Map<String, String> urlsList = new HashMap<String, String>();
try {
System.out.println("Main URL " + mainURL);
// need http protocol
doc = Jsoup.connect(mainURL).get();
GettingHTMLElements.getInputElements(doc, mainURL);
// get page title
// String title = doc.title();
// System.out.println("title : " + title);
// get all links
Elements links = doc.select("a[href]");
for (Element link : links) {
// urlsList.clear();
// get the value from href attribute and adding to list
if (link.attr("href").contains("http")) {
urlsList.put(link.attr("href"), link.text());
} else {
urlsList.put(mainURL + link.attr("href"), link.text());
}
// System.out.println(urlsList);
}
} catch (IOException e) {
e.printStackTrace();
}
// System.out.println("Total urls are "+urlsList.size());
// System.out.println(urlsList);
return urlsList;
}
GettingHtmlElements.java
static Map<String, HtmlElements> urlList = new HashMap<String, HtmlElements>();
public static void getHTMLElements(Map<String, String> urls)
throws IOException {
getElements(urls);
}
public static void getElements(Map<String, String> urls) throws IOException {
for (Map.Entry<String, String> entry1 : urls.entrySet()) {
try {
System.out.println(entry1.getKey());
Document doc = Jsoup.connect(entry1.getKey()).get();
getInputElements(doc, entry1.getKey());
}
catch (Exception e) {
e.printStackTrace();
}
}
Map<String,HtmlElements> list = urlList;
for(Map.Entry<String,HtmlElements> entry1:list.entrySet())
{
HtmlElements ele = entry1.getValue();
System.out.println("url is "+entry1.getKey());
System.out.println("input name "+ele.getInput_name());
}
}
public static HtmlElements getInputElements(Document doc, String entry1) {
HtmlElements htmlElements = new HtmlElements();
Elements inputElements2 = doc.getElementsByTag("input");
Elements textAreaElements2 = doc.getElementsByTag("textarea");
Elements formElements3 = doc.getElementsByTag("form");
for (Element inputElement : inputElements2) {
String key = inputElement.attr("name");
htmlElements.setInput_name(key);
String key1 = inputElement.attr("type");
htmlElements.setInput_type(key1);
String key2 = inputElement.attr("class");
htmlElements.setInput_class(key2);
}
for (Element inputElement : textAreaElements2) {
String key = inputElement.attr("id");
htmlElements.setTextarea_id(key);
String key1 = inputElement.attr("name");
htmlElements.setTextarea_name(key1);
}
for (Element inputElement : formElements3) {
String key = inputElement.attr("method");
htmlElements.setForm_method(key);
String key1 = inputElement.attr("action");
htmlElements.setForm_action(key1);
}
return urlList.put(entry1, htmlElements);
}
Which elements i want take it as a bean.For every url i am getting the urls and htmle elements.but when url contains multiple elements i was getting last element only

You use a class HtmlElements which is not part of JSoup as far as I know. I don't know its inner workings, but I assume it is some sort of list of html nodes or something.
However, you seem to use this class like this:
HtmlElements htmlElements = new HtmlElements();
htmlElements.setInput_name(key);
This indicates that only ONE html element is stored in the htmlElements variable. This would explain why you get only the last element stored - you simply overwrite the one instance all the time.
It is not really clear, since I don't know the HtmlElements class. Maybe something like this works, assuming that HtmlElement is working as a single instance of HtmlElements and HtmlElements has a method add:
HtmlElements htmlElements = new HtmlElements();
...
for (Element inputElement : inputElements2) {
HtmlElement e = new HtmlElement();
htmlElements.add(e);
String key = inputElement.attr("name");
e.setInput_name(key);
}

Related

How to get the first link using JSOUP?

I want to use Jsoup to extract the first link on the google search results. For example, I search for "apple" on google. The first link I see is www.apple.com/. How do I return the first link? I am currently able to extract all links using Jsoup:
new Thread(new Runnable() {
#Override
public void run() {
final StringBuilder stringBuilder = new StringBuilder();
try {
Document doc = Jsoup.connect(sharedURL).get();
String title = doc.title();
Elements links = doc.select("a[href]");
stringBuilder.append(title).append("\n");
for (Element link : links) {
stringBuilder.append("\n").append(" ").append(link.text()).append(" ").append(link.attr("href")).append("\n");
}
} catch (IOException e) {
stringBuilder.append("Error : ").append(e.getMessage()).append("\n");
}
runOnUiThread(new Runnable() {
#Override
public void run() {
// set text
textView.setText(stringBuilder.toString());
}
});
}
}).start();

Do you mean:
Element firstLink = doc.select("a[href]").first();
It works for me. If you meant something else let us know. I checked the search results with the following and its a tough one to decipher as there are so many types of results that come back.. maps, news, ads, etc.
I tidied up the code a little with the use of java lambdas:
public static void main(String[] args) {
new Thread(() -> {
final StringBuilder stringBuilder = new StringBuilder();
try {
String sharedUrl = "https://www.google.com/search?q=apple";
Document doc = Jsoup.connect(sharedUrl).get();
String title = doc.title();
Elements links = doc.select("a[href]");
Element firstLink = links.first(); // <<<<< NEW ADDITION
stringBuilder.append(title).append("\n");
for (Element link : links) {
stringBuilder.append("\n")
.append(" ")
.append(link.text())
.append(" ")
.append(link.attr("href"))
.append("\n");
}
} catch (IOException e) {
stringBuilder.append("Error : ").append(e.getMessage()).append("\n");
}
// replaced some of this for running/testing locally
SwingUtilities.invokeLater(() -> System.out.println(stringBuilder.toString()));
}).start();
}

How to set depth of simple JAVA web crawler

I wrote a simple recursive web crawler to fetch just the URL links from the web page recursively.
Now I am trying to figure out a way to limit the crawler using depth but I am not sure how to limit the crawler by specific depth (I can limit the crawler by top N links but I want to limit using depth)
For Ex: Depth 2 should fetch Parent links -> children(s) links--> children(s) link
Any inputs is appreciated.
public class SimpleCrawler {
static Map<String, String> retMap = new ConcurrentHashMap<String, String>();
public static void main(String args[]) throws IOException {
StringBuffer sb = new StringBuffer();
Map<String, String> map = (returnURL("http://www.google.com"));
recursiveCrawl(map);
for (Map.Entry<String, String> entry : retMap.entrySet()) {
sb.append(entry.getKey());
}
}
public static void recursiveCrawl(Map<String, String> map)
throws IOException {
for (Map.Entry<String, String> entry : map.entrySet()) {
String key = entry.getKey();
Map<String, String> recurSive = returnURL(key);
recursiveCrawl(recurSive);
}
}
public synchronized static Map<String, String> returnURL(String URL)
throws IOException {
Map<String, String> tempMap = new HashMap<String, String>();
Document doc = null;
if (URL != null && !URL.equals("") && !retMap.containsKey(URL)) {
System.out.println("Processing==>" + URL);
try {
URL url = new URL(URL);
System.setProperty("http.proxyHost", "proxy");
System.setProperty("http.proxyPort", "port");
doc = Jsoup.connect(URL).get();
if (doc != null) {
Elements links = doc.select("a");
String FinalString = "";
for (Element e : links) {
FinalString = "http:" + e.attr("href");
if (!retMap.containsKey(FinalString)) {
tempMap.put(FinalString, FinalString);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
retMap.put(URL, URL);
} else {
System.out.println("****Skipping URL****" + URL);
}
return tempMap;
}
}
EDIT1:
I thought of using worklist hence modified the code. I am not exactly sure how to set depth here too (I can set the number of webpages to crawl but not exactly depth). Any suggestions would be appreciated.
public void startCrawl(String url) {
while (this.pagesVisited.size() < 2) {
String currentUrl;
SpiderLeg leg = new SpiderLeg();
if (this.pagesToVisit.isEmpty()) {
currentUrl = url;
this.pagesVisited.add(url);
} else {
currentUrl = this.nextUrl();
}
leg.crawl(currentUrl);
System.out.println("pagesToVisit Size" + pagesToVisit.size());
// SpiderLeg
this.pagesToVisit.addAll(leg.getLinks());
}
System.out.println("\n**Done** Visited " + this.pagesVisited.size()
+ " web page(s)");
}

Based on the non-recursive approach:
Keep a worklist of URLs pagesToCrawl of type CrawlURL
class CrawlURL {
public String url;
public int depth;
public CrawlURL(String url, int depth) {
this.url = url;
this.depth = depth;
}
}
initially (before entering the loop):
Queue<CrawlURL> pagesToCrawl = new LinkedList<>();
pagesToCrawl.add(new CrawlURL(rootUrl, 0)); //rootUrl is the url to start from
now the loop:
while (!pagesToCrawl.isEmpty()) { // will proceed at least once (for rootUrl)
CrawlURL currentUrl = pagesToCrawl.remove();
//analyze the url
//updated with crawled links
}
and the updating with links:
if (currentUrl.depth < 2) {
for (String url : leg.getLinks()) { // referring to your analysis result
pagesToCrawl.add(new CrawlURL(url, currentUrl.depth + 1));
}
}
You could enhance CrawlURL with other meta data (e.g. link name, referrer,. etc.).
Alternative:
In my upper comment I mentioned a generational approach. Its a bit more complex than this one. The basic Idea is to keep to lists (currentPagesToCrawl and futurePagesToCrawl) together with a generation variable (starting with 0 and increasing every time currentPagesToCrawl gets empty). All crawled urls are put into the futurePagesToCrawl queue and if currentPagesToCrawl both lists will be switched. This is done until the generation variable reaches 2.

You could add a depth parameter on the signature of your recursive method eg
on your main
recursiveCrawl(map,0);
and
public static void recursiveCrawl(Map<String, String> map, int depth) throws IOException {
if (depth++ < DESIRED_DEPTH) //assuming initial depth = 0
for (Map.Entry<String, String> entry : map.entrySet()) {
String key = entry.getKey();
Map<String, String> recurSive = returnURL(key);
recursiveCrawl(recurSive, depth);
}
}
]

You can do something like this:
static int maxLevels = 10;
public static void main(String args[]) throws IOException {
...
recursiveCrawl(map,0);
...
}
public static void recursiveCrawl(Map<String, String> map, int level) throws IOException {
for (Map.Entry<String, String> entry : map.entrySet()) {
String key = entry.getKey();
Map<String, String> recurSive = returnURL(key);
if (level < maxLevels) {
recursiveCrawl(recurSive, ++level);
}
}
}
Also, you can use a Set instead of a Map.

Matching Keys in a HashMap

I am attempting to do the following (in psuedocode):
Generate HashMapOne that will be populated by results
found in a DICOM file (the Key was manipulated for matching
purposes).
Generate a second HashMapTwo that will be read from a
text document.
Compare the Keys of both HashMaps, if a match add the results of
the value of HashMapOne in a new HashMapThree.
I am getting stuck with adding the matched key's value to the HashMapThree. It always populates a null value despite me declaring this a public static variable. Can anyone please tell me why this may be? Here is the code snippets below:
public class viewDICOMTags {
HashMap<String,String> dicomFile = new HashMap<String,String>();
HashMap<String,String> dicomTagList = new HashMap<String,String>();
HashMap<String,String> Result = new HashMap<String, String>();
Iterator<org.dcm4che2.data.DicomElement> iter = null;
DicomObject working;
public static DicomElement element;
DicomElement elementTwo;
public static String result;
File dicomList = new File("C:\\Users\\Ryan\\dicomTagList.txt");
public void readDICOMObject(String path) throws IOException
{
DicomInputStream din = null;
din = new DicomInputStream(new File(path));
try {
working = din.readDicomObject();
iter = working.iterator();
while (iter.hasNext())
{
element = iter.next();
result = element.toString();
String s = element.toString().substring(0, Math.min(element.toString().length(), 11));
dicomFile.put(String.valueOf(s.toString()), element.vr().toString());
}
System.out.println("Collected tags, VR Code, and Description from DICOM file....");
}
catch (IOException e)
{
e.printStackTrace();
return;
}
finally {
try {
din.close();
}
catch (IOException ignore){
}
}
readFromTextFile();
}
public void readFromTextFile() throws IOException
{
try
{
String dicomData = "DICOM";
String line = null;
BufferedReader bReader = new BufferedReader(new FileReader(dicomList));
while((line = bReader.readLine()) != null)
{
dicomTagList.put(line.toString(), dicomData);
}
System.out.println("Reading Tags from Text File....");
bReader.close();
}
catch(FileNotFoundException e)
{
System.err.print(e);
}
catch(IOException i)
{
System.err.print(i);
}
compareDICOMSets();
}
public void compareDICOMSets() throws IOException
{
for (Entry<String, String> entry : dicomFile.entrySet())
{
if(dicomTagList.containsKey(entry.getKey()))
Result.put(entry.getKey(), dicomFile.get(element.toString()));
System.out.println(dicomFile.get(element.toString()));
}
SortedSet<String> keys = new TreeSet<String>(Result.keySet());
for (String key : keys) {
String value = Result.get(key);
System.out.println(key);
}
}
}

This line of code looks very wrong
Result.put(entry.getKey(), dicomFile.get(element.toString()));
If you are trying to copy the key/value pair from HashMapOne, then this is not correct.
The value for each key added to Result will be null, because you are calling get method on Map interface on dicomFile. get requires a key as a lookup value, and you are passing in
element.toString()
where element will be the last element that was read from your file.
I think you should be using
Result.put(entry.getKey(), entry.getValue()));

How to write a unit test for an XML parser I wrote in Java

The context is as follows:
I've got objects that represent Tweets (from Twitter). Each object has an id, a date and the id of the original tweet (if there was one).
I receive a file of tweets (where each tweet is in the format of 05/04/2014 12:00:00, tweetID, originalID and is in its' own line) and I want to save them as an XML file where each field has its' own tag.
I want to then be able to read the file and return a list of Tweet objects corresponding to the Tweets from the XML file.
After writing the XML parser that does this I want to test that it works correctly. I've got no idea how to test this.
The XML Parser:
public class TweetToXMLConverter implements TweetImporterExporter {
//there is a single file used for the tweets database
static final String xmlPath = "src/main/resources/tweetsDataBase.xml";
//some "defines", as we like to call them ;)
static final String DB_HEADER = "tweetDataBase";
static final String TWEET_HEADER = "tweet";
static final String TWEET_ID_FIELD = "id";
static final String TWEET_ORIGIN_ID_FIELD = "original tweet";
static final String TWEET_DATE_FIELD = "tweet date";
static File xmlFile;
static boolean initialized = false;
#Override
public void createDB() {
try {
Element tweetDB = new Element(DB_HEADER);
Document doc = new Document(tweetDB);
doc.setRootElement(tweetDB);
XMLOutputter xmlOutput = new XMLOutputter();
// display nice nice? WTF does that chinese whacko want?
xmlOutput.setFormat(Format.getPrettyFormat());
xmlOutput.output(doc, new FileWriter(xmlPath));
xmlFile = new File(xmlPath);
initialized = true;
} catch (IOException io) {
System.out.println(io.getMessage());
}
}
#Override
public void addTweet(Tweet tweet) {
if (!initialized) {
//TODO throw an exception? should not come to pass!
return;
}
SAXBuilder builder = new SAXBuilder();
try {
Document document = (Document) builder.build(xmlFile);
Element newTweet = new Element(TWEET_HEADER);
newTweet.setAttribute(new Attribute(TWEET_ID_FIELD, tweet.getTweetID()));
newTweet.setAttribute(new Attribute(TWEET_DATE_FIELD, tweet.getDate().toString()));
if (tweet.isRetweet())
newTweet.addContent(new Element(TWEET_ORIGIN_ID_FIELD).setText(tweet.getOriginalTweet()));
document.getRootElement().addContent(newTweet);
} catch (IOException io) {
System.out.println(io.getMessage());
} catch (JDOMException jdomex) {
System.out.println(jdomex.getMessage());
}
}
//break glass in case of emergency
#Override
public void addListOfTweets(List<Tweet> list) {
for (Tweet t : list) {
addTweet(t);
}
}
#Override
public List<Tweet> getListOfTweets() {
if (!initialized) {
//TODO throw an exception? should not come to pass!
return null;
}
try {
SAXBuilder builder = new SAXBuilder();
Document document;
document = (Document) builder.build(xmlFile);
List<Tweet> $ = new ArrayList<Tweet>();
for (Object o : document.getRootElement().getChildren(TWEET_HEADER)) {
Element rawTweet = (Element) o;
String id = rawTweet.getAttributeValue(TWEET_ID_FIELD);
String original = rawTweet.getChildText(TWEET_ORIGIN_ID_FIELD);
Date date = new Date(rawTweet.getAttributeValue(TWEET_DATE_FIELD));
$.add(new Tweet(id, original, date));
}
return $;
} catch (JDOMException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}
}
Some usage:
private TweetImporterExporter converter;
List<Tweet> tweetList = converter.getListOfTweets();
for (String tweetString : lines)
converter.addTweet(new Tweet(tweetString));
How can I make sure the the XML file I read (that contains tweets) corresponds to the file I receive (in the form stated above)?
How can I make sure the tweets I add to the file correspond to the ones I tried to add?

Assuming that you have the following model:
public class Tweet {
private Long id;
private Date date;
private Long originalTweetid;
//getters and seters
}
The process would be the following:
create an isntance of TweetToXMLConverter
create a list of Tweet instances that you expect to receive after parsing the file
feed the converter the list you generated
compare the list received by parsing the list and the list you initiated at the start of the test
public class MainTest {
private TweetToXMLConverter converter;
private List<Tweet> tweets;
#Before
public void setup() {
Tweet tweet = new Tweet(1, "05/04/2014 12:00:00", 2);
Tweet tweet2 = new Tweet(2, "06/04/2014 12:00:00", 1);
Tweet tweet3 = new Tweet(3, "07/04/2014 12:00:00", 2);
tweets.add(tweet);
tweets.add(tweet2);
tweets.add(tweet3);
converter = new TweetToXMLConverter();
converter.addListOfTweets(tweets);
}
#Test
public void testParse() {
List<Tweet> parsedTweets = converter.getListOfTweets();
Assert.assertEquals(parsedTweets.size(), tweets.size());
for (int i=0; i<parsedTweets.size(); i++) {
//assuming that both lists are sorted
Assert.assertEquals(parsedTweets.get(i), tweets.get(i));
};
}
}
I am using JUnit for the actual testing.

WebCrawler with recursion

So I am working on a webcrawler that is supposed to download all images, files, and webpages, and then recursively do the same for all webpages found. However, I seem to have a logic error.
public class WebCrawler {
private static String url;
private static int maxCrawlDepth;
private static String filePath;
/* Recursive function that crawls all web pages found on a given web page.
* This function also saves elements from the DownloadRepository to disk.
*/
public static void crawling(WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {
webpage.crawl(currentCrawlDepth);
HashMap<String, WebPage> pages = webpage.getCrawledWebPages();
if(currentCrawlDepth < maxCrawlDepth) {
for(WebPage wp : pages.values()) {
crawling(wp, currentCrawlDepth+1, maxCrawlDepth);
}
}
}
public static void main(String[] args) {
if(args.length != 3) {
System.out.println("Must pass three parameters");
System.exit(0);
}
url = "";
maxCrawlDepth = 0;
filePath = "";
url = args[0];
try {
URL testUrl = new URL(url);
URLConnection urlConnection = testUrl.openConnection();
urlConnection.connect();
} catch (MalformedURLException e) {
System.out.println("Not a valid URL");
System.exit(0);
} catch (IOException e) {
System.out.println("Could not open URL");
System.exit(0);
}
try {
maxCrawlDepth = Integer.parseInt(args[1]);
} catch (NumberFormatException e) {
System.out.println("Argument is not an int");
System.exit(0);
}
filePath = args[2];
File path = new File(filePath);
if(!path.exists()) {
System.out.println("File Path is invalid");
System.exit(0);
}
WebPage webpage = new WebPage(url);
crawling(webpage, 0, maxCrawlDepth);
System.out.println("Web crawl is complete");
}
}
the function crawl will parse the contents of a website storing any found images, files, or links into a hashmap, for example:
public class WebPage implements WebElement {
private static Elements images;
private static Elements links;
private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();
private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();
private HashMap<String, WebFile> files = new HashMap<String, WebFile>();
private String url;
public WebPage(String url) {
this.url = url;
}
/* The crawl method parses the html on a given web page
* and adds the elements of the web page to the Download
* Repository.
*/
public void crawl(int currentCrawlDepth) {
System.out.print("Crawling " + url + " at crawl depth ");
System.out.println(currentCrawlDepth + "\n");
Document doc = null;
try {
HttpConnection httpConnection = (HttpConnection) Jsoup.connect(url);
httpConnection.ignoreContentType(true);
doc = httpConnection.get();
} catch (MalformedURLException e) {
System.out.println(e.getLocalizedMessage());
} catch (IOException e) {
System.out.println(e.getLocalizedMessage());
} catch (IllegalArgumentException e) {
System.out.println(url + "is not a valid URL");
}
DownloadRepository downloadRepository = DownloadRepository.getInstance();
if(doc != null) {
images = doc.select("img");
links = doc.select("a[href]");
for(Element image : images) {
String imageUrl = image.absUrl("src");
if(!webImages.containsValue(image)) {
WebImage webImage = new WebImage(imageUrl);
webImages.put(imageUrl, webImage);
downloadRepository.addElement(imageUrl, webImage);
System.out.println("Added image at " + imageUrl);
}
}
HttpConnection mimeConnection = null;
Response mimeResponse = null;
for(Element link: links) {
String linkUrl = link.absUrl("href");
linkUrl = linkUrl.trim();
if(!linkUrl.contains("#")) {
try {
mimeConnection = (HttpConnection) Jsoup.connect(linkUrl);
mimeConnection.ignoreContentType(true);
mimeConnection.ignoreHttpErrors(true);
mimeResponse = (Response) mimeConnection.execute();
} catch (Exception e) {
System.out.println(e.getLocalizedMessage());
}
String contentType = null;
if(mimeResponse != null) {
contentType = mimeResponse.contentType();
}
if(contentType == null) {
continue;
}
if(contentType.toString().equals("text/html")) {
if(!webPages.containsKey(linkUrl)) {
WebPage webPage = new WebPage(linkUrl);
webPages.put(linkUrl, webPage);
downloadRepository.addElement(linkUrl, webPage);
System.out.println("Added webPage at " + linkUrl);
}
}
else {
if(!files.containsValue(link)) {
WebFile webFile = new WebFile(linkUrl);
files.put(linkUrl, webFile);
downloadRepository.addElement(linkUrl, webFile);
System.out.println("Added file at " + linkUrl);
}
}
}
}
}
System.out.print("\nFinished crawling " + url + " at crawl depth ");
System.out.println(currentCrawlDepth + "\n");
}
public HashMap<String, WebImage> getImages() {
return webImages;
}
public HashMap<String, WebPage> getCrawledWebPages() {
return webPages;
}
public HashMap<String, WebFile> getFiles() {
return files;
}
public String getUrl() {
return url;
}
#Override
public void saveToDisk(String filePath) {
System.out.println(filePath);
}
}
The point of using a hashmap is to ensure that I do not parse the same website more than once. The error seems to be with my recursion. What is the issue?
Here is also some sample output for starting the crawl at http://www.google.com
Crawling https://www.google.com/ at crawl depth 0
Added webPage at http://www.google.com/intl/en/options/
Added webPage at https://www.google.com/intl/en/ads/
Added webPage at https://www.google.com/services/
Added webPage at https://www.google.com/intl/en/about.html
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/ at crawl depth 0
Crawling https://www.google.com/services/ at crawl depth 1
Added webPage at http://www.google.com/intl/en/enterprise/apps/business/?utm_medium=et&utm_campaign=en&utm_source=us-en-et-nelson_bizsol
Added webPage at https://www.google.com/services/sitemap.html
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/services/ at crawl depth 1
**Crawling https://www.google.com/intl/en/policies/ at crawl depth 2**
Added webPage at https://www.google.com/intl/en/policies/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/privacy/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/faq/
Added webPage at https://www.google.com/intl/en/policies/technologies/
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/intl/en/policies/ at crawl depth 2
**Crawling https://www.google.com/intl/en/policies/ at crawl depth 3**
Notice that it parses http://www.google.com/intl/en/policies/ twice

You are creating a new map for each web-page. This will ensure that if the same link occurs on the page twice it will only be crawled once but it will not deal with the case where the same link appears on two different pages.
https://www.google.com/intl/en/policies/ appears on both https://www.google.com/ and https://www.google.com/services/.
To avoid this use a single map throughout your crawl and pass it as a parameter into the recursion.
public class WebCrawler {
private HashMap<String, WebPage> visited = new HashMap<String, WebPage>();
public static void crawling(Map<String, WebPage> visited, WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {
}
}
As you are also holding a map of the images etc you may prefer to create a new object, perhaps call it visited, and make it keep track.
public class Visited {
private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();
public boolean visit(String url, WebPage page) {
if (webPages.containsKey(page)) {
return false;
}
webPages.put(url, page);
return true;
}
private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();
public boolean visit(String url, WebImage image) {
if (webImages.containsKey(image)) {
return false;
}
webImages.put(url, image);
return true;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get HtmlElements from a website - java

Related

How to get the first link using JSOUP?

How to set depth of simple JAVA web crawler

Matching Keys in a HashMap

How to write a unit test for an XML parser I wrote in Java

WebCrawler with recursion

Categories

Resources