Web crawler(NYTimes) using Jsoup (Link within a link)

Web crawler(NYTimes) using Jsoup (Link within a link) - java

I have been assigned an assignment to crawl through the "nytimes" website and display the most liked, shared, etc articles on that website using the concept of a web crawler.
I have made use of JSOUP to extract all the links from the homepage of nytimes. The code is as follows:
public static void processPage(String URL) throws IOException
{
Document doc = Jsoup.connect(URL).get();
Elements questions = doc.select("a[href]");
for(Element link: questions)
{
String absUrl1 = link.absUrl("href");
if(absUrl1.contains("nytimes.com")) {
System.out.println(absUrl1); }
}
}
This code was used to extract and display all the links containing "nytimes.com" but how do I parse all those links and extract links within that link and so on? That's what a crawler is supposed to do. But I'm not able to figure it out. I tried to call the processPage function recursively but the output I'm getting is not as the expected one.

If you're using a single machine, then a Queue is for you.
As you come across links in a page that need to be crawled, add them to the Queue. If you want to be single threaded, you could write a while loop that reads from this queue. Initially, the queue might have the NY Times link in it. Once that URL is pulled and crawled, more URLs will be in the queue for processing. You can continue this for all NY Times articles.
Using a Queue also allows you to easily multithread, allowing multiple threads to take from the queue, helping increase throughput. Take a look at the producer/consumer pattern.
If a single machine isn't enough, you'll have to do something more distributed, like using Hadoop. Yahoo uses Hadoop to have multiple machines spider the web at once.

Related

Developing app to detect webpage change

I'm trying to make a desktop app with java to track changes made to a webpage as a side project and also to monitor when my professors add content to their webpages. I did a bit of research and my current approach is to use the Jsoup library to retrieve the webpage, run it through a hashing algorithm, and then compare the current hash value with a previous hash value.
Is this a recommended approach? I'm open to suggestions and ideas since before I did any research I had no clue how to start nor what jsoup was.

One potential problem with your hashing method: if the page contains any dynamically generated content that changes on each refresh, as many modern websites do, your program will report that the page is constantly changing. Hashing the whole page will only work if the site does not employ any of this dynamic content (ads, hit counter, social media, etc.).
What specifically are you looking for that has changed? Perhaps new assignments being posted? You likely do not want to monitor the entire page for changes anyway. Therefore, you should use an HTML parser -- this is where Jsoup comes in.
First, parse the page into a Document object:
Document doc = Jsoup.parse(htmlString)
You can now perform a number of methods on the Document object to traverse the HTML Nodes. (See Jsoup docs on DOM navigation methods)
For instance, say there is a table on the site, and each row of the table represents a different assignment. The following code would get the table by its ID and each of its row by selecting each of the table's tags.
Element assignTbl = doc.getElementById("assignmentTable");
Elements tblRows = assignTbl.getElementsByTag("tr");
for (Element tblRow: tblRows) {
tblRow.html();
}
You will need to somehow view the webpage's source code (such as Inspect Element in Google Chrome) to figure out the page's structure and design your code accordingly. This way, not only would the algorithm be more reliable, but you could take it much further, such as extracting the details of the assignment that has changed. (If you would like assistance, please edit your question with the target page's HTML.)

crawler4j asynchronously saving results to file

I'm evaluating crawler4j for ~1M crawls per day
My scenario is this: I'm fetching the URL and parsing its description, keywords and title, now I would like to save each URL and its words into a single file
I've seen how it's possible to save crawled data to files. However, since I have many crawls to perform I want different threads performing the save file operation on the file system (in order to not block the fetcher thread). Is that possible to do with crawler4j? If so, how?
Thanks

Consider using a Queue (BlockingQueue or similar) where you put the data to be written and which are then processed by one/more worker Threads (this approach is nothing crawler4j-specific). Search for "producer consumer" to get some general ideas.
Concerning your follow-up question on how to pass the Queue to the crawler instances, this should do the trick (this is only from looking at the source code, haven't used crawler4j on my own):
final BlockingQueue<Data> queue = …
// use a factory, instead of supplying the crawler type to pass the queue
controller.start(new WebCrawlerFactory<MyCrawler>() {
#Override
public MyCrawler newInstance() throws Exception {
return new MyCrawler(queue);
}
}, numberOfCrawlers);

How to manage a crawler URL frontier?

Guys
I have the following code to add visited links on my crawler.
After extracting links i have a for loop which loop thorough each individual href tags.
And after i have visited a link , opened it , i will add the URL to a visited link collection variable defined above.
private final Collection<String> urlForntier = Collections.synchronizedSet(new HashSet<String>());
The crawler implementation is mulithread and assume if i have visited 100,000 urls, if i didn't terminate the crawler it will grow day by day . and It will create memory issues ? Please , what option do i have to refresh the variable without creating inconsistency across threads ?
Thanks in advance!

If your crawlers are any good, managing the crawl frontier quickly becomes difficult, slow and error-prone.
Luckily, your don't need to write this yourself, just write your crawlers to use consume the URL Frontier API and plug-in an implementation that suits you.
See https://github.com/crawler-commons/url-frontier

The most usable way for modern crawling systems is to use NoSQL databases.
This solution is notable slower than HashSet. That is why you can leverage different caching strategy like a Redis, or even Bloom filters
But including specific nature of URL, I'd like to recommend Trie data structure that gives you lot of options to manipulate and search by url string. (Discussion of java implementation can be found on this Stackoevrflow topic)

As per question, I would recommend using Redis to replace use of Collection. It's in-memory database for data structure store and super fast to insert and retrieve data with support of all standard data structures.In your case Set and you can check existence of key in set with SISMEMBER command).
Apache Nutch is also good to explore.

Providing input data for web scraping

I want to scrap data from the following site:
http://www.upmandiparishad.in/commodityWiseAll.aspx
There are two input elements, Commodity and Date. How do I provide these values and retrieve the resulting information?

To extract data from a web page from Java, you can use jsoup.
To provide input elements, you need to understand how they are provided originally by your browser.
Basically, there are two most common methods for a request-response between a client and a server:
GET - Requests data from a specified resource
POST - Submits data to be processed to a specified resource
You can find more about them here.
When you select the Commodity and the Date input values, you can investigate the methods used to provide those values to the server by examining network requests. For example, in Chrome, you can press F12 and select the Network tab to check the information being sent to and from the browser.
When you find out the way of providing the data, you can then form your HTTP request accordingly to provide the same data via jsoup or similar library.
For example, here is how you can provide simple input fields to your request:
Document doc = Jsoup.connect("http://example.com/")
.data("some_input_1", "some_data_1")
.data("some_input_2", "some_data_2")
.post();
This is ofcourse just to get you started, it is by no means a complete answer. You need to show real effort on your side to search for answers online, as there are plenty.
Here are just a few to get you started:
http://www.mkyong.com/java/how-to-send-http-request-getpost-in-java/
http://simplescrape.sourceforge.net/
http://www.xyzws.com/Javafaq/how-to-use-httpurlconnection-post-data-to-web-server/139
http://www.javaworld.com/article/2077532/learn-java/java-tip-34--posting-via-java.html

retrieve information from a url

I want to make a program that will retrieve some information a url.
For example i give the url below, from
librarything
How can i retrieve all the words below the "TAGS" tab, like
Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ?
I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice?
EDIT:
You gave me excellent help, but I want to ask something else.
For every tag we can see how many times each tag has been used, when we press the "number" button. How can I retrieve that number also?

You could use a HTML parser like Jsoup. It allows you to select HTML elements of interest using simple CSS selectors:
E.g.
Document document = Jsoup.connect("http://www.librarything.com/work/9767358/78536487").get();
Elements tags = document.select(".tags .tag a");
for (Element tag : tags) {
System.out.println(tag.text());
}
which prints
Black Library
fantasy
Thanquol & Boneripper
Thanquol and Bone Ripper
Warhammer
Please note that you should read website's robots.txt -if any- and read the website's terms of service -if any- or your server might be IP-banned sooner or later.

I've done this before using PHP with a page scrape, then parsing the HTML as a string using Regular Expressions.
Example here
I imagine there's something similar in java and other languages. The concept would be similar:
Load page data.
Parse the data, (i.e. with a regex, or via the DOM model and using some CSS selectors or some XPath selectors.
Do what you want with the data :)
It's worth remembering that some people might not appreciate you data mining their site and profiting / redistrubuting it on a large scale.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.