Issue with scraping data from websites

Issue with scraping data from websites - java

I'm trying to gather data by scraping webpages using Java with Jsoup. Ideally, I'd like about 8000 lines of data, but I was wondering what the etiquette is when it comes to accessing a site that many times. For each one, my code has to navigate to a different part of the site, so I would have to load 8000 (or more) webpages. Would it be a good idea to put delays between each request so I don't overload the website? They don't offer an API from what I can see.
Additionally, I tried running my code to just get 80 lines of data without any delay, and my internet is out. Could running that code have caused it? When I called the company, the automated message made it sound like service was out in the area, so maybe I just didn't notice it until I tried to run the code. Any help is appreciated, I'm very new to network coding. Thanks!

Here are couple of things you should consider and things that I learned while writing a super-fast web scraper with Java and Jsoup:
Most important one is legal aspect, whether website allows crawling and till what extent they allow using their data.
Putting delays is fine but adding custom user agents that are compatible with robots.txt is more preferred. I had seen significant increase in response times with changing user agent from default to robots.txt.
If site allows and you need to crawl big number of pages, which was allowed for one of my previous project, you could use theadexecutor to load N number of pages simultaneously. It turns hours of data gathering job with single threaded java web scraper into just couple of minutes.
Many ISP's blacklist users who are doing programmable repeatative tasks such as web crawling and setting up email servers. Its varies from ISP to ISP. I previously avoided this by using proxies.
With a website having a response time of 500ms per request, my web scraper was able to scrape data from 200k pages with 50 threads, 1000 proxies in 3 minutes over a connection of 100MBPS.
Should there be delays between requests?
Answer: It depends if website allows you constantly hitting then you don't need it but should be better to have it. I had a delay of 10ms between each request.
I tried running my code to just get 80 lines of data without any delay, and my internet is out?
Answer: Most probably. ISP may assume you are doing DOS attack against the website and may have temporarily/permanently put a limit on your connection.

Related

Many Java threads downloading content from web - but not saturating the bandwidth

A while back I tried implementing a crawler in Java and left the project for a while (made a lot of progress since). Basically I have implemented a crawler with circa 200-400 threads, each thread connects and downloads the content of one page (simplified for clarity, but that's basically it):
// we're in a run() method of a truely generic Runnable.
// _url is a member passed to the Runnable object beforehand.
Connection c = Jsoup.connect(_url).timeout(10000);
c.execute();
Document d = c.response().parse();
// use Jsoup to get the links, add them to the backbone of the crawler
// to be checked and maybe later passed to the crawling queue.
This works. The problem is I only use a very small fraction of my internet bandwidth. Having the ability to download at >6MB/s, I've identified (using NetLimiter and my own calculations) that I only use about 1MB/s at best when downloading pages sources.
I've done a lot of statistics and analyses and it is somewhat reasonable - if the computer cannot efficiently support over ~400 threads (I don't know about that also, but a larger number of threads seems to be ineffective) and each connection takes about 4 seconds to complete, then I'm supposed to download 100 pages per second which is indeed what happens. The bizarre thing is that many times while I run this program, the internet connection is completely clogged - neither I nor anyone else on my wifi connection can access the web normally (when I'm only using 16%! which does not happen when downloading other files, say movies).
I've spent literally weeks calculating, analyzing and collecting various statistics (making sure all threads are operating with VM monitor, calculating mean run time for threads, excel charts...), before coming here, but I've ran out of answers. I wonder if this behavior could be explained. I realize there's a lot of "ifs" in this question, but it's the best I can do without it turning into an essay.
My computer specs are i5 4460 with 8GB DDR3-1600 and a 100Mb/s (effectively around 8MB/s) internet connection, connected directly via LAN to the crawler. I'm looking for general directions - where else should I look
(I mean obvous stuff that are clear to experienced developers and not myself) in order to either:
Improve the download speed (maybe not Jsoup? different number of threads? I've already tried using selectors instead of threads and it was slower), or:
Free up the internet when I'm running this program.
I've thought about the router itself (Netgear N600) limiting the number of outgoing connections (seems odd), so I'm saturating the number of connections, and not the bandwidth, but couldn't figure out if that's even possible.
Any general direction / advice would be warmly welcomed :) feel free to point out newish mistakes, that's how I learn.
Amir.

The issue was not DNS resolutions, as creating the connections with an IP address (I stored all addresses in advance then used those) resulted in the exact same response times and bandwidth use. Nor was it the threads issue.
I suspect now it was the netlimiter program's "fault". I've measured directly the number of bytes received and outputted these to disk (I've done this before but apprantly I've made some changes in the program). It seems I'm really saturating the bandwidth. Also, when switching to HttpUrlConnection objects instead of Jsoup, the netlimiter program does show a much larger bandwidth usage. Perhaps it has some issue with Jsoup.
I'm not sure this was the problem, but empirically, the program downloads a lot of data. So I hope this helps anyone who might encounter a similar issue in the future.

How to starve internet speed from Java?

I am trying to experiment with networking in my Java application. For starters, I would like to test how much networking power can I use before a website begins loading slowly.
Is there a reasonable way to deliberately consume so many networking resources that it affects my browsing experience?
As for my attempts, I tried to create several threads, each trying to download some website (like this: http://docs.oracle.com/javase/tutorial/networking/urls/readingURL.html), but this doesn't do anything to my internet speed (I guess that the amount of threads is just too little).

My suggestion is that you download some big files instead of just HTML web sites. Try different files from different sites so it does not limit your download by your IP or session.
Also if you really want to slow down your navigation experience you need to consume your upload as well, not only your download link. Try uploading something at max speed, it will greatly affect your navigation.
You can make some tests for comparison with Apache JMeter.

How to test your website against multiple users(extended)

There is similar question on this topic I participated in it, but it doesn't really answer what I need at this moment.
How to rigorously test a site?
I noticed java.util.ConcurrentModificationException in my server log, so I fixed that one, but I still don't know if this or some other concurrency will ever occur without testing it.
I've tried to create test in jmeter which just does simple GET and simulates 100 users.
The problem :
I retrieve some information from server when page is done loading, so I'm interested in that part(because that part cause this exception before).
But jmeter gets only the page when its loaded, and all ajax pending requests if any are not displayed in the logs. Actually I can't see anything in logs because, jmeter never reaches these ajax calls when document is ready, it exits just before that.
Naturally when I refresh page from browser I can see logs, what exactly is going on on the server side. Is there a some kind of tool, that waits for all pending requests or can stay on the website for n amount of time, or is there a smarter way to test this to avoid further concurrency exceptions.

AJAX requests are simple GET requests as well, so you just need to configure JMeter to directly call the servlets which serve them.

If you use Selenium instead of JMeter for your tests, you will spawn real browsers that will perform AJAX request exactly like the real application. Simply because it is the real application that is being run.
The problem is... Selenium is for regression testing and compatibility with browsers, not for raw performance. You can't run more than a few browser per computer. Some companies provide cluster of browserd (up to 5000 and up to 500 000 virtual user for browsermob) that you can rent for your performance campain.
You can also use the desktop computer in your office, let say the night to perform your tests.
I know this might be a little complicated and not be the best solution.

web crawler performance

I am interested to know in a very general situation (a home-brew amateur web crawler) what will be the performance of such. More specifically how many pages can a crawler process.
When i say home-brew take that in all senses, a 2.4Ghz core 2 processor, java written, 50mbit internet speed, etc, etc.
Any resources you may share in this regard will be greatly appreciated
Thanks a lot,
Carlos

First of all, the speed of your computer won't be the limiting factor; as for the connection, you should artificially limit the speed of your crawler - most sites will ban your IP address if you start hammering them. In other words, don't crawl a site too quickly (10+ seconds per request should be OK with 99.99% of the sites, but go below that at your own peril).
So, while you could crawl a single site in multiple threads, I'd suggest that each thread crawls a different site (check if it's also not a shared IP address); that way, you could saturate your connection with a lower chance of getting banned from the spidered site.
Some sites don't want you to crawl parts of the site, and there's a commonly used mechanism that you should follow: the robots.txt file. Read the linked site and implement this.
Note also, that some sites prohibit any automated crawling at all; depending on the site's jurisdiction (yours may also apply), breaking this may be illegal (you are responsible for what your script does, "the robot did it" is not even an excuse, much less a defense).

In my experience, mostly making site scrapers, the network download is always the limiting factor. You can usually shuttle the parsing of the page (or storage for parsing later) to a different thread in less than the time it will take to download the next page.
So figure out, on average, how long it takes to download a web page. Multiply that by how many threads you have downloading until it fills your connection's throughput, average out the speed of any given web server and the math is fairly obvious.

If your program is sufficiently efficient, your internet connection WILL be the limiting factor (as Robert Harvey said in his answer).
However, by doing this with a home internet connection, you are probably abusing your provider's terms of service. They will monitor it and will eventually notice if you frequently exceed their reasonable usage policy.
Moreover, if they use a transparent proxy, you may hammer their proxy to death long before you reach their download limit, so be careful - make sure that you are NOT going through your ISP's proxy, transparent or otherwise.
ISPs are set up for most users to do moderate levels of browsing with a few large streaming operations (video or other downloads). A massive level of tiny requests with 100s outstanding at once, will probably not make their proxy servers happy even if it doesn't use much bandwidth.

How to handle OUT OF MEMORY error for multiple threads in a Java Web Crawler

I'm fairly new to programming and am working for my dissertation on a web crawler. I've been provided by a web crawler but i found it to be too slow since it is single threaded. It took 30 mins to crawl 1000 webpages. I tried to create multiple threads for execution and with 20 threads simultaneously running the 1000 webpages took only 2 minutes. But now I'm encountering "Heap Out of Memory" errors. I'm sure what i did was wrong which was create a for loop for 20 threads. What would be the right way to multi-thread the java crawler without giving out the errors? And speaking of which, is multi-threading the solution to my problem or not?

The simple answer (see above) is to increase the JVM memory size. This will help, but it is likely that the real problem is that your web crawling algorithm is creating an in-memory data structure that grows in proportion to the number of pages you visit. If that is the case, the solution maybe to move the data in that data structure to disc; e.g. a database.
The most appropriate solution to your problem depends on how your web crawler works, what it is collecting, and how many pages you need to crawl.

My first suggestion is that you increase the heap size for the JVM:
http://www.informix-zone.com/node/46

Regarding the speed of your program:
If your web crawler obeys the robots.txt file on servers, (which it should to avoid being banned by the site admins) then there may be little that can be done.
You should profile your program, but I expect most of the time is your crawler downloading html pages, and site admins will usually not be happy if you download so fast you drain their bandwidth.
In summary, Downloading a whole site without hurting that site will take a while.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.