I am trying to experiment with networking in my Java application. For starters, I would like to test how much networking power can I use before a website begins loading slowly.
Is there a reasonable way to deliberately consume so many networking resources that it affects my browsing experience?
As for my attempts, I tried to create several threads, each trying to download some website (like this: http://docs.oracle.com/javase/tutorial/networking/urls/readingURL.html), but this doesn't do anything to my internet speed (I guess that the amount of threads is just too little).
My suggestion is that you download some big files instead of just HTML web sites. Try different files from different sites so it does not limit your download by your IP or session.
Also if you really want to slow down your navigation experience you need to consume your upload as well, not only your download link. Try uploading something at max speed, it will greatly affect your navigation.
You can make some tests for comparison with Apache JMeter.
Related
I'm trying to gather data by scraping webpages using Java with Jsoup. Ideally, I'd like about 8000 lines of data, but I was wondering what the etiquette is when it comes to accessing a site that many times. For each one, my code has to navigate to a different part of the site, so I would have to load 8000 (or more) webpages. Would it be a good idea to put delays between each request so I don't overload the website? They don't offer an API from what I can see.
Additionally, I tried running my code to just get 80 lines of data without any delay, and my internet is out. Could running that code have caused it? When I called the company, the automated message made it sound like service was out in the area, so maybe I just didn't notice it until I tried to run the code. Any help is appreciated, I'm very new to network coding. Thanks!
Here are couple of things you should consider and things that I learned while writing a super-fast web scraper with Java and Jsoup:
Most important one is legal aspect, whether website allows crawling and till what extent they allow using their data.
Putting delays is fine but adding custom user agents that are compatible with robots.txt is more preferred. I had seen significant increase in response times with changing user agent from default to robots.txt.
If site allows and you need to crawl big number of pages, which was allowed for one of my previous project, you could use theadexecutor to load N number of pages simultaneously. It turns hours of data gathering job with single threaded java web scraper into just couple of minutes.
Many ISP's blacklist users who are doing programmable repeatative tasks such as web crawling and setting up email servers. Its varies from ISP to ISP. I previously avoided this by using proxies.
With a website having a response time of 500ms per request, my web scraper was able to scrape data from 200k pages with 50 threads, 1000 proxies in 3 minutes over a connection of 100MBPS.
Should there be delays between requests?
Answer: It depends if website allows you constantly hitting then you don't need it but should be better to have it. I had a delay of 10ms between each request.
I tried running my code to just get 80 lines of data without any delay, and my internet is out?
Answer: Most probably. ISP may assume you are doing DOS attack against the website and may have temporarily/permanently put a limit on your connection.
A while back I tried implementing a crawler in Java and left the project for a while (made a lot of progress since). Basically I have implemented a crawler with circa 200-400 threads, each thread connects and downloads the content of one page (simplified for clarity, but that's basically it):
// we're in a run() method of a truely generic Runnable.
// _url is a member passed to the Runnable object beforehand.
Connection c = Jsoup.connect(_url).timeout(10000);
c.execute();
Document d = c.response().parse();
// use Jsoup to get the links, add them to the backbone of the crawler
// to be checked and maybe later passed to the crawling queue.
This works. The problem is I only use a very small fraction of my internet bandwidth. Having the ability to download at >6MB/s, I've identified (using NetLimiter and my own calculations) that I only use about 1MB/s at best when downloading pages sources.
I've done a lot of statistics and analyses and it is somewhat reasonable - if the computer cannot efficiently support over ~400 threads (I don't know about that also, but a larger number of threads seems to be ineffective) and each connection takes about 4 seconds to complete, then I'm supposed to download 100 pages per second which is indeed what happens. The bizarre thing is that many times while I run this program, the internet connection is completely clogged - neither I nor anyone else on my wifi connection can access the web normally (when I'm only using 16%! which does not happen when downloading other files, say movies).
I've spent literally weeks calculating, analyzing and collecting various statistics (making sure all threads are operating with VM monitor, calculating mean run time for threads, excel charts...), before coming here, but I've ran out of answers. I wonder if this behavior could be explained. I realize there's a lot of "ifs" in this question, but it's the best I can do without it turning into an essay.
My computer specs are i5 4460 with 8GB DDR3-1600 and a 100Mb/s (effectively around 8MB/s) internet connection, connected directly via LAN to the crawler. I'm looking for general directions - where else should I look
(I mean obvous stuff that are clear to experienced developers and not myself) in order to either:
Improve the download speed (maybe not Jsoup? different number of threads? I've already tried using selectors instead of threads and it was slower), or:
Free up the internet when I'm running this program.
I've thought about the router itself (Netgear N600) limiting the number of outgoing connections (seems odd), so I'm saturating the number of connections, and not the bandwidth, but couldn't figure out if that's even possible.
Any general direction / advice would be warmly welcomed :) feel free to point out newish mistakes, that's how I learn.
Amir.
The issue was not DNS resolutions, as creating the connections with an IP address (I stored all addresses in advance then used those) resulted in the exact same response times and bandwidth use. Nor was it the threads issue.
I suspect now it was the netlimiter program's "fault". I've measured directly the number of bytes received and outputted these to disk (I've done this before but apprantly I've made some changes in the program). It seems I'm really saturating the bandwidth. Also, when switching to HttpUrlConnection objects instead of Jsoup, the netlimiter program does show a much larger bandwidth usage. Perhaps it has some issue with Jsoup.
I'm not sure this was the problem, but empirically, the program downloads a lot of data. So I hope this helps anyone who might encounter a similar issue in the future.
I'm developing a website, that hopefully, will be accessed by more than a million people at same time. It still has only 70k users, and it's already lagging while uploading a file, or just opening pages and stuff..
I use SQLServer, tomcat and apache http server.
i've tried using another tomcat to manage the access to database, but i'm facing another problem, it has to share the same space of the other tomcat to save the uploaded files. and it causes a huge delay uploading..
what can i do to make my website faster?
The website is developed with JSF with richfaces and Java and Hibernate.
Scaling is hard.
For some operations scaling is impossible. Even the greatest (Google, Facebook, Amazon) are not free to choose their features; what is offered is often a compromise between "what would be cool" and "what will scale".
The question "how to make it faster" is unanswerable without profiling your application.
Making any decisions without considering the former point is PLAIN STUPID AND MIGHT PUT YOU IN EVEN WORSE SITUATION.
The traditional way of identifying of bottlenecks is to think separately about:
a) memory (does the system swap?)
b) cpu (are cpus really busy, or just waiting for the database?)
c) IO (usually that includes the database and bandwidth)
Depending on where is your problem, totally contradicting things will help. For example if you have plenty of memory and low bandwith, switch JSF to save state on server. This will use more memory, but make requests shorter. On the other hand if bandwitdth is not a problem and memory is, then do the opposite: switch JSF to keep state on the client. This will help to conserve memory (although in this case matters are more complicated: if tomcats in your cluster try to share session data, then saving state on server becomes an IO problem).
You say that the problem is with uploading files. To help, we would need to know: where do you save them? to DB? to filesystem? Are they short or long? How are they processed? Are there any patterns on the usage of the uploaded files (like: "new files are used most of the time")? and probably even more questions would pop up after these had been answered.
For your own sake: close this question. You will get plenty of well-intentioned, and yet misguiding answers like 'drop JSF', 'cluster everything', 'add memory', 'move to GAE or Amazon EC', 'go with a NoSQL database', 'do everything asynchronously, use a message queue', 'do everything on the client with ajax', 'drop ajax, it makes too many requests and kills server'. All of this is meaningless unless you profile, profile, profile, measure, measure, measure - FIRST. And then give as a better defined question.
If you access a db, consider to use EJB + CMP. Then follow the following model:
cluster your application server (e.g. GlassFish) for load balancing
keep all service calls of one single request in one single node (by only calling local services)
I am interested to know in a very general situation (a home-brew amateur web crawler) what will be the performance of such. More specifically how many pages can a crawler process.
When i say home-brew take that in all senses, a 2.4Ghz core 2 processor, java written, 50mbit internet speed, etc, etc.
Any resources you may share in this regard will be greatly appreciated
Thanks a lot,
Carlos
First of all, the speed of your computer won't be the limiting factor; as for the connection, you should artificially limit the speed of your crawler - most sites will ban your IP address if you start hammering them. In other words, don't crawl a site too quickly (10+ seconds per request should be OK with 99.99% of the sites, but go below that at your own peril).
So, while you could crawl a single site in multiple threads, I'd suggest that each thread crawls a different site (check if it's also not a shared IP address); that way, you could saturate your connection with a lower chance of getting banned from the spidered site.
Some sites don't want you to crawl parts of the site, and there's a commonly used mechanism that you should follow: the robots.txt file. Read the linked site and implement this.
Note also, that some sites prohibit any automated crawling at all; depending on the site's jurisdiction (yours may also apply), breaking this may be illegal (you are responsible for what your script does, "the robot did it" is not even an excuse, much less a defense).
In my experience, mostly making site scrapers, the network download is always the limiting factor. You can usually shuttle the parsing of the page (or storage for parsing later) to a different thread in less than the time it will take to download the next page.
So figure out, on average, how long it takes to download a web page. Multiply that by how many threads you have downloading until it fills your connection's throughput, average out the speed of any given web server and the math is fairly obvious.
If your program is sufficiently efficient, your internet connection WILL be the limiting factor (as Robert Harvey said in his answer).
However, by doing this with a home internet connection, you are probably abusing your provider's terms of service. They will monitor it and will eventually notice if you frequently exceed their reasonable usage policy.
Moreover, if they use a transparent proxy, you may hammer their proxy to death long before you reach their download limit, so be careful - make sure that you are NOT going through your ISP's proxy, transparent or otherwise.
ISPs are set up for most users to do moderate levels of browsing with a few large streaming operations (video or other downloads). A massive level of tiny requests with 100s outstanding at once, will probably not make their proxy servers happy even if it doesn't use much bandwidth.
I'm fairly new to programming and am working for my dissertation on a web crawler. I've been provided by a web crawler but i found it to be too slow since it is single threaded. It took 30 mins to crawl 1000 webpages. I tried to create multiple threads for execution and with 20 threads simultaneously running the 1000 webpages took only 2 minutes. But now I'm encountering "Heap Out of Memory" errors. I'm sure what i did was wrong which was create a for loop for 20 threads. What would be the right way to multi-thread the java crawler without giving out the errors? And speaking of which, is multi-threading the solution to my problem or not?
The simple answer (see above) is to increase the JVM memory size. This will help, but it is likely that the real problem is that your web crawling algorithm is creating an in-memory data structure that grows in proportion to the number of pages you visit. If that is the case, the solution maybe to move the data in that data structure to disc; e.g. a database.
The most appropriate solution to your problem depends on how your web crawler works, what it is collecting, and how many pages you need to crawl.
My first suggestion is that you increase the heap size for the JVM:
http://www.informix-zone.com/node/46
Regarding the speed of your program:
If your web crawler obeys the robots.txt file on servers, (which it should to avoid being banned by the site admins) then there may be little that can be done.
You should profile your program, but I expect most of the time is your crawler downloading html pages, and site admins will usually not be happy if you download so fast you drain their bandwidth.
In summary, Downloading a whole site without hurting that site will take a while.