web crawler performance - java

I am interested to know in a very general situation (a home-brew amateur web crawler) what will be the performance of such. More specifically how many pages can a crawler process.
When i say home-brew take that in all senses, a 2.4Ghz core 2 processor, java written, 50mbit internet speed, etc, etc.
Any resources you may share in this regard will be greatly appreciated
Thanks a lot,
Carlos

First of all, the speed of your computer won't be the limiting factor; as for the connection, you should artificially limit the speed of your crawler - most sites will ban your IP address if you start hammering them. In other words, don't crawl a site too quickly (10+ seconds per request should be OK with 99.99% of the sites, but go below that at your own peril).
So, while you could crawl a single site in multiple threads, I'd suggest that each thread crawls a different site (check if it's also not a shared IP address); that way, you could saturate your connection with a lower chance of getting banned from the spidered site.
Some sites don't want you to crawl parts of the site, and there's a commonly used mechanism that you should follow: the robots.txt file. Read the linked site and implement this.
Note also, that some sites prohibit any automated crawling at all; depending on the site's jurisdiction (yours may also apply), breaking this may be illegal (you are responsible for what your script does, "the robot did it" is not even an excuse, much less a defense).

In my experience, mostly making site scrapers, the network download is always the limiting factor. You can usually shuttle the parsing of the page (or storage for parsing later) to a different thread in less than the time it will take to download the next page.
So figure out, on average, how long it takes to download a web page. Multiply that by how many threads you have downloading until it fills your connection's throughput, average out the speed of any given web server and the math is fairly obvious.

If your program is sufficiently efficient, your internet connection WILL be the limiting factor (as Robert Harvey said in his answer).
However, by doing this with a home internet connection, you are probably abusing your provider's terms of service. They will monitor it and will eventually notice if you frequently exceed their reasonable usage policy.
Moreover, if they use a transparent proxy, you may hammer their proxy to death long before you reach their download limit, so be careful - make sure that you are NOT going through your ISP's proxy, transparent or otherwise.
ISPs are set up for most users to do moderate levels of browsing with a few large streaming operations (video or other downloads). A massive level of tiny requests with 100s outstanding at once, will probably not make their proxy servers happy even if it doesn't use much bandwidth.

Related

Many Java threads downloading content from web - but not saturating the bandwidth

A while back I tried implementing a crawler in Java and left the project for a while (made a lot of progress since). Basically I have implemented a crawler with circa 200-400 threads, each thread connects and downloads the content of one page (simplified for clarity, but that's basically it):
// we're in a run() method of a truely generic Runnable.
// _url is a member passed to the Runnable object beforehand.
Connection c = Jsoup.connect(_url).timeout(10000);
c.execute();
Document d = c.response().parse();
// use Jsoup to get the links, add them to the backbone of the crawler
// to be checked and maybe later passed to the crawling queue.
This works. The problem is I only use a very small fraction of my internet bandwidth. Having the ability to download at >6MB/s, I've identified (using NetLimiter and my own calculations) that I only use about 1MB/s at best when downloading pages sources.
I've done a lot of statistics and analyses and it is somewhat reasonable - if the computer cannot efficiently support over ~400 threads (I don't know about that also, but a larger number of threads seems to be ineffective) and each connection takes about 4 seconds to complete, then I'm supposed to download 100 pages per second which is indeed what happens. The bizarre thing is that many times while I run this program, the internet connection is completely clogged - neither I nor anyone else on my wifi connection can access the web normally (when I'm only using 16%! which does not happen when downloading other files, say movies).
I've spent literally weeks calculating, analyzing and collecting various statistics (making sure all threads are operating with VM monitor, calculating mean run time for threads, excel charts...), before coming here, but I've ran out of answers. I wonder if this behavior could be explained. I realize there's a lot of "ifs" in this question, but it's the best I can do without it turning into an essay.
My computer specs are i5 4460 with 8GB DDR3-1600 and a 100Mb/s (effectively around 8MB/s) internet connection, connected directly via LAN to the crawler. I'm looking for general directions - where else should I look
(I mean obvous stuff that are clear to experienced developers and not myself) in order to either:
Improve the download speed (maybe not Jsoup? different number of threads? I've already tried using selectors instead of threads and it was slower), or:
Free up the internet when I'm running this program.
I've thought about the router itself (Netgear N600) limiting the number of outgoing connections (seems odd), so I'm saturating the number of connections, and not the bandwidth, but couldn't figure out if that's even possible.
Any general direction / advice would be warmly welcomed :) feel free to point out newish mistakes, that's how I learn.
Amir.
The issue was not DNS resolutions, as creating the connections with an IP address (I stored all addresses in advance then used those) resulted in the exact same response times and bandwidth use. Nor was it the threads issue.
I suspect now it was the netlimiter program's "fault". I've measured directly the number of bytes received and outputted these to disk (I've done this before but apprantly I've made some changes in the program). It seems I'm really saturating the bandwidth. Also, when switching to HttpUrlConnection objects instead of Jsoup, the netlimiter program does show a much larger bandwidth usage. Perhaps it has some issue with Jsoup.
I'm not sure this was the problem, but empirically, the program downloads a lot of data. So I hope this helps anyone who might encounter a similar issue in the future.

How to stop hack/DOS attack on web API

My website has been experiencing a denial of service/hack attack for the last week. The attack is hitting our web API with randomly generated invalid API keys in a loop.
I'm not sure if they are trying to guess a key (mathematically impossible as 64bit keys) or trying to DOS attack the server. The attack is distributed, so I cannot ban all of the IP address, as it occurs from hundreds of clients.
My guess is that it is an Android app by the IPs, so someone has some malware in an Android app, and use all the installs to attack my server.
Server is Tomcat/Java, currently the web API just responds 400 to invalid keys, and caches IPs that have made several invalid key attempts, but still needs to do some processing for each bad request.
Any suggestions how to stop the attack? Is there any way to identify the Android app making the request from the HTTP header?
Preventing Brute-Force Attacks:
There is a vast array of tools and strategies available to help you do this, and which to use depends entirely on your server implementation and requirements.
Without using a firewall, IDS, or other network-control tools, you can't really stop a DDOS from, well, denying service to your application. You can, however, modify your application to make a brute-force attack significantly more difficult.
The standard way to do this is by implementing a lockout or a progressive delay. A lockout prevents an IP from making a login request for X minutes if they fail to log in N times. A progressive delay adds a longer and longer delay to processing each bad login request.
If you're using Tomcat's authentication system (i.e. you have a <login-constraint> element in your webapp configuration), you should use the Tomcat LockoutRealm, which lets you easily put IP addresses on a lockout once they make a number of bad requests.
If you are not using Tomcat's authentication system, then you would have to post more information about what you are using to get more specific information.
Finally, you could simply increase the length of your API keys. 64 bits seems like an insurmountably huge keyspace to search, but its underweight by modern standards. A number of factors could contribute to making it far less secure than you expect:
A botnet (or other large network) could make tens of thousands of attempts per second, if you have no protections in place.
Depending on how you're generating your keys and gathering entropy,
your de facto keyspace might be much smaller.
As your number of valid keys increases, the number of keys that need
to be attempted to find a valid one (at least in theory) drops
sharply.
Upping the API key length to 128 (or 256, or 512) won't cost much, and you'll tremendously increase the search space (and thus, the difficulty) of any brute force attack.
Mitigating DDOS attacks:
To mitigate DDOS attacks, however, you need to do a bit more legwork. DDOS attacks are hard to defend against, and its especially hard if you don't control the network your server is on.
That being said, there are a few server-side things you can do:
Installing and configuring a web-application firewall, like mod_security, to reject incoming connections that violate rules that you define.
Setting up an IDS system, like Snort, to detect when a DDOS attack is occurring and take the first steps to mitigate it
See #Martin Muller's post for another excellent option, fail2ban
Creating your own Tomcat Valve, as described here, to reject incoming requests by their User-Agents (or any other criterion) as a last line of defense.
In the end, however, there is only so much you can do to stop a DDOS attack for free. A server has only so much memory, so many CPU cycles, and so much network bandwidth; with enough incoming connections, even the most efficient firewall won't keep you from going down. You'll be better able to weather DDOS attacks if you invest in a higher-bandwidth internet connection and more servers, or if you deploy your application on Amazon Web Services, or if you bought one of many consumer and enterprise DDOS mitigation products (#SDude has some excellent recommendations in his post). None of those options are cheap, quick, or easy, but they're what's available.
Bottom Line:
If you rely on your application code to mitigate a DDOS, you've already lost
If it's big enough you just can't stop it alone. You can do all the optimisation you want at the app level, but you'll still go down. In addition to app-level security for prevention (as in FSQ's answer) you should use proven solutions leaving the heavy lifting to professionals (if you are serious about your business). My advise is:
Sign-up for CloudFlare or Incapsula. This is day to day for them.
Consider using AWS API gateway as the second stage for your API requests. You'll enjoy filtering, throttling, security,auto-scaling and HA for your API at Amazon scale. Then you can forward the valid requests to your machines (in or outside amazon)
Internet --> CloudFlare/Incapsula --> AWS API Gateway --> Your API Server
0,02
PS: I think this question belongs to Sec
Here are a couple ideas. There are a number of strategies in addition, but this should get you started. Also realize that amazon gets ddos'd on a frequent basis and their systems tend to have a few heuristics that harden them (and therefore you) from these attacks, particularly if you are using Elastic load balancing, which you should be using anyway.
Use a CDN -- they often have ways of detecting and defending against ddos. Akamai, mastery, or amazons own cloud front.
Use iptables to blacklist offensive ips. The more tooling you have around this, the faster you can blok/unblock
Use throttling mechanisms to prevent large numbers of requests
Automatically deny requests that are very large (say greater than 1-2mb; unless you have a photo uploading service or similar) before they get to your application
Prevent cascading failures by placing a limit on the total number of connections to other components in your system; for example, dont let your database server become overloaded by opening a thousand connections to it.
The best way is to prevent the access to your services entirely for those IP addresses who have failed let's say 3 times. This will take most of the load from your server as the attacker gets blocked before Tomcat even has to start a thread for this user.
One of the best tools to achieve this is called fail2ban (http://www.fail2ban.org). It is provided as a package in all major linux distributions.
What you have to do is basically log the failed attempts into a file and create a custom filter for fail2ban. Darryn van Tonder has a nice example on how to write your own filter on his blog: https://darrynvt.wordpress.com/tag/custom-fail2ban-filters/
If D-DOS is attack is severe, application level checks does not work at all. Entire bandwidth will be consumed by D-DOS clients and your application level checks won't be triggered. Practically your web service does not run at all.
If you have to keep your application safe from severe D-DOS attacks, you do not have any other option except relying on third party tools by paying money. One of the Clean pipe provider ( who sends only good traffic) tools I can bank on from my past experience : Neustar
If D-DOS attack is mild in your website, you can implement application level checks. For example, below configuration will restrict maximum number of connections from single IP as quoted in Restrict calls from single IP
<Directory /home/*/public_html> -- You can change this location
MaxConnPerIP 1
OnlyIPLimit audio/mpeg video
</Directory>
For more insight into D-DOS attack, visit Wiki link. It provides list of preventive & responsive tools which includes : Firewalls, Switches, Routers, IPs Based Prevention, D-DOS based defences
and finally
Clean pipes (All traffic is passed through a "cleaning center" or a "scrubbing center" via various methods such as proxies, tunnels or even direct circuits, which separates "bad" traffic (DDoS and also other common internet attacks) and only sends good traffic beyond to the server)
You can find 12 distributors of Clean pipes.
For a targeted and highly distributed DOS attack the only practical solution (other than providing the capacity to soak it up) is to profile the attack, identify the 'tells' and route that traffic to a low resource handler.
Your question has some tells - that the request is invalid, but presumably there is too much cost in determining that. That the requests originate from a specific group of networks and that presumably they occur in bursts.
In your comments you've told us at least one other tell - the user agent is null.
Without adding any additional components, you could start by tarpitting the connection - if a request matching the profile comes in, go ahead and validate the key, but then have your code sleep for a second or two. This will reduce the rate of requests from these clients at a small cost.
Another solution would be to use log failures matching the tell and use fail2ban to reconfigure your firewall in real time to drop all packets from the source address for a while.
No, its unlikely you will be able to identify the app without getting hold of an affected device.

Better performance to a lot of users at same time

I'm developing a website, that hopefully, will be accessed by more than a million people at same time. It still has only 70k users, and it's already lagging while uploading a file, or just opening pages and stuff..
I use SQLServer, tomcat and apache http server.
i've tried using another tomcat to manage the access to database, but i'm facing another problem, it has to share the same space of the other tomcat to save the uploaded files. and it causes a huge delay uploading..
what can i do to make my website faster?
The website is developed with JSF with richfaces and Java and Hibernate.
Scaling is hard.
For some operations scaling is impossible. Even the greatest (Google, Facebook, Amazon) are not free to choose their features; what is offered is often a compromise between "what would be cool" and "what will scale".
The question "how to make it faster" is unanswerable without profiling your application.
Making any decisions without considering the former point is PLAIN STUPID AND MIGHT PUT YOU IN EVEN WORSE SITUATION.
The traditional way of identifying of bottlenecks is to think separately about:
a) memory (does the system swap?)
b) cpu (are cpus really busy, or just waiting for the database?)
c) IO (usually that includes the database and bandwidth)
Depending on where is your problem, totally contradicting things will help. For example if you have plenty of memory and low bandwith, switch JSF to save state on server. This will use more memory, but make requests shorter. On the other hand if bandwitdth is not a problem and memory is, then do the opposite: switch JSF to keep state on the client. This will help to conserve memory (although in this case matters are more complicated: if tomcats in your cluster try to share session data, then saving state on server becomes an IO problem).
You say that the problem is with uploading files. To help, we would need to know: where do you save them? to DB? to filesystem? Are they short or long? How are they processed? Are there any patterns on the usage of the uploaded files (like: "new files are used most of the time")? and probably even more questions would pop up after these had been answered.
For your own sake: close this question. You will get plenty of well-intentioned, and yet misguiding answers like 'drop JSF', 'cluster everything', 'add memory', 'move to GAE or Amazon EC', 'go with a NoSQL database', 'do everything asynchronously, use a message queue', 'do everything on the client with ajax', 'drop ajax, it makes too many requests and kills server'. All of this is meaningless unless you profile, profile, profile, measure, measure, measure - FIRST. And then give as a better defined question.
If you access a db, consider to use EJB + CMP. Then follow the following model:
cluster your application server (e.g. GlassFish) for load balancing
keep all service calls of one single request in one single node (by only calling local services)

Prevent client from overloading server?

I have a Java servlet that's getting overloaded by client requests during peak hours. Some clients span concurrent requests. Sometimes the number of requests per second is just too great.
Should I implement application logic to restrict the number of request client can send per second? Does this need to be done on the application level?
The two most common ways of handling this are to turn away requests when the server is too busy, or handle each request slower.
Turning away requests is easy; just run a fixed number of instances. The OS may or may not queue up a few connection requests, but in general the users will simply fail to connect. A more graceful way of doing it is to have the service return an error code indicating the client should try again later.
Handling requests slower is a bit more work, because it requires separating the servlet handling the requests from the class doing the work in a different thread. You can have a larger number of servlets than worker bees. When a request comes in it accepts it, waits for a worker bee, grabs it and uses it, frees it, then returns the results.
The two can communicate through one of the classes in java.util.concurrent, like LinkedBlockingQueue or ThreadPoolExecutor. If you want to get really fancy, you can use something like a PriorityBlockingQueue to serve some customers before others.
Me, I would throw more hardware at it like Anon said ;)
Some solid answers here. I think more hardware is the way to go. Having too many clients or traffic is usually a good problem to have.
However, if you absolutely must throttle clients, there are some options.
The most scalable solutions that I've seen revolve around a distributed caching system, like Memcached, and using integers to keep counts.
Figure out a rate at which your system can handle traffic. Either overall, or per client. Then put a count into memcached that represents that rate. Each time you get a request, decrement the value. Periodically increment the counter to allow more traffic through.
For example, if you can handle 10 requests/second, put a count of 50 in every 5 seconds, up to a maximum of 50. That way you aren't refilling it all the time, but you can also handle a bit of bursting limited to a window. You will need to experiment to find a good refresh rate. The key for this counter can either be a global key, or based on user id if you need to restrict that way.
The nice thing about this system is that it works across an entire cluster AND the mechanism that refills the counters need not be in one of your current servers. You can dedicate a separate process for it. The loaded servers only need to check it and decrement it.
All that being said, I'd investigate other options first. Throttling your customers is usually a good way to annoy them. Most probably NOT the best idea. :)
I'm assuming you're not in a position to increase capacity (either via hardware or software), and you really just need to limit the externally-imposed load on your server.
Dealing with this from within your application should be avoided unless you have very special needs that are not met by the existing solutions out there, which operate at HTTP server level. A lot of thought has gone into this problem, so it's worth looking at existing solutions rather than implementing one yourself.
If you're using Tomcat, you can configure the maximum number of simultaneous requests allowed via the maxThreads and acceptCount settings. Read the introduction at http://tomcat.apache.org/tomcat-6.0-doc/config/http.html for more info on these.
For more advanced controls (like per-user restrictions), if you're proxying through Apache, you can use a variety of modules to help deal with the situation. A few modules to google for are limitipconn, mod_bw, and mod_cband. These are quite a bit harder to set up and understand than the basic controls that are probably offered by your appserver, so you may just want to stick with those.

How to handle OUT OF MEMORY error for multiple threads in a Java Web Crawler

I'm fairly new to programming and am working for my dissertation on a web crawler. I've been provided by a web crawler but i found it to be too slow since it is single threaded. It took 30 mins to crawl 1000 webpages. I tried to create multiple threads for execution and with 20 threads simultaneously running the 1000 webpages took only 2 minutes. But now I'm encountering "Heap Out of Memory" errors. I'm sure what i did was wrong which was create a for loop for 20 threads. What would be the right way to multi-thread the java crawler without giving out the errors? And speaking of which, is multi-threading the solution to my problem or not?
The simple answer (see above) is to increase the JVM memory size. This will help, but it is likely that the real problem is that your web crawling algorithm is creating an in-memory data structure that grows in proportion to the number of pages you visit. If that is the case, the solution maybe to move the data in that data structure to disc; e.g. a database.
The most appropriate solution to your problem depends on how your web crawler works, what it is collecting, and how many pages you need to crawl.
My first suggestion is that you increase the heap size for the JVM:
http://www.informix-zone.com/node/46
Regarding the speed of your program:
If your web crawler obeys the robots.txt file on servers, (which it should to avoid being banned by the site admins) then there may be little that can be done.
You should profile your program, but I expect most of the time is your crawler downloading html pages, and site admins will usually not be happy if you download so fast you drain their bandwidth.
In summary, Downloading a whole site without hurting that site will take a while.

Categories

Resources