A while back I tried implementing a crawler in Java and left the project for a while (made a lot of progress since). Basically I have implemented a crawler with circa 200-400 threads, each thread connects and downloads the content of one page (simplified for clarity, but that's basically it):
// we're in a run() method of a truely generic Runnable.
// _url is a member passed to the Runnable object beforehand.
Connection c = Jsoup.connect(_url).timeout(10000);
c.execute();
Document d = c.response().parse();
// use Jsoup to get the links, add them to the backbone of the crawler
// to be checked and maybe later passed to the crawling queue.
This works. The problem is I only use a very small fraction of my internet bandwidth. Having the ability to download at >6MB/s, I've identified (using NetLimiter and my own calculations) that I only use about 1MB/s at best when downloading pages sources.
I've done a lot of statistics and analyses and it is somewhat reasonable - if the computer cannot efficiently support over ~400 threads (I don't know about that also, but a larger number of threads seems to be ineffective) and each connection takes about 4 seconds to complete, then I'm supposed to download 100 pages per second which is indeed what happens. The bizarre thing is that many times while I run this program, the internet connection is completely clogged - neither I nor anyone else on my wifi connection can access the web normally (when I'm only using 16%! which does not happen when downloading other files, say movies).
I've spent literally weeks calculating, analyzing and collecting various statistics (making sure all threads are operating with VM monitor, calculating mean run time for threads, excel charts...), before coming here, but I've ran out of answers. I wonder if this behavior could be explained. I realize there's a lot of "ifs" in this question, but it's the best I can do without it turning into an essay.
My computer specs are i5 4460 with 8GB DDR3-1600 and a 100Mb/s (effectively around 8MB/s) internet connection, connected directly via LAN to the crawler. I'm looking for general directions - where else should I look
(I mean obvous stuff that are clear to experienced developers and not myself) in order to either:
Improve the download speed (maybe not Jsoup? different number of threads? I've already tried using selectors instead of threads and it was slower), or:
Free up the internet when I'm running this program.
I've thought about the router itself (Netgear N600) limiting the number of outgoing connections (seems odd), so I'm saturating the number of connections, and not the bandwidth, but couldn't figure out if that's even possible.
Any general direction / advice would be warmly welcomed :) feel free to point out newish mistakes, that's how I learn.
Amir.
The issue was not DNS resolutions, as creating the connections with an IP address (I stored all addresses in advance then used those) resulted in the exact same response times and bandwidth use. Nor was it the threads issue.
I suspect now it was the netlimiter program's "fault". I've measured directly the number of bytes received and outputted these to disk (I've done this before but apprantly I've made some changes in the program). It seems I'm really saturating the bandwidth. Also, when switching to HttpUrlConnection objects instead of Jsoup, the netlimiter program does show a much larger bandwidth usage. Perhaps it has some issue with Jsoup.
I'm not sure this was the problem, but empirically, the program downloads a lot of data. So I hope this helps anyone who might encounter a similar issue in the future.
Related
We have an issue in our server at job and I'm trying to understand what is happening. It's a Java application that runs in a linux server, the application recieve inforamtion form TCP socket and analyse them and after analyse write into the database.
Sometimes the quantity of packets is too many and the Java application need to write many times into the database per second (like 100 to 500 times).
I try to reproduce the issue in my own computer and look how the application works with JProfiler.
The memory look always going up, is it a memory leak (sorry I'm not a Java programmer, i'm C++ programmer)?
After 133 minute
After 158 minute
I have many locked thread, does it means that the application did not programmed correctly?
Is it too many connection to the database (the application use BasicDataSource class to use a connection pool)?
The program don't have FIFO to manage database writing for continual information entering from TCP port. My questions are (remeber that I'm not a Java programmer and I don't know if this is way that a Java application should work or the program can be programmed more efficient)
Do you think that something is wrong with the code that are not correctly managing write, read, updates on the database and cosume too many memory and CPU time, or is it the way that it works in BasicDataSource class?
How do you think I can improve it (if you think it's an issue) this issue, by creating a FIFO and removing the part of code that create too many threads? Or the threads is not the application threads himself and thats the BasicDataSource threads?
There are several areas to dig into, but first I would try and find what is actually blocking the threads in question. I'll assume everything before the app is being looked at as well, so this is from the app down.
I know the graphs show free memory but they are just point in time so I can't see a trend. GC logging is available, I haven't used JProfiler much though so I am not sure how to point you to it in that tool. I know in DynaTrace I can see GC events and their duration as well as any other blocking events and their root cause as well. If this isn't available there are command line switches to log GC activity to see its duration and frequency. That is one area that could block.
I would also look at how many connections you have in your pool. If there are 100-500 requests/second trying to write and they are stacking up because you don't have enough connections to work them then that could be a problem as well. The image shows all transactions but doesn't speak to the pool size. Transactions blocked with nowhere to go could lead to your memory jumps as well.
There is also the flip side that your database can't handle the traffic and is pegged, and that is what is blocking the connections as well so you would want to monitor that end of things and see if that is a possible cause of the blocking.
There is also the chance that the blocking is occurring from the SQL being run as well, waiting for page locks to be released, etc.
Lots of areas to look at, but I would address and verify one layer at a time starting with the app and working down.
We have a 64 bit linux machine and we make multiple HTTP connections to other services and Drools Guvnor website(Rule engine if you don't know) is one of them. In drools, we create knowledge base per rule being fired and creation of knowledge base makes a HTTP connection to Guvnor website.
All other threads are blocked and CPU utilization goes up to ~100% resulting into OOM. We can make changes to compile the rules after 15-20 mins. but I want to be sure of the problem if someone has already faced it.
I checked for "cat /proc/sys/kernel/threads-max" and it shows 27000 threads, Can it be a reason?
I have a couple of question:
When do we know that we are running over capacity?
How many threads can be spawned internally (any rough estimate or formula relating diff parameters will work)?
Has anyone else seen similar issues with Drools? Concurrent access to Guvnor website is basically causing the issue.
Thanks,
I am basing my answer on the assumption that you are creating a knowledge base for each request, and this knowledge base creation incudes the download of latest rule sources from Guvnor please correct if I am mistaken.
I suspect that the build /compilation of packages is taking time and hog your system.
Instead of compiling packages on each and every request, you can download pre build packages from guvnor, and also you can cache this packages locally if your rules does not change much. Only restriction is that you need to use the same version of drools both on guvnor and in your application.
I checked for "cat /proc/sys/kernel/threads-max" and it shows 27000
threads, Can it be a reason?
That number does look large but we dont know if a majority of those threads belong to you java app. Create a java thread dump to confirm this. Your thread dump will also show the CPU time taken by each thread.
When do we know that we are running over capacity?
You have 100% CPU and an OOM error. You are over capacity :) Jokes aside, you should monitor your HTTP connection queue to determine what you are doing wrong. Your post says nothing about how you are handling the HTTP connections (presumably through some sort of pooling mechanism backed by a queue ?). I've seen containers and programs queue requests infinitely causing them to crash with a big bang. Plot the following graphs to isolate your problem
The number of blocking threads over time
Time taken for each thread
Number of threads per thread pool and how they increase / decrease with time (pool size)
How many threads can be spawned internally (any rough estimate or
formula relating diff parameters will work)?
Only a load test can answer this question. Load your server and determine the number of concurrent users it can support at 60-70% capacity. Note the number of threads spawned internally at this point. That is your peak capacity (allowing room for unexpected traffic)
Has anyone else seen similar issues with Drools? Concurrent access to
Guvnor website is basically causing the issue
I cant help there since I've not accessed drools this way. Sorry.
I'm thinking about writing a game which is based around a server, and several client programs connect to it. The game (very) basically consists of a list of items which a user can 'accept', which would remove it from the list on all connected computers (this needs to update very quickly).
I'm thinking about using a Java applet for the client since I would like this to be portable and run from a browser (mostly in Windows), as well as updating fast, and either a C++ or Java server running on Linux (currently just a home server, but possibly to go on a VPS).
A previous 'incarnation' of this game ran in a browser, and used PHP+mySQL for the backend, but this swamped the server quite a bit when several people connected (that was with about 8 people, this would eventually need to handle a lot more).
The users would probably all be in the same physical location (with the same public IP address), and the system would get several requests per second, all of which would require sending the list back to the clients.
Some computers may have firewall restrictions on them, so would you recommend using HTTP traffic, a custom port, or perhaps through SSH or some existing protocol?
Could anyone suggest some tips (threading, multiple requests of one item?), tools, databases (mySQL?), or APIs which would help me get started on this project? I would prefer C++ for the backend as it would be faster, but using Java would allow me to reuse code.
Thanks!
I wouldn't use C++ because of speed alone. It is highly unlikely that the difference in performance will make a real difference to your game. (Your network is likely to cloud any performance difference, unless you have 10 GigE between the client and server) I would use C++ or Java because you will get it working first using that language.
For anyone looking for a good networking API for c++ I always suggest Boost.Asio. It has the advantage of being platform independent, so you can compile a server for linux, windows etc. However, if you are not too familiar with c++ templates/boost the code can be a little overwhelming. Have a look, give it a try.
In terms of general advice. Given the description above, you seem to need a relatively simple server. I would suggest keeping it very basic, single threaded polling loop. Read a message from your connected clients (wait on multiple sockets), and respond appropriately. This eliminates any issue around multiple accesses to your list and other synchronization problems.
I might also suggest, before you re-write your initial incarnation. Try improving it, as you have stated:
and the system would get several requests per second, all of which would require sending the list back to the clients.
Given that each request removes an item from this list, why not just inform your uses which item is removed, rather than sending the entire list over the network time and time again? If this list is of any significant size, this minor change will result in a large improvement.
I am interested to know in a very general situation (a home-brew amateur web crawler) what will be the performance of such. More specifically how many pages can a crawler process.
When i say home-brew take that in all senses, a 2.4Ghz core 2 processor, java written, 50mbit internet speed, etc, etc.
Any resources you may share in this regard will be greatly appreciated
Thanks a lot,
Carlos
First of all, the speed of your computer won't be the limiting factor; as for the connection, you should artificially limit the speed of your crawler - most sites will ban your IP address if you start hammering them. In other words, don't crawl a site too quickly (10+ seconds per request should be OK with 99.99% of the sites, but go below that at your own peril).
So, while you could crawl a single site in multiple threads, I'd suggest that each thread crawls a different site (check if it's also not a shared IP address); that way, you could saturate your connection with a lower chance of getting banned from the spidered site.
Some sites don't want you to crawl parts of the site, and there's a commonly used mechanism that you should follow: the robots.txt file. Read the linked site and implement this.
Note also, that some sites prohibit any automated crawling at all; depending on the site's jurisdiction (yours may also apply), breaking this may be illegal (you are responsible for what your script does, "the robot did it" is not even an excuse, much less a defense).
In my experience, mostly making site scrapers, the network download is always the limiting factor. You can usually shuttle the parsing of the page (or storage for parsing later) to a different thread in less than the time it will take to download the next page.
So figure out, on average, how long it takes to download a web page. Multiply that by how many threads you have downloading until it fills your connection's throughput, average out the speed of any given web server and the math is fairly obvious.
If your program is sufficiently efficient, your internet connection WILL be the limiting factor (as Robert Harvey said in his answer).
However, by doing this with a home internet connection, you are probably abusing your provider's terms of service. They will monitor it and will eventually notice if you frequently exceed their reasonable usage policy.
Moreover, if they use a transparent proxy, you may hammer their proxy to death long before you reach their download limit, so be careful - make sure that you are NOT going through your ISP's proxy, transparent or otherwise.
ISPs are set up for most users to do moderate levels of browsing with a few large streaming operations (video or other downloads). A massive level of tiny requests with 100s outstanding at once, will probably not make their proxy servers happy even if it doesn't use much bandwidth.
I'm fairly new to programming and am working for my dissertation on a web crawler. I've been provided by a web crawler but i found it to be too slow since it is single threaded. It took 30 mins to crawl 1000 webpages. I tried to create multiple threads for execution and with 20 threads simultaneously running the 1000 webpages took only 2 minutes. But now I'm encountering "Heap Out of Memory" errors. I'm sure what i did was wrong which was create a for loop for 20 threads. What would be the right way to multi-thread the java crawler without giving out the errors? And speaking of which, is multi-threading the solution to my problem or not?
The simple answer (see above) is to increase the JVM memory size. This will help, but it is likely that the real problem is that your web crawling algorithm is creating an in-memory data structure that grows in proportion to the number of pages you visit. If that is the case, the solution maybe to move the data in that data structure to disc; e.g. a database.
The most appropriate solution to your problem depends on how your web crawler works, what it is collecting, and how many pages you need to crawl.
My first suggestion is that you increase the heap size for the JVM:
http://www.informix-zone.com/node/46
Regarding the speed of your program:
If your web crawler obeys the robots.txt file on servers, (which it should to avoid being banned by the site admins) then there may be little that can be done.
You should profile your program, but I expect most of the time is your crawler downloading html pages, and site admins will usually not be happy if you download so fast you drain their bandwidth.
In summary, Downloading a whole site without hurting that site will take a while.