Most efficient java way to test 300,000+ URLs [duplicate]

Most efficient java way to test 300,000+ URLs [duplicate] - java

This question already has answers here:
Preferred Java way to ping an HTTP URL for availability
(6 answers)
Closed 9 years ago.
I'm trying to find the most efficient way to test 300,000+ URLs in a database to basically check if the URLs are still valid.
Having looked around the site I've found many excellent answers and am now using something along the lines of:
Read URL from file....
Test URL:
final URL url = new URL("http://" + address);
final HttpURLConnection urlConn = (HttpURLConnection) url.openConnection();
urlConn.setConnectTimeout(1000 * 10);
urlConn.connect();
urlConn.getResponseCode(); // Do something with the code
urlConn.disconnect();
Write details back to file....
So a couple of questions:
1) Is there a more efficient way to test URLs and get response codes?
2) Initially I am able to test about 50 URLs per minute, but after 5 or so minutes things really slow down - I imagine there is some resources I'm not releasing but am not sure what
3) Certain URLs (e.g. www.bhs.org.au) will cause the above to hang for minutes (not good when I have so many URLs to test) even with the connect timeout set, is there anyway I can tighten this up?
Thanks in advance for any help, it's been a quite a few years since I've written any code and I'm starting again from scratch :-)

By far the fastest way to do this would be to use java.nio to open a regular TCP connection to your target host on port 80. Then, simply send it a minimal HTTP request and process the result yourself.
The main advantage of this is that you can have a pool of 10 or 100 or even 1000 connections open and loading at the same time rather than having to do them one after the other. With this, for example, it won't matter much if one server (www.bhs.org.au) takes several minutes to respond. It'll simply hog one of your many connections in the pool, but others will keep running.
You could also achieve that same thing with a little more overhead but a lot less complex coding by using a Thread Pool to run many HttpURLConnections (the way you are doing it now) in parallel in multiple threads.

This may or may not help, but you might want to change your request method to HEAD instead of using the default, which is GET:
urlConn.setRequestMethod("HEAD");
This tells the server that you do not really need a response back, other than the response code.
The article What Is a HTTP HEAD Request Good for describes some uses for HEAD, including link verification:
[Head] asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.... This can be used for example for creating a faster link verification service.

Related

How to handle large http JSON response body [duplicate]

This question already has answers here:
Best way to process a huge HTTP JSON response
(3 answers)
Closed 1 year ago.
There are REST-API that return large JSON data
example result:
{
"arrayA":[
{
"data1":"data",
"data2":"data"
},..
],
"arrayB":[
{
"data1":"data"
},..
]
}
"arrayA" possible record just 0 to 100 records but "arrayB" can be possible 1 million to 10 million record
it make my java application out of memory.
My question is how to handle this case.

There are different concerns here and IMO the question is too broad because the best solution may depend on the actual use case.
You say, you have a REST API and you would like to “protect” the server from Out Of Memory Error, I get that.
However, assuming you’ll find the way to fix the OOM error on server, what kind of client will want to view tens of millions objects at once? If its a browser, is it really required? How long will take the JSON processing on the client side? Won’t the client side of application become too slow and the clients will start to complain? I’m sure you’ve got the point.
So the first way is to “re-think” why do you need such a big response. In this case, probably the best solution is refactoring and changing the logic of the client-server communication
Now, another possible case is that you have an “integration” - some kind of server to server communication.
In this case there is no point in adding the whole json response at once or even doing streaming. If you’re running in the cloud for example, you might want to add this huge json string to some file and upload to S3, for example and then provide a link to it (because S3 can deal with files like this). Of course there are other alternatives in non AWS environment.
As a “stripped-down” idea you might get the Request, create the temp file on the file system, write the data to it in chunks and then return the “FileResource” to the client. Working chunk-by-chunk will ensure that the memory consumption is low on your java application’s side. Basically it would be equal to downloading the file that gets generated dynamically. When you close the stream you might want to remove the file.
This would work best if you have some kind of “get heap dump” or any data dump in general functionality.

limit size of requests with com.sun.net.httpserver.HttpExchange

abusive users may attempt to send really large requests to my httpserver so while a "maxRequestSize" config would've been useful, I have yet to find a way of dealing with this. I also thought there might be a timeout option somewhere but couldn't find anything like that either.
httpExchange.getRequestBody() returns an InputStream but based on my research there's no way to determine the length of an InputStream without first processing it

Handling Huge number of POST parameters

I am passing close to 1000 parameters to my server. And i got the infamous "More than the maximum number of POST parameters detected" error. I increased the MAX_COUNT value in my local server and it was working fine.
But in my production environment the server admin is not ready to increase the count above 512, he is sighting DOS attacks as the reason.
Is there a best way to pass these 1000+ parameters to the server. I tried compartmentalizing the page into sections and made save buttons available for each section. But if it wasn't possible to compartmentalize, what would be the best practice to send huge number of parameters to a server?

You could get around this by posting an XML or JSON document with all of the parameters. It will require quite a bit changes, but will get you around the restriction without having to deal with the sysadmin.

Tools/libraries to resolve/expand thousands of URLs

In a crawler-like project we have a common and widely used task to resolve/expand thousands of URLs. Say we have (very simplified example):
http://bit.ly/4Agih5
GET 'http://bit.ly/4Agih5' request returns one of the 3xx, we follow redirect right to the:
http://stackoverflow.com
GET 'http://stackoverflow.com' returns 200. So 'stackoverflow.com' is the result we need.
Any URLs (not only well-known shorteners like bit.ly) are allowed as input. Some of them redirect once, some doesn't redirect at all (result is the URL itself in this case), some redirect multiple times. Our task to follow all redirects imitating browser behavior as much as possible. In general, if we have some URL A resolver should return us URL B which should be the same as if A was opened in some browser.
So far we used Java, pool of threads and simple URLConnection to solve this task. Advantages are obvious:
simplicity - just create URLConnection, set follow redirects and that's it (almost);
well HTTP support - Java provides everything we need to imitate browser as much as possible: auto follow redirects, cookies support.
Unfortunately such approach has also drawbacks:
performance - threads are not for free, URLConnection starts downloading document right after getInputStream(), even if we don't need it;
memory footprint - don't sure exactly but seems that URL and URLConnection are quite heavy objects, and again buffering of the GET result right after getInputStream() call.
Are there other solutions (or improvements to this one) which may significantly increase speed and decrease memory consumption? Presumably, we need something like:
high-performance lightweight Java HTTP client based on java.nio;
C HTTP client which uses poll() or select();
some ready library which resolves/expands URLs;

You can use Python, Gevent, and urlopen. Combine this gevent exampel with the redirect handling in this SO question.
I would not recommend Nutch, it is very complex to set up and has numerous dependencies (Hadoop, HDFS).

I'd use a selenium script to read URLs off of a queue and GET them. Then wait about 5 seconds per browser to see if a redirect occurs and if so put the new redirect URL back into the queue for the next instance to process. You can have as many instances running simultaneously as you want.
UPDATE:
If you only care about the Location header (what most non-JS or meta redirects use), simply check it, you never need to get the inputStream:
HttpURLConnection.setFollowRedirects(false);
URL url = new URL("http://bit.ly/abc123");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
String newLocation = conn.getHeaderField("Location");
If the newLocation is populated then stick that URL back into the queue and have that followed next round.

How can I make my android app receive/send data faster?

I have an app that needs to transfer data back and forth between a server, but the speed is not satisfactory right now. The main part is that I'm receiving and parsing JSON data (about 200 characters long) over 3g from a server, and the fastest it will ever do the task is about 5 seconds, but sometimes it will take long enough to timeout (upwards of 30 seconds). My server is a rackspace cloud server.
I thought I was following best practices, but it can't be so with these kinds of speeds. I am using AsyncTask and the same global HttpClient variable for everything.
Can you help me find a better way?
I've thought about these options:
using TCP instead of HTTP
encoding the data to try to reduce the size (not sure how this would work)
I don't know a lot about TCP, but it seems like it would be less overhead. What would be the pros and cons of using TCP instead of HTTP? is it practical for a cell phone to do?
Thanks
fyi - once I solve the problem I'll accept an answer that's the most helpful. So far I've received some really great answers
EDIT: I made it so that I can see the progress as it downloads and I've noticed that it is staying at 0% for a long time then it is quickly going to 100% -- does anyone have any ideas in light of this new info? It may be relevant that I'm using a Samsung Epic with Froyo.

Try using GZIP to compress the data being sent. Note a code complete example, but it should get you on the right path.
Rejinderi is right; GSON rocks.
HttpGet getRequest = new HttpGet(url);
getRequest.addHeader("Accept-Encoding", "gzip");
InputStream instream = response.getEntity().getContent();
Header contentEncoding = response.getFirstHeader("Content-Encoding");
if (contentEncoding != null && contentEncoding.getValue().equalsIgnoreCase("gzip")) {
instream = new GZIPInputStream(instream);
}

TCP is just HTTP at a lower level and if you really need performance then TCP is the one you should use. HTTP is easier to develop as there are more support and easier to implement as a developer it wraps a lot of things up so you don't have to implement them yourself. The overhead for your case shouldnt be that much.
As for the JSON data. check if its taking a long time, the normal JSON library java has is damn slow take a look here
http://www.cowtowncoder.com/blog/archives/2009/09/entry_326.html
Debug and see if that is the case. if its the json parse speed i suggest you use the gson library. Its cleaner and easy to implement and much MUCH faster.

Sounds like you need to profile the application to find out where your bottleneck is. You said you are sending data of about 200 chars. That is miniscule and I don't see how compression or anything strictly data related is going to make much of an impact on such a small data set.
I think it is more likely that you have some communication issues, perhaps attempting to establish a new connection for every transfer or something along those lines that is giving you all the overhead.
Profiling is the key to resolving your issues, anything else is a shot in the dark.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.