Java HttpClient seems to be caching content - java

I'm building a simple web-scraper and i need to fetch the same page a few hundred times, and there's an attribute in the page that is dynamic and should change at each request. I've built a multithreaded HttpClient based class to process the requests and i'm using an ExecutorService to make a thread pool and run the threads. The problem is that dynamic attribute sometimes doesn't change on each request and i end up getting the same value on like 3 or 4 subsequent threads. I've read alot about HttpClient and i really can't find where this problem comes from. Could it be something about caching, or something like it!?
Update: here is the code executed in each thread:
HttpContext localContext = new BasicHttpContext();
HttpParams params = new BasicHttpParams();
HttpProtocolParams.setVersion(params, HttpVersion.HTTP_1_1);
HttpProtocolParams.setContentCharset(params,
HTTP.DEFAULT_CONTENT_CHARSET);
HttpProtocolParams.setUseExpectContinue(params, true);
ClientConnectionManager connman = new ThreadSafeClientConnManager();
DefaultHttpClient httpclient = new DefaultHttpClient(connman, params);
HttpHost proxy = new HttpHost(inc_proxy, Integer.valueOf(inc_port));
httpclient.getParams().setParameter(ConnRoutePNames.DEFAULT_PROXY,
proxy);
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("User-Agent",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");
String iden = null;
int timeoutConnection = 10000;
HttpConnectionParams.setConnectionTimeout(httpGet.getParams(),
timeoutConnection);
try {
HttpResponse response = httpclient.execute(httpGet, localContext);
HttpEntity entity = response.getEntity();
if (entity != null) {
InputStream instream = entity.getContent();
String result = convertStreamToString(instream);
// System.out.printf("Resultado\n %s",result +"\n");
instream.close();
iden = StringUtils
.substringBetween(result,
"<input name=\"iden\" value=\"",
"\" type=\"hidden\"/>");
System.out.printf("IDEN:%s\n", iden);
EntityUtils.consume(entity);
}
}
catch (ClientProtocolException e) {
// TODO Auto-generated catch block
System.out.println("Excepção CP");
} catch (IOException e) {
// TODO Auto-generated catch block
System.out.println("Excepção IO");
}

HTTPClient does not use cache by default (when you use DefaultHttpClient class only). It does so, if you use CachingHttpClient which is HttpClient interface decorator enabling caching:
HttpClient client = new CachingHttpClient(new DefaultHttpClient(), cacheConfiguration);
Then, it analyzes If-Modified-Since and If-None-Match headers in order to decide if request to the remote server is performed, or if its result is returned from cache.
I suspect, that your issue is caused by proxy server standing between your application and remote server.
You can test it easily with curl application; execute some number of requests omitting proxy:
#!/bin/bash
for i in {1..50}
do
echo "*** Performing request number $i"
curl -D - http://yourserveraddress.com -o $i -s
done
And then, execute diff between all downloaded files. All of them should have differences you mentioned. Then, add -x/--proxy <host[:port]> option to curl, execute this script and compare files again. If some responses are the same as others, then you can be sure that this is proxy server issue.

Generally speaking, in order to test whether or not HTTP requests are being made over the wire, you can use a "sniffing" tool that analyzes network traffic, for example:
Fiddler ( http://fiddler2.com/fiddler2/ ) - I would start with this
Wireshark ( http://www.wireshark.org/ ) - more low level
I highly doubt HttpClient is performing caching of any sort (this would imply it needs to store pages in memory or on disk - not one of its capabilities).
While this is not an answer, its a point to ponder: Is it possible that the server (or some proxy in between) is returning you cached content? If you are performing many requests (simultaneously or near simultaneously) for the same content, the server may be returning you cached content because it has decided that the information has not "expired" yet. In fact the HTTP protocol provides caching directives for such functionality. Here is a site that provides a high level overview of the different HTTP caching mechanisms:
http://betterexplained.com/articles/how-to-optimize-your-site-with-http-caching/
I hope this gives you a starting point. If you have already considered these avenues then that's great.

You could try appending some unique dummy parameter to the URL on every request to try to defeat any URL-based caching (in the server, or somewhere along the way). It won't work if caching isn't the problem, or if the server is smart enough to reject requests with unknown parameters, or if the server is caching but only based on parameters it cares about, or if your chosen parameter name collides with a parameter the site actually uses.
If this is the URL you're using
http://www.example.org/index.html
try using
http://www.example.org/index.html?dummy=1
Set dummy to a different value for each request.

Related

How to cache the httpclient object in Java?

Using Apache HttpClient 4.5.x in my client webapp which connects to (and log in to) another (say main) server webapp.
The relationship between these 2 webapps is many to many - meaning for some user's request in client webapp, it has to login as another user + make rest call, in the server webapp. So some separation of cookiestores is needed and there's no way (is there?) to get/set the cookie store after creating a httpclient instance, so each request thread received in client webapp does something like this (and need to optimize):
HttpClient client = HttpClientBuilder.create().setDefaultCookieStore(new BasicCookieStore()).build();
//Now POST to login end point and get back JSESSIONID cookie and then make one REST call, and then the client object goes out of scope when the request ends.
I was hoping to ask on the best practice of caching the httpclient instance object as its heavy and is supposed to be reused for at least multiple requests, if not for the whole client webapp as a static singleton.
Specifically, I was hoping for advice on which of these (if any) approaches would constitute a best practice:
Use a static ConcurrentHashMap to cache the httpclient and its associated basiccookiestore for each "user" in client webapp, and to login only when the contained cached cookie is near to its expiry time. Not sure about memory usage, and un/rarely-used httpclient stay in memory without eviction.
Cache only the Cookie (somehow), but recreate a new httpclient object whenever the need arises to use that cookie for a rest call. This saves the prior call to login until the cookie expires, but no reuse of htptclient.
PooledConnectionManager - but can't find examples easily, though might require devising an eviction strategy, max number of threads etc. (so can be complex).
Is there any better way of doing this ? Thanks.
References:
http://hc.apache.org/httpclient-3.x/performance.html
Generally it is recommended to have a single instance of HttpClient
per communication component or even per application
Similar at java httpclient 4.x performance guide to resolve issue
Using concurrent hash map would be the simplest way to achieve what you want to do.
Additionally, if you are using Spring, you might want to create a bean for holding the HTTP client.
Why would you want to do all this? One can assign a different CookieStore on a per request basis by using a local HttpContext.
If needed one can maintain a map of CookieStore instances per unique user.
CloseableHttpClient httpclient = HttpClients.createDefault();
CookieStore cookieStore = new BasicCookieStore();
// Create local HTTP context
HttpClientContext localContext = HttpClientContext.create();
// Bind custom cookie store to the local context
localContext.setCookieStore(cookieStore);
HttpGet httpget = new HttpGet("http://httpbin.org/cookies");
System.out.println("Executing request " + httpget.getRequestLine());
// Pass local context as a parameter
CloseableHttpResponse response = httpclient.execute(httpget, localContext);
try {
System.out.println("----------------------------------------");
System.out.println(response.getStatusLine());
List<Cookie> cookies = cookieStore.getCookies();
for (int i = 0; i < cookies.size(); i++) {
System.out.println("Local cookie: " + cookies.get(i));
}
EntityUtils.consume(response.getEntity());
} finally {
response.close();
}

How to handle Cookies with Apache HttpClient 4.3

I need to implement a series of HTTP requests in Java and decided to use Apaches HttpClient in version 4.3 (the most current one).
The problem is all these requests use a cookie for session management and I seem to be unable to find a way of accessing that cookie and passing it from request to request. My commands in using curl look something like:
# Login
curl -c cookies -d "UserName=username&Password=password" "https://example.com/Login"
# Upload a file
curl -b cookies -F fileUpload=#IMG_0013.JPG "https://example.com/File"
# Get results of server processing file
curl -b cookies "https://example.com/File/1234/Content"
They work perfectly. However with HttpClient it seems not to work. What I tried was:
URI serverAddress = new URI("https://example.com/");
URI loginUri = UriBuilder.fromUri(serverAddress).segment("Login").queryParam("UserName", "username")
.queryParam("Password", "password").build();
RequestConfig globalConfig = RequestConfig.custom().setCookieSpec(CookieSpecs.BEST_MATCH).build();
CookieStore cookieStore = new BasicCookieStore();
HttpClientContext context = HttpClientContext.create();
context.setCookieStore(cookieStore);
CloseableHttpClient httpClient = HttpClients.custom().setDefaultRequestConfig(globalConfig).setDefaultCookieStore(cookieStore).build();
HttpGet httpGet = new HttpGet(loginUri);
CloseableHttpResponse loginResponse = httpClient.execute(httpGet,context);
System.out.println(context.getCookieStore().getCookies());
The output of the last line is always an empty list. I think it should contain my Cookie, am I right?
Can someone give me a small example on how to handle the cookie using Apache HttpClient 4.3?
Thanks
Your code looks OK to me (other than not releasing resources, but I presume exception handling was omitted for brevity). The reason for cookie store being empty may be violation of the actual cookie policy (which is BEST_MATCH in your case) by the target server. So, cookies sent by the server get rejected as invalid. You can find out if that is the case (and other useful contextual details) by turning on context / wire logging as described here

How to make all network traffic go via a proxy?

I have an app that makes http requests to a remote server. I do this with the following code:
HttpClient httpClient = new DefaultHttpClient();
HttpPost httpPost = new HttpPost("myURL");
try {
ArrayList<BasicNameValuePair> postVariables = new ArrayList<BasicNameValuePair>(2);
postVariables.add(new BasicNameValuePair("key","value"));
httpPost.setEntity(new UrlEncodedFormEntity(postVariables));
HttpResponse response = httpClient.execute(httpPost);
String responseString = EntityUtils.toString(response.getEntity());
if (responseString.contains("\"success\":true")){
//this means the request succeeded
} else {
//failed
}
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
This goes really well, but one of our customers has set up an APN that requires requests to go via a certain proxy server. If I add the following to the request this works, the request gets rerouted via the proxy to the server:
HttpHost httpHost = new HttpHost("proxyURL",8080);
httpClient.getParams().setParameter(ConnRoutePNames.DEFAULT_PROXY, httpHost);
So far so good, however, I use a library that makes some http requests as well. The library's code is not accesible for me, so I can't add those two lines to the code. I contacted the creators of that library, and they told me it should be possible to set up the android environment so that all requests will automatically go through the proxy. Is there something like that? I didn't find anything on google.
I'm basically looking for a way to set the above two lines as a standard for all http requests. Please note that the APN does not set the proxy as a default for the entire phone, so apps will have to do this manually (and yes that means the majority of the apps don't work on that customer's phone).
It's been a year or two since I've needed to use it, but if I remember correctly, you can use the System.setProperty(String, String) in order to set an environment-wide setting for your application to route all HTTP traffic through a proxy. The properties that you should need to set are "http.proxyHost" and "http.proxyPort" and then use your HttpClient normally without specifying a proxy because the VM will handle routing requests.
Docs for more information about what I'm talking about can be found here: ProxySelector (just so you know what keys to use) and here for documentation about the actual System.setProperty(String, String) function
If that doesn't work for you, let me know and I'll try to dig out my old code that set a system-level proxy. BTW, it's really only "system-level" since each app runs in it's own Dalvik so you won't impact other app's network communications.

How can I forcefully cache all my HTTP responses?

I'm using the DefaultHTTPClient to make some HTTP GET requests. I'd like to forcefully cache all the HTTP responses for a week. After going through the docs and some SO answers, I've done this:
I installed an HttpResponseCache via the onCreate method of my main activity.
try {
File httpCacheDir = new File(getApplicationContext().getCacheDir(), "http");
long httpCacheSize = 10 * 1024 * 1024; // 10 MiB
HttpResponseCache.install(httpCacheDir, httpCacheSize);
} catch (IOException e) {
Log.i("dd", "HTTP response cache installation failed:" + e);
}
I added a custom HttpResponseInterceptor for my HTTP client, but I still don't get a cache hit. Here's my response interceptor that decompresses GZIPped content, strips caching headers and adds a custom one:
class Decompressor implements HttpResponseInterceptor {
public void process(HttpResponse hreResponse, HttpContext hctContext) throws HttpException, IOException {
hreResponse.removeHeaders("Expires");
hreResponse.removeHeaders("Pragma");
hreResponse.removeHeaders("Cache-Control");
hreResponse.addHeader("Cache-Control", "max-age=604800");
HttpEntity entity = hreResponse.getEntity();
if (entity != null) {
Header ceheader = entity.getContentEncoding();
if (ceheader != null) {
HeaderElement[] codecs = ceheader.getElements();
for (int i = 0; i < codecs.length; i++) {
if (codecs[i].getName().equalsIgnoreCase("gzip")) {
hreResponse.setEntity(new HttpEntityWrapper(entity) {
#Override
public InputStream getContent() throws IOException, IllegalStateException {
return new GZIPInputStream(wrappedEntity.getContent());
}
#Override
public long getContentLength() {
return -1;
}
});
return;
}
}
}
}
}
}
Here's how I make my request:
String strResponse = null;
HttpGet htpGet = new HttpGet(strUrl);
htpGet.addHeader("Accept-Encoding", "gzip");
htpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1");
DefaultHttpClient dhcClient = new DefaultHttpClient();
dhcClient.addResponseInterceptor(new Decompressor(), 0);
HttpResponse resResponse = dhcClient.execute(htpGet);
Log.d("helpers.network", String.format("Cache hit count: %d", HttpResponseCache.getInstalled().getHitCount()));
strResponse = EntityUtils.toString(resResponse.getEntity());
return strResponse;
I can't seem to pinpoint what I'm doing wrong. Would any of you know?
Not sure if this answers your question, but instead of relying on an HTTP server interpreting your cache control headers, have you thought about simply adding a client-side cache using Android's own cache directories?
What we did in ignition is simply write server responses to disk as byte streams, thus having full control over caching (and expiring caches.)
There's a sample app here, but it would require you to use the library's HTTP API (which, however, is merely a thin wrapper around HttpClient.) Or simply look at how the cache works and go from there.
I failed miserably with this. According to the Android documentation, it specifically says about the HttpResponseCache — "Caches HTTP and HTTPS responses to the filesystem so they may be reused, saving time and bandwidth. This class supports HttpURLConnection and HttpsURLConnection; there is no platform-provided cache for DefaultHttpClient or AndroidHttpClient."
So that was out.
Now Apache's HTTP client has a CachingHttpClient and this new version of HTTP client has been back ported to Android through this project. Of course, I could use this.
I didn't want to use the hackish version of the Apache HTTP Client libraries so one idea was to cannibalise the caching related bits from the HTTP Client and reoll my own but it was too much work.
I even considered moving to the recommended HttpURLConnection class as recommended but I've run into other issues. There doesn't seem to be good cookie-persistence implementation for that class.
Anyway, I skipped everything and though. I'm reducing the loading time by caching, why not got go a step further and since I was using jSoup to scrape records from the page and create an ArrayList of a custom structure, I might as well serialize the whole ArrayList by implementing the Serializable method on my structure. Now I don't have to wait for the page request, not do I have to wait for the jSoup parsing slowness. Win.

class to read http pages with reliable timout that return `java.io.InputStream`

Is there any class for reading http pages that return a java.io.InputStream and its timeout be reliable?
I tried java.net.URLConnection and it doesn't have a reliable timeout (it takes more time that it set to timeout reach)? My Code is here:
URLConnection con = url.openConnection();
con.setConnectTimeout(2000);
con.setReadTimeout(2000);
InputStream in = con.getInputStream();
I expect that the reason that the timeout is not working for you is that you are setting the timeout after the connection has been established, or you are using the wrong setter. It is also possible that you are using "non-standard" version of URLConnection ...
"Some non-standard implementation of this method ignores the specified timeout. To see the read timeout set, please call getReadTimeout()." (or getConnectTimeout())
If you posted the relevant part of your actual code we could give you a better answer ...
Alternatively, use the Apache HttpClient library.
You can use Apache HttpClient to read http pages, it also has an http parser.check this for further reference about httpclient. you can get an InputStream object using their API like this.
HttpClient httpclient = new DefaultHttpClient();
// Prepare a request object
HttpGet httpget = new HttpGet("http://www.apache.org/");
// Execute the request
HttpResponse response = httpclient.execute(httpget);
// Examine the response status
System.out.println(response.getStatusLine());
// Get hold of the response entity
HttpEntity entity = response.getEntity();
// If the response does not enclose an entity, there is no need
// to worry about connection release
if (entity != null) {
InputStream instream = entity.getContent();
and coming to timeout part, it totally depends on the network and you cant do much about it from your java code.

Categories

Resources