Slow download with HttpURLConnection - java

I'm trying to make a method that download a webpage.
First, i create a HttpURLConnection.
Second, i call the connect() method.
Third, i read the data through a BufferedReader.
The problem is that with some pages i get reasonable reading times, but with some pages it's Very slow (it can take about 10 minutes!). The slow pages are always the same, and they comes from the same website. Opening those pages with the browser takes just a few seconds instead of 10 minutes. Here is the code
static private String getWebPage(PageNode pagenode)
{
String result;
String inputLine;
URI url;
int cicliLettura=0;
long startTime=0, endTime, openConnTime=0,connTime=0, readTime=0;
try
{
if(Core.logGetWebPage())
startTime=System.nanoTime();
result="";
url=pagenode.getUri();
if(Core.logGetWebPage())
openConnTime=System.nanoTime();
HttpURLConnection yc = (HttpURLConnection) url.toURL().openConnection();
if(url.toURL().getProtocol().equalsIgnoreCase("https"))
yc=(HttpsURLConnection)yc;
yc.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
yc.connect();
if(Core.logGetWebPage())
connTime=System.nanoTime();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
while ((inputLine = in.readLine()) != null)
{
result=result+inputLine+"\n";
cicliLettura++;
}
if(Core.logGetWebPage())
readTime=System.nanoTime();
in.close();
yc.disconnect();
if(Core.logGetWebPage())
{
endTime=System.nanoTime();
System.out.println(/*result+*/"getWebPage eseguito in "+(endTime-startTime)/1000000+" ms. Size: "+result.length()+" Response Code="+yc.getResponseCode()+" Protocollo="+url.toURL().getProtocol()+" openConnTime: "+(openConnTime-startTime)/1000000+" connTime:"+(connTime-openConnTime)/1000000+" readTime:"+(readTime-connTime)/1000000+" cicliLettura="+cicliLettura);
}
return result;
}catch(IOException e){
System.out.println("Eccezione: "+e.toString());
e.printStackTrace();
return null;
}
}
Here you have two log samples
One of the "normal" pages
getWebPage executed Size: 48261 Response Code=200 Protocol=http openConnTime: 0 connTime:1 readTime:569 cicliLettura=359
One of the "slow" pages http://ricette.giallozafferano.it/Pan-di-spagna-al-cacao.html/allcomments
looks like this
getWebPage executed Size: 1748261 Response Code=200 Protocol=http openConnTime: 0 connTime:1 readTime:596834 cicliLettura=35685

What you're likely seeing here is a result of the way you are collating result. Remember that Strings in Java are immutable - therefore when string concatenation occurs, a new String has to be instantiated, which can often involve copying all the data contained in that String. You have the following code executing for every line:
result=result+inputLine+"\n";
Under the covers, this line involves:
A new StringBuffer is created with the entire content of result so far
inputLine is appended to the StringBuffer
The StringBuffer is converted to a String
A new StringBuffer is created for that String
A newline character is appended to that StringBuffer
The StringBuffer is converted to a String
That String is stored as result.
This operation will become more and more time-consuming as result gets bigger and bigger - and your results appear to show (albeit from a sample of 2!) that the results increase drastically with page size.
Instead, use StringBuffer directly.
StringBuffer buffer = new StringBuffer();
while ((inputLine = in.readLine()) != null)
{
buffer.append(inputLine).append('\n');
cicliLettura++;
}
String result = buffer.toString();

Related

implementing JTimer in existing code

How can I implement a 60 second timeout in this code?
This code is opening a URL, downloading plain text, and sending output as a string variable.
Works, but sometimes hangs, and I have to start all over again.
I was hoping something which would timeout after 60 seconds and return whatever data is retrieved.
Please don't suggest to use external libs like Apache etc. If I could edit this code itself, then that would be better.
public static String readURL( URL url )
{
try
{
// open the url stream, wrap it an a few "readers"
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String s="";
String line="";
while ((line = reader.readLine()) != null)
{
s=s+"\r\n"+line;
}
reader.close();
return s;
}catch(Exception e){ StringWriter errors = new StringWriter(); e.printStackTrace(new PrintWriter(errors)); return errors.toString(); }
}//end method
Thread.sleep(60000);
Above code will let the Thread sleep for 60 seconds, and do nothing during that time.
If you want to change the timeout of your connection; Look at Is it possible to read from a InputStream with a timeout?

Java reading HTTP response (using StringBuilder) much slower than in python

I'm calling a webservice that returns a large response, about 59 megabytes of data.
This is how I read it from Java:
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(),"UTF-8"));
result = result.concat(this.getResponseText(in));
private String getResponseText(BufferedReader in) throws IOException {
StringBuilder response = new StringBuilder(Integer.MAX_VALUE/2);
System.out.println("Started reading");
String line = "";
while((line = in.readLine()) != null) {
response.append(line);
response.append("\n");
}
in.close();
System.out.println("Done");
String r = response.toString();
System.out.println("Built r");
return r;
}
In Windows Resource manager during the reading I can see a throughput of about 100000 Bytes per second.
However when I read exactly the same data from the same webservice in python, i.e.:
response = requests.request("POST", url, headers=headers, verify=False, json=json)
I can see throughput up to 700000 Bytes per second (about 7 times faster). And also the code is finished 7 times faster.
The question is - Am I missing something that can make the reads in Java faster? Is this way really the fastest way how I can read HTTP response in Java?
Update - even after I'm not reading, just going through the response, I'm still at at most 100000 bytes / seconds, so I believe that the bottleneck is somewhere in the way how Java reads:
private List<String> getResponseTextAsList(BufferedReader in) throws IOException {
System.out.println("Started reading");
List<String> l = new ArrayList<String>();
int i = 0;
long q = 0;
String line = "";
while((line = in.readLine()) != null) {
//l.add(line);
i++;
q = q+line.length();
}
in.close();
System.out.println("Done" + i + " " + q);
return l;
}

Java - How to check if BufferedReader contains line break before running readLine command

I am trying to parse HTML from a website to get very specific data. The following method reads the source and outputs it as a string to be processed by other methods.
StringBuilder source = new StringBuilder();
URL url = new URL(urlIn);
URLConnection spoof;
spoof = url.openConnection();
spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );
BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream()));
String strLine = "";
while ((strLine = in.readLine()) != null){
source.append(strLine);
}
return source.toString();
The problem that I'm having is that since I call this method multiple times with a different urlIn argument each time, sometimes the method gets stuck at the readLine command. I read that this is because readLine looks for a line break and if the BufferedReader object does not contain one for whatever reason, it will be stuck indefinitely.
Is there a way to check whether my BufferedReader object contains a line break before I run the readLine command. I tried using an if (in.toString().contains("\n")) but that always returns false. Alternatively, could I add a "\n" at the end of my Buffered Reader "in" object every time just so that the while loop would break and not hang up indefinitely?
Any help would be appreciated.
Okay, this here should be what you are looking for.
fis = new FileInputStream("C:/sample.txt");
reader = new BufferedReader(new InputStreamReader(fis));
System.out.println("Reading File line by line using BufferedReader");
String line = reader.readLine();
while(line != null){
System.out.println(line);
line = reader.readLine();
}
Read more: http://javarevisited.blogspot.com/2012/07/read-file-line-by-line-java-example-scanner.html#ixzz3g4RHvy6V
Edit, in your case, since it seems like you are doing webapp testing, I do believe WebDriverWait may work for your needs.
This is not true. BufferedReader.readLine() will not block if the underlying stream has reached the end of input. It will return null. See http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html#readLine().
If your method is getting stuck there is another explanation.
Carefully check all of your exception handling and stream closing logic.

Building Java server and I can't get my page to stop loading (using PrintWriter and Buffered Reader)

I'm building a Java server and everything has been working as expected until now. I can serve up a static html page using two methods I wrote: body and header. Now, I am trying to write a new method called "bodywithQueryString".
Problem:
It almost works, but after the page is loaded, the loading won't stop. It just loads and loads. This is not happening with my static pages.
The only difference between the old method and new bodyWithQueryString() method is that in the new method I am using a buffered reader and print writer. These are new-ish functions for me so I'm guessing I'm not doing it right.
Here's how my new method is supposed to function:
I want to pass my route and querystring (queryarray) to bodyWithQueryString method. I want the method to read the file (from the route) to a byte output stream, do a replaceall on the key/value pair of the querystring while reading and, lastly, return the bytes. The getResponse() main method would then send the html to the browser.
Here's my code:
public void getResponse() throws Exception {
String[] routeParts = parseRoute(route); //break apart route and querystring
File theFile = new File(routeParts[0]);
if (theFile.canRead()) {
out.write(header( twoHundredStatusCode, routeParts[0], contentType(routeParts[0]) ) );
if (routeParts.length > 1) { //there must be a querystring
String[] queryStringArray = parseQueryString(routeParts[1]); //break apart querystring
out.write(bodyWithQueryString(routeParts[0], queryStringArray)); //use new body method
}
else out.write(body(routeParts[0])); //use original body method
out.flush();
private byte[] bodyWithQueryString(String route, String[] queryArray)
throws Exception {
BufferedReader reader = new BufferedReader(new FileReader(route));
ByteArrayOutputStream fileOut = new ByteArrayOutputStream();
PrintWriter writer = new PrintWriter(fileOut);
String line;
while ((line = reader.readLine()) != null) writer.println(line.replaceAll(queryArray[0] ,queryArray[1]));
writer.flush();
writer.close();
reader.close();
return fileOut.toByteArray();
}
It seems to me that you are not returning Content-Length header. This makes it hard for browser know when to stop loading the response.

extract specific part of html code

I am doing my first Android app and I have to take the code of a html page.
Actually I am doing this:
private class NetworkOperation extends AsyncTask<Void, Void, String > {
protected String doInBackground(Void... params) {
try {
URL oracle = new URL("http://www.nationalleague.ch/NL/fr/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
String s1 = "";
while ((inputLine = in.readLine()) != null)
s1 = s1 + inputLine;
in.close();
//return
return s1;
}
catch (IOException e) {
e.printStackTrace();
}
return null;
}
but the problem is it takes too much time. How to take for exemple the HTML from the line 200 to the line 300 ?
Sorry for my bad english :$
Best case use instead of readLine() use read(char[] cbuf, int off, int len). Another dirty way
int i =0;
while(while ((inputLine = in.readLine()) != null)
i++;
if(i>200 || i<300 )
DO SOMETHING
in.close();)
You get the HTML document through HTTP. HTTP usually relies on TCP. So... you can't just "skip lines"! The server will always try to send you all data preceding the portion of your interest, and your side of communication must acknowledge the reception of such data.
Do not read line by line [use read(char[] cbuf, int off, int len)]
Do not concat Strings [use a StringBuilder]
Open The buffered reader (much like you already do):
URL oracle = new URL("http://www.nationalleague.ch/NL/fr/");
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
Instead of reading line by line, read in a char[] (I would use one of size about 8192)
and than use a StringBuilder to append all the read chars.
Reading secific lines of HTML-source seams a little risky because formatting of the source code of the HTML page may change.

Categories

Resources