Best way to get content from a webserver when it's busy? - java

I have the below code to connect a webserver and get content via http request and everything is Ok. But sometimes that website is really being busy, taking to many request at the same time from different users, since I do not know how the http servers work exactly, Is there any tricks that I can do to get content rapidly, in your mind.
I use the below code in multi threaded, every 5 milisecond to get the content when the website is updated to get data instantly.
Can I keep the connection open? Does it make sense, or anything else that makes me get the data earlier when it is really busy?
URLConnection con = url.openConnection();
url.openConnection();
BufferedReader in = new BufferedReader(new
InputStreamReader(con.getInputStream()));
while ((inputLine = in.readLine()) != null) result.append(inputLine);
in.close();
content = result.toString();
Thanks.

Related

How to wait till HttpURLConnection completes?

If I upload 100K file to certain url of my service, wget takes ~20 seconds to complete:
wget --quiet --post-file data.txt --output-document - --header "Content-Type: text/csv" http://localhost:8080/ingest
But if I do it like this in java, strangely this happens immediately:
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setRequestMethod("POST");
con.setRequestProperty("Content-Type", "text/csv;charset=UTF-8");
con.setDoOutput(true);
OutputStream outputStream = con.getOutputStream();
outputStream.write(str.getBytes("UTF-8"));
outputStream.flush();
outputStream.close();
System.out.println("code=" + con.getResponseCode());
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
so my guess is that actually this code is not waiting for data to be submitted, but is doing this in background. How can I force it to block until actual data transfer is finished?
That code should be waiting for the response to complete. The con.getResponseCode() call will not (cannot!) return until the server has at least responded with the HTTP reply header containing the response code.
It may be that the server is sending the HTTP reply header before it has finished reading the data that the client has posted. That would be a mistake. (If the server sends the response too soon, it can't set the response code correctly!)
It is also possible that the server response is not a 2xx response, and there are server error messages / diagnostics on the error stream rather than the input stream. (Read the javadocs on getInputStream versus getErrorStream.)
So the most likely reason that is not blocking for ~20 seconds is because the request has failed ... and this is not being reported properly, due to server or client-side implementation issues.
UPDATE - It turns out that the real issues was that "curl" was behaving strangely on some platforms, probably due to network config issues.

Java - Retrieving a web page with authorization

I'm trying to retrieve a github web page using a java code, for this I used following code.
String startingUrl = "https://github.com/xxxxxx";
URL url = new URL(startingUrl );
HttpURLConnection uc = (HttpURLConnection) url.openConnection();
uc.connect();
String line = null;
StringBuffer tmp = new StringBuffer();
try{
BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream(), "UTF-8"));
while ((line = in.readLine()) != null) {
tmp.append(line);
}
}catch(FileNotFoundException e){
}
However, the page I received here is different from what I observe in browser after login to github. I tried sending authorization header as following, but it didn't worked either.
uc.setRequestProperty("Authorization", "Basic encodexxx");
How can I retrieve the same page that I see when I logged in?
I can't tell you more on this, because I don't know what are you getting, but most common issue for web crawlers is the fact that website owners mostly don't like web crawlers. Thus, you should behave like regular user - your browser for instance. Open your browser inspection element (press f12) when you are reaching some website and see what your browser send in request, then try to mimic it: For example, add Host, Referer, etc in your header. You need to experiment on this.
Also, good to know - some website owners will use advanced techniques (so they will block you to access their site), some won't stop you crawling on their website. Some will let you do what you want. Most fair option is to check www.somedomain.com/robots.txt and there is list of endpoints that are allowed for scraping and those that shouldn't be allowed.

reading bytes from web site

I am trying to create a proxy server.
I want to read the websites byte by byte so that I can display images and all other stuff. I tried readLine but I can't display images. Do you have any suggestions how I can change my code and send all data with DataOutputStream object to browser ?
try{
Socket s = new Socket(InetAddress.getByName(req.hostname), 80);
String file = parcala(req.url);
DataOutputStream out = new DataOutputStream(clientSocket.getOutputStream());
BufferedReader dis = new BufferedReader(new InputStreamReader(s.getInputStream()));
PrintWriter socketOut = new PrintWriter(s.getOutputStream());
socketOut.print("GET "+ req.url + "\n\n");
//socketOut.print("Host: "+req.hostname);
socketOut.flush();
String line;
while ((line = dis.readLine()) != null){
System.out.println(line);
}
}
catch (Exception e){}
}
Edited Part
This is what I should have to do. I can block banned web sites but can't allow other web sites in my program.
In the filter program, you will open a TCP socket at the specified port and wait for connections. If a
request comes (i.e. the client types a URL to access a web site), the application will process it to
decide whether access is allowed or not and then, using the same socket, it will send the reply back
to the client. After the client opened her connection to WebPolice (and her request has been checked
and is allowed), the real web page needs to be shown to the client. Therefore, since the user already gave her request, now it is WebPolice’s turn to forward the request so that the user can get the web page. Thus, WebPolice acts as a client and requests the web page. This means you need to open a connection to the web server (without closing the connection to the user), forward the request over this connection, get the reply and forward it back to the client. You will use threads to handle multiple connections (at the same time and/or at different times).
I don't know what exactly you're trying to do, but crafting an HTTP request and reading its response incorporates somewhat more than you have done here. Readline won't work on binary data anyway.
You can take a look at the URLConnection class (stolen here):
URL oracle = new URL("http://www.oracle.com/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
Then you can read textual or binary data from the in object.
Read line will treat the line read as a String, so unless you want to mess around with conversions over to bytes, I wouldn't recommend that.
I would just read bytes until you can't read anymore, then write them out to a file, this should allow you to grab the images, keeping file headers intact which can be important when dealing with files other than text.
Hope this helps.
Instead of using BufferedReader you can try to use InputStream.
It has several methods for reading bytes.
http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html

Reading from a URLConnection

I have a php page in my server that accepts a couple of POST requests and process them. Lets say it's a simple page and the output is simply an echoed statement. With the URLConnection I established from a Java program to send the POST request, I tried to get the input using the input stream got through connection.getInputStream(). But All I get is the source of the page(the whole php script) and not the output it produces. We shall avoid socket connections here. Can this be done with Url connection or HttpRequest? How?
class htttp{
public static void main(String a[]) throws IOException{
URL url=new URL("http://localhost/test.php");
URLConnection conn = url.openConnection();
//((HttpURLConnection) conn).setRequestMethod("POST");
conn.setDoOutput(true);
conn.setDoInput(true);
OutputStreamWriter wr = new OutputStreamWriter(conn.getOutputStream());
wr.write("Hello");
wr.flush();
wr.close();
InputStream ins = conn.getInputStream();
InputStreamReader isr = new InputStreamReader(ins);
BufferedReader in = new BufferedReader(isr);
String inputLine;
String result = "";
while( (inputLine = in.readLine()) != null )
result += inputLine;
System.out.print(result);
}
}
I get the whole source of the webpage test.php in result. But I want only the output of the php script.
The reason you get the PHP source itself, rather than the output it should be rendering, is that your local HTTP server - receiving your request targeted at http://localhost/test.php - decided to serve back the PHP source, rather than forward the HTTP request to a PHP processor to render the output.
Why this happens? that has to do with your HTTP server's configuration; there might be a few reasons for that. For starters, you should validate your HTTP server's configuration.
Which HTTP server are you using on your machine?
What happens when you browse http://localhost/test.php through your browser?
The problem here is not the Java code - the problem lies with the web server. You need to investigate why your webserver is not executing your PHP script but sending it back raw. You can begin by testing using a simple PHP scipt which returns a fixed result and is accessed using a GET request (from a web browser). Once that is working you can test using the one that responds to POST requests.

how to click on a button through java?

I want to access forms on HTMl pages throught Java Programming Language without involving real browser in between.
At present I am doing it through HTML UNIT but it takes a bit more time to load a page. When it comes to accessing millions of page, then this extra bit time matters most.
Is there any other methods for doing this?
I've used something similar called httpunit before, but I have no idea how it compares performance wise.
If you have millions of pages to process, I would recommend throwing some more threads at it. Just a guess, but I think that if you scale this up to multiple threads, you'll run out of bandwidth before you run out of CPU power (in which case it won't matter how much faster it could be)
Accessing a web page using a browser, even HtmlUnit, is going to be slow. A better method is to test the layer just below the web interface, so that you don't need to access millions of pages -- instead you test enough to make sure that the web interface is using the lower layer correctly.
Most of the interaction in browser comes down to an HTTP GET or an HTTP POST.
You need to figure out exactly the operation you need, and then you can construct the URL and/or form data. Then you can use something like this:
try {
//Construct data
String data = URLEncoder.encode("key1", "UTF-8") + "=" + URLEncoder.encode("value1", "UTF-8"); data += "&" + URLEncoder.encode("key2", "UTF-8") + "=" + URLEncoder.encode("value2", "UTF-8");
// Send data
URL url = new URL("http://hostname:80/cgi");
URLConnection conn = url.openConnection(); conn.setDoOutput(true);
OutputStreamWriter wr = new OutputStreamWriter(conn.getOutputStream());
wr.write(data);
wr.flush();
// Get the response
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line; while ((line = rd.readLine()) != null) {
// Process line... }
wr.close();
rd.close();
} catch (Exception e) { }

Categories

Resources