This is my first post here. I am a hobbyist so please bear with me.
I am attempting to to grab a webpage from https://eztv.it/shows/1/24/ with the following code.
public static void WriteHTMLToFile(String URL){
try {
URI myURI=new URI(URL);
URL url = myURI.toURL();
HttpsURLConnection con= (HttpsURLConnection)url.openConnection();
File myFile=new File("c:\\project\\Test.txt");
myFile.createNewFile();
FileWriter wr=new FileWriter(myFile);
InputStream ins=con.getInputStream();
InputStreamReader isr= new InputStreamReader(ins);
BufferedReader reader = new BufferedReader(isr);
String line;
while ((line = reader.readLine()) != null) {
wr.write(line+"\n");
}
reader.close();
wr.close();
}
catch(Exception e){
log(e.toString());
}
}
When I run this I get the following:
javax.net.ssl.SSLException: SSL peer shut down incorrectly
If I run the above code on this URL: https://eztv.it/shows/887/the-blacklist/ it works as intended. The difference between the two URL file sizes seems to be a contributing factor. In testing different URLs to the same server the above code only seemed to work for files less that ~30Kb. Anything over would generate the above exception.
I figured it out. The server is responding with gzip encoding once file sizes are over a certain size.
con.setRequestProperty("Accept-Encoding", "gzip, deflate, sdch");
was added to the request header as well as some code to handle the gzip stream.
Related
I have an assignment for school that involves writing a simple web crawler that crawls Wikipedia. The assignment stipulates that I can't use any external libraries so I've been playing around with the java.net.URL class. Based on the official tutorial and some code given by my professor I have:
public static void main(String[] args) {
System.setProperty("sun.net.client.defaultConnectTimeout", "500");
System.setProperty("sun.net.client.defaultReadTimeout", "1000");
try {
URL url = new URL(BASE_URL + "/wiki/Physics");
InputStream is = url.openStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String inputLine;
int lineNum = 0;
while ((inputLine = br.readLine()) != null && lineNum < 10) {
System.out.println(inputLine);
lineNum++;
}
is.close();
}
catch (MalformedURLException e) {
System.out.println(e.getMessage());
}
catch (IOException e) {
System.out.println(e.getMessage());
}
}
In addition, the assignment requires that:
Your program should not continuously send requests to wiki. Your program
must wait for at least 1 second after every 10 requests
So my question is, where exactly in the above code is the "request" being sent? And how does this connection work? Is the entire webpage being loaded in one go? or is it being downloaded line by line?
I honestly don't really understand much about networking at all so apologies if I'm misunderstanding something fundamental. Any help would be much appreciated.
InputStream is = url.openStream();
at the above line you will be sending request
BufferedReader br = new BufferedReader(new InputStreamReader(is));
at this line getting the input stream and reading.
Calling url.openStream() initiates a new TCP connection to the server that the URL resolves to. An HTTP GET request is then sent over the connection. If all goes right (i.e., 200 OK), the server sends back the HTTP response message that carries the data payload that is served up at the specified URL. You then need to read the bytes from the InputStream that the openStream() method returns in order to retrieve the data payload into your program.
I cannot comprehend why doesn't the following code does not put a packet onto wire (confirmed via wireshark). It is a fairly standard method of sending an HTTP POST request, as I believe. I don't intend to read anything just POST.
private void sendRequest() throws IOException {
String params = "param=value";
URL url = new URL(otherUrl.toString());
HttpURLConnection con = (HttpURLConnection)url.openConnection();
con.setDoOutput(true);
con.setDoInput(true); //setting this to `false` does not help
con.setRequestMethod("POST");
con.setRequestProperty("Content-Type", "text/plain");
con.setRequestProperty("Content-Length", "" + Integer.toString(params.getBytes().length));
con.setRequestProperty("Accept", "text/plain");
con.setUseCaches(false);
con.connect();
DataOutputStream wr = new DataOutputStream(con.getOutputStream());
wr.writeBytes(params);
wr.flush();
wr.close();
//Logger.getLogger("log").info("URL: "+url+", response: "+con.getResponseCode());
con.disconnect();
}
What happens is... actually nothing, unless I try to read anything. For example by uncommenting the above log line which reads the response code. Trying to read a response via con.getInputStream(); also works. There is no movement of packets. When I uncomment the getResponseCode, I can see that http POST is sent, and then 200 OK is sent back. The order is proper. I.e. I don't get some wild response before sending POST. Everything else looks exactly the same (I can attach wireshark screenshots if needed.). In the debugger the code executes (i.e. does not block anywhere).
I don't understand under what circumstances this can be happening. I belive it should be possible, to send a POST request with con.setDoInput(false);. Currently it doesn't send anything or fails (when trying to execute con.getResponseCode()) with an exception because I obviously promised I won't read anything.
It might be relevant, that before sendRequest I do request some data from the same site, but I trust I close everything properly. I.e:
public static String getData(String urlAddress) throws MalformedURLException, IOException {
URL url = new URL(urlAddress);
HttpURLConnection con = (HttpURLConnection)url.openConnection();
con.setDoOutput(false);
InputStream in = con.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
StringBuilder data = new StringBuilder();
String line;
while((line = reader.readLine()) != null) {
data.append(line);
}
reader.close();
in.close();
con.getResponseCode();
con.disconnect();
return data.toString();
}
The server for url in both cases is the same, port also, so I believe it is possible to use the same socket for communication. The above code works and retrieves the data properly.
I am not sure, maybe I don't clean something, and it gets cached, so with out an explicit read the POST gets delayed. There is no other traffic on the socket.
Unless you're using fixed-length or chunked transfer mode, HttpURLConnection will buffer all your output until you call getInputStream() or getResponseCode(), so that it can send a correct Content-length header.
If you call getResponseCode() you should have a look at its value.
I'm writing a small crawler for sites in English only, and doing that by opening a URL connection. I set the encoding to utf-8 both on the request, and the InputStreamReader but I continue to get gobbledigook for some of the requests, while others work fine.
The following code represents all the research I did and advice out there. I have also tried changing URLConnection to HttpURLConnection with no luck. Some of the returned strings continue to look like this:
??}?r?H????P?n?c??]?d?G?o??Xj{?x?"P$a?Qt?#&??e?a#?????lfVx)?='b?"Y(defUeefee=??????.??a8??{O??????zY?2?M???3c??#
What am I missing?
My code:
public static String getDocumentFromUrl(String urlString) throws Exception {
String wholeDocument = null;
URL url = new URL(urlString);
URLConnection conn = url.openConnection();
conn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");
conn.setRequestProperty("Accept-Charset", "utf-8");
conn.setConnectTimeout(60*1000); // wait only 60 seconds for a response
conn.setReadTimeout(60*1000);
InputStreamReader isr = new InputStreamReader(conn.getInputStream(), "utf-8");
BufferedReader in = new BufferedReader(isr);
String inputLine;
while ((inputLine = in.readLine()) != null) {
wholeDocument += inputLine;
}
isr.close();
in.close();
return wholeDocument;
}
The server is sending the document GZIP compressed. You can set the Accept-Encoding HTTP header to make it send the document in plain text.
conn.setRequestProperty("Accept-Encoding", "identity");
Even so, the HTTP client class handles GZIP compression for you, so you shouldn't have to worry about details like this. What seems to be going on here is that the server is buggy: it does not send the Content-Encoding header to tell you the content is compressed. This behavior seems to depend on the User-Agent, so that the site works in regular web browsers but breaks when used from Java. So, setting the user agent also fixes the issue:
conn.setRequestProperty("User-Agent", "Mozilla/5.0"); // for example
I am trying to use a signed java applet to post to a url like:
http://some.domain.com/something/script.asp?param=5041414F9015496EA699F3D2DBAB4AC2|178411|163843|557|1|1|164||attempt|1630315
But when java makes the connection, the java console shows:
network: Connecting http://some.domain.com/something/script.asp?param=5041414F9015496EA699F3D2DBAB4AC2%7C178411%7C163843%7C557%7C1%7C1%7C164%7C%7Cattempt%7C1630315
I do not want java to urlencode the pipes in the query from | to %7c. It seems the service I'm connecting to doesn't urldecode the param, and I can't change the server side code. Is there a way in java to make the post without escaping the query?
The java I'm using is below:
try {
URL url = new URL(myURL);
URLConnection connection = url.openConnection();
connection.setDoOutput(true);
OutputStreamWriter out = new OutputStreamWriter(
connection.getOutputStream());
out.write(toSend);
out.close();
BufferedReader in = new BufferedReader(
new InputStreamReader(
connection.getInputStream()));
String decodedString = "";
while ((decodedString = in.readLine()) != null) {
totalResponse = totalResponse + decodedString;
}
in.close();
} catch (Exception ex) {
}
Thank you for any help!
the URL class does not do any encoding. testing this on my dev server confirmed this suspicion. your code must be encoding the '|' character somewhere before the snippet you included in your question.
I have been looking around at different ways to connect to URLs and there seem to be a few.
My requirements are to do POST and GET queries on a URL and retrieve the result.
I have seen
URL class
DefaultHttpClient class
HttpClient - apache commons
which method is best?
My rule of thumb and recommendation: Don't introduce dependencies and 3rd party libraries if it's fairly easy to get away without.
In this case I would say, if you need efficiency such as multiple requests per established connection session handling or cookie support etc, go for HTTPClient.
If you only need to perform an HTTP get, this will suffice:
Getting Text from a URL
try {
// Create a URL for the desired page
URL url = new URL("http://hostname:80/index.html");
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
while ((str = in.readLine()) != null) {
// str is one line of text; readLine() strips the newline character(s)
}
in.close();
} catch (MalformedURLException e) {
} catch (IOException e) {
}
Sending a POST Request Using a URL
try {
// Construct data
String data = URLEncoder.encode("key1", "UTF-8") + "=" + URLEncoder.encode("value1", "UTF-8");
data += "&" + URLEncoder.encode("key2", "UTF-8") + "=" + URLEncoder.encode("value2", "UTF-8");
// Send data
URL url = new URL("http://hostname:80/cgi");
URLConnection conn = url.openConnection();
conn.setDoOutput(true);
OutputStreamWriter wr = new OutputStreamWriter(conn.getOutputStream());
wr.write(data);
wr.flush();
// Get the response
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while ((line = rd.readLine()) != null) {
// Process line...
}
wr.close();
rd.close();
} catch (Exception e) {
}
Both methods work great. (I've even done manual gets/posts with cookies.)
HTTPClient is the way to go if your needs go past trivial URL connection (e.g. proxy authentication such as NTLM). There are at least a comparison here between standard HTTP client functionality between libraries provided by the JRE, Apache HTTP Client and others.
If you are using JDK versions earlier to (including 1.4) and have a fairly large data in your post requests, like large file uploads, the default HTTPURLConnection that comes with the JRE is bound to go Out of memory at some point since it buffers the entire data before posting. Additionally it does not support some advanced HTTP headers like chunked encoding, etc.
So I'd recommend it only if your request are trivial and you are not posting large data as aioobe did.