How to read compressed HTML page with Content-Encoding : gzip

How to read compressed HTML page with Content-Encoding : gzip - java

I request a web page that sends a Content-Encoding: gzip header, but got stuck how to read it..
My code:
try {
URLConnection connection = new URL("http://jquery.org").openConnection();
String html = "";
BufferedReader in = null;
connection.setReadTimeout(10000);
in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null){
html+=inputLine+"\n";
}
in.close();
System.out.println(html);
System.exit(0);
} catch (IOException ex) {
Logger.getLogger(Crawler.class.getName()).log(Level.SEVERE, null, ex);
}
The output looks very messy.. (I was unable to paste it here, a sort of symbols..)
I believe this is a compressed content, how to parse it?
Note:
If I change jquery.org to jquery.com (which don't send that header, my code works well)

Actually, this is pb2q's answer, but I post the full code for future readers
try {
URLConnection connection = new URL("http://jquery.org").openConnection();
String html = "";
BufferedReader in = null;
connection.setReadTimeout(10000);
//The changed part
if (connection.getHeaderField("Content-Encoding")!=null && connection.getHeaderField("Content-Encoding").equals("gzip")){
in = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));
} else {
in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
}
//End
String inputLine;
while ((inputLine = in.readLine()) != null){
html+=inputLine+"\n";
}
in.close();
System.out.println(html);
System.exit(0);
} catch (IOException ex) {
Logger.getLogger(Crawler.class.getName()).log(Level.SEVERE, null, ex);
}

There is a class for this: GZIPInputStream. It is an InputStream and so is very transparent to use.

there are two cases with Content-Encoding:gzip header
if data already compressed(by application), Content-Encoding:gizp header will cause data to compressed again.so its double compressed.it's because http compression
if data is not compressed by application, Content-Encoding:gizp will cause data to compress(gzip mostly) and it will automatically uncompressed(un-zip) before it reaches to client. un-zip is default feature available in most of web browsers. browser will do un-zip if it finds Content-Encoding:gizp header in the response.

Related

Http response code 429 while reading HTML

In java I want to read and save all the HTML from an URL(instagram), but getting Error 429 (Too many request). I think it is because I am trying to read more lines than request limits.
StringBuilder contentBuilder = new StringBuilder();
try {
URL url = new URL("https://www.instagram.com/username");
URLConnection con = url.openConnection();
InputStream is =con.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(is));
String str;
while ((str = in.readLine()) != null) {
contentBuilder.append(str);
}
in.close();
} catch (IOException e) {
log.warn("Could not connect", e);
}
String html = contentBuilder.toString();
And the Error is so;
Could not connect
java.io.IOException: Server returned HTTP response code: 429 for URL: https://www.instagram.com/username/
And it shows also that error occurs because of this line
InputStream is =con.getInputStream();
Does anybody have an idea why I get this error and/or what to do to solve it?

The problem might have been caused by the connection not being closed/disconnected.
For the input try-with-resources for automatic closing, even on exception or return is usefull too. Also you constructed an InputStreamReader that would use the default encoding of the machine where the application would run, but you need the charset of the URL's content.
readLine returns the line without line-endings (which in general is very useful). So add one.
StringBuilder contentBuilder = new StringBuilder();
try {
URL url = new URL("https://www.instagram.com/username");
URLConnection con = url.openConnection();
try (BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream(), "UTF-8"))) {
String line;
while ((line = in.readLine()) != null) {
contentBuilder.append(line).append("\r\n");
}
} finally {
con.disconnect();
} // Closes in.
} catch (IOException e) {
log.warn("Could not connect", e);
}
String html = contentBuilder.toString();

How to get correct data from HTTP Request

I'm trying to get my user information from stackoverflow api using a simple HTTP request with GET method in Java.
This code I had used before to get another HTTP data using GET method without problems:
URL obj;
StringBuffer response = new StringBuffer();
String url = "http://api.stackexchange.com/2.2/users?inname=HCarrasko&site=stackoverflow";
try {
obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
con.setRequestMethod("GET");
int responseCode = con.getResponseCode();
System.out.println("\nSending 'GET' request to URL : " + url);
System.out.println("Response Code : " + responseCode);
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
System.out.println(response.toString());
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
But in this case I'm getting just stranger symbols when I print the response var, like this:
�mRM��0�+�N!���FZq�\�pD�z�:V���JX���M��̛yO^���뾽�g�5J&� �9�YW�%c`do���Y'��nKC38<A�&It�3��6a�,�,]���`/{�D����>6�Ɠ��{��7tF ��E��/����K���#_&�yI�a�v��uw}/�g�5����TkBTķ���U݊c���Q�y$���$�=ۈ��ñ���8f�<*�Amw�W�ـŻ��X$�>'*QN�?�<v�ݠ FH*��Ҏ5����ؔA�z��R��vK���"���#�1��ƭ5��0��R���z�ϗ/�������^?r��&�f��-�OO7���������Gy�B���Rxu�#:0�xͺ}�\�����
thanks in advance.

The content is likely GZIP encoded/compressed. The following is a general snippet that I use in all of my Java-based client applications that utilize HTTP, which is intended to deal with this exact problem:
// Read in the response
// Set up an initial input stream:
InputStream inputStream = fetchAddr.getInputStream(); // fetchAddr is the HttpURLConnection
// Check if inputStream is GZipped
if("gzip".equalsIgnoreCase(fetchAddr.getContentEncoding())){
// Format is GZIP
// Replace inputSteam with a GZIP wrapped stream
inputStream = new GZIPInputStream(inputStream);
}else if("deflate".equalsIgnoreCase(fetchAddr.getContentEncoding())){
inputStream = new InflaterInputStream(inputStream, new Inflater(true));
} // Else, we assume it to just be plain text
BufferedReader sr = new BufferedReader(new InputStreamReader(inputStream));
String inputLine;
StringBuilder response = new StringBuilder();
// ... and from here forward just read the response...
This relies on the following imports: java.util.zip.GZIPInputStream; java.util.zip.Inflater; and java.util.zip.InflaterInputStream.

Download part of an url content to save bandwith

I have an huge text file online, I know how to fetch the data in the url... an example would be something like this
URL url = new URL(address);
urlConnection = (HttpURLConnection) url.openConnection();
int responseCode = urlConnection.getResponseCode();
if(responseCode == HttpURLConnection.HTTP_OK) {
InputStream stream = urlConnection.getInputStream();
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(stream));
StringBuilder builder = new StringBuilder();
try {
for (String line; (line = bufferedReader.readLine()) != null;)
builder.append(line);
response = builder.toString();
} catch (IOException e) {
e.printStackTrace();
}
}
This file get updated every X minutes and a new line is added to the bottom, so the real info is only in the last line/lines... I was wondering if it would be possible to download only that part, to save bandwith.
Edit: I am looking for a "client-side" solution, without modifying server
Thank you very much.

servlet socket read timeout via reader

I am seeing this exception:
java.net.SocketTimeoutException: Timeout attempting to read data from the socket
Here's the code generating it:
public static String extractBody(HttpServletRequest request) {
StringBuffer sb = new StringBuffer();
String line = null;
try {
//BufferedReader reader = request.getReader();
BufferedReader reader = new BufferedReader(new InputStreamReader(request.getInputStream()));
while ((line = reader.readLine()) != null) {
sb.append(line);
}
} catch (Exception e) {
logger.fatal("Failed to read from socket with content-length: {}", request.getHeader("Content-Length"));
e.printStackTrace();
}
return sb.toString();
}
When it happens, that content-length that's written is non-zero. It's like this
Failed to read from socket with content-length: 279645
What is causing this timeout? Is it that the socket was left unclosed by the client? Is there something else I am missing? Is there a different way I should be reading the body data from a servlet request? Most of the requests work fine, I only see this error sometimes but it may be a certain client version or platform or something.

The content-length header didn't match the actual content length and it was waiting for more data.

How do I include information in the header when calling a web service using http protocol

I’m trying to use the Bitrex api for icons. In the documentation it states,
calculate the HMAC hash and include it under an apisign header.
I was able to calculate the HMAC hash, but I do not know how to include it in the header.
Code:
try {
String httpsURL = "https://bittrex.com/api/v1.1/public/getticker?market=BTC-BTCD";
httpsURL= cMarkets.mTitle[market];
httpsURL="https://bittrex.com/api/v1.1/public/getticker?market=BTC-";
httpsURL+=cMarkets.mTitle[market];
URL myurl = new URL(httpsURL);
String Hashcode=new String("####hash####");
// How do I include the hashcode under apisign in the header??????
HttpsURLConnection con = (HttpsURLConnection)myurl.openConnection();
InputStream ins = con.getInputStream();
InputStreamReader isr = new InputStreamReader(ins);
BufferedReader in = new BufferedReader(isr);
String inputLine;
while ((inputLine = in.readLine()) != null)
{
reply+=inputLine;
}
} catch (Exception e)
{
System.out.println("Exception in getting data from server");
}

Something like this should work:
con.setRequestProperty("apisign", {{your calculated HMAC Hash}});

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read compressed HTML page with Content-Encoding : gzip - java

There is a class for this: GZIPInputStream. It is an InputStream and so is very transparent to use.

Related

Http response code 429 while reading HTML

How to get correct data from HTTP Request

Download part of an url content to save bandwith

servlet socket read timeout via reader

How do I include information in the header when calling a web service using http protocol

Categories

Resources