Using Java bufferedreader to get html from URL - java

I'm trying to read all the html from a page using a buffered reader like follows
String charset = "UTF-8";
URLConnection connection = new URL(url).openConnection();
connection.addRequestProperty("User-Agent",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
connection.setRequestProperty("Accept-Charset", charset);
InputStream response = connection.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(response,charset));
then I'm reading it line by line like this:
String data = br.readLine();
while(data != null){
data = br.readLine();
}
the problem is I'm getting something like:
}$B!)(BL$B!)(Bu"~$B!)$(D"C(B|X$B!x!)!x(B}
I've tried this:
do {
data = br.readLine();
SortedMap<String, Charset> map = Charset.availableCharsets();
for(Map.Entry<String, Charset> entry : map.entrySet()){
System.out.println(entry.getKey());
try {
System.out.println(new String(data.getBytes(entry.getValue())));
} catch (Exception e) {
e.printStackTrace();
}
}
}while(data!=null)
and I'm not getting any readable html in any of them. This really weird since it was working fine until this morning and I didn't change anything..
What am I doing wrong here? is it possible that something changed in the website I'm trying to read? please help.

The Server has changed his transfer mode to compressed data, what you can see in response header from server:
Connection:keep-alive
Content-Encoding:gzip
Content-Type:text/html; charset=utf-8
Date:Mon, 09 Mar 2015 09:34:41 GMT
Server:nginx
Transfer-Encoding:chunked
Vary:Accept-Encoding
X-Powered-By:PHP/5.5.16-pl0-gentoo
As you can see the content encoding is set to gzip Content-Encoding:gzip.
So you have to decode the zipped content first:
GZIPInputStream gzis = new GZIPInputStream(connection.getInputStream());
BufferedReader br = new BufferedReader(new InputStreamReader(gzis,charset));
To view the headers of requests and responses you could use a network monitor (see Free Network Monitor).
Simpler is it to use the developer plugins integrated in most common browsers. Here is the documentation of Chrome DevTools, how to use the network tab: https://developer.chrome.com/devtools/docs/network

Related

SSL peer shut down incorrectly

This is my first post here. I am a hobbyist so please bear with me.
I am attempting to to grab a webpage from https://eztv.it/shows/1/24/ with the following code.
public static void WriteHTMLToFile(String URL){
try {
URI myURI=new URI(URL);
URL url = myURI.toURL();
HttpsURLConnection con= (HttpsURLConnection)url.openConnection();
File myFile=new File("c:\\project\\Test.txt");
myFile.createNewFile();
FileWriter wr=new FileWriter(myFile);
InputStream ins=con.getInputStream();
InputStreamReader isr= new InputStreamReader(ins);
BufferedReader reader = new BufferedReader(isr);
String line;
while ((line = reader.readLine()) != null) {
wr.write(line+"\n");
}
reader.close();
wr.close();
}
catch(Exception e){
log(e.toString());
}
}
When I run this I get the following:
javax.net.ssl.SSLException: SSL peer shut down incorrectly
If I run the above code on this URL: https://eztv.it/shows/887/the-blacklist/ it works as intended. The difference between the two URL file sizes seems to be a contributing factor. In testing different URLs to the same server the above code only seemed to work for files less that ~30Kb. Anything over would generate the above exception.
I figured it out. The server is responding with gzip encoding once file sizes are over a certain size.
con.setRequestProperty("Accept-Encoding", "gzip, deflate, sdch");
was added to the request header as well as some code to handle the gzip stream.

POST non latin data from Java to PHP

I post some data from Java to PHP:
try {
URL obj = new URL("http://myphpurl/insert.php");
HttpURLConnection conn = (HttpURLConnection) obj.openConnection();
conn.setReadTimeout(10000);
conn.setConnectTimeout(15000);
conn.setRequestMethod(POST_METHOD);
conn.setDoInput(true);
conn.setDoOutput(true);
Map<String, String> params = new HashMap<String, String>();
params.put("title", "العربية");
OutputStream os = conn.getOutputStream();
BufferedWriter writer =
new BufferedWriter(new OutputStreamWriter(os, "UTF-8"));
writer.write(getQuery(params));
writer.flush();
writer.close();
os.close();
BufferedReader in =
new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
String inputLine;
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
LOG.debug("response {}", response);
in.close();
response = null;
inputLine = null;
conn.disconnect();
conn = null;
obj = null;
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
private String getQuery(Map<String, String> params) throws UnsupportedEncodingException {
StringBuilder result = new StringBuilder();
boolean first = true;
Iterator<Map.Entry<String, String>> it = params.entrySet().iterator();
while (it.hasNext()) {
if (first)
first = false;
else
result.append("&");
Map.Entry<String, String> pairs = it.next();
result.append(URLEncoder.encode(pairs.getKey(), "UTF-8"));
result.append("=");
result.append(URLEncoder.encode(pairs.getValue(), "UTF-8"));
it.remove(); // avoids a ConcurrentModificationException
}
return result.toString();
}
The insert.php file looks like this:
<?php
$posttitle = $_POST["title"];
echo "$posttitle";
echo urldecode($posttitle);
?>
The echo show some gibbrish مليون instead of the actual title العربية .
This gibbrish is then inserted in a mysql database.
Additionnal info:
The DATABASE is utf8_general_ci and does support arabic (when I manually update the post using phpMyAdmin it works).
I added UTF-8 in the InputStreamReader and InputStreamWriter, and I had the following behaviour:
Tomcat6 on windows, (PHP + mysql) on CentOS --> OK
Tomcat6 on CentOS , (PHP + mysql) on CentOS --> Not OK
Additionnal infos 2
Posting using javascript works fine: The page responds with the right encoding.
There are a number of things that can go wrong with your code, and we can't test it. Also, I suggest using a full featured HTTP client instead of URLConnection. The list of what you should check:
Pass the right source files encoding to javac (your test is hardcoded. Do you run the same binary or do you run the program from your IDE or anyway recompile on the deployment machine?)
Use UTF-8 to encode the query string
If your API uses the HTTP request body, check that both ends agree on the encoding, and/or use the Content-Type MIME header
PHP has binary strings (the encoding must be given) so make sure you use the appropriate parameters when connecting to the database, and/or transcode accordingly
When sending text from the PHP server, mind the encoding of the template and of the dynamic bits!
The number of moving parts is quite big. You should not debug via print/echo because that adds another level of transcoding. If possible, dump the raw text bytes and use a hex editor.
It's funny that Windows → Linux is ok, while Linux → Linux is not. You may want to check the locale on both CentOS machines (possibly running the operating system command from inside the target process - JVM and Apache)
Try using CharsetEncoder to reveal possible encoding exceptions.
CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
encoder.onMalformedInput(CodingErrorAction.REPORT);
encoder.onUnmappableCharacter(CodingErrorAction.REPORT);

Java URLConnection utf-8 encoding doesn't work

I'm writing a small crawler for sites in English only, and doing that by opening a URL connection. I set the encoding to utf-8 both on the request, and the InputStreamReader but I continue to get gobbledigook for some of the requests, while others work fine.
The following code represents all the research I did and advice out there. I have also tried changing URLConnection to HttpURLConnection with no luck. Some of the returned strings continue to look like this:
??}?r?H????P?n?c??]?d?G?o??Xj{?x?"P$a?Qt?#&??e?a#?????lfVx)?='b?"Y(defUeefee=??????.??a8??{O??????zY?2?M???3c??#
What am I missing?
My code:
public static String getDocumentFromUrl(String urlString) throws Exception {
String wholeDocument = null;
URL url = new URL(urlString);
URLConnection conn = url.openConnection();
conn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");
conn.setRequestProperty("Accept-Charset", "utf-8");
conn.setConnectTimeout(60*1000); // wait only 60 seconds for a response
conn.setReadTimeout(60*1000);
InputStreamReader isr = new InputStreamReader(conn.getInputStream(), "utf-8");
BufferedReader in = new BufferedReader(isr);
String inputLine;
while ((inputLine = in.readLine()) != null) {
wholeDocument += inputLine;
}
isr.close();
in.close();
return wholeDocument;
}
The server is sending the document GZIP compressed. You can set the Accept-Encoding HTTP header to make it send the document in plain text.
conn.setRequestProperty("Accept-Encoding", "identity");
Even so, the HTTP client class handles GZIP compression for you, so you shouldn't have to worry about details like this. What seems to be going on here is that the server is buggy: it does not send the Content-Encoding header to tell you the content is compressed. This behavior seems to depend on the User-Agent, so that the site works in regular web browsers but breaks when used from Java. So, setting the user agent also fixes the issue:
conn.setRequestProperty("User-Agent", "Mozilla/5.0"); // for example

how to handle utf-8 content from website

i'm new in java and i'm stuck in this function:
public String getFromUrl(String url){
String content = "";
try{
URL U = new URL(url);
URLConnection conn = U.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
String line;
while((line = reader.readLine()) != null)content += line+"\r\n";
reader.close();
}
catch(Exception e){}
return content;
}
i always get question marks instead of utf-8 symbols!
what do i do wrong?
i read this post
first: i cant understand why byte array is used?
second: how should "while loop" look like in this case cause if i write
while((line = reader.readLine()) != null)content = line.getBytes("UTF-8");
my eclipse says something like "the local variable content may not have been initialized"
third: how i should convert byte array back into string?
then i read this one. i didnt even try the way it was in this post because i'm trying to write a function that will simulate browsers get and post request. it seems i found out how to perform it with URL class so i dont want to use any other classes and methods.
and now the only problem i have is how to handle utf-8 content.
any help apriciated!
Dump:
String encoding = conn.getContentEncoding();
If not null, you can use that for the reader.
And dump the possible exception catched.

Getting the source code for the following page using Java

I am trying to get the source code for the following page: http://www.amazon.com/gp/offer-listing/082470732X/ref=dp_olp_0?ie=UTF8&redirect=true&condition=all
(Please note that Amazon takes you to another page if you click on the link. To get to the page that I am interested in reading please copy the link and paste it to an empty tab in your browser. Thanks!)
Normally using java.net API, I can get the source code for most of the URLs with almost no problem, however for the above link I get nothing. It turned out that the input stream generated by the connection is encoded by gzip, so I tried the following:
URL url = new URL(urlString);
HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection();
InputStream is = urlConnection.getInputStream();
HttpURLConnection.setFollowRedirects(true);
urlConnection.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = urlConnection.getContentEncoding();
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
is = new GZIPInputStream(is);
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
is = new InflaterInputStream((is), new Inflater(true));
}
However this time I get the following error deterministically:
java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:249)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:239)
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:142)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:67)
at domain.logic.ItemScraper.loadURL(ItemScraper.java:405)
at domain.logic.ItemScraper.main(ItemScraper.java:510)
Can anybody see my mistake? Is there another way to read this particular page? Can somebody explain me why my browser (firefox) can read it, however I cannot read the source using Java?
Thanks in advance, best regards,
Instead of
is = new GZIPInputStream(is);
try
is = new GZIPInputStream(urlConnection.getInputStream());
As for the EOFException, if you add
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24");
it would go away.
You can use a standard BufferedReader to read the response of a webserver of a given URL.
URLIn = new BufferedReader(new InputStreamReader(new URL(URLOrFilename).openStream()));
Then use ...
while ((incomingLine = URLIn.readLine()) != null) {
...
}
... to get the response.

Categories

Resources