how to handle utf-8 content from website - java

i'm new in java and i'm stuck in this function:
public String getFromUrl(String url){
String content = "";
try{
URL U = new URL(url);
URLConnection conn = U.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
String line;
while((line = reader.readLine()) != null)content += line+"\r\n";
reader.close();
}
catch(Exception e){}
return content;
}
i always get question marks instead of utf-8 symbols!
what do i do wrong?
i read this post
first: i cant understand why byte array is used?
second: how should "while loop" look like in this case cause if i write
while((line = reader.readLine()) != null)content = line.getBytes("UTF-8");
my eclipse says something like "the local variable content may not have been initialized"
third: how i should convert byte array back into string?
then i read this one. i didnt even try the way it was in this post because i'm trying to write a function that will simulate browsers get and post request. it seems i found out how to perform it with URL class so i dont want to use any other classes and methods.
and now the only problem i have is how to handle utf-8 content.
any help apriciated!

Dump:
String encoding = conn.getContentEncoding();
If not null, you can use that for the reader.
And dump the possible exception catched.

Related

java.net.URL retrieves unintelligible stream

I have been using a java code to retrieve an url content. The code does not work for https://www.amazon.es/. A similar python code does achieve retrieving an amazon url content.
The java code:
URL url = new URL(urlToScan);
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), StandardCharsets.UTF_8));
StringBuilder builder = new StringBuilder();
for (String temp = reader.readLine(); temp != null; temp = reader.readLine())
builder.append(temp);
webpage = builder.toString();
The python code:
from urllib.request import urlopen
url = "https://www.amazon.es/"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
I searched amazon's html on my own looking for the used charset (in case it was a charset issue) and they are using charset="utf-8".As the html is 22,000+ lines long, I thought it could be some parsing error for long Strings. I also tried with a ByteArrayOutputStream and then instancing using String(byte[], charset) constructor.Java output:
?
Why is not java.net.URL retrieving properly the url content?
Maybe it's because of User-Agent. To set User-Agent, using URLConnection:
URL url = new URL("https://www.amazon.es/");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)");
BufferedInputStream bufferedInputStream = new BufferedInputStream(connection.getInputStream());
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(bufferedInputStream));
StringBuilder buffer = new StringBuilder();
String inputLine;
while ((inputLine = bufferedReader.readLine()) != null) {
buffer.append(inputLine).append("\n");
}
bufferedReader.close();
System.out.println(buffer.toString());
While Python's urllib should be using certain User-Agent by default.

Reading HTML from URL in Java vs. Python

I'm trying to read the HTML from a particular URL and store it into a String for parsing. I referred to a previous post to help me out. When I print out what was read, all I get are special characters.
Here is my Java code (with try/catches left out) that reads from a URL and prints:
String path = "https://html1-f.scribdassets.com/913q5pjrsw60h9i4/pages/106-6b1bd15200.jsonp";
URL url = new URL(path);
InputStream in = url.openStream();
BufferedReader bw = new BufferedReader(new InputStreamReader(in, "UTF-8");
String line;
while ((line = bw.readLine()) != null) {
System.out.println(line);
}
Program output:
�ĘY106-6b1bd15200.jsonpmP�r� �Ƨ�!�%m�vD"��Ra*��w�%����ݳ�sβ��MK�d�9+%�m��l^��މ����:���� ���8B�Vce�.A*��x$FCo���a�b�<����Xy��m�c�>t����� �Z������Gx�o� �J���oKe�0�5�kGYpb�*l����+|�U���-�N3��jBp�R�z5Cۥjh��o�;�~)����~��)~ɮhy��<c,=;tHW���'�c�=~�w���
Expected output:
window.page106_callback(["<div class=\"newpage\" id=\"page106\" style=\"width: 902px; height:1273px\">\n<div class=image_layer style=\"z-index: 1\">\n<div class=ie_fix>\n<img class=\"absimg\" style=\"left:18px;top:27px;width:860px;height:1077px;clip:rect(1px 859px 1076px 1px)\" orig=\"http://html.scribd.com/913q5pjrsw60h9i4/images/106-6b1bd15200.jpg\"/>\n</div>\n</div>\n</div>\n\n"]);
At first, I thought it was an issue with permissions or something that somehow encrypted the stream, but my friend wrote a small Python script to do the same thing and it worked, thereby ruling this out. This is what he wrote:
import requests
link = 'https://html1-f.scribdassets.com/913q5pjrsw60h9i4/pages/106-
6b1bd15200.jsonp'
f = requests.get(link)
text = (f.text)
print(text)
So the question is, why is the Java version unable to correctly read and print from this particular URL? Note that I tried testing some other URLs from various websites and those worked fine. Maybe I should learn Python.
The response is gzip-encoded. You can do:
InputStream in = new GZIPInputStream(con.getInputStream());
#Maurice Perry is right, I tried with below code
String url = "https://html1-f.scribdassets.com/913q5pjrsw60h9i4/pages/106-6b1bd15200.jsonp";
URL obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(new GZIPInputStream(con.getInputStream())));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
System.out.println(response.toString());

How to post with java without urlencoding query part of url

I am trying to use a signed java applet to post to a url like:
http://some.domain.com/something/script.asp?param=5041414F9015496EA699F3D2DBAB4AC2|178411|163843|557|1|1|164||attempt|1630315
But when java makes the connection, the java console shows:
network: Connecting http://some.domain.com/something/script.asp?param=5041414F9015496EA699F3D2DBAB4AC2%7C178411%7C163843%7C557%7C1%7C1%7C164%7C%7Cattempt%7C1630315
I do not want java to urlencode the pipes in the query from | to %7c. It seems the service I'm connecting to doesn't urldecode the param, and I can't change the server side code. Is there a way in java to make the post without escaping the query?
The java I'm using is below:
try {
URL url = new URL(myURL);
URLConnection connection = url.openConnection();
connection.setDoOutput(true);
OutputStreamWriter out = new OutputStreamWriter(
connection.getOutputStream());
out.write(toSend);
out.close();
BufferedReader in = new BufferedReader(
new InputStreamReader(
connection.getInputStream()));
String decodedString = "";
while ((decodedString = in.readLine()) != null) {
totalResponse = totalResponse + decodedString;
}
in.close();
} catch (Exception ex) {
}
Thank you for any help!
the URL class does not do any encoding. testing this on my dev server confirmed this suspicion. your code must be encoding the '|' character somewhere before the snippet you included in your question.

Get raw text from html

Im on quite a basic level of android development.
I would like to get text from a page such as "http://www.google.com". (The page i will be using will only have text, so no pictures or something like that)
So, to be clear: I want to get the text written on a page into etc. a string in my application.
I tried this code, but im not even sure if it does what i want.
URL url = new URL(/*"http://www.google.com");
URLConnection connection = url.openConnection();
// Get the response
BufferedReader rd = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line = "";
I cant get any text from it anyhow. How should I do this?
From the sample code you gave you are not even reading the response from the request. I would get the html with the following code
URL u = new URL("http://www.google.com");
URLConnection conn = u.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(
conn.getInputStream()));
StringBuffer buffer = new StringBuffer();
String inputLine;
while ((inputLine = in.readLine()) != null)
buffer.append(inputLine);
in.close();
System.out.println(buffer.toString());
From there you would need to pass the string into some kind of html parser if you want only the text. From what I've heard JTidy would is a good library for this however I have never used any Java html parsing libraries.
You want to extract text from HTML file? You can make use of specialized tool such as the Jericho HTML parser library. I'm not sure if it can be used directly in Android app, it is quite big, but it is open source so you can make use of its code and take only what you need for your task.
Here is one way:
public String scrape(String urlString) throws Exception {
URL url = new URL(urlString);
URLConnection connection = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(
connection.getInputStream()));
String line = null, data = "";
while ((line = reader.readLine()) != null) {
data += line + "\n";
}
return data;
}
Here is another.

Getting the source code for the following page using Java

I am trying to get the source code for the following page: http://www.amazon.com/gp/offer-listing/082470732X/ref=dp_olp_0?ie=UTF8&redirect=true&condition=all
(Please note that Amazon takes you to another page if you click on the link. To get to the page that I am interested in reading please copy the link and paste it to an empty tab in your browser. Thanks!)
Normally using java.net API, I can get the source code for most of the URLs with almost no problem, however for the above link I get nothing. It turned out that the input stream generated by the connection is encoded by gzip, so I tried the following:
URL url = new URL(urlString);
HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection();
InputStream is = urlConnection.getInputStream();
HttpURLConnection.setFollowRedirects(true);
urlConnection.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = urlConnection.getContentEncoding();
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
is = new GZIPInputStream(is);
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
is = new InflaterInputStream((is), new Inflater(true));
}
However this time I get the following error deterministically:
java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:249)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:239)
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:142)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:67)
at domain.logic.ItemScraper.loadURL(ItemScraper.java:405)
at domain.logic.ItemScraper.main(ItemScraper.java:510)
Can anybody see my mistake? Is there another way to read this particular page? Can somebody explain me why my browser (firefox) can read it, however I cannot read the source using Java?
Thanks in advance, best regards,
Instead of
is = new GZIPInputStream(is);
try
is = new GZIPInputStream(urlConnection.getInputStream());
As for the EOFException, if you add
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24");
it would go away.
You can use a standard BufferedReader to read the response of a webserver of a given URL.
URLIn = new BufferedReader(new InputStreamReader(new URL(URLOrFilename).openStream()));
Then use ...
while ((incomingLine = URLIn.readLine()) != null) {
...
}
... to get the response.

Categories

Resources