Problem with the encoding of a web page - java

I'm trying to get some information from a web... with the code above...
URL url = new URL(webpage);
URLConnection connection;
connection = url.openConnection();
BufferedReader in;
InputStreamReader inputStreamReader;
inputStreamReader = new InputStreamReader(connection.getInputStream(), "iso-8859-1");
in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
But I'm having a problem with the encoding when I'm reading it. The page is in spanish, and it has some simbols like "ñ" or "á". The header of the source code of the page says that it's in "iso-8859-1", and I've tried with "utf-8", but none of them works... when I try to set the text I'm reading from the URL to a TextView it just shows garbage in the simbols I've told....
Any ideas?
Thanks!

I think you are creating the reader incorrectly
inputStreamReader = new InputStreamReader(connection.getInputStream(), "iso-8859-1");
in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
The first statement is creating a Reader with the specified encoding, but the second one is ignoring the original Reader and creating a new one with the default encoding for your platform. You probably need to do this:
inputStreamReader = new InputStreamReader(connection.getInputStream(), "iso-8859-1");
in = new BufferedReader(inputStreamReader);

Related

Fire multiple requests with URLConnection socket flush

What did wrong here? I want to send multiple messages in one https connection with URLConnection. I only get the first message on server.
URL url = new URL("https://example.com:443");
URLConnection connection = url.openConnection();
connection.setDoOutput(true);
BufferedWriter out = new BufferedWriter(
new OutputStreamWriter(connection.getOutputStream()));
out.write("Hello");
out.flush();
inReader = new BufferedReader(
new InputStreamReader(connection.getInputStream()));
out.write("Response");
out.flush();
inReader = new BufferedReader(
new InputStreamReader(connection.getInputStream()));
out.write("Response2");
out.flush();
inReader = new BufferedReader(
new InputStreamReader(connection.getInputStream()));
out.write("Response3");
out.flush();
inReader = new BufferedReader(
new InputStreamReader(connection.getInputStream()));
inReader.close();
out.close();
You can't. URLConnection is for HTTP, which is a stateless protocol. Not for your own stateful messaging protocol. One request and one response. If you want to send another message, get a new URLConnection. Connection pooling will probably happen behind the scenes.
You also need to read the response. Merely getting the input stream is not sufficient, and getting it multiple times is pointless.
Hard to see why you are writing "Response" in a request.

URLConnection didnt return complete content of file

My code looks like
URL oracle = new URL(calURL);
FileWriter overall = new FileWriter("overall.txt");
HttpURLConnection yc = (HttpURLConnection) oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
overall.append("\n"+inputLine);
}
It seems it is returning only half of content .. Not getting the full content
Note : calURL is dynamically generated
calURL is taking much time to load. Before its my stream starts reading I guess. I included timeout before URL connection it is getting full data now.

Encoding Error while writing HTML to txt file

I am downloading the source code of an html webpage and writing it back to a txt file. The output on the terminal looks correct but while writing into a file and reading the contents the file using gedit the contents look something like this :
<^#!^#D^#O^#C^#T^#Y^#P^#E^# ^#h^#t^#m^#l^# ^#P^#U^#B^#L^#I^#C^# ^#"^#-^#/^#/^#W^#3^#C^#/^#/^#D^#T^#D^# ^#X^#H^#T^#M^#L^# ^#1^#.^#0^# ^#T^#r^#a^#n^#s^#i^#t^#i^#o^#n^#a^#l^
I am reading the file line by line by using BufferedReader something like this :
URL oracle = new URL("http://example.com");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
while ((inputLine = in.readLine()) != null)
{
// appending to get the complete html string
}
Then I am writing the contents using PrintWriter.
PrintWriter pout = new PrintWriter("output.txt");
pout.write(html); // here html is the appended html string
pout.close();
Can someone help me with this.
While reading the URL, you need to set the encoding to UTF-8 and while writing back, you should again mention that your encoding is UTF-8. The default encoding could be your system's encoding and might not handle the unicode characters well. Both the InputStream and Outputstream support encoding as an argument. So you might want to replace your PrintWriter with OutputStream
I will suggest to use apache IOUitls
org.apache.commons.io.IOUtils.copy(connection.getInputStream(), new FileOutputStream(file));
URL url = new URL("http://example.com"");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
String contentType = connection.getContentType();
System.out.println("content-type: " + contentType);
IOUtils.copy(connection.getInputStream(), new FileOutputStream("/folder/fileName.html"));
^# is a byte 0, so you are reading with UTF-16, that seems to be your system default encoding.
Specify the encoding. The encoding from the header lines is decisive. If not specified, use the default Latin-1.
URL oracle = new URL("http://example.com");
URLConnection con = oracle.openConnection();
String encoding = con.getContentEncoding();
if (encoding == 0 || encoding.equalsIgnoreCase("ISO-8859-1")) {
encoding = "Windows-1252"; // Default is Latin-1, as Windows Latin-1
}
con.connect();
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream(), encoding));
However you might consider a meta statement.

Java decode special characters ¡ and ̧ becomes A?¡ and I?§

I'm trying to read a file name off XML, whose encoding can be changed.
The file name on the XML has string such as "̧oÌ" which is supposed to be read by my code as "̧oÌ". However, I keep getting I?§.
Similar problem for  and A?¡
Below is my code:
Socket s = new Socket();
InputStream is = s.getInputStream();
ByteArrayInputStream bAis = new ByteArrayInputStream(buf, 0, rlen);
BufferedReader bReader = new BufferedReader( new InputStreamReader( hbis, "ISO-8859-1" ));
String theStringINeed = bReader.readLine();
Any help would be appreciated.
new InputStreamReader( hbis, "ISO-8859-1" )
If you lie about the encoding of a file, bad things will happen.
You need to read the file using the encoding it was actually written in, which is probably UTF8.

Get raw text from html

Im on quite a basic level of android development.
I would like to get text from a page such as "http://www.google.com". (The page i will be using will only have text, so no pictures or something like that)
So, to be clear: I want to get the text written on a page into etc. a string in my application.
I tried this code, but im not even sure if it does what i want.
URL url = new URL(/*"http://www.google.com");
URLConnection connection = url.openConnection();
// Get the response
BufferedReader rd = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line = "";
I cant get any text from it anyhow. How should I do this?
From the sample code you gave you are not even reading the response from the request. I would get the html with the following code
URL u = new URL("http://www.google.com");
URLConnection conn = u.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(
conn.getInputStream()));
StringBuffer buffer = new StringBuffer();
String inputLine;
while ((inputLine = in.readLine()) != null)
buffer.append(inputLine);
in.close();
System.out.println(buffer.toString());
From there you would need to pass the string into some kind of html parser if you want only the text. From what I've heard JTidy would is a good library for this however I have never used any Java html parsing libraries.
You want to extract text from HTML file? You can make use of specialized tool such as the Jericho HTML parser library. I'm not sure if it can be used directly in Android app, it is quite big, but it is open source so you can make use of its code and take only what you need for your task.
Here is one way:
public String scrape(String urlString) throws Exception {
URL url = new URL(urlString);
URLConnection connection = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(
connection.getInputStream()));
String line = null, data = "";
while ((line = reader.readLine()) != null) {
data += line + "\n";
}
return data;
}
Here is another.

Categories

Resources