Encoding Error while writing HTML to txt file

Encoding Error while writing HTML to txt file - java

I am downloading the source code of an html webpage and writing it back to a txt file. The output on the terminal looks correct but while writing into a file and reading the contents the file using gedit the contents look something like this :
<^#!^#D^#O^#C^#T^#Y^#P^#E^# ^#h^#t^#m^#l^# ^#P^#U^#B^#L^#I^#C^# ^#"^#-^#/^#/^#W^#3^#C^#/^#/^#D^#T^#D^# ^#X^#H^#T^#M^#L^# ^#1^#.^#0^# ^#T^#r^#a^#n^#s^#i^#t^#i^#o^#n^#a^#l^
I am reading the file line by line by using BufferedReader something like this :
URL oracle = new URL("http://example.com");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
while ((inputLine = in.readLine()) != null)
{
// appending to get the complete html string
}
Then I am writing the contents using PrintWriter.
PrintWriter pout = new PrintWriter("output.txt");
pout.write(html); // here html is the appended html string
pout.close();
Can someone help me with this.

While reading the URL, you need to set the encoding to UTF-8 and while writing back, you should again mention that your encoding is UTF-8. The default encoding could be your system's encoding and might not handle the unicode characters well. Both the InputStream and Outputstream support encoding as an argument. So you might want to replace your PrintWriter with OutputStream

I will suggest to use apache IOUitls
org.apache.commons.io.IOUtils.copy(connection.getInputStream(), new FileOutputStream(file));
URL url = new URL("http://example.com"");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
String contentType = connection.getContentType();
System.out.println("content-type: " + contentType);
IOUtils.copy(connection.getInputStream(), new FileOutputStream("/folder/fileName.html"));

^# is a byte 0, so you are reading with UTF-16, that seems to be your system default encoding.
Specify the encoding. The encoding from the header lines is decisive. If not specified, use the default Latin-1.
URL oracle = new URL("http://example.com");
URLConnection con = oracle.openConnection();
String encoding = con.getContentEncoding();
if (encoding == 0 || encoding.equalsIgnoreCase("ISO-8859-1")) {
encoding = "Windows-1252"; // Default is Latin-1, as Windows Latin-1
}
con.connect();
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream(), encoding));
However you might consider a meta statement.

Related

why Encoding in http request?

I am trying to learn request and retrive data from server with http protocol on Java this is the code I found on Oracle>Tutorial>networking (Code is pasted at the bottom of question)
Question 1: in out.write("string=" + stringToReverse);why "string=" isn't encoded? like stringToReverse varable
String stringToReverse = URLEncoder.encode(args[1], "UTF-8");
Question 2:
there are two codes below one from oracle code and other from android studio tuts
code in oracle tuts
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
android tuts code
inputStream = urlConnection.getInputStream();
InputStreamReader inputStreamReader = new InputStreamReader(inputStream, Charset.forName("UTF-8"));
BufferedReader reader = new BufferedReader(inputStreamReader);
why is Charset.forName("UTF-8") missing in oracle code?
Note: explaining from basics is very much useful :)
import java.io.*;
import java.net.*;
public class Reverse {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: java Reverse "
+ "http://<location of your servlet/script>"
+ " string_to_reverse");
System.exit(1);
}
String stringToReverse = URLEncoder.encode(args[1], "UTF-8");
URL url = new URL(args[0]);
URLConnection connection = url.openConnection();
connection.setDoOutput(true);
OutputStreamWriter out = new OutputStreamWriter(
connection.getOutputStream());
out.write("string=" + stringToReverse);
out.close();
BufferedReader in = new BufferedReader(
new InputStreamReader(
connection.getInputStream()));
String decodedString;
while ((decodedString = in.readLine()) != null) {
System.out.println(decodedString);
}
in.close();
}
}

Question 1:
There is no need to encode "string=" (as it does not contain any special characters as explained in https://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html)
Question 2:
The charset in the following example is not explicitly defined:
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
Therefore defaut charset is used (which may not be UTF-8)
Every instance of the Java virtual machine has a default charset,
which may or may not be one of the standard charsets. The default
charset is determined during virtual-machine startup and typically
depends upon the locale and charset being used by the underlying
operating system. (https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html)

In a url the string after ? is called as query string
example.com/users/profile?key1=value1&key2=value2
So for the above url the query string is "key1=value1&key2=value2"
In a query string there are key,value pairs which a server script can access.These key value pairs are called as request parameters and are separated by an &.So ?,& ,space etc are called special characters in a url as they are treated specially by the browser.
So what happens in case the value1 itself contains an & character.The server will in advertently end the value1 before & character at user1.
name=user1&23=hello&place=hyd
If you see above example it will not work as expected.
So that's why you use url encoding to convert special characters like & ,? , space etc to some other non special characters when they are used in query string.The server will convert back them to their actual form once it is received.
Now coming to your question 1),URL encoding is not needed in your case as you are not sending the string_to_reverse as a request parameter in query string.As jesper pointed out this is not url encoding.You are sending it as body using the outputstream.
Now question 2),If you see the http://docs.oracle.com/javase/7/docs/api/java/net/URLEncoder.html class,it states as below
Utility class for HTML form encoding. This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format.
So html form data is posted as application/x-www-form-urlencoded and in ur case URLEncoder is taking care of that.If no charset is specified the default character set is used.How to Find the Default Charset/Encoding in Java?.
The name URL in URLEncoder class is little misleading to you as its not really used for encoding url here but used for encoding the request body(string_to_reverse)as application/x-www-form-urlencoded.

Java - Problems with "ü/ä/ö" after SCP

I create a Programm which can load local or remote log files.
If i load a local file there is no error.
But if I copy first the file with SCP to my local (where i use this code: http://www.jcraft.com/jsch/examples/ScpFrom.java.html) and read it out I get an Error and the letters "ü/ä/ö" shown as �.
How can i fix this ?
Remote : Linux-Server
Local: Windows-PC
Code for SCP :
http://www.jcraft.com/jsch/examples/ScpFrom.java.html
Code for reading out :
protected void openTempRemoteFile() throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream( lfile )));
String strLine;
DefaultTableModel dtm = new DefaultTableModel(0, 0);
String header[] = new String[]{ "Timestamp", "Session-ID", "Log" };
dtm.setColumnIdentifiers(header);
table.setModel(dtm);
while ((strLine = reader.readLine()) != null) {
String[] sparts = strLine.split(" ");
String[] bparts = strLine.split(" : ");
String Timestamp = sparts[0] + " " + sparts[1];
String SessionID = sparts[4];
String Log = bparts[1];
dtm.addRow(new Object[] {Timestamp, SessionID, Log});
}
reader.close();
}
EDIT :
Encoding Format of the Local-Files: UTF-8
Encoding Format of the SCP-Remote-Files from Linux-Server: WINDOWS-1252

Supply appropriate Charset to InputStreamReader constructor, e.g.:
import java.nio.charset.StandardCharsets;
...
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream( lfile ),
StandardCharsets.UTF_8)); // try also ISO_8859_1 if UTF_8 doesn't help.

To fix your problem you have at least two options:
You can specify the encoding for your files directly in your code, updating it as follow:
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream( lfile ),
"UTF8"
)
);
or set the default file encoding when starting the JVM with:
java -Dfile.encoding=UTF-8 … com.example.Main
I definitely prefer the first way and you can parametrize the "UTF8" value too, if you need.
With the latter way you could still face the same issues if you forgot to specify that.
You can replace the encoding with whatever you prefer (Refer to https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html for Supported Encodings) and, on Windows, "Cp1252" is usually the default encoding.
Remember, you can always use query the file.encoding property or Charset.defaultCharset() to find the current default encoding for your application, eg:
byte [] byteArray = {'blablabla'};
InputStream inputStream = new ByteArrayInputStream(byteArray);
InputStreamReader reader = new InputStreamReader(inputStream);
String defaultEncoding = reader.getEncoding();

Working with encoding is very tricky thing. If your system always uses this kind of files (from different environment) than you should first detect the charset than read it with given charset. I had similar problem and i used
juniversalchardet
to detect charset and used InputStreamReader(stream, Charset).
In your case it would be like
protected void openTempRemoteFile() throws IOException {
String encoding = UniversalDetector.detectCharset(lfile);
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream( lfile ), Charset.forName(encoding)));
....
If it is only one time job than open it in text editor (notapad++ for example) than save it in your encoding. Than use it in program.

Java decode special characters Â¡ and Ì§ becomes A?¡ and I?§

I'm trying to read a file name off XML, whose encoding can be changed.
The file name on the XML has string such as "Ì§oÌ" which is supposed to be read by my code as "Ì§oÌ". However, I keep getting I?§.
Similar problem for Â and A?¡
Below is my code:
Socket s = new Socket();
InputStream is = s.getInputStream();
ByteArrayInputStream bAis = new ByteArrayInputStream(buf, 0, rlen);
BufferedReader bReader = new BufferedReader( new InputStreamReader( hbis, "ISO-8859-1" ));
String theStringINeed = bReader.readLine();
Any help would be appreciated.

new InputStreamReader( hbis, "ISO-8859-1" )
If you lie about the encoding of a file, bad things will happen.
You need to read the file using the encoding it was actually written in, which is probably UTF8.

Double \\ appears while reading from internet

I'm reading some information from an external server where I have no access and I don't know the encoding and I've having some problems with characters like í. What I do is a POST request using the code below and afterwards, I parse it.
String response = "";
URL url = new URL(pURL);
URLConnection uc = url.openConnection();
if (sid!=null) uc.setRequestProperty("Cookie", sid);
uc.setDoOutput(true);
OutputStreamWriter osw = new OutputStreamWriter(uc.getOutputStream());
osw.write(request);
osw.flush();
InputStreamReader isr = new InputStreamReader(uc.getInputStream(), "UTF8");
BufferedReader br = new BufferedReader(isr);
String content;
while ((content = br.readLine())!=null){
response += content;
}
br.close();
osw.close();
At this moment, if I print the string it shows a \\, I mean, for í instead of appearing \u00ed appears \\\u00ed and if I convert the response string to a char array, I can see that instead of converting it correctly, it's divided into 6 chars \\\\, u, 0, 0, e, d.
I've tried to change encoding where the InputStreamReader is, to replace characters and some regex and none did work. Did anyone have this problem and can help me?
Thank you very much.

Not sure why the response is formatted that way, but you could convert strings with \u00ed into í using StringEscapeUtils as follows:
String input = "\\u00ed";
String unescaped = StringEscapeUtils.unescapeJava(input);
System.out.println(unescaped);
Output:
í

response.replaceAll("\\","\");

Problem with the encoding of a web page

I'm trying to get some information from a web... with the code above...
URL url = new URL(webpage);
URLConnection connection;
connection = url.openConnection();
BufferedReader in;
InputStreamReader inputStreamReader;
inputStreamReader = new InputStreamReader(connection.getInputStream(), "iso-8859-1");
in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
But I'm having a problem with the encoding when I'm reading it. The page is in spanish, and it has some simbols like "ñ" or "á". The header of the source code of the page says that it's in "iso-8859-1", and I've tried with "utf-8", but none of them works... when I try to set the text I'm reading from the URL to a TextView it just shows garbage in the simbols I've told....
Any ideas?
Thanks!

I think you are creating the reader incorrectly
inputStreamReader = new InputStreamReader(connection.getInputStream(), "iso-8859-1");
in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
The first statement is creating a Reader with the specified encoding, but the second one is ignoring the original Reader and creating a new one with the default encoding for your platform. You probably need to do this:
inputStreamReader = new InputStreamReader(connection.getInputStream(), "iso-8859-1");
in = new BufferedReader(inputStreamReader);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Encoding Error while writing HTML to txt file - java

Related

why Encoding in http request?

Java - Problems with "ü/ä/ö" after SCP

Java decode special characters Â¡ and Ì§ becomes A?¡ and I?§

Double \\ appears while reading from internet

Problem with the encoding of a web page

Categories

Resources