Java decode special characters Â¡ and Ì§ becomes A?¡ and I?§

Java decode special characters Â¡ and Ì§ becomes A?¡ and I?§ - java

I'm trying to read a file name off XML, whose encoding can be changed.
The file name on the XML has string such as "Ì§oÌ" which is supposed to be read by my code as "Ì§oÌ". However, I keep getting I?§.
Similar problem for Â and A?¡
Below is my code:
Socket s = new Socket();
InputStream is = s.getInputStream();
ByteArrayInputStream bAis = new ByteArrayInputStream(buf, 0, rlen);
BufferedReader bReader = new BufferedReader( new InputStreamReader( hbis, "ISO-8859-1" ));
String theStringINeed = bReader.readLine();
Any help would be appreciated.

new InputStreamReader( hbis, "ISO-8859-1" )
If you lie about the encoding of a file, bad things will happen.
You need to read the file using the encoding it was actually written in, which is probably UTF8.

Related

Java - Problems with "ü/ä/ö" after SCP

I create a Programm which can load local or remote log files.
If i load a local file there is no error.
But if I copy first the file with SCP to my local (where i use this code: http://www.jcraft.com/jsch/examples/ScpFrom.java.html) and read it out I get an Error and the letters "ü/ä/ö" shown as �.
How can i fix this ?
Remote : Linux-Server
Local: Windows-PC
Code for SCP :
http://www.jcraft.com/jsch/examples/ScpFrom.java.html
Code for reading out :
protected void openTempRemoteFile() throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream( lfile )));
String strLine;
DefaultTableModel dtm = new DefaultTableModel(0, 0);
String header[] = new String[]{ "Timestamp", "Session-ID", "Log" };
dtm.setColumnIdentifiers(header);
table.setModel(dtm);
while ((strLine = reader.readLine()) != null) {
String[] sparts = strLine.split(" ");
String[] bparts = strLine.split(" : ");
String Timestamp = sparts[0] + " " + sparts[1];
String SessionID = sparts[4];
String Log = bparts[1];
dtm.addRow(new Object[] {Timestamp, SessionID, Log});
}
reader.close();
}
EDIT :
Encoding Format of the Local-Files: UTF-8
Encoding Format of the SCP-Remote-Files from Linux-Server: WINDOWS-1252

Supply appropriate Charset to InputStreamReader constructor, e.g.:
import java.nio.charset.StandardCharsets;
...
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream( lfile ),
StandardCharsets.UTF_8)); // try also ISO_8859_1 if UTF_8 doesn't help.

To fix your problem you have at least two options:
You can specify the encoding for your files directly in your code, updating it as follow:
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream( lfile ),
"UTF8"
)
);
or set the default file encoding when starting the JVM with:
java -Dfile.encoding=UTF-8 … com.example.Main
I definitely prefer the first way and you can parametrize the "UTF8" value too, if you need.
With the latter way you could still face the same issues if you forgot to specify that.
You can replace the encoding with whatever you prefer (Refer to https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html for Supported Encodings) and, on Windows, "Cp1252" is usually the default encoding.
Remember, you can always use query the file.encoding property or Charset.defaultCharset() to find the current default encoding for your application, eg:
byte [] byteArray = {'blablabla'};
InputStream inputStream = new ByteArrayInputStream(byteArray);
InputStreamReader reader = new InputStreamReader(inputStream);
String defaultEncoding = reader.getEncoding();

Working with encoding is very tricky thing. If your system always uses this kind of files (from different environment) than you should first detect the charset than read it with given charset. I had similar problem and i used
juniversalchardet
to detect charset and used InputStreamReader(stream, Charset).
In your case it would be like
protected void openTempRemoteFile() throws IOException {
String encoding = UniversalDetector.detectCharset(lfile);
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream( lfile ), Charset.forName(encoding)));
....
If it is only one time job than open it in text editor (notapad++ for example) than save it in your encoding. Than use it in program.

Encoding Error while writing HTML to txt file

I am downloading the source code of an html webpage and writing it back to a txt file. The output on the terminal looks correct but while writing into a file and reading the contents the file using gedit the contents look something like this :
<^#!^#D^#O^#C^#T^#Y^#P^#E^# ^#h^#t^#m^#l^# ^#P^#U^#B^#L^#I^#C^# ^#"^#-^#/^#/^#W^#3^#C^#/^#/^#D^#T^#D^# ^#X^#H^#T^#M^#L^# ^#1^#.^#0^# ^#T^#r^#a^#n^#s^#i^#t^#i^#o^#n^#a^#l^
I am reading the file line by line by using BufferedReader something like this :
URL oracle = new URL("http://example.com");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
while ((inputLine = in.readLine()) != null)
{
// appending to get the complete html string
}
Then I am writing the contents using PrintWriter.
PrintWriter pout = new PrintWriter("output.txt");
pout.write(html); // here html is the appended html string
pout.close();
Can someone help me with this.

While reading the URL, you need to set the encoding to UTF-8 and while writing back, you should again mention that your encoding is UTF-8. The default encoding could be your system's encoding and might not handle the unicode characters well. Both the InputStream and Outputstream support encoding as an argument. So you might want to replace your PrintWriter with OutputStream

I will suggest to use apache IOUitls
org.apache.commons.io.IOUtils.copy(connection.getInputStream(), new FileOutputStream(file));
URL url = new URL("http://example.com"");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
String contentType = connection.getContentType();
System.out.println("content-type: " + contentType);
IOUtils.copy(connection.getInputStream(), new FileOutputStream("/folder/fileName.html"));

^# is a byte 0, so you are reading with UTF-16, that seems to be your system default encoding.
Specify the encoding. The encoding from the header lines is decisive. If not specified, use the default Latin-1.
URL oracle = new URL("http://example.com");
URLConnection con = oracle.openConnection();
String encoding = con.getContentEncoding();
if (encoding == 0 || encoding.equalsIgnoreCase("ISO-8859-1")) {
encoding = "Windows-1252"; // Default is Latin-1, as Windows Latin-1
}
con.connect();
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream(), encoding));
However you might consider a meta statement.

Unable to recover a full image using java bufferwriter?

From Input Stream i am reading the image data and convert it to string. From string am writing to an image directly by following type.
final BufferedReader reader = new BufferedReader(new InputStreamReader(in));
final char[] cbuf = new char[1024];
final int length = reader.read(cbuf);
String packet=new String(cbuf,0,length);
BufferedWriter out = null ;
FileWriter fstream ;
File file = new File(fileName);
fstream = new FileWriter(file);
out.write(packet);
Please guide me in this issue.
I am not getting full image.

final BufferedReader reader = new BufferedReader(new InputStreamReader(in));
Decodes input using default encoding potentially corrupting data.
out.write(packet);
Encodes characters using default encoding potentially corrupting data.
Read documentation on API you use. Only perform conversion with default or unknown encoding when you absolutely need it.
Read/convert an InputStream to a String

How make InputStreamReader fail on invalid data for encoding?

I have some bytes which should be UTF-8 encoded, but which may contain a text is ISO8859-1 encoding, if the user somehow didn't manage to use his text editor the right way.
I read the file with an InputStreamReader:
InputStreamReader reader = new InputStreamReader(
new FileInputStream(file), Charset.forName("UTF-8"));
But every time the user uses umlauts like "ä", which are invalid UTF-8 when stored in ISO8859-1 the InputStreamReader does not complain but adds placeholder characters.
Is there is simple way to make this throw an Exception on invalid input?

CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
InputStreamReader reader = new InputStreamReader(
new FileInputStream(file), decoder);

Simply add .newDecoder():
InputStreamReader reader = new InputStreamReader(
new FileInputStream(file), Charset.forName("UTF-8").newDecoder());

How to read and write UTF-8 to disk on the Android?

I cannot read and write extended characters (French accented characters, for example) to a text file using the standard InputStreamReader methods shown in the Android API examples. When I read back the file using:
InputStreamReader tmp = new InputStreamReader(in);
BufferedReader reader = new BufferedReader(tmp);
String str;
while ((str = reader.readLine()) != null) {
...
the string read is truncated at the extended characters instead of at the end-of-line. The second half of the string then comes on the next line. I'm assuming that I need to persist my data as UTF-8 but I cannot find any examples of that, and I'm new to Java.
Can anyone provide me with an example or a link to relevant documentation?

Very simple and straightforward. :)
String filePath = "/sdcard/utf8_file.txt";
String UTF8 = "utf8";
int BUFFER_SIZE = 8192;
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), UTF8),BUFFER_SIZE);
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filePath), UTF8),BUFFER_SIZE);

When you instantiate the InputStreamReader, use the constructor that takes a character set.
InputStreamReader tmp = new InputStreamReader(in, "UTF-8");
And do a similar thing with OutputStreamWriter
I like to have a
public static final Charset UTF8 = Charset.forName("UTF-8");
in some utility class in my code, so that I can call (see more in the Doc)
InputStreamReader tmp = new InputStreamReader(in, MyUtils.UTF8);
and not have to handle UnsupportedEncodingException every single time.

this should just work on Android, even without explicitly specifying UTF-8, because the default charset is UTF-8. if you can reproduce this problem, please raise a bug with a reproduceable test case here:
http://code.google.com/p/android/issues/entry

if you face any such kind of problem try doing this. You have to Encode and Decode your data into Base64. This worked for me. I can share the code if you need it.

Check the encoding of your file by right clicking it in the Project Explorer and selecting properties. If it's not the right encoding you'll need to re-enter your special characters after you change it, or at least that was my experience.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java decode special characters Â¡ and Ì§ becomes A?¡ and I?§ - java

new InputStreamReader( hbis, "ISO-8859-1" ) If you lie about the encoding of a file, bad things will happen. You need to read the file using the encoding it was actually written in, which is probably UTF8.

Related

Java - Problems with "ü/ä/ö" after SCP

Encoding Error while writing HTML to txt file

Unable to recover a full image using java bufferwriter?

How make InputStreamReader fail on invalid data for encoding?

How to read and write UTF-8 to disk on the Android?

Categories

Resources