Java - Unable to read foreign characters

Java - Unable to read foreign characters - java

I have successfully used the ISO8859-13 character encoding before but this time it doesn't seem to be working.
Based on the web site https://en.wikipedia.org/wiki/ISO/IEC_8859-13 it is a valid character.
These are the 3 characters stored in the file.
äää
Here is the code being used.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
public class ReadFile
{
public static void main(String[] arguments)
{
try
{
File inFile = new File("C:\\Downloads\\MyFile.txt");
if (inFile.exists())
{
System.out.println("File found");
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(inFile), "ISO8859-13"));
String line = null;
while ( (line = in.readLine()) != null )
{
System.out.println("Line Read: >" + line + "<");
}
}
else
{
System.out.println("File not found");
}
}
catch (IOException e)
{
}
}
}
The output on both Windows and Linux with and/or without Eclipse is the same which is.
Line Read: >?¤?¤?¤<
This previously worked for a number of other characters but why not for this?

There are many explanations possible for what you are observing. The two most likely ones, along with some code you can use to confirm that you've found the cause:
Option #1: Terminal issues
Maybe you are writing this to a terminal that either cannot render ä, or, there is a terminal transfer issue (terminals are, in the end, just a bunch of streams and pipes hooked together, they are bytes under the hood, so if one part of the process thinks all are agreed that all bytes are to be interpreted as UTF-8 encoded text, and another as ISO-8859-13 encoded, you get problems). Given that you see the exact same output on windows as on linux this is unlikely (it would be particularly likely if you are seeing this in the 'console' view in an IDE, or different outputs on different systems for the same code). If you want to test it, run instead: System.out.println("unicode codepoint of the first character: " + (int) line.charAt(0)); - this should print 228, which is the unicode codepoint for ä. If it doesn't, then you can be certain this isn't the (only) problem.
If this is it, the fix is to, well, use another terminal or mess with settings, I'd just ask another SO question and give plenty of detail on your setup (which OS, which terminal client, what does SET print, does the client have encoding options, etcetera).
Option #2: It's not actually ISO-8859-13
This, too, is simple to test: remark out your BufferedReader in = .... line and replace it with: System.out.println(new FileInputStream(file).read()); - this should print 228. If it prints anything else, your input file is not actually ISO-8859-13.
If this is it, find out what the encoding actually is and use that instead. For example, in UTF-8 encoding, ä would end up as 2 bytes in a file. That would already imply that your input file containing just äää and not even a newline afterwards is 6 bytes large (in ISO-8859-13, it would be 3), and that the raw bytes, as you read them with fileInputStream.read(), are, in order: 195 164 195 164 195 164. So, if you run the above code and it prints 195 instead of 228 - your input is probably in UTF-8; it's definitely not in ISO-8859-13.

Related

File writer is writing the string with no "Line breaks"

It is a simple "file writing program" that scans a text file from the given directory and stores its content into a string initialized as " words" . Now, when the content is concatenated into the string. The spaces and the line breaks are preserved. However, when I attempt to write that string words into a file. The spaces are preserved but the lines breaks are lost.
For example:
Original String:
Hello. I am stuck in UMT.
Get me out of here.
File Writing:
Hello, I am stuck in UMT. Get me out of here.
Notice the new line is not preserved? No line break? No " \n" ?
package testing;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Scanner;
import java.util.logging.Level;
import java.util.logging.Logger;
public class Reader{ {
String words="this is a a new file created by me :D \n";
File file = new File("C:/file.txt");
try {
Scanner scan= new Scanner(file);
while(scan.hasNextLine()){
words= words.concat(scan.nextLine() + "\r ");
}
} catch (FileNotFoundException ex) {
Logger.getLogger(Reader.class.getName()).log(Level.SEVERE, null,
ex);
}
System.out.println(words);
try {
FileWriter writer = new FileWriter("D:/newfile.txt");
writer.write(words);
writer.close();
} catch (IOException ex) {
Logger.getLogger(Reader.class.getName()).log(Level.SEVERE, null, ex);
}
}}

\r is carriage return, \n is new line but as a general rule you have to look at the encoding type of your file because the new line character can differ from one encoding to another one.
CR, LF, and CRLF are all commonly used end-of-line characters. In the world of PCs, CRLF is common amongst Windows apps, CR is common on the Classic Mac OS, and LF is common on the modern macOS and Unix-oriented OSes (BSD, Linux, etc). See Wikipedia.
Some more info
if you are facing some of these problems you have to keep in mind that different charecter encodings could have different notation for the new line. For example Word may use UTF-8 and Notepad may use ISO-8859-1 and you default system is setted on another type of encoding and it's not sure that every of this encoding share the same new line character. When you use the \n character you are typing the new line of your system so let's say you are working on windows this can display a new line that you can see in Word, but if you will open the same file on a Mac text editor you can see no new lines if they use different character encodings.

Try:
words= words.concat(scan.nextLine() + "\r\n" );
The nextLine() scans each line without the line break. So you have to manually insert one.
Also, on Linux and MacOS machines, next line is \n but on most Windows machines, next line is \r\n.

java utf8 encoding outputstream not working

I need to write a program which is able to write UTF-8 data into a file.
I found out examples on the internet, however, I am not able to progress to desired result.
Code:
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
public class UTF8WriterDemo {
public static void main(String[] args) {
Writer out = null;
try {
out = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream("c://java//temp.txt"), "UTF-8"));
String text = "This texáát will be added to File !!";
out.write(text);
out.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Everything run succesfully, but at the end I see special characters not showing properly:
This texÃ¡Ã¡t will be added to File !!
I tried several examples from the internet with the same result.
I use Visual Studio code.
Where could be the problem please?
Thank you

Your code is correct. You probably already have a file named temp.txt, and therefore Java writes text to the existing file (replacing previous content). What can be a problem is an encoding, that you already have set in your file.
In other words, you can't (or at least shouldn't) write UTF-8 text to the file with for example WINDOWS-1250 encoding or you would get an exact result as you have described.
If you didn't have this file, Java would automatically create a file with UTF-8 encoding.
Possible solutions:
Change encoding of your current file (usually you can open it in any text editor, use Save as and then specify encoding as UTF-8.
Remove this file and Java will create it automatically with proper encoding.
By the way, you should use StandardCharsets class instead of using String charsetName in order to avoid UnsupportedEncodingException:
new OutputStreamWriter(new FileOutputStream("temp.txt"), StandardCharsets.UTF_8)

When you say "I see special characters not showing properly", where are you seeing them?
What you say/show next looks like the string, utf-8 encoded (i.e. the accented a's are each represented by 2 chars, in what appears to be the appropriate encoding).
What I would expect the issue to be is that the java code is not outputting a BOM at the beginning of the file, leaving the interpretation of utf-8 sequences up to the discretion of the reader.

Writing strings with chars like "ñ" to a txt file

Im having a strange issue trying to write in text files with strings which contain characters like "ñ", "á".. and so on. Let me first show you my little piece of code:
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
String content = "whatever";
int c;
c = System.in.read();
content = content + (char)c;
FileWriter fw = new FileWriter("filename.txt");
BufferedWriter bw = new BufferedWriter(fw);
bw.write(content);
bw.close();
}
}
In this example, im just reading a char from the keyboard input and appending it to a given string; then writting the final string into a txt. The problem is that if I type an "ñ" for example (i have a Spanish layout keyboard), when i check the txt, it shows a strange char "¤" where there should be a "ñ", that is, the content of the file is "whatever¤". The same happens with "ç", "ú"..etc. However it writes it fine ("whateverñ") if i just forget about the keyboard input and i write:
...
String content = "whateverñ";
...
or
...
content = content + "ñ";
...
It makes me think that there might be something wrong with the read() method? Or maybe im using it wrongly? or should i use a different method to get the keyboard input? or..? Im a bit lost here.
(Im using the jdk 7u45 # Windows 7 Pro x64)

So ...
It works (i.e. you can read the accented characters on the output file) if you write them as literal strings.
It doesn't work when you read them from System.in and then write them.
This suggests that the problem is on the input side. Specifically, I think your console / keyboard must be using a character encoding for the input stream that does not match the encoding that Java thinks should be used.
You should be able to confirm this tentative diagnosis by outputting the characters you are reading in hexadecimal, and then checking the codes against the unicode tables (which you can find at unicode.org for example).
It strikes me as "odd" that the "platform default encoding" appears to be working on the output side, but not the input side. Maybe someone else can explain ... and offer a concrete suggestion for fixing it. My gut feeling is that the problem is in the way your keyboard is configured, not in Java or your application.

files do not remember their encoding format, when you look at a .txt, the text editor makes a "best guess" to the encoding used.
if you try to read the file into your program again, the text should be back to normal.
also, try printing the "strange" character directly.

How to preserve correct offset of string which is read from a file

I have a text.txt file which contains following txt.
Kontagent Announces Partnership with Global Latino Social Network Quepasa
Released By Kontagent
I read this text file into a string documentText.
documentText.subString(0,9) gives Kontagent, which is good.
But, documentText.subString(87,96) gives y Kontage in windows (IntelliJ Idea) and gives Kontagent in Unix environment. I am guessing it is happening because of blank line in the file (after which the offset got screwed). But, I cannot understand, why I get two different results. I need to get one result in the both the environments.
To read file as string I used all the functions talked about here
How do I create a Java string from the contents of a file? . But, I still get same results after using any of the functions.
Currently I am using this function to read the file into documentText String:
public static String readFileAsString(String fileName)
{
File file = new File(fileName);
StringBuilder fileContents = new StringBuilder((int)file.length());
Scanner scanner = null;
try {
scanner = new Scanner(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
String lineSeparator = System.getProperty("line.separator");
try {
while(scanner.hasNextLine()) {
fileContents.append(scanner.nextLine() + lineSeparator);
}
return fileContents.toString();
} finally {
scanner.close();
}
}
EDIT: Is there a way to write a general function which will work for both windows and UNIX environments. Even if file is copied in text mode.
Because, unfortunately, I cannot guarantee that everyone who is working on this project will always copy files in binary mode.

The Unix file probably uses the native Unix EOL char: \n, whereas the Windows file uses the native Windows EOL sequence: \r\n. Since you have two EOLs in your file, there is a difference of 2 chars. Make sure to use a binary file transfer, and all the bytes will be preserved, and everything will run the same way on both OSes.
EDIT: in fact, you are the one which appends an OS-specific EOL (System.getProperty("line.separator")) at the end of each line. Just read the file as a char array using a Reader, and everything will be fine. Or use Guava's method which does it for you:
String s = CharStreams.toString(new FileReader(fileName));

On Windows, a newline character \n is prepended by \r or a carriage return character. This is non-existent in Linux. Transferring the file from one operating system to the other will not strip/append such characters but occasionally, text editors will auto-format them for you.
Because your file does not include \r characters (presumably transferred straight from Linux), System.getProperty("line.separator") will return \r\n and account for non-existent \r characters. This is why your output is 2 characters behind.
Good luck!

Based on input you guys provided, I wrote something like this
documentText = CharStreams.toString(new FileReader("text.txt"));
documentText = this.documentText.replaceAll("\\r","");
to strip off extra \r if a file has \r.
Now,I am getting expect result in windows environment as well as unix. Problem solved!!!
It works fine irrespective of what mode file has been copied.
:) I wish I could chose both of your answer, but stackoverflow doesn't allow.

StringBuilders ending with mass nul characters

I'm having a very difficult time debugging a problem with an application I've been building. The problem itself I cannot seem to reproduce with a representitive test program with the same issue which makes it difficult to demonstrate. Unfortunately I cannot share my actual source because of security, however, the following test represents fairly well what I am doing, the fact that the files and data are unix style EOL, writing to a zip file with a PrintWriter, and the use of StringBuilders:
public class Tester {
public static void main(String[] args) {
// variables
File target = new File("TESTSAVE.zip");
PrintWriter printout1;
ZipOutputStream zipStream;
ZipEntry ent1;
StringBuilder testtext1 = new StringBuilder();
StringBuilder replacetext = new StringBuilder();
// ensure file replace
if (target.exists()) {
target.delete();
}
try {
// open the streams
zipStream = new ZipOutputStream(new FileOutputStream(target, true));
printout1 = new PrintWriter(zipStream);
ent1 = new ZipEntry("testfile.txt");
zipStream.putNextEntry(ent1);
// construct the data
for (int i = 0; i < 30; i++) {
testtext1.append("Testing 1 2 3 Many! \n");
}
replacetext.append("Testing 4 5 6 LOTS! \n");
replacetext.append("Testing 4 5 6 LOTS! \n");
// the replace operation
testtext1.replace(21, 42, replacetext.toString());
// write it
printout1 = new PrintWriter(zipStream);
printout1.println(testtext1);
// save it
printout1.flush();
zipStream.closeEntry();
printout1.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
The heart of the problem is that the file I see at my side is producing a file of 16.3k characters. My friend, whether he uses the app on his pc or whether he looks at exactly the same file as me sees a file of 19.999k characters, the extra characters being a CRLF followed by a massive number of null characters. No matter what application, encoding or views I use, I cannot at all see these nul characters, I only see a single LF at the last line, but I do see a file of 20k. In all cases there is a difference between what is seen with the exact same files on the two machines even though both are windows machines and both are using the same editing softwares to view.
I've not yet been able to reproduce this behaviour with any amount of dummy programs. I have been able to trace the final line's stray CRLF to my use of println on the PrintWriter, however. When I replaced the println(s) with print(s + '\n') the problem appeared to go away (the file size was 16.3k). However, when I returned the program to println(s), the problem does not appear to return. I'm currently having the files verified by a friend in france to see if the problem really did go away (since I cannot see the nuls but he can), but this behaviour has be thoroughly confused.
I've also noticed that the StringBuilder's replace function states "This sequence will be lengthened to accommodate the specified String if necessary". Given that the stringbuilders setLength function pads with nul characters and that the ensureCapacity function sets capacity to the greater of the input or (currentCapacity*2)+2, I suspected a relation somewhere. However, I have only once when testing with this idea been able to get a result that represented what I've seen, and have not been able to reproduce it since.
Does anyone have any idea what could be causing this error or at least have a suggestion on what direction to take the testing?
Edit since the comments section is broken for me:
Just to clarify, the output is required to be in unix format regardless of the OS, hence the use of '\n' directly rather than through a formatter. The original StringBuilder that is inserted into is not in fact generated to me but is the contents of a file read in by the program. I'm happy the reading process works, as the information in it is used heavily throughout the application. I've done a little probing too and found that directly prior to saving, the buffer IS the correct capacity and that the output when toString() is invoked is the correct length (i.e. it contains no null characters and is 16,363 long, not 19,999). This would put the cause of the error somewhere between generating the string and saving the zip file.

Finally found the cause. Managed to reproduce the problem a few times and traced the cause down not to the output side of the code but the input side. My file reading function was essentially this:
char[] buf;
int charcount = 0;
StringBuilder line = new StringBuilder(2048);
InputStreamReader reader = new InputStreamReader(stream);// provides a line-wise read
BufferedReader file = new BufferedReader(reader);
do { // capture loop
try {
buf = new char[2048];
charcount = file.read(buf, 0, 2048);
} catch (IOException e) {
return null; // unknown IO error
}
line.append(buf);
} while (charcount != -1);
// close and output
problem was appending a buffer that wasnt full, so the later values were still at their initial values of null. Reason I couldnt reproduce it was because some data filled in the buffers nicely, some didn't.
Why I couldn't seem to view the problem on my text editors I still have no idea of, but I should be able to resolve this now. Any suggestions on the best way to do so are welcome, as this is part of one of my long term utility libraries I want to keep it as generic and optimised as possible.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.