Character Encoding in different machines - java

If Two different machines have different character encodings.How to take from a java program that same file on both machines should be read in similar manner.Is it possible using java or we have to manually set the encodings of both machines?

It sounds like you just want to use something like:
InputStream inputStream = new FileInputStream(...);
Reader reader = new InputStreamReader(reader, "UTF-8"); // Or whatever encoding
Basically you don't have to use the platform default encoding, and you should almost never do so. It's a pain that FileReader always uses the platform default encoding :( I prefer to explicitly specify the encoding, even if I'm explicitly specifying that I want to use the platform default :)

You don't need to change the machine's settings.
You can use any java.io.Reader subclass that allows you to set the character encoding. For instance InputStreamReader, like so:
new InputStreamReader(new FileInputStream("file.txt"), "UTF8");

You are in control of reading/writing the files on both environment.
Working with text files in Java
You have control control on only read side.
You know the encoding used to write the file: Identify what encoding is used to write the file and use the same encoding to read it.
You doesn't know the encoding used to write the file: Best you can do is guess the encoding: Character Encoding Detection Algorithm
UPDATE
If your issues is that you are not seeing the output properly in eclipse console then the issue might be with the encoding setting of the eclipse itself. Read this article on how to fix eclipse.

Related

java.io.IOException: No such file or directory

I am encountering an issue to save/ create the file using java.
java.io.IOException: No such file or directory
at java.io.UnixFileSystem.createFileExclusively(Native Method) ~[na:1.7.0_79]
My environment is using Linux but having a mount on Windows (The place where I try to store the file).
It will hit everytime I tried to create when the filename having a chinese characters.
Could this happen because of encoding between Linux and Windows difference?
When I tried running and storing in similar OS (run apps in Linux, storing in Linux, same thing for windows) it run smoothly.
Any help is very appreciated.
The code i used to create the file
File imgPath = new File(fullpath.toString());
if (!imgPath.exists()){
FileUtils.forceMkdir(imgPath);
imgPath.setWritable(true, false);
}
fullpath.append(File.separator).append(fileName);
outputStream = new FileOutputStream(new File(fullpath.toString()));
Thanks a lot.
Note: I'm a fairly new user and can't comment directly yet (only on my questions and answers so far), so I'm posting this as an answer.
Windows uses UTF-16 while Linux uses UTF-8; (considering that you haven't installed anything extra to change anything yet) UTF-8 and UTF-16 support the same range of characters. However, I remember correctly, it had something to do with memory (UTF-8 starts at 8 bits and UTF-16 starts at 16?). Regardless, they're stored/ read a little differently. And then, InputStreamReader converts characters from their external representation in the specified encoding to the internal representation. It's mentioned in this stackoverflow post (Difference between UTF-8 and UTF-16?) about the exact way it's done in bytes. They're the same for the basics, but different for others, like Chinese characters. would suggest looking for solutions along that line (I have to get to class!). I could be entirely wrong, but this is probably a good starting place. Good luck.

Process.getInputStream() encoding issue

I have the following lines of code.I want to use proper encoding scheme.
Process process = processBuilder.start();
InputStreamReader isr = new InputStreamReader(process.getInputStream());
My eclipse is by default using Windows-1252 encoding.While when I run chcp command on command prompt The result is codepage 437.
This means the stream of bytes that I get from command line is encoded by using (codepage437)different scheme than the one used by JVM(windows1252).How do I synchronise between the two when I want my application to run across different platforms.[I can not hardcode to use code 437 in my java application]
Eclipse has nothing to do with it. At runtime, your constants are UTF-16 strings, independent of whatever encoding for Java source you have set in Eclipse. Your program that reads from the stream simply has to know the encoding in use in the process you launch. That will, as you note, depend on what sort of computer you are running on, what your settings are, and choices made by the creator of the program you launch. I'd expect that the literal values of the bytes written by a native non-_UNICODE program on Windows to appear on the stream. If the program you are running was built as an _UNICODE application, it's an interesting question what would appear on the stream ... UTF-16? In any case, any programmer creating any command-line program can send whatever they like down the standard output stream: even if every other program on the system is coughing up, say, Windows-1252, one particular program might write UTF-8 and be documented to do so for use with > redirection into a file. You just have to know.

Encoding problems in Java File

I have String in java which is filename containing umlauts. File is stored on Win 7 Pro disk correctly (umlauts etc. are shown correctly in explorer file listing). I also tried to save filename to text file and then filename was correctly outputted with umlauts. But when I use method exists() from File, it says file doesn't exists. If I try to use method createNewFile(), it creates file like ä.txt (originally ä.txt). What could be wrong in my settings here? I'm using Tomcat 6 and Eclipse to run my web application.
If the file name would be included as static constant in your source code it would not make a difference where your code is being executed, but as you are reading the filename from an remote address it makes a significant difference.
By default every Java instance as a default charset on Windows this is usually "Cp1252", on other systems usually "UTF-8". Therefore every method that is reading or writing Strings from/to network or file system the default charset is used - as long as you don't use the method versions where the charset is explicitly specified.
Therefore writing the file-name into a file doesn't demonstrates everything because if it is displayed correctly depends on the text editor you are using not on the Java program writing it.
Conclusion: Go through your code and make sure you explicitly set the charset. This is especially relevant for methods getBytes() of String and every where you have a Reader/Writer instance connected to an InputStream/OutpuStream.

Writing new line character in Java that works across different OS

I have a Java program which generates a text file on a UNIX server. This text file is then sent to a customer's machine using other systems where it is supposed to be opened in notepad on Windows. As we are well aware that notepad will not be able to find a new line as it uses Windows CR LF.
I am using
System.getProperty("line.separator");
Is there another way of acheiving platform independece so that a file generated on UNIX can be displayed in Windows notepad with the proper formatting like changing some UNIX property for new line?
Notepad is a pre-requisite, hence changing that is not possible.
Since Java 7 System.lineSeparator() ..
Java APIs keep changing with time. Writing new line character in Java that works across different OS could be achieved with System.lineSeparator() API which has been there since Java 7. check this,
System.lineSeperator()
No. You need to pick a single file format, Windows or UNIX. Given that your customer requires notepad, you can probably pick Windows format, which means you just need "\r\n" characters. The fact that UNIX generates the file can be considered irrelevant.
Mac OS X supports CR, LF and CRLF
Linux uses LF, but files using CRLF and CR render too.
Windows uses CRLF
the answer: just always use CRLF, then you have support for all platforms.
CR = '\r' , LF='\n'
so just make strings of the format : "mytexthereblahblahblah \r\n"
System.getProperty("line.separator");
This is wrong. You don't care what the system's line separator is. You need to write out a file in the format that file is supposed to be in. It doesn't matter what system generates the file.
A file is a stream of bytes. You need to write out the stream of bytes that the file format says you should be writing, and that is independent of what line separate any system uses.
Have you checked if the UNIX server can tolerate Windows style newlines? Since the requirement is editting in Notepad, you will need to put Windows style newlines in the file using System.print("\r\n").
If the UNIX server can't handle Windows line separators, then your best option is probably to process the file after editting to convert the newlines using dos2unix on the UNIX server.
The Notepad requirement means that this isn't really a platform independence requirement - it's a requirement for the Unix server to understand Windows newlines.
There is no direct way for having such interoperable simple text file. Hence will have to move to a structured file like doc or excel or pdf or simply fix the format to either windows or UNIX and communicate the end users to use the same.
This answer is an aggregate of the above discussion.
Thanks a lot community for helping. If any other solution is found to the problem. Please post for others :)

Java application failing on special characters

An application I am working on reads information from files to populate a database. Some of the characters in the files are non-English, for example accented French characters.
The application is working fine in Windows but on our Solaris machine it is failing to recognise the special characters and is throwing an exception. For example when it encounters the accented e in "Gérer" it says :-
Encountered: "\u0161" (353), after : "\'G\u00c3\u00a9rer les mod\u00c3"
(an exception which is thrown from our application)
I suspect that in order to stop this from happening I need to change the file.encoding property of the JVM. I tried to do this via System.setProperty() but it has not stopped the error from occurring.
Are there any suggestions for what I could do? I was thinking about setting the basic locale of the solaris platform in /etc/default/init to be UTF-8. Does anyone think this might help?
Any thoughts are much appreciated.
That looks like a file that was converted by native2ascii using the wrong parameters. To demonstrate, create a file with the contents
Gérer les modÚ
and save it as "a.txt" with the encoding UTF-8. Then run this command:
native2ascii -encoding windows-1252 a.txt b.txt
Open the new file and you should see this:
G\u00c3\u00a9rer les mod\u00c3\u0161
Now reverse the process, but specify ISO-8859-1 this time:
native2ascii -reverse -encoding ISO-8859-1 b.txt c.txt
Read the new file as UTF-8 and you should see this:
Gérer les modÀ\u0161
It recovers the "é" okay, but chokes on the "Ú", like your app did.
I don't know what all is going wrong in your app, but I'm pretty sure incorrect use of native2ascii is part of it. And that was probably the result of letting the app use the system default encoding. You should always specify the encoding when you save text, whether it's to a file or a database or what--never let it default. And if you don't have a good reason to choose something else, use UTF-8.
Try to use
java -Dfile.encoding=UTF-8 ...
when starting the application in both systems.
Another way to solve the problem is to change the encoding from both system to UTF-8, but i prefer the first option (less intrusive on the system).
EDIT:
Check this answer on stackoverflow, It might help either:
Changing the default encoding for String(byte[])
Instead of setting the system-wide character encoding, it might be easier and more robust, to specify the character encoding when reading and writing specific text data. How is your application reading the files? All the Java I/O package readers and writers support passing in a character encoding name to be used when reading/writing text to/from bytes. If you don't specify one, it will then use the platform default encoding, as you are likely experiencing.
Some databases are surprisingly limited in the text encodings they can accept. If your Java application reads the files as text, in the proper encoding, then it can output it to the database however it needs it. If your database doesn't support any encoding whose character repetoire includes the non-ASCII characters you have, then you may need to encode your non-English text first, for example into UTF-8 bytes, then Base64 encode those bytes as ASCII text.
PS: Never use String.getBytes() with no character encoding argument for exactly the reasons you are seeing.
I managed to get past this error by running the command
export LC_ALL='en_GB.UTF-8'
This command set the locale for the shell that I was in. This set all of the LC_ environment variables to the Unicode file encoding.
Many thanks for all of your suggestions.
You can also set the encoding at the command line, like so java -Dfile.encoding=utf-8.
I think we'll need more information to be able to help you with your problem:
What exception are you getting exactly, and which method are you calling when it occurs.
What is the encoding of the input file? UTF8? UTF16/Unicode? ISO8859-1?
It'll also be helpful if you could provide us with relevant code snippets.
Also, a few things I want to point out:
The problem isn't occurring at the 'é' but later on.
It sounds like the character encoding may be hard coded in your application somewhere.
Also, you may want to verify that operating system packages to support UTF-8 (SUNWeulux, SUNWeuluf etc) are installed.
Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.
In Java you can use following for reading and writing:
Reading:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));
Writing:
PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));

Categories

Resources