Encoding problems in Java File

Encoding problems in Java File - java

I have String in java which is filename containing umlauts. File is stored on Win 7 Pro disk correctly (umlauts etc. are shown correctly in explorer file listing). I also tried to save filename to text file and then filename was correctly outputted with umlauts. But when I use method exists() from File, it says file doesn't exists. If I try to use method createNewFile(), it creates file like Ã¤.txt (originally ä.txt). What could be wrong in my settings here? I'm using Tomcat 6 and Eclipse to run my web application.

If the file name would be included as static constant in your source code it would not make a difference where your code is being executed, but as you are reading the filename from an remote address it makes a significant difference.
By default every Java instance as a default charset on Windows this is usually "Cp1252", on other systems usually "UTF-8". Therefore every method that is reading or writing Strings from/to network or file system the default charset is used - as long as you don't use the method versions where the charset is explicitly specified.
Therefore writing the file-name into a file doesn't demonstrates everything because if it is displayed correctly depends on the text editor you are using not on the Java program writing it.
Conclusion: Go through your code and make sure you explicitly set the charset. This is especially relevant for methods getBytes() of String and every where you have a Reader/Writer instance connected to an InputStream/OutpuStream.

Related

Is there a way to write file atomically in Java on Linux?

Linux Api has O_TMPFILE flag to be specified with open system call creating unnamed temporary file which cannot be opened by any path. So we can use this to write data to the file "atmoically" and the linkat the given file to the real path. According to the open man page it can be implemented as simple as
char path[1000];
int fd = open("/tmp", O_TMPFILE | O_WRONLY, S_IWUSR);
write(fd, "123456", sizeof("123456"));
sprintf(path, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, path, AT_FDCWD, "/tmp/1111111", AT_SYMLINK_FOLLOW);
Is there a Java alternative (probably non crossplatform) to do atomic write to a file without writing Linux-specific JNI function? Files.createTempFile does completely different thing.
By atomic write I mean that either it cannot be opened and be read from or it contains all the data required to be writted.

I don't believe Java has an API for this, and it seems to depend on both the OS and filesystem having support, so JNI might be the only way, and even then only on Linux.
I did a quick search for what Cygwin does, seems to be a bit of a hack just to make software work, creating a file with a random name then excluding it only from their own directory listing.
I believe the closest you can get in plain Java is to create a file in some other location (kinda like a /proc/self/fd/... equivalent), and then when you are done writing it, either move it or symbolic link it from the final location. To move the file, you want it on the same filesystem partition so the file contents don't actually need to be copied. Programs watching for the file in say /tmp/ wouldn't see it until the move or sym link creation.
You could possibly play around with user accounts and filesystem permissions to ensure that no other (non SYSTEM/root) program can see the file initially even if they tried to look wherever you hid it.

Encoding issue on filename with Java 7 on OSX with jnlp/webstart

I have this problem that has been dropped on me, and have been a couple of days of unsuccessful searches and workaround attempts.
I have now an internal java swing program distributed by jnlp/webstart, on osx and windows computers, that, among other things, downloads some files from WebDav.
Recently, on a test machine with OSX 10.8 and Java 7, filenames and directory names with accented characters started having those replaced by question marks.
No problem on OSX with versions of Java before 7.
example :
XXXYYY_è_ABCD/
becomes
XXXYYY_?_ABCD/
using java.text.Normalizer (NFD, NFC, NFKD, NFKC) on the original string, the result is different but still wrong :
XXXYYY_e?_ABCD/
or
XXXYYY_e_ABCD/
I know, from correspondence between [andrew.brygin at oracle.com] and [mik3hall at gmail.com] that
Yes, file.encoding is set based on the locale that the jvm is running
on, and if you run your java vm in xxxx.UTF-8 locale, the
file.encoding should be UTF-8, set to MacRoman will be problematic.
So I believe Oracle/OpenJDK7 behaves correctly. That said, as Andrew
Thompson pointed out, if all previous Apple JDK releases use MacRoman
as the file.encoding for english/UTF-8 locale, there is a
"compatibility" concern here, it might worth putting something in the
release note to give Oracle/OpenJDK MacOS user a heads up.
original mail
from Joni Salonen blog (java-and-file-names-with-invalid-characters) i know that :
You probably know that Java uses a “default character encoding” to
convert binary data to Strings. To read or write text using another
encoding you can use an InputStreamReader or OutputStreamWriter. But
for data-to-text conversions deep in the API you have no choice but to
change the default encoding.
and
What about file.encoding?
The file.encoding system property can also be used to set the default
character encoding that Java uses for I/O. Unfortunately it seems to
have no effect on how file names are decoded into Strings.
executing locale from inside the jnlp invariabily prints
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
the most similar problem on stackoverflow with a solution is this :
encoding-issues-on-java-7-file-names-in-os-x
but the solution is wrapping the execution of the java program in a script with
#!/bin/bash
export LC_CTYPE="UTF-8" # Try other options if this doesn't work
exec java your.program.Here
but I don't think this option is available to me because of the webstart, and I haven't found any way to set the LC_CTYPE environment variable from within the program.
Any solutions or workarounds?
P.S. :
If we run the program directly from shell, it writes the file/directory correctly even on OSX 10+Java 7.
The problem appears only with the combination of JNLP+OSX+Java7

I take it it's acceptable to have maximal ASCII representation of the file name, which works in virtually any encoding.
First, you want to use specifically NFKD, so that maximum information is retained in the ASCII form. For example, "2⁵" becomes "25"rather than just
"2", "ﬁ" becomes "fi" rather than "" etc once the non-ascii and non-control characters are filtered out.
String str = "XXXYYY_è_ABCD/";
str = Normalizer.normalize(str, Normalizer.Form.NFKD);
str = str.replaceAll( "[^\\x20-\\x7E]", "");
//The file name will be XXXYYY_e_ABCD no matter what system encoding
You would then always pass filenames through this filter to get their filesystem name. You only lose is some uniqueness, I.E file asdé.txt is the same
as asde.txt and in this system they cannot be differentiated.

EDIT: After experimenting with OS X some more I realized my answer was totally wrong, so I'm redoing it.
If your JVM supports -Dfile.encoding=UTF-8 on the JVM command line, that might fix the issue. I believe that is a standard property but I'm not certain about that.
HFS Plus, like other POSIX-compliant file systems, stores filenames as bytes. But unlike Linux's ext3 filesystem, it forces filenames to be valid decomposed UTF-8. This can be seen here with the Python interpreter on my OS X system, starting in an empty directory.
$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
>>> import os
>>> os.mkdir('\xc3\xa8')
>>> os.mkdir('e\xcc\x80')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 17] File exists: 'e\xcc\x80'
>>> os.mkdir('\x8f')
>>> os.listdir('.')
['%8F', 'e\xcc\x80']
>>> ^D
$ ls
%8F è
This proves that the directory name on your filesystem cannot be Mac-Roman encoded (i.e. with byte value 8F where the è is seen), as long as it's an HFS Plus filesystem. But of course, the JVM is not assured of an HFS Plus filesystem, and SMB and NFS do not have the same encoding guarantees, so the JVM should not assume this scheme.
Therefore, you have to convince the JVM to interpret file and directory names with UTF-8 encoding, in order to read the names as java.lang.String objects correctly.

Shot in the dark: File Encoding does not influence the way how the file names are created, just how the content gets written into the file - check this guy here: http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters/
Here is a short entry from Apple: http://developer.apple.com/library/mac/#qa/qa1173/_index.html
Comparing this to http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html I would assume you want to use
normalized_string = Normalizer.normalize(target_chars, Normalizer.Form.NFD);
to normalize the file names before you pass them to the File constructor. Does this help?

I don't think there is a real solution to this problem, right now.
Meantime I came to the conclusion that the "C" environment variables printed from inside the program are from the Java Web Start sandbox, and (by design, apparently) you can't influence those using the jnlp.
The accepted (as accepted by the company) workaround/compromise was of launching the jnlp using javaws from a bash script.
Apparently, launching the jnlp from browser or from finder creates a new sandbox environment with the LANG not setted (so is setted to "C" that is equal to ASCII).
Launching the jnlp from command line instead prints the right LANG from the system default, inheriting it from the shell.
This permits to at least preserve the autoupdating feature of the jnlp and dependencies.
Anyway, we sent a bug report to Oracle, but personally I'm not hoping it to be resolved anytime soon, if ever.

It's a bug in the old-skool java File api, maybe just on a mac? Anyway, the new java.nio api works much better. I have several files containing unicode characters and content that failed to load using java.io.File and related classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced org.apache.commons.io.FileUtils (which has the same problem) with java.nio.Files...
...and be sure to read and write the content of file using an appropriate charset, for example:
Files.readAllLines(myPath, StandardCharsets.UTF_8)

getClassLoader().getResource(filepath) returns a null pointer

I'm using a method to generate XML files dynamically for a research project, they get put into a loader that reads from a file path, I don't have any control over how the loader handles things (otherwise I'd pass the internal XML representation instead of monkeying with temp files), I'm using this code to save the file:
File outputs = File.createTempFile("lvlFile", ".tmp.xml");
FileWriter fw = new FileWriter(outputs);
fw.write(el.asXML());
fw.close();
// filenames is my list of file paths which gets returned and passed around
filenames.add(outputs.getAbsolutePath());
Now, I'm sure that the file in question is written to directly. If I print outputs.getAbsolutePath() and navigate there via terminal to check the files, everything is generated and written properly, so everything is correct on the filesystem. However, this code:
URL url = this.getClass().getClassLoader().getResource(_levelFile);
Where _levelFile is one of my filenames generated above, causes url to be null. The path isn't getting corrupted or anything, printing verifies that _levelFile points to the correct path. The same code has succeeded for other files. Further, the bug doesn't seem related to whether or not I use getPath(), getCanonicalPath(), or getAbsolutePath(), further setting outputs.isReadable(true) doesn't do anything.
Any ideas? Please don't offer alternatives to the Url url = structure, I don't have any control over this code*, I'm obligated to change my code so that the url is set correctly.
(*) At least without SIGNIFICANT effort rewriting a large section of the framework I'm working with, even though the current code succeeds in all other cases.
Edit:
Again, I can't use an alternative to the URL code, it's part of a loader that I can't touch. Also, the loading fails even if I set the path of the temp file to the same directory that my successfully loaded files come from.

I assume that the ClassLoader will only look for resources within the class path - which probably doesn't include /tmp. I'm not sure if it actually supports absolute path names. It might just interpret them as relative to the root of the individual class path.
How about using _levelFile.toURI().toURL() instead?

Your are creating file in file system and then trying to read it as a resource. Resource is where JVM takes its classes, i.e. the classpath. So this operation will work only if your are writing file into your classpath.
And even if this is correct be careful: if for example you are running from eclipse your process will not probably "see" the new resource until you refresh your workspace.
Now my question is: Are your really sure that you want to read files as resources. It seems that your just should create new FileInputStream(_levelFile) and read from it.
Edit
#Anonymouse is right. You are creating temporary file using 2-arg version of createTempFile(), so your file is created in your temporary directory. The chance that it is into your classpath is very low... :)
So, if you want to read it then you have to get its path or just use it when creating your input stream:
File outputs = File.createTempFile("lvlFile", ".tmp.xml");
..........................
InputStream in = new FileInputStream(ouptuts);
// now read from this stream.

Finding temporary file storage in Java

So I'm writing a Java application that uses Simple to store data as xml file, but it is hellishly slow with big files when it stores on a network drive compared to on a local hard drive. So I'd like to store it locally before copying it over to the desired destination.
Is there some smart way to find a temporary local file storage in Java in a system independent way?
E.g. something that returns something such as c:/temp in windows, /tmp in linux, and likewise for other platforms (such as mac). I could use application path but the problem is that the Java application is run from the network drive as well.

Try:
String path = System.getProperty("java.io.tmpdir");
See: http://java.sun.com/javase/6/docs/api/java/lang/System.html#getProperties%28%29
And to add it here for completeness sake, as wic mentioned in his comment, there's also the methods createTempFile(String prefix, String suffix) and createTempFile(String prefix, String suffix, File directory) methods from Java's File class.

System.getProperty("java.io.tmpdir")
The System and Runtime classes are those whose javadocs you should check first when something related to the system is required.

In the spirit of 'let's solve the problem' instead of 'let's answer the specific question':
What type of input stream are you using when reading into Simple? Be sure to use BufferedInputStream (or BufferedReader) - otherwise you are reading one byte/character at a time from the stream, which will be painfully slow when reading a network resource.
Instead of copying the file to local disk, buffer the inputs and you will be good to go.

try System.getProperty("java.io.tmpdir");

Java application failing on special characters

An application I am working on reads information from files to populate a database. Some of the characters in the files are non-English, for example accented French characters.
The application is working fine in Windows but on our Solaris machine it is failing to recognise the special characters and is throwing an exception. For example when it encounters the accented e in "Gérer" it says :-
Encountered: "\u0161" (353), after : "\'G\u00c3\u00a9rer les mod\u00c3"
(an exception which is thrown from our application)
I suspect that in order to stop this from happening I need to change the file.encoding property of the JVM. I tried to do this via System.setProperty() but it has not stopped the error from occurring.
Are there any suggestions for what I could do? I was thinking about setting the basic locale of the solaris platform in /etc/default/init to be UTF-8. Does anyone think this might help?
Any thoughts are much appreciated.

That looks like a file that was converted by native2ascii using the wrong parameters. To demonstrate, create a file with the contents
Gérer les modÚ
and save it as "a.txt" with the encoding UTF-8. Then run this command:
native2ascii -encoding windows-1252 a.txt b.txt
Open the new file and you should see this:
G\u00c3\u00a9rer les mod\u00c3\u0161
Now reverse the process, but specify ISO-8859-1 this time:
native2ascii -reverse -encoding ISO-8859-1 b.txt c.txt
Read the new file as UTF-8 and you should see this:
Gérer les modÀ\u0161
It recovers the "é" okay, but chokes on the "Ú", like your app did.
I don't know what all is going wrong in your app, but I'm pretty sure incorrect use of native2ascii is part of it. And that was probably the result of letting the app use the system default encoding. You should always specify the encoding when you save text, whether it's to a file or a database or what--never let it default. And if you don't have a good reason to choose something else, use UTF-8.

Try to use
java -Dfile.encoding=UTF-8 ...
when starting the application in both systems.
Another way to solve the problem is to change the encoding from both system to UTF-8, but i prefer the first option (less intrusive on the system).
EDIT:
Check this answer on stackoverflow, It might help either:
Changing the default encoding for String(byte[])

Instead of setting the system-wide character encoding, it might be easier and more robust, to specify the character encoding when reading and writing specific text data. How is your application reading the files? All the Java I/O package readers and writers support passing in a character encoding name to be used when reading/writing text to/from bytes. If you don't specify one, it will then use the platform default encoding, as you are likely experiencing.
Some databases are surprisingly limited in the text encodings they can accept. If your Java application reads the files as text, in the proper encoding, then it can output it to the database however it needs it. If your database doesn't support any encoding whose character repetoire includes the non-ASCII characters you have, then you may need to encode your non-English text first, for example into UTF-8 bytes, then Base64 encode those bytes as ASCII text.
PS: Never use String.getBytes() with no character encoding argument for exactly the reasons you are seeing.

I managed to get past this error by running the command
export LC_ALL='en_GB.UTF-8'
This command set the locale for the shell that I was in. This set all of the LC_ environment variables to the Unicode file encoding.
Many thanks for all of your suggestions.

You can also set the encoding at the command line, like so java -Dfile.encoding=utf-8.

I think we'll need more information to be able to help you with your problem:
What exception are you getting exactly, and which method are you calling when it occurs.
What is the encoding of the input file? UTF8? UTF16/Unicode? ISO8859-1?
It'll also be helpful if you could provide us with relevant code snippets.
Also, a few things I want to point out:
The problem isn't occurring at the 'é' but later on.
It sounds like the character encoding may be hard coded in your application somewhere.

Also, you may want to verify that operating system packages to support UTF-8 (SUNWeulux, SUNWeuluf etc) are installed.

Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.
In Java you can use following for reading and writing:
Reading:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));
Writing:
PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.