Naming a file using unicode characters in java

Naming a file using unicode characters in java - java

I learnt that, Java allows file names to have unicode characters.
How to name a file,
naïve.java,
using english keyboard?
Is there a notation similar to unicode escape notation(used in java source code) that we can use to name java files with unicode characters?

It seems that you are referring to JLS §7.2 “Host Support for Packages”:
A package name component or class name might contain a character that cannot correctly appear in a host file system's ordinary directory name, such as a Unicode character on a system that allows only ASCII characters in file names. As a convention, the character can be escaped by using, say, the # character followed by four hexadecimal digits giving the numeric value of the character, as in the \uxxxx escape (§3.3).
Under this convention, the package name:
children.activities.crafts.papierM\u00e2ch\u00e9
which can also be written using full Unicode as:
children.activities.crafts.papierMâché
might be mapped to the directory name:
children/activities/crafts/papierM#00e2ch#00e9
Note that this is described as a convention and only for characters not being supported by the host’s filesystem. So the first obstacle is to find a system without Unicode support in the file system API before you can check whether javac still adheres to this convention.
So for most systems, if you want to name a file naïve.java, there’s not only support to name it directly that way, you also have to name it that way as the fallback escaping scheme is not supported by tools designed to run only on systems which don’t need it.
That leads to your other question about how to enter it via the keyboard. Well, that’s system dependent. The most portable solution is:
open your browser
navigate to this question and mark naïve.java with the mouse
press ctrl+c
use your favorite tool to create a new .java file
when asked for the new name, press ctrl+v
As a general solution, refrain from using every feature, the Java programming language offers…

All String objects in Java contain UTF-16 Unicode. All Java objects that open files eventually name those files strings.
However, your keyboard is not Java's problem, it's your operating system's problem.
So:
File foo = new File("n\u00e4ive.java");
You edited the question to say that what you really want is to have a Java source file with interesting characters. That's a matter for your favorite text editor and operating system.
Java is perfectly happy to compile files with arbitrary names. However, creating those files and managing those files is not its problem. How you would go about creating such a file is between you and your operating system. Windows, Linux, OSX: all have different tools for entering Unicode characters that aren't part of the obvious keyboard map.

Related

What is a "regular file" in Java?

The class BasicFileAttributes, for examining the properties of a file in the file system, has the method isRegularFile(). Unfortunately, the Javadoc description is rather lacking:
Tells whether the file is a regular file with opaque content.
What does this mean? What exactly is a regular file with opaque content? I can tell from the other methods in the class that it's not a directory or symbolic link, so I'm inclined to think that it's everything else. However, there apparently is some type of "irregular file" because a method exists called isOther(), which returns true if it's not a directory, symbolic link, or "regular file".
So what exactly is an regular file in Java?

For example in UNIX, a regular file is one that is not special in some way. Special files include symbolic links and directories. A regular file is a sequence of bytes stored permanently in a file system.
Read this answer # UNIX & Linux stackexchange: What is a regular file?
I figure rm -i is an alias, possibly rm -i. The "regular" part doesn't
mean anything in particular, it only means that it's not a pipe,
device, socket or anything other "special".
it means the file is not a symlink, pipe, rand, null, cpu, etc.
Perhaps you have heard the linux philosophy everything is a text. This
isn't literally true, but it suggests a dominant operational context
where string processing tools can be applied to filesystem elements
directly. In this case, it means that in a more literal fashion. To
see the detection step in isolation, try the command file, as in file
/etc/passwd or file /dev/null.

From Files Reference - AIX IBM
A file is a collection of data that can be read from or written to. A file can be a program you create, text you write, data you acquire, or a device you use. Commands, printers, terminals, and application programs are all stored in files. This allows users to access diverse elements of the system in a uniform way and gives the operating system great flexibility. No format is implied when a file is created.
There are three types of files
Regular - Stores data (text, binary, and executable).
Directory - Contains information used to access other files.
Special - Defines a FIFO (first-in, first-out) file or a physical device.
Regular files are the most common. When a word processing program is used to create a document, both the program and the document are contained in regular files.
Regular files contain either text or binary information. Text files are readable by the user. Binary files are readable by the computer. Binary files can be executable files that instruct the system to accomplish a job. Commands, shell scripts, and other programs are stored in executable files.
Directories contain information the system needs to access all types of files, but they do not contain the actual file data. As a result, directories occupy less space than a regular file and give the file-system structure flexibility and depth. Each directory entry represents either a file or subdirectory and contains the name of a file and the file's i-node (index node reference) number. The i-node number represents the unique i-node that describes the location of the data associated with the file. Directories are created and controlled by a separate set of commands. See "Directories" in Operating system and device management for more information.
Special files define devices for the system or temporary files created by processes. There are three basic types of special files: FIFO (first-in, first-out), block, and character. FIFO files are also called pipes. Pipes are created by one process to temporarily allow communication with another process. These files cease to exist when the first process finishes. Block and character files define devices.
All this above is from the first link. I've checked in many other sources regarding Operational Systems differences and it seems this one is the most common definition on all sources i've found.

I am not an expert on this but at the first look BasicFileAttributes is not a class but an interface. So whatever a regular file is depends on the implementation of this interface. I can see that there is e.g. the class WindowsFileAttributs that implements this interface.
If you have a look at the OpenJDK version of this class you will find that it is
!isSymbolicLink() && !isDirectory() && !isOther();
Get all other information from the code ;-)

Eclipse UTF-8-weird characters

I am writing a program in java with Eclipse IDE, and i want to write my comments in Greek. So i changed the encoding from Window->Preferences->General->Content Types->Text->Java Source File, to UTF-8. The comments in my code are ok but when i run my program some words contains weird characters e.g San Germ�n (San Germán). If i change the encoding to ISO-8859-1, all are ok when i run the program but the comments in my code are not(weird characters !). So, what is going wrong with it?
Edit: My program is in java swing and the weird characters with UTF-8 are Strings in cells of a JTable.
EDIT(2): Ok, i solve my problem i keep the UTF-8 encoding for java file but i change the encoding of the strings. String k = new String(myStringInByteArray,ISO-8859-1);

This is most likely due to the compiler not using the correct character encoding when reading your source. This is a very common source of error when moving between systems.
The typical way to solve it is to use plain ASCII (which is identical in both Windows 1252 and UTF-8) and the "\u1234" encoding scheme (unicode character 0x1234), but it is a bit cumbersome to handle as Eclipse (last time I looked) did not transparently support this.
The property file editor does, though, so a reasonable suggestion could be that you put all your strings in a property file, and load the strings as resources when needing to display them. This is also an excellent introduction to Locales which are needed when you want to have your application be able to speak more than one language.

Encoding issue on filename with Java 7 on OSX with jnlp/webstart

I have this problem that has been dropped on me, and have been a couple of days of unsuccessful searches and workaround attempts.
I have now an internal java swing program distributed by jnlp/webstart, on osx and windows computers, that, among other things, downloads some files from WebDav.
Recently, on a test machine with OSX 10.8 and Java 7, filenames and directory names with accented characters started having those replaced by question marks.
No problem on OSX with versions of Java before 7.
example :
XXXYYY_è_ABCD/
becomes
XXXYYY_?_ABCD/
using java.text.Normalizer (NFD, NFC, NFKD, NFKC) on the original string, the result is different but still wrong :
XXXYYY_e?_ABCD/
or
XXXYYY_e_ABCD/
I know, from correspondence between [andrew.brygin at oracle.com] and [mik3hall at gmail.com] that
Yes, file.encoding is set based on the locale that the jvm is running
on, and if you run your java vm in xxxx.UTF-8 locale, the
file.encoding should be UTF-8, set to MacRoman will be problematic.
So I believe Oracle/OpenJDK7 behaves correctly. That said, as Andrew
Thompson pointed out, if all previous Apple JDK releases use MacRoman
as the file.encoding for english/UTF-8 locale, there is a
"compatibility" concern here, it might worth putting something in the
release note to give Oracle/OpenJDK MacOS user a heads up.
original mail
from Joni Salonen blog (java-and-file-names-with-invalid-characters) i know that :
You probably know that Java uses a “default character encoding” to
convert binary data to Strings. To read or write text using another
encoding you can use an InputStreamReader or OutputStreamWriter. But
for data-to-text conversions deep in the API you have no choice but to
change the default encoding.
and
What about file.encoding?
The file.encoding system property can also be used to set the default
character encoding that Java uses for I/O. Unfortunately it seems to
have no effect on how file names are decoded into Strings.
executing locale from inside the jnlp invariabily prints
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
the most similar problem on stackoverflow with a solution is this :
encoding-issues-on-java-7-file-names-in-os-x
but the solution is wrapping the execution of the java program in a script with
#!/bin/bash
export LC_CTYPE="UTF-8" # Try other options if this doesn't work
exec java your.program.Here
but I don't think this option is available to me because of the webstart, and I haven't found any way to set the LC_CTYPE environment variable from within the program.
Any solutions or workarounds?
P.S. :
If we run the program directly from shell, it writes the file/directory correctly even on OSX 10+Java 7.
The problem appears only with the combination of JNLP+OSX+Java7

I take it it's acceptable to have maximal ASCII representation of the file name, which works in virtually any encoding.
First, you want to use specifically NFKD, so that maximum information is retained in the ASCII form. For example, "2⁵" becomes "25"rather than just
"2", "ﬁ" becomes "fi" rather than "" etc once the non-ascii and non-control characters are filtered out.
String str = "XXXYYY_è_ABCD/";
str = Normalizer.normalize(str, Normalizer.Form.NFKD);
str = str.replaceAll( "[^\\x20-\\x7E]", "");
//The file name will be XXXYYY_e_ABCD no matter what system encoding
You would then always pass filenames through this filter to get their filesystem name. You only lose is some uniqueness, I.E file asdé.txt is the same
as asde.txt and in this system they cannot be differentiated.

EDIT: After experimenting with OS X some more I realized my answer was totally wrong, so I'm redoing it.
If your JVM supports -Dfile.encoding=UTF-8 on the JVM command line, that might fix the issue. I believe that is a standard property but I'm not certain about that.
HFS Plus, like other POSIX-compliant file systems, stores filenames as bytes. But unlike Linux's ext3 filesystem, it forces filenames to be valid decomposed UTF-8. This can be seen here with the Python interpreter on my OS X system, starting in an empty directory.
$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
>>> import os
>>> os.mkdir('\xc3\xa8')
>>> os.mkdir('e\xcc\x80')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 17] File exists: 'e\xcc\x80'
>>> os.mkdir('\x8f')
>>> os.listdir('.')
['%8F', 'e\xcc\x80']
>>> ^D
$ ls
%8F è
This proves that the directory name on your filesystem cannot be Mac-Roman encoded (i.e. with byte value 8F where the è is seen), as long as it's an HFS Plus filesystem. But of course, the JVM is not assured of an HFS Plus filesystem, and SMB and NFS do not have the same encoding guarantees, so the JVM should not assume this scheme.
Therefore, you have to convince the JVM to interpret file and directory names with UTF-8 encoding, in order to read the names as java.lang.String objects correctly.

Shot in the dark: File Encoding does not influence the way how the file names are created, just how the content gets written into the file - check this guy here: http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters/
Here is a short entry from Apple: http://developer.apple.com/library/mac/#qa/qa1173/_index.html
Comparing this to http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html I would assume you want to use
normalized_string = Normalizer.normalize(target_chars, Normalizer.Form.NFD);
to normalize the file names before you pass them to the File constructor. Does this help?

I don't think there is a real solution to this problem, right now.
Meantime I came to the conclusion that the "C" environment variables printed from inside the program are from the Java Web Start sandbox, and (by design, apparently) you can't influence those using the jnlp.
The accepted (as accepted by the company) workaround/compromise was of launching the jnlp using javaws from a bash script.
Apparently, launching the jnlp from browser or from finder creates a new sandbox environment with the LANG not setted (so is setted to "C" that is equal to ASCII).
Launching the jnlp from command line instead prints the right LANG from the system default, inheriting it from the shell.
This permits to at least preserve the autoupdating feature of the jnlp and dependencies.
Anyway, we sent a bug report to Oracle, but personally I'm not hoping it to be resolved anytime soon, if ever.

It's a bug in the old-skool java File api, maybe just on a mac? Anyway, the new java.nio api works much better. I have several files containing unicode characters and content that failed to load using java.io.File and related classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced org.apache.commons.io.FileUtils (which has the same problem) with java.nio.Files...
...and be sure to read and write the content of file using an appropriate charset, for example:
Files.readAllLines(myPath, StandardCharsets.UTF_8)

Deploying Application to Chinese Speaking Country

Recently, I try to internationalize an application to Chinese speaking country.
I realize there are wide variety of encoding methods for Chinese character : Guobiao, Big5, Unicode, HZ
Whenever user input some text, my Java application need to know what kind of input encoding method the users are using, in order for my Java application to convert the input, into a processable data.
I feel that, it is not reliable for me to make assumption on their input encoding method, based on their OS. This is because when someone is using OS with China locale, JVM will by default using Guobiao encoding. However, user may use Big5 inputting tool, to key in Big5 encoding characters.
I was wondering what is the reliable method you all use, in order to detect the encoding type of user input?

For actual user input, you never have to detect it. It is defined by the environment.
On Windows, for a UNICODE application, the API will deliver UTF-16. For an MBCS application, it will deliver the current code page, and there's an API to tell you what that is.
On Linux, the locale determines the encoding of input as delivered to APIs.
Since you say you are in Java, you really don't need to care. All Java UI programs will deliver either char or String values, and those are always, immutably, in Unicode.

Java application failing on special characters

An application I am working on reads information from files to populate a database. Some of the characters in the files are non-English, for example accented French characters.
The application is working fine in Windows but on our Solaris machine it is failing to recognise the special characters and is throwing an exception. For example when it encounters the accented e in "Gérer" it says :-
Encountered: "\u0161" (353), after : "\'G\u00c3\u00a9rer les mod\u00c3"
(an exception which is thrown from our application)
I suspect that in order to stop this from happening I need to change the file.encoding property of the JVM. I tried to do this via System.setProperty() but it has not stopped the error from occurring.
Are there any suggestions for what I could do? I was thinking about setting the basic locale of the solaris platform in /etc/default/init to be UTF-8. Does anyone think this might help?
Any thoughts are much appreciated.

That looks like a file that was converted by native2ascii using the wrong parameters. To demonstrate, create a file with the contents
Gérer les modÚ
and save it as "a.txt" with the encoding UTF-8. Then run this command:
native2ascii -encoding windows-1252 a.txt b.txt
Open the new file and you should see this:
G\u00c3\u00a9rer les mod\u00c3\u0161
Now reverse the process, but specify ISO-8859-1 this time:
native2ascii -reverse -encoding ISO-8859-1 b.txt c.txt
Read the new file as UTF-8 and you should see this:
Gérer les modÀ\u0161
It recovers the "é" okay, but chokes on the "Ú", like your app did.
I don't know what all is going wrong in your app, but I'm pretty sure incorrect use of native2ascii is part of it. And that was probably the result of letting the app use the system default encoding. You should always specify the encoding when you save text, whether it's to a file or a database or what--never let it default. And if you don't have a good reason to choose something else, use UTF-8.

Try to use
java -Dfile.encoding=UTF-8 ...
when starting the application in both systems.
Another way to solve the problem is to change the encoding from both system to UTF-8, but i prefer the first option (less intrusive on the system).
EDIT:
Check this answer on stackoverflow, It might help either:
Changing the default encoding for String(byte[])

Instead of setting the system-wide character encoding, it might be easier and more robust, to specify the character encoding when reading and writing specific text data. How is your application reading the files? All the Java I/O package readers and writers support passing in a character encoding name to be used when reading/writing text to/from bytes. If you don't specify one, it will then use the platform default encoding, as you are likely experiencing.
Some databases are surprisingly limited in the text encodings they can accept. If your Java application reads the files as text, in the proper encoding, then it can output it to the database however it needs it. If your database doesn't support any encoding whose character repetoire includes the non-ASCII characters you have, then you may need to encode your non-English text first, for example into UTF-8 bytes, then Base64 encode those bytes as ASCII text.
PS: Never use String.getBytes() with no character encoding argument for exactly the reasons you are seeing.

I managed to get past this error by running the command
export LC_ALL='en_GB.UTF-8'
This command set the locale for the shell that I was in. This set all of the LC_ environment variables to the Unicode file encoding.
Many thanks for all of your suggestions.

You can also set the encoding at the command line, like so java -Dfile.encoding=utf-8.

I think we'll need more information to be able to help you with your problem:
What exception are you getting exactly, and which method are you calling when it occurs.
What is the encoding of the input file? UTF8? UTF16/Unicode? ISO8859-1?
It'll also be helpful if you could provide us with relevant code snippets.
Also, a few things I want to point out:
The problem isn't occurring at the 'é' but later on.
It sounds like the character encoding may be hard coded in your application somewhere.

Also, you may want to verify that operating system packages to support UTF-8 (SUNWeulux, SUNWeuluf etc) are installed.

Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.
In Java you can use following for reading and writing:
Reading:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));
Writing:
PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.