Setting file name encoding

Setting file name encoding - java

I have an input file in a defined encoding (utf8) from which I create different files whose names and content (again utf8) are taken form that input file.
My problem is that one a particular windows system, the files created do not have the correct characters. The content of these files is perfectly readable, but their names not.
Instead of Ü.xml, the file has the name ├£.xml.
On other Windows systems everything works fine.
The file content's encoding can be set in OutputStreamWriter's second argument, but the file name's encoding can not be set in new File(name) is seems.
Thanks.

Seeing two chars where there should be one UTF-8 multi-byte char ü. that Windows does not seem to have UTF-8 as file encoding. And a UTF-8 file was copied onto that system, like unpacking a zip file.
System.getProperty("file.encoding") should give the platform encoding. Maybe, remotely imaginable, it is some odd case not covered by Java resp. Windows, like a compressed directory, or a second external disk formatted with a non-UTF-8 capable file system.

Java uses the "platform's default charset" to convert file names to strings, and there's no way to change that behaviour through the standard API. You may, on some systems, be able to change the default encoding when you launch the JVM:
java -Dfile.encoding=cp1252 package.ClassName
On other systems the only way to affect the file name encoding is through the system locale settings. You can read more about that here: http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters/

Related

A Java program to modify encoding

I want to compile my package, but I need to change the encoding to "utf-8 without bom". The package has hundreds java files, but I don't want to open every file and save them. So can I write a program to change all the file in the package? Or some tools to use?

As this post discussed, a Java source file without a BOM (byte order mark) is indistinguishable from a source file in plain ASCII. That being said, to convert your UTF-8 source files to UTF without BOM, you can simply strip off the leading BOM marker. Here is a link to code which removes the BOM mark for UTF-8 files:
http://www.javapractices.com/topic/TopicAction.do?Id=257

Try iconv. The program converts files between encodings, presuming you know what encoding the files are already.

File separators of Path name of ZipEntry?

ZIP entries store the full path name of the entry because (I'm sure of the next part) the ZIP archive is not organized as directories. The metadata contains the info about how files are supposed to be stored (inside directories).
If I create a ZIP file in Windows, when I unzip the data in another OS, e.g. Mac OS X, the file structure remains as it used to be in Windows. Is this because the unzipper is designed to handle this, or isit because the file separators inside the ZIP are standard?
I'm asking this because I'm trying to find an entry inside a ZIP file using the name of the zipped file. But which file separator should I use to make it work in systems other than Windows?
I'm using Java, and the method: .getName() of the ZipEntry gives me the path using the Windows file separator \. Would it be enough if I use the java File.separator separator to make it work on another OS? Or will I have to try to find my file with each possible separator?
Honorary Correct Answer Mention
The answer given by #Eren Yilmaz is correct describing the functionality of many tools (or even the one you can code yourself). But given that the .zip standard clearly documents how it must be, the correct answer had to be updated

The .zip file specification states:
4.4.17.1 The name of the file, with optional relative path.
The path stored MUST not contain a drive or
device letter, or a leading slash. All slashes
MUST be forward slashes '/' as opposed to
backwards slashes '\' for compatibility with Amiga
and UNIX file systems etc. If input came from standard
input, there is no file name field.

The file separator is dependent on the application that creates the zip file. Some applications use the system file separator, whereas some use the "civilized" forward slash "/". So, if you are creating the zip file and then consuming it, then you can simply use a forward slash as file separator. If the zip file is created on somewhere else, then you should find out which separator was used. I don't know a simple way, but you can use a brute method and check out both separator types as you progress.
Some applications, especially custom zip creation codes, can mix the separators on different zip entries, so don't forget to check out each entry.

Saving JSP as UTF-8 in NetBeans

i've got some jsp files from another developers and now need to work with them. When i add to the document any UTF-8 char and want to save the document, NetBeans automatically offers me saving in ISO-8859-1.
Actually i'm getting this message from NetBeans:
The index.jsp contains characters
which will probably be damaged during
conversion to the ISO-8859-1 character
set. Do you want to save the file
using this character set? (Yes/No)
NB didn't offer me any other option like saving the file as UTF-8 (as it should be already written in).
I don't know how to save those jsp files in the character set they are already written in.
And don't tell me, that changing the content of the file itself (which is uneffective due to including headers etc. from other files) is the only way...
http://forums.netbeans.org/topic8750.html

Firstly; don't forget to consider this line at top:
<%#page contentType="text/html" pageEncoding="UTF-8"%>
Secondly;
In the NetBeans folder there is a config file. There should be a line like that:
netbeans_default_options="-J-Xms32m -J-Xmx128m -J-XX:PermSize=32m -J-XX:MaxPermSize=160m -J-Xverify:none -J-Dapple.laf.useScreenMenuBar=true"
Add this to the end of the line:
-J-Dfile.encoding=UTF-8
Thirdly:
NetBeans implements a project encoding setting.
To change the language encoding for a project:
Right-click a project node in the Projects windows and choose Properties.
Under Sources, select an encoding value from the Encoding drop-down field.
The encoding affects at least:
* how non-ASCII characters are displayed in the editor window when you open files
* Java file compilation of sources containing non-ASCII identifiers, string literals, or comments
* textual search for international characters over the project
Starting from NetBeans IDE 6.8, you can also specify the encoding that will be used at runtime. For example, this can be useful when the encoding for the operating system on which the application will run is different from your project's encoding.
To specify the encoding to be used at runtime:
In the Files window for your project, open nbproject > private > private.properties
Add the following line to the private.properties file and save changes:
runtime.encoding = < encoding >
This encoding will override the encoding setting for your project and will be used when running your application.
In general,
*.properties files always use ISO-8859-1 encoding plus \uXXXX escapes. (International characters will be displayed natively in the editor but stored as an escape on disk.)
*.xml files and some *.html files can specify their own encodings, regardless of the project encoding. For such files, the IDE's editor ignores the project encoding.
These may help you.
Sources for my answer that I used:
Link1: http://forums.netbeans.org/topic33.html
Link2: http://wiki.netbeans.org/FaqI18nProjectEncoding

Illegal Character when trying to compile java code

I have a program that allows a user to type java code into a rich text box and then compile it using the java compiler. Whenever I try to compile the code that I have written I get an error that says that I have an illegal character at the beginning of my code that is not there. This is the error the compiler is giving me:
C:\Users\Travis Michael>"\Program Files\Java\jdk1.6.0_17\bin\javac" Test.java
Test.java:1: illegal character: \187
∩╗┐public class Test
^
Test.java:1: illegal character: \191
∩╗┐public class Test
^
2 errors

The BOM is generated by, say, File.WriteAllText() or StreamWriter when you don't specify an Encoding. The default is to use the UTF8 encoding and generate a BOM. You can tell the java compiler about this with its -encoding command line option.
The path of least resistance is to avoid generating the BOM. Do so by specifying System.Text.Encoding.Default, that will write the file with the characters in the default code page of your operating system and doesn't write a BOM. Use the File.WriteAllText(String, String, Encoding) overload or the StreamWriter(String, Boolean, Encoding) constructor.
Just make sure that the file you create doesn't get compiled by a machine in another corner of the world. It will produce mojibake.

That's a byte order mark, as everyone says.
javac does not understand the BOM, not even when you try something like
javac -encoding UTF8 Test.java
You need to strip the BOM or convert your source file to another encoding. Notepad++ can convert a single files encoding, I'm not aware of a batch utility on the Windows platform for this.
The java compiler will assume the file is in your platform default encoding, so if you use this, you don't have to specify the encoding.

If using an IDE, specify the java file encoding (via the properties panel)
If NOT using an IDE, use an advanced text-editor (I can recommend Notepad++) and set the encoding to "UTF without BOM", or "ANSI", if that suits you.

In this case do the following Steps 1-7
In Android Studio
1. Menu -> Edit -> Select All
2. Menu -> Edit -> Cut
Open new Notepad.exe
In Notepad
4. Menu -> Edit -> Paste
5. Menu -> Edit -> Select All
6. Menu -> Edit -> Copy
Back In Android Studio
7. Menu -> Edit -> Paste

http://en.wikipedia.org/wiki/Byte_order_mark
The byte order mark (BOM) is a Unicode
character used to signal the
endianness (byte order) of a text file
or stream. Its code point is U+FEFF.
BOM use is optional, and, if used,
should appear at the start of the text
stream. Beyond its specific use as a
byte-order indicator, the BOM
character may also indicate which of
the several Unicode representations
the text is encoded in.
The BOM is a funky-looking character that you sometimes find at the start of unicode streams, giving a clue what the encoding is. It's usually handles invisibly by the string-handling stuff in Java, so you must have confused it somehow, but without seeing your code, it's hard to see where.
You might be able to fix it trivially by manually stripping the BOM from the string before feeding it to javac. It probably qualifies as whitespace, so try calling trim() on the input String, and feeding the output of that to javac.

That's a problem related to BOM (Byte Order Mark) character. Byte Order Mark BOM is an Unicode character used for defining a text file byte order and comes in the start of the file. Eclipse doesn't allow this character at the start of your file, so you must delete it. for this purpose, use a rich text editor like Notepad++ and save the file with encoding "UTF-8 without BOM". That should remove the problem.
I have copy pasted the some content from a website to a Notepad++ editor,
it shows the "LS" with black background. Have deleted the "LS" content and
have copy the same content from notepad++ to java file, it works fine.

I solved this by right clicking in my textEdit program file and selecting [substitutions] and un-checking smart quotes.

instead of getting Notepad++,
You can simply
Open the file with Wordpad
and then
Save As - Plain Text document

Even I was facing this issue as am using notepad++ to code. It is very convenient to type the code in notepad++. However after compiling I get an error " error: illegal character: '\u00bb'".
Solution :
Start writing the code in older version of notepad(which will be there by default in your PC) and save it. Later the modifications can be done using notepad++.
It works!!!

I had the same problem with a file i generated using the command echo echo "" > Main.java in Windows Powershell. I searched the problem and it seemed to have something to do with encoding. I checked the encoding of the file using file -i Main.java and the result was text/plain; charset=utf-16le.
Later i deleted the file and recreated it using git bash using touch Main.java and with this the file compiled successfully. I checked the file encoding using file -i command and this time the result was Main.java: text/x-c; charset=us-ascii.
Next i searched the internet and found that to create an empty file using Powershell we can use the Cmdlet New-Item. I create the file using New-Item Main.java and checked it's encoding and this time the result was Main.java: text/x-c; charset=us-ascii and this time it compiled successully.

Unzip files created with WinZIP with I18N file names?

People these days create their ZIP archives with WinZIP, which allows for internationalized (i.e. non-latin: cyrillic, greek, chinese, you name it) file names.
Sadly, trying to unpack such file causes trouble:
UNIX unzip creates garbage-named files and dirs like "®£¤ ©¤¥èì".
Java and its jar command fails miserably on such archives.
Is there a passable way to unpack such files programmatically? UNIX or Java.

DotNetZip supports unicode and arbitrary encodings for filenames within zipfiles, either for reading or writing zips.
It's a .NET library. For Unix usage, you would need Mono as a pre-requisite.
If the zipfile is correctly constructed by WinZip, in other words if it's compliant with the zip spec from PKWare, then there's no special work you need to do to specify the encoding at the time you unpack it. According to the zip spec, there are two supported encodings used for filenames in zipfiles: UTF-8 and IBM437. The use of one or the other of these encodings is specified in the zip metadata and any zip library can detect and use it. DotNetZip automatically detects it when reading a compliant zip. like this:
using (var zip = ZipFile.Read("thearchive.zip"))
{
foreach (var e in zip)
{
// e.FileName refers to the name on the entry
e.Extract("extract-directory");
}
}
There are archive programs that produce zips that are "non compliant" w.r.t. encoding. WinRar is one - it will create a zip that has filenames encoded in the default encoding in use on the computer. In Shanghai it will use cp950, while in Iceland, something else, and in Lisbon, something else. The advantage to "non compliance" here is that Windows Explorer will open and correctly display i18n-ized filenames in such zips. In other words, "non compliance" is often what people want, because Windows doesn't (yet?) support UTF-8 zip files.
(This all has to do with the encoding used in the zipfile, not the encoding used in the files contained in the zip file)
The zip spec doesn't allow for the specification of an arbitrary text encoding in the zip metadata. In other words if you use cp950 when creating the zip, then your extract logic needs to "know" to use cp950 when extracting - nothing in the zip file carries that information. In addition, of course, the zip library you use to programmatically extract must support arbitrary encodings. As far as I know, Java's zip library does not. DotNetZip does. Like so:
using (ZipFile zip = ZipFile.Read(zipToExtract,
System.Text.Encoding.GetEncoding(950)))
{
foreach (ZipEntry e in zip)
{
e.Extract(extractDirectory);
}
}
DotNetZip can also create zip files with arbitrary encodings - "non compliant" zips.
DotNetZip is free, and open source.

The solution I've found:
Apache commons-compress can unzip such archives just fine, if supplied with correct fallback charset.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.