Using Unicode characters for file names inside a zip archive

Using Unicode characters for file names inside a zip archive - java

I am zipping a file name contains some special characters like Péréquation LES HOPITAUX NEUFS.xls to a different folder, say temp.
I am able to zip the file but the problem is the name of file is changing automatically to
P+¬r+¬quation LES HOPITAUX NEUFS.xls.
How can I support unicode characters for file names inside a zip archive?

It depends a little bit on what code you're using to create the archive. The old Java compression classes are not so flexible as you need.
You may use Apache Commons Compress. Michael Simons wrote this nice piece of code:
ZipArchiveOutputStream ostream = ...; // Your initialization code here
ostream.setEncoding("Cp437"); // This should handle your "special" characters
ostream.setFallbackToUTF8(true); // For "unknown" characters!
ostream.setUseLanguageEncodingFlag(true);
ostream.setCreateUnicodeExtraFields(
ZipArchiveOutputStream.UnicodeExtraFieldPolicy.NOT_ENCODEABLE);
If you're using Java 7 then you finally have a Charset parameter (that can be UTF-8) on the ZipOutputStream constructor
The big problem, anyway, is that many implementations don't understand Unicode encoding because original ZIP file format is ASCII and there is not an official standard for Unicode. See this post for further details.

The Zip specification (historically) does not specify what character encoding to be used for the embedded file names and comments, the original IBM PC character encoding set, commonly referred to as IBM Code Page 437, is supposed to be the only encoding supported. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in Jar files. Our java.util.jar and java.util.zip implementation therefor strictly followed Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.
Consequence? the ZIP file created by "traditional" ZIP tool is not accessible for java.util.jar/zip based tool, and vice versa, if the file name contains characters that are not compatible between Cp437 (as an alternative, tools might simply use the default platform encoding) and UTF-8
For most European, you're "lucky":-) that you only need to avoid a "handful" of characters, such as the umlauts (OK, I'm just kidding), but for Japanese and Chinese, most of the characters are simply out of luck. This is why bug 4244499 had been the No.1 on the Top 25 Java Bugs for so many years. The bug is no longer on the list:-) it has been finally "fixed" in OpenJDK 7, b57. I still keep a snapshot as the record/kudo for myself:-)
The solution (I would use "solution" than a "fix") in JDK7 b57 is to introduce in a new set of ZipInputStream ZipOutStream and ZipFile constructors with a specific "charset" as the parameter, as showed below.
ZipFile(File, Charset)
ZipInputStream(InputStream, Charset)
ZipOutputStream(OutputStream, Charset)
With these new constructors, applications can now access those non-UTF-8 ZIP files via ZipInputStream or ZipFile objects created with the specific encoding, or create a Zip files encoded in non-UTF-8 via the new ZipOutputStream(os, charset) constructor, if necessary.
zip is a stripped-down version of the Jar tool with a "-encoding" option to support non-UTF8 encoding for entry name and comment, it can serve as a demo for how to use the new APIs (I used it as a unit test). I'm still debating with myself if it is a good idea to officially introduce "-encoding" into the Jar tool...

Related

How to create Windows native compatible Zip files with non-ASCII filenames

Our team has a program that generates PDFs written in Java. The PDFs, which may have non-ASCII filenames, are zipped using Apache Commons Compress. The zip files are then uploaded to S3 to be downloaded by Windows and Mac clients.
When unzipped on Mac using the native tools, the files are recreated with the correct filename. However, when trying to unzip using the native Windows UI tool, the filenames are created incorrectly.
The zip process is:
import org.apache.commons.compress.archivers.zip.ZipArchiveEntry;
and I have add following code, it's still not working, display unreadable characters on Windows:
zipFile.setEncoding("UTF-8");
zipFile.setUseLanguageEncodingFlag(true);
zipFile.setCreateUnicodeExtraFields(ZipArchiveOutputStream.UnicodeExtraFieldPoli‌cy.ALWAYS);
How can I create zip files that can be used by both Mac and Windows?

According to Apache Commons Compress page:(https://commons.apache.org/proper/commons-compress/zip.html)
Windows' "compressed folder" feature doesn't recognize any flag or extra field and creates archives using the platforms default encoding - and expects archives to be in that encoding when reading them.
and
If Windows' "compressed folders" is your primary consumer, then your best option is to explicitly set the encoding to the target platform. You may want to enable creation of Unicode extra fields so the tools that support them will extract the file names correctly.
Therefore:
If you know that your Windows users are based in a limited region of the Earth and your filenames are limited to that region (e.g. all Latin), you could heed Apache's advice and define an 8bit code page for filename encoding, which would be respected by OS X's unzip. However, this would mean that it would not work on Windows machines in a different region or accidentally used a slightly different codepage (North American vs. Western Europe).
The sensible alternative would be to use an alternative archive tool on Windows and possibly an alternative archive format. Perhaps you could create self-extracting archives for Windows by prepending a suitable extraction tool to the zip file. For example, you can create a self-extracting 7zip archive in Java using the rough instructions here: http://sourceforge.net/p/sevenzip/discussion/45798/thread/de8aa3c6
The pseudo format is:
7z.sfx + config.txt + your-created-archive.7z your-created-archive.exe
Where 7z.sfx is a 7zip self-extracting executable "header" distributed with 7zip.
In response to comments in question:
Windows uses UTF-16 for filenames and AFAIK uses UTF-16 in it's low level API, which Java calls. However, Windows console is very broken and doesn't quickly support UTF-8.
(Java also uses UTF-16 internally for String objects)
OS X enforces UTF-8 for filename encodings, so Java should also be respecting that when creating filenames.

How to convert (Java) files with different encodings to the same?

I'm working on a big java web application in Eclipse, whose files have different encodings: some are in UTF-8, others in Cp1252, yet others are in ISO-8859-1 (with no distinction between JSP's or java source files, or CSS) — but I know the encoding of each file.
I'm converting the project to Maven, and this is a great occasion to turn all of them to UTF-8.
Of course I don't want to lose a single character (so fully automated conversions do not apply here).
How should I go about it? Is there a tool that can help me ensure I don't lose any special character?
The webapp is in Italian, so, especially in JSP's, there could be lots of accented letters (probably not everywhere HTML entities have been used).
The project is in Eclipse, but I can use an external editor if that could make the conversion easier.

It's very easy to write code to convert encodings - although I'd expect there are tools to do it anyway. Simply:
Create one FileInputStream to the existing file, and wrap it in an InputStreamReader with the appropriate encoding
Create one FileOutputStream to the new file, and wrap it in an OutputStreamWriter with the appropriate encoding
Loop over the reader, reading characters into a buffer and writing out the contents of that buffer (just as many characters as you read) until you've read the whole file
Close all resources (automatic with a try-with-resources block)
The first two steps are simpler with Files.newBufferedReader and Files.newBufferedWriter, too.

Converting a single file can be done with the iconv function (I used LibIconv for Windows).
It lets you specify the source and destinations encodings, and warns when characters can't be converted.
I tried it with a couple of source files and all the accented letters were correctly converted in UTF-8 from Cp1252.

File names with Japanese characters turn to garbage when written to a zip file using java.util.zip.*

I have a directory with a name that contains Japanese characters, and I need to use the zip utils in java.util.zip to write it to a zip file. Writing the zip file succeeds, but when I open the resulting zip file with either Windows' built-in compressed file utility or 7-Zip, the directory with Japanese characters in the name appears as a bunch of garbage characters. I do have the Japanese/East Asian language pack installed on my system -- I can create directories with Japanese names, so that isn't the issue.
Interestingly, if I write a separate script to read the resulting zip file using java.util.zip, the directory name is correct, and I can extract the contents of the zip into appropriately named directories, with Japanese characters. But I can't do this using the commercial zip tools that I've tried, which is undoubtedly what our customers will want to do.
Any ideas about what is causing this problem, and how I can work around it?
I know about this bug, but I still need a workaround for this case.

TrueZIP claims to do this better:
The J2SE API always uses UTF-8 (eight
bit Unicode character set) for entry
names and comments instead of CP437
(a.k.a. IBM437, the genuine IBM-PC
character set), which is used by the
de-facto standard PKZIP from PKWARE.
As a result, you cannot read or write
ZIP files with international entry
file names such as e.g. "täscht.txt"
in a ZIP file created by a (southern)
German.
[description of other problems omitted]
The TrueZIP Library has been developed to overcome these limitations/disadvantages.

Miracles indeed happen, and Sun/Oracle did really fix the long-living bug/rfe:
Now it's possible to [set up filename encodings upon creating][1] the zip file/stream (requires Java 7).
[1]: http://download.java.net/jdk7/docs/api/java/util/zip/ZipOutputStream.html#ZipOutputStream(java.io.OutputStream, java.nio.charset.Charset)

If java.util.zip still behaves as this post describes, I'm not sure if it is possible (with the built-in classes). I have seen Chilkat's Java Zip library mentioned before as a way to get this to work, but have never used it.

Spring Properties File

Hi have this j2ee web application developed using spring framework. I have a problem with rendering mnessages in nihongo characters from the properties file. I tried converting the file to ascii using native2ascii and it solved my problem. Is there no other way of converting the file through setting the encoding to ascii in the configuration files instead of manually converting it by executing native2ascii in command prompt

AfAIK in property files and resource bundles you have to use ASCII. Inside Spring XML configuration files, Unicode should work fine. If you prefer you can edit property files in Unicode and run native2ascii automatically as part of your build process (in Ant, Maven, etc).

As per the java.util.Properties API document:
The load(Reader) / store(Writer, String) methods load and store properties from and to a character based stream in a simple line-oriented format specified below. The load(InputStream) / store(OutputStream, String) methods work the same way as the load(Reader)/store(Writer, String) pair, except the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes ; only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.
(note that ISO 8859-1 is not the same as ASCII as many are incorrectly talking about here).
So, to fix the particular problem without the need for native2ascii, you should use Properties#load(Reader) with an InputStreamReader(input, charset) instead.
Properties properties = new Properties();
properties.load(new InputStreamReader(classLoader.getResourceAsStream("file.properties"), "UTF-8"));
Note that this method was introduced in Java 1.6 over 4 years ago. Ensure that you're using it as well.
I don't do Spring, so I can't go in detail about how to get Spring to work that way, but it would be obvious that you need to override/replace the Spring's resource bundle manager, if any.

hey, i googled for the same issue and found something written in german that was a help for me: http://www.stefanglase.de/2009/10/13/spring-messagesource-mit-utf-8-encoding/

How do I properly store and retrieve internationalized Strings in properties files?

I'm experimenting with internationalization by making a Hello World program that uses properties files + ResourceBundle to get different strings.
Specifically, I have a file "messages_en_US.properties" that stores "hello.world=Hello World!", which works fine of course.
I then have a file "messages_ja_JP.properties" which I've tried all sorts of things with, but it always appears as some type of garbled string when printed to the console or in Swing. The problem is obviously with the reading of the content into a Java string, as a Java string in Japanese typed directly into the source can print fine.
Things I've tried:
The .properties file in UTF-8 encoding with the Japanese string as-is for the value. Something I read indicates that Java expects a properties file to be in the native encoding of the system...? It didn't work either way.
The file in default encoding (ISO-8859-1) and the value stored as escaped Unicode created by the native2ascii program included with Java. Tried with a source file in various Japanese encodings... SHIFT-JIS, EUC-JP, ISO-2022-JP.
Edit:
I actually figured this out while I was typing this, but I figured I'd post it anyway and answer it in case it helps anyone.

I realized that native2ascii was assuming (surprise) that it was converting from my operating system's default encoding each time, and as such not producing the correct escaped Unicode string.
Running native2ascii with the "-encoding encoding_name" option where encoding_name was the name of the source file's encoding (SHIFT-JIS in this case) produced the correct result and everything works fine.
Ant also has a native2ascii task that runs native2ascii on a set of input files and sends output files wherever you want, so I was able to add a builder that does that in Eclipse so that my source folder has the strings in their original encoding for easy editing and building automatically puts converted files of the same name in the output folder.

As of JDK 1.6, Properties has a load() method that accepts a Reader. That means you can save all the property files as UTF-8 and read them all directly by passing an InputStreamReader to load(). I think that's the most elegant solution, but it requires your app to run on a Java 6 runtime.
Historically, load() only accepted an InputStream, and the stream was decoded as ISO-8859-1. Not the system default encoding, always ISO-8859-1. That's important, because it makes a certain hack possible. Say your property file is stored as UTF-8. After you retrieve a property, you can re-encode it as ISO-8859-1 and decode it again as UTF-8, like this:
String realProp = new String(prop.getBytes("ISO-8859-1"), "UTF-8");
It's ugly and fragile, but it does work. But I think the best solution, at least for the next few years, is the one you found: bulk-convert the files with native2ascii using a build tool like Ant.

An alternative way to handle the properties files is:
http://www.unipad.org/main/
This is an editor which can read/write files in \u unicode escape format, this is the format native2ascii creates.
It don't know how well it works with Japanese, I've used it for Hungarian.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.