How to create Windows native compatible Zip files with non-ASCII filenames

How to create Windows native compatible Zip files with non-ASCII filenames - java

Our team has a program that generates PDFs written in Java. The PDFs, which may have non-ASCII filenames, are zipped using Apache Commons Compress. The zip files are then uploaded to S3 to be downloaded by Windows and Mac clients.
When unzipped on Mac using the native tools, the files are recreated with the correct filename. However, when trying to unzip using the native Windows UI tool, the filenames are created incorrectly.
The zip process is:
import org.apache.commons.compress.archivers.zip.ZipArchiveEntry;
and I have add following code, it's still not working, display unreadable characters on Windows:
zipFile.setEncoding("UTF-8");
zipFile.setUseLanguageEncodingFlag(true);
zipFile.setCreateUnicodeExtraFields(ZipArchiveOutputStream.UnicodeExtraFieldPoli‌cy.ALWAYS);
How can I create zip files that can be used by both Mac and Windows?

According to Apache Commons Compress page:(https://commons.apache.org/proper/commons-compress/zip.html)
Windows' "compressed folder" feature doesn't recognize any flag or extra field and creates archives using the platforms default encoding - and expects archives to be in that encoding when reading them.
and
If Windows' "compressed folders" is your primary consumer, then your best option is to explicitly set the encoding to the target platform. You may want to enable creation of Unicode extra fields so the tools that support them will extract the file names correctly.
Therefore:
If you know that your Windows users are based in a limited region of the Earth and your filenames are limited to that region (e.g. all Latin), you could heed Apache's advice and define an 8bit code page for filename encoding, which would be respected by OS X's unzip. However, this would mean that it would not work on Windows machines in a different region or accidentally used a slightly different codepage (North American vs. Western Europe).
The sensible alternative would be to use an alternative archive tool on Windows and possibly an alternative archive format. Perhaps you could create self-extracting archives for Windows by prepending a suitable extraction tool to the zip file. For example, you can create a self-extracting 7zip archive in Java using the rough instructions here: http://sourceforge.net/p/sevenzip/discussion/45798/thread/de8aa3c6
The pseudo format is:
7z.sfx + config.txt + your-created-archive.7z your-created-archive.exe
Where 7z.sfx is a 7zip self-extracting executable "header" distributed with 7zip.
In response to comments in question:
Windows uses UTF-16 for filenames and AFAIK uses UTF-16 in it's low level API, which Java calls. However, Windows console is very broken and doesn't quickly support UTF-8.
(Java also uses UTF-16 internally for String objects)
OS X enforces UTF-8 for filename encodings, so Java should also be respecting that when creating filenames.

Related

Java PC application - exported JAR do not behave as in development

I have an classic Java application for PC. The result of the build is a JAR file which is running on Windows machine.
The application is reading some XML files and creating an HTML document as an end result. The Xml file contains specific language characters that are not native to English.
While in development, in the IDE (Apache NetBeans 13), build - > Run the exported HTML file contains specific language characters.
When I run the JAR file, from the Project - > dist directory , HTML do not contain specific language characters.
For example characters like: č , ć , đ, š are being exported as : Ä� , while running from NetBeans they are exported as such, not as that strange symbol.
The letters in question are from Serbian, Croatian and Bosnian.
When I export the project from NetBeans, I made sure to have this option enabled:
Project -> Project properties -> Build -> Packaging where the "Copy Dependent Libraries" option is selected.
I am puzzled at this point. If anybody has any idea why something is working one way in IDE and other when exported please let me know.

The likely problem is that your HTML file needs to identify its character encoding. Nowadays, generally best to use UTF-8 as the encoding for most purposes.
Determine the file’s encoding
If you have access to the source code of your Java app, examine that to see what character encoding is being used when producing the HTML file. But I assume you have no such access.
Open the HTML file in a text-editor to examine its raw source code. See if it specifies a character encoding. If it does, and that character encoding indicator is incorrect, you will need to alter your HTML file.
If no character encoding is indicated within the HTML, you will need to experiment to discover the encoding. Open the HTML file in a web browser, then use the "view" or developer tools available in most browsers (Firefox, Safari, Edge, etc.) to explicitly switch between encodings.
If switching to a particular encoding causes the text to appear as expected, then you know the likely encoding.
Specify the file’s encoding
In the modern version of HTML, HTML5, UTF-8 is the default encoding assumed by the web browser. But if the web browser switches into Quirks Mode, the browser may assume another encoding. To help avoid Quirks Mode, a HTML5 document should start with <!DOCTYPE html>.
So, best to be explicit about the encoding. Once you determine the encoding being used by your Java app creating the HTML file, either alter that app (if you have source code) to write an indicator of the encoding, or else write another Java app to edit the produced HTML file to include the indicator. If you are not a Java developer, you could use any programming language or even a shell script to edit the produced HTML file.
To indicate the encoding of an HTML5 file, add a meta element.
For UTF-8:
<meta charset="UTF-8">
For Latin-1:
<meta charset="ISO-8859-1">
If your Java app was developed exclusively on Microsoft Windows, the developer may have knowingly or unwittingly used one of the Microsoft defined character encodings. Older versions of Java defaulted to using a character encoding specific to the host platform — but be aware in Java 18+ the default changes to UTF-8 across platforms.
For more info
You can read about these issues in many places. Like here and in Wikipedia.
If you are not savvy with character sets and character encoding, I highly recommend reading the surprisingly entertaining article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), by Joel Spolsky.

How to specify a char set for file name (not content) in Java?

We are running an Java web application on a linux server with default locale "POSIX".
Some of our clients upload files that contain non-ascii characters in file names.
We can retain those non-ascii characters in Java by unicode, but they are lost (saved file name will contain many question mark) after we actually save uploaded file into the file system, because the file system's default locale doesn't support non-ascii character.
Is there any way to specify a char set for file name (not content) before save a file in Java?

The portable Java API does not have a concept of a file system character encoding, as that wouldn't be portable: Windows e.g. saves file names as unicode no matter the locale. On Linux, however, the LC_CTYPE facet of your locale determines the encoding of the file system. So by exporting LC_CTYPE=en_US.utf8 or similar to the environment before you launch your Java application, your application will use that for file name handling.
Also see file.encoding has no effect, LC_ALL environment variable does it which talks about some of the internals behind this conversion.

If the files are entirely under the control of your app rather than being uploaded for another app to use, then I would consider doing your own encoding/decoding of the file names before saving them, e.g. URLEncoder.encode(filename, "UTF-8") to map a user-supplied name to one you can use on disk and URLDecoder.decode(encodedName, "UTF-8") vice versa.

Using Unicode characters for file names inside a zip archive

I am zipping a file name contains some special characters like Péréquation LES HOPITAUX NEUFS.xls to a different folder, say temp.
I am able to zip the file but the problem is the name of file is changing automatically to
P+¬r+¬quation LES HOPITAUX NEUFS.xls.
How can I support unicode characters for file names inside a zip archive?

It depends a little bit on what code you're using to create the archive. The old Java compression classes are not so flexible as you need.
You may use Apache Commons Compress. Michael Simons wrote this nice piece of code:
ZipArchiveOutputStream ostream = ...; // Your initialization code here
ostream.setEncoding("Cp437"); // This should handle your "special" characters
ostream.setFallbackToUTF8(true); // For "unknown" characters!
ostream.setUseLanguageEncodingFlag(true);
ostream.setCreateUnicodeExtraFields(
ZipArchiveOutputStream.UnicodeExtraFieldPolicy.NOT_ENCODEABLE);
If you're using Java 7 then you finally have a Charset parameter (that can be UTF-8) on the ZipOutputStream constructor
The big problem, anyway, is that many implementations don't understand Unicode encoding because original ZIP file format is ASCII and there is not an official standard for Unicode. See this post for further details.

The Zip specification (historically) does not specify what character encoding to be used for the embedded file names and comments, the original IBM PC character encoding set, commonly referred to as IBM Code Page 437, is supposed to be the only encoding supported. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in Jar files. Our java.util.jar and java.util.zip implementation therefor strictly followed Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.
Consequence? the ZIP file created by "traditional" ZIP tool is not accessible for java.util.jar/zip based tool, and vice versa, if the file name contains characters that are not compatible between Cp437 (as an alternative, tools might simply use the default platform encoding) and UTF-8
For most European, you're "lucky":-) that you only need to avoid a "handful" of characters, such as the umlauts (OK, I'm just kidding), but for Japanese and Chinese, most of the characters are simply out of luck. This is why bug 4244499 had been the No.1 on the Top 25 Java Bugs for so many years. The bug is no longer on the list:-) it has been finally "fixed" in OpenJDK 7, b57. I still keep a snapshot as the record/kudo for myself:-)
The solution (I would use "solution" than a "fix") in JDK7 b57 is to introduce in a new set of ZipInputStream ZipOutStream and ZipFile constructors with a specific "charset" as the parameter, as showed below.
ZipFile(File, Charset)
ZipInputStream(InputStream, Charset)
ZipOutputStream(OutputStream, Charset)
With these new constructors, applications can now access those non-UTF-8 ZIP files via ZipInputStream or ZipFile objects created with the specific encoding, or create a Zip files encoded in non-UTF-8 via the new ZipOutputStream(os, charset) constructor, if necessary.
zip is a stripped-down version of the Jar tool with a "-encoding" option to support non-UTF8 encoding for entry name and comment, it can serve as a demo for how to use the new APIs (I used it as a unit test). I'm still debating with myself if it is a good idea to officially introduce "-encoding" into the Jar tool...

File names with Japanese characters turn to garbage when written to a zip file using java.util.zip.*

I have a directory with a name that contains Japanese characters, and I need to use the zip utils in java.util.zip to write it to a zip file. Writing the zip file succeeds, but when I open the resulting zip file with either Windows' built-in compressed file utility or 7-Zip, the directory with Japanese characters in the name appears as a bunch of garbage characters. I do have the Japanese/East Asian language pack installed on my system -- I can create directories with Japanese names, so that isn't the issue.
Interestingly, if I write a separate script to read the resulting zip file using java.util.zip, the directory name is correct, and I can extract the contents of the zip into appropriately named directories, with Japanese characters. But I can't do this using the commercial zip tools that I've tried, which is undoubtedly what our customers will want to do.
Any ideas about what is causing this problem, and how I can work around it?
I know about this bug, but I still need a workaround for this case.

TrueZIP claims to do this better:
The J2SE API always uses UTF-8 (eight
bit Unicode character set) for entry
names and comments instead of CP437
(a.k.a. IBM437, the genuine IBM-PC
character set), which is used by the
de-facto standard PKZIP from PKWARE.
As a result, you cannot read or write
ZIP files with international entry
file names such as e.g. "täscht.txt"
in a ZIP file created by a (southern)
German.
[description of other problems omitted]
The TrueZIP Library has been developed to overcome these limitations/disadvantages.

Miracles indeed happen, and Sun/Oracle did really fix the long-living bug/rfe:
Now it's possible to [set up filename encodings upon creating][1] the zip file/stream (requires Java 7).
[1]: http://download.java.net/jdk7/docs/api/java/util/zip/ZipOutputStream.html#ZipOutputStream(java.io.OutputStream, java.nio.charset.Charset)

If java.util.zip still behaves as this post describes, I'm not sure if it is possible (with the built-in classes). I have seen Chilkat's Java Zip library mentioned before as a way to get this to work, but have never used it.

Help in creating Zip files from .Net and reading them from Java

I'm trying to create a Zip file from .Net that can be read from Java code.
I've used SharpZipLib to create the Zip file but also if the file generated is valid according to the CheckZip function of the #ZipLib library and can be successfully uncompressed via WinZip or WinRar I always get an error when trying to uncompress it using the Java.Utils.Zip class in Java.
Problem seems to be in the wrong header written by SharpZipLib, I've also posted a question on the SharpDevelop forum but with no results (see http://community.sharpdevelop.net/forums/t/8272.aspx for info) but with no result.
Has someone a code sample of compressing a Zip file with .Net and de-compressing it with the Java.Utils.Zip class?
Regards
Massimo

I have used DotNetZip library and it seems to work properly. Typical code:
using (ZipFile zipFile = new ZipFile())
{
zipFile.AddDirectory(sourceFolderPath);
zipFile.Save(archiveFolderName);
}

I had the same problem creating zips with SharpZipLib (latest version) and extracting with java.utils.zip.
Here is what fixed the problem for me. I had to force the exclusion of the zip64 usage:
ZipOutputStream s = new ZipOutputStream(File.Create(someZipFileName))
s.UseZip64 = UseZip64.Off;

Can't help with SharpZipLib, but you can try to create zip file using ZipPackage class System.IO.Packaging without using 3rd part libraries (requires .NET 3+).

To judge whether it's really a conformant ZIP file, see PKZIP's .ZIP File Format Specification.
For what it's worth I have had no trouble using SharpZipLib to create ZIPs on a Windows Mobile device and open them with WinZip or Windows XP's built-in Compressed Folders feature, and also no trouble producing ZIPs on the desktop with SharpZipLib and processing them with my own ZIP extraction utility (basically a wrapper around zlib) on the mobile device.

You don't wanna use the ZipPackage class in .NET - it isn't quite a standard zip model. Well it is, but it presumes a particular structure in the file, with a manifest with a well-known name, and so on. ZipPackage seems to have been optimized for Office docs and XPS docs.
A third-party library, like http://www.codeplex.com/DotNetZip, is probably a better bet if you are doing general-purpose ZIP files and want good interoperability.
DotNetZip builds files that are very interoperable with just about everything, including Java's java.utils.zip. But be careful using features that Java does not support, like ZIP64 or Unicode. ZIP64 is useful only for very large archives, which Java does not support well at this time, I think. Java supports Unicode in a particular way, so if you produce a Unicode-based ZIP file with DotNetZip, you just have to follow a few rules and it will work fine.

I had a similar problem with unzipping SharpZipLib-zipped files on Linux. I think I solved it (well I works on Linux and Mac now, I tested it), check out my blog post: http://igorbrejc.net/development/c/sharpziplib-making-it-work-for-linuxmac

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.