I have an classic Java application for PC. The result of the build is a JAR file which is running on Windows machine.
The application is reading some XML files and creating an HTML document as an end result. The Xml file contains specific language characters that are not native to English.
While in development, in the IDE (Apache NetBeans 13), build - > Run the exported HTML file contains specific language characters.
When I run the JAR file, from the Project - > dist directory , HTML do not contain specific language characters.
For example characters like: č , ć , đ, š are being exported as : Ä� , while running from NetBeans they are exported as such, not as that strange symbol.
The letters in question are from Serbian, Croatian and Bosnian.
When I export the project from NetBeans, I made sure to have this option enabled:
Project -> Project properties -> Build -> Packaging where the "Copy Dependent Libraries" option is selected.
I am puzzled at this point. If anybody has any idea why something is working one way in IDE and other when exported please let me know.
The likely problem is that your HTML file needs to identify its character encoding. Nowadays, generally best to use UTF-8 as the encoding for most purposes.
Determine the file’s encoding
If you have access to the source code of your Java app, examine that to see what character encoding is being used when producing the HTML file. But I assume you have no such access.
Open the HTML file in a text-editor to examine its raw source code. See if it specifies a character encoding. If it does, and that character encoding indicator is incorrect, you will need to alter your HTML file.
If no character encoding is indicated within the HTML, you will need to experiment to discover the encoding. Open the HTML file in a web browser, then use the "view" or developer tools available in most browsers (Firefox, Safari, Edge, etc.) to explicitly switch between encodings.
If switching to a particular encoding causes the text to appear as expected, then you know the likely encoding.
Specify the file’s encoding
In the modern version of HTML, HTML5, UTF-8 is the default encoding assumed by the web browser. But if the web browser switches into Quirks Mode, the browser may assume another encoding. To help avoid Quirks Mode, a HTML5 document should start with <!DOCTYPE html>.
So, best to be explicit about the encoding. Once you determine the encoding being used by your Java app creating the HTML file, either alter that app (if you have source code) to write an indicator of the encoding, or else write another Java app to edit the produced HTML file to include the indicator. If you are not a Java developer, you could use any programming language or even a shell script to edit the produced HTML file.
To indicate the encoding of an HTML5 file, add a meta element.
For UTF-8:
<meta charset="UTF-8">
For Latin-1:
<meta charset="ISO-8859-1">
If your Java app was developed exclusively on Microsoft Windows, the developer may have knowingly or unwittingly used one of the Microsoft defined character encodings. Older versions of Java defaulted to using a character encoding specific to the host platform — but be aware in Java 18+ the default changes to UTF-8 across platforms.
For more info
You can read about these issues in many places. Like here and in Wikipedia.
If you are not savvy with character sets and character encoding, I highly recommend reading the surprisingly entertaining article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), by Joel Spolsky.
Related
Our team has a program that generates PDFs written in Java. The PDFs, which may have non-ASCII filenames, are zipped using Apache Commons Compress. The zip files are then uploaded to S3 to be downloaded by Windows and Mac clients.
When unzipped on Mac using the native tools, the files are recreated with the correct filename. However, when trying to unzip using the native Windows UI tool, the filenames are created incorrectly.
The zip process is:
import org.apache.commons.compress.archivers.zip.ZipArchiveEntry;
and I have add following code, it's still not working, display unreadable characters on Windows:
zipFile.setEncoding("UTF-8");
zipFile.setUseLanguageEncodingFlag(true);
zipFile.setCreateUnicodeExtraFields(ZipArchiveOutputStream.UnicodeExtraFieldPolicy.ALWAYS);
How can I create zip files that can be used by both Mac and Windows?
According to Apache Commons Compress page:(https://commons.apache.org/proper/commons-compress/zip.html)
Windows' "compressed folder" feature doesn't recognize any flag or extra field and creates archives using the platforms default encoding - and expects archives to be in that encoding when reading them.
and
If Windows' "compressed folders" is your primary consumer, then your best option is to explicitly set the encoding to the target platform. You may want to enable creation of Unicode extra fields so the tools that support them will extract the file names correctly.
Therefore:
If you know that your Windows users are based in a limited region of the Earth and your filenames are limited to that region (e.g. all Latin), you could heed Apache's advice and define an 8bit code page for filename encoding, which would be respected by OS X's unzip. However, this would mean that it would not work on Windows machines in a different region or accidentally used a slightly different codepage (North American vs. Western Europe).
The sensible alternative would be to use an alternative archive tool on Windows and possibly an alternative archive format. Perhaps you could create self-extracting archives for Windows by prepending a suitable extraction tool to the zip file. For example, you can create a self-extracting 7zip archive in Java using the rough instructions here: http://sourceforge.net/p/sevenzip/discussion/45798/thread/de8aa3c6
The pseudo format is:
7z.sfx + config.txt + your-created-archive.7z your-created-archive.exe
Where 7z.sfx is a 7zip self-extracting executable "header" distributed with 7zip.
In response to comments in question:
Windows uses UTF-16 for filenames and AFAIK uses UTF-16 in it's low level API, which Java calls. However, Windows console is very broken and doesn't quickly support UTF-8.
(Java also uses UTF-16 internally for String objects)
OS X enforces UTF-8 for filename encodings, so Java should also be respecting that when creating filenames.
I'm working on a big java web application in Eclipse, whose files have different encodings: some are in UTF-8, others in Cp1252, yet others are in ISO-8859-1 (with no distinction between JSP's or java source files, or CSS) — but I know the encoding of each file.
I'm converting the project to Maven, and this is a great occasion to turn all of them to UTF-8.
Of course I don't want to lose a single character (so fully automated conversions do not apply here).
How should I go about it? Is there a tool that can help me ensure I don't lose any special character?
The webapp is in Italian, so, especially in JSP's, there could be lots of accented letters (probably not everywhere HTML entities have been used).
The project is in Eclipse, but I can use an external editor if that could make the conversion easier.
It's very easy to write code to convert encodings - although I'd expect there are tools to do it anyway. Simply:
Create one FileInputStream to the existing file, and wrap it in an InputStreamReader with the appropriate encoding
Create one FileOutputStream to the new file, and wrap it in an OutputStreamWriter with the appropriate encoding
Loop over the reader, reading characters into a buffer and writing out the contents of that buffer (just as many characters as you read) until you've read the whole file
Close all resources (automatic with a try-with-resources block)
The first two steps are simpler with Files.newBufferedReader and Files.newBufferedWriter, too.
Converting a single file can be done with the iconv function (I used LibIconv for Windows).
It lets you specify the source and destinations encodings, and warns when characters can't be converted.
I tried it with a couple of source files and all the accented letters were correctly converted in UTF-8 from Cp1252.
I have a Netbeans platform program that uses a custom DataEditorSupport and a custom ClonableEditor. The files it reads are UTF-8 encoded and the editor that is created does not seem to be using UTF-8.
For example my file has
"TêSt"
and the editor displays this as
"TêStÃ"
How can I get the DataEditorSupport or ClonableEditor to use correctly read UTF-8?
This FAQ entry in the NetBeans wiki might be of help. See also the General Queries API and, in special, the FileEncodingQuery.
I am building an app that takes information from java and builds an excel spreadsheet. Some of the information contains international characters. I am having an issue when Russian characters, for example, are rendered correctly in Java, but when I send those characters to Excel, they are not rendered properly. I initially thought the issue was an encoding problem, but I am now thinking that the problem is simply not have the Russian language pack loaded on my Windows 7 machine.
I need to know if there is a way for a Java application to "force" Excel to show international characters.
Thanks
Check the file encoding you're using is characters don't show up. Java defaults to platform native encoding (Win-1252 for Windows) instead of UTF-8. You can explicitly set the writers to use UTF-8 or Cyrillic encoding.
We recently got a localization file the contains Portuguese translations of all the string in our java app. The file they gave me was a .csv file and I use file maker to create .tab file which is what we need for our purposes. Unfortunately none of the accents seem to work. For example the string vocÍ in our localization file shows up as vocΩ inside the application. I tried switching the language settings to portuguese before creating compiling but I still get this problem, anyone have any ideas of what else I might need to try?
I think that you problem is related to the file encoding used.
Java has full unicode support so there shouldn't be any problems, unless the file you are reading (the one made with FileMaker) is encoded in something different than UTF8 (which is the default used by Java).
You can try saving the file in a different encoding or specifying which encoding to use when opening it from Java (see here). Many API classes support additional parameters to specify which charset to use when opening a file. Just take a look at the documentation..