Issue with encoding UTF-8 when FTPing files - java

I am able to have my application upload files via FTP using the FTPClient Java library.
(I happen to be uploading to an Oracle XML DB repository.)
Everything uploads fine unless the xml file has curly quotes in it. In which case I get the error:
LPX-00200: could not convert from encoding UTF-8 to UCS2
I can upload what I believe to be the same file using the Windows CMD line FTP tool. I am wondering if there is some encoding setting that the windows CMD line tool uses that maybe I need to set in my Java code.
Anyone know stuff about this? Thanks!!

I don't know that application but you could try to use -Dfile.encoding=UTF-8 on your JVM command line

Not familiar with Oracle XML DB repositories—can they accept compressed uploads? Zipping or gzipping your file would save resources and frustrate any ASCII file type autodetection in use.

In binary this problem goes away.
FTPClient.setType(FTPClient.TYPE_BINARY);
http://www.sauronsoftware.it/projects/ftp4j/manual.php#3

If your file contains curly quotes, they are in the high-order bit set range in iso-8859-1 and windows-1252 character sets. In UTF-8, those characters usually take two bytes in UTF-8.
It's quite possible that you've accidentally encoded the xml file in one of these encodings instead of UTF-8. That would result in a conversion error, because the high-order bit being set is only allowed in sequences of multiple UTF-8 octets.
If you're in Windows, open the file in Notepad and try re-saving the document using Save As... with the UTF-8 encoding, and upload the changed file.. In Unix, use iconv or a similar tool to convert from iso-8859-1 to UTF-8 before uploading.
If the XML document explicitly marks its encoding, make sure it's marked with the correct encoding (e.g. UTF-8). In many xml parsers, you can parse iso-8859-1 or windows-1252 character set encoded XML as long as it's marked as such.

Related

java unicode conversion on linux not working on max os x

I am writing a java application on Ubuntu Linux that reads in a text file and creates an xml file from the data. Some of the text contains curly apostrophes and quotes that I convert to straight apostrophes and quotes using the following code:
dataLine = dataLine.replaceAll( "[\u2018|\u2019]", "\u0027" ).replaceAll( "[\u201C|\u201D]", "\u005c\u0022" );
This works fine, but when I port the jar file to a Mac OSX machine, I get three question marks where I should get straight apostrophes and quotes. I created a test application on the Mac using the same line of code to do the conversion and the same test file for input and it worked fine. Why doesn't the jar file created on the Linux machine work correctly on a Mac? I thought java was supposed to be cross platform compatible.
Chances are you'tr not reading the file correctly to start with. You haven't shown how you're reading the file, but my guess is that you're just using FileReader, or an InputStreamReader without specifying the encoding. In that case, the default platform encoding is used - and if that's not the actual encoding of the file, you won't be reading the right characters. You should be able to detect that without doing any replacement at all.
Instead, you should use a FileInputStream and wrap it in an InputStreamReader with the correct encoding - which is likely to be UTF-8 as it's XML. (You should be able to check this easily.)

Character inferno

I need some help. I have to read data from a file and store it into an Oracle db. I run into troubles when characters like 'à' or 'À' appear into data. For example, 'à' is read and become 'à' into my application, so, when I try to save data into db, sometimes, the db complains about values too big about the field that are going to save into. I also tryied
Normalizer.normalize(row, Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
I payed attention about encoding too. I notice that if I run my application on data file, a Cp1252 file, on a Windows machine I got no errors. Sadly I got errors when I run the stuff on a Linux machine. I'm using java 6. TIA.
So, the default character encoding on your windows machine is probably windows-1252 (a superset of latin-1). That means that if you don't specify the charset when reading in the file, Java will default to your system default and get it right.
On your Linux machine, your default charset is probably UTF-8. That means that if you don't not explicitly specify a charset while reading a file, it will default to UTF-8 . . . which, in this case, is wrong.
You didn't post how you're reading in your file, but for example:
InputStreamReader isr = new InputStreamReader(file, "UTF-8");
This would create an input stream reader for reading a file formatted in UTF-8.

Wrong character encoding in dist jar generated with NetBeans

I finally wrote me little app. It's desktop app but it has embedded web server. When I lunched it from NetBeans everything is ok. When I lunch dist jar I have correct character encoding in GUI, but web server output is corrupted ("?" instead of national characters).
I use NetBeans 6.7.1, jdk1.6.0_16, http server from Java 6 SE and lib Rome 1.0
I don't put any source code here, because I have no idea witch part should I put.
//edit:
data are hardcoded in Strings. Those Strings are passed to Rome as arguments to create RSS nodes, Romes RSS feeds are are written to String and then Strings are passed to HttpHandler.
Check the encoding in the source files.
Check any point where encoding/decoding is performed (often any place where String -> byte[] or byte[] -> String). Anything that converts bytes to Strings is performing an encoding operation myEncoding -> UTF-16.
Check that you are passing the appropriate encoding information to 3rd party libraries that perform encoding/decoding.
If generating XML, ensure that the header encoding matches the encoding used to write the bytes (<?xml version="1.0" encoding="UTF-8"?>).
If serving content over HTTP, ensure that the content type and charset header is correct (e.g. Content-Type: text/html; charset=utf-8). A charset is usually only applicable if serving a text MIME type (it is not applicable for application/rss+xml, for example). Check your MIME documentation.
This issue probably has nothing to do with NetBeans. Usually character encoding issues are due to not defining the character encoding somewhere, in which case the actual character encoding will be determined pretty much by luck.
For instance, Java Strings are UTF-16 internally, but the encoding used by Java Readers is determined by the platform default unless explicitly specified.

Spring Properties File

Hi have this j2ee web application developed using spring framework. I have a problem with rendering mnessages in nihongo characters from the properties file. I tried converting the file to ascii using native2ascii and it solved my problem. Is there no other way of converting the file through setting the encoding to ascii in the configuration files instead of manually converting it by executing native2ascii in command prompt
AfAIK in property files and resource bundles you have to use ASCII. Inside Spring XML configuration files, Unicode should work fine. If you prefer you can edit property files in Unicode and run native2ascii automatically as part of your build process (in Ant, Maven, etc).
As per the java.util.Properties API document:
The load(Reader) / store(Writer, String) methods load and store properties from and to a character based stream in a simple line-oriented format specified below. The load(InputStream) / store(OutputStream, String) methods work the same way as the load(Reader)/store(Writer, String) pair, except the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes ; only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.
(note that ISO 8859-1 is not the same as ASCII as many are incorrectly talking about here).
So, to fix the particular problem without the need for native2ascii, you should use Properties#load(Reader) with an InputStreamReader(input, charset) instead.
Properties properties = new Properties();
properties.load(new InputStreamReader(classLoader.getResourceAsStream("file.properties"), "UTF-8"));
Note that this method was introduced in Java 1.6 over 4 years ago. Ensure that you're using it as well.
I don't do Spring, so I can't go in detail about how to get Spring to work that way, but it would be obvious that you need to override/replace the Spring's resource bundle manager, if any.
hey, i googled for the same issue and found something written in german that was a help for me: http://www.stefanglase.de/2009/10/13/spring-messagesource-mit-utf-8-encoding/

How do I properly store and retrieve internationalized Strings in properties files?

I'm experimenting with internationalization by making a Hello World program that uses properties files + ResourceBundle to get different strings.
Specifically, I have a file "messages_en_US.properties" that stores "hello.world=Hello World!", which works fine of course.
I then have a file "messages_ja_JP.properties" which I've tried all sorts of things with, but it always appears as some type of garbled string when printed to the console or in Swing. The problem is obviously with the reading of the content into a Java string, as a Java string in Japanese typed directly into the source can print fine.
Things I've tried:
The .properties file in UTF-8 encoding with the Japanese string as-is for the value. Something I read indicates that Java expects a properties file to be in the native encoding of the system...? It didn't work either way.
The file in default encoding (ISO-8859-1) and the value stored as escaped Unicode created by the native2ascii program included with Java. Tried with a source file in various Japanese encodings... SHIFT-JIS, EUC-JP, ISO-2022-JP.
Edit:
I actually figured this out while I was typing this, but I figured I'd post it anyway and answer it in case it helps anyone.
I realized that native2ascii was assuming (surprise) that it was converting from my operating system's default encoding each time, and as such not producing the correct escaped Unicode string.
Running native2ascii with the "-encoding encoding_name" option where encoding_name was the name of the source file's encoding (SHIFT-JIS in this case) produced the correct result and everything works fine.
Ant also has a native2ascii task that runs native2ascii on a set of input files and sends output files wherever you want, so I was able to add a builder that does that in Eclipse so that my source folder has the strings in their original encoding for easy editing and building automatically puts converted files of the same name in the output folder.
As of JDK 1.6, Properties has a load() method that accepts a Reader. That means you can save all the property files as UTF-8 and read them all directly by passing an InputStreamReader to load(). I think that's the most elegant solution, but it requires your app to run on a Java 6 runtime.
Historically, load() only accepted an InputStream, and the stream was decoded as ISO-8859-1. Not the system default encoding, always ISO-8859-1. That's important, because it makes a certain hack possible. Say your property file is stored as UTF-8. After you retrieve a property, you can re-encode it as ISO-8859-1 and decode it again as UTF-8, like this:
String realProp = new String(prop.getBytes("ISO-8859-1"), "UTF-8");
It's ugly and fragile, but it does work. But I think the best solution, at least for the next few years, is the one you found: bulk-convert the files with native2ascii using a build tool like Ant.
An alternative way to handle the properties files is:
http://www.unipad.org/main/
This is an editor which can read/write files in \u unicode escape format, this is the format native2ascii creates.
It don't know how well it works with Japanese, I've used it for Hungarian.

Categories

Resources