This question already has answers here:
Java : How to determine the correct charset encoding of a stream
(16 answers)
Closed 8 years ago.
Any idea about gettin the real encoding of a file like .html .txt .java and etc in java?
Since some source codes are not utf-8,I wantto change them to utf-8.
In general, it is not possible to always detect exactly what the character encoding of a text file is - there's nothing stored in a text file that tells you explicitly what the character encoding is. You can make some intelligent guesses, but don't expect that you'll always be able to find out exactly what the character encoding of a text file is.
The link that cebewee posted in the comments has more information on how to detect what the character encoding of a text file is.
You can use tools like UTFCast to batch convert file encoding. Just run them on all of your source files and you should be done. On linux, you can use 'iconv' to convert file encoding.
Related
I am new to JAVA. I wanted a JAVA code to convert a text file coming from Unix to a text file that goes to Linux server. So, its a character conversion code from UTF-16 TO UTF-8. The text file goes tthrough oracle database before it reaches linux server. I need this conversion because some special symbols are getting converted to garbage values. Please help Java Experts :)
This question already has answers here:
Validation of files based on their file extensions
(2 answers)
Closed 9 years ago.
I want to validate file contents based on their extension. For example, a user can save a document file (.doc/.docx) as an Excel file (.xls/.xlsx). Before I get the file contents, using Java I need to validate the content type matches with that extension.
Is any one have idea about, please share your points.
There exists projects already to detect the details of a file. The file command on linux can do this for example.
A Java project called Tika might be useful to you, Tika will parse the file(s) specified on the command line and output the extracted text content or metadata to standard output.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Decode Base64 data in java
I have a java file which will be downloaded from a location. The files are BinHex encoded.
Is there any jar file available which i can use in java code to decode the binhex file?
Please help me.
Apache commons has a class called BCodec that allows specifying custom character sequence for encoding. Maybe you can adapt it to your needs?
In any case mechanics behind Base64 and BinHex are very similar, you could take this class and update it to your needs. I think modifying char sequence may be enough.
EDIT:
Here's an implementation of BinConverter as a part of BlowfishJ library.
I am able to have my application upload files via FTP using the FTPClient Java library.
(I happen to be uploading to an Oracle XML DB repository.)
Everything uploads fine unless the xml file has curly quotes in it. In which case I get the error:
LPX-00200: could not convert from encoding UTF-8 to UCS2
I can upload what I believe to be the same file using the Windows CMD line FTP tool. I am wondering if there is some encoding setting that the windows CMD line tool uses that maybe I need to set in my Java code.
Anyone know stuff about this? Thanks!!
I don't know that application but you could try to use -Dfile.encoding=UTF-8 on your JVM command line
Not familiar with Oracle XML DB repositories—can they accept compressed uploads? Zipping or gzipping your file would save resources and frustrate any ASCII file type autodetection in use.
In binary this problem goes away.
FTPClient.setType(FTPClient.TYPE_BINARY);
http://www.sauronsoftware.it/projects/ftp4j/manual.php#3
If your file contains curly quotes, they are in the high-order bit set range in iso-8859-1 and windows-1252 character sets. In UTF-8, those characters usually take two bytes in UTF-8.
It's quite possible that you've accidentally encoded the xml file in one of these encodings instead of UTF-8. That would result in a conversion error, because the high-order bit being set is only allowed in sequences of multiple UTF-8 octets.
If you're in Windows, open the file in Notepad and try re-saving the document using Save As... with the UTF-8 encoding, and upload the changed file.. In Unix, use iconv or a similar tool to convert from iso-8859-1 to UTF-8 before uploading.
If the XML document explicitly marks its encoding, make sure it's marked with the correct encoding (e.g. UTF-8). In many xml parsers, you can parse iso-8859-1 or windows-1252 character set encoded XML as long as it's marked as such.
I'm experimenting with internationalization by making a Hello World program that uses properties files + ResourceBundle to get different strings.
Specifically, I have a file "messages_en_US.properties" that stores "hello.world=Hello World!", which works fine of course.
I then have a file "messages_ja_JP.properties" which I've tried all sorts of things with, but it always appears as some type of garbled string when printed to the console or in Swing. The problem is obviously with the reading of the content into a Java string, as a Java string in Japanese typed directly into the source can print fine.
Things I've tried:
The .properties file in UTF-8 encoding with the Japanese string as-is for the value. Something I read indicates that Java expects a properties file to be in the native encoding of the system...? It didn't work either way.
The file in default encoding (ISO-8859-1) and the value stored as escaped Unicode created by the native2ascii program included with Java. Tried with a source file in various Japanese encodings... SHIFT-JIS, EUC-JP, ISO-2022-JP.
Edit:
I actually figured this out while I was typing this, but I figured I'd post it anyway and answer it in case it helps anyone.
I realized that native2ascii was assuming (surprise) that it was converting from my operating system's default encoding each time, and as such not producing the correct escaped Unicode string.
Running native2ascii with the "-encoding encoding_name" option where encoding_name was the name of the source file's encoding (SHIFT-JIS in this case) produced the correct result and everything works fine.
Ant also has a native2ascii task that runs native2ascii on a set of input files and sends output files wherever you want, so I was able to add a builder that does that in Eclipse so that my source folder has the strings in their original encoding for easy editing and building automatically puts converted files of the same name in the output folder.
As of JDK 1.6, Properties has a load() method that accepts a Reader. That means you can save all the property files as UTF-8 and read them all directly by passing an InputStreamReader to load(). I think that's the most elegant solution, but it requires your app to run on a Java 6 runtime.
Historically, load() only accepted an InputStream, and the stream was decoded as ISO-8859-1. Not the system default encoding, always ISO-8859-1. That's important, because it makes a certain hack possible. Say your property file is stored as UTF-8. After you retrieve a property, you can re-encode it as ISO-8859-1 and decode it again as UTF-8, like this:
String realProp = new String(prop.getBytes("ISO-8859-1"), "UTF-8");
It's ugly and fragile, but it does work. But I think the best solution, at least for the next few years, is the one you found: bulk-convert the files with native2ascii using a build tool like Ant.
An alternative way to handle the properties files is:
http://www.unipad.org/main/
This is an editor which can read/write files in \u unicode escape format, this is the format native2ascii creates.
It don't know how well it works with Japanese, I've used it for Hungarian.