I have some strings in Java (originally from an Excel sheet) that I presume are in Windows 1252 codepage. I want them converted to Javas own unicode format. The Excel file was parsed using the JXL package, in case that matter.
I will clarify: apparently the strings gotten from the Excel file look pretty much like it already is some kind of unicode.
WorkbookSettings ws = new WorkbookSettings();
ws.setCharacterSet(someInteger);
Workbook workbook = Workbook.getWorkbook(new File(filename), ws);
Sheet s = workbook.getSheet(sheet);
row = s.getRow(4);
String contents = row[0].getContents();
This is where contents seems to contain something unicode, the åäö are multibyte characters, while the ASCII ones are normal single byte characters. It is most definitely not Latin1. If I print the "contents" string with printLn and redirect it to a hello.txt file, I find that the letter "ö" is represented with two bytes, C3 B6 in hex. (195 and 179 in decimal.)
[edit]
I have tried the suggestions with different codepages etc given below, tried converting from Cp1252 etc. There was some kind of conversion, because I would get some other kind of gibberish instead. As reference I always printed an "ö" string hand coded into the source code, to verify that there was not something wrong with my terminal or typefaces or anything. The manually typed "ö" always worked.
[edit]
I also tried WorkBookSettings as suggested in the comments, but I looked in the code for JXL and characterSet seems to be ignored by parsing code. I think the parsing code just looks at whatever encoding the XLS file is supposed to be in.
WorkbookSettings ws = new WorkbookSettings();
ws.setEncoding("CP1250");
Worked for me.
If none of the answer above solve the problem, the trick might be done like this:
String myOutput = new String (myInput, "UTF-8");
This should decode the incoming string, whatever its format.
When Java parses a file it uses some encoding to read the bytes on the disk and create bytes in memory. The default encoding varies from platform to platform. Java's internal String representation is Unicode already, so if it parses the file with the right encoding then you are already done; just write out the data in any encoding you want.
If your strings appear corrupted when you look at them in Java, it is probably because you are using the wrong encoding to read the data. Excel is probably using UTF-16 (Little-Endian I think) but I'd expect a library like JXL should be able to detect it appropriately. I've looked at the Javadocs for JXL and it doesn't do anything with character encodings. I imagine it auto-detects any encodings as it needs to.
Do you just need to write the already loaded strings to a text file? If so, then something like the following will work:
String text = getCP1252Text(); // doesn't matter what the original encoding was, Java always uses Unicode
FileOutputStream fos = new FileOutputStream("test.txt"); // Open file
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-16"); // Specify character encoding
PrintWriter pw = new PrintWriter(osw);
pw.print(text ); // repeat as needed
pw.close(); // cleanup
osw.close();
fos.close();
If your problem is something else please edit your question and provide more details.
You need to specify the correct encoding when the file is parsed - once you have a Java String based on the wrong encoding, it's too late.
JXL allows you to specify the encoding by passing a WorkbookSettings object to the factory method.
"windows-1252"/"Cp1252" is not required to be supported by JREs, but is by Sun's (and presumably most others). See the "Supported Encodings" in your JDK documentation. Then it's just a matter of using String, InputStreamReader or similar to decode the bytes into chars.
FileInputStream fis = new FileInputStream (yourFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"CP1250"));
And do with reader whatever you'd do directly with file.
Your description indicates that the encoding is UTF-8 and indeed C3 B6 is the UTF-8 encoding for 'ö'.
Related
I have a zip file.It contains some files.Files contain chinese characters so I used
ZipInputStream zipStream = new ZipInputStream(
new BufferedInputStream(new FileInputStream(zipFilePath), BUFFER_SIZE),
Charset.forName("ISO-8859-1")
);
......
FileOutputStream fileOutput = new FileOutputStream(uncompressedFileName);
while (zipStream.available() > 0) {
fileOutput.write(zipStream.read());
}
Extraction runs succesfully.After that I want to use encodingDetect method to find encoding but now service is not running.It returns nomatch. If I send files directly to service,The service is running.It find charset properly like UTF-8.
I guess that Charset.forName("ISO-8859-1")extract files but format is corrupted.Do you have any idea?
The problem is the Charset of the file names in the zip. UTF-8 raises an error (the file names are evidently not in UTF-8), as UTF-8 requires as special format for the multi-byte sequences, and evidently there are wrong "multibyte" sequences.
ISO-8859-1 is a single byte enconding, accepting garbage.
What you should do is to try the small number of Chinese Charsets, so the file name strings are filled correctly. Java String contains Unicode, so can hold any Charset. The help from someone talking Chinese probably would make sense.
And then try writing files with those names. If not successful on your PC, you must use artificial file names, maybe transliteration from Chinese.
A translation table from original Chinese file name to actual file name may be created
as UTF-8 text file, maybe with a BOM, '\uFEFF` at the begin-of-file.
ISO-8859-1 charset most definitely does not support Chinese language. Use UTF-8 instead of ISO-8859-1
I am building an app where users have to guess a secret word. I have *.txt files in assets folder. The problem is that words are in Albanian language. Our language uses letters like "ë" and "ç", so whenever I try to read from the file some word containing any of those characters I get some wicked symbol and I can not implement string.compare() for these characters. I have tried many options with UTF-8, changed Eclipse setting but still the same error.
I wold really appreciate if someone has got any advice.
The code I use to read the files is:
AssetManager am = getAssets();
strOpenFile = "fjalet.txt";
InputStream fins = am.open(strOpenFile);
reader = new BufferedReader(new InputStreamReader(fins));
ArrayList<String> stringList = new ArrayList<String>();
while ((aDataRow = reader.readLine()) != null) {
aBuffer += aDataRow + "\n";
stringList.add(aDataRow);
}
Otherwise the code works fine, except for mentioned characters
It seems pretty clear that the default encoding that is in force when you create the InputStreamReader does not match the file.
If the file you are trying to read is UTF-8, then this should work:
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
If the file is not UTF-8, then that won't work. Instead you should use the name of the file's true encoding. (My guess is that it is in ISO/IEC_8859-1 or ISO/IEC_8859-16.)
Once you have figured out what the file's encoding really is, you need to try to understand why it does not correspond to your Java platform's default encoding ... and then make a pragmatic decision on what to do about it. (Should you hard-wire the encoding into your application ... as above? Should you make it a configuration property or command parameter? Should you change the default encoding? Should you change the file?)
You need to determine the character encoding that was used when creating the file, and specify this encoding when reading it. If it's UTF-8, for example, use
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
or
reader = new BufferedReader(new InputStreamReader(fins, StandardCharsets.UTF_8));
if you're under Java 7.
Text editors like Notepad++ have good heuristics to guess what the encoding of a file is. Try opening it with such an editor and see which encoding it has guessed (if the characters appear correctly).
You should know encoding of the file.
InputStream class reads file binary. Although you can interpet input as character, it will be implicit guessing, which may be wrong.
InputStreamReader class converts binary to chars. But it should know character set.
You should use the following version to feed it by character set.
UPDATE
Don't suggest you have UTF-8 encoded file, which may be wrong. Here in Russia we have such encodings as CP866, WIN1251 and KOI8, which are all differ from UTF8. Probably you have some popular Albanian encoding of text files. Check your OS setting to guess.
How will append a utf-8 string to a properties file. I have given the code below.
public static void addNewAppIdToRootFiles() {
Properties properties = new Properties();
try {
FileInputStream fin = new FileInputStream("C:\Users\sarika.sukumaran\Desktop\root\root.properties");
properties.load(new InputStreamReader(fin, Charset.forName("UTF-8")));
String propertyStr = new String(("قسيمات").getBytes("iso-8859-1"), "UTF-8");
BufferedWriter bw = new BufferedWriter(new FileWriter(directoryPath + rootFiles, true));
bw.write(propertyStr);
bw.newLine();
bw.flush();
bw.close();
fin.close();
} catch (Exception e) {
System.out.println("Exception : " + e);
}
}
But when I open the file, the string I have written "قسيمات" to the file shows as "??????". Please help me.
OK, your first mistake is getBytes("iso-8859-1"). You should not do these manipulations at all. If you want to write unicode text to file you should open the file and write text. The internal representations of strings in java is unicdoe, so everything will be writter correctly.
You have to care about charset when you are reading file. BTW you do it correctly.
But you do not have to use file manipulation tools to append something to properites file. You can just call prop.setProperty("yourkey", "yourvalue") and then call prop.store(new FileOutputStream(youfilename)).
Ok, I have checked the specification for Properties class. If you use following methods: load() for input stream or store() for output stream, the input/output stream for the file is assumed a iso-8859-1 encoding by default. Therefore, you have to be cautious with a few things:
Some characters in French, German and Portuguese are iso-8859-1 (Latin1) compatible, which they normally work fine in iso-8859-1. So, you don't have to worry that much. But, others like Arabic and Hebrew characters are not Latin1 compatible, so you need to be careful with the choice of encoding for these characters. If you have a mix of characters of French and Arabic, you have no choice but to use Unicode.
What is your current input file's encoding if it already exists to be used with Properties's load() method? If it is not the default iso-8859-1, then you need to figure out what it is first before opening the file. If infile file encoding is UTF-8, then use properties.load(new InputStreamReader(new FileInputStream("infile"), "UTF8"))); Then, stick to this encoding till the end. Match the file encoding with the character encoding as well.
If it is a new input file to be used with Properties's load() method, choose the file encoding that works with your character's encoding. Then, stick to this encoding till the end.
Your expected output file's encoding shall be the same with what is used from Properties's load() method before you use the store() method. If it is not the default iso-8859-1, then you need to figure out what it is first before saving the file. Stick to this encoding till the end. Match the file encoding with the character encoding as well. If outfile file encoding is UTF-8, then specifically use UTF-8 encoding when saving the file. But, if the store() method still ends up with an outfile in iso-8859-1 encoding, then you need to do what is suggested next...
If you stick to the default iso-8859-1, it works fine for characters like French. But, if the characters are not iso-8859-1 or Latin1 encoding compatible, you need to use Unicode escape characters instead as an alternative: for example:\uFE94 for the Arabic ﺔ character. For me, this escaping is too tedious and normally we use native2ascii utility provided in JRE or JDK to convert a properties file from one encoding to another encoding. Of course, there are other ways...just check the references below...For me, it is better to use a properties file in XML format since by default it is UTF-8...
References:
Java properties UTF-8 encoding in Eclipse
Setting the default Java character encoding?
I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.
I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.
In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).
I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…
Any idea ?
Alternatives I tried:
Getting the file (commented lines were tried wihtout success*):
FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream,
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);
Filtered prefixes:
ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
"Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
"Catégorie:", "Modèle:", "Wikipédia:", // UTF-8 as ISO-8859-1
"Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work
* ERRATUM
Actually, my bad, that one I tried work, I tested the wrong index:
InputSource is = new InputSource( fis );
is.setEncoding("UTF-8"); // force UTF-8 interpretation
parser.parse(fis, handler);
Since you write the prefixes as plain strings into your source file, you want to make sure that you save that .java file in UTF-8, too (or any other encoding that supports the special characters you're using). Then, however, you have to tell the compiler which encoding the file is in with the -encoding flag:
javac -encoding utf-8 *.java
For the XML source, you could try
Reader r = new InputStreamReader(new FileInputStream(xmlFileName), "UTF-8");
InputStreams do not deal with encodings since they are byte-based, not character-based. So, here we create a Reader from an FileInputStream - the latter (stream) doesn't know about encodings, but the former (reader) does, because we give the encoding in the constructor.
I'm reading a file line by line, like this:
FileReader myFile = new FileReader(File file);
BufferedReader InputFile = new BufferedReader(myFile);
// Read the first line
String currentRecord = InputFile.readLine();
while(currentRecord != null) {
currentRecord = InputFile.readLine();
}
But if other types of files are uploaded, it will still read their contents. For instance, if the uploaded file is an image, it will output junk characters when reading the file. So my question is: how can I check the file is CSV for sure before reading it?
Checking extension of the file is kind of lame since someone can upload a file that is not CSV but has a .csv extension. Thanks in advance.
Determining the MIME type of a file is not something easy to do, especially if ASCII sections can be mixed with binary ones.
Actually, when you look at how a java mail system does determine the MIME type of an email, it does involve reading all bytes in it, and applying some "rules".
Check out MimeUtility.java
If the primary type of this datasource is "text" and if all the bytes in its input stream are US-ASCII, then the encoding is "7bit".
If more than half of the bytes are non-US-ASCII, then the encoding is "base64".
If less than half of the bytes are non-US-ASCII, then the encoding is "quoted-printable".
If the primary type of this datasource is not "text", then if all the bytes of its input stream are US-ASCII, the encoding is "7bit".
If there is even one non-US-ASCII character, the encoding is "base64".
#return "7bit", "quoted-printable" or "base64"
As mentioned by mmyers in a deleted comment, JavaMimeType is supposed to do the same thing, but:
it is dead since 2006
it does involve reading the all content!
:
File file = new File("/home/bibi/monfichieratester");
InputStream inputStream = new FileInputStream(file);
ByteArrayOutputStream byteArrayStream = new ByteArrayOutputStream();
int readByte;
while ((readByte = inputStream.read()) != -1) {
byteArrayStream.write(readByte);
}
String mimetype = "";
byte[] bytes = byteArrayStream.toByteArray();
MagicMatch m = Magic.getMagicMatch(bytes);
mimetype = m.getMimeType();
So... since you are reading the all content of the file anyway, you could take advantage of that to determine the type based on that content and your own rules.
Java Mime Magic may be of use. It'll analyse mime-types from files and inputstreams. I can't vouch for it's functionality, however.
This link may provide further info. It provides several different means of determining how to do what you want (or at least something similar).
I would perhaps be tempted to write something specific to your problem domain. e.g. determining the number of comma-separated values per line and rejecting if it's not within certain limits. Then split on the commas and parse each entry according to requirements (e.g. are they doubles/floats/valid Strings - and if strings, what encoding). I think you may have to do this anyway, given that someone may upload a file that starts like a CSV but is corrupted half-way through.