How to read an excel(.xls) file like text? - java

I need to read an excel(.xls) file that i'm receiving.
Using the regular charsets like UTF-8, Cp1252, ISO-8859-1, UTF-16LE, none of these helped me, the characters are still malformed.
So i search ended up using juniversalchardet, it showed me that the charset was MacCyrillic, used MacCyrillic to read the file, but still the same weird outcome.
When i open the file on excel everything is fine, all the characters are fine, since its portuguese its filled whit Ç ~ and such. But opening whit notepad or trough java the file is all messed up.
But if open the file on my excel and then save it again like .txt it becomes readable
My method to find the charset
public static void lerCharset(String fileName) throws IOException {
byte[] buf = new byte[50000000];
FileInputStream fis = new FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
fis.close();
}
How can i discover the correct charset?
Should i try a different aproach? Like making my java re-save the excel and then start reading?

If I'm understanding your question, you're trying to read the excel file like a text file.
The challenge is that .xls files are actually binary files containing the text, formatting, sheet information, macro information, etc...
You'd either need to save the files as .csv (Either via Excel before running your program or through your program directly), upgrade them to .xlsx (which has numerous libraries that can read the file as an XML at that point) or use a library (such as apache POI or anything similar) or even query the data out using ADO.
Good luck and I hope that's what you were implying via your question.

Code:
WorkbookSettings workbookSettings = new WorkbookSettings();
WorkbookSettings.setEncoding("Cp1252");

Related

How to get file content properly from a jpg file?

I'm trying to get content from a jpg file so I can encrypt that content and save it in another file that is later decrypted.
I'm trying to do so by reading the jpg file as if it were a text file with this code:
String aBuffer = "";
try {
File myFile = new File(pathRoot);
FileInputStream fIn = new FileInputStream(myFile);
BufferedReader myReader = new BufferedReader(new InputStreamReader(fIn));
String aDataRow = "";
while ((aDataRow = myReader.readLine()) != null) {
aBuffer += aDataRow;
}
myReader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
But this doesn't give the content the file has, just a short string and weirdly enough it also looks like just reading the file corrupts it.
What could I do so I can achieve the desired behavior?
Image files aren't text - but you're treating the data as textual data. Basically, don't do that. Use the InputStream to load the data into a byte array (or preferably, use Files.readAllBytes(Path) to do it rather more simply).
Keep the binary data as binary data. If you absolutely need a text representation, you'll need to encode it in a way that doesn't lose data - where hex or base64 are the most common ways of doing that.
You mention encryption early in the question: encryption also generally operates on binary data. Any encryption methods which provide text options (e.g. string parameters) are just convenience wrappers which encode the text as binary data and then encrypt it.
and weirdly enough it also looks like just reading the file corrupts it.
I believe you're mistaken about that. Just reading from a file will not change it in any way. Admittedly you're not using try-with-resources statements, so you could end up keeping the file handle open, potentially preventing another process from reading it - but the content of the file itself won't change.

Java IO with UTF characters

I have a weird problem with files.
I intend to modify the timing of an .srt file, but writing the new file seems to be a weird task.
Here's a sample code I wrote:
import java.io.*;
import java.nio.charset.Charset;
public class ReaderWriter {
public static void main(String[] args) throws IOException {
InputStream inputStream = new FileInputStream("D:\\E\\Movies\\English\\1960's\\TheApartment1960.srt");
Reader reader = new InputStreamReader(inputStream,
Charset.forName("UTF-8"));
OutputStream outputStream = new FileOutputStream("output.srt");
Writer writer = new OutputStreamWriter(outputStream,
Charset.forName("UTF-8"));
int data = reader.read();
while (data != -1) {
char theChar = (char) data;
writer.write(theChar);
data = reader.read();
}
reader.close();
writer.close();
}
}
This is an image from the original file:
However, the resulted file seems like:
I searched a lot for a solution but in vain. Any help, please.
First a few points:
There is nothing wrong with your Java code. If I use it to read an input file containing Arabic text encoded in UTF-8 it creates the output file encoded in UTF-8 with no problems.
I don't think there is a font issue. Since you can successfully display the content of the input file there is no reason you cannot also successfully display the content of a valid output file.
Those black diamonds with question marks in the output file are replacement characters which are "used to replace an incoming character whose value is unknown or unrepresentable in Unicode". This indicates that the input file you are reading is not UTF-8 encoded, even though the code explicitly states that it is. I can reproduce similar results to yours if the input file is UTF-16 encoded, but specified as UTF-8 in the code.
Alternatively, if the input file truly is UTF-8 encoded, specify it as UTF-16 in the code. For example, here is a valid UTF-8 input file with some Arabic text where the code (incorrectly) stated Reader reader = new InputStreamReader(inputStream, Charset.forName("UTF-16"));:
يونكود في النظم القائمة وفيما يخص التطبيقات الحاسوبية، الخطوط، تصميم النصوص والحوسبة متعددة اللغات.
And here is the output file, containing the replacement characters because the input stream of the UTF-8 file was incorrectly processed as UTF-16:
���⃙臙訠���ꟙ蓙苘Ꟙꛙ藘ꤠ���諘께딠�����ꟙ蓘귘Ꟙ동裘꣙諘꧘谠����꫘뗙藙諙蔠���⃙裘ꟙ蓘귙裘돘꣘ꤠ���⃘ꟙ蓙蓘뫘Ꟙꨮ�
Given all that, simply ensuring that the encoding of the input file is specified correctly in the InputStreamReader() constructor should solve your problem. To verify this, just create another input file and save it with UTF-8 character encoding, then run your code. If it works then you know that the problem was the that the encoding of input file was not UTF-8.

Convert from Codepage 1252 (Windows) to Java, in Java

I have some strings in Java (originally from an Excel sheet) that I presume are in Windows 1252 codepage. I want them converted to Javas own unicode format. The Excel file was parsed using the JXL package, in case that matter.
I will clarify: apparently the strings gotten from the Excel file look pretty much like it already is some kind of unicode.
WorkbookSettings ws = new WorkbookSettings();
ws.setCharacterSet(someInteger);
Workbook workbook = Workbook.getWorkbook(new File(filename), ws);
Sheet s = workbook.getSheet(sheet);
row = s.getRow(4);
String contents = row[0].getContents();
This is where contents seems to contain something unicode, the åäö are multibyte characters, while the ASCII ones are normal single byte characters. It is most definitely not Latin1. If I print the "contents" string with printLn and redirect it to a hello.txt file, I find that the letter "ö" is represented with two bytes, C3 B6 in hex. (195 and 179 in decimal.)
[edit]
I have tried the suggestions with different codepages etc given below, tried converting from Cp1252 etc. There was some kind of conversion, because I would get some other kind of gibberish instead. As reference I always printed an "ö" string hand coded into the source code, to verify that there was not something wrong with my terminal or typefaces or anything. The manually typed "ö" always worked.
[edit]
I also tried WorkBookSettings as suggested in the comments, but I looked in the code for JXL and characterSet seems to be ignored by parsing code. I think the parsing code just looks at whatever encoding the XLS file is supposed to be in.
WorkbookSettings ws = new WorkbookSettings();
ws.setEncoding("CP1250");
Worked for me.
If none of the answer above solve the problem, the trick might be done like this:
String myOutput = new String (myInput, "UTF-8");
This should decode the incoming string, whatever its format.
When Java parses a file it uses some encoding to read the bytes on the disk and create bytes in memory. The default encoding varies from platform to platform. Java's internal String representation is Unicode already, so if it parses the file with the right encoding then you are already done; just write out the data in any encoding you want.
If your strings appear corrupted when you look at them in Java, it is probably because you are using the wrong encoding to read the data. Excel is probably using UTF-16 (Little-Endian I think) but I'd expect a library like JXL should be able to detect it appropriately. I've looked at the Javadocs for JXL and it doesn't do anything with character encodings. I imagine it auto-detects any encodings as it needs to.
Do you just need to write the already loaded strings to a text file? If so, then something like the following will work:
String text = getCP1252Text(); // doesn't matter what the original encoding was, Java always uses Unicode
FileOutputStream fos = new FileOutputStream("test.txt"); // Open file
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-16"); // Specify character encoding
PrintWriter pw = new PrintWriter(osw);
pw.print(text ); // repeat as needed
pw.close(); // cleanup
osw.close();
fos.close();
If your problem is something else please edit your question and provide more details.
You need to specify the correct encoding when the file is parsed - once you have a Java String based on the wrong encoding, it's too late.
JXL allows you to specify the encoding by passing a WorkbookSettings object to the factory method.
"windows-1252"/"Cp1252" is not required to be supported by JREs, but is by Sun's (and presumably most others). See the "Supported Encodings" in your JDK documentation. Then it's just a matter of using String, InputStreamReader or similar to decode the bytes into chars.
FileInputStream fis = new FileInputStream (yourFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"CP1250"));
And do with reader whatever you'd do directly with file.
Your description indicates that the encoding is UTF-8 and indeed C3 B6 is the UTF-8 encoding for 'ö'.

CSV file validation with Java

I'm reading a file line by line, like this:
FileReader myFile = new FileReader(File file);
BufferedReader InputFile = new BufferedReader(myFile);
// Read the first line
String currentRecord = InputFile.readLine();
while(currentRecord != null) {
currentRecord = InputFile.readLine();
}
But if other types of files are uploaded, it will still read their contents. For instance, if the uploaded file is an image, it will output junk characters when reading the file. So my question is: how can I check the file is CSV for sure before reading it?
Checking extension of the file is kind of lame since someone can upload a file that is not CSV but has a .csv extension. Thanks in advance.
Determining the MIME type of a file is not something easy to do, especially if ASCII sections can be mixed with binary ones.
Actually, when you look at how a java mail system does determine the MIME type of an email, it does involve reading all bytes in it, and applying some "rules".
Check out MimeUtility.java
If the primary type of this datasource is "text" and if all the bytes in its input stream are US-ASCII, then the encoding is "7bit".
If more than half of the bytes are non-US-ASCII, then the encoding is "base64".
If less than half of the bytes are non-US-ASCII, then the encoding is "quoted-printable".
If the primary type of this datasource is not "text", then if all the bytes of its input stream are US-ASCII, the encoding is "7bit".
If there is even one non-US-ASCII character, the encoding is "base64".
#return "7bit", "quoted-printable" or "base64"
As mentioned by mmyers in a deleted comment, JavaMimeType is supposed to do the same thing, but:
it is dead since 2006
it does involve reading the all content!
:
File file = new File("/home/bibi/monfichieratester");
InputStream inputStream = new FileInputStream(file);
ByteArrayOutputStream byteArrayStream = new ByteArrayOutputStream();
int readByte;
while ((readByte = inputStream.read()) != -1) {
byteArrayStream.write(readByte);
}
String mimetype = "";
byte[] bytes = byteArrayStream.toByteArray();
MagicMatch m = Magic.getMagicMatch(bytes);
mimetype = m.getMimeType();
So... since you are reading the all content of the file anyway, you could take advantage of that to determine the type based on that content and your own rules.
Java Mime Magic may be of use. It'll analyse mime-types from files and inputstreams. I can't vouch for it's functionality, however.
This link may provide further info. It provides several different means of determining how to do what you want (or at least something similar).
I would perhaps be tempted to write something specific to your problem domain. e.g. determining the number of comma-separated values per line and rejecting if it's not within certain limits. Then split on the commas and parse each entry according to requirements (e.g. are they doubles/floats/valid Strings - and if strings, what encoding). I think you may have to do this anyway, given that someone may upload a file that starts like a CSV but is corrupted half-way through.

Corrupt file when using Java to download file

This problem seems to happen inconsistently. We are using a java applet to download a file from our site, which we store temporarily on the client's machine.
Here is the code that we are using to save the file:
URL targetUrl = new URL(urlForFile);
InputStream content = (InputStream)targetUrl.getContent();
BufferedInputStream buffered = new BufferedInputStream(content);
File savedFile = File.createTempFile("temp",".dat");
FileOutputStream fos = new FileOutputStream(savedFile);
int letter;
while((letter = buffered.read()) != -1)
fos.write(letter);
fos.close();
Later, I try to access that file by using:
ObjectInputStream keyInStream = new ObjectInputStream(new FileInputStream(savedFile));
Most of the time it works without a problem, but every once in a while we get the error:
java.io.StreamCorruptedException: invalid stream header: 0D0A0D0A
which makes me believe that it isn't saving the file correctly.
I'm guessing that the operations you've done with getContent and BufferedInputStream have treated the file like an ascii file which has converted newlines or carriage returns into carriage return + newline (0x0d0a), which has confused ObjectInputStream (which expects serialized data objects.
If you are using an FTP URL, the transfer may be occurring in ASCII mode.
Try appending ";type=I" to the end of your URL.
Why are you using ObjectInputStream to read it?
As per the javadoc:
An ObjectInputStream deserializes primitive data and objects previously written using an ObjectOutputStream.
Probably the error comes from the fact you didn't write it with ObjectOutputStream.
Try reading it wit FileInputStream only.
Here's a sample for binary ( although not the most efficient way )
Here's another used for text files.
There are 3 big problems in your sample code:
You're not just treating the input as bytes
You're needlessly pulling the entire object into memory at once
You're doing multiple method calls for every single byte read and written -- use the array based read/write!
Here's a redo:
URL targetUrl = new URL(urlForFile);
InputStream is = targetUrl.getInputStream();
File savedFile = File.createTempFile("temp",".dat");
FileOutputStream fos = new FileOutputStream(savedFile);
int count;
byte[] buff = new byte[16 * 1024];
while((count = is.read(buff)) != -1) {
fos.write(buff, 0, count);
}
fos.close();
content.close();
You could also step back from the code and check to see if the file on your client is the same as the file on the server. If you get both files on an XP machine, you should be able to use the FC utility to do a compare (check FC's help if you need to run this as a binary compare as there is a switch for that). If you're on Unix, I don't know the file compare program, but I'm sure there's something.
If the files are identical, then you're looking at a problem with the code that reads the file.
If the files are not identical, focus on the code that writes your file.
Good luck!

Categories

Resources