Problems to compress Excel files, JAVA - java

I have some problems compressing excel files using the Hffman algorthim. The thing is that my code seems to work with .txt files, but when I'm trying to compress .xlsx or older versions of excel an error occurs.
First of all, I read my file like this:
File file = new File("fileName.xlsx");
byte[] dataOfFile = new byte[(int) file.length()];
DataInputStream dis = new DataInputStream(new FileInputStream(file));
dis.readFully(dataOfFile);
dis.close();
To check this (if everything seems OK) I use this code:
String entireFileText = new String(dataOfFile, "UTF-8");
for(int i=0;i<dataOfFile.length;i++)
{
System.out.print(dataOfFile[i]);
}
By doing this to a .txt file I get something like this (which seems to be OK):
"7210110810811132119111114108100331310721111193297114101321211111173"
But when I use this on .xlsx file I get this and I think the hyphen makes errors that might occur later in the compression:
"8075342006080003301165490-90122100-1245001908291671111101161011101169584121112101115934612010910832-944240-96020000000000000"... and so on
Anyway, by using a string a can map this into a HashMap, where I count the frequency of each character. I have a HashMap:
public static HashMap map;
public static boolean countHowOftenACharacterAppear(String s1) {
String s = s1;
for(int i = 0; i < s.length(); i++){
char c = s.charAt(i);
Integer val = map.get(new Character(c));
if(val != null){
map.put(c, new Integer(val + 1));
}
else{
map.put(c,1);
}
}
return true;
}
When I compress my string I use:
public static String compress(String s) {
String c = new String();
for(int i = 0; i < s.length(); i++)
c = c + fromCharacterToCode.get(s.charAt(i));
return c;
}
fromCharactertoCode is another HashMap of type :
public static HashMap fromCharacterToCode;
(I'm traversing through my table I've built. Dont't think this is the problem)
Anyway, the results from this using the .txt file is:
"01000110110111011011110001101110011011000001000000000"... (PERFECT)
From the .xlsx file:
"10101110110001110null0010000null0011000nullnullnull10110000null00001101011111" ...
I really don't get why I'm getting the nullpointers on the .xlsx files. I would be very happy if I could get some help here to solve this. Many thanks!!

Your problem is java I/O, well before getting to compression.
First, you don't really need DataInputStream here, but leave that aside. You then convert to String entireFileText assuming the contents of the file is text in UTF-8, whereas data files like .xlsx aren't text at all and many text files even on Windows aren't UTF-8. But you don't seem to use entireFileText, so that may not matter. If you do, and the file isn't plain ASCII text, your compressor will "lose" chunks of it and the output of decompression will be only a fraction of the compression input; that is usually considered unsatisfactory.
Then you extract each byte from dataOfFile. byte in Java is signed; plain ASCII text files will have only "positive" bytes 0x00 to 0x7F (and usually all 0x20 to 0x7E plus 0x09 0x0D 0x0A), but everything else (UTF-8 text, UTF-16 text, data, and executables) will have "negative" bytes 0x80 to 0xFF which come out as -0x80 to -0x01.
Your printout "7210110810811132119111114108100331310721111193297114101321211111173" for "the .txt file" is almost certainly the byte sequence 72=H 101=e 108=l 108=l 111=o 32=space 119=w 111=o 114=r 108=l 100=d 33=! 13=CR 10=LF 72=H 111=o 119=w 32=space 97=a 114=r 101=e 32=space 121=y 111=o 117=u 3=(ETX aka ctrl-C) (how did you get a ctrl-C into a file?! or was it really 30=ctrl-Z? that's somewhat usual for Windows text files)
Someone more familiar with .xlsx format might be able to reconstruct that one, but I can tell you right off the hyphens are due to bytes with negative values, printed in decimal (by default) as -128 to -1.
For a general purpose compressor, you shouldn't ever convert to java char's and String's; those are designed for text and not all files are text. Just work with bytes, but if you want them in consistently positive, mask with & 0xFF .

Related

Problem with input from user saved to file by RandomAccessFile methods

I've got a problem with input from user. I need to save input from user into binary file and when I read it and show it on the screen it isn't working properly. I dont want to put few hundreds of lines, so I will try to dexcribe it in more compact form. And encoding in NetBeans in properties of project is "UTF-8"
I got input from user, in NetBeans console or cmd console. Then I save it to object made up of strings, then add it to ArrayList<Ksiazka> where Ksiazka is my class (basically a book's properties). Then I save whole ArrayList object to file baza.bin. I do it by looping through whole list of objects of class Ksiazka, taking each String one by one and saving it into file baza.bin using method writeUTF(oneOfStrings). When I try to read file baza.bin I see question marks instead of special characters (ą, ć, ę, ł, ń, ó, ś, ź). I think there is a problem in difference in encoding of file and input data, but to be honest I don't have any idea ho to solve that.
Those are attributes of my class Ksiazka:
private String id;
private String tytul;
private String autor;
private String rok;
private String wydawnictwo;
private String gatunek;
private String opis;
private String ktoWypozyczyl;
private String kiedyWypozyczona;
private String kiedyDoOddania;
This is method for reading data from user:
static String podajDana(String[] tab, int coPokazac){
System.out.print(tab[coPokazac]);
boolean podawajDalej = true;
String linia = "";
Scanner klawiatura = new Scanner(System.in, "utf-8");
do{
try {
podawajDalej = false;
linia = klawiatura.nextLine();
}
catch(NoSuchElementException e){
System.err.println("Wystąpił błąd w czasie podawania wartości!"
+ " Spróbuj jeszcze raz!");
}
catch(IllegalStateException e){
System.err.println("Wewnętrzny błąd programu typu 2! Zgłoś to jak najszybciej"
+ " razem z tą wiadomością");
}
}while(podawajDalej);
return linia;
}
String[] tab is just array of strings I want to be able to show on the screen, each set (array) has its own function, int coPokazac is number of line from an array I want to show.
and this one saves all data from ArrayList<Ksiazka> to file baza.bin:
static void zapiszZmiany(ArrayList<Ksiazka> bazaKsiazek){
try{
RandomAccessFile plik = new RandomAccessFile("baza.bin","rw");
for(int i = 0; i < bazaKsiazek.size(); i++){
plik.writeUTF(bazaKsiazek.get(i).zwrocId());
plik.writeUTF(bazaKsiazek.get(i).zwrocTytul());
plik.writeUTF(bazaKsiazek.get(i).zwrocAutor());
plik.writeUTF(bazaKsiazek.get(i).zwrocRok());
plik.writeUTF(bazaKsiazek.get(i).zwrocWydawnictwo());
plik.writeUTF(bazaKsiazek.get(i).zwrocGatunek());
plik.writeUTF(bazaKsiazek.get(i).zwrocOpis());
plik.writeUTF(bazaKsiazek.get(i).zwrocKtoWypozyczyl());
plik.writeUTF(bazaKsiazek.get(i).zwrocKiedyWypozyczona());
plik.writeUTF(bazaKsiazek.get(i).zwrocKiedyDoOddania());
}
plik.close();
}
catch (FileNotFoundException ex){
System.err.println("Nie znaleziono pliku z bazą książek!");
}
catch (IOException ex){
System.err.println("Błąd zapisu bądź odczytu pliku!");
}
}
I think that there is a problem in one of those two methods (either I do something wrong while reading it or something wrong when it is saving data to file using writeUTF()) but even tho I tried few things to solve it, none of them worked.
After quick talk with lecturer I got information that I can use at most JDK 8.
You are using different techniques for reading and writing, and they are not compatible.
Despite the name, the writeUTF method of RandomAccessFile does not write a UTF-8 string. From the documentation:
Writes a string to the file using modified UTF-8 encoding in a machine-independent manner.
First, two bytes are written to the file, starting at the current file pointer, as if by the writeShort method giving the number of bytes to follow. This value is the number of bytes actually written out, not the length of the string. Following the length, each character of the string is output, in sequence, using the modified UTF-8 encoding for each character.
writeUTF will write a two-byte length, then write the string as UTF-8, except that '\u0000' characters are written as two UTF-8 bytes and supplementary characters are written as two UTF-8 encoded surrogates, rather than single UTF-8 codepoint sequences.
On the other hand, you are trying to read that data using new Scanner(System.in, "utf-8") and klawiatura.nextLine();. This approach is not compatible because:
The text was not written as a true UTF-8 sequence.
Before the text was written, two bytes indicating its numeric length were written. They are not readable text.
writeUTF does not write a newline. It does not write any terminating sequence at all, in fact.
The best solution is to remove all usage of RandomAccessFile and replace it with a Writer:
Writer plik = new FileWriter(new File("baza.bin"), StandardCharsets.UTF_8);
for (int i = 0; i < bazaKsiazek.size(); i++) {
plik.write(bazaKsiazek.get(i).zwrocId());
plik.write('\n');
plik.write(bazaKsiazek.get(i).zwrocTytul());
plik.write('\n');
// ...

reading UTF-16 produces unexpected results

I use the beaglebuddy Java library in an Android project for reading/writing ID3 tags of mp3 files. I'm having an issue with reading the text that was previously written using the same library and could not find anything related in their docs.
Assume I write the following info:
MP3 mp3 = new MP3(pathToFile);
mp3.setLeadPerformer("Jon Skeet");
mp3.setTitle("A Million Rep");
mp3.save();
Looking at the source code of the library, I see that UTF-16 encoding is explicitly set, internally it calls
protected ID3v23Frame setV23Text(String text, FrameType frameType) {
return this.setV23Text(Encoding.UTF_16, text, frameType);
}
and
protected ID3v23Frame setV23Text(Encoding encoding, String text, FrameType frameType) {
ID3v23FrameBodyTextInformation frameBody = null;
ID3v23Frame frame = this.getV23Frame(frameType);
if(frame == null) {
frame = this.addV23Frame(frameType);
}
frameBody = (ID3v23FrameBodyTextInformation)frame.getBody();
frameBody.setEncoding(encoding);
frameBody.setText(encoding == Encoding.UTF_16?Utility.getUTF16String(text):text);
return frame;
}
At a later point, I read the data and it gives me some weird Chinese characters:
mp3.getLeadPerformer(); // 䄀 䴀椀氀氀椀漀渀 刀攀瀀
mp3.getTitle(); // 䨀漀渀 匀欀攀攀琀
I took a look at the built-in Utility.getUTF16String(String) method:
public static String getUTF16String(String string) {
String text = string;
byte[] bytes = string.getBytes(Encoding.UTF_16.getCharacterSet());
if(bytes.length < 2 || bytes[0] != -2 || bytes[1] != -1) {
byte[] bytez = new byte[bytes.length + 2];
bytes[0] = -2;
bytes[1] = -1;
System.arraycopy(bytes, 0, bytez, 2, bytes.length);
text = new String(bytez, Encoding.UTF_16.getCharacterSet());
}
return text;
}
I'm not quite getting the point of setting the first 2 bytes to -2 and -1 respectively, is this a pattern stating that the string is UTF-16 encoded?
However, I tried to explicitly call this method when reading the data, that seems to be readable, but always prepends some cryptic characters at the start:
Utility.getUTF16String(mp3.getLeadPerformer()); // ��Jon Skeet
Utility.getUTF16String(mp3.getTitle()); // ��A Million Rep
Since the count of those characters seems to be constant, I created a temporary workaround by simply cutting them off.
Fields like "comments" where the author does not explicitly enforce UTF-16 when writing are read without any issues.
I'm really curious about what's going on here and appreciate any suggestions.

Need help reading bytes from a file (incorrect bytes returned)

i am trying to create a simple program to edit MP3 tags.But the problem is i am stuck at the very first step,as i cannot even pass the file bytes into a byte array.Technically i can,but they are wrong.
For example,if i copy the bytes into a txt file and open it,most of the text is gibberish including the tags(a problem for later),but the first letters are ID3 which are correct.
But if i print the byte array that results from the mp3 in the console,the first values are
1001001
1000100
110011
11
0
0
.....
Which are all invalid characters.But add a zero before the first row,a zero before the second,and TWO zeroes before the third and it now says ID3
What would cause zeroes to get lost like that? It's the same for every mp3 file.Thank you in advance for any help
The piece of code is a very simple copy
try {
FileInputStream BF1 = new FileInputStream("test.mp3");
FileOutputStream fout = new FileOutputStream("byteresults.txt");
byte[] tempbyte = new byte[1024];
BF1.read(tempbyte);
BF1.close();
int i;
for(i=0;i<900;i++){
System.out.print(Integer.toBinaryString(tempbyte[i])+'\n');}
} catch(FileNotFoundException fnf)
{
System.out.println("Specified file not found :" + fnf);
}
catch(IOException ioe)
{
System.out.println("Error while copying file :" + ioe);
}
When printing Binary format, most systems drop the leading zeroes since they do not contribute to the final value. You only expect eight digits cos it's bytes but binary counting itself doesn't work like that. It doesnt care for 8 slots or 4 slots etc. Consider 3 is written as 11 in binary so why bother printing that as 00000011?? Why not 0011?? The human reader will ignore those leading zeroes anyway.
Maybe you could try Hex format as it easier (= more efficiency). Something like :
for ( i=0; i<900; i++)
{
//System.out.print(Integer.toBinaryString(tempbyte[i])+'\n');
System.out.printf("0x%02X", tempbyte[i]);
}
This way you can even check your output against some Hex Editor software (handling exact same file's bytes). Easier to check that you have the right bytes at right place with right values etc...

How to split the ByteArray by reading from the file in C++?

I have written a Java program to write the ByteArray in to a file. And that resulting ByteArray is a resulting of these three ByteArrays-
First 2 bytes is my schemaId which I have represented it using short data type.
Then next 8 Bytes is my Last Modified Date which I have represented it using long data type.
And remaining bytes can be of variable size which is my actual value for my attributes..
So I have a file now in which first line contains resulting ByteArray which will have all the above bytes as I mentioned above.. Now I need to read that file from C++ program and read the first line which will contain the ByteArray and then split that resulting ByteArray accordingly as I mentioned above such that I am able to extract my schemaId, Last Modified Date and my actual attribute value from it.
I have done all my coding always in Java and I am new to C++... I am able to write a program in C++ to read the file but not sure how should I read that ByteArray in such a way such that I am able to split it as I mentioned above..
Below is my C++ program which is reading the file and printing it out on the console..
int main () {
string line;
//the variable of type ifstream:
ifstream myfile ("bytearrayfile");
//check to see if the file is opened:
if (myfile.is_open())
{
//while there are still lines in the
//file, keep reading:
while (! myfile.eof() )
{
//place the line from myfile into the
//line variable:
getline (myfile,line);
//display the line we gathered:
// and here split the byte array accordingly..
cout << line << endl;
}
//close the stream:
myfile.close();
}
else cout << "Unable to open file";
return 0;
}
Can anyone help me with that? Thanks.
Update
Below is my java code which will write resulting ByteArray into a file and the same file now I need to read it back from c++..
public static void main(String[] args) throws Exception {
String os = "whatever os is";
byte[] avroBinaryValue = os.getBytes();
long lastModifiedDate = 1379811105109L;
short schemaId = 32767;
ByteArrayOutputStream byteOsTest = new ByteArrayOutputStream();
DataOutputStream outTest = new DataOutputStream(byteOsTest);
outTest.writeShort(schemaId);
outTest.writeLong(lastModifiedDate);
outTest.writeInt(avroBinaryValue.length);
outTest.write(avroBinaryValue);
byte[] allWrittenBytesTest = byteOsTest.toByteArray();
DataInputStream inTest = new DataInputStream(new ByteArrayInputStream(allWrittenBytesTest));
short schemaIdTest = inTest.readShort();
long lastModifiedDateTest = inTest.readLong();
int sizeAvroTest = inTest.readInt();
byte[] avroBinaryValue1 = new byte[sizeAvroTest];
inTest.read(avroBinaryValue1, 0, sizeAvroTest);
System.out.println(schemaIdTest);
System.out.println(lastModifiedDateTest);
System.out.println(new String(avroBinaryValue1));
writeFile(allWrittenBytesTest);
}
/**
* Write the file in Java
* #param byteArray
*/
public static void writeFile(byte[] byteArray) {
try{
File file = new File("bytearrayfile");
FileOutputStream output = new FileOutputStream(file);
IOUtils.write(byteArray, output);
} catch (Exception ex) {
ex.printStackTrace();
}
}
It doesn't look like you want to use std::getline to read this data. Your file isn't written as text data on a line-by-line basis - it basically has a binary format.
You can use the read method of std::ifstream to read arbitrary chunks of data from an input stream. You probably want to open the file in binary mode:
std::ifstream myfile("bytearrayfile", std::ios::binary);
Fundamentally the method you would use to read each record from the file is:
uint16_t schemaId;
uint64_t lastModifiedDate;
uint32_t binaryLength;
myfile.read(reinterpret_cast<char*>(&schemaId), sizeof(schemaId));
myfile.read(reinterpret_cast<char*>(&lastModifiedDate), sizeof(lastModifiedDate));
myfile.read(reinterpret_cast<char*>(&binaryLength), sizeof(binaryLength));
This will read the three static members of your data structure from the file. Because your data is variable size, you probably need to allocate a buffer to read it into, for example:
std::unique_ptr<char[]> binaryBuf(new char[binaryLength]);
myfile.read(binaryBuf.get(), binaryLength);
The above are examples only to illustrate how you would approach this in C++. You will need to be aware of the following things:
There's no error checking in the above examples. You'll need to check that the calls to ifstream::read are successful and return the correct amount of data.
Endianness may be an issue, depending on the the platform the data originates from and is being read on.
Interpreting the lastModifiedDate field may require you to write a function to convert it from whatever format Java uses (I have no idea about Java).

Java : How to determine the correct charset encoding of a stream

With reference to the following thread:
Java App : Unable to read iso-8859-1 encoded file correctly
What is the best way to programatically determine the correct charset encoding of an inputstream/file ?
I have tried using the following:
File in = new File(args[0]);
InputStreamReader r = new InputStreamReader(new FileInputStream(in));
System.out.println(r.getEncoding());
But on a file which I know to be encoded with ISO8859_1 the above code yields ASCII, which is not correct, and does not allow me to correctly render the content of the file back to the console.
You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.
The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.
Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.
Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.
Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.
I have used this library, similar to jchardet for detecting encoding in Java:
https://github.com/albfernandez/juniversalchardet
check this out:
http://site.icu-project.org/ (icu4j)
they have libraries for detecting charset from IOStream
could be simple like this:
BufferedInputStream bis = new BufferedInputStream(input);
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
CharsetMatch cm = cd.detect();
if (cm != null) {
reader = cm.getReader();
charset = cm.getName();
}else {
throw new UnsupportedCharsetException()
}
Here are my favorites:
TikaEncodingDetector
Dependency:
<dependency>
<groupId>org.apache.any23</groupId>
<artifactId>apache-any23-encoding</artifactId>
<version>1.1</version>
</dependency>
Sample:
public static Charset guessCharset(InputStream is) throws IOException {
return Charset.forName(new TikaEncodingDetector().guessEncoding(is));
}
GuessEncoding
Dependency:
<dependency>
<groupId>org.codehaus.guessencoding</groupId>
<artifactId>guessencoding</artifactId>
<version>1.4</version>
<type>jar</type>
</dependency>
Sample:
public static Charset guessCharset2(File file) throws IOException {
return CharsetToolkit.guessEncoding(file, 4096, StandardCharsets.UTF_8);
}
You can certainly validate the file for a particular charset by decoding it with a CharsetDecoder and watching out for "malformed-input" or "unmappable-character" errors. Of course, this only tells you if a charset is wrong; it doesn't tell you if it is correct. For that, you need a basis of comparison to evaluate the decoded results, e.g. do you know beforehand if the characters are restricted to some subset, or whether the text adheres to some strict format? The bottom line is that charset detection is guesswork without any guarantees.
Which library to use?
As of this writing, they are three libraries that emerge:
GuessEncoding
ICU4j
juniversalchardet
I don't include Apache Any23 because it uses ICU4j 3.4 under the hood.
How to tell which one has detected the right charset (or as close as possible)?
It's impossible to certify the charset detected by each above libraries. However, it's possible to ask them in turn and score the returned response.
How to score the returned response?
Each response can be assigned one point. The more points a response have, the more confidence the detected charset has. This is a simple scoring method. You can elaborate others.
Is there any sample code?
Here is a full snippet implementing the strategy described in the previous lines.
public static String guessEncoding(InputStream input) throws IOException {
// Load input data
long count = 0;
int n = 0, EOF = -1;
byte[] buffer = new byte[4096];
ByteArrayOutputStream output = new ByteArrayOutputStream();
while ((EOF != (n = input.read(buffer))) && (count <= Integer.MAX_VALUE)) {
output.write(buffer, 0, n);
count += n;
}
if (count > Integer.MAX_VALUE) {
throw new RuntimeException("Inputstream too large.");
}
byte[] data = output.toByteArray();
// Detect encoding
Map<String, int[]> encodingsScores = new HashMap<>();
// * GuessEncoding
updateEncodingsScores(encodingsScores, new CharsetToolkit(data).guessEncoding().displayName());
// * ICU4j
CharsetDetector charsetDetector = new CharsetDetector();
charsetDetector.setText(data);
charsetDetector.enableInputFilter(true);
CharsetMatch cm = charsetDetector.detect();
if (cm != null) {
updateEncodingsScores(encodingsScores, cm.getName());
}
// * juniversalchardset
UniversalDetector universalDetector = new UniversalDetector(null);
universalDetector.handleData(data, 0, data.length);
universalDetector.dataEnd();
String encodingName = universalDetector.getDetectedCharset();
if (encodingName != null) {
updateEncodingsScores(encodingsScores, encodingName);
}
// Find winning encoding
Map.Entry<String, int[]> maxEntry = null;
for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
if (maxEntry == null || (e.getValue()[0] > maxEntry.getValue()[0])) {
maxEntry = e;
}
}
String winningEncoding = maxEntry.getKey();
//dumpEncodingsScores(encodingsScores);
return winningEncoding;
}
private static void updateEncodingsScores(Map<String, int[]> encodingsScores, String encoding) {
String encodingName = encoding.toLowerCase();
int[] encodingScore = encodingsScores.get(encodingName);
if (encodingScore == null) {
encodingsScores.put(encodingName, new int[] { 1 });
} else {
encodingScore[0]++;
}
}
private static void dumpEncodingsScores(Map<String, int[]> encodingsScores) {
System.out.println(toString(encodingsScores));
}
private static String toString(Map<String, int[]> encodingsScores) {
String GLUE = ", ";
StringBuilder sb = new StringBuilder();
for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
sb.append(e.getKey() + ":" + e.getValue()[0] + GLUE);
}
int len = sb.length();
sb.delete(len - GLUE.length(), len);
return "{ " + sb.toString() + " }";
}
Improvements:
The guessEncoding method reads the inputstream entirely. For large inputstreams this can be a concern. All these libraries would read the whole inputstream. This would imply a large time consumption for detecting the charset.
It's possible to limit the initial data loading to a few bytes and perform the charset detection on those few bytes only.
As far as I know, there is no general library in this context to be suitable for all types of problems. So, for each problem you should test the existing libraries and select the best one which satisfies your problem’s constraints, but often none of them is appropriate. In these cases you can write your own Encoding Detector! As I have wrote ...
I’ve wrote a meta java tool for detecting charset encoding of HTML Web pages, using IBM ICU4j and Mozilla JCharDet as the built-in components. Here you can find my tool, please read the README section before anything else. Also, you can find some basic concepts of this problem in my paper and in its references.
Bellow I provided some helpful comments which I’ve experienced in my work:
Charset detection is not a foolproof process, because it is essentially based on statistical data and what actually happens is guessing not detecting
icu4j is the main tool in this context by IBM, imho
Both TikaEncodingDetector and Lucene-ICU4j are using icu4j and their accuracy had not a meaningful difference from which the icu4j in my tests (at most %1, as I remember)
icu4j is much more general than jchardet, icu4j is just a bit biased to IBM family encodings while jchardet is strongly biased to utf-8
Due to the widespread use of UTF-8 in HTML-world; jchardet is a better choice than icu4j in overall, but is not the best choice!
icu4j is great for East Asian specific encodings like EUC-KR, EUC-JP, SHIFT_JIS, BIG5 and the GB family encodings
Both icu4j and jchardet are debacle in dealing with HTML pages with Windows-1251 and Windows-1256 encodings. Windows-1251 aka cp1251 is widely used for Cyrillic-based languages like Russian and Windows-1256 aka cp1256 is widely used for Arabic
Almost all encoding detection tools are using statistical methods, so the accuracy of output strongly depends on the size and the contents of the input
Some encodings are essentially the same just with a partial differences, so in some cases the guessed or detected encoding may be false but at the same time be true! As about Windows-1252 and ISO-8859-1. (refer to the last paragraph under the 5.2 section of my paper)
The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file. Take a look at http://jchardet.sourceforge.net/ which does scans the text
If you use ICU4J (http://icu-project.org/apiref/icu4j/)
Here is my code:
String charset = "ISO-8859-1"; //Default chartset, put whatever you want
byte[] fileContent = null;
FileInputStream fin = null;
//create FileInputStream object
fin = new FileInputStream(file.getPath());
/*
* Create byte array large enough to hold the content of the file.
* Use File.length to determine size of the file in bytes.
*/
fileContent = new byte[(int) file.length()];
/*
* To read content of the file in byte array, use
* int read(byte[] byteArray) method of java FileInputStream class.
*
*/
fin.read(fileContent);
byte[] data = fileContent;
CharsetDetector detector = new CharsetDetector();
detector.setText(data);
CharsetMatch cm = detector.detect();
if (cm != null) {
int confidence = cm.getConfidence();
System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
//Here you have the encode name and the confidence
//In my case if the confidence is > 50 I return the encode, else I return the default value
if (confidence > 50) {
charset = cm.getName();
}
}
Remember to put all the try-catch need it.
I hope this works for you.
If you don't know the encoding of your data, it is not so easy to determine, but you could try to use a library to guess it. Also, there is a similar question.
I found a nice third party library which can detect actual encoding:
http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
I didn't test it extensively but it seems to work.
For ISO8859_1 files, there is not an easy way to distinguish them from ASCII. For Unicode files however one can generally detect this based on the first few bytes of the file.
UTF-8 and UTF-16 files include a Byte Order Mark (BOM) at the very beginning of the file. The BOM is a zero-width non-breaking space.
Unfortunately, for historical reasons, Java does not detect this automatically. Programs like Notepad will check the BOM and use the appropriate encoding. Using unix or Cygwin, you can check the BOM with the file command. For example:
$ file sample2.sql
sample2.sql: Unicode text, UTF-16, big-endian
For Java, I suggest you check out this code, which will detect the common file formats and select the correct encoding: How to read a file and automatically specify the correct encoding
An alternative to TikaEncodingDetector is to use Tika AutoDetectReader.
Charset charset = new AutoDetectReader(new FileInputStream(file)).getCharset();
A good strategy to handle this, is with a way to auto detect the input charset.
I use org.xml.sax.InputSource in Java 11 to solve it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
inputStreamReader = new InputStreamReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...**strong text**
In plain Java:
final String[] encodings = { "US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-16LE", "UTF-16" };
List<String> lines;
for (String encoding : encodings) {
try {
lines = Files.readAllLines(path, Charset.forName(encoding));
for (String line : lines) {
// do something...
}
break;
} catch (IOException ioe) {
System.out.println(encoding + " failed, trying next.");
}
}
This approach will try the encodings one by one until one works or we run out of them.
(BTW my encodings list has only those items because they are the charsets implementations required on every Java platform, https://docs.oracle.com/javase/9/docs/api/java/nio/charset/Charset.html)
Can you pick the appropriate char set in the Constructor:
new InputStreamReader(new FileInputStream(in), "ISO8859_1");

Categories

Resources