How to make a frequency table from file content using fileInputStream

How to make a frequency table from file content using fileInputStream - java

My assignment is to create a program that does compression using the Huffman algorithm. My program must be able to compress any type of file. Hence why i'm not using the Reader that works with characters.
Im not understanding how to be able to make some kind of frequency table when encoding a binary file?
EDIT!! Problem solved.
public static void main(String args[]){
try{
FileInputStream in = new FileInputStream("./src/hello.jpg");
int currentByte;
while((currentByte = in.read())!=-1){ //in.read()
//read all byte streams in file and create a frequency
//table
}
}catch (IOException e){
e.printStackTrace();
}
}

I'm not sure what you mean by "reading from an image and look at the characters" but talking about text files (as you're reading one in in your code example) this is most of the time working by casting the read byte to char by doing a
char charVal = (char) currentByte;
It's mostly working because most data is ASCII and most charsets contain ASCII. It gets more complicated with non-ASCII characters because a simple cast is equivalent with using charset ISO-8859-1. This will still most of the time produce correct results, because e.g. Window's cp1252 (on german systems) only differ with ISO-8859-1 at the Euro-sign.
Things start to run havoc with charsets like UTF-8 where non-ASCII characters are encoded with multiple bytes, so you will see things like Ã¤ instead of an ä. Same for files being encoded with Unicode where every second byte is most likely a binary zero.

You could use Files.readAllBytes and then iterate over this array.
Path path = Paths.get("hello.txt");
try {
byte[] array = Files.readAllBytes(path);
} catch (IOException ) {
}

Related

Base64.Decoder returning foreign characters

I am building a small application to turn the text in a text file to Base64 then back to normal. The decoded text always returns some Chinese characters in the beginning of the first line.
public EncryptionEngine(File appFile){
this.appFile= appFile;
}
public void encrypt(){
try {
byte[] fileText = Files.readAllBytes(appFile.toPath());// get file text as bytes
Base64.Encoder encoder = Base64.getEncoder();
PrintWriter writer = new PrintWriter(appFile);
writer.print("");//erase old, readable text
writer.print(encoder.encodeToString(fileText));// insert encoded text
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}
public void deycrpt(){
try {
byte[] fileText = Files.readAllBytes(appFile.toPath());
String s = new String (fileText, StandardCharsets.UTF_8);//String s = new String (fileText);
Base64.Decoder decoder = Base64.getDecoder();
byte[] decodedByteArray = decoder.decode(s);
PrintWriter writer = new PrintWriter(appFile);
writer.print("");
writer.print(new String (decodedByteArray,StandardCharsets.UTF_8)); //writer.print(new String (decodedByteArray));
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}
Text FileBefore before encrypt():
cheese
tomatoes
potatoes
hams
yams
Text File after encrypt()
//5jAGgAZQBlAHMAZQANAAoAdABvAG0AYQB0AG8AZQBzAA0ACgBwAG8AdABhAHQAbwBlAHMADQAKAGgAYQBtAHMADQAKAHkAYQBtAHMA
Text File After decrypt
뿯붿cheese
tomatoes
potatoes
hams
yams
Before encrypt() :
After decrypt() :

Your input file is UTF-16, not UTF-8. It begins with FF FE, the little-endian byte order mark. StandardCharsets.UTF_16 will handle this correctly. (Or instead, set your text editor to UTF-8 instead of UTF-16.)
When you decoded fffe as UTF-8, you got two replacement characters "��", one for each of the two bytes that was not valid in UTF-8. Then when you printed this out, each replacement character '�' was encoded as ef bf bd in UTF-8. Then you interpreted the result as UTF-16, taking them in groups of two, reading it as efbf bdef bfbd. The remainder of the file was UTF-16 the whole time, but the null bytes will safely round-trip.
(If the file were ascii text encoded as UTF-16 without a byte-order mark, you would not have noticed how broken this was!)

Your encrypt and decrypt functions don't make the same assumptions. encrypt Base64-encodes any file and is just fine except for the variable names and comments that suggest that the file is a text file. It need not be.
decrypt reverses the Base64-encoded data back to bytes but then "overprocesses" by assuming that the bytes were text encoding with UTF-8 and decoding then and re-encoding them before writing them to the file. If the assumption was true, it would just be a NOP; It's clearly not true in your case and it mangles the data.
Perhaps you did that because you were trying to use a PrintWriter. In Java (and .NET), the multiple stream and file I/O classes are often confusing—expecially considering their decades-long evolution. Sometimes there is one that does exactly what you need but it could be hard to find; other times, there isn't. And, sometimes, a commonly used library like Apache Commons fills the gap.
So, just write the bytes to the file. There are lots of modern and historical options as explained in the answers to this direct question byte[] to file in Java. Here's one with Files.write:
Files.write(appFile.toPath(), decodedByteArray, StandardOpenOption.CREATE);
Note: While Base64 possibly would have been considered encryption (and cracked) a couple of hundred years ago, it's not intended for that purpose. It's a bit dangerous (and confusing) to call it as such.

Reading from UTF-16 encoded text file, þÿ is prepended on the front

I'm outputting a byte array to a text file using the following method:
try{
FileOutputStream fos = new FileOutputStream(filePath+".8102");
fos.write(concatenatedIVCipherMAC);
fos.close();
}catch(Exception e)
{
e.printStackTrace();
}
which outputs to the file a UTF-16 encoded data, example:
¢¬6î)ªÈP~m˜LïiÆŸÃª•Àe»/#Ó ö¹¥‘þ²XhÃ&¼lG:Öé )GU3«´DÃ{+í—Ã]íò
However when I'm reading it back in I get þÿ prepended to the front of the data, e.g:
þÿ¢¬6î)ªÈP~m˜LïiÆŸÃª•Àe»/?#Ó ö¹¥‘þ²XhÃ&¼lG:Öé )GU3«´DÃ{+í—Ã]íò
This is the method I'm using to read in the file:
private String getFilesContents()
{
String fileContents = "";
Scanner sc = null;
try {
sc = new Scanner(file, "UTF-16");
System.out.println("Can read file: "+file.canRead());
} catch (FileNotFoundException e) {
e.printStackTrace();
}
while(sc.hasNextLine()){
fileContents += sc.nextLine();
}
sc.close();
return fileContents;
}
and then byte[] contentsOfFile = fileContents.getBytes("UTF-16"); to convert the String into a byte array.
A quick Google told me that þÿ represents the byte order but is it Java putting that there or Windows? How can I avoid having the þÿ prepended at the start of the data I'm reading in? I was thinking of just ignoring the first two bytes but if it is Windows then this will obviously break the program on other platforms.
edit: changed appended to prepended.

The file is the IV+data+MAC. It's not meant to be readable text? Should be I be doing something differently?
Yes. You shouldn't be trying to treat it as text anywhere.
If you really need to convert arbitrary binary data into text, use Base64 to convert it. Other than that, stick to byte arrays, InputStream and OutputStream.
I don't know exactly why you're supposedly getting extra characters, but the fact that you haven't got real text to start suggests that it's not really worth diagnosing that side. Just start handling binary data as binary data instead.
EDIT: Have a look at Guava's IO helpers for simplicity...

þÿ is the byte order mark (BOM) unicode character saved as UTF16-BE, interpreted as ISO-8859-1.
You shouldn't treat binary data as text (in whatever encoding), if you want to avoid such errors.

StringBufferInputStream Question in Java

I want to read an input string and return it as a UTF8 encoded string. SO I found an example on the Oracle/Sun website that used FileInputStream. I didn't want to read a file, but a string, so I changed it to StringBufferInputStream and used the code below. The method parameter jtext, is some Japanese text. Actually this method works great. The question is about the deprecated code. I had to put #SuppressWarnings because StringBufferInputStream is deprecated. I want to know is there a better way to get a string input stream? Is it ok just to leave it as is? I've spent so long trying to fix this problem that I don't want to change anything now I seem to have cracked it.
#SuppressWarnings("deprecation")
private String readInput(String jtext) {
StringBuffer buffer = new StringBuffer();
try {
StringBufferInputStream sbis = new StringBufferInputStream (jtext);
InputStreamReader isr = new InputStreamReader(sbis,
"UTF8");
Reader in = new BufferedReader(isr);
int ch;
while ((ch = in.read()) > -1) {
buffer.append((char)ch);
}
in.close();
return buffer.toString();
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
I think I found a solution - of sorts:
private String readInput(String jtext) {
String n;
try {
n = new String(jtext.getBytes("8859_1"));
return n;
} catch (UnsupportedEncodingException e) {
return null;
}
}
Before I was desparately using getBytes(UTF8). But I by chance I used Latin-1 "8859_1" and it worked. Why it worked, I can't fathom. This is what I did step-by-step:
OpenOffice CSV(utf8)------>SQLite(utf8, apparently)------->java encoded as Latin-1, somehow readable.

The reason that StringBufferInputStream is deprecated is because it is fundamentally broken ... for anything other than Strings consisting entirely of Latin-1 characters. According to the javadoc it "encodes" characters by simply chopping off the top 8 bits! You don't want to use it if your application needs to handle Unicode, etc correctly.
If you want to create an InputStream from a String, then the correct way to do it is to use String.getBytes(...) to turn the String into a byte array, and then wrap that in a ByteArrayInputStream. (Make sure that you choose an appropriate encoding!).
But your sample application immediately takes the InputStream, converts it to a Reader and then adds a BufferedReader If this is your real aim, then a simpler and more efficient approach is simply this:
Reader in = new StringReader(text);
This avoids the unnecessary encoding and decoding of the String, and also the "buffer" layer which serves no useful purpose in this case.
(A buffered stream is much more efficient than an unbuffered stream if you are doing small I/O operations on a file, network or console stream. But for a stream that is served from an in-memory data structure the benefits are much smaller, and possibly even negative.)
FOLLOWUP
I realized what you are trying to do now ... work around a character encoding / decoding issue.
My advice would be to try to figure out definitively the actual encoding of the character data that is being delivered by the database, then make sure that the JDBC drivers are configured to use the same encoding. Trying to undo the mis-translation by encoding with one encoding and decoding with another is dodgy, and can give you only a partial correction of the problems.
You also need to consider the possibility that the characters got mangled on the way into the database. If this is the case, then you may be unable to de-mangle them.

Is this what you are trying to do? Here is previous answer on similar question. I am not sure why you want to convert to a String to an exactly the same String.
Java String holds a sequence of chars in which each char represents a Unicode number. So it is possible to construct the same string from two different byte sequences, says one is encoded with UTF-8 and the other is encoded with US-ASCII.
If you want to write it to file, you can always convert it with String.getBytes("encoder");
private static String readInput(String jtext) {
byte[] bytes = jtext.getBytes();
try {
String string = new String(bytes, "UTF-8");
return string;
} catch (UnsupportedEncodingException ex) {
// do something
return null;
}
}
Update
Here is my assumption.
According to your comment, you SQLite DB store text value using one encoding, says UTF-16. For some reason, your SQLite APi cannot determine what the encoding it uses to encode the Unicode values to sequence of bytes.
So when you use getString method from your SQLite API, it reads a set of bytes form you DB, and convert them into Java String using incorrect encoding. If this is the case, you should use getBytes method and reconstruct the String yourself, i.e. new String(bytes, "encoding used in your DB"); If you DB is stored in UTF-16, then new String(bytes, "UTF-16"); should be readable.
Update
I wasn't talking about getBytes method on String class. I talked about getBytes method on your SQL result object, e.g. result.getBytes(String columnLabel).
ResultSet result = .... // from SQL query
String readableString = readInput(result.getBytes("my_table_column"));
You will need to change the signature of your readInput method to
private static String readInput(byte[] bytes) {
try {
// change encoding to your DB encoding.
// this can be UTF-8, UTF-16, 8859_1, etc.
String string = new String(bytes, "UTF-8");
return string;
} catch (UnsupportedEncodingException ex) {
// do something, at least return garbled text
return new String(bytes, "UTF-8");;
}
}
Whatever encoding you set in here which makes your String readable, it is definitely the encoding of your column in DB. This involves no unexplanable phenomenon and you know exactly what your column encoding is.
But it will be good to config your JDBC driver to use the correct encoding so that you will not need to use this readInput method to convert.
If no encoding can make your string readable, you will need consider the possibility of the characters got mangled when it was written to DB as #Stephen C said. If this is the case, using walk around method may cause you to lose some of the charaters during conversions. You will also need to solve encoding problem during writting as well.

The StringReader class is the new alternative to the deprecated StringBufferInputStream class.
However, you state that what you actually want to do is take an existing String and return it encoded as UTF-8. You should be able to do that much more simply I expect. Something like:
s8 = new String(jtext.getBytes("UTF8"));

How to check whether the file is binary?

I wrote the following method to see whether particular file contains ASCII text characters only or control characters in addition to that. Could you glance at this code, suggest improvements and point out oversights?
The logic is as follows: "If first 500 bytes of a file contain 5 or more Control characters - report it as binary file"
thank you.
public boolean isAsciiText(String fileName) throws IOException {
InputStream in = new FileInputStream(fileName);
byte[] bytes = new byte[500];
in.read(bytes, 0, bytes.length);
int x = 0;
short bin = 0;
for (byte thisByte : bytes) {
char it = (char) thisByte;
if (!Character.isWhitespace(it) && Character.isISOControl(it)) {
bin++;
}
if (bin >= 5) {
return false;
}
x++;
}
in.close();
return true;
}

Since you call this class "isASCIIText", you know exactly what you're looking for. In other words, it's not "isTextInCurrentLocaleEncoding". Thus you can be more accurate with:
if (thisByte < 32 || thisByte > 127) bin++;
edit, a long time later — it's pointed out in a comment that this simple check would be tripped up by a text file that started with a lot of newlines. It'd probably be better to use a table of "ok" bytes, and include printable characters (including carriage return, newline, and tab, and possibly form feed though I don't think many modern documents use those), and then check the table.

x doesn't appear to do anything.
What if the file is less than 500 bytes?
Some binary files have a situation where you can have a header for the first N bytes of the file which contains some data that is useful for an application but that the library the binary is for doesn't care about. You could easily have 500+ bytes of ASCII in a preamble like this followed by binary data in the following gigabyte.
Should handle exception if the file can't be opened or read, etc.

Fails badly if file size is less than 500 bytes
The line char it = (char) thisByte; is conceptually dubious, it mixes bytes and chars concepts, ie. assumes implicitly that the encoding is one-byte=one character (them, it excludes unicode encodings). In particular, it fails if the file is UTF-16 encoded.
The return inside the loop (slightly bad practice IMO) forgets to close the file.

The first thing I noticed - unrelated to your actual question, but you should be closing your input stream in a finally block to ensure it's always done. Usually this merely handles exceptions, but in your case you won't even close the streams of files when returning false.
Asides from that, why the comparison to ISO control characters? That's not a "binary" file, that's a "file that contains 5 or more control characters". A better way to approach the situation in my opinion, would be to invert the check - write an isAsciiText function instead which asserts that all the characters in the file (or in the first 500 bytes if you so wish) are in a set of bytes that are known good.
Theoretically, only checking the first few hundred bytes of a file could get you into trouble if it was a composite file of sorts (e.g. text with embedded pictures), but in practice I suspect every such file will have binary header data at the start so you're probably OK.

This would not work with the jdk install packages for linux or solaris. they have a shell-script start and then a bi data blob.
why not check the mime type using some library like jMimeMagic (http://http://sourceforge.net/projects/jmimemagic/) and deside based on the mimetype how to handle the file.

One could parse and compare ageinst a list of known binary file header bytes, like the one provided here.
Problem is, one needs to have a sorted list of binary-only headers, and the list might not be complete at all. For example, reading and parsing binary files contained in some Equinox framework jar. If one needs to identify the specific file types though, this should work.
If you're on Linux, for existing files on the disk, native file command execution should work well:
String command = "file -i [ZIP FILE...]";
Process process = Runtime.getRuntime().exec(command);
...
It will output information on the files:
...: application/zip; charset=binary
which you can furtherly filter with grep, or in Java, depending on, if you simply need estimation of the files' binary character, or if you need to find out their MIME types.
If parsing InputStreams, like content of nested files inside archives, this doesn't work, unfortunately, unless resorting to shell-only programs, like unzip - if you want to avoid creating temp unzipped files.
For this, a rough estimation of examining the first 500 Bytes worked out ok for me, so far, as was hinted in the examples above; instead of Character.isWhitespace/isISOControl(char), I used Character.isIdentifierIgnorable(codePoint), assuming UTF-8 default encoding:
private static boolean isBinaryFileHeader(byte[] headerBytes) {
return new String(headerBytes).codePoints().filter(Character::isIdentifierIgnorable).count() >= 5;
}
public void printNestedZipContent(String zipPath) {
try (ZipFile zipFile = new ZipFile(zipPath)) {
int zipHeaderBytesLen = 500;
zipFile.entries().asIterator().forEachRemaining( entry -> {
String entryName = entry.getName();
if (entry.isDirectory()) {
System.out.println("FOLDER_NAME: " + entryName);
return;
}
// Get content bytes from ZipFile for ZipEntry
try (InputStream zipEntryStream = new BufferedInputStream(zipFile.getInputStream(zipEntry))) {
// read and store header bytes
byte[] headerBytes = zipEntryStream.readNBytes(zipHeaderBytesLen);
// Skip entry, if nested binary file
if (isBinaryFileHeader(headerBytes)) {
return;
}
// Continue reading zipInputStream bytes, if non-binary
byte[] zipContentBytes = zipEntryStream.readAllBytes();
int zipContentBytesLen = zipContentBytes.length;
// Join already read header bytes and rest of content bytes
byte[] joinedZipEntryContent = Arrays.copyOf(zipContentBytes, zipContentBytesLen + zipHeaderBytesLen);
System.arraycopy(headerBytes, 0, joinedZipEntryContent, zipContentBytesLen, zipHeaderBytesLen);
// Output (default/UTF-8) encoded text file content
System.out.println(new String(joinedZipEntryContent));
} catch (IOException e) {
System.out.println("ERROR getting ZipEntry content: " + entry.getName());
}
});
} catch (IOException e) {
System.out.println("ERROR opening ZipFile: " + zipPath);
e.printStackTrace();
}
}

You ignore what read() returns, what if the files is shorter than 500 bytes?
When you return false, you don't close the file.
When converting byte to char, you assume your file is 7-bit ASCII.

How do you convert binary data to Strings and back in Java?

I have binary data in a file that I can read into a byte array and process with no problem. Now I need to send parts of the data over a network connection as elements in an XML document. My problem is that when I convert the data from an array of bytes to a String and back to an array of bytes, the data is getting corrupted. I've tested this on one machine to isolate the problem to the String conversion, so I now know that it isn't getting corrupted by the XML parser or the network transport.
What I've got right now is
byte[] buffer = ...; // read from file
// a few lines that prove I can process the data successfully
String element = new String(buffer);
byte[] newBuffer = element.getBytes();
// a few lines that try to process newBuffer and fail because it is not the same data anymore
Does anyone know how to convert binary to String and back without data loss?
Answered: Thanks Sam. I feel like an idiot. I had this answered yesterday because my SAX parser was complaining. For some reason when I ran into this seemingly separate issue, it didn't occur to me that it was a new symptom of the same problem.
EDIT: Just for the sake of completeness, I used the Base64 class from the Apache Commons Codec package to solve this problem.

String(byte[]) treats the data as the default character encoding. So, how bytes get converted from 8-bit values to 16-bit Java Unicode chars will vary not only between operating systems, but can even vary between different users using different codepages on the same machine! This constructor is only good for decoding one of your own text files. Do not try to convert arbitrary bytes to chars in Java!
Encoding as base64 is a good solution. This is how files are sent over SMTP (e-mail). The (free) Apache Commons Codec project will do the job.
byte[] bytes = loadFile(file);
//all chars in encoded are guaranteed to be 7-bit ASCII
byte[] encoded = Base64.encodeBase64(bytes);
String printMe = new String(encoded, "US-ASCII");
System.out.println(printMe);
byte[] decoded = Base64.decodeBase64(encoded);
Alternatively, you can use the Java 6 DatatypeConverter:
import java.io.*;
import java.nio.channels.*;
import javax.xml.bind.DatatypeConverter;
public class EncodeDecode {
public static void main(String[] args) throws Exception {
File file = new File("/bin/ls");
byte[] bytes = loadFile(file, new ByteArrayOutputStream()).toByteArray();
String encoded = DatatypeConverter.printBase64Binary(bytes);
System.out.println(encoded);
byte[] decoded = DatatypeConverter.parseBase64Binary(encoded);
// check
for (int i = 0; i < bytes.length; i++) {
assert bytes[i] == decoded[i];
}
}
private static <T extends OutputStream> T loadFile(File file, T out)
throws IOException {
FileChannel in = new FileInputStream(file).getChannel();
try {
assert in.size() == in.transferTo(0, in.size(), Channels.newChannel(out));
return out;
} finally {
in.close();
}
}
}

If you encode it in base64, this will turn any data into ascii safe text, but base64 encoded data is larger than the orignal data

See this question, How do you embed binary data in XML?
Instead of converting the byte[] into String then pushing into XML somewhere, convert the byte[] to a String via BASE64 encoding (some XML libraries have a type to do this for you). The BASE64 decode once you get the String back from XML.
Use http://commons.apache.org/codec/
You data may be getting messed up due to all sorts of weird character set restrictions and the presence of non-priting characters. Stick w/ BASE64.

How are you building your XML document? If you use java's built in XML classes then the string encoding should be handled for you.
Take a look at the javax.xml and org.xml packages. That's what we use for generating XML docs, and it handles all the string encoding and decoding quite nicely.
---EDIT:
Hmm, I think I misunderstood the problem. You're not trying to encode a regular string, but some set of arbitrary binary data? In that case the Base64 encoding suggested in an earlier comment is probably the way to go. I believe that's a fairly standard way of encoding binary data in XML.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.