Unexpected output with RandomAccessFile - java

I'm trying to learn about RandomAccessFile but after creating a test program I'm getting some bizarre output.
import java.io.File;
import java.io.IOException;
import java.io.RandomAccessFile;
public class RandomAccessFileTest
{
public static void main(String[] args) throws IOException
{
// Create a new blank file
File file = new File("RandomAccessFileTest.txt");
file.createNewFile();
// Open the file in read/write mode
RandomAccessFile randomfile = new RandomAccessFile(file, "rw");
// Write stuff
randomfile.write("Hello World".getBytes());
// Go to a location
randomfile.seek(0);
// Get the pointer to that location
long pointer = randomfile.getFilePointer();
System.out.println("location: " + pointer);
// Read a char (two bytes?)
char letter = randomfile.readChar();
System.out.println("character: " + letter);
randomfile.close();
}
}
This program prints out
location: 0
character: ?
Turns out that the value of letter was '䡥' when it should be 'H'.
I've found a question similar to this, and apparently this is caused by reading one byte instead of two, but it didn't explain how exactly to fix it.

You've written "Hello World" in the platform default encoding - which is likely to use a single byte per character.
You're then reading RandomAccessFile.readChar which always reads two bytes. Documentation:
Reads a character from this file. This method reads two bytes from the file, starting at the current file pointer. If the bytes read, in order, are b1 and b2, where 0 <= b1, b2 <= 255, then the result is equal to:
(char)((b1 << 8) | b2)
This method blocks until the two bytes are read, the end of the stream is detected, or an exception is thrown.
So H and e are being combined into a single character - H is U+0048, e is U+0065, so assuming they've been written as ASCII character, you're reading bytes 0x48 and 0x65 and combining them into U+4865 which is a Han character for "a moving cart".
Basically, you shouldn't be using readChar to try to read this data.
Usually to read a text file, you want an InputStreamReader (with an appropriate encoding) wrapping an InputStream (e.g. a FileInputStream). It's not really ideal to try to do this with RandomAccessFile - you could read data into a byte[] and then convert that into a String but there are all kinds of subtleties you'd need to think about.

Related

Problem with input from user saved to file by RandomAccessFile methods

I've got a problem with input from user. I need to save input from user into binary file and when I read it and show it on the screen it isn't working properly. I dont want to put few hundreds of lines, so I will try to dexcribe it in more compact form. And encoding in NetBeans in properties of project is "UTF-8"
I got input from user, in NetBeans console or cmd console. Then I save it to object made up of strings, then add it to ArrayList<Ksiazka> where Ksiazka is my class (basically a book's properties). Then I save whole ArrayList object to file baza.bin. I do it by looping through whole list of objects of class Ksiazka, taking each String one by one and saving it into file baza.bin using method writeUTF(oneOfStrings). When I try to read file baza.bin I see question marks instead of special characters (ą, ć, ę, ł, ń, ó, ś, ź). I think there is a problem in difference in encoding of file and input data, but to be honest I don't have any idea ho to solve that.
Those are attributes of my class Ksiazka:
private String id;
private String tytul;
private String autor;
private String rok;
private String wydawnictwo;
private String gatunek;
private String opis;
private String ktoWypozyczyl;
private String kiedyWypozyczona;
private String kiedyDoOddania;
This is method for reading data from user:
static String podajDana(String[] tab, int coPokazac){
System.out.print(tab[coPokazac]);
boolean podawajDalej = true;
String linia = "";
Scanner klawiatura = new Scanner(System.in, "utf-8");
do{
try {
podawajDalej = false;
linia = klawiatura.nextLine();
}
catch(NoSuchElementException e){
System.err.println("Wystąpił błąd w czasie podawania wartości!"
+ " Spróbuj jeszcze raz!");
}
catch(IllegalStateException e){
System.err.println("Wewnętrzny błąd programu typu 2! Zgłoś to jak najszybciej"
+ " razem z tą wiadomością");
}
}while(podawajDalej);
return linia;
}
String[] tab is just array of strings I want to be able to show on the screen, each set (array) has its own function, int coPokazac is number of line from an array I want to show.
and this one saves all data from ArrayList<Ksiazka> to file baza.bin:
static void zapiszZmiany(ArrayList<Ksiazka> bazaKsiazek){
try{
RandomAccessFile plik = new RandomAccessFile("baza.bin","rw");
for(int i = 0; i < bazaKsiazek.size(); i++){
plik.writeUTF(bazaKsiazek.get(i).zwrocId());
plik.writeUTF(bazaKsiazek.get(i).zwrocTytul());
plik.writeUTF(bazaKsiazek.get(i).zwrocAutor());
plik.writeUTF(bazaKsiazek.get(i).zwrocRok());
plik.writeUTF(bazaKsiazek.get(i).zwrocWydawnictwo());
plik.writeUTF(bazaKsiazek.get(i).zwrocGatunek());
plik.writeUTF(bazaKsiazek.get(i).zwrocOpis());
plik.writeUTF(bazaKsiazek.get(i).zwrocKtoWypozyczyl());
plik.writeUTF(bazaKsiazek.get(i).zwrocKiedyWypozyczona());
plik.writeUTF(bazaKsiazek.get(i).zwrocKiedyDoOddania());
}
plik.close();
}
catch (FileNotFoundException ex){
System.err.println("Nie znaleziono pliku z bazą książek!");
}
catch (IOException ex){
System.err.println("Błąd zapisu bądź odczytu pliku!");
}
}
I think that there is a problem in one of those two methods (either I do something wrong while reading it or something wrong when it is saving data to file using writeUTF()) but even tho I tried few things to solve it, none of them worked.
After quick talk with lecturer I got information that I can use at most JDK 8.
You are using different techniques for reading and writing, and they are not compatible.
Despite the name, the writeUTF method of RandomAccessFile does not write a UTF-8 string. From the documentation:
Writes a string to the file using modified UTF-8 encoding in a machine-independent manner.
First, two bytes are written to the file, starting at the current file pointer, as if by the writeShort method giving the number of bytes to follow. This value is the number of bytes actually written out, not the length of the string. Following the length, each character of the string is output, in sequence, using the modified UTF-8 encoding for each character.
writeUTF will write a two-byte length, then write the string as UTF-8, except that '\u0000' characters are written as two UTF-8 bytes and supplementary characters are written as two UTF-8 encoded surrogates, rather than single UTF-8 codepoint sequences.
On the other hand, you are trying to read that data using new Scanner(System.in, "utf-8") and klawiatura.nextLine();. This approach is not compatible because:
The text was not written as a true UTF-8 sequence.
Before the text was written, two bytes indicating its numeric length were written. They are not readable text.
writeUTF does not write a newline. It does not write any terminating sequence at all, in fact.
The best solution is to remove all usage of RandomAccessFile and replace it with a Writer:
Writer plik = new FileWriter(new File("baza.bin"), StandardCharsets.UTF_8);
for (int i = 0; i < bazaKsiazek.size(); i++) {
plik.write(bazaKsiazek.get(i).zwrocId());
plik.write('\n');
plik.write(bazaKsiazek.get(i).zwrocTytul());
plik.write('\n');
// ...

Problems to compress Excel files, JAVA

I have some problems compressing excel files using the Hffman algorthim. The thing is that my code seems to work with .txt files, but when I'm trying to compress .xlsx or older versions of excel an error occurs.
First of all, I read my file like this:
File file = new File("fileName.xlsx");
byte[] dataOfFile = new byte[(int) file.length()];
DataInputStream dis = new DataInputStream(new FileInputStream(file));
dis.readFully(dataOfFile);
dis.close();
To check this (if everything seems OK) I use this code:
String entireFileText = new String(dataOfFile, "UTF-8");
for(int i=0;i<dataOfFile.length;i++)
{
System.out.print(dataOfFile[i]);
}
By doing this to a .txt file I get something like this (which seems to be OK):
"7210110810811132119111114108100331310721111193297114101321211111173"
But when I use this on .xlsx file I get this and I think the hyphen makes errors that might occur later in the compression:
"8075342006080003301165490-90122100-1245001908291671111101161011101169584121112101115934612010910832-944240-96020000000000000"... and so on
Anyway, by using a string a can map this into a HashMap, where I count the frequency of each character. I have a HashMap:
public static HashMap map;
public static boolean countHowOftenACharacterAppear(String s1) {
String s = s1;
for(int i = 0; i < s.length(); i++){
char c = s.charAt(i);
Integer val = map.get(new Character(c));
if(val != null){
map.put(c, new Integer(val + 1));
}
else{
map.put(c,1);
}
}
return true;
}
When I compress my string I use:
public static String compress(String s) {
String c = new String();
for(int i = 0; i < s.length(); i++)
c = c + fromCharacterToCode.get(s.charAt(i));
return c;
}
fromCharactertoCode is another HashMap of type :
public static HashMap fromCharacterToCode;
(I'm traversing through my table I've built. Dont't think this is the problem)
Anyway, the results from this using the .txt file is:
"01000110110111011011110001101110011011000001000000000"... (PERFECT)
From the .xlsx file:
"10101110110001110null0010000null0011000nullnullnull10110000null00001101011111" ...
I really don't get why I'm getting the nullpointers on the .xlsx files. I would be very happy if I could get some help here to solve this. Many thanks!!
Your problem is java I/O, well before getting to compression.
First, you don't really need DataInputStream here, but leave that aside. You then convert to String entireFileText assuming the contents of the file is text in UTF-8, whereas data files like .xlsx aren't text at all and many text files even on Windows aren't UTF-8. But you don't seem to use entireFileText, so that may not matter. If you do, and the file isn't plain ASCII text, your compressor will "lose" chunks of it and the output of decompression will be only a fraction of the compression input; that is usually considered unsatisfactory.
Then you extract each byte from dataOfFile. byte in Java is signed; plain ASCII text files will have only "positive" bytes 0x00 to 0x7F (and usually all 0x20 to 0x7E plus 0x09 0x0D 0x0A), but everything else (UTF-8 text, UTF-16 text, data, and executables) will have "negative" bytes 0x80 to 0xFF which come out as -0x80 to -0x01.
Your printout "7210110810811132119111114108100331310721111193297114101321211111173" for "the .txt file" is almost certainly the byte sequence 72=H 101=e 108=l 108=l 111=o 32=space 119=w 111=o 114=r 108=l 100=d 33=! 13=CR 10=LF 72=H 111=o 119=w 32=space 97=a 114=r 101=e 32=space 121=y 111=o 117=u 3=(ETX aka ctrl-C) (how did you get a ctrl-C into a file?! or was it really 30=ctrl-Z? that's somewhat usual for Windows text files)
Someone more familiar with .xlsx format might be able to reconstruct that one, but I can tell you right off the hyphens are due to bytes with negative values, printed in decimal (by default) as -128 to -1.
For a general purpose compressor, you shouldn't ever convert to java char's and String's; those are designed for text and not all files are text. Just work with bytes, but if you want them in consistently positive, mask with & 0xFF .

How to split the ByteArray by reading from the file in C++?

I have written a Java program to write the ByteArray in to a file. And that resulting ByteArray is a resulting of these three ByteArrays-
First 2 bytes is my schemaId which I have represented it using short data type.
Then next 8 Bytes is my Last Modified Date which I have represented it using long data type.
And remaining bytes can be of variable size which is my actual value for my attributes..
So I have a file now in which first line contains resulting ByteArray which will have all the above bytes as I mentioned above.. Now I need to read that file from C++ program and read the first line which will contain the ByteArray and then split that resulting ByteArray accordingly as I mentioned above such that I am able to extract my schemaId, Last Modified Date and my actual attribute value from it.
I have done all my coding always in Java and I am new to C++... I am able to write a program in C++ to read the file but not sure how should I read that ByteArray in such a way such that I am able to split it as I mentioned above..
Below is my C++ program which is reading the file and printing it out on the console..
int main () {
string line;
//the variable of type ifstream:
ifstream myfile ("bytearrayfile");
//check to see if the file is opened:
if (myfile.is_open())
{
//while there are still lines in the
//file, keep reading:
while (! myfile.eof() )
{
//place the line from myfile into the
//line variable:
getline (myfile,line);
//display the line we gathered:
// and here split the byte array accordingly..
cout << line << endl;
}
//close the stream:
myfile.close();
}
else cout << "Unable to open file";
return 0;
}
Can anyone help me with that? Thanks.
Update
Below is my java code which will write resulting ByteArray into a file and the same file now I need to read it back from c++..
public static void main(String[] args) throws Exception {
String os = "whatever os is";
byte[] avroBinaryValue = os.getBytes();
long lastModifiedDate = 1379811105109L;
short schemaId = 32767;
ByteArrayOutputStream byteOsTest = new ByteArrayOutputStream();
DataOutputStream outTest = new DataOutputStream(byteOsTest);
outTest.writeShort(schemaId);
outTest.writeLong(lastModifiedDate);
outTest.writeInt(avroBinaryValue.length);
outTest.write(avroBinaryValue);
byte[] allWrittenBytesTest = byteOsTest.toByteArray();
DataInputStream inTest = new DataInputStream(new ByteArrayInputStream(allWrittenBytesTest));
short schemaIdTest = inTest.readShort();
long lastModifiedDateTest = inTest.readLong();
int sizeAvroTest = inTest.readInt();
byte[] avroBinaryValue1 = new byte[sizeAvroTest];
inTest.read(avroBinaryValue1, 0, sizeAvroTest);
System.out.println(schemaIdTest);
System.out.println(lastModifiedDateTest);
System.out.println(new String(avroBinaryValue1));
writeFile(allWrittenBytesTest);
}
/**
* Write the file in Java
* #param byteArray
*/
public static void writeFile(byte[] byteArray) {
try{
File file = new File("bytearrayfile");
FileOutputStream output = new FileOutputStream(file);
IOUtils.write(byteArray, output);
} catch (Exception ex) {
ex.printStackTrace();
}
}
It doesn't look like you want to use std::getline to read this data. Your file isn't written as text data on a line-by-line basis - it basically has a binary format.
You can use the read method of std::ifstream to read arbitrary chunks of data from an input stream. You probably want to open the file in binary mode:
std::ifstream myfile("bytearrayfile", std::ios::binary);
Fundamentally the method you would use to read each record from the file is:
uint16_t schemaId;
uint64_t lastModifiedDate;
uint32_t binaryLength;
myfile.read(reinterpret_cast<char*>(&schemaId), sizeof(schemaId));
myfile.read(reinterpret_cast<char*>(&lastModifiedDate), sizeof(lastModifiedDate));
myfile.read(reinterpret_cast<char*>(&binaryLength), sizeof(binaryLength));
This will read the three static members of your data structure from the file. Because your data is variable size, you probably need to allocate a buffer to read it into, for example:
std::unique_ptr<char[]> binaryBuf(new char[binaryLength]);
myfile.read(binaryBuf.get(), binaryLength);
The above are examples only to illustrate how you would approach this in C++. You will need to be aware of the following things:
There's no error checking in the above examples. You'll need to check that the calls to ifstream::read are successful and return the correct amount of data.
Endianness may be an issue, depending on the the platform the data originates from and is being read on.
Interpreting the lastModifiedDate field may require you to write a function to convert it from whatever format Java uses (I have no idea about Java).

Most efficient merging of 2 text files.

So I have large (around 4 gigs each) txt files in pairs and I need to create a 3rd file which would consist of the 2 files in shuffle mode. The following equation presents it best:
3rdfile = (4 lines from file 1) + (4 lines from file 2) and this is repeated until I hit the end of file 1 (both input files will have the same length - this is by definition). Here is the code I'm using now but this doesn't scale very good on large files. I was wondering if there is a more efficient way to do this - would working with memory mapped file help ? All ideas are welcome.
public static void mergeFastq(String forwardFile, String reverseFile, String outputFile) {
try {
BufferedReader inputReaderForward = new BufferedReader(new FileReader(forwardFile));
BufferedReader inputReaderReverse = new BufferedReader(new FileReader(reverseFile));
PrintWriter outputWriter = new PrintWriter(new FileWriter(outputFile, true));
String forwardLine = null;
System.out.println("Begin merging Fastq files");
int readsMerge = 0;
while ((forwardLine = inputReaderForward.readLine()) != null) {
//append the forward file
outputWriter.println(forwardLine);
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
outputWriter.println(inputReaderForward.readLine());
//append the reverse file
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
outputWriter.println(inputReaderReverse.readLine());
readsMerge++;
if(readsMerge % 10000 == 0) {
System.out.println("[" + now() + "] Merged 10000");
readsMerge = 0;
}
}
inputReaderForward.close();
inputReaderReverse.close();
outputWriter.close();
} catch (IOException ex) {
Logger.getLogger(Utilities.class.getName()).log(Level.SEVERE, "Error while merging FastQ files", ex);
}
}
Maybe you also want to try to use a BufferedWriter to cut down your file IO operations.
http://download.oracle.com/javase/6/docs/api/java/io/BufferedWriter.html
A simple answer is to use a bigger buffer, which help to reduce to total number of I/O call being made.
Usually, memory mapped IO with FileChannel (see Java NIO) will be used for handling large data file IO. In this case, however, it is not the case, as you need to inspect the file content in order to determine the boundary for every 4 lines.
If performance was the main requirement, then I would code this function in C or C++ instead of Java.
But regardless of language used, what I would do is try to manage memory myself. I would create two large buffers, say 128MB or more each and fill them with data from the two text files. Then you need a 3rd buffer that is twice as big as the previous two. The algorithm will start moving characters one by one from input buffer #1 to destination buffer, and at the same time count EOLs. Once you reach the 4th line you store the current position on that buffer away and repeat the same process with the 2nd input buffer. You continue alternating between the two input buffers, replenishing the buffers when you consume all the data in them. Each time you have to refill the input buffers you can also write the destination buffer and empty it.
Buffer your read and write operations. Buffer needs to be large enough to minimize the read/write operations and still be memory efficient. This is really simple and it works.
void write(InputStream is, OutputStream os) throws IOException {
byte[] buf = new byte[102400]; //optimize the size of buffer to your needs
int num;
while((n = is.read(buf)) != -1){
os.write(buffer, 0, num);
}
}
EDIT:
I just realized that you need to shuffle the lines, so this code will not work for you as is but, the concept still remains the same.

Why doesn't Java properly re-create this image from an InputStream?

I've looked at this every way I can think... The problem is that I end up writing the PERFECT number of bytes, and the files are VERY similar - but some bytes are different. I opened the Java generated file in Scite as well as the original, and even though they are close, they are not the same. Is there any way to fix this? I've tried doing everything possible - I've used different wrappers, readers, writers and different methods of taking the byte array (or taking it as chars - tried both) and making it into a file.
The image in question, for the test, is at http://www.google.com/images/srpr/nav_logo13.png. Here's the code:
import java.awt.Image;
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import javax.imageio.ImageIO;
public class ImgExample
{
private String address = "http://www.google.com";
/**
* Returns a 3 dimensional array that holds the RGB values of each pixel at the position of the current
* webcam picture. For example, getPicture()[1][2][3] is the pixel at (2,1) and the BLUE value.
* [row][col][0] is alpha
* [row][col][1] is red
* [row][col][2] is green
* [row][col][3] is blue
*/
public int[][][] getPicture()
{
Image camera = null;
try {
int maxChars = 35000;
//The image in question is 28,736 bytes, but I want to make sure it's bigger
//for testing purposes as in my case, it's an image stream so it's unpredictable
byte[] buffer = new byte[maxChars];
//create the connection
HttpURLConnection conn = (HttpURLConnection)(new URL(this.address+"/images/srpr/nav_logo13.png")).openConnection();
conn.setUseCaches(false);
//wrap a buffer around our input stream
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
int bytesRead = 0;
while ( bytesRead < maxChars && reader.ready() )
{
//reader.read returns an int - I'm assuming this is okay?
buffer[bytesRead] = (byte)reader.read();
bytesRead++;
if ( !reader.ready() )
{
//This is here to make sure the stream has time to download the next segment
Thread.sleep(10);
}
}
reader.close();
//Great, write out the file for viewing
File writeOutFile = new File("testgoog.png");
if ( writeOutFile.exists() )
{
writeOutFile.delete();
writeOutFile.createNewFile();
}
FileOutputStream fout = new FileOutputStream(writeOutFile, false);
//FileWriter fout = new FileWriter(writeOutFile, false);
//needed to make sure I was actually reading 100% of the file in question
System.out.println("Bytes read = "+bytesRead);
//write out the byte buffer from the first byte to the end of all the chars read
fout.write(buffer, 0, bytesRead);
fout.flush();
fout.close();
//Finally use a byte stream to create an image
ByteArrayInputStream byteImgStream = new ByteArrayInputStream(buffer);
camera = ImageIO.read(byteImgStream);
byteImgStream.close();
} catch ( Exception e ) { e.printStackTrace(); }
return ImgExample.imageToPixels(camera);
}
public static int[][][] imageToPixels (Image image)
{
//there's a bunch of code here that works in the real program, no worries
//it creates a 3d arr that goes [x][y][alpha, r, g, b val]
//e.g. imageToPixels(camera)[1][2][3] gives the pixel's blue value for row 1 col 2
return new int[][][]{{{-1,-1,-1}}};
}
public static void main(String[] args)
{
ImgExample ex = new ImgExample();
ex.getPicture();
}
}
The problem as I see it is that you're using Readers. In Java, Readers are for processing character streams, not binary streams, and the character conversions that it does are most likely what's changing your bytes on you.
Instead, you should read() from the InputStream directly. InputStream's read() will block until data is available, but returns -1 when the end of the stream is reached.
Edit: You can also wrap the InputStream in a BufferedInputStream.
BufferedReader is intended for reading character streams, not byte/binary streams.
BufferedReader.read() returns The character read, as an integer in the range 0 to 65535. This will likely truncate any binary data where the byte value is greater than 65535.
I think you want to use InputStream.read() directly, not wrapped in a BufferedReader/InputStreamReader.
And finally, not related to the problem, but if you open a FileOutputStream to append=false, there isn't really any point to deleting any file that already exists - append=false does the same thing.
I think your problem is you are using an InputStreamReader. From the javadocs
An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.
You dont want the conversion to character streams.
And you also shouldn't be using ready() like that. You are just wasting time with that and the sleeps. read() will block until data arrives anyway, and it will block the correct length of time, not an arbitrary guess. The canonical copy loop in Java goes like this:
int count;
byte[] buffer; // whatever size you like
while ((count = in.read(buffer)) > 0)
{
out.write(buffer, 0, count);
}

Categories

Resources