Resource file format processing in Java

Resource file format processing in Java - java

I am trying to implement a processor for a specific resource archive file format in Java. The format has a Header comprised of a three-char description, a dummy byte, plus a byte indicating the number of files.
Then each file has an entry consisting of a dummy byte, a twelve-char string describing the file name, a dummy byte, and an offset declared in a three-byte array.
What would be the proper class for reading this kind of structure? I have tried RandomAccessFile but it does not allow to read arrays of data, e.g. I can only read three chars by calling readChar() three times, etc.
Of course I can extend RandomAccessFile to do what I want but there's got to be a proper out-of-the-box class to do this kind of processing isn't it?
This is my reader for the header in C#:
protected override void ReadHeader()
{
Header = new string(this.BinaryReader.ReadChars(3));
byte dummy = this.BinaryReader.ReadByte();
NFiles = this.BinaryReader.ReadByte();
}

I think you got lucky with your C# code, as it relies on the character encoding to be set somewhere else, and if it didn't match the number of bytes per character in the file, your code would probably have failed.
The safest way to do this in Java would be to strictly read bytes and do the conversion to characters yourself. If you need seek abilities, then indeed RandomAccessFile would be your easiest solution, but it should be pointed out that InputStream allows skipping, so if you don`t need actual random access, just to skip some of the files, you could certainly use it.
In either case, you should read the bytes from the file per the file specification, and then convert them to characters based on a known encoding. You should never trust a file that was not written by a Java program to contain any Java data types other than byte, and even if it was written by Java, it may well have been converted to raw bytes while writing.
So your code should be something along the lines of:
String header = "";
int nFiles = 0;
RandomAccessFile raFile = new RandomAccessFile( "filename", "r" );
byte[] buffer = new byte[3];
int numRead = raFile.read( buffer );
header = new String( buffer, StandardCharsets.US_ASCII.name() );
int numSkipped = raFile.skipBytes(1);
nFiles = raFile.read(); // The byte is read as an integer between 0 and 255
Sanity checks (checking that actual 3 bytes were read, 1 byte was skipped and nFiles is not -1) and exception handling have been skipped for brevity.
It's more or less the same if you use InputStream.

I would go with MappedByteBuffer. This will allow you to seek arbitrarily, but will also deal efficiently and transparently with large files that are too large to fit comfortably in RAM.
This is, to my mind, the best way of reading structured binary data like this from a file.
You can then build your own data structure on top of that, to handle the specific file format.

Related

C# - Writing strings to a stream using two bytes for length, not one

I am creating an easy to use server-client model with an extensible protocol, where the server is in Java and clients can be Java, C#, what-have-you.
I ran into this issue: Java data streams write strings with a short designating the length, followed by the data.
C# lets me specify the encoding I want, but it only reads one byte for the length. (actually, it says '7 bits at a time'...this is odd. This might be part of my problem?)
Here is my setup: The server sends a string to the client once it connects. It's a short string, so the first byte is 0 and the second byte is 9; the string is 9 bytes long.
//...
_socket.Connect(host, port);
var stream = new NetworkStream(_socket);
_in = new BinaryReader(stream, Encoding.UTF8);
Console.WriteLine(_in.ReadString()); //outputs nothing
Reading a single byte before reading the string of course outputs the expected string. But, how can I set up my stream reader to read a string using two bytes as the length, not one? Do I need to subclass BinaryReader and override ReadString()?

The C# BinaryWriter/Reader behavior uses, if I recall correctly, the 8th bit to signify where the last byte of the count is. This allows for counts up to 127 to fit in a single byte while still allowing for actual count values much larger (i.e. up to 2^31-1); it's a bit like UTF8 in that respect.
For your own purposes, note that you are writing the whole protocol (presumably), so you have complete control over both ends. Both behaviors you describe, in C# and Java, are implemented by what are essentially helper classes in each language. There's nothing saying that you have to use them, and both languages offer a way to simply encode text directly into an array of bytes which you can send however you like.
If you do want to stick with the Java-based protocol, you can use BitConverter to convert between a short to a byte[] so that you can send and receive those two bytes explicitly. For example:
_in = new BinaryReader(stream, Encoding.UTF8);
byte[] header = _in.ReadBytes(2);
short count = BitConverter.ToInt16(header, 0);
byte[] data = _in.ReadBytes(count);
string text = Encoding.UTF8.GetString(data);
Console.WriteLine(text); // outputs something

Writing Bits to a file using BitSet & FileOutputStream

I've run into a bit of a problem when it comes to writing specific bits to a file. I apologise if this is a duplicate of anything but I could not find a reasonable answer with the searches I ran.
I have a number of difficulties with the following:
Writing a header (Long) bit by bit (converted to a byte array so the
FileOutputStream can utilise it) to the file.
Writing single bits to the file. For example, at one stage I am required to write a single bit set to 0 to the file so my initial thought would be to use a BitSet but Java seems to treat this as a null?
BitSet initialPadding = new BitSet();
initialPadding.set(0, false);
fileOutputStream.write(initialPadding.toByteArray());
1)
I create a FileOutputStream as shown below with the necessary file name:
FileOutputStream fileOutputStream = new FileOutputStream(file.getAbsolutePath());
I am attempting to create an ".amr" file so the first step before I perform any bit manipulation is to write a header to the beginning of the file. This has the following value:
Long defaultHeader = 0x2321414d520aL;
I've tried writing this to the file using the following method but I am pretty sure it does not write the correct result:
fileOutputStream.write(defaultHeader.byteValue());
Am I using the correct streams? Are my convertions completely wrong?
2)
I have a public BitSet fileBitSet;which has bits read in from a ".raw" file as the input. I need to be able to extract certain bits from the BitSet in order to write them to the file later. I do this using the following method:
public int getOctetPayloadHeader(int startPoint) {
int readLength = 0;
octetCMR = fileBitSet.get(0, 3);
octetRES = fileBitSet.get(4, 7);
if (octetRES.get(0, 3).isEmpty()) {
/* Keep constructing the payload header. */
octetFBit = fileBitSet.get(8, 8);
octetMode = fileBitSet.get(9, 12);
octetQuality = fileBitSet.get(13, 13);
octetPadding = fileBitSet.get(14, 15);
... }
What would be the best way to go for writing these bits to a file bearing in mind that I may be required to sometimes write a single bit or 81 bits at a particular offset in the fileBitSet ?

There is only one thing you can write to an OutputStream: bytes. You have to do the composing of your bits into bytes yourself; only you know the rules how the bits are to be put together into bytes.
As for stuff like:
Long defaultHeader = 0x2321414d520aL;
fileOutputStream.write(defaultHeader.byteValue());
You should take a close look at the javadocs for the methods you are using. byteValue() returns a single byte; so of course its not doing what you expect. Working with streams is well explained in oracles tutorials: http://docs.oracle.com/javase/tutorial/essential/io/streams.html
For writing single bits or groups of bits, you will need a custom OutputStream that handles grouping the bits into bytes to be written. Thats commonly called a BitStream (there is no such class in the JDK); you have to either write it yourself (which I highly recommend, its a very good excercise to teach you about bits and bytes) or find one on the web.

Java file IO truncated while reading large files using BufferedInputStream

I have a function in which I am only given a BufferedInputStream and no other information about the file to be read. I unfortunately cannot alter the method definition as it is called by code I don't have access to. I've been using the code below to read the file and place its contents in a String:
public String[] doImport(BufferedInputStream stream) throws IOException, PersistenceException {
int bytesAvail = stream.available();
byte[] bytesRead = new byte[bytesAvail];
stream.read(bytesRead);
stream.close();
String fileContents = new String(bytesRead);
//more code here working with fileContents
}
My problem is that for large files (>2Gb), this code causes the program to either run extremely slowly or truncate the data, depending on the computer the program is executed on. Does anyone have a recommendation regarding how to deal with large files in this situation?

You're assuming that available() returns the size of the file; it does not. It returns the number of bytes available to be read, and that may be any number less than or equal to the size of the file.
Unfortunately there's no way to do what you want in just one shot without having some other source of information on the length of the file data (i.e., by calling java.io.File.length()). Instead, you have to possibly accumulate from multiple reads. One way is by using ByteArrayOutputStream. Read into a fixed, finite-size array, then write the data you read into a ByteArrayOutputStream. At the end, pull the byte array out. You'll need to use the three-argument forms of read() and write() and look at the return value of read() so you know exactly how many bytes were read into the buffer on each call.

I'm not sure why you don't think you can read it line-by-line. BufferedInputStream only describes how the underlying stream is accessed, it doesn't impose any restrictions on how you ultimately read data from it. You can use it just as if it were any other InputStream.
Namely, to read it line-by-line you could do
InputStreamReader streamReader = new InputStreamReader(stream);
BufferedInputReader lineReader = new BufferedInputReader(streamReader);
String line = lineReader.readLine();
...
[Edit] This response is to the original wording of the question, which asked specifically for a way to read the input file line-by-line.

How do you write any ASCII character to a file in Java?

Basically I'm trying to use a BufferedWriter to write to a file using Java. The problem is, I'm actually doing some compression so I generate ints between 0 and 255, and I want to write the character who's ASCII value is equal to that int. When I try writing to the file, it writes many ? characters, so when I read the file back in, it reads those as 63, which is clearly not what I want. Any ideas how I can fix this?
Example code:
int a = generateCode(character); //a now has an int between 0 and 255
bw.write((char) a);
a is always between 0 and 255, but it sometimes writes '?'

You are really trying to write / read bytes to / from a file.
When you are processing byte-oriented data (as distinct from character-oriented data), you should be using InputStream and OutputStream classes and not Reader and Writer classes.
In this case, you should use FileInputStream / FileOutputStream, and wrap with a BufferedInputStream / BufferedOutputStream if you are doing byte-at-a-time reads and writes.
Those pesky '?' characters are due to issues the encoding/decoding process that happens when Java converts between characters and the default text encoding for your platform. The conversion from bytes to characters and back is often "lossy" ... depending on the encoding scheme used. You can avoid this by using the byte-oriented stream classes.
(And the answers that point out that ASCII is a 7-bit not 8-bit character set are 100% correct. You are really trying to read / write binary octets, not characters.)

You need to make up your mind what are you really doing. Are you trying to write some bytes to a file, or are you trying to write encoded text? Because these are different concepts in Java; byte I/O is handled by subclasses of InputStream and OutputStream, while character I/O is handled by subclasses of Reader and Writer. If what you really want to write is bytes to a file (which I'm guessing from your mention of compression), use an OutputStream, not a Writer.
Then there's another confusion you have, which is evident from your mention of "ASCII characters from 0-255." There are no ASCII characters above 127. Please take 15 minutes to read this: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" (by Joel Spolsky). Pay particular attention to the parts where he explains the difference between a character set and an encoding, because it's critical for understanding Java I/O. (To review whether you understood, here's what you need to learn: Java Writers are classes that translate character output to byte output by applying a client-specified encoding to the text, and sending the bytes to an OutputStream.)

Java strings are based on 16 bit wide characters, it tries to perform conversions around that assumption if there is no clear specifications.
The following sample code, write and reads data directly as bytes, meaning 8-bit numbers which have an ASCII meaning associated with them.
import java.io.*;
public class RWBytes{
public static void main(String[] args)throws IOException{
String filename = "MiTestFile.txt";
byte[] bArray1 =new byte[5];
byte[] bArray2 =new byte[5];
bArray1[0]=65;//A
bArray1[1]=66;//B
bArray1[2]=67;//C
bArray1[3]=68;//D
bArray1[4]=69;//E
FileOutputStream fos = new FileOutputStream(filename);
fos.write(bArray1);
fos.close();
FileInputStream fis = new FileInputStream(filename);
fis.read(bArray2);
ByteArrayInputStream bais = new ByteArrayInputStream(bArray2);
for(int i =0; i< bArray2.length ; i++){
System.out.println("As the bytem value: "+ bArray2[i]);//as the numeric byte value
System.out.println("Converted as char to printiong to the screen: "+ String.valueOf((char)bArray2[i]));
}
}
}
A fixed subset of the 7 bit ASCII code is printable, A=65 for example, the 10 corresponds to the "new line" character which steps down one line on screen when found and "printed". Many other codes exist which manipulate a character oriented screen, these are invisible and manipulated the screen representation like tabs, spaces, etc. There are also other control characters which had the purpose of ringing a bell for example.
The higher 8 bit end above 127 is defined as whatever the implementer wanted, only the lower half have standard meanings associated.
For general binary byte handling there are no such qualm, they are number which represent the data. Only when trying to print to the screen the become meaningful in all kind of ways.

Extract first valid line of string from byte array

I am writing a utility in Java that reads a stream which may contain both text and binary data. I want to avoid having I/O wait. To do that I create a thread to keep reading the data (and wait for it) putting it into a buffer, so the clients can check avialability and terminate the waiting whenever they want (by closing the input stream which will generate IOException and stop waiting). This works every well as far as reading bytes out of it; as binary is concerned.
Now, I also want to make it easy for the client to read line out of it like '.hasNextLine()' and '.readLine()'. Without using an I/O-wait stream like buffered stream, (Q1) How can I check if a binary (byte[]) contain a valid unicode line (in the form of the length of the first line)? I look around the String/CharSet API but could not find it (or I miss it?). (NOTE: If possible I don't want to use non-build-in library).
Since I could not find one, I try to create one. Without being so complicated, here is my algorithm.
1). I look from the start of the byte array until I find '\n' or '\r' without '\n'.
2). Then, I cut the byte array from the start to that point and using it to create a string (with CharSet if specified) using 'new String(byte[])' or 'new String(byte[], CharSet)'.
3). If that success without exception, we found the first valid line and return it.
4). Otherwise, these bytes may not be a string, so I look further to another '\n' or '\r' w/o '\n'. and this process repeat.
5. If the search ends at the end of available bytes I stop and return null (no valid line found).
My question is (Q2)Is the following algorithm adequate?
Just when I was about to implement it, I searched on Google and found that there are many other codes for new line, for example U+2424, U+0085, U+000C, U+2028 and U+2029.
So my last question is (Q3), Do I really need to detect these code? If I do, Will it increase the chance of false alarm?
I am well aware that recognize something from binary is not absolute. I am just trying to find the best balance.
To sum up, I have an array of byte and I want to extract a first valid string line from it with/without specific CharSet. This must be done in Java and avoid using any non-build-in library.
Thanks you all in advance.

I am afraid your problem is not well-defined. You write that you want to extract the "first valid string line" from your data. But whether somet byte sequence is a "valid string" depends on the encoding. So you must decide which encoding(s) you want to use in testing.
Sensible choices would be:
the platform default encoding (Java property "file.encoding")
UTF-8 (as it is most common)
a list of encodings you know your clients will use (such as several Russian or Chinese encodings)
What makes sense will depend on the data, there's no general answer.
Once you have your encodings, the problem of line termination should follow, as most encodings have rules on what terminates a line. In ASCII or Latin-1, LF,CR-LF and LF-CR would suffice. On Unicode, you need all the ones you listed above.
But again, there's no general answer, as new line codes are not strictly regulated. Again, it would depend on your data.

First of all let me ask you a question, is the data you are trying to process a legacy data? In other words, are you responsible for the input stream format that you are trying to consume here?
If you are indeed controlling the input format, then you probably want to take a decision Binary vs. Text out of the Q1 algorithm. For me this algorithm has one troubling part.
`4). Otherwise, these bytes may not be a string, so I look further to
another '\n' or '\r' w/o '\n'. and this process repeat.`
Are you dismissing input prior to line terminator and take the bytes that start immediately after, or try to reevaluate the string with now 2 line terminators? If former, you may have broken binary data interface, if latter you may still not parse the text correctly.
I think having well defined markers for binary data and text data in your stream will simplify your algorithm a lot.
Couple of words on String constructor. new String(byte[], CharSet) will not generate any exception if the byte array is not in particular CharSet, instead it will create a string full of question marks ( probably not what you want ). If you want to generate an exception you should use CharsetDecoder.
Also note that in Java 6 there are 2 constructors that take charset
String(byte[] bytes, String charsetName) and String(byte[] bytes, Charset charset). I did some simple performance test a while ago, and constructor with String charsetName is magnitudes faster than the one that takes Charset object ( Question to Sun: bug, feature? ).

I would try this:
make the IO reader put strings/lines into a thread safe collection (for example some implementation of BlockingQueue)
the main code has only reference to the synced collection and checks for new data when needed, like queue.peek(). It doesn't need to know about the io thread nor the stream.
Some pseudo java code (missing exception & io handling, generics, imports++) :
class IORunner extends Thread {
IORunner(InputStream in, BlockingQueue outputQueue) {
this.reader = new BufferedReader(new InputStreamReader(in, "utf-8"));
this.outputQueue = outputQueue;
}
public void run() {
String line;
while((line=reader.readLine())!=null)
this.outputQueue.put(line);
}
}
class Main {
public static void main(String args[]) {
...
BlockingQueue dataQueue = new LinkedBlockingQueue();
new IORunner(myStreamFromSomewhere, dataQueue).start();
while(true) {
if(!dataQueue.isEmpty()) { // can also use .peek() != null
System.out.println(dataQueue.take());
}
Thread.sleep(1000);
}
}
}
The collection decouples the input(stream) more from the main code. You can also limit the number of lines stored/mem used by creating the queue with a limited capacity (see blockingqueue doc).
The BufferedReader handles the checking of new lines for you :) The InputStreamReader handles the charset (recommend setting one yourself since the default one changes depending on OS etc.).

The java.text namespace is designed for this sort of natural language operation. The BreakIterator.getLineInstance() static method returns an iterator that detects line breaks. You do need to know the locale and encoding for best results, though.

Q2: The method you use seems reasonable enough to work.
Q1: Can't think of something better than the algorithm that you are using
Q3: I believe it will be enough to test for \r and \n. The others are too exotic for usual text files.

I just solved this to get test stubb working for Datagram - I did byte[] varName= String.getBytes(); then final int len = varName.length; then send the int as DataOutputStream and then the byte array and just do readInt() on the rcv then read bytes(count) using the readInt.
Not a lib, not hard to do either. Just read up on readUTF and do what they did for the bytes.
The string should construct from the byte array recovered that way, if not you have other problems. If the string can be reconstructed, it can be buffered ... no?
May be able to just use read / write UTF() in DataStream - why not?
{ edit: per OP's request }
//Sending end
String data = new String("fdsfjal;sajssaafe8e88e88aa");// fingers pounding keyboard
DataOutputStream dataOutputStream = new DataOutputStream();//
final Integer length = new Integer(data.length());
dataOutputStream.writeInt(length.intValue());//
dataOutputStream.write(data.getBytes());//
dataOutputStream.flush();//
dataOutputStream.close();//
// rcv end
DataInputStream dataInputStream = new DataInputStream(source);
final int sizeToRead = dataInputStream.readInt();
byte[] datasink = new byte[sizeToRead.intValue()];
dataInputStream.read(datasink,sizeToRead);
dataInputStream.close;
try
{
// constructor
// String(byte[] bytes, int offset, int length)
final String result = new String(datasink,0x00000000,sizeToRead);//
// continue coding here
Do me a favor, keep the heat off of me. This is very fast right in the posting tool - code probably contains substantial errors - it's faster for me just to explain it writing Java ~ there will be others who can translate it to other code language ( s ) which you can too if you wish it in another codebase. You will need exception trapping an so on, just do a compile and start fixing errors. When you get a clean compile, start over from the beginnning and look for blunders. ( that's what a blunder is called in engineering - a blunder )

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.