Are there any Java Frameworks for binary file parsing? - java

My problem is, that I want to parse binary files of different types with a generic parser which is implemented in JAVA. Maybe describing the file format with a configuration file which is read by the parser or creating Java classes which parse the files according to some sort of parsing rules.
I have searched quite a bit on the internet but found almost nothing on this topic.
What I have found are just things which deal with compiler-generators (Jay, Cojen, etc.) but I don't think that I can use them to generate something for parsing binary files. But I could be wrong on that assumption.
Are there any frameworks which deal especially with easy parsing of binary files or can anyone give me a hint how I could use parser/compiler-generators to do so?
Update:
I'm looking for something where I can write a config-file like
file:
header: FIXED("MAGIC")
body: content(10)
content:
value1: BYTE
value2: LONG
value3: STRING(10)
and it generates automatically something which parses files which start with "MAGIC", followed by ten times the content-package (which itself consists of a byte, a long and a 10-byte string).
Update2:
I found something comparable what I'm looking for, "Construct", but sadly this is a Python-Framework. Maybe this helps someone to get an idea, what I'm looking for.

Using Preon:
public class File {
#BoundString(match="MAGIC")
private String header;
#BoundList(size="10", type=Body.class)
private List<Body> body;
private static class Body {
#Bound
byte value1;
#Bound
long value2;
#BoundString(size="10")
String value3;
}
}
Decoding data:
Codec<File> codec = Codecs.create(File.class);
File file = codecs.decode(codec, buffer);
Let me know if you are running into problems.

give a try to preon

I have used DataInputStream for reading binary files and I write the rules in Java. ;) Binary files can have just about any format so there is no general rule for how to read them.
Frameworks don't always make things simpler. In your case, the description file is longer than the code to just read the data using a DataInputStream.
public static void parse(DataInput in) throws IOException {
// file:
// header: FIXED("MAGIC")
String header = readAsString(in, 5);
assert header.equals("MAGIC");
// body: content(10)
// ?? not sure what this means
// content:
for(int i=0;i<10;i++) {
// value1: BYTE
byte value1 = in.readByte();
// value2: LONG
long value2 = in.readLong();
// value3: STRING(10)
String value3 = readAsString(in, 10);
}
}
public static String readAsString(DataInput in, int len) throws IOException {
byte[] bytes = new byte[len];
in.readFully(bytes);
return new String(bytes);
}
If you want to have a configuration file you could use a Java Configuration File. http://www.google.co.uk/search?q=java+configuration+file

Google's Protocol Buffers

Parser combinator library is an option. JParsec works fine, however it could be slow.

I have been developing a framework for Java which allows to parse binary data https://github.com/raydac/java-binary-block-parser
in the case you should just describe structure of your binary file in pseudolanguage

You can parse binary files with parsers like JavaCC. Here you can find a simple example. Probably it's a bit more difficult than parsing text files.

Have you looking into the world of parsers. A good parser is yacc, and there may be a port of it for java.

Related

How to process a string with 823237 characters

I have a string that has 823237 characters in it. its actually an xml file and for testing purpose I want to return as a response form a servlet.
I have tried everything I can possible think of
1) creating a constant with the whole string... in this case Eclipse complains (with a red line under servlet class name) -
The type generates a string that requires more than 65535 bytes to encode in Utf8 format in the constant pool
2) breaking the whole string into 20 string constants and writing to the out object directly
something like :
out.println( CONSTANT_STRING_PART_1 + CONSTANT_STRING_PART_2 +
CONSTANT_STRING_PART_3 + CONSTANT_STRING_PART_4 +
CONSTANT_STRING_PART_5 + CONSTANT_STRING_PART_6 +
// add all the string constants till .... CONSTANT_STRING_PART_20);
in this case ... the build fails .. complaining..
[javac] D:\xx\xxx\xxx.java:87: constant string too long
[javac] CONSTANT_STRING_PART_19 + CONSTANT_STRING_PART_20);
^
3) reading the xml file as a string and writing to out object .. in this case I get
SEVERE: Allocate exception for servlet MyServlet
Caused by: org.apache.xmlbeans.XmlException: error: Content is not allowed in prolog.
Finally my question is ... how can I return such a big string (as response) from the servlet ???
You can avoid to load all the text in memory using streams:
InputStream is = new FileInputStream("path/to/your/file"); //or the following line if the file is in the classpath
InputStream is = MyServlet.class.getResourceAsStream("path/to/file/in/classpath");
byte[] buff = new byte[4 * 1024];
int read;
while ((read = is.read(buff)) != -1) {
out.write(buff, 0, read);
}
The second approach might work the following way:
out.print(CONSTANT_STRING_PART_1);
out.print(CONSTANT_STRING_PART_2);
out.print(CONSTANT_STRING_PART_3);
out.print(CONSTANT_STRING_PART_4);
// ...
out.print(CONSTANT_STRING_PART_N);
out.println();
You can do this in a loop of course (which is highly recommended ;)).
The way you do it, you just temporarely create the large string again to then pass it to println(), which is the same problem as the first one.
Ropes: Theory and practice
Why and when to use Ropes for Java for string manipulations
You can read a 823K file into a String. Maybe not the most elegant method, but totally doable. Method 3 should have worked. There was an XML error, but that has nothing to do with reading from a file into a String, or the length of the data.
It has to be an external file, though, because it is too big to be inlined into a class file (there are size limits for those).
I recommend Commons IO FileUtils#readFileToString.
You have to deal with ByteArrayOutputStream and not with the String it self. If you want to send your String in the http response all you have to do is to read from that byteArray stream and write in the response stream like this :
ByteArrayOutputStream baos = new ByteArrayOutputStream(8232237);
baos.write(constant1.getBytes());
baos.write(constant2.getBytes());
...
baos.writeTo(response.getOutputStream());
Both problem 1) and 2) are due to the same fundamental issue. A String literal (or constant String expression) cannot be more than 65535 characters because there is a hard limit on string constants in the class file format.
The third problem sounds like a bug in the way you've implemented it rather than a fundamental problem. In fact, it sounds like you are trying to load the XML as a DOM and then unparse it (which is unnecessary), and that somehow you have managed to mangle the XML in the process. (Or maybe it is mangled in the file you are trying to read ...)
The simple and elegant solution is to save the stuff in a file, and then read it as plain text.
Or ... less elegant, but just as effective:
String[] strings = new String[](
"longString1",
"longString2",
...
"longStringN"};
for (String str : strings) {
out.write(str);
}
Of course, the problem with embedding test data as string literals is that you have to escape certain characters in the string to keep the compiler happy. That's tedious if you have to do it by hand.

Externalize XML construction from a stream of CSV in Java

I get a stream of values as CSV , based on some condition I need to generate a XML including only a set of values from the CSV. For e.g .
Input : a:value1, b:value2, c:value3, d:value4, e:value5.
if (condition1)
XML O/P = <Request><ValueOfA>value1</ValueOfA><ValueOfE>value5</ValueOfE></Request>
else if (condition2)
XML O/P = <Request><ValueOfB>value2</ValueOfB><ValueOfD>value4</ValueOfD></Request>
I want to externalize the process in a way that given a template the output XML is generated accordingly. String manipulation is the easiest way of implementing this but I do not want to mess up the XML if some special characters appear in the input, etc. Please suggest.
Perhaps you could benefit from templating engine, something like Apache Velocity.
I would suggest creating an xsd and using JAXB to create the Java binding classes that you can use to generate the XML.
I recommend my own templating engine (JATL http://code.google.com/p/jatl/) Although its geared to (X)HTML its also very good at generating XML.
I didn't bother solving the whole problem for you (that is double splitting on the input ("," and then ":").) but this is how you would use JATL.
final String a = "stuff";
HtmlWriter html = new HtmlWriter() {
#Override
protected void build() {
//If condition1
start("Request").start("ValueOfA").text(a).end().end();
}
};
//Now write.
StringWriter writer = new StringWriter();
String results = html.write(writer).getBuffer().toString();
Which would generate
<Request><ValueOfA>stuff</ValueOfA></Request>
All the correct escaping is handled for you.

What is the proper way to write/read a file with different IO streams

I have a file that contains bytes, chars, and an object, all of which need to be written then read. What would be the best way to utilize Java's different IO streams for writing and reading these data types? More specifically, is there a proper way to add delimiters and recognize those delimiters, then triggering what stream should be used? I believe I need some clarification on using multiple streams in the same file, something I have never studied before. A thorough explanation would be a sufficient answer. Thanks!
As EJP already suggested, use ObjectOutputStream and ObjectInputStream an0d wrap your other elements as an object(s). I'm giving as an answer so I could show an example (it's hard to do it in comment) EJP - if you want to embed it in your question, please do and I'll delete the answer.
class MyWrapedData implements serializeable{
private String string1;
private String string2;
private char char1;
// constructors
// getters setters
}
Write to file:
ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(fileName));
out.writeObject(myWrappedDataInstance);
out.flush();
Read from file
ObjectInputStream in = new ObjectInputStream(new FileInputStream(fileName));
Object obj = in.readObject();
MyWrapedData wraped = null;
if ((obj != null) && (obj instanceof MyWrappedData))
wraped = (MyWrapedData)obj;
// get the specific elements from the wraped object
see very clear example here: Read and Write
Redesign the file. There is no sensible way of implementing it as presently designed. For example the object presupposes an ObjectOutputStream, which has a header - where's that going to go? And how are you going to know where to switch from bytes to chars?
I would probably use an ObjectOutputStream for the whole thing and write everything as objects. Then Serialization solves all those problems for you. After all you don't actually care what's in the file, only how to read and write it.
Can you change the structure of the file? It is unclear because the first sentence of your question contradicts being able to add delineators. If you can change the file structure you could output the different data types into separate files. I would consider this the 'proper' way to delineate the data streams.
If you are stuck with the file the way it is then you will need to write an interface to the file's structure which in practice is a shopping list of read operations and a lot of exception handling. A hackish way to program because it will require a hex editor and a lot of trial and error but it works in certain cases.
Why not write the file as XML, possibly with a nice simple library like XSTream. If you are concerned about space, wrap it in gzip compression.
If you have control over the file format, and it's not an exceptionally large file (i.e. < 1 GiB), have you thought about using Google's Protocol Buffers?
They generate code that parses (and serializes) file/byte[] content. Protocol buffers use a tagging approach on every value that includes (1) field number and (2) a type, so they have nice properties such as forward/backward compatability with optional fields etc. They are fairly well optimized for both speed and file size, adding only ~2 bytes of overhead for a short byte[], with ~2-4 additional bytes to encode the length on larger byte[] fields (VarInt encoded lengths).
This could be overkill, but if you have a bunch of different fields & types, protobuf is really helpful. See: http://code.google.com/p/protobuf/.
An alternative is Thrift by Facebook, with support for a few more languages although possibly less use in the wild last I checked.
If the structure of your file is not fixed, consider using a wrapper per type. First you need to create the interface of your wrapper classes….
interface MyWrapper extends Serializable {
void accept(MyWrapperVisitor visitor);
}
Then you create the MyWrapperVisitor interface…
interface MyWrapperVisitor {
void visit(MyString wrapper);
void visit(MyChar wrapper);
void visit(MyLong wrapper);
void visit(MyCustomObject wrapper);
}
Then you create your wrapper classes…
class MyString implements MyWrapper {
public final String value;
public MyString(String value) {
super();
this.value = value;
}
#Override
public void accept(MyWrapperVisitor visitor) {
visitor.visit(this);
}
}
.
.
.
And finally you read your objects…
final InputStream in = new FileInputStream(myfile);
final ObjectInputStream objIn = new ObjectInputStream(in);
final MyWrapperVisitor visitor = new MyWrapperVisitor() {
#Override
public void visit(MyString wrapper) {
//your logic here
}
.
.
.
};
//loop over all your objects here
final MyWrapper wrapper = (MyWrapper) objIn.readObject();
wrapper.accept(visitor);

Creating Java DataInputStream data in Python

I have a Java program that uses a DataInputStream for storing object data.
Example:
DataInputStream tInput = new DataInputStream(getClass().getResourceAsStream(aDirectory + "/ResultItemInfo.dat"));
this._text = tInput.readUTF();
this._image = tInput.readUTF();
this._audio = tInput.readUTF();
this._random = false;
if (tInput.read() == 1) {
this._random = true;
}
this._hasMenu = false;
if (tInput.read() == 1) {
this._hasMenu = true;
}
Nice, isn't it?
There is an existing dataset, and now I have to add some records. If the tool that I am required to make was written in Java too, this would be pretty easy. Unfortunately, it is written in Python, so I have to discover a way how to create files that can be read from the Java application using Python.
Is there any easy way to do this?
As a last resort, I could:
Modify the Java app and use XML. This would break compatibility with all existing data, so I really don't want to do it.
Use Jython. Possible solution, but I want pure C-Python.
Reverse-Engineer the data format. Not a good solution either.
For a string to be readUTF-able, it must contain two bytes of counter and then exactly as many bytes of UTF-8 encoded data as counter says.
So I suggest this piece of code should write a unicode string data the way what readUTF() could read:
encoded = data.encode('UTF-8')
count = len(encoded)
msb, lsb = divmod(count, 256) # split to two bytes
outfile.write(chr(msb))
outfile.write(chr(lsb))
outfile.write(encoded)
The outfile must be open in binary mode (e.g. "wb").
This is according to description of Java interface DataInput. I did not try to run this, though.

How do you convert binary data to Strings and back in Java?

I have binary data in a file that I can read into a byte array and process with no problem. Now I need to send parts of the data over a network connection as elements in an XML document. My problem is that when I convert the data from an array of bytes to a String and back to an array of bytes, the data is getting corrupted. I've tested this on one machine to isolate the problem to the String conversion, so I now know that it isn't getting corrupted by the XML parser or the network transport.
What I've got right now is
byte[] buffer = ...; // read from file
// a few lines that prove I can process the data successfully
String element = new String(buffer);
byte[] newBuffer = element.getBytes();
// a few lines that try to process newBuffer and fail because it is not the same data anymore
Does anyone know how to convert binary to String and back without data loss?
Answered: Thanks Sam. I feel like an idiot. I had this answered yesterday because my SAX parser was complaining. For some reason when I ran into this seemingly separate issue, it didn't occur to me that it was a new symptom of the same problem.
EDIT: Just for the sake of completeness, I used the Base64 class from the Apache Commons Codec package to solve this problem.
String(byte[]) treats the data as the default character encoding. So, how bytes get converted from 8-bit values to 16-bit Java Unicode chars will vary not only between operating systems, but can even vary between different users using different codepages on the same machine! This constructor is only good for decoding one of your own text files. Do not try to convert arbitrary bytes to chars in Java!
Encoding as base64 is a good solution. This is how files are sent over SMTP (e-mail). The (free) Apache Commons Codec project will do the job.
byte[] bytes = loadFile(file);
//all chars in encoded are guaranteed to be 7-bit ASCII
byte[] encoded = Base64.encodeBase64(bytes);
String printMe = new String(encoded, "US-ASCII");
System.out.println(printMe);
byte[] decoded = Base64.decodeBase64(encoded);
Alternatively, you can use the Java 6 DatatypeConverter:
import java.io.*;
import java.nio.channels.*;
import javax.xml.bind.DatatypeConverter;
public class EncodeDecode {
public static void main(String[] args) throws Exception {
File file = new File("/bin/ls");
byte[] bytes = loadFile(file, new ByteArrayOutputStream()).toByteArray();
String encoded = DatatypeConverter.printBase64Binary(bytes);
System.out.println(encoded);
byte[] decoded = DatatypeConverter.parseBase64Binary(encoded);
// check
for (int i = 0; i < bytes.length; i++) {
assert bytes[i] == decoded[i];
}
}
private static <T extends OutputStream> T loadFile(File file, T out)
throws IOException {
FileChannel in = new FileInputStream(file).getChannel();
try {
assert in.size() == in.transferTo(0, in.size(), Channels.newChannel(out));
return out;
} finally {
in.close();
}
}
}
If you encode it in base64, this will turn any data into ascii safe text, but base64 encoded data is larger than the orignal data
See this question, How do you embed binary data in XML?
Instead of converting the byte[] into String then pushing into XML somewhere, convert the byte[] to a String via BASE64 encoding (some XML libraries have a type to do this for you). The BASE64 decode once you get the String back from XML.
Use http://commons.apache.org/codec/
You data may be getting messed up due to all sorts of weird character set restrictions and the presence of non-priting characters. Stick w/ BASE64.
How are you building your XML document? If you use java's built in XML classes then the string encoding should be handled for you.
Take a look at the javax.xml and org.xml packages. That's what we use for generating XML docs, and it handles all the string encoding and decoding quite nicely.
---EDIT:
Hmm, I think I misunderstood the problem. You're not trying to encode a regular string, but some set of arbitrary binary data? In that case the Base64 encoding suggested in an earlier comment is probably the way to go. I believe that's a fairly standard way of encoding binary data in XML.

Categories

Resources