Compact Java Externalization

Compact Java Externalization - java

I am trying to figure out a way to serialize simple Java objects (ie all the fields are primitive types) compactly, without the big header that normally gets added on when you use writeExternal. It does not need to be super general, backwards compatible across versions, or anything like that, I just want it to work with ObjectOutputStreams (or something similar) and not add ~100 bytes to the size of each object I serialize.
More concretely, I have a class that has 3 members: a boolean flag and two longs. I should be able to represent this object in 17 bytes. Here is a simplified version of the code:
class Record implements Externalizable {
bool b;
long id;
long uid;
public void writeExternal(ObjectOutput out) throws IOException {
int size = 1 + 8 + 8; //I know, I know, but there's no sizeof
ByteBuffer buff = ByteBuffer.allocate(size);
if (b) {
buff.put((byte) 1);
} else {
buff.put((byte) 0);
}
buff.putLong(id);
buff.putLong(uid);
out.write(buff.array(), 0, size);
}
}
Elsewhere, these are stored by being passed into a method like the following:
public void store(Object value) throws IOException {
ObjectOutputStream out = getStream();
out.writeObject(value);
out.close();
}
After I store just one of these objects in a file this way, the file has a size of 128 bytes (and 256 for two of them, so it's not amortized). Looking at the file, it is clear that it is writing in a header similar to the one used in default serialization (which, for the record, uses about 376 bytes to store one of these). I can see that my writeExternal method is getting invoked (I put in some logging), so that isn't the problem. Is this just a fundamental limitation of the way ObjectOutputStream deserializes things? Do I need to work on raw DataOutputStreams to get the kind of compactness I want?
[EDIT: In case anyone is wondering, I ended up using DataOutputStreams directly, which turned out to be easier than I'd feared]

Related

Is ByteBuf.arrayOffset useless?

I'm learning Netty in Action.
At the chapter 5.2.2 ByteBuf usage patterns, there is a piece of code that confused me. It is shown below.
ByteBuf heapBuf = ...
if (heapBuf.hasArray()) {
byte[] array = heapBuf.array();
int offset = heapBuf.arrayOffset() + heapBuf.readerIndex();
int lenght = heapBuf.readableBytes();
handleArray(array, offset, length)
}
I wondered what is the use case of the ByteBuf.arrayOffset() method. The documentation for that method reads:
Returns the offset of the first byte within the backing byte array of this buffer.
Then, I looked up the arrayOffset() method in UnpooledHeapByteBuf.java which implements ByteBuf. The implementation for the method always just returns 0, as seen below.
#Override
public int arrayOffset() {
return 0;
}
So, is ByteBuf#arrayOffset useless?

There may be other implementations for ByteBuf and it could be possible that they have a more useful or even complex implementation.
So for the case of UnpooledHeapByteBuf returning 0 works but that does not mean that there aren't other implementations of ByteBuf that need a different implementation.
The method should do what the documentation states and you could imagine that other implementations indeed have an offset that is different to 0. For example if they use something like a circular-array as backing byte array.
In that case the method needs to return the index of where the current start pointer is located at and not 0.
Here's an example-image showing such a circular-array (the current pointer is at index 2 and not at 0, it moves around the array while using it):
And on the user-side, if you want to safely use your ByteBuf object you also should use the method. You can avoid using it if you operate on UnpooledHeapByteBuf but even then you should not because it could be possible that they change the internal behavior with future versions.

No its not useless at all as it allows us to have one huge byte array back multiple ByteBuf implementations. This in fact is done in PooledHeapByteBuf

Is it possible to use struct-like constructs in Java?

I'm considering using Java for a large project but I haven't been able to find anything that remotely represented structures in Java. I need to be able to convert network packets to structures/classes that can be used in the application.
I know that it is possible to use RandomAccessFile but this way is NOT acceptable. So I'm curious if it is possible to "cast" a set of bytes to a structure like I could do in C. If this is not possible then I cannot use Java.
So the question I'm asking is if it is possible to cast aligned data to a class without any extra effort beyond specifying the alignment and data types?

No. You cannot cast a array of bytes to a class object.
That being said, you can use a java.nio.Buffer and easily extract the fields you need to an object like this:
class Packet {
private final int type;
private final float data1;
private final short data2;
public Packet(byte[] bytes) {
ByteBuffer bb = ByteBuffer.wrap(bytes);
bb.order(ByteOrder.BIG_ENDIAN); // or LITTLE_ENDIAN
type = bb.getInt();
data1 = bb.getFloat();
data2 = bb.getShort();
}
}

You're basically asking whether you can use a C-specific solution to a problem in another language. The answer is, predictably, 'no'.
However, it is perfectly possible to construct a class that takes a set of bytes in its constructor and constructs an appropriate instance.
class Foo {
int someField;
String anotherField;
public Foo(byte[] bytes) {
someField = someFieldFromBytes(bytes);
anotherField = anotherFieldFromBytes(bytes);
etc.
}
}
You can ensure there is a one-to-one mapping of class instances to byte arrays. Add a toBytes() method to serialize an instance into bytes.

No, you cannot do that. Java simply doesn't have the same concepts as C.
You can create a class that behaves much like a struct:
public class Structure {
public int field1;
public String field2;
}
and you can have a constructor that takes an array or bytes or a DataInput to read the bytes:
public class Structure {
...
public Structure(byte[] data) {
this(new DataInputStream(new ByteArrayInputStream(data)));
}
public Structure(DataInput in) {
field1 = in.readInt();
field2 = in.readUTF();
}
}
then read bytes off the wire and pump them into Structures:
byte[] bytes = network.read();
DataInputStream stream = new DataInputStream(new ByteArrayInputStream(bytes));
Structure structure1 = new Structure(stream);
Structure structure2 = new Structure(stream);
...
It's not as concise as C but it's pretty close. Note that the DataInput interface cleanly removes any mucking around with endianness on your behalf, so that's definitely a benefit over C.

As Joshua says, serialization is the typical way to do these kinds of things. However you there are other binary protocols like MessagePack, ProtocolBuffers, and AvRO.
If you want to play with the bytecode structures, look at ASM and CGLIB; these are very common in Java applications.

There is nothing which matches your description.
The closest thing to a struct in Java is a simple class which holds values either accessible through it's fields or set/get methods.
The typical means to convert between Java class instances and on-the-wire representations is Java serialization which can be heavily customized as need be. It is what is used by Java's Remote Method Invocation API and works extremely well.

ByteBuffer.wrap(new byte[] {}).getDouble();

No, this is not possible. You're trying to use Java like C, which is bound to cause complications. Either learn to do things the Java way, or go back to C.
In this case, the Java way would probably involve DataInputStream and/or DataOutputStream.

You cannot cast array of bytes to instance of class.
But you can do much much more with java.
Java has internal, very strong and very flexible mechanism of serialization. This is what you need. You can read and write object to/from stream.
If both sides are written in java, there are no problem at all. If one of sides is not java you can customeze your serialization. Start from reading javadoc of java.util.Serializable.

Write from Java generic T[] to DataOutput stream?

Given a generic array T[], where T extends java.lang.Number, I would like to write the array to a byte[], using ByteArrayOutputStream. java.io.DataOutput (and an implementation such as java.io.DataOutputStream appears close to what I need, but there is no generic way to write the elements of the T[] array. I want to do something like
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(out);
for (T v : getData()) {
dataOut.write(v); // <== uh, oh
}
but there is no generic <T> void write(T v) method on DataOutput.
Is there any way to avoid having to write a whole bunch of isntanceof spaghetti?
Clarification
The byte[] is being sent to a non-Java client, so object serialization isn't an option. I need, for example, the byte[] generated from a Float[] to be a valid float[] in C.

No, there isn't. The instanceof "spaghetti" would have to exist somewhere anyway. Make a generic method that does that:
public <T> void write(DataOutputStream stream, T object) {
// instanceofs and writes here
}

You can just use an ObjectOutputStream instead of a DataOutputStream, since all Numbers are guaranteed to be serializable.

Regarding to the last edit, I would try this approach (if its ugly or not).
1) Check per instanceof which type you have
2) Store it into a primitive and extract the bytes you need (eg integer) like this (for the first two bytes)
byte[] bytes = new byte[2];
bytes[0]=(byte)(i>>8);
bytes[1]=(byte)i;
3) Send it via the byte[] array
4) Get stuck because different c implementations use different amout of bytes for integer, so nobody can guarantee that the results will equal your initial numbers. e.g. how do you want to handle the 4 byte integer of java with 2 byte integers of c? How do you handle Long?
So...i don't see a way to do, but, im not an expert in this area....
Please correct me if im wrong. ;-)

Large amount of constants in Java

I need to include about 1 MByte of data in a Java application, for very fast and easy access in the rest of the source code. My main background is not Java, so my initial idea was to convert the data directly to Java source code, defining 1MByte of constant arrays, classes (instead of C++ struct) etc., something like this:
public final/immutable/const MyClass MyList[] = {
{ 23012, 22, "Hamburger"} ,
{ 28375, 123, "Kieler"}
};
However, it seems that Java does not support such constructs. Is this correct? If yes, what is the best solution to this problem?
NOTE: The data consists of 2 tables with each about 50000 records of data, which is to be searched in various ways. This may require some indexes later, with significant more records, maybe 1 million records, saved this way. I expect the application to start up very fast, without iterating through these records.

I personally wouldn't put it in source form.
Instead, include the data in some appropriate raw format in your jar file (I'm assuming you'll be packaging the application or library up) and use Class.getResourceAsStream or ClassLoader.getResourceAsStream to load it.
You may very well want a class to encapsulate loading, caching and providing this data - but I don't see much benefit from converting it into source code.

Due to limitations of the java bytecode files, class-files can not be larger than 64k iirc. (They are simply not intended for this type of data.)
I would load the data upon starting the program, using something like the following lines of code:
import java.io.*;
import java.util.*;
public class Test {
public static void main(String... args) throws IOException {
List<DataRecord> records = new ArrayList<DataRecord>();
BufferedReader br = new BufferedReader(new FileReader("data.txt"));
String s;
while ((s = br.readLine()) != null) {
String[] arr = s.split(" ");
int i = Integer.parseInt(arr[0]);
int j = Integer.parseInt(arr[1]);
records.add(new DataRecord(i, j, arr[0]));
}
}
}
class DataRecord {
public final int i, j;
public final String s;
public DataRecord(int i, int j, String s) {
this.i = i;
this.j = j;
this.s = s;
}
}
(NB: The Scanner is quite slow, so don't be tempted to use it just because it has a simple interface. Stick with some form of BufferedReader and split, or StringTokenizer.)
Efficiency can of course be improved if you transform the data into a binary format. In that case, you can make use of the DataInputStream (but don't forget to go through some BufferedInputStream or BufferedReader)
Depending on how you wish to access the data, you might be better off storing the records in a hash-map (HashMap<Integer, DataRecord>) (having i or j as the key).
If you wish to load the data at the same time as the JVM loads the class file itself (roughly!) you could do the read / initialization, not within a method, but ecapsulated in static { ... }.
For a memory-mapped approach, have a look at the java.nio.channels-package in java. Especially the method
public abstract MappedByteBuffer map(FileChannel.MapMode mode, long position,long size) throws IOException
Complete code examples can be found here.
Dan Bornstein (the lead developer of DalvikVM) explains a solution to your problem in this talk (Look around 0:30:00). However I doubt the solution applies to as much data as a megabyte.

An idea is that you use enumerators, but I'm not sure if this suits to your implementation, and it also depends on how you are planning to use the data.
public enum Stuff {
HAMBURGER (23012, 22),
KIELER (28375, 123);
private int a;
private int b;
//private instantiation, does not need to be called explicitly.
private Stuff(int a, int b) {
this.a = a;
this.b = b;
}
public int getAvalue() {
return this.a;
}
public int getBvalue() {
return this.b;
}
}
These can then be accessed like:
Stuff someThing = Stuff.HAMBURGER;
int hamburgerA = Stuff.HAMBURGER.getA() // = 23012
Another idea is using a static initializer to set private fields of a class.

Putting the data into source could would actually not be the fastest solution, not by a long shot. Loading a Java class is quite complex and slow (at least on a platform that does bytecode verification, not sure about Android).
The fastest possible way to do this would be to define your own binary index format. You could then read that as a byte[] (possibly using memory mapping) or even a RandomAccessFile without interpreting it in any way until you start accessing it. The cost of this would be the complexity of the code that accesses it. With fixed-size records, a sorted list of records that's accessed via binary search would still be pretty simple, but anything else is going to get ugly.
Though before doing that, are you sure this isn't premature optimization? The easiest (and probably still quite fast) solution would be to jsut serialize a Map, List or array - have you tried this and determined that it is, in fact, too slow?

convert the data directly to Java source code, defining 1MByte of constant arrays, classes
Be aware that there are strict constraints on the size of classes and their structures [ref JVM Spec.

This is how you define it in Java, if I understood what you are after:
public final Object[][] myList = {
{ 23012, 22, "Hamburger"} ,
{ 28375, 123, "Kieler"}
};

It looks like you plan to write your own lightweight database.
If you can limit the length of the String to a realistic max size the following might work:
write each entry into a binary file, the entries have the same size, so you waste some bytes with each entry(int a, int b,int stringsize, string, padding)
To read an entry open the file as a random access file, multiply the index with the length of an entry to get the offset and seek the position.
Put the bytes into a bytebuffer and read the values, the String has to be converted with the String(byte[] ,int start, int length,Charset) ctor.
If you can't limit the length of a block dump the strings in an additional file and only store the offsets in your table. This requires an additional file access and makes modifiying the data hard.
Some informationa about random file-access in java can be found here http://java.sun.com/docs/books/tutorial/essential/io/rafs.html.
For faster access you can cache some of your read entries in a Hashmap and always remove the oldest from the map when reading a new one.
Pseudo code (wont compile):
class MyDataStore
{
FileChannel fc = null;
Map<Integer,Entry> mychace = new HashMap<Integer, Entry>();
int chaceSize = 50000;
ArrayList<Integer> queue = new ArrayList();
static final int entryLength = 100;//byte
void open(File f)throws Exception{fc = f.newByteChannel()}
void close()throws Exception{fc.close();fc = null;}
Entry getEntryAt(int index)
{
if(mychace.contains(index))return mychace.get(index);
long pos = index * entryLength; fc.seek(pos);ByteBuffer
b = new ByteBuffer(100);
fc.read(b);
Entry a = new Entry(b);
queue.add(index);
mychace.put(index,a);
if(queue.size()>chacesize)mychace.remove(queue.remove(0));
return a;
}
}
class Entry{
int a; int b; String s;
public Entry(Bytebuffer bb)
{
a = bb.getInt();
b = bb.getInt();
int size = bb.getInt();
byte[] bin = new byte[size];
bb.get(bin);
s = new String(bin);
}
}
Missing from the pseudocode:
writing, since you need it for constant data
total number of entries/sizeof file, only needs an additional integer at the beginning of the file and an additional 4 byte offset for each access operation.

You could also declare a static class (or a set of static classes) exposing the desidered values as methods. After all, you want your code to be able to find the value for a given name, and don't want the value to change.
So: location=MyLibOfConstants.returnHamburgerLocation().zipcode
And you can store this stuff in a hashtable with lazyinitialization, if you thing that calculating it on the fly would be a waste of time.

Isn't a cache what you need?
As classes it is loaded in the memory, not really limited to a defined size, should be as fast as using constants...
Actually it can even search data with some kind of indexes (exemple with the object hashcode...)
You can for exemple create all your data arrays (ex { 23012, 22, "Hamburger"}) and then create 3 hashmap:
map1.put(23012,hamburgerItem);
map2.put(22,hamburgerItem);
map3.put("Hamburger",hamburgerItem);
This way you can search very fast in one of the map according to the parameter you have...
(but this works only if your keys are unique in the map... this is just an exemple that could inspire you)
At work we have a very big webapp (80 weblogic instances) and it's almost what we do: caching everywhere. From a countrylist in database, create a cache...
There are many different kind of caches, you should check the link and choose what you need...
http://en.wikipedia.org/wiki/Cache_algorithms

Java serialization sounds like something that needs to be parsed... not good. Isn't there some kind of standard format for storing data in a stream, that can be read/looked up using a standard API without parsing it?
If you were to create the data in code, then it would all be loaded on first use. This is unlikely to be much more efficient than loading from a separate file - as well as parsing the data in the class file, the JVM has to verify and compile the bytecodes to create each object a million times, rather than just the once if you load it from a loop.
If you want random access and can't use a memory mapped file, then there is a RandomAccessFile which might work. You need either to load a index on start, or you need to make the entries a fixed length.
You might want to check whether the HDF5 libraries run on your platform; it may be overkill for such a simple and small dataset though.

I would recommend to use assets for storing such data.

Examples of Java I/O Stream Filters

I am looking for example code that demonstrates how to create a filter that understands binary data. Links to examples are greatly appreciated.

If you mean examples of FilterInputStream/FilterOutputStream, then you need look no further than the JDK. I'll talk about the input stream variant for the sake of argument, but the same applies to output streams.
Take a look at InflaterInputStream, for example. Look at the arry read() method, and notice how at some point if calls fill(), which in turn reads from the underlying input stream. Then, around that method, it calls into the Inflater to actually turn the buffer of "raw" bytes that it pulled from the underlying stream into the actual bytes that are written to the caller's array.
One thing to consider is that FilterInputStream is a little bit of a waste of space. So long as you can define your InputStream to take another underlying InputStream, and in all the read method(s) you read from that underlying stream (bearing in mind that in theory you only need to define the byte read() method), then specifically making your class extend FilterInputStream doesn't really buy you very much. For example, here's part of the code for an input stream that limits the number of bytes from the underlying stream that it allows the caller to read (in effect, we can "chop up" a stream into several sub-streams, which is useful when reading from archive files, for example):
class LimitedInputStream extends InputStream {
private InputStream in;
private long bytesLeft;
LimitedInputStream(InputStream in, long maxBytes) {
this.in = in;
this.bytesLeft = maxBytes;
}
#Override
public int read() throws IOException {
if (bytesLeft <= 0)
return -1;
int b = in.read();
bytesLeft--;
return b;
}
#Override
public int read(byte b[], int off, int len) throws IOException {
if (bytesLeft <= 0)
return -1;
len = (int) Math.min((long) len, bytesLeft);
int n = in.read(b, off, len);
if (n > 0)
bytesLeft -= n;
return n;
}
// ... missed off boring implementations of skip(), available()..
}
In this case in my application, it really buys me nothing to declare this class as a FilterInputStream-- in effect, it's a choice between wanting to call in.read() or super.read() to get the underlying data...!

For a good example, I'd suggest taking a look at the source code for java.io.DataInputStream. This class shows you one way to decode primitive types and character strings from "binary" data, from which more complex structures can be produced.
Of course, other applications might choose to use other encodings. For example, the ASN.1 "distinguished encoding rules" are used for Public Key Infrastructure applications. Lucene provides good documentation for its file format, which is designed for compactness.
If you have the Sun JDK, look in its top directory for "src.zip". Most IDEs will show you the source code for the core Java classes if you tell them where to find this file.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Compact Java Externalization - java

Related

Is ByteBuf.arrayOffset useless?

Is it possible to use struct-like constructs in Java?

Write from Java generic T[] to DataOutput stream?

Large amount of constants in Java

Examples of Java I/O Stream Filters

Categories

Resources