Deflater: is it possible to clone state (or rollback)? - java

Suppose I'm using a Deflater to compress a stream of bytes, and at some intervals I have the option of feeding it with two different byte arrays (two alternative representations of the same info), so that I can choose the most compressible one. Ideally, I would like to be able to clone the state of a "live" deflater, so that I can feed each instance with an array, check the results, and discard the undesirable one.
Alternatively, I'd like to mark the current state (sort of a savepoint) so that, after feeding and compressing with setInput() + deflate() I can rollback/reset to that state to try with different data.
Looking at the API, this seems to me rather impossible... nor even reimplementing the Deflater (not at least if one wants to take advantage of the internal native implementation). Am I right? Any ideas or experiences?

It does not appear that the Java interface to zlib provides zlib's deflateCopy() operation. It is possible that the inherited clone operation is properly implemented and does a deflateCopy(), but I don't know.

Related

Understanding the importance of serialisation in Java

I was just introduced to the concept of serialisation in Java and while I 'get' the fundamentals, I can't help but feel like it's a bit of an overkill? My logic is that if I have pointers to the objects and I know how many bytes it takes up in memory. Why can't I just theoretically write these bytes to some txt file, along with the some extra bytes to indicate the type. With this, can't I just read these bytes back and restore my original object?
The amount of detail my book goes into serialisation is giving me a good indication that I'm not really understanding the importance of this and that there is probably something more subtle than just writing out all the bytes exactly as they are. Any help is greatly appreciated! (I have some background in c++ if that helps)
Why can't I just theoretically write these bytes to some txt file, along with the some extra bytes to indicate the type. With this, can't I just read these bytes back and restore my original object?
How could anyone ever read them back in? Say I'm writing code that's supposed to read in your file. Please tell me what the third byte means so that I can decode it properly.
What if the internal representation of the object contains pointers to other objects that might be in different memory locations the next time the program runs? For example, it is quite common to manage identical strings by having internal references to the same internal string object. How will writing that reference to a file be sensible given that the internal string object may not exist in the next run?
To write data to a file, you need to write it out in some specific format that actually contains all the information you need to be able to read back in. What happens to work internally for this program at this time just won't do as there's no guarantee another program at another time can make sense of it.
What you suggest works provided;
the order and type of fields doesn't change. Note this is not set at compile time.
the byte order doesn't change.
you don't have any references eg no String, enum, List or Map.
the name&package of the type doesn't change.
We at Chronicle, use a form of serialization which supports this as it's much faster but it's very limiting. You have to be very aware of those limitations and have a problem which is suitable. We also have a form of serialization which have none of these constraints, but it is slower.
The purpose of Java Serialization is to support arbitrary object graphs even if data is exchanged between systems which might arrange the data differently.

Difference between serialization and normal object storage?

Serialization is the process of converting an object stored in memory into a stream of bytes to be transferred over a network, stored in a DB, etc.
But isn't the object already stored in memory as bits and bytes? Why do we need another process to convert the object stored as bytes into another byte representation? Can't we just transmit the object directly over the network?
I think I may be missing something in the way the objects are stored in memory, or the way the object fields are accessed.
Can someone please help me in clearing up this confusion?
Different systems don't store things in memory in the same way. The obvious example is endianness.
Serialization defines a way by which systems using different in-memory representations can communicate.
Another important fact is that the requirements on in-memory and serialized data may be different: when in-memory, fast read (and maybe write) access is desirable; when serialized, small size is desirable. It is easier to create two different formats to fit these two use cases than it is to create one format which is good for both.
An example which springs to mind is LinkedHashMap: this basically stores two versions of the mapping when in memory (one to capture insertion order; one as a traditional hash map). However, you don't need both of these representations to reconstruct the same map from a serialized form: you only need the insertion order of key/value pairs. As such, the serialized form does not store the same data as the in-memory form.
Serialization turns the pre-existing bytes from the memory into a universal form.
This is done because different systems allocate memory in different ways. Thus, we cannot ensure that the object can be saved directly from the memory on one machine and then be loaded back in properly into another, different machine.
Mabe you can find more information on this page of Oracle docs.
Explanation of object serialization from book Thinking In Java.
When you create an object, it exists for as long as you need it, but under no circumstances does it exist when the program terminates. While this makes sense at first, there are situations in which it would be incredibly useful if an object could exist and hold its information even while the program wasn’t running. Then, the next time you started the program, the object would be there and it would have the same information it had the previous time the program was running. Of course, you can get a similar effect by writing the information to a file or to a database, but in the spirit of making everything an object, it would be quite convenient to declare an object to be "persistent," and have all the details taken care of for you.
Java’s object serialization allows you to take any object that implements the Serializable interface and turn it into a sequence of bytes that can later be fully restored to regenerate the original object. This is even true across a network, which means that the serialization mechanism automatically compensates for differences in operating systems. That is, you can create an object on a Windows machine, serialize it, and send it across the network to a Unix machine, where it will be correctly reconstructed. You don’t have to worry about the data representations on the different machines, the byte ordering, or any other details.
Hope this helps you.
Let's go with that set of mind : we take the object as is , and we send it as byte array over the network. another socket/httphandler receives that byte array.
now, two things come to mind:
ho much bytes to send?
what are these bytes? what class do these btyes represent?
you will have to provide this data as well. so for this action alone we need extra 2 steps.
Now, in C# and Java, as opposed to C++, the objects are scattered throught the heap, each object hold references to the objects it containes , so now we have another requirement
recursivly "catch" all the inner object and pack them into the byte array
now we get packed byte array which represent some object hirarchy, we need to tell the other side how to de-pack this byte array back to object+the object it holds so
Send information on how to unpack that byte array to object hirarchy
Some entities a obejct have cannot be sent over the net, such as functions. so now we have yet another step
Strip away things that cannot be serialized, like functions
this process goes on and one, for every new solution you will find many problems. Serialization is the process of taking that byte array you are talking about and making it something that can be handled in other enviroments, like network/files.

Justification for the design of the public interface of ByteArrayOutputStream?

There are many java standard and 3rd party libraries that in their public API, there are methods for writing to or reading from Stream.
One example is javax.imageio.ImageIO.write() that takes OutputStream to write the content of a processed image to it.
Another example is iText pdf processing library that takes OutputStream to write the resulting pdf to it.
Third example is AmazonS3 Java API, which takes InputStream so that will read it and create file in thir S3 storage.
The problem araises when you want to to combine two of these. For example, I have an image as BufferedImage for which i have to use ImageIO.write to push the result in OutputStream.
But there is no direct way to push it to Amazon S3, as S3 requires InputStream.
There are few ways to work this out, but subject of this question is usage of ByteArrayOutputStream.
The idea behind ByteArrayOutputStream is to use an intermidiate byte array wrapped in Input/Output Stream so that the guy that wants to write to output stream will write to the array and the guy that wants to read, will read the array.
My wondering is why ByteArrayOutputStream does not allow any access to the byte array without copying it, for example, to provide an InputStream that has direct access to it.
The only way to access it is to call toByteArray(), that will make a copy of the internal array (the standard one). Which means, in my image example, i will have three copies of the image in the memory:
First is the actual BufferedImage,
second is the internal array of the OutputStream and
third is the copy produced by toByteArray() so I can create the
InputStream.
How this design is justified?
Hiding implementation? Just provide getInputStream(), and the implementation stays hidden.
Multi-threading? ByteArrayOutputStream is not suited for access by multiple threads anyway, so this can not be.
Moreover, there is second flavor of ByteArrayOutputStream, provided by Apache's commons-io library (which has a different internal implementation).
But both have exactly the same public interface that does not provide way to access the byte array without copying it.
My wondering is why ByteArrayOutputStream does not allow any access to the byte array without coping it, for example, to provide an InputStream that has direct access to it.
I can think of four reasons:
The current implementation uses a single byte array, but it could also be implemented as a linked list of byte arrays, deferring the creation of the final array until the application asks for it. If the application could see the actual byte buffer, it would have to be a single array.
Contrary to your understanding ByteArrayOutputStream is thread safe, and is suitable for use in multi-threaded applications. But if direct access was provided to the byte array, it is difficult to see how that could be synchronized without creating other problems.
The API would need to be more complicated because the application also needs to know where the current buffer high water mark is, and whether the byte array is (still) the live byte array. (The ByteArrayOutputStream implementation occasionally needs to reallocate the byte array ... and that will leave the application holding a reference to an array that is no longer the array.)
When you expose the byte array, you allow an application to modify the contents of the array, which could be problematic.
How this design is justified?
The design is tailored for simpler use-cases than yours. The Java SE class libraries don't aim to support all possible use-cases. But they don't prevent you (or a 3rd party library) from providing other stream classes for other use-cases.
The bottom line is that the Sun designers decided NOT to expose the byte array for ByteArrayOutputStream, and (IMO) you are unlikely to change their minds.
(And if you want to try, this is not the right place to do it.
Try submitting an RFE via the Bugs database.
Or develop an patch that adds the functionality and submit it to the OpenJDK team via the relevant channels. You would increase your chances if you included comprehensive unit tests and documentation.)
You might have more success convincing the Apache Commons IO developers of the rightness of your arguments, provided that you can come up with an API design that isn't too dangerous.
Alternatively, there's nothing stopping you from just implementing your own special purpose version that exposes its internal data structures. The code is GPL'ed so you can copy it ... subject to the normal GPL rules about code distribution.
Luckily, the internal array is protected, so you can subclass it, and wrap a ByteArrayInputStream around it, without any copying.
I think that the behavior you are looking for is a Pipe. A ByteArrayOutputStream is just an OutputStream, not an input/output stream. It wasn't designed for what you have in mind.

Lightweight way to persist objects in Java

I'm trying to design a lightweight way to store persistent data in Java. I've already got a very efficient way to serialize POJOs to DataOutputStreams (and back), but I'm trying to think of a good way to ensure that changes to the data in the POJOs gets serialized when necessary.
This is for a client-side app where I'm trying to keep the size of the eventual distributable as low as possible, so I'm reluctant to use anything that would pull-in heavy-weight dependencies. Right now my distributable is almost 10MB, and I don't want it to get much bigger.
I've considered DB4O but its too heavy - I need something light. Really its probably more a design pattern I need, rather than a library.
Any ideas?
The 'lightest weight' persistence option will almost surely be simply marking some classes Serializable and reading/writing from some fixed location. Are you trying to accomplish something more complex than this? If so, it's time to bundle hsqldb and use an ORM.
If your users are tech savvy, or you're just worried about initial payload, there are libraries which can pull dependencies at runtime, such as Grape.
If you already have a compact data output format in bytes (which I assume you have if you can persist efficiently to a DataOutputStream) then an efficient and general technique is to use run-length-encoding on the difference between the previous byte array output and the new byte array output.
Points to note:
If the object has not changed, the difference in byte arrays will be an array of zeros and hence will compress very small....
For the first time you serialize the object, consider the previous output to be all zeros so that you communicate a complete set of data
You probably want to be a bit clever when the object has variable-sized substructures....
You can also try zipping the difference rather than RLE - might be more efficient in some cases where you have a large object graph with a lot of changes

When to use byte array & when byte buffer?

What is the difference between a byte array & byte buffer ?
Also, in what situations should one be preferred over the other?
[my usecase is for a web application being developed in java].
There are actually a number of ways to work with bytes. And I agree that it's not always easy to pick the best one:
the byte[]
the java.nio.ByteBuffer
the java.io.ByteArrayOutputStream (in combination with other streams)
the java.util.BitSet
The byte[] is just a primitive array, just containing the raw data. So, it does not have convenient methods for building or manipulating the content.
A ByteBuffer is more like a builder. It creates a byte[]. Unlike arrays, it has more convenient helper methods. (e.g. the append(byte) method). It's not that straightforward in terms of usage. (Most tutorials are way too complicated or of poor quality, but this one will get you somewhere. Take it one step further? then read about the many pitfalls.)
You could be tempted to say that a ByteBuffer does to byte[], what a StringBuilder does for String. But there is a specific difference/shortcoming of the ByteBuffer class. Although it may appear that a bytebuffer resizes automatically while you add elements, the ByteBuffer actually has a fixed capacity. When you instantiate it, you already have to specify the maximum size of the buffer.
That's one of the reasons, why I often prefer to use the ByteArrayOutputStream because it automatically resizes, just like an ArrayList does. (It has a toByteArray() method). Sometimes it's practical, to wrap it in a DataOutputStream. The advantage is that you will have some additional convenience calls, (e.g. writeShort(int) if you need to write 2 bytes.)
BitSet comes in handy when you want to perform bit-level operations. You can get/set individual bits, and it has logical operator methods like xor(). (The toByteArray() method was only introduced in java 7.)
Of course depending on your needs you can combine all of them to build your byte[].
ByteBuffer is part of the new IO package (nio) that was developed for fast throughput of file-based data. Specifically, Apache is a very fast web server (written in C) because it reads bytes from disk and puts them on the network directly, without shuffling them through various buffers. It does this through memory-mapped files, which early versions of Java did not have. With the advent of nio, it became possible to write a web server in java that is as fast as Apache. When you want very fast file-to-network throughput, then you want to use memory mapped files and ByteBuffer.
Databases typically use memory-mapped files, but this type of usage is seldom efficient in Java. In C/C++, it's possible to load up a large chunk of memory and cast it to the typed data you want. Due to Java's security model, this isn't generally feasible, because you can only convert to certain native types, and these conversions aren't very efficient. ByteBuffer works best when you are just dealing with bytes as plain byte data -- once you need to convert them to objects, the other java io classes typically perform better and are easier to use.
If you're not dealing with memory mapped files, then you don't really need to bother with ByteBuffer -- you'd normally use arrays of byte. If you're trying to build a web server, with the fastest possible throughput of file-based raw byte data, then ByteBuffer (specifically MappedByteBuffer) is your best friend.
Those two articles may help you http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly and http://evanjones.ca/software/java-bytebuffers.html

Categories

Resources