I was warned against using Guava immutable collections in objects used in serialized communication because if the version of Guava on one end were updated, there could be serialization version incompatibility issues. Is this a valid concern?
Let's give some perspective.
The most prominent uses of serialization are:
storing data in between runs of your application
sending data between a client and a server
Guava is totally fine for application 2, if you control which Guava version is being used on both the client and the server. Moreover, while Guava doesn't make guarantees on the consistency of serialization between Guava versions...in reality, the serialized forms don't change very often.
On the other hand, let me give some perspective as to why Guava doesn't guarantee the consistency of serialized forms. I changed the serialized form of ImmutableMultiset between Guava releases 9 and 10, and the reason why was because I needed to refactor things so I could add ImmutableSortedMultiset to the immutable collections. You can see the change yourself here. Trying to do this same refactoring while keeping the serialized forms consistent would have almost certainly required additional awkward hacks, which are...pretty strongly against the Guava team's philosophy. (It might have been doable by a more expert programmer than myself, but I still claim it wouldn't have been trivial.) Guaranteeing serialization compatibility for the long term would have required staggering amounts of effort, as discussed in the above linked mailing list thread, Kevin stated:
Trying to provide for
cross-version compatibility made things a hundred times more difficult and
we gave up on it before Guava even started.
and Jared:
The underlying problem is still there: ensuring that the serialized forms
are compatible between all Guava versions. That was a goal of mine when
working towards Google Collections 1.0, but I abandoned that goal after
realizing its difficulty. Implementing and testing cross-version
compatibility wasn't worth the effort.
Finally, I'll point out that Guava gets used internally at Google all over the place and manages pretty well.
Yes, that is a valid concern.
From the Guava project homepage (http://code.google.com/p/guava-libraries/):
Serialized forms of ALL objects are subject to change. Do not persist these and assume they can be read by a future version of the library.
If you're using Java native serialization Guava is not a good choice.
Related
I have stumbled upon Hashing class from com.google.common.hash package.
Intellij IDEA shows following warning if I am using functions of that class:
The class itself is annotated with #Beta annotation:
The description of #Beta annotation says:
Signifies that a public API (public class, method or field) is subject to incompatible changes, or even removal, in a future release. An API bearing this annotation is exempt from any compatibility guarantees made by its containing library. Note that the presence of this annotation implies nothing about the quality or performance of the API ...
So the implementation of the API is fine and stable?
... in question, only the fact that it is not "API-frozen."
It is generally safe for applications to depend on beta APIs, at the cost of some extra work ...
Which kind of extra work?
... during upgrades. However it is generally inadvisable for libraries (which get included on users' CLASSPATHs, outside the library developers' control) to do so.
The question is whether it is safe / stable to use mentioned class and its functionality? What is the tradeoff while using a beta API?
The implementaion of the API is fine, you can rely on that since it is an extensively used library from google.
As for stability - you can do a little research here and compare a couple of versions of this API a year apart. Let's say, 23.0 versus 27.0-jre
https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html
https://google.github.io/guava/releases/27.0-jre/api/docs/com/google/common/hash/Hashing.html
If you do a diff, the API's from different years (2017 versus 2018) are exactly the same.
Therefore, I would interpret the #Beta here as a heads-up that "be aware, this API may change in future", but in practise the API is both stable, reliable and heavily used.
Maybe at some point, the google developers may choose to remove the #Beta annotation. Or maybe they intend to, or have forgotten (speculative...)
The "extra work" referred to means that, if you build an application using this API, you may need to refactor your application slightly (imagine that a method signiture changes, or a method becomes deprecated and replaced) if you need to upgrade to the newest version of this API.
The degree of work there depends on how heavily and how often you use the API, and how deep the dependency on that API is (transitively, through other libraries, for example - those would also need to be rebuilt).
In summary, in this case - "dont worry, move along" :)
So the implementation of the API is fine and stable?
No way to know from this annotation.
To answer that you need to know how widely used it is, and for how long.
Which kind of extra work?
The kind of extra work that you have to do when a method that required only 1 parameter and returned String now requires 3 parameters, and returns a List<String>.
i.e.: Code that uses this API might need to change due to API change.
So the implementation of the API is fine and stable?
The quoted text says that the API is "subject to incompatible changes". That means that it (the API) is not stable.
Note also that the quoted text explicitly states that the annotation is saying nothing about whether or not the API's implementation works.
But also note that this is not a yes / no issue. It is actually a yes / no / maybe issue. Some questions don't have answers.
Which kind of extra work?
Rewriting some of your code ... if the API changes.
The question is whether it is safe / stable to use mentioned class and its functionality?
This requires the ability to predict the future. It is unanswerable. (Unless you ask the people who put that annotation on that API. They may be able to make a reliable prediction ...)
It is also unanswerable because it depends on what you mean by safe, and what context you intend to use the Hashing class in.
What is the tradeoff while using a beta API?
The tradeoff is self evident:
On the plus side, you get to use new API functionality that may be beneficial to your application in the long run. (And if there is no evidence that it may be beneficial, this whole discussion is moot!)
On the minus side, you may have to rewrite some of your code if the authors modify the API ... as they say they might do.
I have the following scenario: I am using a very big external library in my Eclipse RCP application for a specific purpose.
At this point in time I am not sure if I may not have to replace this library in the future to another one (because it does not provide the necessary functionality or something like that). Also I have users using this library from day one so I would like to encapsulate the library, giving me at least a chance of changing the library in the future without the user noticing or having to change anything in their code.
Is there a simple way to encapsulate a whole library in some automated fashion?
Unless the part of the library's interface you are actually using is completely trivial, or standardized the way JSF or JAX-B are (in which case you don't need encapsulation) this is a completely wasted effort.
I can guarantee that if you have to switch to a different library, the encapsulation would prove worthless because the other library has different underlying concepts and usage patters that cannot be made to fit the existing ones.
I don't think that's possible, since the syntax and semantics of the library might be unique to some extent.
Sure, you could create proxies for all the classes and provide those, but that might require quite some work (writing a framework that scans the library) and that wouldn't guarantee you that exchanging the library would be easy.
Imagine the replacement would provide different methods and even use different semantics (to some extent). What if methods/fields etc. were missing in the replacement?
The best way to handle that would be to write an explicit wrapper and make the users use only that wrapper. This way you could restrict the API to the core concepts that are really needed. This still might not provide a good enough encapsulation however, based on what the library actually does.
Example:
For 3D programming you could use OpenGL or Direct3D. Both have somewhat different APIs but use the same core concepts. Thus you could create a wrapper for them that provides a unified API. That wrapper might then have to convert some data etc. (like making column-oriented matrics row-oriented and vice versa) but since the core concepts are the same, that should be doable.
However, you'd need to stick to the core concepts and couldn't use additional features. For example, Direct3D would also provide some more highlevel API (Direct3DX) which isn't provided by OpenGL.
Which is the best way you think to use Guava? Since, in the web site, the guys say that the interfaces are subject to change till they release 1.0. Taking this into account, the code you write shouldn't depend directly on those interfaces, so, are you wrapping all the Guava code you call into some kind of layer or facade in our projects in order to, if those interfaces change, then you at least have those changes centralized in one place?
Which is the best way to go? I am really interested in starting using it but I have that question hitting my mind hahah :)
I'm not sure where you're getting that about the interfaces being subject to change until version 1.0. That was true with Google Collections, Guava's predecessor, but that has had its 1.0 release and is now a part of Guava. Additionally, nothing that was part of Google Collections will be changed in a way that could break code.
Guava itself doesn't even use a release system with a concept of "1.0". It just does releases, labeled "r05", "r06" and so on. All APIs in Guava are effectively frozen unless they are marked with the #Beta annotation. If #Beta is on a class or interface, anything in that class is subject to change. If a class isn't annotated with it, but some methods in the class are, those specific methods are subject to change.
Note that even with the #Beta APIs, the functionality they provide will very likely not be removed completely... at most they'll probably just change how that functionality is provided. Additionally, I believe they're deprecating the original form of any #Beta API they change for 1 release before removing it completely, giving you time to see that it's changed and update to the new form of that API. #Beta also doesn't mean that a class or method isn't well-tested or suitable for production use.
Finally, this shouldn't be much of an issue if you're working on an application that uses Guava. It should be easy enough to update to a new version whenever, just making changes here and there if any #Beta APIs you were using changed. It's people writing libraries that use Guava who really need to avoid using #Beta APIs, as using one could create a situation where you're unable to switch to a newer version of Guava in your application OR use another library that uses a newer version because it would break code in the older library that depends on a changed/removed beta API.
It looks like GAE has chosen a subset of JDK 1.6 classes, as per:
Google App Engine JDK white list
which is very unfortunate as one gets class linkage errors all over the place with most common java libraries that deal with data binding, reflection, class loading and annotations. Although some omissions may be for deprecated or legacy things, there are others that are not. My specific concern is with streaming pull parsers (javax.xml.stream.*) which was just added to JDK 1.6 after a long delay (API was finalized at about same time as JDK 1.4). Omitting this makes it harder to do scalable high-performance xml processing.
Problem as I understand is that not only are classes missing, but they can not even be added because of security constraints.
So: this is an open-ended philosophical question that probably just GAE devs could answer for sure but... why are some APIs dropped from standard JDK 1.6, seemingly arbitrarily?
UPDATE:
Quick note: thanks for answers. For what it's worth I really do not see how security would have anything to do with not including javax.xml.stream.
Security aspects are relevant for great many other things (and I don't need threads, for example, and can see why they are out), so it's understandable boilerplate answer; just not applicable for this one.
Stax API is just a set of interfaces and abstract for crying out loud. But more importantly, it has exactly the same ramifications as including SAX, DOM and JAXP interfaces -- which are included already!
But it looks like this issue has been brought to attention of google devs:
discussion on whitelisting Stax API
so here's hoping that this and similar issues can be resolved swiftly.
GAE is run in a hosted environment with untrusted (and potentially malicious) clients, who often are given access for free.
In that type of environment, security is a very high concern, and APIs which have filesystem access get very heavy scrutiny. I think thats why they've chosen to start pretty conservatively in terms of what they allow.
It wouldn't surprise me at all if more classes find their way into the whitelist as security issues are addressed (and based on demand), though.
But I wouldn't even expect to get threading tools available, eg.
It's extremely doubtful that these things were dropped arbitrarily. GAE runs in an extremely security-sensitive environment, and the chances are good that an internal audit of the class libraries found some risks that Google was not willing to take.
As for your high-performance streaming XML parsers, you could try to find an appropriate library (jar file). Unless it relies on threads or file access (or black-listed API), it should work just as well as the one in the JDK.
There are a lot of (rather complex) libraries that work on GAE.
I'm currently working on a project which needs to persist any kind of object (of which implementation we don't have any control) so these objects could be recovered afterwards.
We can't implement an ORM because we can't restrict the users of our library at development time.
Our first alternative was to serialize it with the Java default serialization but we had a lot of trouble recovering the objects when the users started to pass different versions of the same object (attributes changed types, names, ...).
We have tried with the XMLEncoder class (transforms an object into a XML), but we have found that there is a lack of functionality (doesn't support Enums for example).
Finally, we also tried JAXB but this impose our users to annotate their classes.
Any good alternative?
It's 2011, and in a commercial grade REST web services project we use the following serializers to offer clients a variety of media types:
XStream (for XML but not for JSON)
Jackson (for JSON)
Kryo (a fast, compact binary serialization format)
Smile (a binary format that comes with Jackson 1.6 and later).
Java Object Serialization.
We experimented with other serializers recently:
SimpleXML seems solid, runs at 2x the speed of XStream, but requires a bit too much configuration for our situation.
YamlBeans had a couple of bugs.
SnakeYAML had a minor bug relating to dates.
Jackson JSON, Kryo, and Jackson Smile were all significantly faster than good old Java Object Serialization, by about 3x to 4.5x. XStream is on the slow side. But these are all solid choices at this point. We'll keep monitoring the other three.
http://x-stream.github.io/ is nice, please take a look at it! Very convenient
of which implementation we don't have any control
The solution is don't do this. If you don't have control of a type's implementation you shouldn't be serialising it. End of story. Java serialisation provides serialVersionUID specifically for managing serialisation incompatibilities between different versions of a type. If you don't control the implementation you cannot be sure that IDs are being changed correctly when a developer changes a class.
Take a simple example of a 'Point'. It can be represented by either a cartesian or a polar coordinate system. It would be cost prohibitive for you to build a system that could cope dynamically with these sorts of corrections - it really has to be the developer of the class who designs the serialisation.
In short it's your design that's wrong - not the technology.
The easiest thing for you to do is still to use serialization, IMO, but put more thought into the serialized form of the classes (which you really ought to do anyway). For instance:
Explicitly define the SerialUID.
Define your own serialized form where appropriate.
The serialized form is part of the class' API and careful thought should be put into its design.
I won't go into a lot of details, since pretty much everything I have said comes from Effective Java. I'll instead, refer you to it, specifically the chapters about Serialization. It warns you about all the problems you're running into, and provides proper solutions to the problem:
http://www.amazon.com/Effective-Java-2nd-Joshua-Bloch/dp/0321356683
With that said, if you're still considering a non-serialization approach, here are a couple:
XML marshalling
As many has pointed out is an option, but I think you'll still run into the same problems with backward compatibility. However, with XML marshalling, you'll hopefully catch these right away, since some frameworks may do some checks for you during initialization.
Conversion to/from YAML
This is an idea I have been toying with, but I really liked the YAML format (at least as a custom toString() format). But really, the only difference for you is that you'd be marshalling to YAML instead of XML. The only benefit is that that YAML is slightly more human readable than XML. The same restrictions apply.
google came up with a binary protocol -- http://code.google.com/apis/protocolbuffers/ is faster, has a smaller payload compared to XML -- which others have suggested as alternate.
One of the advanteages of protocol buffers is that it can exchange info with C, C++, python and java.
Try serializing to json with Gson for example.
Also a very fast JDK serialization drop-in replacement:
http://ruedigermoeller.github.io/fast-serialization/
If serialization speed is important to you then there is a comprehensive benchmark of JVM serializers here:
https://github.com/eishay/jvm-serializers/wiki
Personally, I use Fame a lot, since it features interoperability with Smalltalk (both VW and Squeak) and Python. (Disclaimer, I am the main contributor of the Fame project.)
Possibly Castor?
Betwixt is a good library for serializing objects - but it's not going to be an automatic kind of thing. If the number of objects you have to serialize is relatively fixed, this may be a good option for you, but if your 'customer' is going to be throwing new classes at you all the time, it may be more effort than it's worth (Definitely easier than XMLEncoder for all the special cases, though).
Another approach is to require your customer to provide the appropriate .betwixt files for any objects they throw at you (that effectively offloads the responsibility to them).
Long and short - serialization is hard - there is no completely brain dead approach to it. Java serialization is as close to a brain dead solution as I've ever seen, but as you've found, incorrect use of the version uid value can break it. Java serialization also requires use of the marker 'Serializable' interface, so if you can't control your source, you are kind of out of luck on that one.
If the requirement is truly as arduous as you describe, you may have to resort to some sort of BCE (Byte code modification) on the objects / aspects / whatever. This is getting way outside the realm of a small development project, and into the realm of Hibernate, Casper or an ORM....
SBE is an established library for fast, bytebuffer based serialization library and capable of versioning. However it is a bit hard to use as you need to write length wrapper classes over it.
In light of its shortcomings, I have recently made a Java-only serialization library inspired by SBE and FIX-protocol (common financial market protocol to exchange trade/quote messages), that tries to keep the advantages of both while overcoming their weaknesses. You can take a look at https://github.com/iceberglet/anymsg
Another idea: Use cache. Caches provide much better control, scalability and robustness to the application. Still need to serialize, though, but the management becomes much easier with within a caching service framework. Cache can be persisted in memory, disk, database or array - or all of the options - with one being overflow, stand by, fail-over for the other . Commons JCS and Ehcache are two java implementations, the latter is an enterprise solution free up to 32 GB of storage (disclaimer: I don't work for ehcache ;-)).