I need to re-write some C++ code in Java that I wrote years ago.
In C++, I used a std::vector<unsigned char> buffer to implement a buffer which I would retrieve from a socket connection.
With this buffer I do all kinds of operations:
Have an iterator to buffer element and advance it as required.
Search it for certain tokens.
Get a range of byte elements (sub-buffer) from buffer using indexes / iterators.
Based on these requirements, I could use ArrayList<Byte> and have a custom iterator-like class to keep track of current element and work using indexes.
In what ways will be in-efficient to use List<Byte>?
Somewhere else I have seen ByteBuffer being recommended but my requirements don't call for it.
Perhaps because I need indexes, searches to perform, etc. and my buffer won't change or be modified once created.
All I need is to read the buffer and perform above operations.
Should I better just have a wrapper around byte[] with my own iterator/pointer class?
This really depends on your requirements. The huge problem with List<Byte> is the fact that you are using Byte objects rather than primitive byte values!
Not only does that affect memory consumption, it also means that there might be a lot of boxing/unboxing going on.
That may cause a lot of objects to be generated† - leading to constant churn for the garbage collector.
So if you intend to do intensive computations, or you are working massive amounts of data, then the ByteBuffer class has various performance advantages here.
† from Java Language Specification §5.1.1
The rule above is a pragmatic compromise, requiring that certain common values always be boxed into indistinguishable objects. The implementation may cache these, lazily or eagerly.
List is not a class, it's an interface, so you need to supply an actual class behind it, to provide an implementation. You can use a List<Byte> variable on your code, to hide the implementation details of the actual class you are using, but if you need it behave as an array sometimes, and as a List others (whit iterators and such), then you need not to abandon the ArrayList<Byte> class and continue using it, to be able to use the index and list behaviours together. If you switch to a List<Byte>, finally you will end doing unnecessary casts to access the array features of ArrayList, and these will consume cpu, checking that the proper class is being cast. And you'll blur the code. Also, you have alternatives in ByteBuffer or CharBuffer, as suggested by other responses. But beware that these class are abstract, so you'll have to implement part of them, before being able to use.
Related
We use a HashMap<Integer, SomeType>() with more than a million entries. I consider that large.
But integers are their own hash code. Couldn't we save memory with a, say, IntegerHashMap<Integer, SomeType>() that uses a special Map.Entry, using int directly instead of a pointer to an Integer object? In our case, that would save 1000000x the memory required for an Integer object.
Any faults in my line of thought? Too special to be of general interest? (at least, there is an EnumHashMap)
add1. The first generic parameter of IntegerHashMap is used to make it closely similar to the other Map implementations. It could be dropped, of course.
add2. The same should be possible with other maps and collections. For example ToIntegerHashMap<KeyType, Integer>, IntegerHashSet<Integer>, etc.
What you're looking for is a "Primitive collections" library. They are usually much better with memory usage and performance. One of the oldest/popular libraries was called "Trove". However, it is a bit outdated now. The main active libraries in use now are:
Goldman Sach Collections
Fast Util
Koloboke
See Benchmarks Here
Some words of caution:
"integers are their own hash code" I'd be very careful with this statement. Depending on the integers you have, the distribution of keys may be anything from optimal to terrible. Ideally, I'd design the map so that you can pass in a custom IntFunction as hashing strategy. You can still default this to (i) -> i if you want, but you probably want to introduce a modulo factor, or your internal array will be enormous. You may even want to use an IntBinaryOperator, where one param is the int and the other is the number of buckets.
I would drop the first generic param. You probably don't want to implement Map<Integer, SomeType>, because then you will have to box / unbox in all your methods, and you will lose all your optimizations (except space). Trying to make a primitive collection compatible with an object collection will make the whole exercise pointless.
I have an array of ByteBuffers(which actually represent integers). I want to the separate unique & non unique ByteBuffers (i.e integers) in the array. Thus I am using HashSet of this type:
HashSet<ByteBuffer> columnsSet = new HashSet<ByteBuffer>()
Just wanted to know if HashSet is a good way to do so? And do I pay more costs, when doing so for a ByteBuffer then if I would have done it for a Integer?
(Actually I am reading serialized data from DB which needs to be written back after this operation thus I want to avoid serialization and deserialization between bytebuffer to Integer and back!)
Your thoughts upon this appreciated.
Creating a ByteBuffer is far more expensive than reading/writing from a reused ByteBuffer.
The most efficient way to store integers is to use int type. If you want a Set of these you can use TIntHashSet which uses int primitives. You can do multiple read/deserialize/store and reverse with O(1) pre-allocated objects.
First of all, it will work. The overhead of equals() on two ByteBuffers will definitely be higher, but perhaps not enough to offset the benefits of not having to deserialize (though, I'm not entirely sure if that would be such a big problem).
I'm pretty sure that the performance will asymptotically be the same, but a more memory-efficient solution is to sort your array, then step through it linearly and test successive elements for equality.
An example, suppose your buffers contain the following:
1 2 5 1
Sort it:
1 1 2 5
Once you start iterating, you get ar[0].equals(ar[1]) and you know these are duplicates. Just keep going like that till n-1.
Collections normally operate on the equals() and hashCode() methods, so performance implications would come through the implementation of the objects stored in the collection.
Looking at ByteBuffer and Integer one can see that the implementation of those methods in Integer are simpler (just one int comparison for equals() and return value; for hashCode()). Thus you could say the Set<ByteBuffer> has higher cost than a Set<Integer>.
However, I can't tell you right now if this cost is higher than the serialization and deserialization cost.
In fact, I'd just go for the more readable code unless you really have a performance problem. In that case I'd just try both methods and take the faster one.
Sometime back our architect gave this funda to me and I couldn't talk to him more to get the details at the time, but I couldn't understand how arrays are more serializable/better performant over ArrayLists.
Update: This is in the web services code if it is important and it can be that he might mean performance instead of serializability.
Update: There is no problem with XML serialization for ArrayLists.
<sample-array-list>reddy1</sample-array-list>
<sample-array-list>reddy2</sample-array-list>
<sample-array-list>reddy3</sample-array-list>
Could there be a problem in a distributed application?
There's no such thing as "more serializable". Either a class is serializable, or it is not. Both arrays and ArrayList are serializable.
As for performance, that's an entirely different topic. Arrays, especially of primitives, use quite a bit less memory than ArrayLists, but the serialization format is actually equally compact for both.
In the end, the only person who can really explain this vague and misleading statement is the person who made it. I suggest you ask your architect what exactly he meant.
I'm assuming that you are talking about Java object serialization.
It turns out that an array (of objects) and ArrayList have similar but not identical contents. In the array case, the serialization will consist of the object header, the array length and its elements. In the ArrayList case, the serialization consists of the list size, the array length and the first 'size' elements of the array. So one extra 32 bit int is serialized. There may also be differences in the respective object headers.
So, yes, there is a small (probably 4 byte) difference in the size of the serial representations. And it is possible that an array can be serialized / deserialized
slightly more quickly. But the differences are likely to be down in the noise, and not worth worrying about ... unless profiling, etc tells you this is a bottleneck.
EDIT
Based on #Tom Hawtin's comment, the object header difference is significant, especially if the serialization only contains a small number of ArrayList instances.
Maybe he was refering to XML-serialization used in Webservices ?
Having used those a few years ago, I remember that a Webservice returning a List object was difficult to connect to (at least I could not figure it out, probably because of the inner structure of ArrayLists and LinkedLists), although this was trivially done when a native array was returned.
To adress Reddy's comment,
But in any case (array or ArrayList)
will get converted to XML, right?
Yes they will, but the XML-serialization basically translated in XML all the data contained in the serialized object.
For an array, that is a series of values.
For instance, if you declare and serialize
int[] array = {42, 83};
You will probably get an XML result looking like :
<array>42</array>
<array>83</array
For an ArrayList, that is :
an array (obviously), which may have a size bigger than the actual number of elements
several other members such as integer indexes (firstIndex and lastIndex), counts, etc
(you can find all that stuff in the source for ArrayList.java)
So all of those will get translated to XML, which makes it more difficult for the Webservice client to read the actual values : it has to read the index values, find the actual array, and read the values contained between the two indexes.
The serialization of :
ArrayList<Integer> list = new ArrayList<Integer>();
list.add(42);
list.add(83);
might end up looking like :
<firstIndex>0</firstIndex>
<lastIndex>2</lastIndex>
<E>42</E>
<E>83</E>
<E>0</E>
<E>0</E>
<E>0</E>
<E>0</E>
<E>0</E>
<E>0</E>
<E>0</E>
<E>0</E>
So basically, when using XML-serialization in Webservices, you'd better use arrays (such as int[]) than collections (such as ArrayList<Integer>). For that you might find useful to convert Collections to arrays using Collection#toArray().
They both serialize the same data. So I wouldn't say one is significantly better than the other.
As of i know,both are Serializable but using arrays is better coz the main purpose of implementing the ArrayList is for internal easy manipulation purpose,not to expose to outer world.It is little heavier to use ,so when using in webservices while serializing it ,it might create problems in the namespace and headers.If it automatically sets them ,you ll not be able to receive or send data properly.So it is better to use primitive arrays .
Only in Java does this make a difference, and even then it's hard to notice it.
If he didn't mean Java then yes, your best bet would most likely be asking him exactly what he meant by that.
Just a related thought: The List interface is not Serializable so if you want to include a List in a Serializable API you are forced to either expose a Serializable implementation such as ArrayList or convert the List to an array. Good design practices discourage exposing your implementation, which might be why your architect was encouraging you to convert the List to an array. You do pay a little time penalty converting the List to an array, but on the other end you can wrap the array with a list interface with java.util.Arrays.asList(), which is fast.
Given that Java ME allows for no reflection and is quite stripped down, what would a good approach to storing data in a RecordStore be? I thought of devising a JSON-like syntax adapter class which would probably require every class whose values are to be stored to implement a Hashtable + might probably require an additional regex library as even that's been taken away.
If you're wondering why on earth I need it, it's for an assignment. So the dilemma is really between writing a rather unnecessarily large chunk of proper code as a matter of principle or hardcoding everything in knowing nobody has to suffer through maintenance of this junk down the line. But, the good principles person in me is leaning towards the former.
EDIT: I should have elaborated — I'd like to store object's data within the RecordStore, so I'm trying to devise a method to represent an object as a string which can then be converted into a byte array.
For every object you want to save in the RecordStore, pare it down to its component Strings and primitives, then serialise it to a byte array, using a ByteArrayOutputStream wrapped in a DataOutputStream. Then write this byte array to RMS.
Reverse the process using a ByteArrayInputStream wrapped in a DataInputStream to get the object back.
I'm new to java world from C++ background. I'd like to port some C++ code to Java.
The code uses Sparse vectors:
struct Feature{
int index;
double value;
};
typedef std::vector<Feature> featvec_t;
As I understood, if one makes an object, there will be some overhead on memory usage.
So naive implementation of Feature will overhead signifiantly when there will be 10-100 millions of Features in a set of featvec_t.
How to represent this structure memory efficiently in Java?
If memory is really your bottleneck, try storing your data in two separate arrays:
int[] index and double[] value.
But in most cases with such big structures performance (time) will be the main issue. Depending on operations mostly performed on your data (insert, delete, get, etc.) you need to choose appropriate data structure to store objects of class Feature.
Start your explorations with java.util.Collection interface, its subinterfaces (List, Set, etc) and their implementations provided in java.util package.
To avoid memory overhead for each entry, you could write a java.util.List<Feature> implementation that wraps arrays of int and double, and builds Feature objects on demand.
To have it resize automatically, you could use TIntArrayList and TDoubleArrayList from GNU trove.
Is the question about space for the struct itself or the sparse vector? Since others have answered the former, I'll shoot for the latter...
There aren't any sparse lists/matrices in the standard Java collections to my knowledge.
You could build an equivalent using a TreeMap keyed on the index.
An object in Java (I guess) has:
sizeof(index)
sizeof(value)
sizeof(Class*) <-- pointer to concrete class
So the diference is four bytes from the pointer. If your struct is 4+8=12 bytes it's a 33% overhead... but I can't think other (better) way to do it.