More specifically, how does the Array object in Java allow users to access each bucket in constant time? I understand that Java allocates memory equivalent to the specified size at initialization, but what about the structure of it allows such rapid lookup? The reason I am asking is because it's something not available in any other data structure save those that use Array representations of data.
Another (possibly silly) question is why can't other data structures provide such quick lookup of the stored data? I guess this question can/will be answered by the answer to the first question (how Arrays are implemented).
An array is just stored as a big contiguous block of memory.
Accessing an element is a matter of:
Finding where the data starts
Validating the index
Multipying the index by the element size, and adding the result to the start location
Accessing that memory location
Note that this is all done in the JVM, not in byte code. If arrays didn't exist at a JVM level, you couldn't fake them up within pure Java. Aside from anything else, arrays are the only objects in Java which logically vary in size between instances within the same JVM. (It's possible that some JVMs have clever systems to use compressed references in some cases and not in others, but that's an implementation detail rather than a logical difference.)
Related
I am writing C++ library that will be used by different Android applications to process some kind of data organized like two-dimensional storage where each dimension has no predefined restrictions for size (like array of arrays of float, and size of arrays can be quite large).
Current solution uses SWIG to copy data from memory allocated by Java code to C++ structures. It turns out that each array of float value (in Java) became vector of float (in C++).
The problem is that duplication of a large amount of data increases the risk of running out of memory available for application. I understand that, in any case, memory consumption issue should be resolved by input volume limitation, but the library does not know how much memory is available and should have whole data (access to any data element is needed repeatedly) to perform correct processing.
So now I am considering possibility to use one data storage for Java and C++, so C++ code require direct access to data stored by Java code to memory allocated on Java side (making memory allocated by C++ code as single storage is not considered).
I want to know how to organize such memory sharing in a safe manner (preferably using SWIG).
I feel that some difficulties can be with such implementation, e.g. with Java garbage collector (C++ code can address to storage which already deallocated) and slowing memory access through the wrapper (as mentioned earlier, the library requires repeated access to each data item)… but perhaps someone advise me a reliable solution.
The explanation of why my idea is wrong can be accepted, if supported with sufficiently and compelling arguments.
You can take access to raw array of data using Critical Native implementation. This tecknology allow to access directly to jvm memory without owerhead of transfering data between Java and native code.
But this have next restrictions:
must be static and not synchronized;
argument types must be primitive or primitive arrays;
implementation must not call JNI functions, i.e. it cannot allocate Java objects or throw exceptions;
should not run for a long time, since it will block GC while running.
The declaration of a critical native looks like a regular JNI method, except that:
it starts with JavaCritical_ instead of Java_;
it does not have extra JNIEnv* and jclass arguments;
Java arrays are passed in two arguments: the first is an array length, and the second is a pointer to raw array data. That is, no need to call GetArrayElements and friends, you can instantly use a direct array pointer.
Look at original answer and source article for details.
Serialization is the process of converting an object stored in memory into a stream of bytes to be transferred over a network, stored in a DB, etc.
But isn't the object already stored in memory as bits and bytes? Why do we need another process to convert the object stored as bytes into another byte representation? Can't we just transmit the object directly over the network?
I think I may be missing something in the way the objects are stored in memory, or the way the object fields are accessed.
Can someone please help me in clearing up this confusion?
Different systems don't store things in memory in the same way. The obvious example is endianness.
Serialization defines a way by which systems using different in-memory representations can communicate.
Another important fact is that the requirements on in-memory and serialized data may be different: when in-memory, fast read (and maybe write) access is desirable; when serialized, small size is desirable. It is easier to create two different formats to fit these two use cases than it is to create one format which is good for both.
An example which springs to mind is LinkedHashMap: this basically stores two versions of the mapping when in memory (one to capture insertion order; one as a traditional hash map). However, you don't need both of these representations to reconstruct the same map from a serialized form: you only need the insertion order of key/value pairs. As such, the serialized form does not store the same data as the in-memory form.
Serialization turns the pre-existing bytes from the memory into a universal form.
This is done because different systems allocate memory in different ways. Thus, we cannot ensure that the object can be saved directly from the memory on one machine and then be loaded back in properly into another, different machine.
Mabe you can find more information on this page of Oracle docs.
Explanation of object serialization from book Thinking In Java.
When you create an object, it exists for as long as you need it, but under no circumstances does it exist when the program terminates. While this makes sense at first, there are situations in which it would be incredibly useful if an object could exist and hold its information even while the program wasn’t running. Then, the next time you started the program, the object would be there and it would have the same information it had the previous time the program was running. Of course, you can get a similar effect by writing the information to a file or to a database, but in the spirit of making everything an object, it would be quite convenient to declare an object to be "persistent," and have all the details taken care of for you.
Java’s object serialization allows you to take any object that implements the Serializable interface and turn it into a sequence of bytes that can later be fully restored to regenerate the original object. This is even true across a network, which means that the serialization mechanism automatically compensates for differences in operating systems. That is, you can create an object on a Windows machine, serialize it, and send it across the network to a Unix machine, where it will be correctly reconstructed. You don’t have to worry about the data representations on the different machines, the byte ordering, or any other details.
Hope this helps you.
Let's go with that set of mind : we take the object as is , and we send it as byte array over the network. another socket/httphandler receives that byte array.
now, two things come to mind:
ho much bytes to send?
what are these bytes? what class do these btyes represent?
you will have to provide this data as well. so for this action alone we need extra 2 steps.
Now, in C# and Java, as opposed to C++, the objects are scattered throught the heap, each object hold references to the objects it containes , so now we have another requirement
recursivly "catch" all the inner object and pack them into the byte array
now we get packed byte array which represent some object hirarchy, we need to tell the other side how to de-pack this byte array back to object+the object it holds so
Send information on how to unpack that byte array to object hirarchy
Some entities a obejct have cannot be sent over the net, such as functions. so now we have yet another step
Strip away things that cannot be serialized, like functions
this process goes on and one, for every new solution you will find many problems. Serialization is the process of taking that byte array you are talking about and making it something that can be handled in other enviroments, like network/files.
I am working with some relatively large arrays of instances of a single data structure. Each instance consists of about a half a dozen fields. The arrays take up a lot of space and I'm finding that my development environment dies even when running with a vm using 7 gigabytes of heap space. Although I can move to a larger machine, I am also exploring ways I could economize on space without taking an enormous hit in performance. On inspection of the data I've noticed a great deal of redundancy in the data. For about 80 percent of the data, four of the six fields have identical values.
This gave me the idea, that I can segregate these instances that have redundant information and put them in a specialized form of the data structure (an extension of the original data structure) with static fields for the four fields that contain the identical information. My assumption is that the static fields will only be instantiated in memory once and so even though this information is shared by say 100K objects, these fields take up the same memory as a they would if only one data structure was instantiated. I therefore should be able to realize a significant memory savings.
Is this a correct assumption?
Thank you,
Elliott
I don't know your specific datastructure and a possible algorithm to buid a flyweight, but I would suggest one:
http://en.wikipedia.org/wiki/Flyweight_pattern
The pattern is quite near to the solution you are thinking about, and gives you a good seperation of "how to get the data."
How about maintaining the fields that are redundant in a map and just have references to those values in arrays. Could save space by reducing size of individual data structure.
Try to use HashMap for your storage. That's the way to fast find equal object.
You need to think how define hashCode function of your objects.
If one took say 1000 lines of computer code and instead of the variables being declared independently, they were grouped together (obviously this does depend on the variable sizes being used) into classes and structs, would this directly increase the cache spacial locality (and therefore reduce the cache miss rate)?
I was under the impression by associating the variables within a class/struct they would be assigned continuous memory addresses?
If you are talking about method-local variables, they are already contiguous on the stack, or strictly speaking in activation records which are all but invariably on the stack. If you are talking about references to Java objects, or pointers to dynamically allocated C++ objects, putting them into containing classes won't make any difference for another reason: the objects concerned will still be in arbitrary positions in the heap.
Answering this question is not possible without making some quite unreasonable assumptions. Spatial locality is as much about algorithms as it is about data structures, so grouping logically related data elements together may be of no consequence or even worse based on an algorithm that you use.
For example, consider a representation of 100 points in 3D space. You could put them in three separate arrays, or create a 3-tuple struct/class, and make an array of these.
If your algorithm must get all three coordinates of each point at once on each step, the tuple representation wins. However, think what would happen if you wanted to build an algorithm that operates on each dimension independently, and paralelize it three-way among three independent threads. In this case three separate arrays would win hands down, because that layout would avoid false sharing, and improve spatial locality as far as the one-dimension-at-a-time algorithm is concerned.
This example shows that there is no "one size fits all" solution. Spatial locality should always be considered in the context of a specific algorithm; a good solution in one case could turn bad in other seemingly similar cases.
If you are asking whether to group local variables into explicitly defined structures, there is not going to be an advantage. Local variables are implemented in terms of activation records, which are usually closely related to the implementation of class structures, for any language that has both.
So, local variables should already have good spatial locality, unless the language implementation is doing something weird to screw it up.
You might improve locality by isolating large chunks of local state which isn't used during recursion into separate non-recursing functions. This would be a micro-optimization, so you need to inspect the machine code first to be sure it's not a waste of time. Anyway, it's unrelated to moving locals into a class.
I am trying to implement a cache-like collection of objects. The purpose is to have fast access to these objects through locality in memory since I'll likely be reading multiple objects at a time. I currently just store objects in a java collections object like vector or deque. But I do not believe this makes use of contiguous memory.
I know this can be done in C, but can it be done in Java? These objects may be of varying lengths (since they may contain strings). Is there a way to allocate contiguous memory through Java? Is there a Java collections object that would?
Please let me know.
Thanks,
jbu
You can't force it. If you allocate all the objects in quick succession they're likely to be contiguous - but if you're storing them in a collection, there's no guarantee that the collection will be local to the actual values. (The collection will have references to the objects, rather than containing the objects themselves.)
In addition, GC compaction will move values around in memory.
Have you actually profiled your app and found this is a bottleneck? In most cases I'd expect other optimisations could help you in a more reliable way.
No, you can't guarantee this locality of reference.
By allocating a byte array, or using a mapped byte buffer from the nio packages, you could get a chunk of contiguous memory, from which you can decode the data you want (effectively deserializing the objects of interest from this chunk of memory). However, if you repeatedly access the same objects, the deserialization overhead would likely defeat the purpose.
Have you written this code yet in Java? And if so, have you profiled it? I would argue that you probably don't need to worry about the objects being in contiguous memory - the JVM is better at memory management than you are in a garbage collected environment.
If you're really concerned about performance, maybe Java isn't the right tool for the job, but my gut instinct is to tell you that you're worrying about optimization too early, and that a Java version of your code, working with non-contiguous memory, will probably suit your needs.
I suggest using a HashMap (no threaded) or Hashtable (threaded) for your cache. Both store an object into an array in the sun jvm. Since all objects in java are passed by reference, this should be represented as an array of pointers in c. My bet is that you are performing premature optimization.
If you absolutely must have this, you have two options:
1) Use JNI and write it in c.
2) Get a BIG byte buffer and use ObjectOutputStream to write objects to it. This will probable be VERY SLOW compared to using a hash table.