I am working on refactoring a small portion of an open source large-scale configuration management system for my University.
We're using some open source tools for machine learning like Weka, and the aspect I am assigned to refactor is dealing with data mining and constructing rules.
The open source files we've been using from Liverpool and Japan are working well, but there are some memory usage issues when we use the program on large scale projects.
I've isolated the major memory hogs and come to the conclusion I need to figure out a different data structure to store and manipulate the data. As it stands now, the program is using what end up becoming very large multidimensional arrays of integers, objects, strings, etc.
There are several methods that simply reconfigure the set up of the associations after we are deriving rules for behaviors. In many cases, we are only adding or subtracting a single element, or simply flattening the multidimensional arrays.
I primarily program in C/C++ in general, so I am not an expert on the data structures available in Java. What I am looking to replace the static arrays with is a dynamic structure that can be easily resized without having to create a second multidimensional array.
What is happening now is we are having to create an entirely new structure every time we add and remove rules, objects, or other miscellaneous data from the multidimensional array. Then we are immediately copying into the new array.
I'd like to be able to simply use the same multidimensional array and simply add a new row and column. Subsequently, I'd like to be able to manipulate the data in the structure by simply saving a temporary value and overwriting previous values, shifting left, right, etc.
Can anyone think of any data structures in Java that would fit the bill?
On a related note, I have looked into explicit garbage collection, but have found I can only really suggest the JVM collect by calling System.Gc(), or by manipulating the garbage collection behavior of the JVM by way of tuning. Is there a better or more effective way?
Regards,
Edm
If you have a lot of nulls/zeroes/falses/empty-strings in your matrix, then you can save space by using a sparse matrix implementation. Matrix-toolkits has several sparse matrices that you can use / modify to suit your needs, or you can just use a hashmap with an {x, y} tuple as the key. (The hashmap also has the advantage that there are several external hashmap implementations available, e.g. BerkeleyDB, so that it's unlikely that you'll run out of memory.)
To replace static arrays with a dynamic structure use an ArrayList that grows with data automatically. To have a two-dimensional data structure use a List of List as
List<List<Integer>> dataStore = new ArrayList<List<Integer>>();
dataStore.add(new ArrayList<Integer>());
dataStore.add(Arrays.asList(1, 2, 3, 4));
// Access [1][3] as
System.out.println(dataStore.get(1).get(3)); // prints 4
Since, you touched upon having control over garbage collection (which Java actually does a pretty good job of all by itself) it seems memory management is of paramount importance as this is what's causing the re-factoring in the first place.
You could look into the Flyweight GoF pattern that focuses on sharing of objects instead of repeating them to cut down on the memory footprint of the application. To enable sharing flyweight objects need to be made immutable.
Psuedo code:
// adding a new flyweight obj at [2][1]
fwObjStore.get(2).set(1, FWObjFactory.getInstance(fwKey));
public class FWObjFactory {
private static Map<String, FWObject> fwMap = new HashMap<String, FWObject>();
public static getInstance(String fwKey) {
if (!fwMap.containsKey(fwKey)) {
fwMap.put(fwKey, newFwFromKey(fwKey));
}
return fwMap.get(fwKey);
}
private static FWObject newFwFromKey(String fwKey) {
// ...
}
}
I would look into using a "List of Lists". For example, you could declare something like
List<List<Object>> mArray = new ArrayList<List<Object>>();
Any time you need to add a new "row", you could do something like:
mArray.add (new ArrayList<Object>());
Check out the List interface to see what you can do with Lists in Java and which classes implement the interface (or roll your own!).
There's no multidimentional thing in Java.Java has array of arrays.
You can use ArrayList with type parameter as ArrayList
ArrayList<ArrayList<yourType>> myList = new ArrayList<ArrayList<yourType>>();
Also,don't worry about GC..It would collect as and when required..
Why not use two Lists tangled together? Like so:
List<List<String>> rowColumns = new ArrayList<>();
// Add a row with two entries, or columns:
List<String> oneRow = Arrays.asList("Hello", "World!");
rowColumns.add(oneRow);
Also, consider using a Map with entries mapped to Lists.
Garbage Collection should generally never have to be dealt with explicitly in Java. Usually you want to look for memory leaks whenever one occur first. When that happens, look for background threads that don't die as supposed to or strong references in caches. If you want to read some about the latter issue, you can start here and here.
Related
Out of interest: Recently, I encountered a situation in one of my Java projects where I could store some data either in a two-dimensional array or make a dedicated class for it whose instances I would put into a one-dimensional array. So I wonder whether there exist some canonical design advice on this topic in terms of performance (runtime, memory consumption)?
Without regard of design patterns (extremely simplified situation), let's say I could store data like
class MyContainer {
public double a;
public double b;
...
}
and then
MyContainer[] myArray = new MyContainer[10000];
for(int i = myArray.length; (--i) >= 0;) {
myArray[i] = new MyContainer();
}
...
versus
double[][] myData = new double[10000][2];
...
I somehow think that the array-based approach should be more compact (memory) and faster (access). Then again, maybe it is not, arrays are objects too and array access needs to check indexes while object member access does not.(?) The allocation of the object array would probably(?) take longer, as I need to iteratively create the instances and my code would be bigger due to the additional class.
Thus, I wonder whether the designs of the common JVMs provide advantages for one approach over the other, in terms of access speed and memory consumption?
Many thanks.
Then again, maybe it is not, arrays are objects too
That's right. So I think this approach will not buy you anything.
If you want to go down that route, you could flatten this out into a one-dimensional array (each of your "objects" then takes two slots). That would give you immediate access to all fields in all objects, without having to follow pointers, and the whole thing is just one big memory allocation: since your component type is primitive, there is just one object as far as memory allocation is concerned (the container array itself).
This is one of the motivations for people wanting to have structs and value types in Java, and similar considerations drive the development of specialized high-performance data structure libraries (that get rid of unneccessary object wrappers).
I would not worry about it, until you really have a huge datastructure, though. Only then will the overhead of the object-oriented way matter.
I somehow think that the array-based approach should be more compact (memory) and faster (access)
It won't. You can easily confirm this by using Java Management interfaces:
com.sun.management.ThreadMXBean b = (com.sun.management.ThreadMXBean) ManagementFactory.getThreadMXBean();
long selfId = Thread.currentThread().getId();
long memoryBefore = b.getThreadAllocatedBytes(selfId);
// <-- Put measured code here
long memoryAfter = b.getThreadAllocatedBytes(selfId);
System.out.println(memoryAfter - memoryBefore);
Under measured code put new double[0] and new Object() and you will see that those allocations will require exactly the same amount of memory.
It might be that the JVM/JIT treats arrays in a special way which could make them faster to access in one way or another.
JIT do some vectorization of an array operations if for-loops. But it's more about speed of arithmetic operations rather than speed of access. Beside that, can't think about any.
The canonical advice that I've seen in this situation is that premature optimisation is the root of all evil. Following that means that you should stick with the code that is easiest to write / maintain / get past your code quality regime, and then look at optimisation if you have a measurable performance issue.
In your examples the memory consumption is similar because in the object case you have 10,000 references plus two doubles per reference, and in the 2D array case you have 10,000 references (the first dimension) to little arrays containing two doubles each. So both are one base reference plus 10,000 references plus 20,000 doubles.
A more efficient representation would be two arrays, where you'd have two base references plus 20,000 doubles.
double[] a = new double[10000];
double[] b = new double[10000];
What would be the best way to store and read a really long string, with each entry is an index for another array?
Right now I have this
String indices="1,4,6,19,22,54,....."
The string has up to hundred of thousand entries, so I think maybe I could use a data structure like Linked List. Does anyone know if it would be faster to use one?
List<String> list = new ArrayList<String>();
list.add("1");
list.add("2");
you need to declare arraylist of type string.Then add to it.
It would depend on what you'll do with the string (the indices) and the corresponding arrays. Also, it will depend on how you're gonna access them.
I'd suggest you first read an overview about the data structures implemented in java, specially in the Collections Framework.
We could give some suggestions, but you'd have to provide us more information, specially those I mentioned in the beginning (what you want, how this data will be stored and accessed, and so on).
For example, if you need to have a fast access to the indexed data, maybe a string isn't even the best approach. Maybe a map would be better. The indexes could be the keys and the indexed arrays could be the values of the map, for example. But this is just a void example, I strongly suggest you give us more information.
I really like using the ArrayList class, which if your comfortable using arrays, ArrayList or any member of the Collections Framework. Would work really well. For what your trying to do.
ArrayList<String> indices = new ArrayList<String>();
indices.add("");
I have similar hunch in my mind , in which I want to like 1k number of strings and parse them (searching purpose to know it contain item or not).
Hence I found instead of using java collection framework - map or set or list
if I store data simply in array and start parsing data using for-loop, it is faster.
You visit this link and see actual output which we calculated in micro seconds.
https://www.programcreek.com/2014/04/check-if-array-contains-a-value-java/
So using simple brute force is winner in case of unsorted array
(normally we have).
But arrays.BinarySearch() is winner if array is sorted.
Lets say we have a bunch of data (temp,wind,pressure) that ultimately comes in as a number of float arrays.
For example:
float[] temp = //get after performing some processing (takes time)
float[] wind =
Say we want to store these values in memory for different hours of the day. Is it better to put these on a HashMap like:
HashMap maphr1 = new HashMap();
maphr1.put("temp",temp);
maphr1.put("wind",wind);
...
Or is it better to create a Java object like:
public class HourData(){
private float[] temp,wind,pressure;
//getters and setters for above!
}
...
// use it like this
HourData hr1 = new HourData();
hr1.setTemp(temp);
hr1.setWind(wind);
Out of these two approaches which is better in terms of performance, readability, good OOP practice etc
You're best off having an HourData class that stores a single set of temperature, wind, and pressure values, like this:
public class HourData {
private float temp, wind, pressure;
// Getters and setters for the above fields
}
If you need to store more than one set of values, you can use an array, or a collection of HourData objects. For example:
HourData[] hourDataArray = new HourData[10000];
This is ultimately much more flexible, performant, and intuitive to use than putting storing the arrays of data in your HourData class.
Flexibility
I say that this approach is more flexible because it leaves the choice of what kind of collection implementation to use (e.g. ArrayList, LinkedList, etc.) to users of the HourData class. Moreover, if he/she wishes to deal just with a single set of values, this approach doesn't force them to deal with an array or collection.
Performance
Suppose you have a list of HourData instances. If you used three float arrays in the way that you described, then accessing the i'th temp, wind, and pressure values may cause three separate pages to be accessed in memory. This happens because all of the temp values will be stored contiguously, followed by all of the wind values, followed by all of the pressure values. If you use a class to group these values together, then accessing the i'th temp, wind, and pressure values will be faster because they will all be stored adjacent to each other in memory.
Intuitive
If you use a HashMap, anyone who needs to access any of the fields will have to know the field names in advance. HashMap objects are better suited to key/value pairs where the keys are not known at compile time. Using an HourData class that contains clearly defined fields, one only needs to look at the class API to know that HourData contains values for temp, wind, and pressure.
Also, getter and setter methods for array fields can be confusing. What if I just want to add a single set of temp, wind, and pressure values to the list? Do I have to get each of the arrays, and add the new values to the end of them? This kind of confusion is easily avoided by using a "wrapper" collection around an HourData that deals only with single values.
For readability i would definately go for a object since it makes more sense. Especially since you store different datacollections like the wind longs have a different meaning as the temp longs.
Besides this you can also store other information like the location and time of your measurement.
Well if you dont have any key to differentiate different instances of the same object. I would create HourData objects and store them in a array list.
Putting data in a contained object always increases the readability.
You have mentioned bunch of data, So I would rather read it as collection of data.
So the answer is , if something already available in Java collection framework out of box , why do you want to write one for you.
You should look at Java collection classes and see which fits your requirement better, whether it is concurrent access, fast retrieve time or fast add time etc etc..
Hope this helps
EDIT----
Adding one more dimension to this.
The type of application you are building also affects your approach.
The above discussion rightly mentions readability, flexibility , performance as driving criteria for your design.
But the type of application you are building is also one of the influencing factors.
For example, Lets say you are building a web application.
A Object which is stored in memory for a long time would be either in Application or Session Scope. So you will have to make it immutable by design or use it for thread safe manner.
The business data which remains same across different implementations should be designed as per OOP or best practices but the infrastructure or Application logic should more be your framework driven.
I feel what you are talking, like keeping an object for a long time in memory is more a framework driven outlook, hence I suggested use Java Collection and put your business objects inside it. Important points are
Concurrent Access Control
Immutable by design
If you have a limited and already defined list of parameters then it's better to use the second approach.
In terms of performance: you don't need to search for key in hashmap
In terms of readability: data.setTemp(temp) is better than map.put("temp", temp). One of the benefits of the first approach is that typing errors will be catched during the compilation
In terms of good OOP practices: first approach has nothing to do with OOP practices. Using the second approach you can easily change the implementation, add new methods, provide several alternative data object implementations, etc.
But you might want to use collections if you don't know the parameters and if you want to work with uncategorized(extensible) set of parameters.
I have this code:
newArray = new String[][]{{"Me","123"},{"You","321"},{"He","221"}};
And I want to do this dynamically.
Add more elements, things like it.
How do I do this?
PS: Without using Vector, just using String[][];
You can't change the size of an array. You have to create a new array and copy all content from the old array to the new array.
That's why it's much easier to use the java collection classes like ArrayList, HashSet, ...
You can't change the size of arrays. I think you have some options:
use a List<List<String>> to store a list of lists of strings
use a Map<String,String> if you're storing a key/value pair
Vector tends not to be used these days, btw. A Vector is synchronised on each method call, and thus there's a performance hit (negligible nowadays with modern VMs)
Java does not have the facility to resize arrays like some other languages.
But
You would not see a difference between a String array and a ArrayList<String> (javadoc) unless you are specifically required to do so (like in homework)
There are ways where you can declare a enormous array so that you dont run out of space but I would strongly recommend ArrayList for if you need dynamic changes to the size. And ArrayList provides some possibilities that are not (directly) possible with an array, as a bonus.
You can get away with using arrays if it's possible to calculate the size of arrays before using them. In your example, it seems that we need to know the size of the first array only. So you could impose some limit of how many records could be saved, or you could query user to know how many records it needs to save or something similar.
But again, it's easier to use Collections.
What is the need of Collection framework in Java since all the data operations(sorting/adding/deleting) are possible with Arrays and moreover array is suitable for memory consumption and performance is also better compared with Collections.
Can anyone point me a real time data oriented example which shows the difference in both(array/Collections) of these implementations.
Arrays are not resizable.
Java Collections Framework provides lots of different useful data types, such as linked lists (allows insertion anywhere in constant time), resizeable array lists (like Vector but cooler), red-black trees, hash-based maps (like Hashtable but cooler).
Java Collections Framework provides abstractions, so you can refer to a list as a List, whether backed by an array list or a linked list; and you can refer to a map/dictionary as a Map, whether backed by a red-black tree or a hashtable.
In other words, Java Collections Framework allows you to use the right data structure, because one size does not fit all.
Several reasons:
Java's collection classes provides a higher level interface than arrays.
Arrays have a fixed size. Collections (see ArrayList) have a flexible size.
Efficiently implementing a complicated data structures (e.g., hash tables) on top of raw arrays is a demanding task. The standard HashMap gives you that for free.
There are different implementation you can choose from for the same set of services: ArrayList vs. LinkedList, HashMap vs. TreeMap, synchronized, etc.
Finally, arrays allow covariance: setting an element of an array is not guaranteed to succeed due to typing errors that are detectable only at run time. Generics prevent this problem in arrays.
Take a look at this fragment that illustrates the covariance problem:
String[] strings = new String[10];
Object[] objects = strings;
objects[0] = new Date(); // <- ArrayStoreException: java.util.Date
Collection classes like Set, List, and Map implementations are closer to the "problem space." They allow developers to complete work more quickly and turn in more readable/maintainable code.
For each class in the Collections API there's a different answer to your question. Here are a few examples.
LinkedList: If you remove an element from the middle of an array, you pay the cost of moving all of the elements to the right of the removed element. Not so with a linked list.
Set: If you try to implement a set with an array, adding an element or testing for an element's presence is O(N). With a HashSet, it's O(1).
Map: To implement a map using an array would give the same performance characteristics as your putative array implementation of a set.
It depends upon your application's needs. There are so many types of collections, including:
HashSet
ArrayList
HashMap
TreeSet
TreeMap
LinkedList
So for example, if you need to store key/value pairs, you will have to write a lot of custom code if it will be based off an array - whereas the Hash* collections should just work out of the box. As always, pick the right tool for the job.
Well the basic premise is "wrong" since Java included the Dictionary class since before interfaces existed in the language...
collections offer Lists which are somewhat similar to arrays, but they offer many more things that are not. I'll assume you were just talking about List (and even Set) and leave Map out of it.
Yes, it is possible to get the same functionality as List and Set with an array, however there is a lot of work involved. The whole point of a library is that users do not have to "roll their own" implementations of common things.
Once you have a single implementation that everyone uses it is easier to justify spending resources optimizing it as well. That means when the standard collections are sped up or have their memory footprint reduced that all applications using them get the improvements for free.
A single interface for each thing also simplifies every developers learning curve - there are not umpteen different ways of doing the same thing.
If you wanted to have an array that grows over time you would probably not put the growth code all over your classes, but would instead write a single utility method to do that. Same for deletion and insertion etc...
Also, arrays are not well suited to insertion/deletion, especially when you expect that the .length member is supposed to reflect the actual number of contents, so you would spend a huge amount of time growing and shrinking the array. Arrays are also not well suited for Sets as you would have to iterate over the entire array each time you wanted to do an insertion to check for duplicates. That would kill any perceived efficiency.
Arrays are not efficient always. What if you need something like LinkedList? Looks like you need to learn some data structure : http://en.wikipedia.org/wiki/List_of_data_structures
Java Collections came up with different functionality,usability and convenience.
When in an application we want to work on group of Objects, Only ARRAY can not help us,Or rather they might leads to do things with some cumbersome operations.
One important difference, is one of usability and convenience, especially given that Collections automatically expand in size when needed:
Collections came up with methods to simplify our work.
Each one has a unique feature:
List- Essentially a variable-size array;
You can usually add/remove items at any arbitrary position;
The order of the items is well defined (i.e. you can say what position a given item goes in in the list).
Used- Most cases where you just need to store or iterate through a "bunch of things" and later iterate through them.
Set- Things can be "there or not"— when you add items to a set, there's no notion of how many times the item was added, and usually no notion of ordering.
Used- Remembering "which items you've already processed", e.g. when doing a web crawl;
Making other yes-no decisions about an item, e.g. "is the item a word of English", "is the item in the database?" , "is the item in this category?" etc.
Here you find use of each collection as per scenario:
Collection is the framework in Java and you know that framework is very easy to use rather than implementing and then use it and your concern is that why we don't use the array there are drawbacks of array like it is static you have to define the size of row at least in beginning, so if your array is large then it would result primarily in wastage of large memory.
So you can prefer ArrayList over it which is inside the collection hierarchy.
Complexity is other issue like you want to insert in array then you have to trace it upto define index so over it you can use LinkedList all functions are implemented only you need to use and became your code less complex and you can read there are various advantages of collection hierarchy.
Collection framework are much higher level compared to Arrays and provides important interfaces and classes that by using them we can manage groups of objects with a much sophisticated way with many methods already given by the specific collection.
For example:
ArrayList - It's like a dynamic array i.e. we don't need to declare its size, it grows as we add elements to it and it shrinks as we remove elements from it, during the runtime of the program.
LinkedList - It can be used to depict a Queue(FIFO) or even a Stack(LIFO).
HashSet - It stores its element by a process called hashing. The order of elements in HashSet is not guaranteed.
TreeSet - TreeSet is the best candidate when one needs to store a large number of sorted elements and their fast access.
ArrayDeque - It can also be used to implement a first-in, first-out(FIFO) queue or a last-in, first-out(LIFO) queue.
HashMap - HashMap stores the data in the form of key-value pairs, where key and value are objects.
Treemap - TreeMap stores key-value pairs in a sorted ascending order and retrieval speed of an element out of a TreeMap is quite fast.
To learn more about Java collections, check out this article.