Compression feature in java for cache system

Compression feature in java for cache system - java

I am building a cache that has to store as much data as possible. CPU is not a mayor issue, because the next level of data is a lot more expessive to reach than running the CPUs a little bit for decompression.
I'm looking for a good strategy and not a full implemenation. A typical object instance that should be cached can be gernalized as a list of hashmaps. The keys in these map are very similiar to keys in another map in that list. Keys and values are strings.
Maps in different caching objects (this means also different lists) may not always have similar keys. Maybe only a subset (50%) of the keys is the same.
I was thinking of extracting the keys into ONE header array and each collection of values of the hashmap into another array with the same length. This means the data array might be sparse (null-pointers). But I don't have to carry the meta data around. The possition in the data array is the only way of looking up the correct key.
Now I want to compress the data array. Compression won't really work well on a single data array because there is little information. It will need a few data arrays stuck together to get a good compression rate.
Is there any good way of compressing String-Arrays in java? How many of these data arrays should I cluster for good results?
Is there maybe some better aporoach? This is a open questions for collecting ideas, so please feel free to elaborate :-)

Flyweight can help
If are not compressing you can use Flyweight pattern to avoid the cost of the string-key repeated in each object.
Remember a string is an object so a key in your hashmap is a reference to it. If a lot of objects with the same property use references to the same string object you only have 4-bytes for each reference and only one string in memory.
How to ensure you are sharing the string objects between objects? You can use something similar to String.intern(). But please don't use String.intern() itself.
Interning a string is returning the same string-object for the same string value. You must hold a cache for those strings. The reason I don't recommend String.intern() is that the cache is the String class itself so it nevers get freed. But you can implement something analogous.
This code returns your own string if it's new. And returns the first one if it's not.
HashMap<String,String> internedStrings = new HashMap<String,String>();
syncrhonized String returnUniqueString(String str) {
String alreadyCached = internedStrings.get(str);
if (alreadyCached == null) {
internedStrings.put(str, str);
alreadyCached = str;
}
return alreadyCached;
}
But if you are compressing, not
Because compressing means you are serializing your object graphs and each property name will be serialized as a different string, so repeating itself. Maybe the compressed size doesn't grow too much because it's a repeated string but when you rehidrate the objects they will be created separately.
Maybe you can use the returnUniqueString at the time of rehidrating :)

This sounds like a good approach.
However, i suggest you consider a different way of breaking the map values into lists: rather than making a list for each map, make a list for each different key, containing the values for that key for each item.
For example, if your maps are:
1: {
colour: red,
size: small,
},
2: {
colour: blue,
flavour: strawberry
},
3: {
colour: red,
size: large,
flavour: strawberry
}
Then you decompose into:
colour: [red, blue, red]
size: [small, null, large]
flavour: [null, strawberry, strawberry]
This may seem a bit weird, but the point is that you cluster values of the same type together, which will help the compression be more efficient.

Related

What is the Clear Purpose of SparseBooleanArray?? [ I referred official android site for this ]

I referred the android doc site for "SparseBooleanArray" class but still not getting idea of that class about what is the purpose of that class?? For what purpose we need to use that class??
Here is the Doc Link
http://developer.android.com/reference/android/util/SparseBooleanArray.html

From what I get from the documentation it is for mapping Integer values to booleans.
That is, if you want to map, if for a certain userID a widget should be shown and some userIDs have already been deleted, you would have gaps in your mapping.
Meaning, with a normal array, you would create an array of size=maxID and add a boolean value to element at index=userID. Then when iterating over the array, you would have to iterate over maxID elements in the worst case and have to check for null if there is no boolean for that index (eg. the ID does not exist). That is really inefficient.
When using a hashmap to do that you could map the ID to the boolean, but with the added overhead of generating the hashvalue for the key (that is why it is called *hash*map), which would ultimately hurt performance firstly in CPU cycles as well as RAM usage.
So that SparseBooleanArray seems like a good middleway of dealing with such a situation.
NOTE: Even though my example is really contrieved, I hope it illustrates the situation.

Like the javadoc says, SparseBooleanArrays map integers to booleans which basically means that it's like a map with Integer as a key and a boolean as value (Map).
However it's more efficient to use in this particular case It is intended to be more efficient than using a HashMap to map Integers to Booleans
Hope this clears out any issues you had with the description.

I found a very specific and wonderful use for the sparse boolean array.
You can put a true or false value to be associated with a position in a list.
For example: List item #7 was clicked, so putting 7 as the key and true as the value.

There can be three ways to store resource id's
1 Array
Boolean array containing id's as indexes.If we have used that id set it to true else false
Though all the operations are fast but this implementation will require huge amount of space.So it can't be used
High Space Complexity
2 HashMap
Key-ID
Value-Boolean True/False
Using this we need to process each id using the hashing function which will consume memory.Also there may be some empty locations where no id will be stored and we also need to deal with crashes.So due to usage complexity and medium space complexity, it is not used.
Medium Space Complexity
3 SparseBooleanArray
It is middle way.It uses mapping and Array Implementation
Key - ID
Value - Boolean True/False
It is an ArrayList which stores id's in an increasing order.So minimum space is used as it only contains id's which are being used.For searching an id binary search is used.
Though Binary Search O(logn) is slower than hashing O(1) or Array O(1),i.e. all the operations Insertion, Deletion, Searching will take more time but there is least memory wastage.So to save memory we prefer SparseBoolean Array
Least Space Complexity

Looking for memory efficient design

I'm running some experiments over a large dataset and would like to optimize a particular part. Currently, I have 5-6 Models each of which stores a mapping from Topics to List of Strings. The set of Topics is large and the same between each Model, so there must be a better way. Ultimately the query I need to perform is: what is the String in position x of the List for some Model-Topic combination.
One of the problems with using the mapping method is that if there are say 500k-5M topics, each has a list of 20 strings. Then my Map<Model, Map<Topic, List<String>>> is going to be massive.

Have you tried SortedSet / Maps? Sounds like you need to optimize your search, sorted collections (like TreeMap) should be log(n) while regular list is O(1). Of course, this kind of thing is something at which databases excel...

Not clear where/how you want to achieve "memory efficiency". First one needs to look at the particulars of your detailed data to see how much storage that consumes, then examine various ways of organizing it and analyze their efficiency in terms of % overhead vs your "real" data.
A brief glance shows that a HashMap, when you consider the associated tables, has about 80 bytes of overhead per entry. An ArrayList looks to average out around 10-12. Without looking, I would guess that a TreeMap would be more than a HashMap -- maybe 100.
Generally speaking, links within your own objects will be "cheaper", both in storage and speed to access, than links using these aggregating objects. But the aggregating objects are convenient to use, and have been "optimized" to a degree.
(But looking at your update, you probably should be looking at a DB application, rather than holding everything in heap.)

You could use Topic and Model to construct a composite key in a single Map, e.g.
map.put(topic1_id + model1_id, list1_1);
map.put(topic1_id + model2_id, list1_2);
...
map.get(topic_id + model_id)
where the IDs are Strings (or a similar scheme could be used with numeric identifiers).
A similar approach is to assign each topic and model a unique number, then store the lists of strings in arrays, so looking up the list for a given combination is a matter of looking up two indexes, then accessing a given location in a 2D array. (however, this is easier when you know the number of topics and models in advance of constructing the data structure)
For memory efficiency, also consider the small details. In general, you want to minimise the number of Objects - each Object carries an overhead. ArrayLists can have a lot of wasted space as they grow dynamically, doubling in size when they exceed their current capacity. If you can pre-size them to the required capacity (or use an array instead) then you can save a lot of memory. The same applies when using large numbers of small HashMaps.

One possible data structure is a hierarchy of maps, leading to an array of Strings. E.g.:
HashMap<Model, HashMap<Topic, String[]>> map;
A query function would then look like:
public String query(Model model, Topic topic, int x) {
HashMap<Topic, String[]> childMap = map.get(model);
if (childMap == null) {
return null;
}
String[] list = childMap.get(topic);
if (list == null) {
return null;
}
return list[x];
}
Presuming your Model and Topic structures implement hashCode() and equals() reasonably, the query performance should be quite good.
One potential weakness: I'm assuming you need to index a large number of Model/Topic combinations, and related lists of Strings (if not, you presumably wouldn't be asking about optimization). My guess is that the child String[] arrays will consume a large amount of memory. Each array is a Java object (about 20 bytes) + a pointer at each array location.
2 suggestions there:
1) If many Model/Topic combinations share the same set of Strings, you could gain quite a lot by sharing those String[] instances.
2) If you're using a 64-bit VM, be sure to use compressed ordinary object pointers (-XX:+UseCompressedOops). That will at least keep most of the pointers to 4 bytes instead of 8. Compressed OOPs is the default since 1.6.0_23, so a relatively recent VM will save you some memory here.

One other possibility not mentioned is store the strings using String[][][] and models and topics in a List such as ArrayList and then at query time:
public String query(Model model, Topic topic, int x) {
return strings[models.indexOf(model)][topics.indexOf(topic)][x];
}
It could be further improved for speed if the topics and models were sorted, then binary search rather than indexOf could be used.

Hashcode for strings that can be converted to integer

I'm looking for the most effective way of creating hashcodes for a very specific case of strings.
I have strings that can be converted to integer, they vary from 1 to 10,000, and they are very concentrated on the 1-600 range.
My question is what is the most effective way, in terms of performance for retrieving the items from a collection to implement the hashcode for it.
What I'm thinking is:
I can have the strings converted to integer and use a direct acess table (an array of 10.000 rows) - this will be very fast for retrieving but not very smart in terms of memory allocation;
I can use the strings as strings and get a hashcode for it (i wont have to convert it to integer, but i dont know how effective will be the hashcode for the strings in terms of collisions)
Any other ideas are greatly appreciated.
thanks a lot
Thanks everyone for your promptly replies...
There is another information Tha i've forget to add on this. I tink it Will Make this clear if I let you know my final goal with this-I migh not even need a hash table!!!
I just want to validate a stream against a dictiory that is immutable. I want to check if a given tag might or might not be present on my message.
I will receive a string with several pairs tag=value. I want to verify if the tag must or must not be treated by my app.

You might want to consider a trie (http://en.wikipedia.org/wiki/Trie) or radix tree (http://en.wikipedia.org/wiki/Radix_tree). No need to parse the string into an integer, or compute a hash code. You're walking a tree as you walk the string.
Edit:
Both computing a hash code on a string and parsing an integer out of a string involve walking the entire the string, and THEN using that value as a look-up into a specific data structure. Other techniques might involve simultaneously inspecting the string WHILE traversing a data structure. This MIGHT be of value to the poster who asked for "other ideas".

Many collections (e.g. HashMap) already apply a supplemental "rehash" method to help with poor hashcode algorithms. e.g. browse the cource code for HashMap.hash(). And Strings are very common keys, so you can be sure that String.hashCode() is highly optimized. SO, unless you notice a lot of collisions between your hashCodes, I'd go with the standard code.
I tried putting the Strings for 0..600 into a HashSet to see what happened, but it's then pretty tedious to see how many entries had collisions. Look for yourself! If you really really care, copy the source code from HashMap into your own class, edit it so you can get access to the entries (in the Java 6 source code I'm looking at, that would be transient Entry[] table, YMMV), and add methods to count collisions.

If there are only a limited valid range of values, why not represent the collection as a int[10000] as you suggested? The value at array[x] is the number of times that x occurs.
If your strings are represented as decimal integers, then parsing them to strings is a 5-iteration loop (up to 5 digits) and a couple of additions and subtractions. That is, it is incredibly fast. Inserting the elements is effectively O(1), retrieval is O(1). Memory required is around 40kb (4 bytes per int).
One problem is that insertion order is not preserved. Maybe you don't care.
Maybe you could think about caching the hashcode and only updating it if your collection has changed since the last time hashcode() was called. See Caching hashes in Java collections?

«Insert disclaimer about only doing this when it's a hot spot in your application and you can prove it»
Well the integer value itself will be a perfect hash function, you will not get any collisions. However there are two problems with this approach:
HashMap doesn't allow you to specify a custom hash function. So either you'll have to implement you own HashMap or you use a wrapper object.
HashMap uses a bitwise and instead of a modulo operation to find the bucket. This obviously throws bits away since it's just a mask. java.util.HashMap.hash(int) tries to compensate for this but I have seen claims that this is not very successful. Again we're back to implementing your own HashMap.
Now that this point since you're using the integer value as a hash function why not use the integer value as a key in the HashMap instead of the string? If you really want optimize this you can write a hash map that uses int instead of Integer keys or use TIntObjectHashMap from trove.
If you're really interested in finding good hash functions I can recommend Hashing in Smalltalk, just ignore the half dozen pages where the author rants about Java (disclaimer: I know the author).

Java Memory Saving Techniques?

I have this 4 Dimensional array to store String values which are used to create a map which is then displayed on screen with the paintComponent. I have read many articles saying that using huge arrays is very inefficient. (Especially since the array dimensions are 16x16x3x3) I was wondering if there was any way to store the string values (I use them as ID values) differently to save memory or reduce fetching time. If you have any ideas or methods I would appreciate it. Thanks!

Well, if your matrix is full, that is every element contains data, then I believe an array is optimally efficient. But if your matrix is sparse, you could look into using more Linking-based data types.
the first thing I would do is to not use strings as IDs, use ints. It'll reduce the size of your structure a lot.
Also, that array really isn't that big, I wouldn't worry about efficiency if that's the only data structure you have. It's only 2304 elements large.

First off, 16*16*3*3 = 2304 - quite modest really. At this size I'd be more worried about the confusion likely to be caused by a 4D array than the size it is taking!
As others have said, if it fully populated, arrays are ok. If it has gaps, an ArrayList or similar would be better.
If the Strings are just IDs, why not store an enum (or even Integers) instead of a string?

Keep in mind that the String values are separate from the array. The array itself takes the same memory space regardless of what string values it links to. Accessing a specific address in your array will take the same amount of time regardless of what type of object you have saved there, or what the value of that object is.
However, if you find that many of your string values represent exactly the same string, you can avoid having multiple copies of the same string by leveraging String.intern(). If you store the interned string, and you don't have any other references to the non-interned string, that frees the non-interned string up to be garbage-collected. Your array will then have multiple entries that point to the same memory space, rather than different memory addresses with equivalent string objects.
See also:
Is it good practice to use java.lang.String.intern()?
http://www.codeinstructions.com/2009/01/busting-javalangstringintern-myths.html
Depending on the requirements of your IDs, you may also want to look into using a different data structure than strings. For example, while the array itself would be the same size, storing int values would avoid the need to allocate extra space for each individual entry.
Also, a 4-dimensional array may not be the best data structure for your needs in the first place. Can you describe why you've chosen this data structure for what you're trying to represent?

The Strings only take up the space of a reference in each array element. There could be a savings if the strings come from a very small set of values. A more important question is are your 4-dimensional arrays sparse or mostly filled? If you have very few values actually specified then you might have a big savings replacing the 4-d array with a Map from the indicies to the String. Let me know if you want a code sample.

Do you actually have a 4D array of 16x16x3x3 (i.e. 2k) string objects? That doesn't sound that big to me. An array is the most efficient way to store a collection of objects, in terms of memory. An ArrayList can be slightly less efficient (up to 50% wasted space).
The only other way I can think of is to store the Strings end-to-end in one giant String and then use substring() to get the bit you need from that, but you would still need to store the indexes somewhere.
Are you running out of memory? If so check that the Strings in your array are the size you think they are - the backing array of a String instance in Java can be much larger than the string itself. If you do subString() on a 1 GB string, the returned string instance shares the 1 GB array of the first string so will keep it from being GC'd longer than you might expect.

Mapping of strings to integers

What is the easiest way in Java to map strings (Java String) to (positive) integers (Java int), so that
equal strings map to equal integers, and
different strings map to different integers?
So, similar to hashCode() but different strings are required to produce different integers. So, in a sense, it would be a hasCode() without the collision possibility.
An obvious solution would maintain a mapping table from strings to integers,
and a counter to guarantee that new strings are assigned a new integer. I'm just wondering
how is this problem usually solved.
Would also be interesting to extend it to other objects than strings.

Have a look at perfect hashing.

This is impossible to achieve without any restrictions, simply because there are more possible Strings than there are integers, so eventually you will run out of numbers.
A solution is only possible when you limit the number of usable Strings. Then you can use a simple counter. Here is a simple implementation where all (2^32 = 4294967296 different strings) can be used. Never mind that it uses lots of memory.
import java.util.HashMap;
import java.util.Map;
public class StringToInt {
private Map<String, Integer> map;
private int counter = Integer.MIN_VALUE;
public StringToInt() {
map = new HashMap<String, Integer>();
}
public int toInt(String s) {
Integer i = map.get(s);
if (i == null) {
map.put(s, counter);
i = counter;
++counter;
}
return i;
}
}

There's not going to be an easy or complete solution. We use hashes because there are way more possible Strings than there are ints. Collisions are just a limitation of using a finite number of bits to represent integers.

In most hashcode() type implementations, collisions are accepted as inevitable and tested for.
If you absolutely must have no collisions, guaranteed, the solution you outline will work.
Aside from this, there are cryptographic hash functions such as MD5 and SHA, where collisions are extremely unlikely (though with a lot of effort can be forced). The Java Cryptography Architecture has implementations of these. Those methods may perhaps be faster than a good implementation of your solution for very large sets. They will also execute in constant time and give the same code for the same string, no matter which order the strings are added in. Also, it doesn't require storing each string. Crypto hash results could be considered as integers but they won't fit in a java int - you could use a BigInteger to hold them as suggested in another answer.
Incidentally, if you're put off by the idea of a collision being 'extremely unlikely', it's probably similar likelihood that a bit would randomly flip in your computer memory or hard disk and cause any program to behave differently than you expect :-)
Note, there are also some theoretical weaknesses in some hash functions (e.g. MD5) but for your purposes that probably doesn't matter and you could just use the most efficient such function - those weaknesses are only relevant if someone is maliciously trying to come up with strings that have the same code as another string.
edit: I just noticed in the title of your question, it seems you want bidirectional mapping, though you don't actually state this in the question. It is (by design) not possible to go from a Crypto hash to the original string. If you really need that, you'd have to store a map keying hashes back to strings.

I'd try to do by introducing an object holding Map and Map. Adding Strings to that object (or maybe having them created from said object) will assign them an Integer value. Requesting a Integer value for a String already registered will return the same value.
Drawbacks: Different launches will yield different Integers for the same String, depending on order unless you somehow persist the whole thing. Also, it's not very object oriented and requires a special object to create/register a String.
Plus side: It's quite similar to internalizing Strings and easily understandable. (Also, you asked for an easy, not elegant way.)
For the more general case, you might create a high level subclass of Object, introduce a "integerize" method there and extend every single class from that. I think, however, that road leads to tears.

Since Strings in java are unbounded in length, and each character has 16 bits, and ints have 32 bits, you could only produce a unique mapping of Strings to ints if the Strings were up to two characters. But you could use BigInteger to produce a unique mapping, with something like:
String s = "my string";
BigInteger bi = new BigInteger(s.getBytes());
Reverse mapping:
String str = new String(bi.toByteArray());

Can you use a Map to indicate which Strings you already have assigned integers to? That's kind of the "database-y" solution, where you assign each String a "primary key" from a sequence as it comes up. Then you put the String and Integer pair into a Map so you can look it up again. And if you need the String for a given Integer, you can also put the same pair into a Map.

As you outline, a hash table that resolves collisions is a standard solution. You could also use a Bentley/Sedgewick style search trie, which in many applications is faster than hashing.
If you substitute 'unique pointer' for 'unique integer' you can see Dave Hanson's solution to this problem in C. This is quite a nice abstraction because
The pointers can still be used as C strings.
Equal strings hash to equal pointers, so strcmp can be dispensed with in favor of pointer equality, and the pointers can be used as keys in other hash tables.
If Java offers a test for object identity on String objects then you can play the same game there.

If by integer you mean the data type, then as other posters have explained this is quite impossible, due to the fact that the integer data type is of fixed size, and strings are unbound.
However if you simply mean a positive number, then theoretically you should be able to interpret the string as if it were an "integer" simply by regarding it as a byte array (in a consistent encoding). You could also treat it as an array of integers of arbitrary length, but if you can do that why not just use a string? :)
Implementation speaking, this is usually "solved" by using a hash code and simply double-checking any collisions, since there are likely to be none anyway and on the off chance there is a collision, it still works out to be constant time. However if this isn't applicable, I'm not sure what the best solution would be.
Interesting question.

I don't know if this is practical, but if we take only lowercase letter alphabet, than every word can be viewed as a number in 26-base positional system. For example, if a is 0 and z is 25 than boom is 1*26^3 + 14*26^2 + 14*26^1 + 12*26^0 = 27416

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.