Fast value access from string-based key path

Fast value access from string-based key path - java

I'm currently implementing a generic model for pivot-like data visualization in ColdFusion 9.
I'm not interested in supporting multiple measures and the model exposes a numeric valueAt(string colKey, string rowKey) function that can be called by a view in order to retrieve the resulting aggregation of a measure based on column and row dimensions.
For example, with the data set below, if the measure was AVG(Age) and the column dimension Rank, then model.valueOf('3', '') would return 2.33.
Wine Age Rank
WineA 3 3
WineB 4 2
WineC 2 3
WineD 2 3
Now, the data structure that naturally came to my mind was to use a java.util.HashMap to store the computed data, using a combination of column and row values converted to string as keys. This means that depending on the data set, I might potentially have a very large number of keys that will start with the same prefix.
I purposely created a large data set (1 million entries) with multiple strings having the same prefix and checked the percentage of bucket collisions I would get using the default java String.hashCode() algorithm and MurmurHash3.
Here's how I build the data set sample:
<cfset maxItemsCount = 1000000>
<cfset tokens = ['test', 'one', 'two', 'tree', 'four', 'five']>
<cfset tokensLen = arrayLen(tokens)>
<cfset items = []>
<cfset loopCount = 1>
<cfloop condition="arrayLen(items) lt maxItemsCount">
<cfset item = ''>
<cfloop from="1" to="#tokensLen#" index="i">
<cfset item = listAppend(item, tokens[i] & loopCount, '_')>
<cfset arrayAppend(items, item)>
</cfloop>
<cfset ++loopCount>
</cfloop>
With an array initialized to 2 * entries count, I got 27% collisions with String.hashCode() and 22% for Murmur. It took around 2580 milliseconds with java.util.HashMap only to store and retrieve keys once.
I'm looking for ideas on how to improve performance, whether by using a different data structures (perhaps nested hash maps?) or find a way to reduce the number of collisions without compromising the API signature?
Thanks!

With a million entries, there will always be some collisions (unless your array is much longer than 1e12 entries :D). I guess that MurmurHash makes a perfect job here, but you could try MD5 for comparison (which is sort of guaranteed to do a perfect job).
Now, the data structure that naturally came to my mind was to use a java.util.HashMap to store the computed data, using a combination of column and row values converted to string as keys. This means that depending on the data set, I might potentially have a very large number of keys that will start with the same prefix.
You're concatenating Strings and so producing quite some garbage. It may be better to create a
#Value static class Key {
private final String row;
private final String column;
}
as a key for your HashMap, where #Value is a Lombok annotation generating all the boring stuff like equals, hashCode and the constructor.
You can do easily without Lombok and even a bit better:
static class Key {
Key(String row, String column) {
// Do NOT use 31 as a multiplier as it increases the number of collisions!
// Try Murmur, too.
hashCode = row.hashCode() + 113 * column.hashCode();
this.row = row;
this.column = column;
}
public int hashCode() {
return hashCode;
}
public boolean equals(Object o) {
if (this == o) return true;
if (!(o instanceof Key)) return false;
Key that = (Key) o;
// Check hashCode first.
if (this.hashCode != that.hashCode) return false;
if (!this.row.equals(that.row)) return false;
if (!this.column.equals(that.column)) return false;
return true;
}
private final int hashCode;
private final String row;
private final String column;
}

Related

How to efficiently store a large Java map?

I am brute-forcing one game and I need to store data for all positions and outcomes. Data will likely be hundreds of Gb in size. I considered SQL, but I am afraid that lookups in a tight loop might kill performance. Program will iterate over possible positions and return winning move if it is known, return longest losing sequence if all moves are known to lose and check outcome for unknown moves.
What is the best way to store a large Map<Long,Long[]> positionIdToBestMoves? I am considering SQL or data serialization.
I want to solve tiny checkers by brute-forcing all viable moves in Java. The upper limit of positions is around 100 billions. Most of them are not plausible (i.e. more pieces than were present in the beginning of the game). Some 10 Billions is a reasonable estimate. Each Map<Long, Long[]> position maps Long positionID to Long whiteToMove and Long blackToMove. Positive value indicate that position is winning and a move that leads to position stored in value should be chosen. Negative value -n means position is losing in at most n moves.
Search itself would have a recursion like this:
//this is a stub
private Map<Long, Long[]> boardBook =...
//assuming that all winning positions are known
public Long nextMove(Long currentPos, int whiteOrBlack){
Set<Long> validMoves = calculateValidMoves(currentPos, whiteOrBlack);
boolean hasWinner = checkIfValidMoveIsKnownToWin(validMoves, whiteOrBlack);
if(hasWinner){ //there is a winning move - play it
Long winningMove = getWinningMove(validMoves, whiteOrBlack);
boardBook.get(currentPos)[whiteOrBlack] = winningMove ;
return winningMove ;
}
boolean areAllPositionsKnown = checkIfAllPositionsKnown(validMoves, whiteOrBlack);
if(areAllPositionsKnown){ //all moves are losing.. choose longest struggle
Long longestSequenceToDefeat = findPositionToLongestSequenceToDefeat(validMoves, whiteOrBlack);
int numberOfStepsTodefeat = boardBook.get(longestSequenceToDefeat)[whiteOrBlack];
boardBook.get(currentPos)[whiteOrBlack] = longestSequenceToDefeat ;
return longestSequenceToDefeat;
}
Set<Long> movesToCheck = getUntestedMoves(validMoves, whiteOrBlack);
Long longeststruggle;
int maxNumberOfMovesToDefeat =-1;
for(Long moveTocheck : movesToCheck){
Long result = nextMove(moveToCheck, whiteOrBlack);
if(result>0){ //just discovered a winning move
boardBook.get(currentPos)[whiteOrBlack] = winningMove ;
return winningMove ;
}else {
int numOfMovesToDefeat = -1*boardBook.get(moveTocheck)[whiteOrBlack];
if( numOfMovesToDefeat >maxNumberOfMovesToDefeat ){
maxNumberOfMovesToDefeat =numOfMovesToDefeat ;
longeststruggle = moveTocheck;
}
}
}
boardBook.get(currentPos)[whiteOrBlack] = -1*maxNumberOfMovesToDefeat;
return longeststruggle;
}

You may want to look at Chronicle. It's highly optimized key-value storage and it should fit for your purpose.
Or you can write storage by yourself, but you still will end up doing something like map and memory mapped file under the hood.

Ektorp CouchDb: Query for pattern with multiple contains

I want to query multiple candidates for a search string which could look like "My sear foo".
Now I want to look for documents which have a field that contains one (or more) of the entered strings (seen as splitted by whitespaces).
I found some code which allows me to do a search by pattern:
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(trim).endKey(trim + "\u9999");
return db.queryView(viewQuery, DeviceEntityCouch.class);
}
which works quite nice for looking just for one pattern. But how do I have to modify my code to get a multiple contains on doc.serialNumber?
EDIT:
This is the current workaround, but there must be a better way i guess.
Also there is only an OR logic. So an entry fits term1 or term2 to be in the list.
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
String[] split = trim.split(" ");
List<DeviceEntityCouch> list = new ArrayList<>();
for (String s : split) {
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(s).endKey(s + "\u9999");
list.addAll(db.queryView(viewQuery, DeviceEntityCouch.class));
}
return list;
}

Looks like you are implementing a full text search here. That's not going to be very efficient in CouchDB (I guess same applies to other databases).
Correct me if I am wrong but from looking at your code looks like you are trying to search a list of serial numbers for a pattern. CouchDB (or any other database) is quite efficient if you can somehow index the data you will be searching for.
Otherwise you must fetch every single record and perform a string comparison on it.
The only way I can think of to optimize this in CouchDB would be the something like the following (with assumptions):
Your serial numbers are not very long (say 20 chars?)
You force the search to be always 5 characters
Generate view that emits every single 5 char long substring from your serial number - more or less this (could be optimized and not sure if I got the in):
...
for (var i = 0; doc.serialNo.length > 5 && i < doc.serialNo.length - 5; i++) {
emit([doc.serialNo.substring(i, i + 5), doc._id]);
}
...
Use _count reduce function
Now the following url:
http://localhost:5984/test/_design/serial/_view/complex-key?startkey=["01234"]&endkey=["01234",{}]&group=true
Will return a list of documents with a hit count for a key of 01234.
If you don't group and set the reduce option to be false, you will get a list of all matches, including duplicates if a single doc has multiple hits.
Refer to http://ryankirkman.com/2011/03/30/advanced-filtering-with-couchdb-views.html for the information about complex keys lookups.
I am not sure how efficient couchdb is in terms of updating that view. It depends on how many records you will have and how many new entries appear between view is being queried (I understand couchdb rebuilds the view's b-tree on demand).
I have generated a view like that that splits doc ids into 5 char long keys. Out of over 1K docs it generated over 30K results - id being 32 char long, simple maths really: (serialNo.length - searchablekey.length + 1) * docscount).
Generating the view took a while but the lookups where fast.
You could generate keys of multiple lengths, etc. All comes down to your records count vs speed of lookups.

JVisualVM HeapDump OQL rendering array inside an Object

I am trying to write a query such as this:
select {r: referrers(f), count:count(referrers(f))}
from com.a.b.myClass f
However, the output doesn't show the actual objects:
{
count = 3.0,
r = [object Object]
}
Removing the Javascript Object notation once again shows referrers normally, but they are no longer compartmentalized. Is there a way to format it inside the Object notation?

So I see that you asked this question a year ago, so I don't know if you still need the answer, but since I was searching around for something similar, I can answer this. The problem is that referrers(f) returns an enumeration and so it doesn't really translate well when you try to put it into your hashmap. I was doing a similar type of analysis where I was trying to find unique char arrays (count the unique combinations of char arrays up to the first 50 characters). What I came up with was this:
var counts = {};
filter(
map(
unique(
map(
filter(heap.objects('char[]'), "it.length > 50"), // filter out strings less than 50 chars in length
function(charArray) { // chop the string at 50 chars and then count the unique combos
var subs = charArray.toString().substr(0,50);
if (! counts[subs]) {
counts[subs] = 1;
} else {
counts[subs] = counts[subs] + 1;
}
return subs;
}
) // map
) // unique
, function(subs) { // map the strings into an array that has the string and the counts of that string
return { string: subs, count: counts[subs] };
}) // map
, "it.count > 5000"); // filter out strings that have counts < 5000
This essentially shows how to take an enumeration (heap.objects('char[]') in this case) and filter it and map it so that you can compute statistics on it. Hope this helps someone.

Efficient way to implement 'events since x' in Java

I want to be-able to ask an object 'how many events have occurred in the last x seconds' where the x is an argument.
e.g. how many events have occurred in the last 120 seconds..
How I approached is linear based on the number of events occurring but was wanting to see what the most efficient way (space & time) to achieve this requirement?;
public class TimeSinceStat {
private List<DateTime> eventTimes = new ArrayList<>();
public void apply() {
eventTimes.add(DateTime.now());
}
public int eventsSince(int seconds) {
DateTime startTime = DateTime.now().minus(Seconds.seconds(seconds));
for (int i = 0; i < orderTimes.size(); i++) {
DateTime dateTime = eventTimes.get(i);
if (dateTime.compareTo(startTime) > 0)
return eventTimes.subList(i, eventTimes.size()).size();
}
return 0;
}
(PS - i'm using JodaTime for the date/time representation)
Edit:
The key of this algorithm to find all events that have happened in the last x seconds; the exact start time (e.g. now - 30 seconds) is may or maynot be in the collection

Store the DateTime in a TreeSet and then use tailSet to get the most recent events. This saves you from having to find the starting point by iteration (which is O(n)) and instead by searching (which is O (log n)).
TreeSet<DateTime> eventTimes;
public int eventsSince(int seconds) {
return eventTimes.tailSet(DateTime.now().minus(Seconds.seconds(seconds)), true).size();
}
Of course, you could also binary search on your sorted list, but this does the work for you.
Edit
If it's a concern that multiple events could occur at the same DateTime, you can take the exact same approach with a SortedMultiset from Guava:
TreeMultiset<DateTime> eventTimes;
public int eventsSince(int seconds) {
return eventTimes.tailMultiset(
DateTime.now().minus(Seconds.seconds(seconds)),
BoundType.CLOSED
).size();
}
Edit x2
Here's a much more efficient approach that leverages the fact that you only log events that happened after all other events. With each event, store the number of events up to that date:
SortedMap<DateTime, Integer> eventCounts = initEventMap();
public SortedMap<DateTime, Integer> initEventMap() {
TreeMap<DateTime, Integer> map = new TreeMap<>();
//prime the map to make subsequent operations much cleaner
map.put(DateTime.now().minus(Seconds.seconds(1)), 0);
return map;
}
private long totalCount() {
//you can handle the edge condition here
return eventCounts.getLastEntry().getValue();
}
public void logEvent() {
eventCounts.put(DateTime.now(), totalCount() + 1);
}
Then getting the count since a date is super efficient, just take the total and subtract the count of events that occurred before that date.
public int eventsSince(int seconds) {
DateTime startTime = DateTime.now().minus(Seconds.seconds(seconds));
return totalCount() - eventCounts.lowerEntry(startTime).getValue();
}
This eliminates the inefficient iteration. It's a constant time lookup and an O(log n) lookup.

If you were implementing a data structure from scratch, and the data are not in sorted order, you'd want to construct a balanced order statistic tree (also see code here). This is just a regular balanced tree with the size of the tree rooted at each node maintained in the node itself.
The size fields enable efficient calcualtion of the "rank" of any key in the tree. You can do the desired range query by making two O(log n) probes into the tree for the rank of the min and max range value, finally taking their difference.
The proposed tree and set tail operations are great except the tail views will need time to construct, even though all you need is their size. The asymptotic complexity is the same as the OST, but the OST avoids this overhead completely. The difference could be meaningful if performance is very criticial.
Of course I'd definitely use the standard library solution first and consider the OST only if the speed turned out to be inadequate.

Since DateTime already implements Comparable interface, I would recommend storing the data in a TreeMap instead, and you could use TreeMap#tailMap to get a subtree of the DateTime's that occurs in the desired time.
Based on your code:
public class TimeSinceStat {
//just in case two or more events start at the "same time"
private NavigableMap<DateTime, Integer> eventTimes = new TreeMap<>();
//if this class needs to be used in multiple threads, use ConcurrentSkipListMap instead of TreeMap
public void apply() {
DateTime dateTime = DateTime.now();
Integer times = eventTimes.contains(dateTime) != null ? 0 : (eventTimes.get(dateTime) + 1);
eventTimes.put(dateTime, times);
}
public int eventsSince(int seconds) {
DateTime startTime = DateTime.now().minus(Seconds.seconds(seconds));
NavigableMap<DateTime, Integer> eventsInRange = eventTimes.tailMap(startTime, true);
int counter = 0;
for (Integer time : eventsInRange.values()) {
counter += time;
}
return counter;
}
}

Assuming the list is sorted, you could do a binary search. Java Collections already provides Collections.binarySearch, and DateTime implements Comparable (according to the JodaTime JavaDoc). binarySearch will return the index of the value you want, if it exists in the list, otherwise it returns the index of the greatest value less than the one you want (with the sign flipped). So, all you need to do is (in your eventsSince method):
// find the time you want.
int index=Collections.binarySearch(eventTimes, startTime);
if(index < 0) index = -(index+1)-1; // make sure we get the right index if startTime isn't found
// check for dupes
while(index != eventTimes.size() - 1 && eventTimes.get(index).equals(eventTimes.get(index+1))){
index++;
}
// return the number of events after the index
return eventTimes.size() - index; // this works because indices start at 0
This should be a faster way to do what you want.

How to synchronize System Time access in a class in Java

I am writing a class that when called will call a method to use system time to generate a unique 8 character alphanumeric as a reference ID. But I have the fear that at some point, multiple calls might be made in the same millisecond, resulting in the same reference ID. How can I go about protecting this call to system time from multiple threads that might call this method simultaneously?

System time is unreliable source for Unique Ids. That's it. Don't use it.
You need some form of a permanent source (UUID uses secure random which seed is provided by the OS)
The system time may go/jump backwards even a few milliseconds and screw your logic entirely. If you can tolerate 64 bits only you can either use High/Low generator which is a very good compromise or cook your own recipe: like 18bits of days since beginning of 2012 (you have over 700years to go) and then 46bits of randomness coming from SecureRandom - not the best case and technically it may fail but it doesn't require external persistence.

I'd suggest to add the threadID to the reference ID. This will make the reference more unique. However, even within a thread consecutive calls to a time source may deliver identical values. Even calls to the highest resolution source (QueryPerformanceCounter) may result in identical values on certain hardware. A possible solution to this problem is testing the collected time value against its predecessor and add an increment item to the "time-stamp". You may need more than 8 characters when this should be human readable.
The most efficient source for a timestamp is the GetSystemTimeAsFileTime API. I wrote some details in this answer.

You can use the UUID class to generate the bits for your ID, then use some bitwise operators and Long.toString to convert it to base-36 (alpha-numeric).
public static String getId() {
UUID uuid = UUID.randomUUID();
// This is the time-based long, and is predictable
long msb = uuid.getMostSignificantBits();
// This contains the variant bits, and is random
long lsb = uuid.getLeastSignificantBits();
long result = msb ^ lsb; // XOR
String encoded = Long.toString(result, 36);
// Remove sign if negative
if (result < 0)
encoded = encoded.substring(1, encoded.length());
// Trim extra digits or pad with zeroes
if (encoded.length() > 8) {
encoded = encoded.substring(encoded.length() - 8, encoded.length());
}
while (encoded.length() < 8) {
encoded = "0" + encoded;
}
}
Since your character space is still smaller compared to UUID, this isn't foolproof. Test it with this code:
public static void main(String[] args) {
Set<String> ids = new HashSet<String>();
int count = 0;
for (int i = 0; i < 100000; i++) {
if (!ids.add(getId())) {
count++;
}
}
System.out.println(count + " duplicate(s)");
}
For 100,000 IDs, the code performs well pretty consistently and is very fast. I start getting duplicate IDs when I increase another order of magnitude to 1,000,000. I modified the trimming to take the end of the encoded string instead of the beginning, and this greatly improved duplicate ID rates. Now having 1,000,000 IDs isn't producing any duplicates for me.
Your best bet may still be to use a synchronized counter like AtomicInteger or AtomicLong and encode the number from that in base-36 using the code above, especially if you plan on having lots of IDs.
Edit: Counter approach, in case you want it:
private final AtomicLong counter;
public IdGenerator(int start) {
// start could also be initialized from a file or other
// external source that stores the most recently used ID
counter = new AtomicLong(start);
}
public String getId() {
long result = counter.getAndIncrement();
String encoded = Long.toString(result, 36);
// Remove sign if negative
if (result < 0)
encoded = encoded.substring(1, encoded.length());
// Trim extra digits or pad with zeroes
if (encoded.length() > 8) {
encoded = encoded.substring(0, 8);
}
while (encoded.length() < 8) {
encoded = "0" + encoded;
}
}
This code is thread-safe and can be accessed concurrently.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Fast value access from string-based key path - java

Related

How to efficiently store a large Java map?

Ektorp CouchDb: Query for pattern with multiple contains

JVisualVM HeapDump OQL rendering array inside an Object

Efficient way to implement 'events since x' in Java

How to synchronize System Time access in a class in Java

Categories

Resources