How to efficiently store a large Java map?

How to efficiently store a large Java map? - java

I am brute-forcing one game and I need to store data for all positions and outcomes. Data will likely be hundreds of Gb in size. I considered SQL, but I am afraid that lookups in a tight loop might kill performance. Program will iterate over possible positions and return winning move if it is known, return longest losing sequence if all moves are known to lose and check outcome for unknown moves.
What is the best way to store a large Map<Long,Long[]> positionIdToBestMoves? I am considering SQL or data serialization.
I want to solve tiny checkers by brute-forcing all viable moves in Java. The upper limit of positions is around 100 billions. Most of them are not plausible (i.e. more pieces than were present in the beginning of the game). Some 10 Billions is a reasonable estimate. Each Map<Long, Long[]> position maps Long positionID to Long whiteToMove and Long blackToMove. Positive value indicate that position is winning and a move that leads to position stored in value should be chosen. Negative value -n means position is losing in at most n moves.
Search itself would have a recursion like this:
//this is a stub
private Map<Long, Long[]> boardBook =...
//assuming that all winning positions are known
public Long nextMove(Long currentPos, int whiteOrBlack){
Set<Long> validMoves = calculateValidMoves(currentPos, whiteOrBlack);
boolean hasWinner = checkIfValidMoveIsKnownToWin(validMoves, whiteOrBlack);
if(hasWinner){ //there is a winning move - play it
Long winningMove = getWinningMove(validMoves, whiteOrBlack);
boardBook.get(currentPos)[whiteOrBlack] = winningMove ;
return winningMove ;
}
boolean areAllPositionsKnown = checkIfAllPositionsKnown(validMoves, whiteOrBlack);
if(areAllPositionsKnown){ //all moves are losing.. choose longest struggle
Long longestSequenceToDefeat = findPositionToLongestSequenceToDefeat(validMoves, whiteOrBlack);
int numberOfStepsTodefeat = boardBook.get(longestSequenceToDefeat)[whiteOrBlack];
boardBook.get(currentPos)[whiteOrBlack] = longestSequenceToDefeat ;
return longestSequenceToDefeat;
}
Set<Long> movesToCheck = getUntestedMoves(validMoves, whiteOrBlack);
Long longeststruggle;
int maxNumberOfMovesToDefeat =-1;
for(Long moveTocheck : movesToCheck){
Long result = nextMove(moveToCheck, whiteOrBlack);
if(result>0){ //just discovered a winning move
boardBook.get(currentPos)[whiteOrBlack] = winningMove ;
return winningMove ;
}else {
int numOfMovesToDefeat = -1*boardBook.get(moveTocheck)[whiteOrBlack];
if( numOfMovesToDefeat >maxNumberOfMovesToDefeat ){
maxNumberOfMovesToDefeat =numOfMovesToDefeat ;
longeststruggle = moveTocheck;
}
}
}
boardBook.get(currentPos)[whiteOrBlack] = -1*maxNumberOfMovesToDefeat;
return longeststruggle;
}

You may want to look at Chronicle. It's highly optimized key-value storage and it should fit for your purpose.
Or you can write storage by yourself, but you still will end up doing something like map and memory mapped file under the hood.

Related

Fast Forward implementation in Realtime Audio byte array in java

I am managing audio capturing and playing using java sound API (targetDataLine and sourceDataLine). Now suppose in a conference environment, one participant's audio queue size got greater than jitter size (due to processing or network) and I want to fast forward the audio bytes I have of that participant to make it shorter than jitter size.
How can I fast forward the audio byte array of that participant?
I can't do it during playing as normally Player thread just deque 1 frame from every participant's queue and mix it for playing. The only way I can get that is if I deque more than 1 frame of that participant and mix(?) it for fast-forwarding before mixing it with other participants 1 dequeued frame for playing?
Thanks in advance for any kind of help or advice.

There are two ways to speed up the playback that I know of. In one case, the faster pace creates a rise in pitch. The coding for this is relatively easy. In the other case, pitch is kept constant, but it involves a technique of working with sound granules (granular synthesis), and is harder to explain.
For the situation where maintaining the same pitch is not a concern, the basic plan is as follows: instead of advancing by single frames, advance by a frame + a small increment. For example, let's say that advancing 1.1 frames over a course of 44000 frames is sufficient to catch you up. (That would also mean that the pitch increase would be about 1/10 of an octave.)
To advance a "fractional" frame, you first have to convert the bytes of the two bracketing frames to PCM. Then, use linear interpolation to get the intermediate value. Then convert that intermediate value back to bytes for the output line.
For example, if you are advancing from frame[0] to frame["1.1"] you will need to know the PCM for frame[1] and frame[2]. The intermediate value can be calculated using a weighted average:
value = PCM[1] * 9/10 + PCM[2] * 1/10
I think it might be good to make the amount by which you advance change gradually. Take a few dozen frames to ramp up the increment and allow time to ramp down again when returning to normal dequeuing. If you suddenly change the rate at which you are reading the audio data, it is possible to introduce a discontinuity that will be heard as a click.
I have used this basic plan for dynamic control of playback speed, but I haven't had the experience of employing it for the situation that you are describing. Regulating the variable speed could be tricky if you also are trying to enforce keeping the transitions smooth.
The basic idea for using granules involves obtaining contiguous PCM (I'm not clear what the optimum number of frames would be for voice, 1 to 50 millis is cited as commonly being used with this technique in synthesis), and giving it a volume envelope that allows you to mix sequential granules end-to-end (they must overlap).
I think the envelopes for the granules make use of a Hann function or Hamming window--but I'm not clear on the details, such as the overlapping placement of the granules so that they mix/transition smoothly. I've only dabbled, and I'm going to assume folks at Signal Processing will be the best bet for advice on how to code this.

I found a fantastic git repo (sonic library, mainly for audio player) which actually does exactly what I wanted with so much controls. I can input a whole .wav file or even chunks of audio byte arrays and after processing, we can get speed up play experience and so more. For real time processing I actually called this on every chunk of audio byte array.
I found another way/algo to detect whether a audio chunk/byte array is voice or not and after depending on it's result, I can simply ignore playing non voice packets which gives us around 1.5x speedup with less processing.
public class DTHVAD {
public static final int INITIAL_EMIN = 100;
public static final double INITIAL_DELTAJ = 1.0001;
private static boolean isFirstFrame;
private static double Emax;
private static double Emin;
private static int inactiveFrameCounter;
private static double Lamda; //
private static double DeltaJ;
static {
initDTH();
}
private static void initDTH() {
Emax = 0;
Emin = 0;
isFirstFrame = true;
Lamda = 0.950; // range is 0.950---0.999
DeltaJ = 1.0001;
}
public static boolean isAllSilence(short[] samples, int length) {
boolean r = true;
for (int l = 0; l < length; l += 80) {
if (!isSilence(samples, l, l+80)) {
r = false;
break;
}
}
return r;
}
public static boolean isSilence(short[] samples, int offset, int length) {
boolean isSilenceR = false;
long energy = energyRMSE(samples, offset, length);
// printf("en=%ld\n",energy);
if (isFirstFrame) {
Emax = energy;
Emin = INITIAL_EMIN;
isFirstFrame = false;
}
if (energy > Emax) {
Emax = energy;
}
if (energy < Emin) {
if ((int) energy == 0) {
Emin = INITIAL_EMIN;
} else {
Emin = energy;
}
DeltaJ = INITIAL_DELTAJ; // Resetting DeltaJ with initial value
} else {
DeltaJ = DeltaJ * 1.0001;
}
long thresshold = (long) ((1 - Lamda) * Emax + Lamda * Emin);
// printf("e=%ld,Emin=%f, Emax=%f, thres=%ld\n",energy,Emin,Emax,thresshold);
Lamda = (Emax - Emin) / Emax;
if (energy > thresshold) {
isSilenceR = false; // voice marking
} else {
isSilenceR = true; // noise marking
}
Emin = Emin * DeltaJ;
return isSilenceR;
}
private static long energyRMSE(short[] samples, int offset, int length) {
double cEnergy = 0;
float reversOfN = (float) 1 / length;
long step = 0;
for (int i = offset; i < length; i++) {
step = samples[i] * samples[i]; // x*x/N=
// printf("step=%ld cEng=%ld\n",step,cEnergy);
cEnergy += (long) ((float) step * reversOfN);// for length =80
// reverseOfN=0.0125
}
cEnergy = Math.pow(cEnergy, 0.5);
return (long) cEnergy;
}
}
Here I can convert my byte array to short array and detect whether it is voice or non voice by
frame.silence = DTHVAD.isSilence(encodeShortBuffer, 0, shortLen);

Efficient way to implement 'events since x' in Java

I want to be-able to ask an object 'how many events have occurred in the last x seconds' where the x is an argument.
e.g. how many events have occurred in the last 120 seconds..
How I approached is linear based on the number of events occurring but was wanting to see what the most efficient way (space & time) to achieve this requirement?;
public class TimeSinceStat {
private List<DateTime> eventTimes = new ArrayList<>();
public void apply() {
eventTimes.add(DateTime.now());
}
public int eventsSince(int seconds) {
DateTime startTime = DateTime.now().minus(Seconds.seconds(seconds));
for (int i = 0; i < orderTimes.size(); i++) {
DateTime dateTime = eventTimes.get(i);
if (dateTime.compareTo(startTime) > 0)
return eventTimes.subList(i, eventTimes.size()).size();
}
return 0;
}
(PS - i'm using JodaTime for the date/time representation)
Edit:
The key of this algorithm to find all events that have happened in the last x seconds; the exact start time (e.g. now - 30 seconds) is may or maynot be in the collection

Store the DateTime in a TreeSet and then use tailSet to get the most recent events. This saves you from having to find the starting point by iteration (which is O(n)) and instead by searching (which is O (log n)).
TreeSet<DateTime> eventTimes;
public int eventsSince(int seconds) {
return eventTimes.tailSet(DateTime.now().minus(Seconds.seconds(seconds)), true).size();
}
Of course, you could also binary search on your sorted list, but this does the work for you.
Edit
If it's a concern that multiple events could occur at the same DateTime, you can take the exact same approach with a SortedMultiset from Guava:
TreeMultiset<DateTime> eventTimes;
public int eventsSince(int seconds) {
return eventTimes.tailMultiset(
DateTime.now().minus(Seconds.seconds(seconds)),
BoundType.CLOSED
).size();
}
Edit x2
Here's a much more efficient approach that leverages the fact that you only log events that happened after all other events. With each event, store the number of events up to that date:
SortedMap<DateTime, Integer> eventCounts = initEventMap();
public SortedMap<DateTime, Integer> initEventMap() {
TreeMap<DateTime, Integer> map = new TreeMap<>();
//prime the map to make subsequent operations much cleaner
map.put(DateTime.now().minus(Seconds.seconds(1)), 0);
return map;
}
private long totalCount() {
//you can handle the edge condition here
return eventCounts.getLastEntry().getValue();
}
public void logEvent() {
eventCounts.put(DateTime.now(), totalCount() + 1);
}
Then getting the count since a date is super efficient, just take the total and subtract the count of events that occurred before that date.
public int eventsSince(int seconds) {
DateTime startTime = DateTime.now().minus(Seconds.seconds(seconds));
return totalCount() - eventCounts.lowerEntry(startTime).getValue();
}
This eliminates the inefficient iteration. It's a constant time lookup and an O(log n) lookup.

If you were implementing a data structure from scratch, and the data are not in sorted order, you'd want to construct a balanced order statistic tree (also see code here). This is just a regular balanced tree with the size of the tree rooted at each node maintained in the node itself.
The size fields enable efficient calcualtion of the "rank" of any key in the tree. You can do the desired range query by making two O(log n) probes into the tree for the rank of the min and max range value, finally taking their difference.
The proposed tree and set tail operations are great except the tail views will need time to construct, even though all you need is their size. The asymptotic complexity is the same as the OST, but the OST avoids this overhead completely. The difference could be meaningful if performance is very criticial.
Of course I'd definitely use the standard library solution first and consider the OST only if the speed turned out to be inadequate.

Since DateTime already implements Comparable interface, I would recommend storing the data in a TreeMap instead, and you could use TreeMap#tailMap to get a subtree of the DateTime's that occurs in the desired time.
Based on your code:
public class TimeSinceStat {
//just in case two or more events start at the "same time"
private NavigableMap<DateTime, Integer> eventTimes = new TreeMap<>();
//if this class needs to be used in multiple threads, use ConcurrentSkipListMap instead of TreeMap
public void apply() {
DateTime dateTime = DateTime.now();
Integer times = eventTimes.contains(dateTime) != null ? 0 : (eventTimes.get(dateTime) + 1);
eventTimes.put(dateTime, times);
}
public int eventsSince(int seconds) {
DateTime startTime = DateTime.now().minus(Seconds.seconds(seconds));
NavigableMap<DateTime, Integer> eventsInRange = eventTimes.tailMap(startTime, true);
int counter = 0;
for (Integer time : eventsInRange.values()) {
counter += time;
}
return counter;
}
}

Assuming the list is sorted, you could do a binary search. Java Collections already provides Collections.binarySearch, and DateTime implements Comparable (according to the JodaTime JavaDoc). binarySearch will return the index of the value you want, if it exists in the list, otherwise it returns the index of the greatest value less than the one you want (with the sign flipped). So, all you need to do is (in your eventsSince method):
// find the time you want.
int index=Collections.binarySearch(eventTimes, startTime);
if(index < 0) index = -(index+1)-1; // make sure we get the right index if startTime isn't found
// check for dupes
while(index != eventTimes.size() - 1 && eventTimes.get(index).equals(eventTimes.get(index+1))){
index++;
}
// return the number of events after the index
return eventTimes.size() - index; // this works because indices start at 0
This should be a faster way to do what you want.

Fast interpolation between a collection of points

I've built a model of the solar system in Java. In order to determine the position of a planet it does do a whole lot of computations which give a very exact value. However I am often satisfied with the approximate position, if that could make it go faster. Because I'm using it in a simulation speed is important, as the position of the planet will be requested millions of times.
Currently I try to cache the position of a planet throughout its orbit and then use those coordinates over and over. If a position in between two values is requested I perform a linear interpolation. This is how I store values:
for(int t=0; t<tp; t++) {
listCoordinates[t]=super.coordinates(ti+t);
}
interpolator = new PlanetOrbit(listCoordinates,tp);
PlanetOrbit has the interpolation code:
package cometsim;
import org.apache.commons.math3.util.FastMath;
public class PlanetOrbit {
final double[][] coordinates;
double tp;
public PlanetOrbit(double[][] coordinates, double tp) {
this.coordinates = coordinates;
this.tp = tp;
}
public double[] coordinates(double julian) {
double T = julian % FastMath.floor(tp);
if(coordinates.length == 1 || coordinates.length == 0) return coordinates[0];
if(FastMath.round(T) == T) return coordinates[(int) T];
int floor = (int) FastMath.floor(T);
if(floor>=coordinates.length) floor=coordinates.length-5;
double[] f = coordinates[floor];
double[] c = coordinates[floor+1];
double[] retval = f;
retval[0] += (T-FastMath.floor(T))*(c[0]-f[0]);
retval[1] += (T-FastMath.floor(T))*(c[1]-f[1]);
retval[2] += (T-FastMath.floor(T))*(c[2]-f[2]);
return retval;
}
}
You can think of FastMath as Math but faster. However, this code is not much of a speed improvement over calculating the exact value every time. Do you have any ideas for how to make it faster?

There are a few issues I can see, the main ones I can see are as follows
PlanetOrbit#coordinates seems to actually change the values in the variable coordinates. As this method is supposed to only interpolate I expect that your orbit will actually corrupt slightly everytime you run though it (because it is a linear interpolation the orbit will actually degrade towards its centre).
You do the same thing several times, most clearly T-FastMath.floor(T) occures 3 seperate times in the code.
Not a question of efficiency or accuracy but the variable and method names are very opaque, use real words for variable names.
My proposed method would be as follows
public double[] getInterpolatedCoordinates(double julian){ //julian calendar? This variable name needs to be something else, like day, or time, or whatever it actually means
int startIndex=(int)julian;
int endIndex=(startIndex+1>=coordinates.length?1:startIndex+1); //wrap around
double nonIntegerPortion=julian-startIndex;
double[] start = coordinates[startIndex];
double[] end = coordinates[endIndex];
double[] returnPosition= new double[3];
for(int i=0;i< start.length;i++){
returnPosition[i]=start[i]*(1-nonIntegerPortion)+end[i]*nonIntegerPortion;
}
return returnPosition;
}
This avoids corrupting the coordinates array and avoids repeating the same floor several times (1-nonIntegerPortion is still done several times and could be removed if needs be but I expect profiling will show it isn't significant). However, it does create a new double[] each time which may be inefficient if you only need the array temporarily. This can be corrected using a store object (an object you used previously but no longer need, usually from the previous loop)
public double[] getInterpolatedCoordinates(double julian, double[] store){
int startIndex=(int)julian;
int endIndex=(startIndex+1>=coordinates.length?1:startIndex+1); //wrap around
double nonIntegerPortion=julian-startIndex;
double[] start = coordinates[startIndex];
double[] end = coordinates[endIndex];
double[] returnPosition= store;
for(int i=0;i< start.length;i++){
returnPosition[i]=start[i]*(1-nonIntegerPortion)+end[i]*nonIntegerPortion;
}
return returnPosition; //store is returned
}

Tips optimizing Java code

So, I've written a spellchecker in Java and things work as they should. The only problem is that if I use a word where the max allowed distance of edits is too large (like say, 9) then my code runs out of memory. I've profiled my code and dumped the heap into a file, but I don't know how to use it to optimize my code.
Can anyone offer any help? I'm more than willing to put up the file/use any other approach that people might have.
-Edit-
Many people asked for more details in the comments. I figured that other people would find them useful, and they might get buried in the comments. Here they are:
I'm using a Trie to store the words themselves.
In order to improve time efficiency, I don't compute the Levenshtein Distance upfront, but I calculate it as I go. What I mean by this is that I keep only two rows of the LD table in memory. Since a Trie is a prefix tree, it means that every time I recurse down a node, the previous letters of the word (and therefore the distance for those words) remains the same. Therefore, I only calculate the distance with that new letter included, with the previous row remaining unchanged.
The suggestions that I generate are stored in a HashMap. The rows of the LD table are stored in ArrayLists.
Here's the code of the function in the Trie that leads to the problem. Building the Trie is pretty straight forward, and I haven't included the code for the same here.
/*
* #param letter: the letter that is currently being looked at in the trie
* word: the word that we are trying to find matches for
* previousRow: the previous row of the Levenshtein Distance table
* suggestions: all the suggestions for the given word
* maxd: max distance a word can be from th query and still be returned as suggestion
* suggestion: the current suggestion being constructed
*/
public void get(char letter, ArrayList<Character> word, ArrayList<Integer> previousRow, HashSet<String> suggestions, int maxd, String suggestion){
// the new row of the trie that is to be computed.
ArrayList<Integer> currentRow = new ArrayList<Integer>(word.size()+1);
currentRow.add(previousRow.get(0)+1);
int insert = 0;
int delete = 0;
int swap = 0;
int d = 0;
for(int i=1;i<word.size()+1;i++){
delete = currentRow.get(i-1)+1;
insert = previousRow.get(i)+1;
if(word.get(i-1)==letter)
swap = previousRow.get(i-1);
else
swap = previousRow.get(i-1)+1;
d = Math.min(delete, Math.min(insert, swap));
currentRow.add(d);
}
// if this node represents a word and the distance so far is <= maxd, then add this word as a suggestion
if(isWord==true && d<=maxd){
suggestions.add(suggestion);
}
// if any of the entries in the current row are <=maxd, it means we can still find possible solutions.
// recursively search all the branches of the trie
for(int i=0;i<currentRow.size();i++){
if(currentRow.get(i)<=maxd){
for(int j=0;j<26;j++){
if(children[j]!=null){
children[j].get((char)(j+97), word, currentRow, suggestions, maxd, suggestion+String.valueOf((char)(j+97)));
}
}
break;
}
}
}

Here's some code I quickly crafted showing one way to generate the candidates and to then "rank" them.
The trick is: you never "test" a non-valid candidate.
To me your: "I run out of memory when I've got an edit distance of 9" screams "combinatorial explosion".
Of course to dodge a combinatorial explosion you don't do thing like trying to generate yourself all words that are at a distance from '9' from your misspelled work. You start from the misspelled word and generate (quite a lot) of possible candidates, but you refrain from creating too many candidates, for then you'd run into trouble.
(also note that it doesn't make much sense to compute up to a Levenhstein Edit Distance of 9, because technically any word less than 10 letters can be transformed into any other word less than 10 letters in max 9 transformations)
Here's why you simply cannot test all words up to a distance of 9 without either having an OutOfMemory error or simply a program never terminating:
generating all the LED up to 1 for the word "ptmizing", by only adding one letter (from a to z) generates already 9*26 variations (i.e. 324 variations) [there are 9 positions where you can insert one out of 26 letters)
generating all the LED up to 2, by only adding one letter to what we know have generates already 10*26*324 variations (60 840)
generating all the LED up to 3 gives: 17 400 240 variations
And that is only by considering the case where we add one, add two or add three letters (we're not counting deletion, swaps, etc.). And that is on a misspelled word that is only nine characters long. On "real" words, it explodes even faster.
Sure, you could get "smart" and generate this in a way not to have too many dupes etc. but the point stays: it's a combinatorial explosion that explodes fastly.
Anyway... Here's an example. I'm simply passing the dictionary of valid words (containing only four words in this case) to the corresponding method to keep this short.
You'll obviously want to replace the call to the LED with your own LED implementation.
The double-metaphone is just an example: in a real spellchecker words that do "sound alike"
despite further LED should be considered as "more correct" and hence often suggest first. For example "optimizing" and "aupteemising" are quite far from a LED point of view, but using the double-metaphone you should get "optimizing" as one of the first suggestion.
(disclaimer: following was cranked in a few minutes, it doesn't take into account uppercase, non-english words, etc.: it's not a real spell-checker, just an example)
#Test
public void spellCheck() {
final String src = "misspeled";
final Set<String> validWords = new HashSet<String>();
validWords.add("boing");
validWords.add("Yahoo!");
validWords.add("misspelled");
validWords.add("stackoverflow");
final List<String> candidates = findNonSortedCandidates( src, validWords );
final SortedMap<Integer,String> res = computeLevenhsteinEditDistanceForEveryCandidate(candidates, src);
for ( final Map.Entry<Integer,String> entry : res.entrySet() ) {
System.out.println( entry.getValue() + " # LED: " + entry.getKey() );
}
}
private SortedMap<Integer, String> computeLevenhsteinEditDistanceForEveryCandidate(
final List<String> candidates,
final String mispelledWord
) {
final SortedMap<Integer, String> res = new TreeMap<Integer, String>();
for ( final String candidate : candidates ) {
res.put( dynamicProgrammingLED(candidate, mispelledWord), candidate );
}
return res;
}
private int dynamicProgrammingLED( final String candidate, final String misspelledWord ) {
return Levenhstein.getLevenshteinDistance(candidate,misspelledWord);
}
Here you generate all possible candidates using several methods. I've only implemented one such method (and quickly so it may be bogus but that's not the point ; )
private List<String> findNonSortedCandidates( final String src, final Set<String> validWords ) {
final List<String> res = new ArrayList<String>();
res.addAll( allCombinationAddingOneLetter(src, validWords) );
// res.addAll( allCombinationRemovingOneLetter(src) );
// res.addAll( allCombinationInvertingLetters(src) );
return res;
}
private List<String> allCombinationAddingOneLetter( final String src, final Set<String> validWords ) {
final List<String> res = new ArrayList<String>();
for (char c = 'a'; c < 'z'; c++) {
for (int i = 0; i < src.length(); i++) {
final String candidate = src.substring(0, i) + c + src.substring(i, src.length());
if ( validWords.contains(candidate) ) {
res.add(candidate); // only adding candidates we know are valid words
}
}
if ( validWords.contains(src+c) ) {
res.add( src + c );
}
}
return res;
}

One thing you could try is, increase the Java's heap size, in order to overcome "out of memory error".
Following article will help you in order to understand how to increase heap size in Java
http://viralpatel.net/blogs/2009/01/jvm-java-increase-heap-size-setting-heap-size-jvm-heap.html
But I think the better approach to address your problem is, find out a better algorithm than the current algorithm

Well without more Information on the topic there is not much the community could do for you... You can start with the following:
Look at what your Profiler says (after it has run a little while): Does anything pile up? Are there a lot of Objects - this should normally give you a hint on what is wrong with your code.
Publish your saved dump somewhere and link it in your question, so someone else could take a look at it.
Tell us which profiler you are using, then somebody can give you hints on where to look for valuable information.
After you have narrowed down your problem to a specific part of your Code, and you cannot figure out why there are so many objects of $FOO in your memory, post a snippet of the relevant part.

Find word in dictionary of unknown size using only a method to get a word by index

A few days ago I had interview in some big company, name is not required :), and interviewer asked me to find solution to the next task:
Predefined:
There is dictionary of words with unspecified size, we just know that all words in dictionary are sorted (for example by alphabet). Also we have just a one method
String getWord(int index) throws IndexOutOfBoundsException
Needs:
Need to develop algorithm to find some input word in dictionary using java. For this we should implement method
public boolean isWordInTheDictionary(String word)
Limitations:
We cannot change the internal structure of dictionary, we have no access to internal structure, we do not know counts of elements in dictionary.
Issues:
I have developed modified-binary search, and will publish my variant(works variant) of algorithm, but are there another variants with logarithmic complexity? My variant has complexity O(logN).
My variant of implementation:
public class Dictionary {
private static final int BIGGEST_TOP_MASK = 0xF00000;
private static final int LESS_TOP_MASK = 0x0F0000;
private static final int FULL_MASK = 0xFFFFFF;
private String[] data;
private static final int STEP = 100; // for real test step should be Integer.MAX_VALUE
private int shiftIndex = -1;
private static final int LESS_MASK = 0x0000FF;
private static final int BIG_MASK = 0x00FF00;
public Dictionary() {
data = getData();
}
String getWord(int index) throws IndexOutOfBoundsException {
return data[index];
}
public String[] getData() {
return new String[]{"a", "aaaa", "asss", "az", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "test", "u", "v", "w", "x", "y", "z"};
}
public boolean isWordInTheDictionary(String word) {
boolean isFound = false;
int constantIndex = STEP; // predefined step
int flag = 0;
int i = 0;
while (true) {
i++;
if (flag == FULL_MASK) {
System.out.println("Word is not found ... Steps " + i);
break;
}
try {
String data = getWord(constantIndex);
if (null != data) {
int compareResult = word.compareTo(data);
if (compareResult > 0) {
if ((flag & LESS_MASK) == LESS_MASK) {
constantIndex = prepareIndex(false, constantIndex);
if (shiftIndex == 1)
flag |= BIGGEST_TOP_MASK;
} else {
constantIndex = constantIndex * 2;
}
flag |= BIG_MASK;
} else if (compareResult < 0) {
if ((flag & BIG_MASK) == BIG_MASK) {
constantIndex = prepareIndex(true, constantIndex);
if (shiftIndex == 1)
flag |= LESS_TOP_MASK;
} else {
constantIndex = constantIndex / 2;
}
flag |= LESS_MASK;
} else {
// YES!!! We found word.
isFound = true;
System.out.println("Steps " + i);
break;
}
}
} catch (IndexOutOfBoundsException e) {
if (flag > 0) {
constantIndex = prepareIndex(true, constantIndex);
flag |= LESS_MASK;
} else constantIndex = constantIndex / 2;
}
}
return isFound;
}
private int prepareIndex(boolean isBiggest, int constantIndex) {
shiftIndex = (int) Math.ceil(getIndex(shiftIndex == -1 ? constantIndex : shiftIndex));
if (isBiggest)
constantIndex = constantIndex - shiftIndex;
else
constantIndex = constantIndex + shiftIndex;
return constantIndex;
}
private double getIndex(double constantIndex) {
if (constantIndex <= 1)
return 1;
return constantIndex / 2;
}
}

It sounds like the part they really want you to think about is how to handle the fact that you don't know the size of the dictionary. I think they assume that you can give them a binary search. So the real question is how do you manipulate the range of the search as it progresses.
Once you have found a value in the dictionary that is greater than your search target (or out of bounds), the rest looks like standard binary search. The hard part is how do you optimally expand the range when the target value is greater than the dictionary value that you've looked up. It looks like you are expanding by a factor of 1.5. This could be really problematic with a huge dictionary and a small fixed initial step like you have (100). Think if there were 50 million words how many times your algorithm would have to expand the range upwards if you're searching for 'zebra'.
Here's an idea: use the ordered nature of the collection to your advantage by assuming the first letter of each word is evenly distributed amongst the letters of the alphabet (this will never be true, but without knowing more about the collection of words it's probably the best you can do). Then weight the amount of your range expansion by how far from the end you would expect the dictionary word to be.
So if you took your initial step of 100 and looked up the dictionary word at that index and it was 'aardvark', you would expand your range a lot more for the next step than if it was 'walrus.' Still O(log n) but probably much better for most collections of words.

Here is an alternative implementation that uses Collections.binarySearch. It fails if one of the words in the list starts with the Character '\uffff' (that is Unicode 0xffff and not a legal not a valid unicode character).
public static class ListProxy extends AbstractList<String> implements RandomAccess
{
#Override public String get( int index )
{
try {
return getWord( index );
} catch( IndexOutOfBoundsException ex ) {
return "\uffff";
}
}
#Override public int size()
{
return Integer.MAX_VALUE;
}
}
public static boolean isWordInTheDictionary( String word )
{
return Collections.binarySearch( new ListProxy(), word ) >= 0;
}
Update: I modified it so that it implements RandomAccess since the binarySearch in Collections would otherwise use a iterator based search on such a large list which would be extremely slow. This should now however be decently fast since the binary search will need only 31 iterations even though the List pretends to be as large as possible.
Here is a slightly modified version that remembers the smallest failed index to converge its proclaimed size to the actual size of the dictionary en passant and thus avoids almost all exceptions in successive lookups. Although you would need to create a new ListProxy instance whenever the size of the dictionary could have changed.
public static class ListProxy extends AbstractList<String> implements RandomAccess
{
private int size = Integer.MAX_VALUE;
#Override public String get( int index )
{
try {
if( index < size )
return getWord( index );
} catch( IndexOutOfBoundsException ex ) {
size = index;
}
return "\uffff";
}
#Override public int size()
{
return size;
}
}
private static ListProxy listProxy = new ListProxy();
public static boolean isWordInTheDictionary( String word )
{
return Collections.binarySearch( listProxy , word ) >= 0;
}

You have the right idea, but I think your implementation is overly complicated. You want to do a binary search, but you don't know what the upper bound is. So instead of starting at the middle, you start at index 1 (assuming dictionary indexes start at 0).
If the word you're looking for is "less than" the current dictionary word, halve the distance between the current index and your "low" value. ("low" starts at 0, of course).
If the word you're looking for is "greater than" the word at the index you just examined, then either halve the distance between the current index and your "high" value ("high" starts at 2) or, if index and "high" are the same, double the index.
If doubling the index gives you an out of range exception, you halve the distance between the current value and the doubled value. So if going from 16 to 32 throws an exception, try 24. And, of course, keep track of the fact that 32 is more than the max.
So a search sequence might look like 1, 2, 4, 8, 16, 12, 14 - found!
It's the same concept as a binary search, but rather than starting with low = 0, high = n-1, you start with low = 0, high = 2, and double the high value when you need to. It's still O(log N), although the constant is going to be a bit larger than with a "normal" binary search.

You can incur a one-time cost of O(n), if you know that the dictionary will not change. You can add all the words in the dictionary to a hashtable, and then any subsequent calls to isWordInDictionary() will be O(1) (in theory).

Use the getWord() API to copy the entire contents of the dictionary into a more sensible data structure (e.g. hash table, trie, perhaps even augmented by a Bloom filter). ;-)

In a different language:
#!/usr/bin/perl
$t=0;
$cur=1;
$under=0;
$EOL=int(rand(1000000))+1;
$TARGET=int(rand(1000000))+1;
if ($TARGET>$EOL)
{
$x=$EOL;
$EOL=$TARGET;
$TARGET=$x;
}
print "Looking for $TARGET with EOL $EOL\n";
sub testWord($)
{
my($a)=#_;
++$t;
return 0 if ($a eq $TARGET);
return -2 if ($a > $EOL);
return 1 if ($a > $TARGET);
return -1;
}
while ($r = testWord($cur))
{
print "Tested $cur, got $r\n";
if ($r == 1) { $over=$cur; }
if ($r == -1) { $under=$cur; }
if ($r == -2) { $over = $cur; }
if ($over)
{
$cur = int(($over-$under)/2)+$under;
$cur++ if ($cur <= $under);
$cur-- if ($cur >= $over);
}
else
{
$cur *= 2;
}
}
print "Found $TARGET at $r in $t tests\n";
The main benefit of this one is it is a bit simpler to understand. I think it may be more efficient if your first guesses are below the target since I don't think you are taking advantage of the space you have already "searched", but that is just with a quick glance at your code. Since it is looking for numbers for simplicity, it doesn't have to deal with not finding the target, but that is an easy extension.

#Sergii Zagriichuk hope the interview went well. Good luck with that.
I think just as #alexcoco said Binary Search is the answer.
Other options I see are only available if you could extend the dictionary. You could make it slightly better. E.g. You could count the words on each letter, and keep their track this way you would effectively had to work only on a subset of words.
Or yea as guys are saying to entirely implement your own dictionary structure.
I know this doesn't answer you question properly. But I cannot see other possibilities.
BTW would be nice to see your algorithm.
EDIT:
Expanding on my comment under answer of bshields...
#Sergii Zagriichuk even better it would be to remember the last index where we had null (no word), I think. Then at each run you could check if it is still true. If not then expand the range to a 'previous index' obtained by reversing the binary search behaviour, so we have null again. This way you would always adjust the size of the range of your search algorithm, thus adapting to the current state of the dictionary as needed. Plus the changes would have to be significant in order to cause your range adjustment so the adjustment wouldn't have any real negative impact on the algorithm. Also dictionaries tend to be static in nature so this should work :)

On one hand yes you are right with binary search implementation. But on the other hand in case dictionary is static and is not changed between lookups - we could suggest different algorithm. Here we have common problem - string sorting/search is different comparing to sorting/searching int array, so getWord(int i).compareTo(string) is O(min(length0, length1)).
Suppose we have request to find words w0, w1, ... wN, during lookup we could build up a tree with indicies (probably some suffix tree will good enough for this task).
During next lookup request we have following set a1, a2, ... aM, so to decrease average time we could first decrease range by searching position in the tree.
The problem with this implementation is concurrency and memory usage, so next step is implementing strategy to make search tree smaller.
PS: main aim was to check ideas and problems you suggest.

Well i think the info that dictionary is sorted can be utilized in a better way.
Say you are looking for a word "Zebra" , whereas the first guess search resulted in "abcg".
So we can use this info in chossing the second guess index . like in my case the resulted word is starting with a , whereas i am looking for something starting with z. So rather than making a static jump , i can make some calculated jump based on the current result and desired result. So in this way suppose if my next jump takes me to the word "yvu" , i now i am very near , so i will make a rather slow small jump than in the prev case.

Here is my solution.. uses O(logn) operations. First part of the code tries to find a estimate of the length and then the second part takes advantage of the fact that the dictionary is sorted and performs a binary search.
boolean isWordInTheDictionary(String word){
if (word == null){
return false;
}
// estimate the length of the dictionary array
long len=2;
String temp= getWord(len);
while(true){
len = len * 2;
try{
temp = getWord(len);
}catch(IndexOutOfBoundsException e){
// found upped bound break from loop
break;
}
}
// Do a modified binary search using the estimated length
long beg = 0 ;
long end = len;
String tempWrd;
while(true){
System.out.println(String.format("beg: %s, end=%s, (beg+end)/2=%s ", beg,end,(beg+end)/2));
if(end - beg <= 1){
return false;
}
long idx = (beg+end)/2;
tempWrd = getWord(idx);
if(tempWrd == null){
end=idx;
continue;
}
if ( word.compareTo(tempWrd) > 0){
beg = idx;
}
else if(word.compareTo(tempWrd) < 0){
end= idx;
}else{
// found the word..
System.out.println(String.format("getword at index: %s, =%s", idx,getWord(idx)));
return true;
}
}
}

Assuming the dictionary is 0-based, I would decompose the search in two parts.
First, given that the index to parameter to getWord() is an integer, and assuming that the index must be a number between 0 and the maximum positive integer, perform a binary search over that range in order to find the maximum valid index (irrespective of the word values). This operation is O(log N), since is a simple binary search.
Once obtained the size of the dictionary, a second ordinary binary search (again of complexity O(log N)) will bring on the desired answer.
Since O(log N)+O(log N) is O(log N), this algorithm complies with your requirement.

I'm in a hiring proccess which asked me this same problem...
My approach was a bit different, and considering the dictionary (webservice) I have, it's about 30% more efficient (for the words I've tested).
Here is the solution:
https://github.com/gustavompo/wordfinder
I'll not post the whole solution here because it's decoupled through classes and methods, but the core algorithm is this:
public WordFindingResult FindWord(string word)
{
var callsCount = 0;
var lowerLimit = new WordFindingLimit(0, null);
var upperLimit = new WordFindingLimit(int.MaxValue, null);
var wordToFind = new Word(word);
var wordIndex = _initialIndex;
while (callsCount <= _maximumCallsCount)
{
if (CouldNotFindWord(lowerLimit, upperLimit))
return new WordFindingResult(callsCount, -1, string.Empty, WordFindingResult.ErrorCodes.NOT_FOUND);
var wordFound = RetrieveWordAt(wordIndex);
callsCount++;
if (wordToFind.Equals(wordFound))
return new WordFindingResult(callsCount, wordIndex, wordFound.OriginalWordString);
else if (IsIndexTooHigh(wordToFind, wordFound))
{
upperLimit = new WordFindingLimit(wordIndex, wordFound);
wordIndex = IndexConsideringTooHighPreviousResult(lowerLimit, wordIndex);
}
else
{
lowerLimit = new WordFindingLimit(wordIndex, wordFound);
wordIndex = IndexConsideringTooLowPreviousResult(lowerLimit, upperLimit, wordToFind);
}
}
return new WordFindingResult(callsCount, -1, string.Empty, WordFindingResult.ErrorCodes.CALLS_LIMIT_EXCEEDED);
}
private int IndexConsideringTooHighPreviousResult(WordFindingLimit maxLowerLimit, int current)
{
return BinarySearch(maxLowerLimit.Index, current);
}
private int IndexConsideringTooLowPreviousResult(WordFindingLimit maxLowerLimit, WordFindingLimit minUpperLimit, Word target)
{
if (AreLowerAndUpperLimitsDefined(maxLowerLimit, minUpperLimit))
return BinarySearch(maxLowerLimit.Index, minUpperLimit.Index);
var scoreByIndexPosition = maxLowerLimit.Index / maxLowerLimit.Word.Score;
var indexOfTargetBasedInScore = (int)(target.Score * scoreByIndexPosition);
return indexOfTargetBasedInScore;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.