Smallest java structure with relatively decent contains() solution

Smallest java structure with relatively decent contains() solution - java

Alright, here's the lowdown: I'm writing a class in Java that finds the Nth Hardy's Taxi number (a number that can be summed up by two different sets of two cubed numbers). I have the discovery itself down, but I am in desperate need of some space saving. To that end, I need the smallest possible data structure where I can relatively easily use or create a method like contains(). I'm not particularly worried about speed, as my current solution can certainly get it to compute well within the time restrictions.
In short, the data structure needs:
To be able to relatively simply implement a contains() method
To use a low amount of memory
To be able to store very large number of entries
To be easily usable with the primitive long type
Any ideas? I started with a hash map (because I needed to test the values the led to the sum to ensure accuracy), then moved to hash set once I guaranteed reliable answers.
Any other general ideas on how to save some space would be greatly appreciated!
I don't think you'd need the code to answer the question, but here it is in case you're curious:
public class Hardy {
// private static HashMap<Long, Long> hm;
/**
* Find the nth Hardy number (start counting with 1, not 0) and the numbers
* whose cubes demonstrate that it is a Hardy number.
* #param n
* #return the nth Hardy number
*/
public static long nthHardyNumber(int n) {
// long i, j, oldValue;
int i, j;
int counter = 0;
long xyLimit = 2147483647; // xyLimit is the max value of a 32bit signed number
long sum;
// hm = new HashMap<Long, Long>();
int hardyCalculations = (int) (n * 1.1);
HashSet<Long> hs = new HashSet<Long>(hardyCalculations * hardyCalculations, (float) 0.95);
long[] sums = new long[hardyCalculations];
// long binaryStorage, mask = 0x00000000FFFFFFFF;
for (i = 1; i < xyLimit; i++){
for (j = 1; j <= i; j++){
// binaryStorage = ((i << 32) + j);
// long y = ((binaryStorage << 32) >> 32) & mask;
// long x = (binaryStorage >> 32) & mask;
sum = cube(i) + cube(j);
if (hs.contains(sum) && !arrayContains(sums, sum)){
// oldValue = hm.get(sum);
// long oldY = ((oldValue << 32) >> 32) & mask;
// long oldX = (oldValue >> 32) & mask;
// if (oldX != x && oldX != y){
sums[counter] = sum;
counter++;
if (counter == hardyCalculations){
// Arrays.sort(sums);
bubbleSort(sums);
return sums[n - 1];
}
} else {
hs.add(sum);
}
}
}
return 0;
}
private static void bubbleSort(long[] array){
long current, next;
int i;
boolean ordered = false;
while (!ordered) {
ordered = true;
for (i = 0; i < array.length - 1; i++){
current = array[i];
next = array[i + 1];
if (current > next) {
ordered = false;
array[i] = next;
array[i+1] = current;
}
}
}
}
private static boolean arrayContains(long[] array, long n){
for (long l : array){
if (l == n){
return true;
}
}
return false;
}
private static long cube(long n){
return n*n*n;
}
}

Have you considered using a standard tree? In java that would be a TreeSet. By sacrificing speed, a tree generally gains back space over a hash.
For that matter, sums might be a TreeMap, transforming the linear arrayContains to a logarithmic operation. Being naturally ordered, there would also be no need to re-sort it afterwards.
EDIT
The complaint against using a java tree structure for sums is that java's tree types don't support the k-select algorithm. On the assumption that Hardy numbers are rare, perhaps you don't need to sweat the complexity of this container (in which case your array is fine.)
If you did need to improve time performance of this aspect, you could consider using a selection-enabled tree such as the one mentioned here. However that solution works by increasing the space requirement, not lowering it.
Alternately we can incrementally throw out Hardy numbers we know we don't need. Suppose during the running of the algorithm, sums already contains n Hardy numbers and we discover a new one. We insert it and do whatever we need to preserve collection order, and so now contains n+1 sorted elements.
Consider that last element. We already know about n smaller Hardy numbers, and so there is no possible way this last element is our answer. Why keep it? At this point we can shrink sums again down to size n and toss the largest element out. This is both a space savings, and time savings as we have fewer elements to maintain in sorted order.
The natural data structure for sums in that approach is a max heap. In java there is no native implementation available, but a few 3rd party ones are floating around. You could "make it work" with TreeMap::lastKey, which will be slower in the end, but still faster than quadratic bubbleSort.

If you have an extremely large number of elements, and you effectively want an index to allow fast tests for containment in the underlying dataset, then take a look at Bloom Filters. These are space-efficient indexes whose sole purpose is to enable fast tests for containment in a dataset.
Bloom Filters are probabilistic, which means if they return true for containment, then you actually need to check your underlying dataset to confirm that the element is really present.
If they return false, the element is guaranteed not to be contained in the underlying dataset, and in that case the test for containment would be very cheap.
So it depends on the whether most of the time you expect a candidate to really be contained in the dataset or not.

this is core function to find if a given number is HR-number: it's in C but one should get the idea:
bool is_sum_of_cubes(int value)
{
int m = pow(value, 1.0/3);
int i = m;
int j = 1;
while(j < m && i >= 0)
{
int element = i*i*i + j*j*j;
if( value == element )
{
return true;
}
if(element < value)
{
++j;
}
else
{
--i;
}
}
return false;
}

Related

Why does this implementation of Quadratic Probing fail when not overriding values on collision?

My current implementation of Quadratic Probing overrides the item being stored at the current index with the new item when a collision occurs. I insert three Person objects which are stored by using their lastname as key. To test the collision resolution of the implementation they all have the same last name which is "Windmill".
I need the implementation to keep all person objects but just move them to a different index instead of overriding them.
The list size has been set as 7, stored in variable "M" used for modulo in the insert function.
Insert function
#Override
public void put(String key, Person value) {
int tmp = hash(key);
int i, h = 0;
for (i = tmp; keys[i] != null; i = (i + h * h++) % M) {
collisionCount++;
if (keys[i].equals(key)) {
values[i] = value;
return;
}
}
keys[i] = key;
values[i] = value;
N++;
}
Hash function
private int hash(String key) {
return (key.hashCode() & 0x7fffffff) % M;
}
get function
#Override
public List<Person> get(String key) {
List<Person> results = new ArrayList<>();
int tmp = hash(key);
int i = hash(key), h = 0;
while (keys[i] != null)
{
if (keys[i].equals(key))
results.add(values[i]);
i = (i + h * h++) % M;
}
return results;
}
When i remove the piece of code that overrides previous values, the index int overflows and turns into a negative number, causing the program to crash.

You get overflow because you do % M after some operations with ints that cause overflow.
You need to replace i = (i + h * h++) % M with some additional operations based on modulo operation properties (https://en.wikipedia.org/wiki/Modulo_operation):
(a + b) mod n = [(a mod n) + (b mod n)] mod n.
ab mod n = [(a mod n)(b mod n)] mod n.

I think there are two issues with your code:
You don't check whether the (multi-)map is full. In practice you want to do 2 checks:
check if N==M (or maybe some smaller threshold like 90% of M)
make collisionCount a local variable and when it reaches N (unfortunately this check is also necessary to avoid some pathological cases)
in both cases you should extend your storage area and copy old data into it (re-insert). This alone should fix your bug for small values of M but for really big sizes of the map you still need the next thing.
You didn't take into account how mod (%) operation works in Java. Particularly for negative value of a the value of a % b is also negative. So when you insert a lot of values and check for next index, i + h^2 might overflow Integer.MAX_VALUE and become negative. To fix this you might use a method like this:
static int safeMod(int a, int b) {
int m = a % b;
return (m >= 0) ? m : (m+b);
}

How to merge an array of size M into another array of size 2M

I got this question in an online code challenge. I needed to merge two sorted arrays, one of size M with M elements into another one, with M elements and capacity 2M. I provided the following solution:
class ArraysSorting {
/**
* Function to move elements at the end of array
*
* #param bigger array
*/
void moveToEnd(int bigger[]) {
int i = 0, j = bigger.length - 1;
for (i = bigger.length - 1; i >= 0; i--)
if (bigger[i] != 0) {
bigger[j] = bigger[i];
j--;
}
}
/**
* Merges array smaller array of size M into bigger array of size 2M
* #param bigger array
* #param smaller array
*/
void merge(int bigger[], int smaller[]) {
moveToEnd(bigger);
int i = smaller.length;
int j = 0;
int k = 0;
while (k < (bigger.length)) {
if ((i < (bigger.length) && bigger[i] <= smaller[j]) || (j == smaller.length)) {
bigger[k] = bigger[i];
k++;
i++;
} else {
bigger[k] = smaller[j];
k++;
j++;
}
}
}
}
Is there a more efficient way to do this?
The Time Complexity: O(2M)

You can't beat linear time because you have to at least scan all the 2M elements, so that's the best you can ever get.
In practice though, you can optimize it a little further. There's no need to shift the elements of bigger towards the end; just write the results right-to-left rather than left-to-right (and of course, you'll need to invert the comparison, at any step you'll want to select the largest element rather than the smallest).
Also, this is not good:
if ((i < (bigger.length) && bigger[i] <= smaller[j]) || (j == smaller.length)) {
/* ... */
}
You should test j == smaller.length before accessing smaller[j]; the code as is will possibly access out of bounds positions in smaller. Do this instead:
if ((j == smaller.length) || (i < (bigger.length) && bigger[i] <= smaller[j])) {
/* ... */
}
Overall, I do think you can make the code simpler, here's something that in my opinion is easier to read because the if conditions are smaller and easier to understand (it also approaches the traditional way in which you merge two arrays and avoids the extra O(M) work of shifting the elements to the back):
void merge(int bigger[], size_t bigger_len, int smaller[], size_t smaller_len) {
ssize_t smaller_i, bigger_i, idx;
if (smaller_len == 0)
return;
smaller_i = smaller_len-1;
if (bigger_len == 0)
bigger_i = -1;
else
bigger_i = bigger_len-1;
idx = bigger_len+smaller_len-1;
while (smaller_i >= 0 && bigger_i >= 0) {
if (bigger[bigger_i] > smaller[smaller_i]) {
bigger[idx] = bigger[bigger_i];
bigger_i--;
}
else {
bigger[idx] = smaller[smaller_i];
smaller_i--;
}
idx--;
}
while (smaller_i >= 0) {
bigger[idx] = smaller[smaller_i];
smaller_i--;
idx--;
}
}
It's easy to see that the first loop runs as long as a comparison between two elements in the different arrays is possible (rather than having the loop always run and use complicated if tests inside). Also note that since output is being written to bigger, once the first loop terminates, we only need to make sure that the rest (if any) of smaller that is left is copied over to bigger. The code is in C, but it's pretty much the same in Java. bigger_len and smaller_len are the number of elements in bigger and in smaller; it is assumed that bigger has enough space for bigger_len+smaller_len elements. The initial if tests to assign to smaller_i and bigger_i are necessary to handle edge cases where subtracting 1 would overflow the (unsigned) range of size_t; they are unnecessary in Java since Java doesn't have unsigned types (correct me if I'm wrong, I haven't done Java recently).
Time complexity remains O(M).

Inserting an object in ascending order with an ArrayList in Java

So I've went back and forth quite a few times trying multiple different methods but I just can't seem to wrap my head around the appropriate algorithm for this method. I am creating a Polynomial class that uses an ArrayList where Term(int coeff, int expo) The coeff is the coefficient of the polynomial and the expo is the exponent. In the test class I have to insert multiple different Term objects but they need to be inserted in ascending order by their exponents (for example, 4x^1 + 2x^3 + x^4 + 5x^7)
This is the code I have up to the end of the insert() method which takes two parameters, the coeff and expo:
public class Polynomial
{
private ArrayList<Term> polynomials ;
/**
* Creates a new Polynomial object with no terms
*/
public Polynomial()
{
polynomials = new ArrayList<>() ;
}
/**
* Inserts a new term into its proper place in a Polynomial
* #param coeff the coefficient of the new term
* #param expo the exponent of the new term
*/
public void insert(int coeff, int expo)
{
Term newTerm = new Term (coeff, expo) ;
if (polynomials.isEmpty())
{
polynomials.add(newTerm);
return;
}
int polySize = polynomials.size() - 1 ;
for (int i = 0 ; i <= polySize ; i++)
{
Term listTerm = polynomials.get(i) ;
int listTermExpo = listTerm.getExpo() ;
if ( expo <= listTermExpo )
{
polynomials.add(i, newTerm);
return;
}
else if ( expo > listTermExpo )
{
polynomials.add(newTerm) ;
return ;
}
}
}
The problem arises near the end of the code. Once I put in a Term whose coefficient isn't <= the Term at the index it goes to the else if statement and adds it to the end of the list. Which is wrong, since it needs to be added where it JUST becomes bigger than the next coefficient. Just because it's larger than that coefficient doesn't mean its the LARGEST coefficient. I've tried doing the for statement backwards where:
for (i = polySize ; i >= 0 ; i--)
{
etc.
}
But that didn't work either since it raises the same issue just the other way around. If anyone could provide some solution or answer it would be much appreciated since I am very confused. At this point I'm sure I'm just making it too complicated. I just want to know how to recognize that the exponent is larger but then go back into the for loop until it is smaller than or equal to the index's exponent.
Also, I should mention, I am not allowed to use any other collection or class, so I must do this using a for, if, else, or do while statement.
Thanks in advance!

Remove this from the for loop:
else if (expo > listTermExpo)
{
polynomials.add(newTerm);
return;
}
Place this after the for loop:
polynomials.add(newTerm);
return;
Reasoning: You want to add the term to the end of the list only if it is not less than ANY term in it - not just the first term.
Also, it is good formatting to have the ; immediately after the statement it is for with no space in between, and for ()s to not have any spaces immediately inside them. I've edited the code I copied from you to show what I mean.

This should have the exact behaviour you specified:
public void insert(int coeff, int expo) {
Term newTerm = new Term(coeff, expo);
int max = polynomials.size();
int min = 0;
int pivot;
while (max > min) {
pivot = (min + max) / 2;
if (expo > polynomials.get(pivot).getExpo()){
min = pivot + 1;
}
else {
max = pivot;
}
}
polynomials.add(min, newTerm);
}
This algorithm will add new Terms right in front of the first term with the same exponent if any such Term is already in the list.

Java mergesort, should the "merge" step be done with queues or arrays?

This is not homework, I don't have money for school so I am teaching myself whilst working shifts at a tollbooth on the highway (long nights with few customers)
I was trying to implement a simple "mergesort" by thinking first, stretching my brain a little if you like for some actual learning, and then looking at the solution on the manual I am using: "2008-08-21 | The Algorithm Design Manual | Springer | by Steven S. Skiena | ISBN-1848000693".
I came up with a solution which implements the "merge" step using an array as a buffer, I am pasting it below. The author uses queues so I wonder:
Should queues be used instead?
What are the advantages of one method Vs the other? (obviously his method will be better as he is a top algorist and I am a beginner, but I can't quite pinpoint the strengths of it, help me please)
What are the tradeoffs/assumptions that governed his choice?
Here is my code (I am including my implementation of the splitting function as well for the sake of completeness but I think we are only reviewing the merge step here; I do not believe this is a Code Review post by the way as my questions are specific to just one method and about its performance in comparison to another):
package exercises;
public class MergeSort {
private static void merge(int[] values, int leftStart, int midPoint,
int rightEnd) {
int intervalSize = rightEnd - leftStart;
int[] mergeSpace = new int[intervalSize];
int nowMerging = 0;
int pointLeft = leftStart;
int pointRight = midPoint;
do {
if (values[pointLeft] <= values[pointRight]) {
mergeSpace[nowMerging] = values[pointLeft];
pointLeft++;
} else {
mergeSpace[nowMerging] = values[pointRight];
pointRight++;
}
nowMerging++;
} while (pointLeft < midPoint && pointRight < rightEnd);
int fillFromPoint = pointLeft < midPoint ? pointLeft : pointRight;
System.arraycopy(values, fillFromPoint, mergeSpace, nowMerging,
intervalSize - nowMerging);
System.arraycopy(mergeSpace, 0, values, leftStart, intervalSize);
}
public static void mergeSort(int[] values) {
mergeSort(values, 0, values.length);
}
private static void mergeSort(int[] values, int start, int end) {
int intervalSize = end - start;
if (intervalSize < 2) {
return;
}
boolean isIntervalSizeEven = intervalSize % 2 == 0;
int splittingAdjustment = isIntervalSizeEven ? 0 : 1;
int halfSize = intervalSize / 2;
int leftStart = start;
int rightEnd = end;
int midPoint = start + halfSize + splittingAdjustment;
mergeSort(values, leftStart, midPoint);
mergeSort(values, midPoint, rightEnd);
merge(values, leftStart, midPoint, rightEnd);
}
}
Here is the reference solution from the textbook: (it's in C so I am adding the tag)
merge(item_type s[], int low, int middle, int high)
{
int i; /* counter */
queue buffer1, buffer2; /* buffers to hold elements for merging */
init_queue(&buffer1);
init_queue(&buffer2);
for (i=low; i<=middle; i++) enqueue(&buffer1,s[i]);
for (i=middle+1; i<=high; i++) enqueue(&buffer2,s[i]);
i = low;
while (!(empty_queue(&buffer1) || empty_queue(&buffer2))) {
if (headq(&buffer1) <= headq(&buffer2))
s[i++] = dequeue(&buffer1);
else
s[i++] = dequeue(&buffer2);
}
while (!empty_queue(&buffer1)) s[i++] = dequeue(&buffer1);
while (!empty_queue(&buffer2)) s[i++] = dequeue(&buffer2);
}

Abstractly, a queue is just some object that supports the enqueue, dequeue, peek, and is-empty operations. It can be implemented in many different ways (using a circular buffer, using linked lists, etc.)
Logically speaking, the merge algorithm is easiest to describe in terms of queues. You begin with two queues holding the values to merge together, then repeatedly apply peek, is-empty, and dequeue operations on those queues to reconstruct a single sorted sequence.
In your implementation using arrays, you are effectively doing the same thing as if you were using queues. You have just chosen to implement those queues using arrays. There isn't necessarily "better" or "worse" than using queues. Using queues makes the high-level operation of the merge algorithm clearer, but might introduce some inefficiency (though it's hard to say for certain without benchmarking). Using arrays might be slightly more efficient (again, you should test this!), but might obscure the high-level operation of the algorithm. From Skienna's point of view, using queues might be better because it makes the high-level details of the algorithm clear. From your point of view, arrays might be better because of the performance concerns.
Hope this helps!

You're worrying about minor constant factors which are largely down to the quality of your compiler. Given that you seem to be worried about that, arrays are your friend. Below is my C# implementation for integer merge-sort which, I think, is close to as tight as you can get. [EDIT: fixed a buglet.]
If you want to do better in practice, you need something like natural merge-sort, where, instead of merging up in powers of two, you simply merge adjacent non-decreasing sequences of the input. This is certainly no worse than powers-of-two, but is definitely faster when the input data contains some sorted sequences (i.e., anything other than a purely descending input sequence). That's left as an exercise for the student.
int[] MSort(int[] src) {
var n = src.Length;
var from = (int[]) src.Clone();
var to = new int[n];
for (var span = 1; span < n; span += span) {
var i = 0;
for (var j = 0; j < n; j += span + span) {
var l = j;
var lend = Math.Min(l + span, n);
var r = lend;
var rend = Math.Min(r + span, n);
while (l < lend && r < rend) to[i++] = (from[l] <= from[r] ? from[l++] : from[r++]);
while (l < lend) to[i++] = from[l++];
while (r < rend) to[i++] = from[r++];
}
var tmp = from; from = to; to = tmp;
}
return from;
}

Range lookup in Java

Suppose, I have an unsorted array of overlapped ranges. Each range is just a pair of integers begin and end. Now I want to find if a given key belongs to at least one of the ranges. Probably, I have to know the ranges it belongs as well.
We can assume the ranges array takes ~1M and fits the memory. I am looking for an easy algorithm, which uses only standard JDK collections without any 3d-party libraries and special data structures, but works reasonably fast.
What would you suggest?

Sort the ranges numerically by a custom Comparator, then for each key k build a one-element range [k, k] and do a binary search for this range with a different Comparator.
The Comparator for searching's compare(x,y) should return
<0 if x.max < y.min
>0 if x.min > y.max
0 otherwise (its two range arguments overlap).
As noted by #Per, you need a different, stricter Comparator for sorting, but the first two clauses still hold.
This should work even if the ranges overlap, though you may want to merge overlapping ranges after sorting to speed up the search. The merging can be done in O(N) time.
This is in effect a static interval tree, i.e. one without O(lg N) insertion or deletion, in the same way that a sorted array can be considered a static binary search tree.

If you don't need to know which interval contains your point (EDIT: I guess you probably do, but I'll leave this answer for others with this question who don't), then
Preprocess the intervals by computing two arrays B and E. B is the values of begin in sorted order. E is the values of end in sorted order.
To query a point x, use binary search to find the least index i such that B[i] > x and the least index j such that E[j] ≥ x. The number of intervals [begin, end] containing x is i - j.
class Interval {
double begin, end;
}
class BeginComparator implements java.util.Comparator<Interval> {
public int compare(Interval o1, Interval o2) {
return Double.compare(o1.begin, o2.begin);
}
};
public class IntervalTree {
IntervalTree(Interval[] intervals_) {
intervals = intervals_.clone();
java.util.Arrays.sort(intervals, new BeginComparator());
maxEnd = new double[intervals.length];
initializeMaxEnd(0, intervals.length);
}
double initializeMaxEnd(int a, int b) {
if (a >= b) {
return Double.NEGATIVE_INFINITY;
}
int m = (a + b) >>> 1;
maxEnd[m] = initializeMaxEnd(a, m);
return Math.max(Math.max(maxEnd[m], intervals[m].end), initializeMaxEnd(m + 1, b));
}
void findContainingIntervals(double x, int a, int b, java.util.Collection<Interval> result) {
if (a >= b) {
return;
}
int m = (a + b) >>> 1;
Interval i = intervals[m];
if (x < i.begin) {
findContainingIntervals(x, a, m, result);
} else {
if (x <= i.end) {
result.add(i);
}
if (maxEnd[m] >= x) {
findContainingIntervals(x, a, m, result);
}
findContainingIntervals(x, m + 1, b, result);
}
}
java.util.Collection<Interval> findContainingIntervals(double x) {
java.util.Collection<Interval> result = new java.util.ArrayList<Interval>();
findContainingIntervals(x, 0, intervals.length, result);
return result;
}
Interval[] intervals;
double[] maxEnd;
public static void main(String[] args) {
java.util.Random r = new java.util.Random();
Interval[] intervals = new Interval[10000];
for (int j = 0; j < intervals.length; j++) {
Interval i = new Interval();
do {
i.begin = r.nextDouble();
i.end = r.nextDouble();
} while (i.begin >= i.end);
intervals[j] = i;
}
IntervalTree it = new IntervalTree(intervals);
double x = r.nextDouble();
java.util.Collection<Interval> result = it.findContainingIntervals(x);
int count = 0;
for (Interval i : intervals) {
if (i.begin <= x && x <= i.end) {
count++;
}
}
System.out.println(result.size());
System.out.println(count);
}
}

I believe this is what you are looking for: http://en.wikipedia.org/wiki/Interval_tree
But check this simpler solution first to see if it fits your needs: Using java map for range searches

simple solution with O(n) complexity:
for(Range range: ranges){
if (key >= range.start && key <= range.end)
return range;
}
More clever algorithm can be applied if we know more information about ranges.
Is they sorted? Is they overlapped? and so on

Given just your specification, I would be inclined to order the ranges by size, with the widest ranges first (use a custom Comparator to facilitate this). Then simply iterate through them and return true as soon as you find a range that contains the key. Because we know nothing else about the data, of course the widest ranges are the most likely to contain a given key; searching them first could be a (small) optimization.
You could preprocess the list in other ways. For instance, you could exclude any ranges that are completely enclosed by other ranges. You could order by begin and early-exit as soon as you encounter a begin value greater than your key.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.