FilterOperator NOT IN - java

com.google.appengine.api.datastore.Query.FilterOperator enum does not have a NOT_IN value. All other operations are possible (equal, not equal and all inequalities). Is it possible to create FilterPredicate with that behaviour (e.g., "id", notIn(), new int[] { 3, 4, 7 }, where notIn() is something that will make the query return all values except for those whose id's were found in the list given)? If not, them how can I query the datastore like that? Something like negating the FilterPredicate, for example.

There isn't (as far as I know) server-side support for that type of query. Your best bet for simulating it client-side is to merge the result of three queries: one for elements below the min of the set, one for elements above the max of the set, and one for for [min..max] where you perform the not in in code on the client side.
(Added) You can perform all three queries in parallel to save wall time. A challenge will emerge if any of the queries returns a sufficiently large number of entities to either blow memory or exceed time limits.

Related

JPQL number to string conversion disagreeing with Java

Due to some of the restrictions of JPQL (no ordering in subqueries, can't check array equality), I'm having to do some workarounds. Namely, I'm concatenating some numbers (and commas) into a string and checking if they're in an array parameter. (I wish to know which item has exactly a particular set of associated entries in a crossreference table. "For each item, is the set of other_id/value pairs referencing item equal to this set I'm passing in as a parameter?") However, the number-to-string conversion tends to go a little weird at high decimal places - I think JPQL (or possibly postgres) is converting the numbers into strings slightly differently than Java is. For instance, JPQL vs Java might give me
1.20991849899292 vs
1.2099184989929199
or maybe the other way around, and new BigDecimal(value).toString() gives me something again slightly different.
How can I get them to agree? Consider that there's string A, returned from a query, and I wish it to match string B, calculated in Java from the same number. One way I thought of was to reduce the precision of the number before storing it, so it just wouldn't run into the issue. This worked, but only once there were only, say, 10 bits of mantissa left, which was more than I wanted to remove. Another way I thought of was to obtain string B by running, like,
entityManager.createQuery("SELECT CAST(:val as text)", String.class)
.setParameter("val", 0.14520927387582)
.getSingleResult();
...but JPQL refuses to do queries that don't have a FROM clause. I could make a junk table, always containing exactly one row, but that's really hacky.
Is there a way to tell JPQL how to format a number when it turns it into a string (without knowing which underlying database you're using)? Is there a way to get the SELECT CAST method to work without being more than a little hacky? Any other, better ideas?

Aligning number of elements in partition in Java Apache Spark

I have two JavaRDD<Double> called rdd1 and rdd2 over which I'd like to evaluate some correlation, e.g. with Statistics.corr(). The two RDDs are generated with many transformations and actions, but at the end of the process, they both have the same number of elements. I know that two conditions must be respected in order to evaluate the correlation, that are related (as far as I understood) to the zip method used in the correlation function. Conditions are:
The RDDs must be split over the same number of partitions
Every partitions must have the same number of elements
Moreover, according to the Spark documentation, I'm using methods over the RDD which preserve ordering, so that the final correlation will be correct (although this wouldn't raise any exception). Now, the problem is that even if I'm able to keep the number of partition consistent, for example with the code
JavaRDD<Double> rdd1Repatitioned = rdd1.repartition(rdd2.getNumPartitions());
what I don't know how to do (and what is giving me exceptions) is to control the number of entries in every partition. I found a workaround that, for now, is working, that is re-initializing the two RDDs I want to correlate
List<Double> rdd1Array = rdd1.collect();
List<Double> rdd2Array = rdd2.collect();
JavaRDD<Double> newRdd1 = sc.parallelize(rdd1Array);
JavaRDD<Double> newRdd2 = sc.parallelize(rdd2Array);
but I'm not sure this guarantees me anything about the consistency. Second, it might be really expensive computational-wise in some situations. Is there a way to control the number of elements in each partition, or in general to realign the partitions in two or more RDDs (I know more or less how the partitioning system works, and I understand that this might be complicated from the distribution point of view)?
Ok, this worked for me:
Statistics.corr(rdd1.repartition(8), rdd2.repartition(8))

Optimise MySQL reading pattern

I have an Integer-collection with filtered row-IDs in which I am trying to search for sequences/ranges to optimise a MySQL select-query. To give you an example:
The Integer-Collection can be either very fragmented:
[1,2,88,101,200] = Sequence(1-2,88,101,200)
Or very entirely:
[1,2,3,4,..,198,199,200] = Sequence(1-200)
Is there any Java-algorithm to find a sequence in the collection or to improve my reading-pattern in general?
How long are your collections? Unless you have millions of items, it is probably fastest to load a collection entirely into memory, sort it and then scan for ranges.
In a sorted list, finding ranges is trivial. Just scan it sequentially; if the next element is not previous element + 1, one range has just ended, and another began.

storing sets of integers to check if a certain set has already been mentioned

I've come across an interesting problem which I would love to get some input on.
I have a program that generates a set of numbers (based on some predefined conditions). Each set contains up to 6 numbers that do not have to be unique with integers that ranges from 1 to 100).
I would like to somehow store every set that is created so that I can quickly check if a certain set with the exact same numbers (order doesn't matter) has previously been generated.
Speed is a priority in this case as there might be up to 100k sets stored before the program stops (maybe more, but most the time probably less)! Would anyone have any recommendations as to what data structures I should use and how I should approach this problem?
What I have currently is this:
Sort each set before storing it into a HashSet of Strings. The string is simply each number in the sorted set with some separator.
For example, the set {4, 23, 67, 67, 71} would get encoded as the string "4-23-67-67-71" and stored into the HashSet. Then for every new set generated, sort it, encode it and check if it exists in the HashSet.
Thanks!
if you break it into pieces it seems to me that
creating a set (generate 6 numbers, sort, stringify) runs in O(1)
checking if this string exists in the hashset is O(1)
inserting into the hashset is O(1)
you do this n times, which gives you O(n).
this is already optimal as you have to touch every element once anyways :)
you might run into problems depending on the range of your random numbers.
e.g. assume you generate only numbers between one and one, then there's obviously only one possible outcome ("1-1-1-1-1-1") and you'll have only collisions from there on. however, as long as the number of possible sequences is much larger than the number of elements you generate i don't see a problem.
one tip: if you know the number of generated elements beforehand it would be wise to initialize the hashset with the correct number of elements (i.e. new HashSet<String>( 100000 ) );
p.s. now with other answers popping up i'd like to note that while there may be room for improvement on a microscopic level (i.e. using language specific tricks), your overal approach can't be improved.
Create a class SetOfIntegers
Implement a hashCode() method that will generate reasonably unique hash values
Use HashMap to store your elements like put(hashValue,instance)
Use containsKey(hashValue) to check if the same hashValue already present
This way you will avoid sorting and conversion/formatting of your sets.
Just use a java.util.BitSet for each set, adding integers to the set with the set(int bitIndex) method, you don't have to sort anything, and check a HashMap for already existing BitSet before adding a new BitSet to it, it will be really very fast. Don't use sorting of value and toString for that purpose ever if speed is important.

How to find 1 or more partially intersecting time-intervals in a list of few million?

I need an idea for an efficient index/search algorithm, and/or data structure, for determining whether a time-interval overlaps zero or more time-intervals in a list, keeping in mind that a complete overlap is a special case of partial overlap . So far I've not not come up with anything fast or elegant...
Consider a collection of intervals with each interval having 2 dates - start, and end.
Intervals can be large or small, they can overlap each other partially, or not at all. In Java notation, something like this:
interface Period
{
long getStart(); // millis since the epoch
long getEnd();
boolean intersects(Period p); // trivial intersection check with another period
}
Collection<Period> c = new ArrayList<Period>(); // assume a lot of elements
The goal is to efficiently find all intervals which partially intersect a newly-arrived input interval. For c as an ArrayList this could look like...
Collection<Period> getIntersectingPeriods(Period p)
{
// how to implement this without full iteration?
Collection<Period> result = new ArrayList<Period>();
for (Period element : c)
if (element.intersects(p))
result.add(element);
return result;
}
Iterating through the entire list linearly requires too many compares to meet my performance goals. Instead of ArrayList, something better is needed to direct the search, and minimize the number of comparisons.
My best solution so far involves maintaining two sorted lists internally and conducting 4 binary searches and some list iteration for every request. Any better ideas?
Editor's Note: Time-intervals are a specific case employing linear segments along a single axis, be that X, or in this case, T (for time).
Interval trees will do:
In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree...
Seems the Wiki article solves more than was asked. Are you tied to Java?
You have a "huge collection of objects" which says to me "Database"
You asked about "built-in period indexing capabilities" and indexing says database to me.
Only you can decide whether this SQL meets your perception of "elegant":
Select A.Key as One_Interval,
B.Key as Other_Interval
From Big_List_Of_Intervals as A join Big_List_Of_Intervals as B
on A.Start between B.Start and B.End OR
B.Start between A.Start and A.End
If the Start and End columns are indexed, a relational database (according to advertising) will be quite efficient at this.

Categories

Resources