Sorting (a stream of) doubles by absolute magnitude - java

I have a series of double values which I want to sum up and get the maximum value.
The DoubleStream.summaryStatistics() sounds perfect for that.
The getSum() method has an API note reminding me of what I learned during one of my computer science courses: the stability of the summation problem tends to be better if the values are sorted by their absolute values. However, DoubleStream does not let me specify the comparator to use, it will just use Double.compareTo if I call sorted() on the stream.
Thus I gathered the values into a final Stream.Builder<Double> values = Stream.builder(); and call
values.build()
.sorted(Comparator.comparingDouble(Math::abs))
.mapToDouble(a -> a).summaryStatistics();
Yet, this looks somewhat lengthy and I would have preferred to use the DoubleStream.Builder instead of the generic builder.
Did I miss something or do I really have to use the boxed version of the stream just to be able to specify the comparator?

Primitive streams don't have an overloaded sorted method and will get sorted in natural order. But to go back to your underlying problem, there are ways to improve the accuracy of the sum that don't involve sorting the data first.
One such algorithm is the Kahan summation algorithm which happens to be used by the OpenJDK/Oracle JDK internally.
This is admittedly an implementation detail so the usual caveats apply (non-OpenJDK/Oracle JDKs or future OpenJDK JDKs may take alternative approaches etc.)
See also this post: In which order should floats be added to get the most precise result?

The only possible way to sort DoubleStream is to box/unbox it:
double[] input = //...
DoubleStream.of(input).boxed()
.sorted(Comparator.comparingDouble(Math::abs))
.mapToDouble(a -> a).summaryStatistics();
However as Kahan summation is used internally, the difference should be not very significant. In most of applications unsorted input will yield the good resulting accuracy. Of course you should test by yourself if the unsorted summation is satisfactory for your particular task.

Related

Java8 - Count after filter on stream

I hope this question was not asked before.
In java 8, I have an array of String myArray in input and an integer maxLength.
I want to count the number of string in my array smaller than maxLength. I WANT to use stream to resolve this issue.
For that I thought to do this :
int solution = Arrays.stream(myArray).filter(s -> s.length() <= maxLength).count();
However I'm not sure if it is the right way to do this. It will need to go through first array once and then go through the filtered array to count.
But if I don't use a stream, I could easely make an algorithm where I loop once over myArray.
My questions are very easy: Is there a way to resolve this issue with the same time performance than with a loop ? Is it always a "good" solution to use stream ?
However I'm not sure if it is the right way to do this. It will need
to go through first array once and then go through the filtered array
to count.
Your assumption that it will perform multiple passes is wrong. There is something calling operation fusion i.e. multiple operations can be executed in a single pass on the data;
In this case Arrays.stream(myArray) will create a stream object (cheap operation and lightweight object) , filter(s -> s.length() <= maxLength).count(); will be combined into a single pass on the data because there is no stateful operation in the pipeline as opposed to filtering all the elements of the stream and then counting all the elements which pass the predicate.
A quote from Brian Goetz post here states:
Stream pipelines, in contrast, fuse their operations into as few
passes on the data as possible, often a single pass. (Stateful
intermediate operations, such as sorting, can introduce barrier points
that necessitate multipass execution.)
As for:
My questions are very easy: Is there a way to resolve this issue with
the same time performance than with a loop ?
depends on the amount of data and cost per element. Anyhow, for a small number of elements the imperative for loops will almost always win if not always.
Is it always a "good" solution to use stream ?
No, if you really care about performance then measure, measure and measure.
Use streams for it being declarative, for its abstraction, composition and the possibility of benefitting from parallelism when you know you will benefit from it that is.
You can use range instead of stream and filter the output.
int solution = IntStream.range(0, myArray.length)
.filter(index -> myArray[index].length() <= maxLength)
.count();

Efficient Intersection and Union of Lists of Strings

I need to efficiently find the ratio of (intersection size / union size) for pairs of Lists of strings. The lists are small (mostly about 3 to 10 items), but I have a huge number of them (~300K) and have to do this on every pair, so I need this actual computation to be as efficient as possible. The strings themselves are short unicode strings -- averaging around 5-10 unicode characters.
The accepted answer here Efficiently compute Intersection of two Sets in Java? looked extremely helpful but (likely because my sets are small (?)) I haven't gotten much improvement by using the approach suggested in the accepted answer.
Here's what I have so far:
protected double uuEdgeWeight(UVertex u1, UVertex u2) {
Set<String> u1Tokens = new HashSet<String>(u1.getTokenlist());
List<String> u2Tokens = u2.getTokenlist();
int intersection = 0;
int union = u1Tokens.size();
for (String s:u2Tokens) {
if (u1Tokens.contains(s)) {
intersection++;
} else {
union++;
}
}
return ((double) intersection / union);
My question is, is there anything I can do to improve this, given that I'm working with Strings which may be more time consuming to check equality than other data types.
I think because I'm comparing multiple u2's against the same u1, I could get some improvement by doing the cloning of u2 into a HashSet outside of the loop (which isn't shown -- meaning I'd pass in the HashSet instead of the object from which I could pull the list and then clone into a set)
Anything else I can do to squeak out even a small improvement here?
Thanks in advance!
Update
I've updated the numeric specifics of my problem above. Also, due to the nature of the data, most (90%?) of the intersections are going to be empty. My initial attempt at this used the clone the set and then retainAll the items in the other set approach to find the intersection, and then shortcuts out before doing the clone and addAll to find the union. That was about as efficient as the code posted above, presumably because of the trade of between it being a slower algorithm overall versus being able to shortcut out a lot of the time. So, I'm thinking about ways to take advantage of the infrequency of overlapping sets, and would appreciate any suggestions in that regard.
Thanks in advance!
You would get a large improvement by moving the HashSet outside of the loop.
If the HashSet really has only got a few entries in it then you are probably actually just as fast to use an Array - since traversing an array is much simpler/faster. I'm not sure where the threshold would lie but I'd measure both - and be sure that you do the measurements correctly. (i.e. warm up loops before timed loops, etc).
One thing to try might be using a sorted array for the things to compare against. Scan until you go past current and you can immediately abort the search. That will improve processor branch prediction and reduce the number of comparisons a bit.
If you want to optimize for this function (not sure if it actually works in your context) you could assign each unique String an Int value, when the String is added to the UVertex set that Int as a bit in a BitSet.
This function should then become a set.or(otherset) and a set.and(otherset). Depending on the number of unique Strings that could be efficient.

Compressed SortedSet<Long> implementation

I need to store a large number of Long values in a SortedSet implementation in a space-efficient manner. I was considering bit-set implementations and discovered Javaewah. However, the API expects int values rather than longs.
Can anyone recommend any alternatives or suggest a good way to solve this problem? I am mainly concerned with space efficiency. Upon building the set I will need to access the minimum and maximum element once. However, access time is not a huge concern (i.e. so a fully run-length encoded implementation will be fine).
EDIT
I should be clear that the implementation does not have to implement the SortedSet interface providing I can access the minimum and maximum elements of the collection.
You could use TLongArrayList which uses a long[] underneath. It supports sort() so the min and max will be the first and last value.
Or you could use a long[] with a length and do this yourself. ;)
This will use about 64 byte more than the raw values themselves. You can get more compact if you can make some assumptions about the range of long values. e.g. if they are actually limited to 48-bit.
You might consider using LongBuffer. If it is memory mapped it avoids using heap or direct memory, but you would have implement a sort routine yourself.
If they are clustered, you might be able to represent the data as a Set of ranges. The ranges could be a pure A - B, or a BitSet with a starting value. The later works well for phone numbers. ;)
Not sure if it has Set or how efficient it is compared to regular JCF, but take a look at this:
http://commons.apache.org/primitives/

Mapping from String to integer - performance of various approaches

Let's say that I need to make a mapping from String to an integer. The integers are unique and form a continuous range starting from 0. That is:
Hello -> 0
World -> 1
Foo -> 2
Bar -> 3
Spam -> 4
Eggs -> 5
etc.
There are at least two straightforward ways to do it. With a hashmap:
HashMap<String, Integer> map = ...
int integer = map.get(string); // Plus maybe null check to avoid NPE in unboxing.
Or with a list:
List<String> list = ...
int integer = list.indexOf(string); // Plus maybe check for -1.
Which approach should I use, and why? Arguably the relative performance depends on the size of the list/map, since List#indexOf() is a linear search using String#equals() -> O(n) efficiency, while HashMap#get() uses hash to narrow down the search -> certainly more efficient when the map is big, but maybe inferior when there are just few elements (there must be some overhead in calculating the hash, right?).
Since benchmarking Java code properly is notoriously hard, I would like to get some educated guesses. Is my reasoning above correct (list is better for small, map is better for large)? What is the threshold size approximately? What difference do various List and HashMap implementations make?
A third option and possibly my favorite would be to use a trie:
I bet it beats the HashMap in performance (no collisions + the fact that computing the hash-code is O(length of string) anyway), and possibly also the List approach in some cases (such as if your strings have long common prefixes, as the indexOf would waste lot of time in the equals methods).
When choosing between List and Map I would go for a Map (such as HashMap). Here is my reasoning:
Readability
The Map interface simply provides a more intuitive interface for this use case.
Optimization in the right place
I'd say if you're using a List you would be optimizing for the small cases anyway. That's probably not where the bottle neck is.
A fourth option would be to use a LinkedHashMap, iterate through it if the size is small, and get the associated number if the size is large.
A fifth option is to encapsulate the decision in a separate class all together. In this case you could even implement it to change strategy in runtime as the list grows.
You're right: a List would be O(n), a HashMap would be O(1), so a HashMap would be faster for n large enough so that the time to calculate the hash didn't swamp the List linear search.
I don't know the threshold size; that's a matter for experimentation or better analytics than I can muster right now.
Your question is totally correct on all points:
HashMaps are better (they use a hash)
Benchmarking Java code is hard
But at the end of the day, you're just going to have to benchmark your particular application. I don't see why HashMaps would be slower for small cases but the benchmarking will give you the answer if it is or not.
One more option, a TreeMap is another map data structure which uses a tree as opposed to a hash to access the entries. If you are doing benchmarking, you might as well benchmark that as well.
Regarding benchmarking, one of the main problems is the garbage collector. However if you do a test which doesn't allocate any objects, that shouldn't be a problem. Fill up your map/list, then just write a loop to get N random elements, and then time it, that should be reasonably reproducable and therefore informative.
Unfortunately, you are going to have to benchmark this yourself, because the relative performance will depend critically on the actual String values, and also on the relative probability that you will test a string that is not in your mapping. And of course, it depends on how String.equals() and String.hashCode() are implemented, as well as the details of the HashMap and List classes used.
In the case of a HashMap, a lookup will typically involve calculating the hash of the key String, and then comparing the key String with one or more entry key Strings. The hashcode calculation looks at all characters of the String, and is therefore dependent on the key String. The equals operations typically will typically examine all of the characters when equals returns true and considerably less when it returns false. The actual number of times that equals is called for a given key string depends on how the hashed key strings are distributed. Normally, you'd expect an average of 1 or 2 calls to equal for a "hit" and maybe up to 3 for a "miss".
In the case of a List, a lookup will call equals for an average of half the entry key Strings in the case of a "hit" and all of them in the case of a "miss". If you know the relative distribution of the keys that you are looking up, you can improve the performance in the "hit" case by ordering the list. But the "miss" case cannot be optimized.
In addition to the trie alternative suggested by #aioobe, you could also implement a specialized String to integer hashmap using a so-called perfect hash function. This maps each of the actual key strings to a unique hash within a small range. The hash can then be used to index an array of key/value pairs. This reduces a lookup to exactly one call to hash function and one call to String.equals. (And if you can assume that supplied key will always be one of the mapped strings, you can dispense with the call to equals.)
The difficulty of the perfect hash approach is in finding a function that works for the set of keys in the mapping and is not too expensive to compute. AFAIK, this has to be done by trial and error.
But the reality is that simply using a HashMap is a safe option, because it gives O(1) performance with a relatively small constant of proportionality (unless the entry keys are pathological).
(FWIW, my guess is that the break-even point where HashMap.get() becomes better than List.contains() is less than 10 entries, assuming that the strings have an average length of 5 to 10.)
From what I can remember, the list method will be O(n),but would be quick to add items, as no computation occurs. You could get this lower O(log n) if you implemented a b-search or other searching algorithms. The hash is O(1), but its slower to insert, since the hash needs to be computed every time you add an element.
I know in .net, theres a special collection called a HybridDictionary, that does exactly this. Uses a list to a point, then a hash. I think the crossover is around 10, so this may be a good line in the sand.
I would say you're correct in your above statement, though I'm not 100% sure if a list would be faster for small sets, and where the crossover point is.
I think a HashMap will always be better. If you have n strings each of length at most l, then String#hashCode and String#equals are both O(l) (in Java's default implementation, anyway).
When you do List#indexOf it iterates through the list (O(n)) and performs a comparison on each element (O(l)), to give O(nl) performance.
Java's HashMap has (let's say) r buckets, and each bucket contains a linked list. Each of these lists is of length O(n/r) (assuming the String's hashCode method distributes the Strings uniformly between the buckets). To look up a String, you need to calculate the hashCode (O(l)), look up the bucket (O(1) - one, not l), and iterate through that bucket's linked list (O(n/r) elements) doing an O(l) comparison on each one. This gives a total lookup time of O(l + (nl)/r).
As the List implementation is O(nl) and the HashMap implementation is O(nl/r) (I'm dropping the first l as it's relatively insignificant), lookup performance should be equivalent when r=1 and the HashMap will be faster for all greater values of r.
Note that you can set r when you construct the HashMap using this constructor (set the initialCapacity to r and the loadFactor argument to n/r for your given n and chosen r).

Efficiently finding the intersection of a variable number of sets of strings

I have a variable number of ArrayList's that I need to find the intersection of. A realistic cap on the number of sets of strings is probably around 35 but could be more. I don't want any code, just ideas on what could be efficient. I have an implementation that I'm about to start coding but want to hear some other ideas.
Currently, just thinking about my solution, it looks like I should have an asymptotic run-time of Θ(n2).
Thanks for any help!
tshred
Edit: To clarify, I really just want to know is there a faster way to do it. Faster than Θ(n2).
Set.retainAll() is how you find the intersection of two sets. If you use HashSet, then converting your ArrayLists to Sets and using retainAll() in a loop over all of them is actually O(n).
The accepted answer is just fine; as an update : since Java 8 there is a slightly more efficient way to find the intersection of two Sets.
Set<String> intersection = set1.stream()
.filter(set2::contains)
.collect(Collectors.toSet());
The reason it is slightly more efficient is because the original approach had to add elements of set1 it then had to remove again if they weren't in set2. This approach only adds to the result set what needs to be in there.
Strictly speaking you could do this pre Java 8 as well, but without Streams the code would have been quite a bit more laborious.
If both sets differ considerably in size, you would prefer streaming over the smaller one.
There is also the static method Sets.intersection(set1, set2) in Google Guava that returns an unmodifiable view of the intersection of two sets.
One more idea - if your arrays/sets are different sizes, it makes sense to begin with the smallest.
The best option would be to use HashSet to store the contents of these lists instead of ArrayList. If you can do that, you can create a temporary HashSet to which you add the elements to be intersected (use the putAll(..) method). Do tempSet.retainAll(storedSet) and tempSet will contain the intersection.
Sort them (n lg n) and then do binary searches (lg n).
You can use single HashSet. It's add() method returns false when the object is alredy in set. adding objects from the lists and marking counts of false return values will give you union in the set + data for histogram (and the objects that have count+1 equal to list count are your intersection). If you throw the counts to TreeSet, you can detect empty intersection early.
In case that is required the state if 2 set has intersection, I use the next snippet on Java 8+ versions code:
set1.stream().anyMatch(set2::contains)

Categories

Resources