I'm eager to use Guava's RangeSets in my program. Despite the features of adding and merging of ranges, i'm also interested in the "size" of my ranges.
Some remarks:
no ranges i'm interested in are infinite!
all ranges i'm using are of the bound-type "closedOpen"
the underlying use-case is a discrete time-space (size = summed up time-ticks)
This seems to be something which is not built-in (or i didn't see it) and i'm wondering if there is a clear reason against this conceptionally (which means i should not implement some getSize() function myself) or not.
Let's have a look at my use-case:
RangeSet<Integer> usageTicks = TreeRangeSet.create();
usageTicks.add(Range.closedOpen(3, 7));
usageTicks.add(Range.closedOpen(12,18));
usageTicks.add(Range.closedOpen(18, 23));
int size = usageTicks.hypotheticalGetSizeFunction(); // size = 15
Is there any reason against the following:
Set<Range<Integer>> setOfRanges = usageTicks.asRanges();
int sum = 0;
for(Range<Integer> range : setOfRanges)
sum += (range.upperEndpoint() - range.lowerEndpoint());
Guava's Range only require one thing of its enclosed types: that they implement Comparable.
But not all which implement Comparable have a notion of distance. How would you measure the distance between two Strings, for instance?
This is why Guava also has DiscreteDomain and ContiguousSet; with the former you have methods such as next(), prev() and distance(), which is what you are interested in here. Guava's site has an article on it.
Related
I have a series of double values which I want to sum up and get the maximum value.
The DoubleStream.summaryStatistics() sounds perfect for that.
The getSum() method has an API note reminding me of what I learned during one of my computer science courses: the stability of the summation problem tends to be better if the values are sorted by their absolute values. However, DoubleStream does not let me specify the comparator to use, it will just use Double.compareTo if I call sorted() on the stream.
Thus I gathered the values into a final Stream.Builder<Double> values = Stream.builder(); and call
values.build()
.sorted(Comparator.comparingDouble(Math::abs))
.mapToDouble(a -> a).summaryStatistics();
Yet, this looks somewhat lengthy and I would have preferred to use the DoubleStream.Builder instead of the generic builder.
Did I miss something or do I really have to use the boxed version of the stream just to be able to specify the comparator?
Primitive streams don't have an overloaded sorted method and will get sorted in natural order. But to go back to your underlying problem, there are ways to improve the accuracy of the sum that don't involve sorting the data first.
One such algorithm is the Kahan summation algorithm which happens to be used by the OpenJDK/Oracle JDK internally.
This is admittedly an implementation detail so the usual caveats apply (non-OpenJDK/Oracle JDKs or future OpenJDK JDKs may take alternative approaches etc.)
See also this post: In which order should floats be added to get the most precise result?
The only possible way to sort DoubleStream is to box/unbox it:
double[] input = //...
DoubleStream.of(input).boxed()
.sorted(Comparator.comparingDouble(Math::abs))
.mapToDouble(a -> a).summaryStatistics();
However as Kahan summation is used internally, the difference should be not very significant. In most of applications unsorted input will yield the good resulting accuracy. Of course you should test by yourself if the unsorted summation is satisfactory for your particular task.
I have a list of objects that implement non-overlapping ranges, e.g.:
1 to 10
11 to 20
21 to 50
51 to 100
They provide min() and max() to retrieve those values.
I need a datastore to easily retrieve the right object, given a value that must be in its interval.
The easiest way I can think of is to create an ordered arraylist and simply traverse it until I found the correct interval. So in this case a lookup is done in O(N).
Are there more efficient data structures available in the standard Java library to do this task?
You could try using the NavigableMap, that method is explained in this answer: Using java map for range searches, the 'no holes' aproach.
Example implementation using TreeMap:
// Create TreeMap
NavigableMap<Integer, MyInterval> map = new TreeMap<>();
// Putting values
map.put(interval1.min(), myInterval);
map.put(interval2.min(), myInterval);
// Retrieving values
int value = 15;
MyInterval interval = map.floorEntry(value).getValue();
// Check if value is smaller than max()
if (value <= interval.max())
return interval;
return null;
No need to reinvent the wheel, Guava provides the Range, RangeSet, and RangeMap classes. See the Ranges Explained docs for more details.
Is there any standard interface or approach usable in collections/streams (max, sort) for the situation where one might need to compare on multiple sides/objects at once?
The signature could be something like
compare(T... toCompare)
instead of
compare(T object1, T object2)
what I would like is do an implementation that works for comparing operations in Java APIs. But from what I saw, I think I have to adhere mandatory to unitary comparations.
UPDATE: Practical example: I'd like to have a Comparator implementation interpreted by Collections/Stream.max() that allowed me to make multiside comparisons not unitary comparisons (i.e, that accepts multiple T in the compare method). The max function returns the element so that element is the winner of a comparison mechanism, custom implemented, of it against ALL the others, not the winner of n battles 1 vs 1.
UPDATE2: More specific example:
I have (Pineapple,Pizza,Yogurt), and max returns the item such that my custom 1 -> n comparison returns biggest quotient. This quotient could be something like degreeOfYumie. So Pineapple is more yummie than Pizza+Yogurt, Pizza is equally yummie than Pineapple+yogurt, and Yogurt is equally yummie than Pizza+Pineapple. So the winner is Pineaple. If I did that unitary, all the ingredients would be equally yummie. Is there any mechanism for implementing a comparator/comparable as that? Perhaps a "sortable" interface that works on collections, streams and queues?
There is no need for a specialized interface. If you have a Comparator that conforms to the specification, it will be transitive and allow comparing multiple objects. To get the maximum out of three or more elements, simply use, e.g.
Stream.of(42, 8, 17).max(Comparator.naturalOrder())
.ifPresent(System.out::println);
// or
Stream.of("foo", "BAR", "Baz").max(String::compareToIgnoreCase)
.ifPresent(System.out::println);
If you are interested in the index of the max element, you can do it like this:
List<String> list=Arrays.asList("foo", "BAR", "z", "Baz");
int index=IntStream.range(0, list.size()).boxed()
.max(Comparator.comparing(list::get, String.CASE_INSENSITIVE_ORDER))
.orElseThrow(()->new IllegalStateException("empty list"));
Regarding your updated question…
You said you want to establish an ordering based on the quotient of an element’s property and the remaining elements. Let’s think this through
Suppose we have the positive numerical values a, b and c and want to establish an ordering based on a/(b+c), b/(a+c) and c/(a+b).
Then we can transform the term by extending the quotients to have a common denominator:
a(a+c)(a+b) b(b+c)(b+a) c(c+b)(c+a)
--------------- --------------- ---------------
(a+b)(b+c)(a+c) (a+b)(b+c)(a+c) (a+b)(b+c)(a+c)
Since common denominators have no effect on the ordering we can elide them and after expanding the products we get the terms:
a³+a²b+a²c+abc b³+b²a+b²c+abc c³+c²a+c²b+abc
Here we can elide the common summand abc as it has no effect on the ordering.
a³+a²b+a²c b³+b²a+b²c c³+c²a+c²b
then factor out again
a²(a+b+c) b²(a+b+c) c²(a+b+c)
to see that we have a common factor which we can elide as it doesn’t affect the ordering so we finally get
a² b² c²
what does this result tell us? Simply that the quotients are proportional to the values a, b and c, thus have the same ordering. So there is no need to implement a quotient based comparator when we can prove it to have the same outcome as a simple comparator based on the original values a, b and c.
(The picture would be different if negative values were allowed, but since allowing negative values would create the possibility of getting zero as denominator, they are off this use case anyway)
It should be emphasized that any other result for a particular comparator would prove that that comparator is unusable for standard Comparator use cases. If the combined values of all other elements had an effect on the resulting order, in other words, adding another element to the relation would change the ordering, how should an operation like adding an element to a TreeSet or inserting it at the right position of a sorted list work?
The problem with comparing multiple objects at once is what to return.
A Java comparator returns -1 if the first object is "smaller than the second one, 0 if they are equals and 1 if the first one is the "bigger" one.
If you compare more than two objects, an integer wouldn't suffice to describe the difference between said objects.
If you have a normal Comparable<T> you can combine it any way you want. From being able to compare two things you can build anything (see different sorting algorithms, which usually only need a < implementation).
For example here's a naive one for "you could say if it's bigger, equal or smaller than ANY of the objects"
<T extends Comparable<T>> int compare(T... toCompare) {
if (toCompare.length < 2) throw Nothing to compare; // or return something
T first = toCompare[0];
int smallerCount;
int equalCount;
int biggerCount;
for(int i = 1, n = toCompare.length; i < n; ++i) {
int compare = first.compareTo(toCompare[i]);
if(compare == 0) {
equalCount++;
} else if(compare < 0) {
smallerCount++;
} else {
biggerCount++;
}
}
return someCombinationOf(smallerCount, equalCount, biggerCount);
}
However I couldn't figure out a proper way of combining them, what about the sequence (3, 5, 3, 1) where 3 is smaller than 5, equal to 3 and bigger than 1, so all counts are 1; here all your "it's bigger, equal or smaller than ANY" conditions are true at the same time, however you could return the counts as an object if it helps to defer the combination of counts to a later point in time.
I need to efficiently find the ratio of (intersection size / union size) for pairs of Lists of strings. The lists are small (mostly about 3 to 10 items), but I have a huge number of them (~300K) and have to do this on every pair, so I need this actual computation to be as efficient as possible. The strings themselves are short unicode strings -- averaging around 5-10 unicode characters.
The accepted answer here Efficiently compute Intersection of two Sets in Java? looked extremely helpful but (likely because my sets are small (?)) I haven't gotten much improvement by using the approach suggested in the accepted answer.
Here's what I have so far:
protected double uuEdgeWeight(UVertex u1, UVertex u2) {
Set<String> u1Tokens = new HashSet<String>(u1.getTokenlist());
List<String> u2Tokens = u2.getTokenlist();
int intersection = 0;
int union = u1Tokens.size();
for (String s:u2Tokens) {
if (u1Tokens.contains(s)) {
intersection++;
} else {
union++;
}
}
return ((double) intersection / union);
My question is, is there anything I can do to improve this, given that I'm working with Strings which may be more time consuming to check equality than other data types.
I think because I'm comparing multiple u2's against the same u1, I could get some improvement by doing the cloning of u2 into a HashSet outside of the loop (which isn't shown -- meaning I'd pass in the HashSet instead of the object from which I could pull the list and then clone into a set)
Anything else I can do to squeak out even a small improvement here?
Thanks in advance!
Update
I've updated the numeric specifics of my problem above. Also, due to the nature of the data, most (90%?) of the intersections are going to be empty. My initial attempt at this used the clone the set and then retainAll the items in the other set approach to find the intersection, and then shortcuts out before doing the clone and addAll to find the union. That was about as efficient as the code posted above, presumably because of the trade of between it being a slower algorithm overall versus being able to shortcut out a lot of the time. So, I'm thinking about ways to take advantage of the infrequency of overlapping sets, and would appreciate any suggestions in that regard.
Thanks in advance!
You would get a large improvement by moving the HashSet outside of the loop.
If the HashSet really has only got a few entries in it then you are probably actually just as fast to use an Array - since traversing an array is much simpler/faster. I'm not sure where the threshold would lie but I'd measure both - and be sure that you do the measurements correctly. (i.e. warm up loops before timed loops, etc).
One thing to try might be using a sorted array for the things to compare against. Scan until you go past current and you can immediately abort the search. That will improve processor branch prediction and reduce the number of comparisons a bit.
If you want to optimize for this function (not sure if it actually works in your context) you could assign each unique String an Int value, when the String is added to the UVertex set that Int as a bit in a BitSet.
This function should then become a set.or(otherset) and a set.and(otherset). Depending on the number of unique Strings that could be efficient.
What is the easiest way in Java to map strings (Java String) to (positive) integers (Java int), so that
equal strings map to equal integers, and
different strings map to different integers?
So, similar to hashCode() but different strings are required to produce different integers. So, in a sense, it would be a hasCode() without the collision possibility.
An obvious solution would maintain a mapping table from strings to integers,
and a counter to guarantee that new strings are assigned a new integer. I'm just wondering
how is this problem usually solved.
Would also be interesting to extend it to other objects than strings.
Have a look at perfect hashing.
This is impossible to achieve without any restrictions, simply because there are more possible Strings than there are integers, so eventually you will run out of numbers.
A solution is only possible when you limit the number of usable Strings. Then you can use a simple counter. Here is a simple implementation where all (2^32 = 4294967296 different strings) can be used. Never mind that it uses lots of memory.
import java.util.HashMap;
import java.util.Map;
public class StringToInt {
private Map<String, Integer> map;
private int counter = Integer.MIN_VALUE;
public StringToInt() {
map = new HashMap<String, Integer>();
}
public int toInt(String s) {
Integer i = map.get(s);
if (i == null) {
map.put(s, counter);
i = counter;
++counter;
}
return i;
}
}
There's not going to be an easy or complete solution. We use hashes because there are way more possible Strings than there are ints. Collisions are just a limitation of using a finite number of bits to represent integers.
In most hashcode() type implementations, collisions are accepted as inevitable and tested for.
If you absolutely must have no collisions, guaranteed, the solution you outline will work.
Aside from this, there are cryptographic hash functions such as MD5 and SHA, where collisions are extremely unlikely (though with a lot of effort can be forced). The Java Cryptography Architecture has implementations of these. Those methods may perhaps be faster than a good implementation of your solution for very large sets. They will also execute in constant time and give the same code for the same string, no matter which order the strings are added in. Also, it doesn't require storing each string. Crypto hash results could be considered as integers but they won't fit in a java int - you could use a BigInteger to hold them as suggested in another answer.
Incidentally, if you're put off by the idea of a collision being 'extremely unlikely', it's probably similar likelihood that a bit would randomly flip in your computer memory or hard disk and cause any program to behave differently than you expect :-)
Note, there are also some theoretical weaknesses in some hash functions (e.g. MD5) but for your purposes that probably doesn't matter and you could just use the most efficient such function - those weaknesses are only relevant if someone is maliciously trying to come up with strings that have the same code as another string.
edit: I just noticed in the title of your question, it seems you want bidirectional mapping, though you don't actually state this in the question. It is (by design) not possible to go from a Crypto hash to the original string. If you really need that, you'd have to store a map keying hashes back to strings.
I'd try to do by introducing an object holding Map and Map. Adding Strings to that object (or maybe having them created from said object) will assign them an Integer value. Requesting a Integer value for a String already registered will return the same value.
Drawbacks: Different launches will yield different Integers for the same String, depending on order unless you somehow persist the whole thing. Also, it's not very object oriented and requires a special object to create/register a String.
Plus side: It's quite similar to internalizing Strings and easily understandable. (Also, you asked for an easy, not elegant way.)
For the more general case, you might create a high level subclass of Object, introduce a "integerize" method there and extend every single class from that. I think, however, that road leads to tears.
Since Strings in java are unbounded in length, and each character has 16 bits, and ints have 32 bits, you could only produce a unique mapping of Strings to ints if the Strings were up to two characters. But you could use BigInteger to produce a unique mapping, with something like:
String s = "my string";
BigInteger bi = new BigInteger(s.getBytes());
Reverse mapping:
String str = new String(bi.toByteArray());
Can you use a Map to indicate which Strings you already have assigned integers to? That's kind of the "database-y" solution, where you assign each String a "primary key" from a sequence as it comes up. Then you put the String and Integer pair into a Map so you can look it up again. And if you need the String for a given Integer, you can also put the same pair into a Map.
As you outline, a hash table that resolves collisions is a standard solution. You could also use a Bentley/Sedgewick style search trie, which in many applications is faster than hashing.
If you substitute 'unique pointer' for 'unique integer' you can see Dave Hanson's solution to this problem in C. This is quite a nice abstraction because
The pointers can still be used as C strings.
Equal strings hash to equal pointers, so strcmp can be dispensed with in favor of pointer equality, and the pointers can be used as keys in other hash tables.
If Java offers a test for object identity on String objects then you can play the same game there.
If by integer you mean the data type, then as other posters have explained this is quite impossible, due to the fact that the integer data type is of fixed size, and strings are unbound.
However if you simply mean a positive number, then theoretically you should be able to interpret the string as if it were an "integer" simply by regarding it as a byte array (in a consistent encoding). You could also treat it as an array of integers of arbitrary length, but if you can do that why not just use a string? :)
Implementation speaking, this is usually "solved" by using a hash code and simply double-checking any collisions, since there are likely to be none anyway and on the off chance there is a collision, it still works out to be constant time. However if this isn't applicable, I'm not sure what the best solution would be.
Interesting question.
I don't know if this is practical, but if we take only lowercase letter alphabet, than every word can be viewed as a number in 26-base positional system. For example, if a is 0 and z is 25 than boom is 1*26^3 + 14*26^2 + 14*26^1 + 12*26^0 = 27416