Stable sort - do we really need it? - java

I do not understand the underlying problem that tries to solve the stable sorting algorithm.
Arrays.sort(Object[]) Javadoc states:
This sort is guaranteed to be stable: equal elements will
not be reordered as a result of the sort.
But if elements are equal, they are not distringuishable from each other! If you swap two equal elements, this should not affect anything. This is the definition of equality.
So, why do we need stability at all?
UPDATE: My question is about Collections.sort(List<T>) / Objects.sort(Object[]) methods, not Collections.sort(List<T>, Comparator<T>), Objects.sort(Comparator<T>). The latter ones are bit different. But there is still no need for stability for them: if you want predictable compound sorts, then you create appropriate compound comparators.

Let's say you have two columns. Column name and column date. Then you start ordering your list by date first, afterwards you sort them by name. If your sort is stable what it will produce is that you get the name ordered correctly and if the names are equal you get them sorted by date since your order is stable. But if your order is not stable you won't have any relative ordering between the equal keys.
public static void main (String[] args)
{
// your code goes here
List<FirstNameLastName> list = new ArrayList<FirstNameLastName> ();
list.add(new("A","B"));
list.add(new("D","B"));
list.add(new("F","B"));
list.add(new("C","C"));
list.add(new("E","C"));
list.add(new("B","C"));
Arrays.sort(list,new Comparator(firstName)); //FIXME
// A-B , B-C , C-C , D-B , E-C , F-B
Arrays.sort(list,new Comparator(lastName)); //FIXME
// A-B , D-B F-B,B-C,C-C,E-C
//So as you see here inside the last name B and C each first name
//is sorted also
//However if you just sorted instead directly on last name only
//A-B , D-B -F,B-C-C,E-C,B-C
}
private class FirstNameLastName {
String firstName;
Stirng lastName;
public FirstNameLastName(String firstName,String lastName) {
this.firstName = firstName;
this.lastName = lastName;
}
}

Consider the example
[{1, 'c'}, {2, 'a'}, {3, 'a'}]
It is sorted by number field, but not by character. After a stable sort by character:
[{2, 'a'}, {3, 'a'}, {1, 'c'}]
After an unstable sort the following order is possible:
[{3, 'a'}, {2, 'a'}, {1, 'c'}]
You can notice that {3, 'a'} and {2, 'a'} were reordered.
Java 8 example (Java API has only stable sort for objects):
List<Point> list = Arrays.asList(new Point(1,1), new Point(1,0), new Point(2,1));
list.sort((a,b) -> Integer.compare(a.x,b.x));
System.out.println(list);
list.sort((a,b) -> Integer.compare(a.y,b.y));
System.out.println(list);

Unstable sort can suck for UIs. Imagine you are using file explorer in Windows, with a details view of songs. Can you imagine if you kept clicking filetype column to sort by filetype, and it randomly reordered everything within each type-group every time you clicked it?
Stable sort in UI's allows me (the user) to create my own compound sorts. I can chain multiple sorts together, like "sort by name", then "sort by artist", then "sort by type". The final resulting sort prioritizes type, then artist, and then finally, by name. This is because a stable sort actually preserves the previous sort, allowing me to "build" my own sorting from a series of elementary sorts! Whereas unstable sorts nuke the previous sort type's results.
Of course, in code, you'd just make one big, fully defined ordering, and then sort and sort once, rather than a chained "compound" sort like I described above. This is why you tend not to need stable sorts for most application-internal sorting. But when the user drives the click-by-click, stable sorting is the best/simplest.
EDIT:
"My question is about Collections.sort(List<T>) / Objects.sort(Object[]) methods"
These are generic sorts that work on anything that defines Comparable. Your class's Comparable implementation might return 0 even when the objects are not technically "equal", because you might be trying for a particular ordering. In other words, these methods are every bit as open to custom orderings as Collections.sort(List<T>, Comparator<T>). And you might want stable sort, or you might not.

You may (and almost always do) have elements for which it is true that:
a.compareTo(b) == 0 and
a.equals(b) == false
Take, for example, a List<Product>, where product has a number of properties:
- id
- price
You could see several use cases where you would want to sort Product by id or price but not by other values.
The big benefit that stable sorting brings to the table is that if you sort by price then by id you will have a List that is correctly sorted by first price then by id.
If your sorting algorithm is unstable, then the second sort by id might be break the order of the initial sort by price.

Related

Why does Collections.sort in reverse order resulted in incomplete sort?

I have a list of files that I am trying to sort in the order of the most recent modified date to the least recent. The date is stored as a long value (milli seconds since the epoch) and I used Collections.sort to sort the files. I want the files to go from most recent to least recent (top to bottom), so I did R2-R1 instead of R1-R2 in the Comparator. The code I used is shown below:
Collections.sort(temp, new Comparator<RecordingFile>() {
#Override
public int compare(RecordingFile R1, RecordingFile R2) {
int x = (int) (R2.getLastModfied()-R1.getLastModfied());
return Integer.compare(x, 0);
}
});
This code resulted in something like so:
14-04-2022
10-04-2022
06-04-2022
05-04-2022
20-03-2022
...
18-04-2022
18-04-2022
17-04-2022
The list is somehow ordered correctly but incorrectly at the same time. The files are ordered in parts instead of fully. I tried shuffling the list before ordering and it resulted in a different order but still the same behaviour (ordered but in parts). To solve this, I did R1-R2 in the comparator and then reverse the sorted list. This resulted in a fully ordered list that takes into account all items in the list.
I was wondering if anyone knows why this happened?
Just like in the comments, try using Comparator.comparing(RecordingFile::getLastModfied).reversed() instead of doing it manually.
If you decide to still do it manually, check the lastModfied type because if it's a long then you shouldn't be returning the final result of the comparison as int

Hibernate Search / Lucene based Sorting Issue

I am having an issue in sorting, which specify below.
Previously, the code is writtern as
Sort sort = new Sort(new SortField[] {
SortField.FIELD_SCORE,
new SortField("field_1", SortField.STRING),
new SortField("field_2", SortField.STRING),
new SortField("field_2", SortField.LONG)
});
and this is an example pasted by the a stackoverflow answer here for custom sorting,
Sorting search result in Lucene based on a numeric field.
Though he does not suggest this is the correct way to do the sorting, this is also the code where my company has been used for years.
But when I create a new function, that needs to do sorting on lots of fields, and by performing unit testing, I found that it does not actually work as intended.
I need to remove SortField.FIELD_SCORE in order to make it works great. And I think this is suggested by the example described here if I did understand correctly, https://docs.jboss.org/hibernate/search/4.1/reference/en-US/html_single/#d0e5317.
i.e. the main code will convert to
Sort sort = new Sort(new SortField[] {
new SortField("field_1", SortField.STRING),
new SortField("field_2", SortField.STRING),
new SortField("field_2", SortField.LONG)
});
So my question is
what is the usage of SortField.FIELD_SCORE? How does the field score be calculated?
Why presenting SortField.FIELD_SCORE sometimes return correct value, sometimes don't?
what is the usage of SortField.FIELD_SCORE? How does the field score be calculated?
When you search for documents containing a word, each document gets assigned a "score": a float value, generally positive. The higher this value, the better the match. How exactly this is computed is a bit complex, and it gets worse when you have multiple nested queries (e.g. boolean queries, etc.), because then scores get combined with other formulas. Suffice it to say: the score is a number, there's one value for each document, and higher is better.
SortField.FIELD_SCORE will simply sort documents by descending score.
Why presenting SortField.FIELD_SCORE sometimes return correct value, sometimes don't?
Hard to say. It depends on lots of things, like your analyzers, the exact query you're running, and even the frequency of the search terms in your documents. Like I said, the formula used to compute the score is complex.
One thing that stand out in your sort, though, is that you're sorting by score and by actual fields. That's unlikely to work well. Scores are generally unique, so unless your documents are very similar (e.g. all text fields are empty for some reason), the top documents will have scores like this: [5.1, 3.4, 2.6, 2.4, 2.2]. Their order is already "complete": you can add as many subsequent sorts as you want, the order will not change because it is fully defined by the sort by score.
Think of alphabetical order: if I have to sort ["area", "baby"], the second letter of "baby" may be "a", but it doesn't matter, because the first letter is "b" and it's always going to be after the "a" of "area".
So, if you're not interested in a sort by score (and, if you don't know what score is, chances are you indeed are not interested), just stick to sorts by field:
Sort sort = new Sort(new SortField[] {
new SortField("field_1", SortField.STRING),
new SortField("field_2", SortField.STRING),
new SortField("field_2", SortField.LONG)
});
And if you're interested in a sort by score, then just sort by score:
Sort sort = new Sort(new SortField[] {
SortField.FIELD_SCORE
});
// Or equivalently
Sort sort = Sort.RELEVANCE; // "Relevance" means "sort by score"
Note that Hibernate Search 4.1 (the version for your documentation link) is very old; you should consider upgrading at least to 5.11 (similar API, also old but still maintained), and preferably to 6.0 (different, but more modern API, new and also maintained).

Multiobject Comparable/Comparator interface

Is there any standard interface or approach usable in collections/streams (max, sort) for the situation where one might need to compare on multiple sides/objects at once?
The signature could be something like
compare(T... toCompare)
instead of
compare(T object1, T object2)
what I would like is do an implementation that works for comparing operations in Java APIs. But from what I saw, I think I have to adhere mandatory to unitary comparations.
UPDATE: Practical example: I'd like to have a Comparator implementation interpreted by Collections/Stream.max() that allowed me to make multiside comparisons not unitary comparisons (i.e, that accepts multiple T in the compare method). The max function returns the element so that element is the winner of a comparison mechanism, custom implemented, of it against ALL the others, not the winner of n battles 1 vs 1.
UPDATE2: More specific example:
I have (Pineapple,Pizza,Yogurt), and max returns the item such that my custom 1 -> n comparison returns biggest quotient. This quotient could be something like degreeOfYumie. So Pineapple is more yummie than Pizza+Yogurt, Pizza is equally yummie than Pineapple+yogurt, and Yogurt is equally yummie than Pizza+Pineapple. So the winner is Pineaple. If I did that unitary, all the ingredients would be equally yummie. Is there any mechanism for implementing a comparator/comparable as that? Perhaps a "sortable" interface that works on collections, streams and queues?
There is no need for a specialized interface. If you have a Comparator that conforms to the specification, it will be transitive and allow comparing multiple objects. To get the maximum out of three or more elements, simply use, e.g.
Stream.of(42, 8, 17).max(Comparator.naturalOrder())
.ifPresent(System.out::println);
// or
Stream.of("foo", "BAR", "Baz").max(String::compareToIgnoreCase)
.ifPresent(System.out::println);
If you are interested in the index of the max element, you can do it like this:
List<String> list=Arrays.asList("foo", "BAR", "z", "Baz");
int index=IntStream.range(0, list.size()).boxed()
.max(Comparator.comparing(list::get, String.CASE_INSENSITIVE_ORDER))
.orElseThrow(()->new IllegalStateException("empty list"));
Regarding your updated question…
You said you want to establish an ordering based on the quotient of an element’s property and the remaining elements. Let’s think this through
Suppose we have the positive numerical values a, b and c and want to establish an ordering based on a/(b+c), b/(a+c) and c/(a+b).
Then we can transform the term by extending the quotients to have a common denominator:
a(a+c)(a+b) b(b+c)(b+a) c(c+b)(c+a)
--------------- --------------- ---------------
(a+b)(b+c)(a+c) (a+b)(b+c)(a+c) (a+b)(b+c)(a+c)
Since common denominators have no effect on the ordering we can elide them and after expanding the products we get the terms:
a³+a²b+a²c+abc b³+b²a+b²c+abc c³+c²a+c²b+abc
Here we can elide the common summand abc as it has no effect on the ordering.
a³+a²b+a²c b³+b²a+b²c c³+c²a+c²b
then factor out again
a²(a+b+c) b²(a+b+c) c²(a+b+c)
to see that we have a common factor which we can elide as it doesn’t affect the ordering so we finally get
a² b² c²
what does this result tell us? Simply that the quotients are proportional to the values a, b and c, thus have the same ordering. So there is no need to implement a quotient based comparator when we can prove it to have the same outcome as a simple comparator based on the original values a, b and c.
(The picture would be different if negative values were allowed, but since allowing negative values would create the possibility of getting zero as denominator, they are off this use case anyway)
It should be emphasized that any other result for a particular comparator would prove that that comparator is unusable for standard Comparator use cases. If the combined values of all other elements had an effect on the resulting order, in other words, adding another element to the relation would change the ordering, how should an operation like adding an element to a TreeSet or inserting it at the right position of a sorted list work?
The problem with comparing multiple objects at once is what to return.
A Java comparator returns -1 if the first object is "smaller than the second one, 0 if they are equals and 1 if the first one is the "bigger" one.
If you compare more than two objects, an integer wouldn't suffice to describe the difference between said objects.
If you have a normal Comparable<T> you can combine it any way you want. From being able to compare two things you can build anything (see different sorting algorithms, which usually only need a < implementation).
For example here's a naive one for "you could say if it's bigger, equal or smaller than ANY of the objects"
<T extends Comparable<T>> int compare(T... toCompare) {
if (toCompare.length < 2) throw Nothing to compare; // or return something
T first = toCompare[0];
int smallerCount;
int equalCount;
int biggerCount;
for(int i = 1, n = toCompare.length; i < n; ++i) {
int compare = first.compareTo(toCompare[i]);
if(compare == 0) {
equalCount++;
} else if(compare < 0) {
smallerCount++;
} else {
biggerCount++;
}
}
return someCombinationOf(smallerCount, equalCount, biggerCount);
}
However I couldn't figure out a proper way of combining them, what about the sequence (3, 5, 3, 1) where 3 is smaller than 5, equal to 3 and bigger than 1, so all counts are 1; here all your "it's bigger, equal or smaller than ANY" conditions are true at the same time, however you could return the counts as an object if it helps to defer the combination of counts to a later point in time.

Can a 2D array hold an array within a given point?

I am a fairly newbie programmer with a question on arrays in Java. Consider a 2D array, [i][j]. The value of i is determined at run time. The value of j is known to be 7. At [i][6] and [i][7] I want to be able to store a deeper array or list of values. Is it possible to have something like an array within an array, where there is an x and y axis and a z axis at the point of [i][6] and i[7] or will I need a full 3D cube of memory to be able to store and navigate my data?
The Details: My goal is to run a query which takes certain information from two tables (target and attacker) My query is fine and I can get a resultset. What I really want to be able to do is to store the data from my resultset and present it in a table in a more useful format while also using it in a data visualization program. The fields I get are: server_id, target_ip, threat_level, client_id, attacker_ip and num_of_attacks. I could get 20 records that have the same server_id, target_ip, threat_level, client_id but different attacker_ip and num_of_attacks because that machine got attacked 20 times. A third dimension would allow me to do this but the 3rd axis/array would be empty for server_id, target_ip, threat_level, client_id
UPDATE after reviewing the answers and doing some more thinking I'm wondering if using an arraylist of objects would be best for me, and/or possible. Keeping data organized and easily accessible is a big concern for me. In psedu code it would be something like this:
Object[] servers
String server_id
String target
String threat_level
String client_id
String arr[][] // this array will hold attacker_ip in one axis and num_of_attacks in the other in order to keep the relation between the attacking ip and the number of attacks they make against one specific server
In first place, if you have an array DataType[i][j] and j is known to be 7, the 2 greatest indexes you can use are 5 and 6, not 6 and 7. This is because Java array indexes are 0-based. When creating the array you indicate the number of elements, not the maximum index (which always is one less than number of elements).
In second place, there is nothing wrong with using multidimensional arrays when the problem domain already uses them. I can think of scientific applications, data analysis applications, but not many more. If, on the contrary, you are modelling a business problem whose domain does not use multidimensional arrays, you are probably better off using more abstract data structures instead of forcing arrays into the design just because they seem very efficient, experience in other languages where arrays are more important, or other reasons.
Without having much information, I'd say your "first dimension" could be better represented by a List type (say ArrayList). Why? Because you say its size is determined at runtime (and I assume this comes indirectly, not as a magic number that you obtain from somewhere). Lists are similar to arrays but have the particularity that they "know" how to grow. Your program can easily append new elements as it reads them from a source or otherwise discovers/creates them. It can also easily insert them at the beginning or in the middle, but this is rare.
So, your first dimension would be: ArrayList<something>, where something is the type of your second dimension.
Regarding this second dimension, you say that it has a size of 7, but that the first 5 items accept single values while the last 2 multiple ones. This is already telling me that the 7 items are not homogeneous, and thus an array is ill-indicated. This dimension would be much better represented by a class. To understand this class's structure, let's say that the 5 single-valued elements are homogenous (of type, say, BigDecimal). One of the most natural representations for this is array, as the size is known. The 2 remaining, multi-valued elements also seem to constitute an array. However, given that each of its 2 elements contains an unidentified number of data items, the element type of this array should not be BigDecimal as in the previous case, but ArrayList. The type of the elements of these ArrayLists is whatever the type of the multiple values is (say BigDecimal too).
The final result is:
class SecondD {
BigDecimal[] singleValued= new BigDecimal[5] ;
ArrayList<BigDecimal>[] multiValued= new ArrayList<BigDecimal>[2] ;
{
multiValued[0]= new ArrayList<BigDecimal>() ;
multiValued[1]= new ArrayList<BigDecimal>() ;
}
}
ArrayList<SecondD> data= new ArrayList<SecondD>() ;
In this code snippet I'm not only declaring the structures, but also creating them so they are ready to use. Pure declaration would be:
class SecondD {
BigDecimal[] singleValued;
ArrayList<BigDecimal>[] multiValued;
}
ArrayList<SecondD> data= new ArrayList<SecondD>() ;
Array size is not important in Java from a type (and thus structural) point of view. That's why you don't see any 5 or 2.
Access to the data structure would be like
data.get(130).singleValued[2]
data.get(130).multiValued[1].get(27)
A possible variant that could be much clearer in certain cases is
class SecondD {
BigDecimal monday;
BigDecimal tuesday;
BigDecimal wednesday;
BigDecimal thursday;
BigDecimal friday;
ArrayList<BigDecimal> saturday= new ArrayList<BigDecimal>() ;
ArrayList<BigDecimal> sunday= new ArrayList<BigDecimal>() ;
}
ArrayList<SecondD> data= new ArrayList<SecondD>() ;
In this case we are "expanding" each array into individual items, each with a name. Typical access operations would be:
data.get(130).wednesday
data.get(130).sunday.get(27)
Which variant to choose? Well, that depends on how similar or different the operations with the different itemes are. If every time you will perform and operation with monday you will also perform it with tuesday, wednesday, thursday, and friday (not saturday and sunday because these are a completely different kind of thing, remember?), then an array could be better. For example, to sum the items when stores as an array it's only necessary:
element= data.get(130) ;
int sum= 0 ;
for(int e: element.singleValued ) sum+= e ;
While if expanded:
element= data.get(130) ;
int sum= 0 ;
sum+= element.monday ;
sum+= element.tuesday ;
sum+= element.wednesday ;
sum+= element.thursday ;
sum+= element.friday ;
In this case, with only 5 elements, the difference is not much. The first way makes things slightly shorter, while the second makes them clearer. Personally, I vote for clarity. Now, if instead of 5 items they would have been 1,000 or even as few as 20, the repetition in the second case would have too much and the first case preferred. I have another general rule for this too: if I can name every element separately, then it's probably better to do exactly so. If while trying to name the elements I find myself using numbers or sequential letters of the alphabet (either naturally, as in the days of the month, or because things just don't seem to have different names), then it's arrays. You could still find cases that are not clear even after applying these two criteria. In this case toss a coin, start developing the program, and think a bit how things would be the other way. You can change your mind any time.
If your application is indeed a scientific one, please forgive me for such a long (and useless) explanation. My answer could help others looking for something similar, though.
Use ArrayList instead of array primitives. You can have your three dimensions, without the associated inefficient wastage of allocating a "cube"
If not creating a custom class like #nIcE cOw suggested Collections are more cumbersome for this kind of thing than primitive arrays. This is because Java likes to be verbose and doesn't do certain things for you like operator overloading (like C++ does) or give you the ability to easily instantiate ArrayList from arrays.
To exemplify, heres #sbat's example with ArrayLists;
public static <T> ArrayList<T> toAL(T ... input) {
ArrayList<T> output = new ArrayList<T>();
for (T item : input) {
output.add(item);
}
return output;
}
public static void main(String[] args) {
ArrayList<ArrayList<ArrayList<Integer>>> a = toAL(
toAL(
toAL(0, 1, 2)
),
toAL(
toAL(4, 5)
),
toAL(
toAL(6)
)
);
System.out.println(a.get(0).get(0).get(2));
System.out.println(a.get(1).get(0).get(1));
System.out.println(a.get(2).get(0).get(0));
}
Of course, there's nothing syntactically wrong with doing:
int[][][] a = {{{0, 1, 2}}, {{4, 5}}, {{6}}};
System.out.println(a[0][0].length); // 3
System.out.println(a[1][0].length); // 2
System.out.println(a[2][0].length); // 1
In fact, that's what multidimensional arrays in Java are, they're arrays within arrays.
The only problem I see with this is that it might become confusing or difficult to maintain later on, but so would using ArrayLists within ArrayLists:
List<List<List<Integer>>> list = ...;
System.out.println(list.get(0).get(1).get(50)); // using ArrayList
However, there are still reasons as to why you might prefer an array over a collection. But ArrayLists or other collections may be preferable depending on the circumstance.

About SortedSet interface, java tutorial

Reading this Oracle tutorial I came across this explanation of the difference between the range-view operations of a List and the ones provided by the SortedSet interface.
Here is the bit interested:
The range-view operations are somewhat analogous to those provided by
the List interface, but there is one big difference. Range views of a
sorted set remain valid even if the backing sorted set is modified
directly. This is feasible because the endpoints of a range view of a
sorted set are absolute points in the element space rather than
specific elements in the backing collection, as is the case for lists.
Is anybody capable to explain the bold part with, let's say, other words?
Thanks in advance.
Let's say you have a list and a set both containing the integers 11, 13, 15 and 17.
You could write set.subSet(12, 15) to construct a view, and then insert 12 into the original set. If you do this, 12 will appear in the view.
This is not possible with the list. Even though you can construct a view, the moment you modify the original list structurally (e.g. insert an element), the view becomes invalid.
The short answer is that sorted sets are backed directly by the set, unlike lists where you are working with, essentially, pointers. Changes to the underlying list changes the pointers (indexes) making holding views of the list for long problemactic. Since a set is sorted and it's a set, you are pointing at specific objects at the range boundry. This means that the references can't become invalid if an insertion or deletion occurs within the range while you hold the view.
More technically, the definition of range in this context:
A range, sometimes known as an interval, is a convex (contiguous) portion of a particular domain. Convexity means that for any a <= b <= c, range.contains(a) && range.contains(c) implies that range.contains(b). Ranges may extend to infinity; for example, the range "x > 3" contains arbitrarily large values -- or may be finitely constrained, for example "2 <= x < 5".

Categories

Resources