Comparing an array and getting the difference

Comparing an array and getting the difference - java

How would I compare two arrays that might have different lengths and get the difference between each array?
For example:
Cat cat = new Cat();
Dog dog = new Dog();
Alligator alligator = new Alligator();
Animal animals[] = { cat, dog };
Animal animals2[] = { cat, dog, alligator };
How would I compare them two arrays and make it return the instance of Alligator?

I would suggest that your question needs to be clarified. Currently, everyone is guessing what about what you are actually asking.
Are the arrays intended to represent sets, or lists, or something in between? In other words, does element order matter, and can there be duplicates?
What does "equal" mean? Does new Cat() "equal" new Cat()? Your example suggests that it does!!
What do you mean by the "difference"? Do you mean set difference?
What do you want to happen if the two arrays have the same length?
Is this a once-off comparison or does it occur repeatedly for the same arrays?
How many elements are there in the arrays (on average)?
Why are you using arrays at all?
Making the assumption that these arrays are intended to be true sets, then you probably should be using HashSet instead of arrays, and using collection operations like addAll and retainAll to calculate the set difference.
On the other hand, if the arrays are meant to represent lists, it is not at all clear what "difference" means.
If it is critical that the code runs fast, then you most certainly need to rethink your data structures. If you always start with arrays, you are not going to be able to calculate the "differences" fast ... at least in the general case.
Finally, if you are going to use anything that depends on the equals(Object) method (and that includes any of the Java collection types, you really need to have a clear understanding of what "equals" is supposed to mean in your application. Are all Cat instances equal? Are they all different? Are some Cat instances equal and others not? If you don't figure this out, and implement the equals and hashCode methods accordingly you will get confusing results.

I suggest that you put your objects in sets and then use an intersection of the sets:
// Considering you put your objects in setA and setB
Set<Object> intersection = new HashSet<Object>(setA);
intersection.retainAll(setB);
After that you can use removeAll to get a difference to any of the two sets:
setA.removeAll(intersection);
setB.removeAll(intersection);
Inspired by: http://hype-free.blogspot.com/2008/11/calculating-intersection-of-two-java.html

Well, you could maybe use Set instead and use the removeAll() method.
Or you could use the following simple and slow algorithm for doing:
List<Animal> differences = new ArrayList<Animal>();
for (Animal a1 : animals) {
boolean isInSecondArray = false;
for (Animal a2 : animals2) {
if (a1 == a2) {
isInSecondArray = true;
break;
}
}
if (!isInSecondArray)
differences.add(a1)
}
Then differences will have all the objects that are in animals array but not in animals2 array. In a similar way you can do the opposite (get all the objects that are in animals2 but not in animals).

You may want to look at this article for more information:
http://download-llnw.oracle.com/javase/tutorial/collections/interfaces/set.html
As was mentioned, removeAll() is made for this, but you will want to do it twice, so that you can create a list of all that are missing in both, and then you could combine these two results to have a list of all the differences.
But, this is a destructive operation, so if you don't want to lose the information, copy the Set and operate on that one.
UPDATE:
It appears that my assumption of what is in the array is wrong, so removeAll() won't work, but with a 5ms requirement, depeending on the number of items to search it could be a problem.
So, it would appear a HashMap<String, Animal> would be the best option, as it is fast in searching.
Animal is an interface with at least one property, String name. For each class that implements Animal write code for Equals and hashCode. You can find some discussion here: http://www.ibm.com/developerworks/java/library/j-jtp05273.html. This way, if you want the hash value to be a combination of the type of animal and the name then that will be fine.
So, the basic algorithm is to keep everything in the hashmaps, and then to search for differences, just get an array of keys, and search through to see if that key is contained in the other list, and if it isn't put it into a List<Object>, storing the value there.
You will want to do this twice, so, if you have at least a dual-core processor, you may get some benefit out of having both searches being done in separate threads, but then you will want to use one of the concurrent datatypes added in JDK5 so that you don't have to worry about synchronizations in the combined list of differences.
So, I would write it first as a single-thread and test, to get some ideas as to how much faster it is, also comparing it to the original implmemntation.
Then, if you need it faster, try using threads, again, compare to see if there is a speed increase.
Before making any optimization ensure you have some metrics on what you already have, so that you can compare and see if the one change will lead to an increase in speed.
If you make too many changes at a time, one may have a large improvement on speed, but others may lead to a performance decrease, and it wouldn't be seen, which is why each change should be one at a time.
Don't lose the other implementations though, by using unit tests and testing perhaps 100 times each, you can get an idea as to what improvement each change gives you.

I don't care about perf for my usages (and you shouldn't either, unless you have a good reason to, and you find out via your profiler that this code is the bottleneck).
What I do is similar to functional's answer. I use LINQ set operators to get the exception on each list:
http://msdn.microsoft.com/en-us/library/bb397894.aspx
Edit:
Sorry, I didn't notice this is Java. Sorry, I'm off in C# la-la land, and they look very similar :)

Related

Is creating a HashMap alongside an ArrayList just for constant-time contains() a valid strategy?

I've got an ArrayList that can be anywhere from 0 to 5000 items long (pretty big objects, too).
At one point I compare it against another ArrayList, to find their intersection. I know this is O(n^2).
Is creating a HashMap alongside this ArrayList, to achieve constant-time lookup, a valid strategy here, in order to reduce the complexity to O(n)? Or is the overhead of another data structure simply not worth it? I believe it would take up no additional space (besides for the references).
(I know, I'm sure 'it depends on what I'm doing', but I'm seriously wondering if there's any drawback that makes it pointless, or if it's actually a common strategy to use. And yes, I'm aware of the quote about prematurely optimizing. I'm just curious from a theoretical standpoint).

First of all, a short side note:
And yes, I'm aware of the quote about prematurely optimizing.
What you are asking about here is not "premature optimization"!
You are not talking about replacing a multiplication with some odd bitwise operations "because they are faster (on a 90's PC, in a C-program)". You are thinking about the right data structure for your application pattern. You are considering the application cases (though you did not tell us many details about them). And you are considering the implications that the choice of a certain data structure will have on the asymptotic running time of your algorithms. This is planning, or maybe engineering, but not "premature optimization".
That being said, and to tell you what you already know: It depends.
To elaborate this a bit: It depends on the actual operations (methods) that you perform on these collections, how frequently you perform then, how time-critical they are, and how memory-sensitive the application is.
(For 5000 elements, the latter should not be a problem, as only references are stored - see the discussion in the comments)
In general, I'd also be hesitant to really store the Set alongside the List, if they are always supposed to contain the same elements. This wording is intentional: You should always be aware of the differences between both collections. Primarily: A Set can contain each element only once, whereas a List may contain the same element multiple times.
For all hints, recommendations and considerations, this should be kept in mind.
But even if it is given for granted that the lists will always contain elements only once in your case, then you still have to make sure that both collections are maintained properly. If you really just stored them, you could easily cause subtle bugs:
private Set<T> set = new HashSet<T>();
private List<T> list = new ArrayList<T>();
// Fine
void add(T element)
{
set.add(element);
list.add(element);
}
// Fine
void remove(T element)
{
set.remove(element);
list.remove(element); // May be expensive, but ... well
}
// Added later, 100 lines below the other methods:
void removeAll(Collection<T> elements)
{
set.removeAll(elements);
// Ooops - something's missing here...
}
To avoid this, one could even consider to create a dedicated collection class - something like a FastContainsList that combines a Set and a List, and forwards the contains call to the Set. But you'll qickly notice that it will be hard (or maybe impossible) to not violate the contracts of the Collection and List interfaces with such a collection, unless the clause that "You may not add elements twice" becomes part of the contract...
So again, all this depends on what you want to do with these methods, and which interface you really need. If you don't need the indexed access of List, then it's easy. Otherwise, referring to your example:
At one point I compare it against another ArrayList, to find their intersection. I know this is O(n^2).
You can avoid this by creating the sets locally:
static <T> List<T> computeIntersection(List<T> list0, List<T> list1)
{
Set<T> set0 = new LinkedHashSet<T>(list0);
Set<T> set1 = new LinkedHashSet<T>(list1);
set0.retainAll(set1);
return new ArrayList<T>(set0);
}
This will have a running time of O(n). Of course, if you do this frequently, but rarely change the contents of the lists, there may be options to avoid the copies, but for the reason mentioned above, maintainng the required data structures may become tricky.

Java: Vector sort vs Collection sort

I have never had the pleasure to work with the code below, and I've now stumbled upon an assignment where I'm suppose to argument which of these is most used, and why.
We've got two examples,
public Vector<Integer> sort(Vector<Integer> integers) { ... }
public Collection<Integer> sort(Collection<Integer> integers) {}
Essentially, we're to argument which of these two examples is the best solution to sorting things. Which of these two are used the most, and why?

Essentially, we're to argument which of these two examples is the best solution to sorting things. Which of these two are used the most, and why?
Lots of comments but I don't see an answer really. So: the second one is more used nowadays. The first one is an old method, back from the early Java days. You can still use it, but nowadays people use mostly the second one because it obviously imposes less requirements on the person using it (in particular the collection which you're sorting doesn't need to be a Vector).

There are advantages and disadvantages to each. I think you're being asked to choose (perhaps deliberately) between two poor choices.
As many have already said, Vector is obsolete. It should not be used.
However… there's a reason the Collections.sort method takes a List and not a Collection: How does one sort an unordered Set? How would one sort a collection which already has a required order, like a TreeSet?
Since the notion of sorting a Collection is so poorly defined, the method which takes a Vector is probably the better one to use, even though Vector itself shouldn't be used.

Dictionary dilemma: Array vs Arraylist

What I want to know is which one would be more efficient, should I use a 1D array and list 100 words, or make an array list to do the same thing in Java?
Note: I've only used arrays so far, array lists would be slightly new to me, I know what it is, I just have never used it before, also they would be used to randomly select a word.

If you know from the beginning the final number of elements then there is no point of using an ArrayList over an Array. ArrayList are dynamical: they can grow but you have a small price to pay for it in term of performance and memory space requirement. The difference is slim but if you don't need the autogrow feature of ArrayList then why asking for it?
However, beside that, there is anoter criteria that can make (or not) a bigger splash: Arrays are covariant where ArrayList are not; that is: if B is a subclass of A than a reference to an Array of A can also accept a reference to an array of B but a reference to an ArrayList of A cannot accept an ArrayList of B. In other words, an Array of B will be considered as a covariant for an Array of A but an ArrayList of A won't:
class A {}
class B extends A {}
A[] a = new B[1]; // OK
ArrayList<A> a2 = new ArrayList<B>(); // Error.
To circumvent this last error, you can try with a family of types such as:
ArrayList<? extends A> a3 = new ArrayList<B>();
but then, you are limiting the contravariance of the ArrayList a3:
a3.add(new A()); // Error!
a3.add(new B()); // Error again!
However, when you have an hierarchy of classes, it's usually a better idea to keep working with the superclass. Therefore, even when you have a set of objects B where B is a subclass of A, keeping A[] and ArrayList instead of B[] and ArrayList for keeping references to these objects B is often better suited to OOP and easier to work with.
Sometimes, you may have to make a cast from A to B in order to access a property or a method of B which is not accessible from A. However, this could be considered as a weakness in the design. OOP works best when you use the polymorphism at its fullest extent and the base class (or super class) should have all the necessary virtual functions to access the properties and methods of all subclasses and therefore you should be able to keep a reference to a subclass using the base class without having to make any cast thereafter.

I suggest you to use List, there is almost no such difference between array and list on performance based.
But in case of List your code will easy to manage and flexible as comparing to array.

If efficiency is your biggest worry here, then you have no worries. Use whichever you want. You'll see no appreciable difference in performance between an array and a List for 99.99% (completely made up) of applications. In general Lists are preferred over arrays because they're easier to work with.

Why do people create arraylist like this?

Occasionally I see somebody create an arraylist like this, why?
List numbers = new ArrayList( );
Instead of:
ArrayList<something> numbers = new ArrayList<something>();

If you asking about using interface instead of concrete object, than it is a good practice. Imagine, you will switch to LinkedList tomorrow. In first case you won't need to fix variable declaration.
If the question was about non-using generics, then it is bad. Generics are always good as they give type safety.

What's good:
1. List is a general case for many implementations.
List trololo = new ListImpl();
Hides real implementation for the user:
public List giveMeTheList(){
List trololo = new SomeCoolListImpl();
return trololo;
}
By design it's good: user shouldn't pay attention to the realization. He just gets interface access for the implementation. Implementation should already has all neccessary properties: be fast for appending, be fast for inserting or be unmodifiable, e.t.c.
What's bad:
I've read that all raw types will be restricted in future Java versions, so such code better write this way:
List<?> trololo = new ListImpl<?>();
In general wildcard has the same meaning: you don't know fo sure will your collection be heterogenous or homogeneous?

Someday you could do:
List<something> numbers = new LinkedList<something>();without changing client code which calls numbers.

Declaring interface instead of implementation is indeed the rather good and widespread practice, but it is not always the best way. Use it everytime except for all of the following conditions are true:
You are completely sure, that chosen implementation will satisfy your needs.
You need some implementation-specific feauture, that is not available through interface, e.g. ArrayList.trimToSize()
Of course, you may use casting, but then using interface makes no sense at all.

The first line is old style Java, we had to do it before Java 1.5 introduced generics. But a lot of brilliant software engineers are still forced to use Java 1.4 (or less), because their companies fear risk and effort to upgrade the applications...
OK, that was off the records. A lot of legacy code has been produced with java 1.4 or less and has not been refactored.
The second line includes generics (so it's clearly 1.5+) and the variable is declared as an ArrayList. There's actually no big problem. Sure, always better to code against interfaces, so to my (and others) opinion, don't declare a variable as ArrayList unless you really need the special ArrayList methods.

Most of the time, when you don't care about the implementation, it's better to program to interface. So, something like:
List<something> numbers = new ArrayList<something>();
would be preferred than:
ArrayList<something> numbers = new ArrayList<something>();
The reason is you can tweak your program later for performance reason.
But, you have to be careful not to just choose the most generic interface available. For example, if you want to have a sorted set, instead of to Set, you should program to SortedSet, like this:
SortedSet<something> s = new TreeSet<something>();
If you just blatantly use interface like this:
Set<something> s = new TreeSet<something>();
Someone can modify the implementation to HashSet and your program will be broken.
Lastly, this program to interface will even be much more useful when you define a public API.

Two differences are that numbers in the first line is of type List, not ArrayList. This is possible because ArrayList is a descendant of List; that is, it has everything that List has, so can fill in for a List object. (This doesn't work the other way around.)
The second line's ArrayList is typed. This means that the second numbers list can only hold type something objects.

Removing duplicates without overriding hash method

I have a List which contains a list of objects and I want to remove from this list all the elements which have the same values in two of their attributes. I had though about doing something like this:
List<Class1> myList;
....
Set<Class1> mySet = new HashSet<Class1>();
mySet.addAll(myList);
and overriding hash method in Class1 so it returns a number which depends only in the attributes I want to consider.
The problem is that I need to do a different filtering in another part of the application so I can't override hash method in this way (I would need two different hash methods).
What's the most efficient way of doing this filtering without overriding hash method?
Thanks

Overriding hashCode and equals in Class1 (just to do this) is problematic. You end up with your class having an unnatural definition of equality, which may turn out to be other for other current and future uses of the class.
Review the Comparator interface and write a Comparator<Class1> implementation to compare instances of your Class1 based on your criteria; e.g. based on those two attributes. Then instantiate a TreeSet<Class>` for duplicate detection using the TreeSet(Comparator) constructor.
EDIT
Comparing this approach with #Tom Hawtin's approach:
The two approaches use roughly comparable space overall. The treeset's internal nodes roughly balance the hashset's array and the wrappers that support the custom equals / hash methods.
The wrapper + hashset approach is O(N) in time (assuming good hashing) versus O(NlogN) for the treeset approach. So that is the way to go if the input list is likely to be large.
The treeset approach wins in terms of the lines of code that need to be written.

Let your Class1 implements Comparable. Then use TreeSet as in your example (i.e. use addAll method).

As an alternative to what Roman said you can have a look at this SO question about filtering using Predicates. If you use Google Collections anyway this might be a good fit.

I would suggest introducing a class for the concept of the parts of Class1 that you want to consider significant in this context. Then use a HashSet or HashMap.

Sometimes programmers make things too complicated trying to use all the nice features of a language, and the answers to this question are an example. Overriding anything on the class is overkill. What you need is this:
class MyClass {
Object attr1;
Object attr2;
}
List<Class1> list;
Set<Class1> set=....
Set<MyClass> tempset = new HashSet<MyClass>;
for (Class1 c:list) {
MyClass myc = new MyClass();
myc.attr1 = c.attr1;
myc.attr2 = c.attr2;
if (!tempset.contains(myc)) {
tempset.add(myc);
set.add(c);
}
}
Feel free to fix up minor irregulairites. There will be some issues depending on what you mean by equality for the attributes (and obvious changes if the attributes are primitive). Sometimes we need to write code, not just use the builtin libraries.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.