Efficient algorithm for filtering

Efficient algorithm for filtering - java

I m working on a application which is a service. I am receiving a request object and I need to pass this object through set of filters and return the response. There are about 10 filters I need to pass the object through.
Currently the application is doing a sequential search on every filter as follows:
public List<Element) FilterA(Request request){
for(Element element in items)
{
// compare element to request object elements
// there are different field checking per object
}
}
So there is FilterB, FilterC etc. they are all done in similar fashion, within for loops different fields are being compared.
Can this be done via hashset? or Binary search?
Or is there an efficient algorithm. Essentially I d like to improve the O(n) to something less.

If you have n lists and f filters there are bascially only two approaches: iterate through the list and apply each filter to each individual element (keep it if it passes all of them, remove it otherwise); or do what you're doing now and let each filter iterate over the entire list. Both have a worst-case complexity of O(n*f), assuming O(1) element removal (I recommend using a LinkedList to achieve this, copy the contents to one if necessary).
You can really only improve upon this complexity by utilising properties of your input. Maybe you can combine multiple filters into one (when they're range checks, for instance) or maybe taking one element from the list will also result in the removal of others. Also, if you can guess which filters will probably remove more elements it will pay off to run these first.
So yeah, it really depends on what kind of stuff you're filtering and what your filters look like. In the most general case you can't win much (as long as you're already using lists from which you can remove elements in O(1) time) but you might gain something if you take knowledge of your input into account.

Related

Is switching between Collections worth it?

Java offers us Collections, where every option is best used in a certain scenario.
But what would be a good solution for the combination of following tasks:
Quickly iterate through every element in the list (order does not matter)
Check if the list contains (a) certain element(s)
Some options that were considered which may or may not be good practice:
It could be possible to, for example, first use a LinkedList, and
then convert it to a HashSet when the amount of elements
is unknown in advance (and if duplicates will not be present)
Pick a solution for one of both tasks and use the same implementation for the other task (if switching to another implementation is not worth it)
Perhaps some implementation exists that does both (failed to find one)
Is there a 'best' solution to this, and if so, what is it?
EDIT: For potential future visitors, this page contains many implementations with big O runtimes.

A HashSet can be iterated through quickly and provides efficient lookups.
HashSet<Object> set = new HashSet<>();
set.add("Hello");
for (Object obj : set) {
System.out.println(obj);
}
if (set.contains("Hello")) {
System.out.println("Found");
}

Quickly iterate through every element in the list (order does not matter)
It the order does not matter, you should go with a Collection implementation with a time complexity of O(n), since each of them is implementing Iterable and if you want to iterate over each element, you have to visit each element at least once (hence there is nothing better than O(n)). Practically, of course, one implementation is more suited compared to another one, since more often you have multiple considerations to take into account.
Check if the list contains (a) certain element(s)
This is typically the user case for a Set, you will have much better time complexity for contains operations. One thing to note here is that a Set does not have a predefined order when iterating over elements. It can change between implementations and it is risky to make assumptions about it.
Now to your question:
From my perspective, if you have the choice to choose the data structure of a class yourself, go with the most natural one for that use case. If you can imagine that you have to call contains a lot, then a Set might be suited for your use case. You can also use a List and each time you need to call contains (multiple times) you can create a Set with all elements from the List before. Of course, if you call this method often, it would be expensive to create the Set for each invocation. You may use a Set in the first place.
Your comment stated that you have a world of players and you want to check if a player is part of a certain world object. Since the world owns the players, it should also contain a Collection of some kind to store them. Now, in this case i would recommend a Map with a common identifier of the player as key, and the player itself as value.
public class World {
private Map<String, Player> players = new HashMap<>();
public Collection<Player> getPlayers() { ... }
public Optional<Player> getPlayer(String nickname) { ... }
// ...
}

Datastructure to find/search and add?

I need to add the strings in Data structure(DS). Later I need to find the string and then remove it based on same condition.
Hashset can be best fit here as it provides O(1) complexity for search and removal of given element also will just require updating parent
right or left node. In Arraylist/Array it will be O(n) for search and same for removal.
Per my understanding Hashset will be better here as i need to search a large number of elements and if found remove it.
My question :- Is Hashset or some other DS better here ?

Usually such tasks are best handled by Trie data structure and the variations of it.
Alternatively you can use a hash table, however it doesn't guarantee worst-case complexity.

As usual, it depends on your needs:
if you just need to add String instances, retrieve and delete them, this is no brainer - HashSet is your choice (TreeSet has worse asymptotic complexity and it is suitable in case that you need order of String instances for some reason, e.g. alphabetically)
if you wish to store and efficiently search all String instances with the specified prefix, use Trie as correctly mentioned in Serge Rogatch's answer
if you wish to check if pattern exists in the specified String instance or not, use Suffix tree

Why don't we count linear search cost as a prerequisite bottleneck for the insertion operation of a linked list, compared to ArrayList?

I have had this question for a while but I have been unsatisfied with the answers because the distinctions appear to be arbitrary and more like conventional wisdom that is sort of blindly accepted rather than assessed critically.
In an ArrayList it is said that insertion cost (for a single element) is linear. If we are inserting at index p for 0 <= p < n where n is the size of the list, then the remaining n-p elements are shifted over first before the new element is copied into position p.
In a LinkedList, it is said that insertion cost (for a single element) is constant. For instance if we already have a node and we want to insert after it, we rearrange some pointers and it's done quickly. But getting this node in the first place, I don't see how it can be done other than a linear search first (assuming it isn't a trivial case like prepending at the start of the list or appending at the end).
And yet in the case of the LinkedList, we don't count that initial search time. To me this is confusing because it's sort of like saying "The ice cream is free... after you pay for it." It's like, well, of course it is... but that sort of skips the hard part of paying for it. Of course inserting in a LinkedList is going to be constant time if you already have the node you want, but getting that node in the first place may take some extra time! I could easily say that inserting in an ArrayList is constant time... after I move the remaining n-p elements.
So I don't understand why this distinction is made for one but not the other. You could argue that insertion is considered constant for LinkedLists because of the cases where you insert at the front or back where linear time operations are not required, whereas in an ArrayList, insertion requires copying of the suffix array after position p, but I could easily counter that by saying if we insert at the back of an ArrayList, it is amortized constant time and doesn't require extra copying in most cases unless we reach capacity.
In other words we separate the linear stuff from the constant stuff for LinkedList, but we don't separate them for the ArrayList, even though in both cases, the linear operations may not be invoked or not invoked.
So why do we consider them separate for LinkedList and not for ArrayList? Or are they only being defined here in the context where LinkedList is overwhelmingly used for head/tail appends and prepends as opposed to elements in the middle?

This is basically a limitation of the Java interface for List and LinkedList, rather than a fundamental limitation of linked lists. That is, in Java there is no convenient concept of "a pointer to a list node".
Every type of list has a few different concepts loosely associated with the idea of pointing to a particular item:
The idea of a "reference" to a specific item in a list
The integer position of an item in the list
The value of a item that may be in the list (possibly multiple times)
The most general concept is the first one, and is usually encapsulated in the idea of an iterator. As it happens, the simple way to implement an iterator for an array backed list is simply to wrap an integer which refers to the position of the item in a list. So for array lists only, the first and second ways of referring to items are pretty tightly bound.
For other list types, however, and even for most other container types (trees, hashes, etc) that is not the case. The generic reference to an item is usually something like a pointer to the wrapper structure around one item (e.g., HashMap.Entry or LinkedList.Entry). For these structures the idea of accessing the nth element isn't necessary natural or even possible (e.g., unordered collections like sets and many hash maps).
Perhaps unfortunately, Java made the idea of getting an item by its index a first-class operation. Many of the operations directly on List objects are implemented in terms of list indexes: remove(int index), add(int index, ...), get(int index), etc. So it's kind of natural to think of those operations as being the fundamental ones.
For LinkedList though it's more fundamental to use a pointer to a node to refer to an object. Rather than passing around a list index, you'd pass around the pointer. After inserting an element, you'd get a pointer to the element.
In C++ this concept is embodied in the concept of the iterator, which is the first class way to refer to items in collections, including lists. So does such a "pointer" exist in Java? It sure does - it's the Iterator object! Usually you think of an Iterator as being for iteration, but you can also think of it as pointing to a particular object.
So the key observation is: given an pointer (iterator) to an object, you can remove and add from linked lists in constant time, but from an array-like list this takes linear time in general. There is no inherent need to search for an object before deleting it: there are plenty of scenarios where you can maintain or take as input such a reference, or where you are processing the entire list, and here the constant time deletion of linked lists does change the algorithmic complexity.
Of course, if you need to do something like delete the first entry containing the value "foo" that implies both a search and a delete operation. Both array-based and linked lists taken O(n) for search, so they don't vary here - but you can meaningfully separate the search and delete operations.
So you could, in principle, pass around Iterator objects rather than list indexes or object values - at least if your use case supports it. However, at the top I said that "Java has no convenient notion of a pointer to a list node". Why?
Well because actually using Iterator is actually very inconvenient. First of all, it's tough to get an Iterator to an object in the first place: for example, and unlike C++, the add() methods don't return an Iterator - so to get a pointer to the item you just added, you need to go ahead and iterate over the list or use the listIterator(int index) call, which is inherently inefficient for linked lists. Many methods (e.g., subList()) support only a version that takes indexes, but not Iterators - even when such a method could be efficiently supported.
Add to that the restrictions around iterator invalidation when the list is modified, and they actually become pretty useless for referring to elements except in immutable lists.
So Java's support of pointers to list elements is pretty half-hearted an so it's tough to leverage the constant time operations that linked list offers, except in cases such as adding to the front of a list, or deleting items during iteration.
It's not limited to lists, either - the ConcurrentQueue is also a linked structure which supports constant time deletes, but you can't reliably use that ability from Java.

If you're using a LinkedList, chances are you're not going to use it for a random access insert. LinkedList offers constant time for push (insert at the beginning) or add (because it has a ref to the final element IIRC). You are correct in your suspicion that an insert into a random index (e.g. insert sorted) will take linear time - not constant.
ArrayList, by contrast, is worst case linear. Most of the time it simply does an arraycopy to shift the indices (which is a low-level shift that is constant time). Only when you need to resize the backing array will it take linear time.

Filtering List without using iterator

I need to filter a List of size 1000 or more and get a sublist out of it.
I dont want to use an iterator.
1) At present I am iterating the List and comparing it using Java. This is time consuming task. I need to increase the performance of my code.
2) I also tried to use Google Collections(Guava), but I think it will also iterate in background.
Predicate<String> validList = new Predicate<String>(){
public boolean apply(String aid){
return aid.contains("1_15_12");
}
};
Collection<String> finalList =com.google.common.collect.Collections2.filter(Collection,validList);
Can anyone suggest me how can I get sublist faster without iterating or if iterator is used I will get result comparatively faster.

Consider what happens if you call size() on your sublist. That has to check every element, as every element may change the result.
If you have a very specialized way of using your list which means you don't touch every element in it, don't use random access, etc, perhaps you don't want the List interface at all. If you could tell us more about what you're doing, that would really help.

List is an ordered collection of objects. So You must to iterate it in order to filter.

I enrich my comment:
I think iterator is inevitable during filtering, as each element has to be checked.
Regarding to Collections2.filter, it's different from simple filter: the returned Collection is still "Predicated". That means IllegalArgumentException will be thrown if unsatisfied element is added to the Collection.

If the performance is really your concern, most probably the predicate is pretty slow. What you can do is to Lists.partition your list, filter in parallel (you have to write this) and then concatenate the results.
There might be better ways to solve your problem, but we would need more information about the predicate and the data in the List.

customize an indexof call for a linkedlist (java)

I'm working with a very large (custom Object) linkedlist, and I'm trying to determine if an object that I'm trying to add to the list is already in there.
The issue is that the item I am searching for is a unique object containing:
A 1st String
A 2nd String
A unique Count #
I'm trying to find out if there is an item in my linked list that contains the (1st String) and (2nd String), but ignore (the unique Count #).
This can be done the dumb way (the way I tried it first) by going through each individual linkedlist item - but this takes way too long. I'm trying to speed it up! I figured using (indexOf) would help, but I don't know how I can customize what it is searching for.
Any ideas?

indexOf() has O(n) performance as well because it progressively scans the List until it finds the element you're looking for.
Is the list sorted? If so, you might be able to search for an element using something like quicksort.
If you need constant time access for random elements, I don't think a Linked List is your best bet.

Do you NEED to use a LinkedList? If it's not legacy code, I would recommend either HashSet or LinkedHashMap. Both will give you constant-time lookup, and if you still need insertion-order iteration, LinkedHashMap has an internal LinkedList running through the keys.

Unfortunately the "dumb way" is the most effiecient way to do so, although you could use
if ( linkedList.contains(objectThatMayBeInList) ) { //do something }
The problem is that a LinkedList has a best case search of O(N) where N is the size of the list. That means that on any given search you have a worst case scenario of N computations. Linked lists are not the best data structure for that kind of an operation, but at the same time, it's not that bad, and it shouldn't be too slow, computers are good at doing that. Is there more specifics you can give us as to the size of the list?

Basically you want to find out if object A exists in linked list L. This is the search problem, and if the list is unordered you cannot do it faster than O(n).
If you kept the list sorted (making insertion slower), you could do a binary search to see if A is in the list, which would be much faster.
Perhaps you could also keep a Map (HashMap or TreeMap for instance) in addition to the list, where you keep track of what stuff is in the list.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.