How to check multiple contains operations faster? - java

I have a String list as below. I want to do some calculations based on if this list has multiple elements with same value.
I got nearly 120k elements and when I run this code it runs too slow. Is there any faster approach than contains method?
List<String> words= getWordsFromDB(); //words list has nearly 120k elements
List<String> tempWordsList = new LinkedList<String>(); //empty list
String[] keys = getKeysFromDB();
List<String> tempKeysList = new LinkedList<String>();
for (int x = 0; x < words.size(); x++) {
if (!tempWordsList.contains(words.get(x))) {
tempWordsList.add(words.get(x));
String key= keys[x];
tempKeysList.add(key);
} else {
int index = tempWordsList.indexOf(words.get(x));
String m = tempKeysList.get(index);
String n = keys[x];
if (!m.contains(n)) {
String newWord = m + ", " + n;
tempKeysList.set(index, newWord);
}
}
}
EDIT: words list comes from database and problem is there is a service continuously updating and inserting data to this table. I don't have any access to this service and there are other applications who is using the same table.
EDIT2: I have updated for full code.

You are searching the list twice per word: once for contains() and once for indexOf(). You could replace contains() by indexOf(), test the result for -1, otherwise reuse the result instead of calling indexOf() again. But you are certainly using the wrong data structure. What exactly do you need a for? Do you need a? I would use a HashSet, or a HashMap if you need to associate other data with each word.

//1) if you can avoid using linked list use below solution
List<String> words= getWordsFromDB(); //words list has nearly 120k elements
//if you can avoid using linked list, use set instead
Set<String> set=new HashSet<>();
for (String s:words) {
if (!set.add(s)) {
//do some calculations
}
}
//2) if you can't avoid using linked list use below code
List<String> words= getWordsFromDB(); //words list has nearly 120k elements
List<String> tempList = new LinkedList<String>(); //empty list
//if you can't avoid LinkedListv (tempList) you need to use a set
Set<String> set=new HashSet<>();
for (String s:words) {
if (set.add(s)) {
tempList.add(s);
} else {
int a = tempList.indexOf(s);
//do some calculations
}
}

LinkedList.get() runs in O(N) time. Either use ArrayList with O(1) lookup time, or avoid indexed lookups altogether by using an iterator:
for (String word : words) {
if (!tempList.contains(word)) {
tempList.add(word);
} else {
int firstIndex = tempList.indexOf(word);
//do some calculations
}
}
Disclaimer: The above was written under the questionable assumption that words is a LinkedList. I would still recommend the enhanced-for loop, since it's more conventional and its time complexity is not implementation-dependent. Either way, the suggestion below still stands.
You can further improve by replacing tempList with a HashMap. This will avoid the O(N) cost of contains() and indexOf():
Map<String, Integer> indexes = new HashMap<>();
int index = 0;
for (String word : words) {
Integer firstIndex = indexes.putIfAbsent(word, index++);
if (firstIndex != null) {
//do some calculations
}
}
Based on your latest update, it looks like you're trying to group "keys" by their corresponding "word". If so, you might give streams a spin:
List<String> words = getWordsFromDB();
String[] keys = getKeysFromDB();
Collection<String> groupedKeys = IntStream.range(0, words.size())
.boxed()
.collect(Collectors.groupingBy(
words::get,
LinkedHashMap::new, // if word order is significant
Collectors.mapping(
i -> keys[i],
Collectors.joining(", "))))
.values();
However, as mentioned in the comments, it would probably be best to move this logic into your database query.

Acutally, tempList use linear complexity time methods :
if (!tempList.contains(words.get(x))) {
and
int a = tempList.indexOf(words.get(x));
It means that at each invocation of them, the list is in average iterate at half.
Besides, these are redundant.
indexOf() only could be invoked :
for (int x = 0; x < words.size(); x++) {
int indexWord = tempList.indexOf(words.get(x));
if (indexWord != -1) {
tempList.add(words.get(x));
} else {
//do some calculations by using indexWord
}
}
But to improve all accesses, you should change your structure : wrapping or replacing LinkedList by LinkedHashSet.
LinkedHashSet would keep the actual behavior because as List, it defines the iteration ordering, which is the order in which elements were inserted into the set but it also uses hashing feature to improve time access to its elements.

Related

Removing duplicate elements & count repetitions in ArrayList

This is more difficult than I expected. I have a sorted ArrayList of Strings (words), and my task is to remove the repetitions and print out a list of each word, followed by the number of the word's repetitions. Suffice it to say that it's more complex than I expected. After trying different things, I decided to use a HashMap to store the words (key), value(repetitions).
This is the code. Dictionary is the sorted ArrayList and Repetitions that HashMap.
public void countElements ()
{
String word=dictionary.get(0);
int wordCount=1;
int count=dictionary.size();
for (int i=0;i<count;i++)
{
word=dictionary.get(i);
for (int j=i+1; j<count;j++)
{
if(word.equals(dictionary.get(j)))
{
wordCount=wordCount+1;
repetitions.put(word, wordCount);
dictionary.remove(j--);
count--;
}
}
}
For some reason that I do not understand (I'm a beginner), after I call the dictionary.remove(j--) method, variable j decrements by 1, even though it should be i+1. What am I missing? Any ideas on how to do this properly would be appreciated. I know that it would be best to use an iterator, but that can become even more confusing.
Many thanks.
A version which uses streams:
final Map<String, Long> countMap = dictionary.stream().collect(
Collectors.groupingBy(word -> word, LinkedHashMap::new, Collectors.counting()));
System.out.println("Counts follow");
System.out.println(countMap);
System.out.println("Duplicate-free list follows");
System.out.println(countMap.keySet());
Here we group (using Collectors.groupingBy) the elements of the list using each element (i.e. each word) as a key in the resulting map, and counting this word occurrences (using Collectors.counting()).
Outer collector (groupingBy) uses counting collector as a downstream collector that collects (here, counts) all the occurrences of a single word.
We're using LinkedHashMap here to build the map because it maintains the order in which key-value pairs were added to it as we want to maintain the same order that words had in your initial list.
And one more thing: countMap.keySet() is not a List. If you want to get a List in the end, it's just new ArrayList(countMap.keySet()).
This code will serve your purpose. Now dictionary would contain the unique words and hashmap would contain the frequency count of each word.
public class newq {
public static void main(String[] args)
{
ArrayList<String> dictionary=new ArrayList<String>();
dictionary.add("hello");
dictionary.add("hello");
dictionary.add("asd");
dictionary.add("qwet");
dictionary.add("qwet");
HashMap<String,Integer> hs=new HashMap<String,Integer>();
int i=0;
while(i<dictionary.size())
{
String word=dictionary.get(i);
if(hs.containsKey(word)) // check if word repeated
{
hs.put(word, hs.get(word)+1); //if repeated increase the count
dictionary.remove(i); // remove the word
}
else
{
hs.put(word, 1); //not repeated
i++;
}
}
Iterator it = hs.entrySet().iterator();
while(it.hasNext())
{
HashMap.Entry pair = (HashMap.Entry)it.next();
System.out.println(pair.getKey() + " = " + pair.getValue());
it.remove();
}
for(String word: dictionary)
{
System.out.println(word);
}
}
}
If you don't want 'j' to decrement you should use j-1.
Using j--, --j, j++, or ++j changes the value of the variable.
This link has a good explanation and simple examples about post- en pre-incrementing.

java 8, most efficient method to return duplicates from a list (not remove them)? [duplicate]

This question already has answers here:
How to select duplicate values from a list in java?
(13 answers)
Closed 5 years ago.
I have an ArrayList of Strings, and I want to find and return all values which exist more than once in the list. Most cases are looking for the opposite (removing the duplicate items like distinct()), and so example code is hard to come by.
I was able to come up with this:
public synchronized List<String> listMatching(List<String> allStrings) {
long startTime = System.currentTimeMillis();
List<String> duplicates = allStrings.stream().filter(string -> Collections.frequency(allStrings, string) > 1)
.collect(Collectors.toList());
long stopTime = System.currentTimeMillis();
long elapsedTime = stopTime - startTime;
LOG.info("Time for Collections.frequency(): "+ elapsedTime);
return duplicates;
}
But this uses Collections.frequency, which loops through the whole list for each item and counts every occurrence. This takes about 150ms to run on my current list of about 4,000 strings.
This is a bit slow for me and will only get worse as the list size increases. I took the frequency method and rewrote it to return immediately on the 2nd occurrence:
protected boolean moreThanOne(Collection<?> c, Object o) {
boolean found = false;
if (o != null) {
for (Object e : c) {
if (o.equals(e)) {
if (found) {
return found;
} else {
found = true;
}
}
}
}
return found;
}
and changed my method to use it:
public synchronized List<String> listMatching(List<String> allStrings) {
long startTime = System.currentTimeMillis();
List<String> duplicates = allStrings.stream().filter(string -> moreThanOne(allStrings, string))
.collect(Collectors.toList());
long stopTime = System.currentTimeMillis();
long elapsedTime = stopTime - startTime;
LOG.info("Time for moreThanOne(): "+ elapsedTime);
return duplicates;
}
This seems to work as expected, but does not really increase the speed as much as I was hoping, clocking in at about 120ms. This is probably due to it also needing to loop through the whole list for each item, but I am not sure how to avoid that and still accomplish the task.
I know this might seem like premature optimization, but my List can easily be 1mil+, and this method is a critical piece of my app that influences the timing of everything else.
Do you see any way that I could further optimize this code? Perhaps using some sort of fancy Predicate? An entirely different approach?
EDIT:
Thanks to all your suggestions, I was able to come up with something significantly faster:
public synchronized Set<String> listMatching(List<String> allStrings) {
Set<String> allItems = new HashSet<>();
Set<String> duplicates = allStrings.stream()
.filter(string -> !allItems.add(string))
.collect(Collectors.toSet());
return duplicates;
}
Running under the same conditions, this is able to go through my list in <5ms.
All the HashMap suggestions would have been great though, if I had needed to know the counts. Not sure why the Collections.frequency() method doesn't use that technique.
An easy way to find duplicates is to iterate over the list and use the add() method to add the item to some other temp set. It will return false if the item already exists in the set.
public synchronized List<String> listMatching(List<String> allStrings) {
Set<String> tempSet = new HashSet();
Set<String> duplicates = new HashSet();
allStrings.forEach( item -> {
if (!tempSet.add(item)) duplicates.add(item);
});
return duplicates;
}
A good way to make this really scalable is to build a Map that contains the count of each string. To build the map, you will look up each string in your list. If the string is not yet in the map, put the string and a count of one in the map. If the string is found in the map, increment the count.
You probably want to use some type that allows you to increment the count in-place, rather than having to put() the updated count each time. For example, you can use an int[] with one element.
The other advantage of not re-putting counts is that it is easy to execute in parallel, because you can synchronize on the object that contains your count when you want to read/write the count.
The non-parallel code might look something like this:
Map<String, int[]> map = new HashMap<>(listOfStrings.size());
for (String s: listOfStrings) {
int[] curCount = map.get(s);
if (curCount == null) {
curCount = new int[1];
curCount[0] = 1;
map.put(s, curCount);
} else {
curCount[0]++;
}
}
Then you can iterate over the map entries and do the right thing based on the count of each string.
Best data-structure will be Set<String>.
Add all elements from list in set.
Delete elements from set one by one traversing from list.
If element not found in set then it's duplicate in list. (Because it's already deleted)
this will take O(n)+O(n).
coding-
List<String> list = new ArrayList<>();
List<String> duplicates = new ArrayList<>();
list.add("luna");
list.add("mirana");
list.add("mirana");
list.add("mirana");
Set<String> set = new HashSet<>();
set.addAll(list);
for(String a:list){
if(set.contains(a)){
set.remove(a);
}else{
duplicates.add(a);
}
}
System.out.println(duplicates);
Output
[mirana, mirana]

Binary search over a list of pairs

I need to find elem that would match element.
My program works but it is not efficient. I have a very large ArrayList<Obj> pairs (more than 4000 elements) and I use a binary search to find matching indexes.
public int search(String element) {
ArrayList<String> list = new ArrayList<String>();
for (int i = 0; i < pairs.size(); i++) {
list.add(pairs.get(i).getElem());
}
return index = Collections.binarySearch(list, element);
}
I wonder if there is a more efficient way than using a loop to copy half of the ArrayList pairs into a new ArrayList list.
Constructor for Obj: Obj x = new Obj(String elem, String word);
If your master list (pairs) does not change then I'd recommend creating a TreeMap to maintain reverse index structure, e.g.:
List<String> pairs = new ArrayList<String>(); //list containing 4000 entries
Map<String, Integer> indexMap = new TreeMap<>();
int index = 0;
for(String element : pairs){
indexMap.put(element, index++);
}
Now, while searching for an element, all you need to do is :
indexMap.get(element);
That will give you the required index or null if element doesn't exist. Also, if an element can be present in the list multiple times then, you can change the indexMap to be Map<String, List<Integer>>.
Your current algorithm iterates the list and calls the binary search, so complexity would be O(n) for iteration and O(log n) whereas TreeMap guarantees log(n) time cost so it will be much quicker.
Here's the documentation of TreeMap.
It looks like the problem is solved.
As my issue was that ArrayList pairs type was Obj and element type was String, I couldn't use Collections.binarySearch, I decided to create a new variable
Obj x = new Obj(element, "");. It looks like the string doesn't cause any issues (it passed my JUnit tests) as my compareTo method compares two elems and ignores the second variable of Obj x.
My updated method:
public int search(String element) {
Obj x = new Obj(element, "");
int index = Collections.binarySearch(pairs, x);

Java how to remove element from List efficiently

Ok, this is a proof-of-concept I have on my head that has been bugging me for a few days:
Let's say I have:
List<String> a = new ArrayList<String>();
a.add("foo");
a.add("buzz");
a.add("bazz");
a.add("bar");
for (int i = 0; i < a.size(); i++)
{
String str = a.get(i);
if (!str.equals("foo") || !str.equals("bar")) a.remove(str);
}
this would end up with the list ["foo", "bazz", "bar"] because it would read the string at index 1 ("buzz"), delete it, the string at index 2 ("bazz") would jump to index 1 and it would be bypassed without being verified.
What I came up with was:
List<String> a = new ArrayList<String>();
a.add("foo");
a.add("buzz");
a.add("bazz");
a.add("bar");
for (int i = 0; i < a.size(); i++)
{
String str = a.get(i);
boolean removed = false;
if (!str.equals("foo") || !str.equals("bar"))
{
a.remove(str);
removed = true;
}
if (removed) i--;
}
It should work this way (atleast it does in my head lol), but messing with for iterators is not really good practice.
Other way I thought would be creating a "removal list" and add items to that list that needed to be removed from list a, but that would be just plain resource waste.
So, what is the best practice to remove items from a list efficiently?
Use an Iterator instead and use Iterator#remove method:
for (Iterator<String> it = a.iterator(); it.hasNext(); ) {
String str = it.next();
if (!str.equals("foo") || !str.equals("bar")) {
it.remove();
}
}
From your question:
messing with for iterators is not really good practice
In fact, if you code oriented to interfaces and use List instead of ArrayList directly, using get method could become into navigating through all the collection to get the desired element (for example, if you have a List backed by a single linked list). So, the best practice here would be using iterators instead of using get.
what is the best practice to remove items from a list efficiently?
Not only for Lists, but for any Collection that supports Iterable, and assuming you don't have an index or some sort of key (like in a Map) to directly access to an element, the best way to remove an element would be using Iterator#remove.
You have three main choices:
Use an Iterator, since it has that handy remove method on it. :-)
Iterator<String> it = list.iterator();
while (it.hasNext()) {
if (/*...you want to remove `it.next()`...*/) {
it.remove();
}
}
Loop backward through the list, so that if you remove something, it doesn't matter for the next iteration. This also has the advantage of only calling list.size() once.
for (int index = list.size() - 1; index >= 0; --index) {
// ...check and optionally remove here...
}
Use a while loop instead, and only increment the index variable if you don't remove the item.
int index = 0;
while (index < list.size()) {
if (/*...you want to remove the item...*/) {
list.removeAt(index);
} else {
// Not removing, move to the next
++index;
}
}
Remember that unless you know you're dealing with an ArrayList, the cost of List#get(int) may be high (it may be a traversal). But if you know you're dealing with ArrayList (or similar), then...
Your first example will likely cause off-by-one errors, since once you remove an object your list's indexes will change. If you want to be quick about it, use an iterator or List's own .remove() function:
Iterator<String> itr = yourList.iterator();
while (itr.hasNext()) {
if ("foo".equals(itr.next()) {
itr.remove();
}
}
Or:
yourList.remove("foo");
yourList.removeAll("foo"); // removes all
ArrayList.retainAll has a "smart" implementation that does the right thing to be linear time. You can just use list.retainAll(Arrays.asList("foo", "bar")) and you'll get the fast implementation in that one line.

Fastest way to find substring in JAVA

lets say i have list of names.
ArrayList<String> nameslist = new ArrayList<String>();
nameslist.add("jon");
nameslist.add("david");
nameslist.add("davis");
nameslist.add("jonson");
and this list contains few thousands nameslist in it. What is the fastes way to know that this list contains names start with given name.
String name = "jon"
result should be 2.
I have tried with comparing every element of list with substring function (it works but) it is very slow specially when list is huge.
Thanks is advance.
You could use a TreeSet for O(log n) access and write something like:
TreeSet<String> set = new TreeSet<String>();
set.add("jon");
set.add("david");
set.add("davis");
set.add("jonson");
set.add("henry");
Set<String> subset = set.tailSet("jon");
int count = 0;
for (String s : subset) {
if (s.startsWith("jon")) count++;
else break;
}
System.out.println("count = " + count);
which prints 2 as you expect.
Alternatively, you could use Set<String> subset = set.subSet("jon", "joo"); to return the full list of al names that start with "jon", but you need to give the first invalid entry that follows the jons (in this case: "joo").
Have a look at Trie. It's a data structure aimed to perform fast searches according to word prefixes. You may need to manipulate it a bit in order to get the number of leafs in the subtree, but in any case you do not traverse the entire list.
The complexity of searching in ArrayList (or linear array) is O(n), where n is number of elements in array.
For best performance you can see Trie
Iterate on the ArrayList, for each element, check if it begins with jon. Time complexity is O(n).
What exactly does "very slow" mean?
Really the only way to do this is to loop through the list and check every element:
int count = 0;
for (String name : nameslist) {
if (name.startsWith("jon")) {
count++;
}
}
System.out.println("Found: " + count);
If your strings in list are not too long you can use this cheat: store in HashSet all prefixes and your complexity will be ~O(1):
// Preprocessing
List<String> list = Arrays.asList("hello", "world"); // Your list
Set<String> set = new HashSet<>()
for(String s: list) {
for (int i = 1; i <= s.length; i++) {
set.add(s.substring(0, i));
}
}
// Now you want to test
assert true == set.contains("wor")
If it is not, you can use any full text search engine like Apache Lucene
I'd suggest you to create a Runnable for processing the list elements. Then you create an ExecutorService with fixed pool size, which processes the elements concurrently.
Rough example:
ExecutorService executor = Executors.newFixedThreadPool(5);
for (String str : coll){
Runnable r = new StringProcessor(str);
executor.execute(r);
}
I suggest TreeSet.
similar way access every element and increment count. alogorithm wise you can improve performance.
int count = 0;
iter = list.iterator();
String name;
while(iter.hasNext()) {
name = iter.next();
if (name.startsWith("jon")) {
count++;
}
if(name.startsWith("k")) break;
}
This break eliminates the checking of rest of string comparisons.
You can consider Boyer–Moore string search algorithm.
complexity O(n+m) worst case.
You need to iterate each name and find the name within it.
String name = "jon";
int count=0;
for(String n:nameslist){
if(n.contains(name){
count++;
}
}

Categories

Resources