What is the best way to find common elements from 2 sets? - java

Recently I had an interview and I was asked one question.
I have 2 sets with around 1 Million records each.
I have to find the common element in 2 sets.
My response:
I will create a new empty Set. And i gave him below solution but he was not happy with it. He said there are 1 million records so the solution won't be good.
public Set<Integer> commonElements(Set<Integer> s1, Set<Integer> s2) {
Set<Integer> res = new HashSet<>();
for (Integer temp : s1) {
if(s2.contains(temp)) {
res.add(temp);
}
}
return res;
}
What is the better way to solve this problem then?

First of all: in order determine the intersection of two sets, you absolutely have to look at all entries of at least one of the two sets (to figure whether it is in the other set). There is no magic around that would tell you that in less than O(min(size(s1), size(s2)). Period.
The next thing to tell the interviewer: "1 million entries. You must be kidding. It is 2019. Any decent piece of hardware crunches two 1-million sets in less than a second". (Of course: that only applies for objects that are cheap to compare, like here for Integer instances. If oneRecord.equals(anotherRecord) is a super expensive operation, then 1 million entries could still be a problem in 2022).
Then you briefly mention that there are various built-in ways to solve this, as well as various 3rd party libraries. But you avoid the mistake that the other two answers make: pointing to a library that does compute the intersect is not at all something you sell as "solution" to this question.
You see, regarding coding: the java Set interface has an easy solution to that: s1.retainAll(s2) computes the join of the two sets, as it removes all elements from s1 that
aren't in s2.
Obviously, you have to mention within the interview that this will modify s1.
In case that the requirement is to not modify s1 or s2, your solution is a viable way to go, and there isn't anything one can do about the runtime cost. If it all, you could call size() for both sets and iterate the one that has less entries.
Alternatively, you can do
Set<String> result = new HashSet<>(s1);
return result.retain(s2);
but in the end, you have to iterate one set and for each element determine whether it is in the second set.
But of course, the real answer to such questions is always always always to show the interviewer that you are able to dissect the problem into its different aspects. You outline basic constraints, you outline different solutions and discuss their pros and cons. Me for example, I would expect you to sit down and maybe write a program like this:
public class Numbers {
private final static int numberOfEntries = 20_000_000;
private final static int maxRandom = numberOfEntries;
private Set<Integer> s1;
private Set<Integer> s2;
#Before
public void setUp() throws Exception {
Random random = new Random(42);
s1 = fillWithRandomEntries(random, numberOfEntries);
s2 = fillWithRandomEntries(random, numberOfEntries);
}
private static Set<Integer> fillWithRandomEntries(Random random, int entries) {
Set<Integer> rv = new HashSet<>();
for (int i = 0; i < entries; i++) {
rv.add(random.nextInt(maxRandom));
}
return rv;
}
#Test
public void classic() {
long start = System.currentTimeMillis();
HashSet<Integer> intersection = new HashSet<>();
s1.forEach((i) -> {
if (s2.contains(i))
intersection.add(i);
});
long end = System.currentTimeMillis();
System.out.println("foreach duration: " + (end-start) + " ms");
System.out.println("intersection.size() = " + intersection.size());
}
#Test
public void retainAll() {
long start = System.currentTimeMillis();
s1.retainAll(s2);
long end = System.currentTimeMillis();
System.out.println("Retain all duration: " + (end-start) + " ms");
System.out.println("intersection.size() = " + s1.size());
}
#Test
public void streams() {
long start = System.currentTimeMillis();
Set<Integer> intersection = s1.stream().filter(i -> s2.contains(i)).collect(Collectors.toSet());
long end = System.currentTimeMillis();
System.out.println("streaming: " + (end-start) + " ms");
System.out.println("intersection.size() = " + intersection.size());
}
#Test
public void parallelStreams() {
long start = System.currentTimeMillis();
Set<Integer> intersection = s1.parallelStream().filter(i -> s2.contains(i)).collect(Collectors.toSet());
long end = System.currentTimeMillis();
System.out.println("parallel streaming: " + (end-start) + " ms");
System.out.println("intersection.size() = " + intersection.size());
}
}
The first observation here: I decided to run with 20 million entries. I started with 2 million, but all three tests would run well below 500 ms. Here is the print out for 20 million on my Mac Book Pro:
foreach duration: 9304 ms
intersection.size() = 7990888
streaming: 9356 ms
intersection.size() = 7990888
Retain all duration: 685 ms
intersection.size() = 7990888
parallel streaming: 6998 ms
intersection.size() = 7990888
As expected: all intersects have the same size (because I seeded the random number generator to get to comparable results).
And surprise: modifying s1 in place ... is by far the cheapest option. It beats streaming by a factor of 10. Also note: the parallel streaming is quicker here. When running with 1 million entries, the sequential stream was faster.
Therefore I initially mentioned to mention "1 million entries is not a performance problem". That is a very important statement, as it tells the interviewer that you are not one of those people wasting hours to micro-optimize non-existing performance issues.

you can use
CollectionUtils
its from apache
CollectionUtils.intersection(Collection a,Collection b)

The answer is:
s1.retainAll(s2);
Ref. https://www.w3resource.com/java-exercises/collection/java-collection-hash-set-exercise-11.php

Related

Compartmentalizing loops over a large iteration

The Goal of my question is to enhance the performance of my algorithm by splitting the range of my loop iterations over a large array list.
For example: I have an Array list with a size of about 10 billion entries of long values, the goal I am trying to achieve is to start the loop from 0 to 100 million entries, output the result for the 100 million entries of whatever calculations inside the loop; then begin and 100 million to 200 million doing the previous and outputting the result, then 300-400million,400-500million and so on and so forth.
after I get all the 100 billion/100 million results, then I can sum them up outside of the loop collecting the results from the loop outputs parallel.
I have tried to use a range that might be able to achieve something similar by trying to use a dynamic range shift method but I cant seem to have the logic fully implemented like I would like to.
public static void tt4() {
long essir2 = 0;
long essir3 = 0;
List cc = new ArrayList<>();
List<Long> range = new ArrayList<>();
// break point is a method that returns list values, it was converted to
// string because of some concatenations and would be converted back to long here
for (String ari1 : Breakpoint()) {
cc.add(Long.valueOf(ari1));
}
// the size of the List is huge about 1 trillion entries at the minimum
long hy = cc.size() - 1;
for (long k = 0; k < hy; k++) {
long t1 = (long) cc.get((int) k);
long t2 = (long) cc.get((int) (k + 1));
// My main question: I am trying to iterate the entire list in a dynamic way
// which would exclude repeated endpoints on each iteration.
range = LongStream.rangeClosed(t1 + 1, t2)
.boxed()
.collect(Collectors.toList());
for (long i : range) {
// Hard is another method call on the iteration
// complexcalc is a method as well
essir2 = complexcalc((int) i, (int) Hard(i));
essir3 += essir2;
}
}
System.out.println("\n" + essir3);
}
I don't have any errors, I am just looking for a way to enhance performance and time. I can do a million entries in under a second directly, but when I put the size I require it runs forever. The size I'm giving are abstracts to illustrate size magnitudes, I don't want opinions like a 100 billion is not much, if I can do a million under a second, I'm talking massively huge numbers I need to iterate over doing complex tasks and calls, I just need help with the logic I'm trying to achieve if I can.
One thing I would suggest right off the bat would be to store your Breakpoint return value inside a simple array rather then using a List. This should improve your execution time significantly:
List<Long> cc = new ArrayList<>();
for (String ari1 : Breakpoint()) {
cc.add(Long.valueOf(ari1));
}
Long[] ccArray = cc.toArray(new Long[0]);
I believe what you're looking for is to split your tasks across multiple threads. You can do this with ExecutorService "which simplifies the execution of tasks in asynchronous mode".
Note that I am not overly familiar with this whole concept but have experimented with it a bit recently and give you a quick draft of how you could implement this.
I welcome those more experienced with multi-threading to either correct this post or provide additional information in the comments to help improve this answer.
Runnable Task class
public class CompartmentalizationTask implements Runnable {
private final ArrayList<Long> cc;
private final long index;
public CompartmentalizationTask(ArrayList<Long> list, long index) {
this.cc = list;
this.index = index;
}
#Override
public void run() {
Main.compartmentalize(cc, index);
}
}
Main class
private static ExecutorService exeService = Executors.newCachedThreadPool();
private static List<Future> futureTasks = new ArrayList<>();
public static void tt4() throws ExecutionException, InterruptedException
{
long essir2 = 0;
long essir3 = 0;
ArrayList<Long> cc = new ArrayList<>();
List<Long> range = new ArrayList<>();
// break point is a method that returns list values, it was converted to
// string because of some concatenations and would be converted back to long here
for (String ari1 : Breakpoint()) {
cc.add(Long.valueOf(ari1));
}
// the size of the List is huge about 1 trillion entries at the minimum
long hy = cc.size() - 1;
for (long k = 0; k < hy; k++) {
futureTasks.add(Main.exeService.submit(new CompartmentalizationTask(cc, k)));
}
for (int i = 0; i < futureTasks.size(); i++) {
futureTasks.get(i).get();
}
exeService.shutdown();
}
public static void compartmentalize(ArrayList<Long> cc, long index)
{
long t1 = (long) cc.get((int) index);
long t2 = (long) cc.get((int) (index + 1));
// My main question: I am trying to iterate the entire list in a dynamic way
// which would exclude repeated endpoints on each iteration.
range = LongStream.rangeClosed(t1 + 1, t2)
.boxed()
.collect(Collectors.toList());
for (long i : range) {
// Hard is another method call on the iteration
// complexcalc is a method as well
essir2 = complexcalc((int) i, (int) Hard(i));
essir3 += essir2;
}
}

Why is my java program becoming gradually slower?

I recently built a Fibonacci generator that uses recursion and hashmaps to reduce complexity. I am using the System.nanoTime() to keep track of the time it takes for my program to print 10000 Fibonacci number. It started out good with less than a second but gradually became slower and now it takes more than 4 seconds. Can someone explain why this might be happening. The code is down here-
import java.util.*;
import java.math.*;
public class FibonacciGeneratorUnlimited {
static int numFibCalls = 0;
static HashMap<Integer, BigInteger> d = new HashMap<Integer, BigInteger>();
static Scanner fibNumber = new Scanner(System.in);
static BigInteger ans = new BigInteger("0");
public static void main(String[] args){
d.put(0 , new BigInteger("0"));
d.put(1 , new BigInteger("1"));
System.out.print("Enter the term:\t");
int n = fibNumber.nextInt();
long startTime = System.nanoTime();
for (int i = 0; i <= n; i++) {
System.out.println(i + " : " + fib_efficient(i, d));
}
System.out.println((double)(System.nanoTime() - startTime) / 1000000000);
}
public static BigInteger fib_efficient(int n, HashMap<Integer, BigInteger> d) {
numFibCalls += 1;
if (d.containsKey(n)) {
return (d.get(n));
} else {
ans = (fib_efficient(n-1, d).add(fib_efficient(n-2, d)));
d.put(n, ans);
return ans;
}
}
}
If you are restarting the program every time you make a new fibonacci sequence, then your program most likely isn't the problem. It might just be the your processor got hot after running the program a few times, or a background process in your computer suddenly started, causing your program to slow down.
More memory java -Xmx=... or less caching
public static BigInteger fib_efficient(int n, HashMap<Integer, BigInteger> d) {
numFibCalls++;
if ((n & 3) <= 1) { // Every second is cached.
BigInteger cached = d.get(n);
if (cached != null) {
return cached;
} else {
BigInteger ans = fib_efficient(n-1, d).add(fib_efficient(n-2, d));
d.put(n, ans);
return ans;
}
} else {
return fib_efficient(n-1, d).add(fib_efficient(n-2, d));
}
}
Two subsequent numbers are cached out of four in order to stop the
recursion on both branches for:
fib(n) = fib(n-1) + fib(n-2)
BigInteger isn't the nicest class where performance and memory is concerned.
It started out good with less than a second but gradually became slower and now it takes more than 4 seconds.
What do you mean by this? Do you mean that you ran this exact same program with the same input and its run-time changed from < 1 second to > 4 seconds?
If you have the same exact code running with the same exact inputs in a deterministic algorithm...
then the differences are probably external to your code - maybe other processes are taking up more CPU on one run.
Do you mean that you increased the inputs from some value X to 10,000 and now it takes > 4 seconds?
Then that's just a matter of the algorithm taking longer with larger inputs, which is perfectly normal.
recursion and hashmaps to reduce complexity
That's not quite how complexity works. You have improved the best-case and the average-case, but you have done nothing to change the worst-case.
Now for some actual performance improvement advice
Stop printing out the results... that's eating up over 99% of your processing time. Seriously, though, switch out "System.out.println(i + " : " + fib_efficient(i, d))" with "fib_efficient(i,d)" and it'll execute over 100x faster.
Concatenating strings and printing to console are very expensive processes.
It happens because the complexity for Fibonacci is Big-O(n^2). This means that, the larger the input the time increases exponentially, as you can see in the graph for Big-O(n^2) in this link. Check this answer to see a complete explanation about it´s complexity.
Now, the complexity of your algorithm increases because you are using a HashMap to search and insert elements each time that function is invoked. Consider remove this HashMap.

Why is Collections.synchronizedSet(HashSet) faster than HashSet for addAll, retainAll, and contains?

I ran a test to find the best concurrent Set implementation for my program, with a non-synchronized HashSet as a control, and ran into an interesting result: the addAll, retainAll, and contains operations for a Collections.synchronizedSet(HashSet) appear to be faster than those of a regular HashSet. My understanding is that a SynchronizedSet(HashSet) should never be faster than a HashSet because it consists of a HashSet with synchronization locks. I've run the test quite a few times now, with similar results. Am I doing something wrong?
Relevant results:
Testing set: HashSet
Add: 17.467758 ms
Retain: 28.865039 ms
Contains: 22.18998 ms
Total: 68.522777 ms
--
Testing set: SynchronizedSet
Add: 17.54269 ms
Retain: 20.173502 ms
Contains: 19.618188 ms
Total: 57.33438 ms
Relevant code:
public class SetPerformance {
static Set<Long> source1 = new HashSet<>();
static Set<Long> source2 = new HashSet<>();
static Random rand = new Random();
public static void main(String[] args) {
Set<Long> control = new HashSet<>();
Set<Long> synch = Collections.synchronizedSet(new HashSet<Long>());
//populate sets to draw values from
System.out.println("Populating source");
for(int i = 0; i < 100000; i++) {
source1.add(rand.nextLong());
source2.add(rand.nextLong());
}
//populate sets with initial values
System.out.println("Populating test sets");
control.addAll(source1);
synch.addAll(source1);
testSet(control);
testSet(synch);
}
public static void testSet(Set<Long> set) {
System.out.println("--\nTesting set: " + set.getClass().getSimpleName());
long start = System.nanoTime();
set.addAll(source1);
long add = System.nanoTime();
set.retainAll(source1);
long retain = System.nanoTime();
boolean test;
for(int i = 0; i < 100000; i++) {
test = set.contains(rand.nextLong());
}
long contains = System.nanoTime();
System.out.println("Add: " + (add - start) / 1000000.0 + " ms");
System.out.println("Retain: " + (retain - add) / 1000000.0 + " ms");
System.out.println("Contains: " + (contains - retain) / 1000000.0 + " ms");
System.out.println("Total: " + (contains - start) / 1000000.0 + " ms");
}
}
You aren't warming up the JVM.
Note that you run the HashSet test first.
I changed your program slightly to run the test in a loop 5 times. SynchronizedSet was faster, on my machine, in only the first test.
Then, I tried reversing the order of the two tests, and only running the test once. HashSet won again.
Read more about this here: How do I write a correct micro-benchmark in Java?
Additionally, check out Google Caliper for a framework that handles all these microbenchmarking issues.
yes
try to run the sync set before the regular and you will get your "needed" results.
I reckon this has to do with the JVM warm up and nothing else.
Try to warn up the VM with some computations and then run the benchmark or run it a few times in a mixed order.

Java : Searching Ids from hashset or String

I have large number of IDs which can I store in HashSet or String
i.e.
String strIds=",1,2,3,4,5,6,7,8,.,.,.,.,.,.,.,1000,";
Or
HashSet<String> setOfids = new HashSet<String>();
setOfids.put("1");
setOfids.put("2");
.
.
.
setOfids.put("1000");
Further more I want to perform search on IDs
Which Should I use for better Performance(Faster & memory efficient)
1) strIds.indexOf("someId");
or
2) setOfids.contains("someId");
Tell me any other way so, I can do the same.
Thanks for Looking here :)
A hash table lookup is "constant time", i.e., it does not grow with the number of ids.
But a compact string of all id's in a String requires the least memory.
So, make up your mind: fastest retrieval or a minimum of storage!
Set will be better choice. Reasons:
Search will be O(1) in case of Set. In case of String it will be O(N).
Performance will not degrade as data grows.
String will use more memory if you want to do any kind of data manipulation (add or remove IDs).
indexOf might give you negative result as well
Say 1000 is present but 100 is not, so indexOf will return the location of 1000 as 100 is substring of 1000.
Simple POC code for the performance:
import java.util.HashSet;
import java.util.Set;
public class TimeComputationTest {
public static void main(String[] args) {
String strIds = null;
Set<String> setOfids = new HashSet<String>();
StringBuffer sb = new StringBuffer();
for (int i = 1;i <= 1000;i++) {
setOfids.add(String.valueOf(i));
if (sb.length() != 0) {
sb.append(",");
}
sb.append(i);
}
strIds = sb.toString();
testTime(strIds, setOfids, "1");
testTime(strIds, setOfids, "100");
testTime(strIds, setOfids, "500");
testTime(strIds, setOfids, "1000");
}
private static void testTime(String strIds, Set<String> setOfids, String string) {
long startTime = System.nanoTime();
strIds.indexOf(string);
long endTime = System.nanoTime();
System.out.println("String search time for (" + string + ") is " + (endTime - startTime));
startTime = System.nanoTime();
setOfids.contains(string);
endTime = System.nanoTime();
System.out.println("HashSet search time for (" + string + ") is " + (endTime - startTime));
}
}
The output will be (approx.):
String search time for (1) is 3000
HashSet search time for (1) is 7000
String search time for (100) is 6000
HashSet search time for (100) is 2000
String search time for (500) is 33000
HashSet search time for (500) is 2000
String search time for (1000) is 71000
HashSet search time for (1000) is 1000
Besides the performances, you shouldn't use a String like that. Although it is creative, it is not made for indexing like that. What would happen if you want to change the format of the ids?
To improve the performance and save memory of hashSet you could of course use
HashSet<Integer> instead of HashSet<String>
I assume HashSet is the better option to go with.
There are two advantages:
It doesn't allow duplicates
HashSet internally assumes a HashMap, hence retrieval is faster.
It will work faster:::
String strIds=",1,2,3,4,5,6,7,8,.,.,.,.,.,.,.,1000,";
String searchStr = "9";
boolean searchFound = strIds.contains(","+searchStr +",");

Huge performance difference between Vector and HashSet

I have a program which fetches records from database (using Hibernate) and fills them in a Vector. There was an issue regarding the performance of the operation and I did a test with the Vector replaced by a HashSet. With 300000 records, the speed gain is immense - 45 mins to 2 mins!
So my question is, what is causing this huge difference? Is it just the point that all methods in Vector are synchronized or the point that internally Vector uses an array whereas HashSet does not? Or something else?
The code is running in a single thread.
EDIT:
The code is only inserting the values in the Vector (and in the other case, HashSet).
If it's trying to use the Vector as a set, and checking for the existence of a record before adding it, then filling the vector becomes an O(n^2) operation, compared with O(n) for HashSet. It would also become an O(n^2) operation if you insert each element at the start of the vector instead of at the end.
If you're just using collection.add(item) then I wouldn't expect to see that sort of difference - synchronization isn't that slow.
If you can try to test it with different numbers of records, you could see how each version grows as n increases - that would make it easier to work out what's going on.
EDIT: If you're just using Vector.add then it sounds like something else could be going on - e.g. your database was behaving differently between your different test runs. Here's a little test application:
import java.util.*;
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
Vector<String> vector = new Vector<String>();
for (int i = 0; i < 300000; i++) {
vector.add("dummy value");
}
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + "ms");
}
}
Output:
Time taken: 38ms
Now obviously this isn't going to be very accurate - System.currentTimeMillis isn't the best way of getting accurate timing - but it's clearly not taking 45 minutes. In other words, you should look elsewhere for the problem, if you really are just calling Vector.add(item).
Now, changing the code above to use
vector.add(0, "dummy value"); // Insert item at the beginning
makes an enormous difference - it takes 42 seconds instead of 38ms. That's clearly a lot worse - but it's still a long way from being 45 minutes - and I doubt that my desktop is 60 times as fast as yours.
If you are inserting them at the middle or beginning instead of at the end, then the Vector needs to move them all along. Every insert. The hashmap, on the other hand, doesn't really care or have to do anything.
Vector is outdated and should not be used anymore. Profile with ArrayList or LinkedList (depends on how you use the list) and you will see the difference (sync vs unsync).
Why are you using Vector in a single threaded application at all?
Vector is synchronized by default; HashSet is not. That's my guess. Obtaining a monitor for access takes time.
I don't know if there are reads in your test, but Vector and HashSet are both O(1) if get() is used to access Vector entries.
Under normal circumstances, it is totally implausible that inserting 300,000 records into a Vector will take 43 minutes longer than inserting the same records into a HashSet.
However, I think there is a possible explanation of what might be going on.
First, the records coming out of the database must have a very high proportion of duplicates. Or at least, they must be duplicates according to the semantics of the equals/hashcode methods of your record class.
Next, I think you must be pushing very close to filling up the heap.
So the reason that the HashSet solution is so much faster is that it is most of the records are being replaced by the set.add operation. By contrast the Vector solution is keeping all of the records, and the JVM is spending most of its time trying to squeeze that last 0.05% of memory by running the GC over, and over and over.
One way to test this theory is to run the Vector version of the application with a much bigger heap.
Irrespective, the best way to investigate this kind of problem is to run the application using a profiler, and see where all the CPU time is going.
import java.util.*;
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
Vector<String> vector = new Vector<String>();
for (int i = 0; i < 300000; i++) {
if(vector.contains(i)) {
vector.add("dummy value");
}
}
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + "ms");
}
}
If you check for duplicate element before insert the element in the vector, it will take more time depend upon the size of vector. best way is to use the HashSet for high performance, because Hashset will not allow duplicate and no need to check for duplicate element before inserting.
According to Dr Heinz Kabutz, he said this in one of his newsletters.
The old Vector class implements serialization in a naive way. They simply do the default serialization, which writes the entire Object[] as-is into the stream. Thus if we insert a bunch of elements into the List, then clear it, the difference between Vector and ArrayList is enormous.
import java.util.*;
import java.io.*;
public class VectorWritingSize {
public static void main(String[] args) throws IOException {
test(new LinkedList<String>());
test(new ArrayList<String>());
test(new Vector<String>());
}
public static void test(List<String> list) throws IOException {
insertJunk(list);
for (int i = 0; i < 10; i++) {
list.add("hello world");
}
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream out = new ObjectOutputStream(baos);
out.writeObject(list);
out.close();
System.out.println(list.getClass().getSimpleName() +
" used " + baos.toByteArray().length + " bytes");
}
private static void insertJunk(List<String> list) {
for(int i = 0; i<1000 * 1000; i++) {
list.add("junk");
}
list.clear();
}
}
When we run this code, we get the following output:
LinkedList used 107 bytes
ArrayList used 117 bytes
Vector used 1310926 bytes
Vector can use a staggering amount of bytes when being serialized. The lesson here? Don't ever use Vector as Lists in objects that are Serializable. The potential for disaster is too great.

Categories

Resources