I have done some code to combine in parallel group of collections which contains pairs[String,Integer], Example
Thread 1
[Car,1][Bear,1][Car,1]
Thread 2
[River,1][Car,1][River,1]
Result should be collections of each unique pair key (sorted alphabetically)
[Bear,1]
[Car,1][Car,1][Car,1]
[River,1][River,1][River,1]
My solution to do this like what shown below but sometime i don't get expected result or ConcurrentModificationException gets thrown from the list that contains result collections
List<Collection<Pair<String, Integer>>> combiningResult = new ArrayList<>();
private void startMappingPhase() throws Exception {
SimpleDateFormat formatter = new SimpleDateFormat("HH:mm:ss.SSS");
Invoker invoker = new Invoker(mappingClsPath, "Mapping", "mapper");
List<Callable<Integer>> tasks = new ArrayList<>();
for (String line : fileLines) {
tasks.add(() -> {
try {
combine((Collection<Pair<String, Integer>>) invoker.invoke(line));
} catch (Exception e) {
e.printStackTrace();
executor.shutdownNow();
errorOccurred = true;
return 0;
}
return 1;
});
if (errorOccurred)
Utils.showFatalError("Some error occurred, See log for more detalis");
}
long start = System.nanoTime();
System.out.println(tasks.size() + " Tasks");
System.out.println("Started at " + formatter.format(new Date()) + "\n");
executor.invokeAll(tasks);
long elapsedTime = System.nanoTime() - start;
partitioningResult.forEach(c -> {
System.out.println(c.size() + "\n" + c);
});
System.out.print("\nFinished in " + (elapsedTime / 1_000_000_000.0) + " milliseconds\n");
}
private void partition(Collection<Pair<String, Integer>> pairs) {
Set<Pair<String, Integer>> uniquePairs = new LinkedHashSet<>(pairs);
for (Pair<String, Integer> uniquePair : uniquePairs) {
int pFrequencyCount = Collections.frequency(pairs, uniquePair);
Optional<Collection<Pair<String, Integer>>> collResult = combiningResult.stream().filter(c -> c.contains(uniquePair)).findAny();
if (collResult.isPresent()) {
collResult.ifPresent(c -> {
for (int i = 0; i < pFrequencyCount; i++)
c.add(uniquePair);
});
} else {
Collection<Pair<String, Integer>> newColl = new ArrayList<>();
for (int i = 0; i < pFrequencyCount; i++)
newColl.add(uniquePair);
combiningResult.add(newColl);
}
}
}
I tried CopyOnWriteList insisted of ArrayList but sometimes it gets incomplete result like
[Car,1][Car,1] insisted of three entries, My question
Is there a way to achieve what I'm trying to do without getting ConcurrentModificationException and incomplete result?
An example image
If you are trying to modify a single collections from multiple threads you will need to add a synchronized block or use one of the JDK classes supporting concurrency. These will typically perform better than a synchronized block.
https://docs.oracle.com/javase/tutorial/essential/concurrency/collections.html
Related
I have this code, where I have my own homemade array class, that I want to use to test the speed of some different concurrency tools in java
public class LongArrayListUnsafe {
private static final ExecutorService executor
= Executors.newFixedThreadPool(1);
public static void main(String[] args) {
LongArrayList dal1 = new LongArrayList();
int n = 100_000_000;
Timer t = new Timer();
List<Callable<Void>> tasks = new ArrayList<>();
tasks.add(() -> {
for (int i = 0; i <= n; i+=2){
dal1.add(i);
}
return null;
});
tasks.add(() -> {
for (int i = 0; i < n; i++){
dal1.set(i, i + 1);
}
return null;});
tasks.add(() -> {
for (int i = 0; i < n; i++) {
dal1.get(i);
}
return null;});
tasks.add(() -> {
for (int i = n; i < n * 2; i++) {
dal1.add(i + 1);
}
return null;});
try {
executor.invokeAll(tasks);
} catch (InterruptedException exn) {
System.out.println("Interrupted: " + exn);
}
executor.shutdown();
try {
executor.awaitTermination(1000, TimeUnit.MILLISECONDS);
} catch (Exception e){
System.out.println("what?");
}
System.out.println("Using toString(): " + t.check() + " ms");
}
}
class LongArrayList {
// Invariant: 0 <= size <= items.length
private long[] items;
private int size;
public LongArrayList() {
reset();
}
public static LongArrayList withElements(long... initialValues){
LongArrayList list = new LongArrayList();
for (long l : initialValues) list.add( l );
return list;
}
public void reset(){
items = new long[2];
size = 0;
}
// Number of items in the double list
public int size() {
return size;
}
// Return item number i
public long get(int i) {
if (0 <= i && i < size)
return items[i];
else
throw new IndexOutOfBoundsException(String.valueOf(i));
}
// Replace item number i, if any, with x
public long set(int i, long x) {
if (0 <= i && i < size) {
long old = items[i];
items[i] = x;
return old;
} else
throw new IndexOutOfBoundsException(String.valueOf(i));
}
// Add item x to end of list
public LongArrayList add(long x) {
if (size == items.length) {
long[] newItems = new long[items.length * 2];
for (int i=0; i<items.length; i++)
newItems[i] = items[i];
items = newItems;
}
items[size] = x;
size++;
return this;
}
public String toString() {
return Arrays.stream(items, 0,size)
.mapToObj( Long::toString )
.collect(Collectors.joining(", ", "[", "]"));
}
}
public class Timer {
private long start, spent = 0;
public Timer() { play(); }
public double check() { return (System.nanoTime()-start+spent)/1e9; }
public void pause() { spent += System.nanoTime()-start; }
public void play() { start = System.nanoTime(); }
}
The implementation of a LongArrayList class is not so important,it's not threadsafe.
The drivercode with the executorservice performs a bunch of operations on the arraylist, and has 4 different tasks doing it, each 100_000_000 times.
The problem is that when I give the threadpool more threads "Executors.newFixedThreadPool(2);" it only becomes slower.
For example, for one thread, a typical timing is 1.0366974 ms, but if I run it with 3 threads, the time ramps up to 5.7932714 ms.
What is going on? why is more threads so much slower?
EDIT:
To boil the issue down, I made this much simpler drivercode, that has four tasks that simply add elements:
ExecutorService executor
= Executors.newFixedThreadPool(2);
LongArrayList dal1 = new LongArrayList();
int n = 100_000_00;
Timer t = new Timer();
for (int i = 0; i < 4 ; i++){
executor.execute(new Runnable() {
#Override
public void run() {
for (int j = 0; j < n ; j++)
dal1.add(j);
}
});
}
executor.shutdown();
try {
executor.awaitTermination(1000, TimeUnit.MILLISECONDS);
} catch (Exception e){
System.out.println("what?");
}
System.out.println("Using toString(): " + t.check() + " ms");
Here it still does not seem to matter how many threads i allocate, there is no speedup at all, could this simply be because of overhead?
There are some problems with your code that make it hard to reason why with more threads the time increases.
btw
public double check() { return (System.nanoTime()-start+spent)/1e9; }
gives you back seconds not milliseconds, so change this:
System.out.println("Using toString(): " + t.check() + " ms");
to
System.out.println("Using toString(): " + t.check() + "s");
First problem:
LongArrayList dal1 = new LongArrayList();
dal1 is shared among all threads, and those threads are updating that shared variable without any mutual exclusion around it, consequently, leading to race conditions. Moreover, this can also lead to cache invalidation, which can increase your overall execution time.
The other thing is that you may have load balancing problems. You have 4 parallel tasks, but clearly the last one
tasks.add(() -> {
for (int i = n; i < n * 2; i++) {
dal1.add(i + 1);
}
return null;});
is the most computing-intensive task. Even if the 4 tasks run in parallel, without the problems that I have mention (i.e., lack of synchronization around the shared data), the last task will dictate the overall execution time.
Not to mention that parallelism does not come for free, it adds overhead (e.g., scheduling the parallel work and so on), which might be high enough that makes it not worth to parallelize the code in the first place. In your code, there is at least the overhead of waiting for the tasks to be completed, and also the overhead of shutting down the pool of executors.
Another possibility that would also explain why you are not getting ArrayIndexOutOfBoundsException all over the place is that the first 3 tasks are so small that they are being executed by the same thread. This would also again make your overall execution time very dependent on the last task, the on the overhead of executor.shutdown(); and executor.awaitTermination. However, even if that is the case, the order of execution of tasks, and which threads will execute then, is typically non-deterministic, and consequently, is not something that your application should rely upon. Funny enough, when I changed your code to immediately execute the tasks (i.e., executor.execute) I got ArrayIndexOutOfBoundsException all over the place.
in Something like 'contains any' for Java set? there a several solutions
Collections.disjoint(A, B)
setA.stream().anyMatch(setB::contains)
Sets.intersection(set1, set2).isEmpty()
CollectionUtils.containsAny()
im my case set1 is new ConcurrentHashMap<>().keySet() and set2 is an ArrayList
set1 can cointain up to 100 entries, set2 less then 10
Or will they all do the same and perform similar?
public static void main(String[] args) {
Map<String, String> map = new ConcurrentHashMap<>();
List<String> list = new ArrayList<>();
for (int i = 0; i < 100; i++) {
map.put(RandomStringUtils.randomNumeric(5), RandomStringUtils.randomNumeric(5));
}
for (int i = 0; i < 10; i++) {
list.add(RandomStringUtils.randomNumeric(5));
}
Set<String> set = new HashSet<>(list);
List<Runnable> methods = new ArrayList<>();
methods.add(() -> { Collections.disjoint(map.keySet(), list); });
methods.add(() -> { Collections.disjoint(list, map.keySet()); });
methods.add(() -> { map.keySet().stream().anyMatch(list::contains); });
methods.add(() -> { list.stream().anyMatch(map.keySet()::contains); });
methods.add(() -> { Sets.intersection(map.keySet(), set).isEmpty(); });
methods.add(() -> { Sets.intersection(set, map.keySet()).isEmpty(); });
methods.add(() -> { CollectionUtils.containsAny(map.keySet(), list); });
methods.add(() -> { CollectionUtils.containsAny(list, map.keySet()); });
for (Runnable method : methods) {
long start = System.currentTimeMillis();
for (int i = 0; i < 100000; i++) {
method.run();
}
long end = System.currentTimeMillis();
System.out.println("took " + (end - start));
}
}
And the winner iiis Collections.disjoint
took 15
took 32
took 484
took 62
took 157
took 47
took 24
took 32
setA.stream().anyMatch(setB::contains) will be best because all the other options will be non-lazy evaluation and will be performed on all the elements.
For the stream, it will be lazy evaluation and will be returned once any match is found.
Also, from Documentation of CollectionUtils.containsAny()
In other words, this method returns true iff the intersection(java.lang.Iterable, java.lang.Iterable) of coll1 and coll2 is not empty.
I have a for loop that I want to parallelize. In my below code, I iterate my outermost for loop and put entries in various data structures and it works fine. And all those datastructures have a getter in the same class which I use later on to get all the details once everything is done in this for loop from some other class. I am populating info, itemToNumberMapping, catToValueHolder, tasksByCategory, catHolder, itemIds data structures and they also have getters as well.
// want to parallelize this for loop
for (Task task : tasks) {
if (task.getCategories().isEmpty() || task.getEventList() == null
|| task.getMetaInfo() == null) {
continue;
}
String itemId = task.getEventList().getId();
String categoryId = task.getCategories().get(0).getId();
Processor fp = new Processor(siteId, itemId, categoryId, poolType);
Map<String, Integer> holder = fp.getDataHolder();
if (!holder.isEmpty()) {
for (Map.Entry<String, Integer> entry : holder.entrySet()) {
info.putIfAbsent(entry.getKey(), entry.getValue());
}
List<Integer> values = new ArrayList<>();
for (String key : holder.keySet()) {
values.add(info.get(key));
}
itemToNumberMapping.put(itemId, StringUtils.join(values, ","));
catToValueHolder.put(categoryId, StringUtils.join(values, ","));
}
Category cat = getCategory(task, holder.isEmpty());
tasksByCategory.add(cat);
LinkedList<String> ids = getCategoryIds(task);
catHolder.put(categoryId, ids.getLast());
itemIds.add(itemId);
}
Now I know how to parallelize a for loop as in below example but confusion is - In my case, I don't have one object like output in below example. In my case, I have multiple data structures that I am populating by iterating for loop so I am confuse how can I parallelize my outermost for loop and still populate all those data structures?
private final ExecutorService service = Executors.newFixedThreadPool(10);
List<Future<Output>> futures = new ArrayList<Future<Output>>();
for (final Input input : inputs) {
Callable<Output> callable = new Callable<Output>() {
public Output call() throws Exception {
Output output = new Output();
// process your input here and compute the output
return output;
}
};
futures.add(service.submit(callable));
}
service.shutdown();
List<Output> outputs = new ArrayList<Output>();
for (Future<Output> future : futures) {
outputs.add(future.get());
}
Update:-
I am parallelizing a for loop which is inside a do while loop and my do while loop runs until number is less than or equal to pages. So maybe I am not doing it correctly. Because my do while loop will run until all the pages are done and for each page, I have a for loop which I am trying to parallelize and the way I have set it up, it's giving rejectedexecutionexception.
private void check() {
String endpoint = "some_url";
int number = 1;
int pages = 0;
do {
ExecutorService executorService = Executors.newFixedThreadPool(10);
for (int i = 1; i <= retryCount; i++) {
try {
HttpEntity<String> requestEntity =
new HttpEntity<String>(getBody(number), getHeader());
ResponseEntity<String> responseEntity =
HttpClient.getInstance().getClient()
.exchange(URI.create(endpoint), HttpMethod.POST, requestEntity, String.class);
String jsonInput = responseEntity.getBody();
Process response = objectMapper.readValue(jsonInput, Process.class);
pages = (int) response.getPaginationResponse().getTotalPages();
List<Task> tasks = response.getTasks();
if (pages <= 0 || tasks.isEmpty()) {
continue;
}
// want to parallelize this for loop
for (Task task : tasks) {
Callable<Void> c = new Callable<>() {
public void call() {
if (!task.getCategories().isEmpty() && task.getEventList() != null
&& task.getMetaInfo() != null) {
// my code here
}
}
};
executorService.submit(c);
}
// is this at right place? because I am getting rejectedexecutionexception
executorService.shutdown();
number++;
break;
} catch (Exception ex) {
// log exception
}
}
} while (number <= pages);
}
You do not have to output something from your parallel code. You just take the body of the outer loop and create a task for each item, like this:
for (Task task : tasks) {
Callable<Void> c = new Callable<>() {
public void call() {
if (task.getCategories().isEmpty() || task.getEventList() == null || task.getMetaInfo() == null) {
// ... rest of code here
}
}
};
executorService.submit(c);
}
// wait for executor service, check for exceptions or whatever else you want to do here
public static void main(String[] args) {
List<String> data = new ArrayList<>();
for (int i = 0; i < 10000000; i++) {
data.add("data" + i);
}
System.out.println("parallel stream start time" + System.currentTimeMillis());
data.parallelStream().forEach(x -> {
System.out.println("data -->" + x);
});
System.out.println("parallel stream end time" + System.currentTimeMillis());
System.out.println("simple stream start time" + System.currentTimeMillis());
data.stream().forEach(x -> {
System.out.println("data -->" + x);
});
System.out.println("simple stream end time" + System.currentTimeMillis());
System.out.println("normal foreach start time" + System.currentTimeMillis());
for (int i = 0; i < data.size(); i++) {
System.out.println("data -->" + data.get(i));
}
System.out.println("normal foreach end time" + System.currentTimeMillis());
}
Output
parallel stream start time 1501944014854
parallel stream end time 1501944014970
simple stream start time 1501944014970
simple stream end time 1501944015036
normal foreach start time 1501944015036
normal foreach end time 1501944015040
Total time taken
Simple stream -> 66
Parellem stream -> 116
simple foreach -> 4
In many blogs written that parallelStream is executing by parallel by internally managed distributed task among thread and collect automatically..
But as per above experiment it is clearly notice that Parallel Stream taking more time then simple stream and normal foreach.
Why it is taking more time if it is executed parallel? Is it good to use in project as this feature is downgrading performance?
Thanks in Advance
Your tests are based on I/O operations (the most expensive operation)
If you want to use parallel streams you have to take the thread creation time overhead into account. So only if your operation benefits from that then use it (that is the case for heavy operations). If not, then just use normal streams or a regular for-loop.
Basic rules for measurement:
Don't use I/O operation.
Repeat the same test more then just once.
So if we have to re-formulate the test scenarios again, then we probably have a test helper class defined as follows:
import java.util.HashMap;
import java.util.Map;
import java.util.UUID;
public class Benchmark {
public static <T> T performTest(Callable<T> callable, int iteration, String name) throws Exception {
Map<String, Iteraion> map = new HashMap<>();
T last = null;
for (int i = 0; i < iteration; i++) {
long s = System.nanoTime();
T temp = callable.call();
long f = System.nanoTime();
map.put(UUID.randomUUID().toString(), new Iteraion(s, f));
if (i == iteration - 1) {
last = temp;
}
}
System.out.print("TEST :\t" + name + "\t\t\t");
System.out.print("ITERATION: " + map.size());
long sum = 0l;
for (String i : map.keySet()) {
sum += (map.get(i).finish - map.get(i).start);
}
long avg = (sum / map.size()) / 1000000;
System.out.println("\t\t\tAVERAGE: " + avg + " ms");
return last;
}
public interface Callable<T> {
T call() throws Exception;
}
static class Iteraion {
Long start;
Long finish;
public Iteraion(Long s, Long f) {
start = s;
finish = f;
}
}
}
Now we can perform the same test more then once using different operation. The following code shows test performed using two different scenarios.
import java.util.ArrayList;
import java.util.List;
import static java.lang.Math.*;
#SuppressWarnings("unused")
public class Test {
public static void main(String[] args) {
try {
final int iteration = 100;
final List<String> data = new ArrayList<>();
for (int i = 0; i < 10000000; i++) {
data.add("data" + i);
}
/**
* Scenario 1
*/
Benchmark.performTest(new Callable<Void>() {
#Override
public Void call() throws Exception {
data.parallelStream().forEach(x -> {
x.trim();
});
return (Void) null;
}
}, iteration, "PARALEL_STREAM_ASSIGN_VAL");
Benchmark.performTest(new Callable<Void>() {
#Override
public Void call() throws Exception {
data.stream().forEach(x -> {
x.trim();
});
return (Void) null;
}
}, iteration, "NORMAL_STREAM_ASSIGN_VAL");
Benchmark.performTest(new Callable<Void>() {
#Override
public Void call() throws Exception {
for (int i = 0; i < data.size(); i++) {
data.get(i).trim();
}
return (Void) null;
}
}, iteration, "NORMAL_FOREACH_ASSIGN_VAL");
/**
* Scenario 2
*/
Benchmark.performTest(new Callable<Void>() {
#Override
public Void call() throws Exception {
data.parallelStream().forEach(x -> {
Integer i = Integer.parseInt(x.substring(4, x.length()));
double d = tan(atan(tan(atan(i))));
});
return (Void) null;
}
}, iteration, "PARALEL_STREAM_COMPUTATION");
Benchmark.performTest(new Callable<Void>() {
#Override
public Void call() throws Exception {
data.stream().forEach(x -> {
Integer i = Integer.parseInt(x.substring(4, x.length()));
double d = tan(atan(tan(atan(i))));
});
return (Void) null;
}
}, iteration, "NORMAL_STREAM_COMPUTATION");
Benchmark.performTest(new Callable<Void>() {
#Override
public Void call() throws Exception {
for (int i = 0; i < data.size(); i++) {
Integer x = Integer.parseInt(data.get(i).substring(4, data.get(i).length()));
double d = tan(atan(tan(atan(x))));
}
return (Void) null;
}
}, iteration, "NORMAL_FOREACH_COMPUTATION");
} catch (Exception e) {
e.printStackTrace();
}
}
}
The first scenario performs the same test using the trim() method 100 times for a list that contains 10_000_000 elements and therefore it uses a parallel stream, then a normal stream and last the old school for loop.
The second scenario performs some relatively heavy operations like tan(atan(tan(atan(i)))) for the same list with the same technique as in the first scenario.
The results are:
// First scenario, average times
Parallel stream: 78 ms
Regular stream: 113 ms
For-loop: 110 ms
// Second scenario, average times
Parallel stream: 1397 ms
Regular stream: 3866 ms
For-loop: 3826 ms
Note that you can debug the above code, then you notice that for parallel streams the program creates three extra threads under name [ForkJoinPool-1], [ForkJoinPool-2] and [ForkJoinPool-3].
Edit:
The sequential streams and the for-loop use the caller's thread.
I have read a little about ConcurrentModificationException in stackflow and my actual update appears not to be the issue, it could be a problem in my design or I need a technique I haven't learnt yet.
Example Situation:
My iterator is running along position markers.
Then an action can be performed to shift the markers over (e.g. Inserting into string).
All Markers greater than the current position must also be shifted to preserve correctness.
Task:
How do I update the remaining markers without the iterator exploding?
Can I refresh the iterator, or break and start the loop again?
The following code is abstracted from my work.
public void innerLoop(Boolean b) {
//An Example of what I'm working with
HashMap<String, HashSet<Integer>> map = new HashMap<String, HashSet<Integer>>() {
{
put("Nonce",
new HashSet<Integer>() {
{
add(1);
add(2);
add(3);
add(4);
add(5);
}
});
}
};
//for each key
for (String key: map.keySet()) {
HashSet<Integer> positions = map.get(key);
//for each integer
for (Iterator<Integer> it = positions.iterator(); it.hasNext();) {
Integer position = it.next();
System.out.println("position =" + position);
//(out of scope) decision requiring elements from the outter loops
if (new Random().nextBoolean()&&b) {
//shift position by +4 (or whatever)
//and every other (int >= position)
System.out.println("Shift " + position + " by 4");
Integer shift = 4;
update(position,
shift,
positions);
it.remove();
}
}
}
}
public void update(Integer current,
Integer diff,
Set<Integer> set) {
if (set != null) {
HashSet<Integer> temp = new HashSet<Integer>();
for (Integer old: set) {
if (old >= current) {
temp.add(old);
System.out.println(old + "Added to temp");
}
}
for (Integer old: temp) {
set.remove(old);
System.out.println(old + "removed");
set.add(old + diff);
System.out.println((old + diff) + "Added");
}
}
}
Edited with Garrett Hall Solution
public void nestedloops() {
HashMap<String, HashSet<Integer>> map = new HashMap<String, HashSet<Integer>>() {
{
put("Hello",
new HashSet<Integer>() {
{
add(5);
add(2);
add(3);
add(4);
add(1);
add(6);
}
});
}
};
//for each key
for (String key: map.keySet()) {
ArrayList<Integer> positions = new ArrayList<Integer>(map.get(key));
//for each integer
for (int i = 0; i < positions.size(); i++) {
Integer position = positions.get(i);
System.out.println("[" + i + "] =" + position);
//out of scope decision
if (new Random().nextBoolean()) {
//shift position by +4
//and every other (int >= position)
System.out.println("Shift after " + position + " by 4");
Integer shift = 4;
//Update the array
for (int j = 0; j < positions.size(); j++) {
Integer checkPosition = positions.get(j);
if (checkPosition > position) {
System.out.println(checkPosition + "increased by 4");
positions.set(j,
checkPosition + shift);
}
}
}
}
//Add updated Array
map.put(key,
new HashSet<Integer>(positions));
}
}
You best bet is indexing the HashSet by putting it into a list. Then you can use indices to refer to elements rather than an Iterator. So long as you are not removing or adding (only updating) elements, then your indices will be correct. Otherwise you will have to account for that. Example:
ArrayList<Integer> positions = new ArrayList<Integer>(map.get(key));
for (int i = 0; i < positions.size(); i ++) {
// updating list
for (int j = i; i < positions.size(); j ++) {
positions.set(j, positions.get(i) + diff);
}
}
I would copy the original set to a list so that you don't need to worry about the current iteration code. Then update a secondary list (not being iterated).
Reasons:
You can't iterate and modify your original collection at once (there is no way around the ConcurrentModificationExceptions)
Nice one liner to shift items in a list.
Collections.rotate(list.subList(j, k+1), -1);
Guava will be able to handle the "find first index that satisfies that predicate and transform the list" which a bunch of utility methods.