Object reuse - mutating same object - in Flink operators - java

I was reading the doc here, which gives a use case to reuse the object as given below:
stream
.apply(new WindowFunction<WikipediaEditEvent, Tuple2<String, Long>, String, TimeWindow>() {
// Create an instance that we will reuse on every call
private Tuple2<String, Long> result = new Tuple<>();
#Override
public void apply(String userName, TimeWindow timeWindow, Iterable<WikipediaEditEvent> iterable, Collector<Tuple2<String, Long>> collector) throws Exception {
long changesCount = ...
// Set fields on an existing object instead of creating a new one
result.f0 = userName;
// Auto-boxing!! A new Long value may be created
result.f1 = changesCount;
// Reuse the same Tuple2 object
collector.collect(result);
}
}
So every time instead of creating a new Tuple, it seems to be able to use the same Tuple by using its mutable nature in order to decrease the pressure on GC. Would it be applicable in all operators, where we can mutate and pass the same object in the pipeline via collector.collect(...) call?
By using that idea, I wonder in what places I can make such an optimization without breaking the code or introducing sneaky bugs. Again as an example a KeySelector which returns a Tuple taken from this answer given below:
KeyedStream<Employee, Tuple2<String, String>> employeesKeyedByCountryndEmployer =
streamEmployee.keyBy(
new KeySelector<Employee, Tuple2<String, String>>() {
#Override
public Tuple2<String, String> getKey(Employee value) throws Exception {
return Tuple2.of(value.getCountry(), value.getEmployer());
}
}
);
I wonder if that case, can I reuse the same Tuple by mutating it with different inputs as below. Of course in all cases I assume parallelism is more than 1, probably much higher in a real use case.
KeyedStream<Employee, Tuple2<String, String>> employeesKeyedByCountryndEmployer =
streamEmployee.keyBy(
new KeySelector<Employee, Tuple2<String, String>>() {
Tuple2<String, String> tuple = new Tuple2<>();
#Override
public Tuple2<String, String> getKey(Employee value) throws Exception {
tuple.f0 = value.getCountry();
tuple.f1 = value.value.getEmployer();
return tuple;
}
}
);
I do not know, if Flink copies objects between stages in the pipeline, so I wonder if it's safe to do such an optimization. I read about enableObjectReuse() configuration in the docs, though I am not sure if I really understood it. Actually, it may be a bit Flink internals, though could not understand when Flink does what to manage data/object/records in the pipeline. May be I should make this clear first?
Thanks,

This is sort of reuse in a KeySelector is not safe. keyBy is not an operator, and the usual rules about object reuse in operator chains (which I covered here) do not apply.

See Dave Anderson's answer to Flink, rule of using 'object reuse mode'
Basically you can't remember input object references across function calls or
modify input objects. So in your situation above with the KeySelector, you're modifying an object that you created, not an input object.

Related

How can I properly make a multilevel map using lambdas?

I'm comparing files in folders (acceptor & sender) using JCIFS. During comparation two situations may occur:
- file not exists at acceptor
- file exists at acceptor
I need to get a map, where compared files are groupped by mentioned two types, so i could copy non-existing files or chech size and modification date of existing...
I want to make it using lambdas and streams, because i woult use parallel streams in near future, and it's also convinient...\
I've managed to make a working prototype method that checks whether file exists and creates a map:
private Map<String, Boolean> compareFiles(String[] acceptor, String[] sender) {
return Arrays.stream(sender)
.map(s -> new AbstractMap.SimpleEntry<>(s, Stream.of(acceptor).anyMatch(s::equals)))
Map.Entry::getValue)));
.collect(collectingAndThen(
toMap(Map.Entry::getKey, Map.Entry::getValue),
Collections::<String,Boolean> unmodifiableMap));
}
but i cant add higher level grouping by map value...
I have such a non-working piece of code:
private Map<String, Boolean> compareFiles(String[] acceptor, String[] sender) {
return Arrays.stream(sender)
.map(s -> new AbstractMap.SimpleEntry<>(s, Stream.of(acceptor).anyMatch(s::equals)))
.collect(groupingBy(
Map.Entry::getValue,
groupingBy(Map.Entry::getKey, Map.Entry::getValue)));
}
}
My code can't compile, because i missed something very important.. Could anyone help me please and exlain how to make this lambda correct?
P.S. arrays from method parameters are SmbFiles samba directories:
private final String master = "smb://192.168.1.118/mastershare/";
private final String node = "smb://192.168.1.118/nodeshare/";
SmbFile masterDir = new SmbFile(master);
SmbFile nodeDir = new SmbFile(node);
Map<Boolean, <Map<String, Boolean>>> resultingMap = compareFiles(masterDir, nodeDir);
Collecting into nested maps with the same values, is not very useful. The resulting Map<Boolean, Map<String, Boolean>> can only have two keys, true and false. When you call get(true) on it, you’ll get a Map<String, Boolean> where all string keys redundantly map to true. Likewise, get(false) will give a you map where all values are false.
To me, it looks like you actually want
private Map<Boolean, Set<String>> compareFiles(String[] acceptor, String[] sender) {
return Arrays.stream(sender)
.collect(partitioningBy(Arrays.asList(acceptor)::contains, toSet()));
}
where get(true) gives you a set of all strings where the predicate evaluated to true and vice versa.
partitioningBy is an optimized version of groupingBy for boolean keys.
Note that Stream.of(acceptor).anyMatch(s::equals) is an overuse of Stream features. Arrays(acceptor).contains(s) is simpler and when being used as a predicate like Arrays.asList(acceptor)::contains, the expression Arrays.asList(acceptor) will get evaluated only once and a function calling contains on each evaluation is passed to the collector.
When acceptor gets large, you should not consider parallel processing, but replacing the linear search with a hash lookup
private Map<Boolean, Set<String>> compareFiles(String[] acceptor, String[] sender) {
return Arrays.stream(sender)
.collect(partitioningBy(new HashSet<>(Arrays.asList(acceptor))::contains, toSet()));
}
Again, the preparation work of new HashSet<>(Arrays.asList(acceptor)) is only done once, whereas the contains invocation, done for every element of sender, will not depend on the size of acceptor anymore.
I've managed to solve my problem. I had a type mismatch, so the working code is:
private Map<Boolean, Map<String, Boolean>> compareFiles(String[] acceptor, String[] sender) {
return Arrays.stream(sender)
.map(s -> new AbstractMap.SimpleEntry<>(s, Stream.of(acceptor).anyMatch(s::equals)))
.collect(collectingAndThen(
groupingBy(Map.Entry::getValue, toMap(Map.Entry::getKey, Map.Entry::getValue)),
Collections::<Boolean, Map<String, Boolean>> unmodifiableMap));
}

How to modify a value to Tuple2 in Java

I am using a accumulator within a fold function. I would like to change the value of the accumulator.
My function looks something like this:
public Tuple2<String, Long> fold(Tuple2<String, Long> acc, eventClass event)
{
acc._1 = event.getUser();
acc._2 += event.getOtherThing();
return acc
}
To me this should be working, because all I am doing is change the values of the accumulator. However what I get is Cannot assign value to final variable _1. Same for _2. Why are these properties of acc final? How can I assign values to them?
quick edit:
Wat I could to is rust return a new Tuple instead, but this is not a nice solution in my opinion return new Tuple2<String, Long>(event.getUser(), acc._2 + event.getOtherThing());
solution for flink framework:
Use the Tuple2 of defined in flink. Import it using
import org.apache.flink.api.java.tuple.Tuple2;
and then use it with
acc.f0 = event.getUser();
acc.f1 += event.getByteDiff();
return acc;
I don't know what kind of Tuple2 you still use, but I assume it is a scala Tuple2.
The scala Tuple2 it's immutable. You can't change the value of an Immutable object, you must to recreate it.
Why? The scala Tuple2 is a functional programming "Data structure" so, as all concept of functional programming" it try to reduce side effect.
You can use the .copy function to recreate it as you want.
the following is an example of code:
#Test
public void test() {
Tuple2<String,Long> tuple = new Tuple2<>("a",1l);
Tuple2<String,Long> actual = tuple.copy(tuple._1,tuple._2+1);
Tuple2<String,Long> expected = new Tuple2<>("a",2l);
assertEquals(actual,expected);
}
I don't know which Tuple2 you are working with. How about return a new object:
Tuple2<String, Long> tuple = new Tuple2<String, Long>();
tuple._1 = event.getUser();
tuple._2 = event.getOtherThing() + acc._2;
return tuple;

Inconsistent responses when using ConcurrentHashMap in multi-threaded environment

We have a single thread that regularly updates a Map. And then we have multiple other threads that read this map.
This is how the update thread executes
private Map<String, SecondMap> firstMap = new ConcurrentHashMap<>();
private void refresh() //This method is called every X seconds by one thread only
{
List<SecondMap> newData = getLatestData();
final List<String> newEntries = new ArrayList<>();
for(SecondMap map : newData) {
newEntries.add(map.getName());
firstMap.put(map.getName(), map);
}
final Set<String> cachedEntries = firstMap.keySet();
for (final String cachedEntry : cachedEntries) {
if (!newEntries.contains(cachedEntry)) {
firstMap.remove(cachedEntry);
}
}
}
public Map<String, SecondMap> getFirstMap()//Other threads call this
{
return firstMap;
}
The SecondMap class looks like this
class SecondMap {
Map<String, SomeClass> data; //Not necessarily a concurrent hashmap
public Map<String, SomeClass> getData() {
return data;
}
}
Below is the simplified version of how reader threads access
public void getValue() {
Map<String, SecondMap> firstMap = getFirstMap();
SecondMap secondMap = firstMap.get("SomeKey");
secondMap.getData().get("AnotherKey");// This returns null
}
We are seeing that in other threads, when they iterate over the received
firstMap, sometimes they get null values for some keys in the SecondMap. We don't see any null values for keys in the firstMap, but we see null values for keys in second value. One thing that we can rule out is that the method getLatestData will never return such data. It reads from a database and returns these entries. There can never be null values in the database in the first place. Also we see that this happens occasionally. We are probably missing something here in handling multi-threaded situation in a proper way, but I am looking for an explanation why this can happen.
Assuming the Map<String, SomeClass> data; inside the SecondMap class is a HashMap, you can get a null value for a key in two scenarios.
1. If the key maps to a null value. Example "Something" -> null.
2. If the key is not in the map in the first place.
So without knowing much about where the data is coming from. If one of maps returned by getLatestData(); doesn't have the key "SomeKey" in the map at all, it will return null.
Also since there's not enough information about how that Map<String, SomeClass> data; is updated, and if it's mutable or immutable, you may have issues there. If that map is immutable and the SecondMap is immutable then it's more probably ok. But if you are modifying if from multiple threads you should make it a ConcurrentHashMap and if you update the reference to a new Map<String, SomeClass> data from different threads, inside the SecondMap you should also make that reference volatile.
class SecondMap {
volatile Map<String, SomeClass> data; //Not necessarily a concurrent hashmap
public Map<String, SomeClass> getData() {
return data;
}
}
If you'd like to understand in depth on when to use the volatile keyword and all the intricacies of data races, there's a section in this online course https://www.udemy.com/java-multithreading-concurrency-performance-optimization/?couponCode=CONCURRENCY
about it. I have not seen any resource that explains and demonstrates it better. And unfortunately there are so many articles online that just explain it WRONG, which is sad.
I hope from the little information in the question I was able to point you to some directions that might help. Please share more information if nothing of that works, or if something does work, please let me know, I'm curious to know what it was :)

How To Generate Mono<Map<Map<List>> from Flux

I have a flux of response form below responses as Flux.<Response>fromIterable(responses). I want to convert this to Mono of map as follows:
Mono< Map< String, Map< String, Collection< Response>>>> collectMap = ?
where company is first key for which another map of response will be generated with category as key.
List< Response> responses = new ArrayList(){
{
add(Response.builder().company("Samsung").category("Tab").price("$2000").itemName("Note").build());
add(Response.builder().company("Samsung").category("Phone").price("$2000").itemName("S9").build());
add(Response.builder().company("Samsung").category("Phone").price("$1000").itemName("S8").build());
add(Response.builder().company("Iphone").category("Phone").price("$5000").itemName("Iphone8").build());
add(Response.builder().company("Iphone").category("Tab").price("$5000").itemName("Tab").build());
}
};
Though I am able to achieve initial map as follow
Mono<Map<String, Collection<Response>>> collect = Flux.<Response>fromIterable( responses )
.collectMultimap( Response::getCompany );
Do someone has an idea how I can achieve my goal here.
I don't think collectMultiMap or collectMap helps you directly in this case:
The collectMultiMap (and its overloads) only can return Map<T, Collection<V> which is clearly different than what you want. Of course, you can process the resulting value set (namely the Collection<V> part of the map) with a O(n) complexity.
On the other hand collectMap (and its overloads) look a bit more promising, if you provide the value function. However, you don't have access to other V objects, which forbids you to build the Collection<V>.
The solution I came up with is using reduce; though the return type is:
Mono<Map<String, Map<String, List<Response>>>> (mind the List<V> instead of Collection<V>)
return Flux.<Response>fromIterable( responses )
.reduce(new HashMap<>(), (map, user) -> {
map.getOrDefault(user.getId(), new HashMap<>())
.getOrDefault(user.getEmail(), new ArrayList<>())
.add(user);
return map;
});
The full type for the HashMap in reduce is HashMap<String, Map<String, List<AppUser>>>, thankfully Java can deduce that from the return type of the method or the type of the assigned variable.

Java: how to convert a List<?> to a Map<String,?> [duplicate]

This question already has answers here:
How to convert List to Map?
(20 answers)
Closed 7 years ago.
I would like to find a way to take the object specific routine below and abstract it into a method that you can pass a class, list, and fieldname to get back a Map.
If I could get a general pointer on the pattern used or , etc that could get me started in the right direction.
Map<String,Role> mapped_roles = new HashMap<String,Role>();
List<Role> p_roles = (List<Role>) c.list();
for (Role el : p_roles) {
mapped_roles.put(el.getName(), el);
}
to this? (Pseudo code)
Map<String,?> MapMe(Class clz, Collection list, String methodName)
Map<String,?> map = new HashMap<String,?>();
for (clz el : list) {
map.put(el.methodName(), el);
}
is it possible?
Using Guava (formerly Google Collections):
Map<String,Role> mappedRoles = Maps.uniqueIndex(yourList, Functions.toStringFunction());
Or, if you want to supply your own method that makes a String out of the object:
Map<String,Role> mappedRoles = Maps.uniqueIndex(yourList, new Function<Role,String>() {
public String apply(Role from) {
return from.getName(); // or something else
}});
Here's what I would do. I am not entirely sure if I am handling generics right, but oh well:
public <T> Map<String, T> mapMe(Collection<T> list) {
Map<String, T> map = new HashMap<String, T>();
for (T el : list) {
map.put(el.toString(), el);
}
return map;
}
Just pass a Collection to it, and have your classes implement toString() to return the name. Polymorphism will take care of it.
Java 8 streams and method references make this so easy you don't need a helper method for it.
Map<String, Foo> map = listOfFoos.stream()
.collect(Collectors.toMap(Foo::getName, Function.identity()));
If there may be duplicate keys, you can aggregate the values with the toMap overload that takes a value merge function, or you can use groupingBy to collect into a list:
//taken right from the Collectors javadoc
Map<Department, List<Employee>> byDept = employees.stream()
.collect(Collectors.groupingBy(Employee::getDepartment));
As shown above, none of this is specific to String -- you can create an index on any type.
If you have a lot of objects to process and/or your indexing function is expensive, you can go parallel by using Collection.parallelStream() or stream().parallel() (they do the same thing). In that case you might use toConcurrentMap or groupingByConcurrent, as they allow the stream implementation to just blast elements into a ConcurrentMap instead of making separate maps for each thread and then merging them.
If you don't want to commit to Foo::getName (or any specific method) at the call site, you can use a Function passed in by a caller, stored in a field, etc.. Whoever actually creates the Function can still take advantage of method reference or lambda syntax.
Avoid reflection like the plague.
Unfortunately, Java's syntax for this is verbose. (A recent JDK7 proposal would make it much more consise.)
interface ToString<T> {
String toString(T obj);
}
public static <T> Map<String,T> stringIndexOf(
Iterable<T> things,
ToString<T> toString
) {
Map<String,T> map = new HashMap<String,T>();
for (T thing : things) {
map.put(toString.toString(thing), thing);
}
return map;
}
Currently call as:
Map<String,Thing> map = stringIndexOf(
things,
new ToString<Thing>() { public String toString(Thing thing) {
return thing.getSomething();
}
);
In JDK7, it may be something like:
Map<String,Thing> map = stringIndexOf(
things,
{ thing -> thing.getSomething(); }
);
(Might need a yield in there.)
Using reflection and generics:
public static <T> Map<String, T> MapMe(Class<T> clz, Collection<T> list, String methodName)
throws Exception{
Map<String, T> map = new HashMap<String, T>();
Method method = clz.getMethod(methodName);
for (T el : list){
map.put((String)method.invoke(el), el);
}
return map;
}
In your documentation, make sure you mention that the return type of the method must be a String. Otherwise, it will throw a ClassCastException when it tries to cast the return value.
If you're sure that each object in the List will have a unique index, use Guava with Jorn's suggestion of Maps.uniqueIndex.
If, on the other hand, more than one object may have the same value for the index field (which, while not true for your specific example perhaps, is true in many use cases for this sort of thing), the more general way do this indexing is to use Multimaps.index(Iterable<V> values, Function<? super V,K> keyFunction) to create an ImmutableListMultimap<K,V> that maps each key to one or more matching values.
Here's an example that uses a custom Function that creates an index on a specific property of an object:
List<Foo> foos = ...
ImmutableListMultimap<String, Foo> index = Multimaps.index(foos,
new Function<Foo, String>() {
public String apply(Foo input) {
return input.getBar();
}
});
// iterate over all Foos that have "baz" as their Bar property
for (Foo foo : index.get("baz")) { ... }

Categories

Resources