I am using Flink v.1.4.0.
I am working with the DataSet API and one of the things I want to try is very similar to how broadcast variables are used in Apache Spark.
Practically, I want to apply a map function on a DataSet, go through each of the elements in the DataSet and search for it in a HashMap; if the search element is present in the Map then retrieve the respective value.
The HashMap is very big and I don't know if (since I haven't even built my solution) it needs to be Serializable to be transmitted and used by all workers concurrently.
In general, the solution I have in mind would look like this:
Map<String, T> hashMap = new ... ;
DataSet<Point> points = env.readCsv(...);
points
.map(point -> hashMap.getOrDefault(point.getId, 0))
...
but I don't know if this would work or if it is efficient in any way. After doing a bit of searching I found a much better example here according to which one can us Broadcast variables in Flink to broadcast a List as follows:
DataSet<Point> points = env.readCsv(...);
DataSet<Centroid> centroids = ... ; // some computation
points.map(new RichMapFunction<Point, Integer>() {
private List<Centroid> centroids;
#Override
public void open(Configuration parameters) {
this.centroids = getRuntimeContext().getBroadcastVariable("centroids");
}
#Override
public Integer map(Point p) {
return selectCentroid(centroids, p);
}
}).withBroadcastSet("centroids", centroids);
However, .getBroadcastVariable() seems to only work with a List.
Can someone provide an alternative solution with a HashMap?
How would that solution work?
What is the most efficient way to go about solving this?
Could one use a Flink Managed State to do something similar to how broadcast variables are used? How?
Finally, can I attempt multiple mappings with multiple broadcast variables in a pipeline?
Where do the values of hashMap come from? Two other possible solutions:
Reinitialise/recreate/regenerate hashMap in each instance of your filtering/mapping operator separately in open method. Probably more efficient per record, but duplicates initialisation logic.
Create two DataSet, one for hashMap values, second for points and join those two DataSets using desired join strategy. As an analogy, what you are trying to do could be expressed by SQL query SELECT * FROM points p, hashMap h WHERE h.key = p.id.
Related
I have a fairly demanding task and unfortunately I can't get any further. Maybe you have a tip for me:
Goal:
Create a lot of edges with apache gremlin with only with one message
to the gremlin server (kind of bulk operation for creating edges).
The sourceId, the targetId and the type is saved in a list of JAVA pojos.
Use gremlin for java
Do not use IDs from the underlying engine, use some constant property PROP_ID for storing the user-given id
My current approach was:
create a list of map because gremlin java can only inject objects when they're maps or arrays
Object[] edgesMap = edges.stream().map(edge -> {
Map<String, String> m = new HashMap<>();
m.put("sourceId", edge.sourceId);
m.put("targetId", edge.targetId);
m.put("type", edge.type);
return m;
}).collect(Collectors.toList()).toArray();
Now i wanted to inject the object into a traversal and iterate the map. For every value i wanted to create an edge.
GraphTraversal<Vertex, Vertex> traversal = g.withSideEffect("edgeList", edgesMap).V().limit(1).sideEffect(
select("edgeList").unfold().as("edge").sideEffect(
g.V().has(PROP_ID, select("edge").select("targetId")).addE(select("edge").select("type")).from(select("edge").select("sourceId"))
)
);
traversal.iterate();
But unfortunately i cannot use .has in the anonymous traversal because .select(...).select(...) does not inject some constant value but returns a traversal. So i was told in the tinkerpop community that the has-traversal will always be true for every node and as a result for every node some edge is created.
I was told to use the where()-traversal to only get the node that fits with the property PROP_ID and the iterated value from the map. But the where()-function expects some P<String> predicate and with my current knowledge, i'm unable to get the select(...)-values into that predicate.
Maybe someone can help me so that I can either rewrite the traversal or someone has an idea how I can implement the requirements. Thanks! :)
I am creating an agent based model in Anylogic 8.7. I created a collection with ArrayList class and Agent elements using this code to separate some agents meeting a specific condition:
collection.addAll(findAll(population,p -> p.counter==variable); for (AgentyType p: collection ) { traceln(p.probability); }
The above code will store the probability attribute of the separated agents in the console. Is there a way to define a loop to retrieve the printed probability attributes from the console one by one and store them in a variable to operate on them? Or if there is a more efficient and optimized way of doing this I would be glad if you share this with me. Thank you all.
I am not sure why you are following this methodology... Agent-Based Modeling already "stores" the parameters you are looking for, you do not need the console as an intermediate. I believe what you are trying to do is the following:
for( AgentType p : agentTypes)
{
if( p.track == 1 )
{
sum = sum + p.probability * p.impact ;
}
{
I recommend you read:
https://help.anylogic.com/topic/com.anylogic.help/html/code/for.html?resultof=%22%66%6f%72%22%20%22%6c%6f%6f%70%22%20
and
https://help.anylogic.com/topic/com.anylogic.help/html/agentbased/statistics.html?resultof=%22%73%74%61%74%69%73%74%69%63%73%22%20%22%73%74%61%74%69%73%74%22%20
The latter will give you a better idea on how to collect Agent statistics based on certain criteria.
depending on the operations you want to perform you can use the following:
https://help.anylogic.com/index.jsp?topic=%2Fcom.anylogic.help%2Fhtml%2Fjavadoc%2Fcom%2Fanylogic%2Fengine%2FUtilitiesCollection.html&resultof=%22%75%74%69%6c%69%74%69%65%73%22%20%22%75%74%69%6c%22%20
you can use something like this to collect one by one the values of your probabilities.
collection.addAll(findAll(population,p -> p.counter==variable);
LinkedList <Double> probabilities= new LinkedList();
for (AgentyType p: collection ) {
probabilities.add(p.probability);
}
First time posting and I am having some difficulty understanding groovy script arrays? (not sure if they are list, arrays, or maps). I have typically coded in PHP and am used to associating PHP multidimensional arrays as a (key => value) association. I am not sure if I am overlooking that flexibility in Groovy. It seems like you either have to pick either a map/array combo or a list.
What I am trying to accomplish is I have another associative array that is static I would like to have associated with a key -> value. (e.g. 1 - Tim, 2 - Greg, 3 - Bob, etc...)
I have another associative array that is total dynamic. This needs to be nested within the associative array that I stated above because in this list it will contain task information that the current user has worked on. (e.g. under Tim there he might have worked on 3 unrelated task at a different time and the statuses of those task might vary. So this should correlate to something like this [Task 1, 3/6/19, Completed Task], [Task 2, 3/5/19, Completed Task], [Task 3, 2/5/19, In Progress Task]. Someone named Greg might have instead 4 task.
So my question is what is the best data structure to use for this? How do I add data to this data structure effectively?
I'm sorry if these seem like bare-bones basic questions. Again, I'm new to Groovy.
Map model=[:]
List names=['Tim','Greg','Bob']
names?.each { name->
//dynamically call something that returns a list
// model."${name}"= getSomeList(name)
//get a list assign it the above list maybe something like this
// List someTasks = ['task1','task2']
// model."${name}"= someTasks
//or shorter
// model."${name}"= ['task1','task2']
// 1 element multi element list
if (name=='Bob') {
model."${name}"= ['task1']
} else {
model."${name}"= ['task1','task2']
}
}
//This iterates through map and its value being another iteration
model?.each{ key,value ->
println "working on $key"
value?.each { v-
println "$key has task ${v}"
}
}
Try some of above may help you understand it better and yes you can use <<
Map model=[:]
model << ['bob':['task1']]
model << ['Greg':['task1','task2']]
You could either map like latter or above through an iteration further lists/maps within that list so for example:
model << ['Greg':[
'task1' : ['do thing1','do thing2'],
'task2': [ 'do xyz', 'do abc']
]
]
//This iterates through map and its value being another map with an iteration
model?.each{ key,value ->
println "working on $key"
value?.each {k, v->
println "$key has task ${k}"
v?.each { vv ->
println "$key has task ${k} which needs to do ${vv}"
}
}
}
Using collect you could really simply all the each iterations which is a lot more verbose, using collection you could make it into one line:
names?.collect{[it:getSomeList(it)]}
//sometimes you need to flatten in this case I dont think you would
names?.collect{[it:seriesHotelList(it)]}?.flatten()
List getSomeList(String name) {
return ['task1','task2']
}
The basic data structures that are key/value lookups are just Java Maps (usually the LinkedHashMap implementation in Groovy). Your first-level association seems to be something like a Map<Integer, Employee>. The nested one that you are calling "total dynamic" seems instead to really be a structured class, and you definitely should learn how Java/Groovy classes work. This seems to be something like what you're looking for:
class Employee {
int employeeId
String name
List<Task> tasks
}
enum TaskStatus {
PENDING,
IN_PROGRESS,
COMPLETED
}
class Task {
int taskNumber
LocalDate date // java.time.LocalDate
TaskStatus status
}
By the way, Groovy is a great language and my preferred JVM language, but it's better to make sure you understand the basics first. I recommend using #CompileStatic on all of your classes whenever possible and making sure you understand any cases where you can't use it. This will help to prevent errors and missteps as you learn.
i am going to type my code here and then I will explain my problem below.
for (int i = 0; i < sales.totalSales(); i++) {
EntidadGeo gec = sales.getSale(i).getCustomer();
EntidadGeo get = sales.getSale(i).getStore();
int[] c = geo.georeferenciar(sales.getSale(i).getCustomer().getCP(), ventas.getVenta(i).getCCustomer().getCalle(), ventas.getVenta(i).getCCustomer().getProvincia());
gec.setX(c[0]);
gec.setY(c[1]);
int[] c2 = geo.georeferenciar(ventas.getSale(i).getStore().getCP(), ventas.getVenta(i).getStore().getCalle(), ventas.getSale(i).getStore().getProvincia());
get.setX(c2[0]);
get.setY(c2[1]);
mapaventas.representar(gec, get);
}
I have that for loop, what i want to do in my project is to print in a map. The point is what I need to draw in the map are customers and stores and one store can sell to many customers at the same time. In my project I am using MVC pattern, this part belongs to Controller part, and in the model part i draw the map.
It works now but the problem is that my project draw one customer and one store instead of 4 customers per 1 store.
Thanks
Your problem is here:
mapaventas.representar(gec, get);
So it looks like you have a Map<Vendor, Client> which will only associate only one client per vendor. I have to guess at this because we have no knowledge what the method above does. If I am correct a better solution perhaps is to use a Map<Vendor, ArrayList<Client>>. so that a Vendor can be associated with multiple clients. Then you would do something like
ArrayList<Client> getList = mapaventas.get(gec);
// if the above is null, create the arraylist first and put it
// and the gec into the map.
getList.add(get);
Note that my variable names and types will not be the same as yours, but hopefully you will understand the concept I'm trying to get across. If not, please ask.
It sounds like your database has a one-to-many relation between Store and Customer. A corresponding object model might be List<Map<Store, List<Customer>>>. Because a Customer may trade at more than one Store, you want "to check there if there is an IdStore already drawn, and then I don't want to draw it."
One approach would be to iterate through the List and add entries to a Set<Location>. Because implementations of Set reject duplicate elements, only one copy would be present, and no explicit check would be required. As a concrete example using JMapViewer, you would add a MapMarker to the mapViewer for each Location in the Set, as shown here.
I have a 2D array
public static class Status{
public static String[][] Data= {
{ "FriendlyName","Value","Units","Serial","Min","Max","Mode","TestID","notes" },
{ "PIDs supported [01 – 20]:",null,"Binary","0",null,null,"1","0",null },
{ "Online Monitors since DTCs cleared:",null,"Binary","1",null,null,"1","1",null },
{ "Freeze DTC:",null,"NONE IN MODE 1","2",null,null,"1","2",null },
I want to
SELECT "FriendlyName","Value" FROM Data WHERE "Mode" = "1" and "TestID" = "2"
How do I do it? The fastest execution time is important because there could be hundreds of these per minute.
Think about how general it needs to be. The solution for something truly as general as SQL probably doesn't look much like the solution for a few very specific queries.
As you present it, I'd be inclined to avoid the 2D array of strings and instead create a collection - probably an ArrayList, but if you're doing frequent insertions & deletions maybe a LinkedList would be more appropriate - of some struct-like class. So
List<MyThing> list = new ArrayList<MyThing>();
and index the fields on which you want to search using a HashMap:
Map<Integer, MyThing> modeIndex = new HashMap<Integer, MyThing>()
for (MyThing thing : list)
modeIndex.put(thing.mode, thing);
Writing it down makes me realize that won't do, in and of itself, because multiple things could have the same mode. So probably a multimap instead - or roll your own by making the value type of the map not MyThing, but rather List. Google Collections has a fine multimap implementation.
This doesn't exactly answer your question, but it is possible to run some Java RDBMs with their tables entirely in your JVM's memory. For example, HSQLDB. This will give you the full power of SQL selects without the overheads of disc access. The only catch is that you won't be able to query a raw Java data structure like you are asking. You'll first have to insert the data into the DB's in-memory tables.
(I've not tried this ... perhaps someone could comment if this approach is really viable.)
As to your actual question, in C# they used to use LINQ (Language Integrated Query) for this, which takes benefit of the language's support for closures. Right now with Java 6 as the latest official release, Java doesn't support closures, but it's going to come in the shortly upcoming Java 7. The Java 7 based equivalent for LINQ is likely going to be JaQue.
As to your actual problem, you're definitely using a wrong datastructure for the job. Your best bet will be to convert the String[][] into a List<Entity> and using convenient searching/filtering API's provided by Guava, as suggested by Carl Manaster. The Iterables#filter() would be a good start.
EDIT: I took a look at your array, and I think this is definitely a job for RDBMS. If you want in-memory datastructure like features (fast/no need for DB server), embedded in-memory databases like HSQLDB, H2 can provide those.
If you want good execution time, you MUST have a good datastructure. If you just have data stored in a 2D array unordered, you'll be mostly stuck with O(n).
What you need is indexes for example, just like other RDBMS. For example, if you use a lot of WHERE clause like this WHERE name='Brian' AND last_name='Smith' you could do something like this (kind of a pseudocode):
Set<Entry> everyEntry = //the set that contains all data
Map<String, Set<Entry>> indexedSet = newMap();
for(String name : unionSetOfNames){
Set<Entry> subset = Iterables.collect(new HasName(name), everyEntries);
indexedSet.put(name, subset);
}
//and later...
Set<Entry> brians = indexedSet.get("Brian");
Entry target = Iterables.find(new HasLastName("Smith"),brians);
(Please forgive me if the Guava API usage is wrong in the example code (it's pseudo-code!, but you get the idea).
In the above code, you'll be doing a lookup of O(1) once, and then another O(n) lookup, but on a much much smaller subset. So this can be more effective than doing a O(n) lookup on the entire set, etc. If you use a ordered Set, ordered by the last_name and use binary search, that lookup will become O(log n). Things like that. There are bunch of datastructures out there and this is only a very simple example.
So in conclusion, if I were you, I'll define my own classes and create a datastructure using some standard datastructures available in JDK. If that doesn't suffice, I might look at some other datastructures out there, but if it gets really complex, I think I'd just use some in-memory RDBMS like HSQLDB or H2. They are easy to embed, so there are quite close to having your own in-memory datastructure. And as more and more you do complex stuff, chances are that that option provides better performance.
Note also that I used the Google Guava library in my sample code.. They are excellent, and I highly recommend to use them because it's so much nicer. Of course don't forget to look at the java.utli.collections package, too..
I ended up using a lookup table. 90% of the data is referenced from near the top.
public static int lookupReferenceInTable (String instanceMode, String instanceTID){
int ModeMatches[]=getReferencesToMode(Integer.parseInt(instanceMode));
int lineLookup = getReferenceFromPossibleMatches(ModeMatches, instanceTID);
return lineLookup;
}
private static int getReferenceFromPossibleMatches(int[] ModeMatches, String instanceTID) {
int counter = 0;
int match = 0;
instanceTID=instanceTID.trim();
while ( counter < ModeMatches.length ){
int x = ModeMatches[counter];
if (Data[x][DataTestID].equals(instanceTID)){
return ModeMatches[counter];
}
counter ++ ;
}
return match;
}
It can be further optimized so that instead of looping through all of the arrays it will loop on column until it finds a match, then loop the next, then the next. The data is laid out in a flowing and well organized manner so a lookup based on 3 criteria should only take a number of checks equal to the rows.