How to do the same work multiple times using Dataflow?

How to do the same work multiple times using Dataflow? - java

I have some work that needs to be done repeatedly. For example, let's say I want to roll 2000 dice and collect the result. The caveat is the dice throw depends on PCollection How can this be done with Dataflow?
I've tried using a PCollectionList but the result is that my Dataflow is too large to start (> 10 MB). Here is an example of what I'd like to do (using PCollectionList):
// I'd like to operate on things 2000 times.
PCollection<Thing> things = ...;
List<PCollection<ModifiedThing>> modifiedThingsList = new ArrayList<>();
for (int i = 0; i < 2000; ++i) {
modifiedThingsList.add(things.apply(ParDo.of(thing -> modify(thing)));
}
PCollection<ModifiedThing> modifiedThings = PCollectionList.of(modifiedThingsList).apply(Flatten.pCollections());
Because the JSON representation of the above graph is too large for Dataflow, I need a different way of representing this logic. Any ideas?

ParDo or FlatMapElements can return an arbitrarily large number of outputs per input. For example:
PCollection<ModifiedThing> modifiedThings = things.apply(
ParDo.of(new DoFn<Thing, ModifiedThing>() {
public void processElement(ProcessContext c) {
for (int i = 0; i < 2000; ++i) {
c.output(modify(c.element()));
}
}
}));
Caveat: If you're going to immediately apply other ParDos to modifiedThings, be careful with fusion, since 2000 is a pretty high fan-out factor. A good example code snippet for preventing fusion is here.

Related

Can I trust that System.nanoTime() will return different value each invocation?

My program needs to generate unique labels that consist of a tag, date and time. Something like this:
"myTag__2019_09_05__07_51"
If one tries to generate two labels with the same tag in the same minute, one will receive equal labels, what I cannot allow. I think of adding as an additional suffix result of System.nanoTime() to make sure that each label will be unique (I cannot access all labels previously generated to look for duplicates):
"myTag__2019_09_05__07_51__{System.nanoTime() result}"
Can I trust that each invocation of System.nanoTime() this will produce a different value? I tested it like this on my laptop:
assertNotEquals(System.nanoTime(), System.nanoTime())
and this works. I wonder if I have a guarantee that it will always work.

TLDR; If you only use a single thread on a popular VM on a modern operating system, it may work in practice. But many serious applications use multiple threads and multiple instances of the application, and there won't be any guarantee in that case.
The only guarantee given in the Javadoc for System.nanoTime() is that the resolution of the clock is at least as good as System.currentTimeMillis() - so if you are writing cross-platform code, there is clearly no expectation that the results of nanoTime are unique, as you can call nanoTime() many times per millisecond.
On my OS (Java 11, MacOS) I always get at least one nanosecond difference between successive calls on the same thread (and that is after Integer.MAX_VALUE looks at successive return values); it's possible that there is something in the implementation that guarantees it.
However it is simple to generate duplicate results if you use multiple Threads and have more than 1 physical CPU. Here's code that will show you:
public class UniqueNano {
private static volatile long a = -1, b = -2;
public static void main(String[] args) {
long max = 1_000_000;
new Thread(() -> {
for (int i = 0; i < max; i++) { a = System.nanoTime(); }
}).start();
new Thread(() -> {
for (int i = 0; i < max; i++) { b = System.nanoTime(); }
}).start();
for (int i = 0; i < max; i++) {
if (a == b) {
System.out.println("nanoTime not unique");
}
}
}
}
Also, when you scale your application to multiple machines, you will potentially have the same problem.
Relying on System.nanoTime() to get unique values is not a good idea.

How to improve Java multi-threading performance? Time efficiency of saving/loading data from NoSQL database (like Redis) vs ArrayList?

I am evaluating an SDK and I need to cross compare ~15000 iris images stored in a gallery folder and generate the similarity scores as a 15000 x 15000 matrix.
So I pre-processed all the images and stored the processed blobs in an ArrayList. Then I'm using multiple threads with 2 'for' loops in the run method to call the 'compare' method (from SDK) and pass the index of an ArrayList as parameters to compare those respective blobs and save the integer return values in an excel sheet using Apache poi library. The performance is very inefficient (each comparison takes ~40ms) and the whole task takes a lot of time (~100 days estimated with 8 cores running at 100%) to do all the 225,000,000 comparisons. Please help me understand this bottle neck.
Multithreading code
int processors = Runtime.getRuntime().availableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(processors);
for(int i =0; i<processors; i++) {
//each thread compares 1875 images with 15000 images
Runnable task = new Thread(bloblist,i*1875,i*1875+1874);
executor.execute(task);
}
executor.shutdown();
Run Method
public void run(){
for(int i = startIndex; i<= lastIndex; i++) {
for(int j=0;j<15000;j++){
compare.compareIris(bloblist.get(i),bloblist.get(j));
score= compare.getScore();
//save result to Excel using Apache POI
...
...
}
}
}
Please suggest me a time-efficient architecture to accomplish this task. Shall I store the blobs in a NoSQL DB or is there any alternate way to do this?

I'd consider adding some simple profiling to your code as a first step. Profiling libraries are great, but can be a little intimidating. All you really need to get started is:
public void run(){
long sumCompare = 0;
long sumSave = 0
for(int i = startIndex; i<= lastIndex; i++) {
for(int j=0;j<15000;j++){
final long compareStart = System.currentTimeMillis();
compare.compareIris(bloblist.get(i),bloblist.get(j));
score= compare.getScore();
final long compareEnd = System.currentTimeMillis();
compareSum += (compareEnd - compareStart);
//save result to Excel using Apache POI
...
...
final long saveEnd = System.currentTimeMillis();
saveSum += (saveEnd - compareEnd);
}
}
System.out.println(String.format("Compare: %d; Save: %d", sumCompare, sumSave);
}
Maybe run this over a 100x100 grid instead to get a rough idea of where the bulk of your runtime is.
If it's the save step, I'd strongly recommend using a database as an intermediate step between computing the score and exporting it to a spreadsheet. A NoSQL database would work, although I'd also encourage you to look at something like SQLite, just for simplicity's sake. (Many NoSQL databases are designed to offer advantages across a cluster of database nodes while working with very large datasets; if you're storing write-only data on one node, SQL may be your best bet.)
If the bottleneck is the compute step, it will be more difficult to improve performance. If the blobs don't all fit comfortably in RAM along with whatever RAM the comparisons consume, you may be paying the price of swapping this data onto and off-of disk. You may see a small improvement by having each thread take work "off of a queue", rather than starting with a pre-assigned block:
final int processors = Runtime.getRuntime().availableProcessors();
final ExecutorService executor = Executors.newFixedThreadPool(processors);
final AtomicLong nextCompare = new AtomicLong(0);
for(int i =0; i<processors; i++) {
Runnable task = new Thread(bloblist, nextCompare);
executor.execute(task);
}
executor.shutdown();
public void run(){
while (true) {
final long taskNum = nextCompare.getAndIncrement();
if (taskNum >= 15000 * 15000) {
return;
}
final long i = Math.floor(taskNum/15000);
final long j = taskNum % 15000;
compare.compareIris(bloblist.get(i),bloblist.get(j));
score = compare.getScore();
// Save score, etc.)
}
}
This will result in all threads working on blobs stored relatively close together in memory. In this way, no thread is evicting data from the cache that another thread will require in the near future. You are, however, paying the price of locking the AtomicLong; if memory thrashing wasn't your issue, this will likely be a bit slower.

Ektorp CouchDb: Query for pattern with multiple contains

I want to query multiple candidates for a search string which could look like "My sear foo".
Now I want to look for documents which have a field that contains one (or more) of the entered strings (seen as splitted by whitespaces).
I found some code which allows me to do a search by pattern:
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(trim).endKey(trim + "\u9999");
return db.queryView(viewQuery, DeviceEntityCouch.class);
}
which works quite nice for looking just for one pattern. But how do I have to modify my code to get a multiple contains on doc.serialNumber?
EDIT:
This is the current workaround, but there must be a better way i guess.
Also there is only an OR logic. So an entry fits term1 or term2 to be in the list.
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
String[] split = trim.split(" ");
List<DeviceEntityCouch> list = new ArrayList<>();
for (String s : split) {
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(s).endKey(s + "\u9999");
list.addAll(db.queryView(viewQuery, DeviceEntityCouch.class));
}
return list;
}

Looks like you are implementing a full text search here. That's not going to be very efficient in CouchDB (I guess same applies to other databases).
Correct me if I am wrong but from looking at your code looks like you are trying to search a list of serial numbers for a pattern. CouchDB (or any other database) is quite efficient if you can somehow index the data you will be searching for.
Otherwise you must fetch every single record and perform a string comparison on it.
The only way I can think of to optimize this in CouchDB would be the something like the following (with assumptions):
Your serial numbers are not very long (say 20 chars?)
You force the search to be always 5 characters
Generate view that emits every single 5 char long substring from your serial number - more or less this (could be optimized and not sure if I got the in):
...
for (var i = 0; doc.serialNo.length > 5 && i < doc.serialNo.length - 5; i++) {
emit([doc.serialNo.substring(i, i + 5), doc._id]);
}
...
Use _count reduce function
Now the following url:
http://localhost:5984/test/_design/serial/_view/complex-key?startkey=["01234"]&endkey=["01234",{}]&group=true
Will return a list of documents with a hit count for a key of 01234.
If you don't group and set the reduce option to be false, you will get a list of all matches, including duplicates if a single doc has multiple hits.
Refer to http://ryankirkman.com/2011/03/30/advanced-filtering-with-couchdb-views.html for the information about complex keys lookups.
I am not sure how efficient couchdb is in terms of updating that view. It depends on how many records you will have and how many new entries appear between view is being queried (I understand couchdb rebuilds the view's b-tree on demand).
I have generated a view like that that splits doc ids into 5 char long keys. Out of over 1K docs it generated over 30K results - id being 32 char long, simple maths really: (serialNo.length - searchablekey.length + 1) * docscount).
Generating the view took a while but the lookups where fast.
You could generate keys of multiple lengths, etc. All comes down to your records count vs speed of lookups.

Genetic Algorithm for Process Allocation

I have the following uni assignment that's been puzzling me. I have to implement a genetic algorithm that allocates processes into processors. More specifically the problem is the following:
"You have a program that is computed in parallel processor system. The program is made up of a N number of processes that need to be allocated on a n number of processors (where n is way smaller than N). The communication of processes during this whole process can be quite time consuming, so the best practice would be to assign processes that need intercommunication with one another to same processor.
In order to reduce the communication time between processes you could allocate of these processes to the same processor but this would negate the parallel processing idea that every processor needs to contribute to the whole process.
Consider the following: Let's say that Cij is the total amount of communication between process i and process j. Assume that every process needs the same amount of computing power so that the limitations of the processing process can be handled by assigning the same amount of processes to a processor. Use a genetic algorithm to assign N processes to n processors."
The above is roughly translated the description of the problem. Now I have the following question that puzzle me.
1) What would be the best viable solution in order to for the genetic algorithm to run. I have the theory behind them and I have deduced that you need a best possible solution in order to check each generation of the produced population.
2) How can I properly design the whole problem in order to be handled by a program.
I am planning to implement this in Java but any other recommendations for other programming languages would be welcome.

The Dude abides. Or El Duderino if you're not into the whole brevity thing.
What you're asking is really a two part question, but the Genetic Algorithm part can be easily illustrated in concept. I find that giving a basic start can be helpful, but this problem as a "whole" is too complicated to address here.
Genetic Algorithms (GA) can be used as an optimizer, as you note. In order to apply a GA to a process execution plan, you need to be able to score an execution plan, then clone and mutate the best plans. A GA works by running several plans, cloning the best, and then mutating some of them slightly to see if the offspring (cloned) plans are improved or worsened.
In this example, I created a array of 100 random Integers. Each Integer is a "process" to be run and the value of the Integer is the "cost" of running that individual process.
List<Integer> processes = new ArrayList<Integer>();
The processes are then added to an ExecutionPlan, which is a List<List<Integer>>. This List of List of Integers will be used to represent 4 processors doing 25 rounds of processing:
class ExecutionPlan implements Comparable<ExecutionPlan> {
List<List<Integer>> plan;
int cost;
The total cost of an execution plan will be computed by taking the highest process cost per round (the greatest Integer) and summing the costs of all the rounds. Thus, the goal of the optimizer is to distribute the initial 100 integers (processes) into 25 rounds of "processing" on 4 "processors" such that total cost is as low as possible.
// make 10 execution plans of 25 execution rounds running on 4 processors;
List<ExecutionPlan> executionPlans = createAndIntializePlans(processes);
// Loop on generationCount
for ( int generation = 0; generation < GENERATIONCOUNT; ++generation) {
computeCostOfPlans(executionPlans);
// sort plans by cost
Collections.sort(executionPlans);
// print execution plan costs
System.out.println(generation + " = " + executionPlans);
// clone 5 better plans over 5 worse plans
// i.e., kill off the least fit and reproduce the best fit.
cloneBetterPlansOverWorsePlans(executionPlans);
// mutate 5 cloned plans
mutateClones(executionPlans);
}
When the program is run, the cost is initially randomly determined, but with each generation it improves. If you run it for 1000 generations and plot the results, a typical run will look like this:
The purpose of the GA is to Optimize or attempt to determine the best possible solution. The reason it can be applied to you problem is that your ExecutionPlan can be scored, cloned and mutated. The path to success, therefore, is to separate the problems in your mind. First, figure out how you can make an execution plan that can be scored as to what the cost will be to actually run it on an assumed set of hardware. Add rountines to clone and mutate an ExecutionPlan. Once you have that plug it into this GA example. Good luck and stay cool dude.
public class Optimize {
private static int GENERATIONCOUNT = 1000;
private static int PROCESSCOUNT = 100;
private static int MUTATIONCOUNT = PROCESSCOUNT/10;
public static void main(String...strings) {
new Optimize().run();
}
// define an execution plan as 25 runs on 4 processors
class ExecutionPlan implements Comparable<ExecutionPlan> {
List<List<Integer>> plan;
int cost;
public ExecutionPlan(List<List<Integer>> plan) {
this.plan = plan;
}
#Override
public int compareTo(ExecutionPlan o) {
return cost-o.cost;
}
#Override
public String toString() {
return Integer.toString(cost);
}
}
private void run() {
// make 100 processes to be completed
List<Integer> processes = new ArrayList<Integer>();
// assign them a random cost between 1 and 100;
for ( int index = 0; index < PROCESSCOUNT; ++index) {
processes.add( new Integer((int)(Math.random() * 99.0)+1));
}
// make 10 execution plans of 25 execution rounds running on 4 processors;
List<ExecutionPlan> executionPlans = createAndIntializePlans(processes);
// Loop on generationCount
for ( int generation = 0; generation < GENERATIONCOUNT; ++generation) {
computeCostOfPlans(executionPlans);
// sort plans by cost
Collections.sort(executionPlans);
// print execution plan costs
System.out.println(generation + " = " + executionPlans);
// clone 5 better plans over 5 worse plans
cloneBetterPlansOverWorsePlans(executionPlans);
// mutate 5 cloned plans
mutateClones(executionPlans);
}
}
private void mutateClones(List<ExecutionPlan> executionPlans) {
for ( int index = 0; index < executionPlans.size()/2; ++index) {
ExecutionPlan execution = executionPlans.get(index);
// mutate 10 different location swaps, maybe the plan improves
for ( int mutationCount = 0; mutationCount < MUTATIONCOUNT ; ++mutationCount) {
int location1 = (int)(Math.random() * 100.0);
int location2 = (int)(Math.random() * 100.0);
// swap two locations
Integer processCostTemp = execution.plan.get(location1/4).get(location1%4);
execution.plan.get(location1/4).set(location1%4, execution.plan.get(location2/4).get(location2%4));
execution.plan.get(location2/4).set(location2%4, processCostTemp);
}
}
}
private void cloneBetterPlansOverWorsePlans(List<ExecutionPlan> executionPlans) {
for ( int index = 0; index < executionPlans.size()/2; ++index) {
ExecutionPlan execution = executionPlans.get(index);
List<List<Integer>> clonePlan = new ArrayList<List<Integer>>();
for ( int roundNumber = 0; roundNumber < 25; ++roundNumber) {
clonePlan.add( new ArrayList<Integer>(execution.plan.get(roundNumber)) );
}
executionPlans.set( index + executionPlans.size()/2, new ExecutionPlan(clonePlan) );
}
}
private void computeCostOfPlans(List<ExecutionPlan> executionPlans) {
for ( ExecutionPlan execution: executionPlans) {
execution.cost = 0;
for ( int roundNumber = 0; roundNumber < 25; ++roundNumber) {
// cost of a round is greatest "communication time".
List<Integer> round = execution.plan.get(roundNumber);
int roundCost = round.get(0)>round.get(1)?round.get(0):round.get(1);
roundCost = execution.cost>round.get(2)?roundCost:round.get(2);
roundCost = execution.cost>round.get(3)?roundCost:round.get(3);
// add all the round costs' to determine total plan cost
execution.cost += roundCost;
}
}
}
private List<ExecutionPlan> createAndIntializePlans(List<Integer> processes) {
List<ExecutionPlan> executionPlans = new ArrayList<ExecutionPlan>();
for ( int planNumber = 0; planNumber < 10; ++planNumber) {
// randomize the processes for this plan
Collections.shuffle(processes);
// and make the plan
List<List<Integer>> currentPlan = new ArrayList<List<Integer>>();
for ( int roundNumber = 0; roundNumber < 25; ++roundNumber) {
List<Integer> round = new ArrayList<Integer>();
round.add(processes.get(4*roundNumber+0));
round.add(processes.get(4*roundNumber+1));
round.add(processes.get(4*roundNumber+2));
round.add(processes.get(4*roundNumber+3));
currentPlan.add(round);
}
executionPlans.add(new ExecutionPlan(currentPlan));
}
return executionPlans;
}
}

why it is so slow with 100,000 records when using pipeline in redis?

It is said that pipeline is a better way when many set/get is required in redis, so this is my test code:
public class TestPipeline {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
JedisShardInfo si = new JedisShardInfo("127.0.0.1", 6379);
List<JedisShardInfo> list = new ArrayList<JedisShardInfo>();
list.add(si);
ShardedJedis jedis = new ShardedJedis(list);
long startTime = System.currentTimeMillis();
ShardedJedisPipeline pipeline = jedis.pipelined();
for (int i = 0; i < 100000; i++) {
Map<String, String> map = new HashMap<String, String>();
map.put("id", "" + i);
map.put("name", "lyj" + i);
pipeline.hmset("m" + i, map);
}
pipeline.sync();
long endTime = System.currentTimeMillis();
System.out.println(endTime - startTime);
}
}
When I ran it, there is no response with this program for a while, but when I don't work with pipe, it takes only 20073 ms, so I am confused why it is even better without pipeline and how a wide gap!
Thanks for answer me, a few questions, how do you calculate 6MB data?
When I send 10K data, pipeline is always faster than normal mode, but with 100k, pipeline would no response.I think 100-1000 operations is a advisable choice as below said.Is there anyting with JIT since I don't understand it?

There are a few points you need to consider before writing such a benchmark (and especially a benchmark using the JVM):
on most (physical) machines, Redis is able to process more than 100K ops/s when pipelining is used. Your benchmark only deals with 100K item, so it does not last long enough to produce meaningful results. Furthermore, there is no time for the successive stages of the JIT to kick in.
the absolute time is not a very relevant metric. Displaying the throughput (i.e. the number of operation per second) while keeping the benchmark running for at least 10 seconds would be a better and more stable metric.
your inner loop generates a lot of garbage. If you plan to benchmark Jedis+Redis, then you need to keep the overhead of your own program low.
because you have defined everything into the main function, your loop will not be compiled by the JIT (depending on the JVM you use). Only the inner method calls may be. If you want the JIT to be efficient, make sure to encapsulate your code into methods that can be compiled by the JIT.
optionally, you may want to add a warm-up phase before performing the actual measurement to avoid accounting the overhead of running the first iterations with the bare-bone interpreter, and the cost of the JIT itself.
Now, regarding Redis pipelining, your pipeline is way too long. 100K commands in the pipeline means Jedis has to build a 6MB buffer before sending anything to Redis. It means the socket buffers (on client side, and perhaps server-side) will be saturated, and that Redis will have to deal with 6 MB communication buffers as well.
Furthermore, your benchmark is still synchronous (using a pipeline does not magically make it asynchronous). In other words, Jedis will not start reading replies until the last query of your pipeline has been sent to Redis. When the pipeline is too long, it has the potential to block things.
Consider limiting the size of the pipeline to 100-1000 operations. Of course, it will generate more roundtrips, but the pressure on the communication stack will be reduced to an acceptable level. For instance, consider the following program:
import redis.clients.jedis.*;
import java.util.*;
public class TestPipeline {
/**
* #param args
*/
int i = 0;
Map<String, String> map = new HashMap<String, String>();
ShardedJedis jedis;
// Number of iterations
// Use 1000 to test with the pipeline, 100 otherwise
static final int N = 1000;
public TestPipeline() {
JedisShardInfo si = new JedisShardInfo("127.0.0.1", 6379);
List<JedisShardInfo> list = new ArrayList<JedisShardInfo>();
list.add(si);
jedis = new ShardedJedis(list);
}
public void push( int n ) {
ShardedJedisPipeline pipeline = jedis.pipelined();
for ( int k = 0; k < n; k++) {
map.put("id", "" + i);
map.put("name", "lyj" + i);
pipeline.hmset("m" + i, map);
++i;
}
pipeline.sync();
}
public void push2( int n ) {
for ( int k = 0; k < n; k++) {
map.put("id", "" + i);
map.put("name", "lyj" + i);
jedis.hmset("m" + i, map);
++i;
}
}
public static void main(String[] args) {
TestPipeline obj = new TestPipeline();
long startTime = System.currentTimeMillis();
for ( int j=0; j<N; j++ ) {
// Use push2 instead to test without pipeline
obj.push(1000);
// Uncomment to see the acceleration
//System.out.println(obj.i);
}
long endTime = System.currentTimeMillis();
double d = 1000.0 * obj.i;
d /= (double)(endTime - startTime);
System.out.println("Throughput: "+d);
}
}
With this program, you can test with or without pipelining. Be sure to increase the number of iterations (N parameter) when pipelining is used, so that it runs for at least 10 seconds. If you uncomment the println in the loop, you will realize that the program is slow at the begining and will get quicker as the JIT starts to optimize things (that's why the program should run at least several seconds to give a meaningful result).
On my hardware (an old Athlon box), I can get 8-9 times more throughput when the pipeline is used. The program could be further improved by optimizing key/value formatting in the inner loop and adding a warm-up phase.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.