Creating Performance Counters in Java

Creating Performance Counters in Java - java

Does anyone know how can I create a new Performance Counter (perfmon tool) in Java?
For example: a new performance counter for monitoring the number / duration of user actions.
I created such performance counters in C# and it was quite easy, however I couldn’t find anything helpful for creating it in Java…

If you want to develop your performance counter independently from the main code, you should look at aspect programming (AspectJ, Javassist).
You'll can plug your performance counter on the method(s) you want without modifying the main code.

Java does not immediately work with perfmon (but you should see DTrace under Solaris).
Please see this question for suggestions: Java app performance counters viewed in Perfmon

Not sure what you are expecting this tool to do but I would create some data structures to record these times and counts like
class UserActionStats {
int count;
long durationMS;
long start = 0;
public void startAction() {
start = System.currentTimeMillis();
}
public void endAction() {
durationMS += System.currentTimeMillis() - start;
count++;
}
}
A collection for these could look like
private static final Map<String, UserActionStats> map =
new HashMap<String, UserActionStats>();
public static UserActionStats forUser(String userName) {
synchronized(map) {
UserActionStats uas = map.get(userName);
if (uas == null)
map.put(userName, uas = new UserActionStats());
return uas;
}
}

Related

Tensorflow Java Multi-GPU inference

I have a server with multiple GPUs and want to make full use of them during model inference inside a java app.
By default tensorflow seizes all available GPUs, but uses only the first one.
I can think of three options to overcome this issue:
Restrict device visibility on process level, namely using CUDA_VISIBLE_DEVICES environment variable.
That would require me to run several instances of the java app and distribute traffic among them. Not that tempting idea.
Launch several sessions inside a single application and try to assign one device to each of them via ConfigProto:
public class DistributedPredictor {
private Predictor[] nested;
private int[] counters;
// ...
public DistributedPredictor(String modelPath, int numDevices, int numThreadsPerDevice) {
nested = new Predictor[numDevices];
counters = new int[numDevices];
for (int i = 0; i < nested.length; i++) {
nested[i] = new Predictor(modelPath, i, numDevices, numThreadsPerDevice);
}
}
public Prediction predict(Data data) {
int i = acquirePredictorIndex();
Prediction result = nested[i].predict(data);
releasePredictorIndex(i);
return result;
}
private synchronized int acquirePredictorIndex() {
int i = argmin(counters);
counters[i] += 1;
return i;
}
private synchronized void releasePredictorIndex(int i) {
counters[i] -= 1;
}
}
public class Predictor {
private Session session;
public Predictor(String modelPath, int deviceIdx, int numDevices, int numThreadsPerDevice) {
GPUOptions gpuOptions = GPUOptions.newBuilder()
.setVisibleDeviceList("" + deviceIdx)
.setAllowGrowth(true)
.build();
ConfigProto config = ConfigProto.newBuilder()
.setGpuOptions(gpuOptions)
.setInterOpParallelismThreads(numDevices * numThreadsPerDevice)
.build();
byte[] graphDef = Files.readAllBytes(Paths.get(modelPath));
Graph graph = new Graph();
graph.importGraphDef(graphDef);
this.session = new Session(graph, config.toByteArray());
}
public Prediction predict(Data data) {
// ...
}
}
This approach seems to work fine at a glance. However, sessions occasionally ignore setVisibleDeviceList option and all go for the first device causing Out-Of-Memory crash.
Build the model in a multi-tower fashion in python using tf.device() specification. On java side, give different Predictors different towers inside a shared session.
Feels cumbersome and idiomatically wrong to me.
UPDATE: As #ash proposed, there's yet another option:
Assign an appropriate device to each operation of the existing graph by modifying its definition (graphDef).
To get it done, one could adapt the code from Method 2:
public class Predictor {
private Session session;
public Predictor(String modelPath, int deviceIdx, int numDevices, int numThreadsPerDevice) {
byte[] graphDef = Files.readAllBytes(Paths.get(modelPath));
graphDef = setGraphDefDevice(graphDef, deviceIdx)
Graph graph = new Graph();
graph.importGraphDef(graphDef);
ConfigProto config = ConfigProto.newBuilder()
.setAllowSoftPlacement(true)
.build();
this.session = new Session(graph, config.toByteArray());
}
private static byte[] setGraphDefDevice(byte[] graphDef, int deviceIdx) throws InvalidProtocolBufferException {
String deviceString = String.format("/gpu:%d", deviceIdx);
GraphDef.Builder builder = GraphDef.parseFrom(graphDef).toBuilder();
for (int i = 0; i < builder.getNodeCount(); i++) {
builder.getNodeBuilder(i).setDevice(deviceString);
}
return builder.build().toByteArray();
}
public Prediction predict(Data data) {
// ...
}
}
Just like other mentioned approaches, this one doesn't set me free from manually distributing data among devices. But at least it works stably and is comparably easy to implement. Overall, this looks like an (almost) normal technique.
Is there an elegant way to do such a basic thing with tensorflow java API? Any ideas would be appreciated.

In short: There is a workaround, where you end up with one session per GPU.
Details:
The general flow is that the TensorFlow runtime respects the devices specified for operations in the graph. If no device is specified for an operation, then it "places" it based on some heuristics. Those heuristics currently result in "place operation on GPU:0 if GPUs are available and there is a GPU kernel for the operation" (Placer::Run in case you're interested).
What you ask for I think is a reasonable feature request for TensorFlow - the ability to treat devices in the serialized graph as "virtual" ones to be mapped to a set of "phyiscal" devices at run time, or alternatively setting the "default device". This feature does not currently exist. Adding such an option to ConfigProto is something you may want to file a feature request for.
I can suggest a workaround in the interim. First, some commentary on your proposed solutions.
Your first idea will surely work, but as you pointed out, is cumbersome.
Setting using visible_device_list in the ConfigProto doesn't quite work out since that is actually a per-process setting and is ignored after the first session is created in the process. This is certainly not documented as well as it should be (and somewhat unfortunate that this appears in the per-Session configuration). However, this explains why your suggestion here doesn't work and why you still see a single GPU being used.
This could work.
Another option is to end up with different graphs (with operations explicitly placed on different GPUs), resulting in one session per GPU. Something like this can be used to edit the graph and explicitly assign a device to each operation:
public static byte[] modifyGraphDef(byte[] graphDef, String device) throws Exception {
GraphDef.Builder builder = GraphDef.parseFrom(graphDef).toBuilder();
for (int i = 0; i < builder.getNodeCount(); ++i) {
builder.getNodeBuilder(i).setDevice(device);
}
return builder.build().toByteArray();
}
After which you could create a Graph and Session per GPU using something like:
final int NUM_GPUS = 8;
// setAllowSoftPlacement: Just in case our device modifications were too aggressive
// (e.g., setting a GPU device on an operation that only has CPU kernels)
// setLogDevicePlacment: So we can see what happens.
byte[] config =
ConfigProto.newBuilder()
.setLogDevicePlacement(true)
.setAllowSoftPlacement(true)
.build()
.toByteArray();
Graph graphs[] = new Graph[NUM_GPUS];
Session sessions[] = new Session[NUM_GPUS];
for (int i = 0; i < NUM_GPUS; ++i) {
graphs[i] = new Graph();
graphs[i].importGraphDef(modifyGraphDef(graphDef, String.format("/gpu:%d", i)));
sessions[i] = new Session(graphs[i], config);
}
Then use sessions[i] to execute the graph on GPU #i.
Hope that helps.

In python it can be done as follows:
def get_frozen_graph(graph_file):
"""Read Frozen Graph file from disk."""
with tf.gfile.GFile(graph_file, "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
return graph_def
trt_graph1 = get_frozen_graph('/home/ved/ved_1/frozen_inference_graph.pb')
with tf.device('/gpu:1'):
[tf_input_l1, tf_scores_l1, tf_boxes_l1, tf_classes_l1, tf_num_detections_l1, tf_masks_l1] = tf.import_graph_def(trt_graph1,
return_elements=['image_tensor:0', 'detection_scores:0',
'detection_boxes:0', 'detection_classes:0','num_detections:0', 'detection_masks:0'])
tf_sess1 = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
trt_graph2 = get_frozen_graph('/home/ved/ved_2/frozen_inference_graph.pb')
with tf.device('/gpu:0'):
[tf_input_l2, tf_scores_l2, tf_boxes_l2, tf_classes_l2, tf_num_detections_l2] = tf.import_graph_def(trt_graph2,
return_elements=['image_tensor:0', 'detection_scores:0',
'detection_boxes:0', 'detection_classes:0','num_detections:0'])
tf_sess2 = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))

Akka stream - limiting Flow rate without introducing delay

I'm working with Akka (version 2.4.17) to build an observation Flow in Java (let's say of elements of type <T> to stay generic).
My requirement is that this Flow should be customizable to deliver a maximum number of observations per unit of time as soon as they arrive. For instance, it should be able to deliver at most 2 observations per minute (the first that arrive, the rest can be dropped).
I looked very closely to the Akka documentation, and in particular this page which details the built-in stages and their semantics.
So far, I tried the following approaches.
With throttle and shaping() mode (to not close the stream when the limit is exceeded):
Flow.of(T.class)
.throttle(2,
new FiniteDuration(1, TimeUnit.MINUTES),
0,
ThrottleMode.shaping())
With groupedWith and an intermediary custom method:
final int nbObsMax = 2;
Flow.of(T.class)
.groupedWithin(Integer.MAX_VALUE, new FiniteDuration(1, TimeUnit.MINUTES))
.map(list -> {
List<T> listToTransfer = new ArrayList<>();
for (int i = list.size()-nbObsMax ; i>0 && i<list.size() ; i++) {
listToTransfer.add(new T(list.get(i)));
}
return listToTransfer;
})
.mapConcat(elem -> elem) // Splitting List<T> in a Flow of T objects
Previous approaches give me the correct number of observations per unit of time but these observations are retained and only delivered at the end of the time window (and therefore there is an additional delay).
To give a more concrete example, if the following observations arrives into my Flow:
[Obs1 t=0s] [Obs2 t=45s] [Obs3 t=47s] [Obs4 t=121s] [Obs5 t=122s]
It should only output the following ones as soon as they arrive (processing time can be neglected here):
Window 1: [Obs1 t~0s] [Obs2 t~45s]
Window 2: [Obs4 t~121s] [Obs5 t~122s]
Any help will be appreciated, thanks for reading my first StackOverflow post ;)

I cannot think of a solution out of the box that does what you want. Throttle will emit in a steady stream because of how it is implemented with the bucket model, rather than having a permitted lease at the start of every time period.
To get the exact behavior you are after you would have to create your own custom rate-limit stage (which might not be that hard). You can find the docs on how to create custom stages here: http://doc.akka.io/docs/akka/2.5.0/java/stream/stream-customize.html#custom-linear-processing-stages-using-graphstage
One design that could work is having an allowance counter saying how many elements that can be emitted that you reset every interval, for every incoming element you subtract one from the counter and emit, when the allowance used up you keep pulling upstream but discard the elements rather than emit them. Using TimerGraphStageLogic for GraphStageLogic allows you to set a timed callback that can reset the allowance.

I think this is exactly what you need: http://doc.akka.io/docs/akka/2.5.0/java/stream/stream-cookbook.html#Globally_limiting_the_rate_of_a_set_of_streams

Thanks to the answer of #johanandren, I've successfully implemented a custom time-based GraphStage that meets my requirements.
I post the code below, if anyone is interested:
import akka.stream.Attributes;
import akka.stream.FlowShape;
import akka.stream.Inlet;
import akka.stream.Outlet;
import akka.stream.stage.*;
import scala.concurrent.duration.FiniteDuration;
public class CustomThrottleGraphStage<A> extends GraphStage<FlowShape<A, A>> {
private final FiniteDuration silencePeriod;
private int nbElemsMax;
public CustomThrottleGraphStage(int nbElemsMax, FiniteDuration silencePeriod) {
this.silencePeriod = silencePeriod;
this.nbElemsMax = nbElemsMax;
}
public final Inlet<A> in = Inlet.create("TimedGate.in");
public final Outlet<A> out = Outlet.create("TimedGate.out");
private final FlowShape<A, A> shape = FlowShape.of(in, out);
#Override
public FlowShape<A, A> shape() {
return shape;
}
#Override
public GraphStageLogic createLogic(Attributes inheritedAttributes) {
return new TimerGraphStageLogic(shape) {
private boolean open = false;
private int countElements = 0;
{
setHandler(in, new AbstractInHandler() {
#Override
public void onPush() throws Exception {
A elem = grab(in);
if (open || countElements >= nbElemsMax) {
pull(in); // we drop all incoming observations since the rate limit has been reached
}
else {
if (countElements == 0) { // we schedule the next instant to reset the observation counter
scheduleOnce("resetCounter", silencePeriod);
}
push(out, elem); // we forward the incoming observation
countElements += 1; // we increment the counter
}
}
});
setHandler(out, new AbstractOutHandler() {
#Override
public void onPull() throws Exception {
pull(in);
}
});
}
#Override
public void onTimer(Object key) {
if (key.equals("resetCounter")) {
open = false;
countElements = 0;
}
}
};
}
}

Collectors.summingLong or mapToLong to summarize long values

I've got a list of objects with a value and want to summarise all these values. What is the preferred way to do this in Java 8?
public static void main(String[] args) {
List<AnObject> longs = new ArrayList<AnObject>();
longs.add(new AnObject());
longs.add(new AnObject());
longs.add(new AnObject());
long mappedSum = longs.stream().mapToLong(AnObject::getVal).sum();
long collectedSum = longs.stream().collect(Collectors.summingLong(AnObject::getVal));
System.out.println(mappedSum);
System.out.println(collectedSum);
}
private static class AnObject {
private long val = 10;
public long getVal() {
return val;
}
}
I think mapToLong is more straight forward but I can't really motivate why.
Edit: I've updated the question by changing from summarizeLong to summingLong, that's why some answers and comments might seem a bit off.

I think using Collectors.summarizingLong(AnObject::getVal)) would do more work than you need it to do, as it computes other statistics beside the sum (average, count, min, max, ...).
If you just need the sum, use the simpler and more efficient method :
long mappedSum = longs.stream().mapToLong(AnObject::getVal).sum();
After you changed Collectors.summarizingLong to Collectors.summingLong, it's hard to say which option would be more efficient. The first option has an extra step (mapToLong), but I'm not sure how much difference that would make, since the second option does more work in collect compared to what the first option does in sum.

Is there anything in Java close to the parallel collections in Scala?

What is the simplest way to implement a parallel computation (e.g. on a multiple core processor) using Java.
I.E. the java equivalent to this Scala code
val list = aLargeList
list.par.map(_*2)
There is this library, but it seems overwhelming.

http://gee.cs.oswego.edu/dl/jsr166/dist/extra166ydocs/
Don't give up so fast, snappy! ))
From the javadocs (with changes to map to your f) the essential matter is really just this:
ParallelLongArray a = ... // you provide
a.replaceWithMapping (new LongOp() { public long op(long a){return a*2L;}};);
is pretty much this, right?
val list = aLargeList
list.par.map(_*2)
& If you are willing to live with a bit less terseness, the above can be a reasonably clean and clear 3 liner (and of course, if you reuse functions, then its the same exact thing as Scala - inline functions.):
ParallelLongArray a = ... // you provide
LongOp f = new LongOp() { public long op(long a){return a*2L;}};
a.replaceWithMapping (f);
[edited above to show concise complete form ala OP's Scala variant]
and here it is in maximal verbose form where we start from scratch for demo:
import java.util.Random;
import jsr166y.ForkJoinPool;
import extra166y.Ops.LongGenerator;
import extra166y.Ops.LongOp;
import extra166y.ParallelLongArray;
public class ListParUnaryFunc {
public static void main(String[] args) {
int n = Integer.parseInt(args[0]);
// create a parallel long array
// with random long values
ParallelLongArray a = ParallelLongArray.create(n-1, new ForkJoinPool());
a.replaceWithGeneratedValue(generator);
// use it: apply unaryLongFuncOp in parallel
// to all values in array
a.replaceWithMapping(unaryLongFuncOp);
// examine it
for(Long v : a.asList()){
System.out.format("%d\n", v);
}
}
static final Random rand = new Random(System.nanoTime());
static LongGenerator generator = new LongGenerator() {
#Override final
public long op() { return rand.nextLong(); }
};
static LongOp unaryLongFuncOp = new LongOp() {
#Override final public long op(long a) { return a * 2L; }
};
}
Final edit and notes:
Also note that a simple class such as the following (which you can reuse across your projects):
/**
* The very basic form w/ TODOs on checks, concurrency issues, init, etc.
*/
final public static class ParArray {
private ParallelLongArray parr;
private final long[] arr;
public ParArray (long[] arr){
this.arr = arr;
}
public final ParArray par() {
if(parr == null)
parr = ParallelLongArray.createFromCopy(arr, new ForkJoinPool()) ;
return this;
}
public final ParallelLongArray map(LongOp op) {
return parr.replaceWithMapping(op);
}
public final long[] values() { return parr.getArray(); }
}
and something like that will allow you to write more fluid Java code (if terseness matters to you):
long[] arr = ... // you provide
LongOp f = ... // you provide
ParArray list = new ParArray(arr);
list.par().map(f);
And the above approach can certainly be pushed to make it even cleaner.

Doing that on one machine is pretty easy, but not as easy as Scala makes it. That library you posted is already apart of Java 5 and beyond. Probably the simplest thing to use is a ExecutorService. That represents a series of threads that can be run on any processor. You send it tasks and those things return results.
http://download.oracle.com/javase/1,5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html
http://www.fromdev.com/2009/06/how-can-i-leverage-javautilconcurrent.html
I'd suggest using ExecutorService.invokeAll() which will return a list of Futures. Then you can check them to see if their done.
If you're using Java7 then you could use the fork/join framework which might save you some work. With all of these you can build something very similar to Scala parallel arrays so using it is fairly concise.

Using threads, Java doesn't have this sort of thing built-in.

There will be an equivalent in Java 8: http://www.infoq.com/articles/java-8-vs-scala

Hashmap vs Array performance

Is it (performance-wise) better to use Arrays or HashMaps when the indexes of the Array are known? Keep in mind that the 'objects array/map' in the example is just an example, in my real project it is generated by another class so I cant use individual variables.
ArrayExample:
SomeObject[] objects = new SomeObject[2];
objects[0] = new SomeObject("Obj1");
objects[1] = new SomeObject("Obj2");
void doSomethingToObject(String Identifier){
SomeObject object;
if(Identifier.equals("Obj1")){
object=objects[0];
}else if(){
object=objects[1];
}
//do stuff
}
HashMapExample:
HashMap objects = HashMap();
objects.put("Obj1",new SomeObject());
objects.put("Obj2",new SomeObject());
void doSomethingToObject(String Identifier){
SomeObject object = (SomeObject) objects.get(Identifier);
//do stuff
}
The HashMap one looks much much better but I really need performance on this so that has priority.
EDIT: Well Array's it is then, suggestions are still welcome
EDIT: I forgot to mention, the size of the Array/HashMap is always the same (6)
EDIT: It appears that HashMaps are faster
Array: 128ms
Hash: 103ms
When using less cycles the HashMaps was even twice as fast
test code:
import java.util.HashMap;
import java.util.Random;
public class Optimizationsest {
private static Random r = new Random();
private static HashMap<String,SomeObject> hm = new HashMap<String,SomeObject>();
private static SomeObject[] o = new SomeObject[6];
private static String[] Indentifiers = {"Obj1","Obj2","Obj3","Obj4","Obj5","Obj6"};
private static int t = 1000000;
public static void main(String[] args){
CreateHash();
CreateArray();
long loopTime = ProcessArray();
long hashTime = ProcessHash();
System.out.println("Array: " + loopTime + "ms");
System.out.println("Hash: " + hashTime + "ms");
}
public static void CreateHash(){
for(int i=0; i <= 5; i++){
hm.put("Obj"+(i+1), new SomeObject());
}
}
public static void CreateArray(){
for(int i=0; i <= 5; i++){
o[i]=new SomeObject();
}
}
public static long ProcessArray(){
StopWatch sw = new StopWatch();
sw.start();
for(int i = 1;i<=t;i++){
checkArray(Indentifiers[r.nextInt(6)]);
}
sw.stop();
return sw.getElapsedTime();
}
private static void checkArray(String Identifier) {
SomeObject object;
if(Identifier.equals("Obj1")){
object=o[0];
}else if(Identifier.equals("Obj2")){
object=o[1];
}else if(Identifier.equals("Obj3")){
object=o[2];
}else if(Identifier.equals("Obj4")){
object=o[3];
}else if(Identifier.equals("Obj5")){
object=o[4];
}else if(Identifier.equals("Obj6")){
object=o[5];
}else{
object = new SomeObject();
}
object.kill();
}
public static long ProcessHash(){
StopWatch sw = new StopWatch();
sw.start();
for(int i = 1;i<=t;i++){
checkHash(Indentifiers[r.nextInt(6)]);
}
sw.stop();
return sw.getElapsedTime();
}
private static void checkHash(String Identifier) {
SomeObject object = (SomeObject) hm.get(Identifier);
object.kill();
}
}

HashMap uses an array underneath so it can never be faster than using an array correctly.
Random.nextInt() is many times slower than what you are testing, even using array to test an array is going to bias your results.
The reason your array benchmark is so slow is due to the equals comparisons, not the array access itself.
HashTable is usually much slower than HashMap because it does much the same thing but is also synchronized.
A common problem with micro-benchmarks is the JIT which is very good at removing code which doesn't do anything. If you are not careful you will only be testing whether you have confused the JIT enough that it cannot workout your code doesn't do anything.
This is one of the reason you can write micro-benchmarks which out perform C++ systems. This is because Java is a simpler language and easier to reason about and thus detect code which does nothing useful. This can lead to tests which show that Java does "nothing useful" much faster than C++ ;)

arrays when the indexes are know are faster (HashMap uses an array of linked lists behind the scenes which adds a bit of overhead above the array accesses not to mention the hashing operations that need to be done)
and FYI HashMap<String,SomeObject> objects = HashMap<String,SomeObject>(); makes it so you won't have to cast

For the example shown, HashTable wins, I believe. The problem with the array approach is that it doesn't scale. I imagine you want to have more than two entries in the table, and the condition branch tree in doSomethingToObject will quickly get unwieldly and slow.

Logically, HashMap is definitely a fit in your case. From performance standpoint is also wins since in case of arrays you will need to do number of string comparisons (in your algorithm) while in HashMap you just use a hash code if load factor is not too high. Both array and HashMap will need to be resized if you add many elements, but in case of HashMap you will need to also redistribute elements. In this use case HashMap loses.

Arrays will usually be faster than Collections classes.
PS. You mentioned HashTable in your post. HashTable has even worse performance thatn HashMap. I assume your mention of HashTable was a typo
"The HashTable one looks much much
better "

The example is strange. The key problem is whether your data is dynamic. If it is, you could not write you program that way (as in the array case). In order words, comparing between your array and hash implementation is not fair. The hash implementation works for dynamic data, but the array implementation does not.
If you only have static data (6 fixed objects), array or hash just work as data holder. You could even define static objects.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Creating Performance Counters in Java - java

If you want to develop your performance counter independently from the main code, you should look at aspect programming (AspectJ, Javassist). You'll can plug your performance counter on the method(s) you want without modifying the main code.

Java does not immediately work with perfmon (but you should see DTrace under Solaris). Please see this question for suggestions: Java app performance counters viewed in Perfmon

Related

Tensorflow Java Multi-GPU inference

Akka stream - limiting Flow rate without introducing delay

Collectors.summingLong or mapToLong to summarize long values

Is there anything in Java close to the parallel collections in Scala?

Hashmap vs Array performance

Categories

Resources