I have a server with multiple GPUs and want to make full use of them during model inference inside a java app.
By default tensorflow seizes all available GPUs, but uses only the first one.
I can think of three options to overcome this issue:
Restrict device visibility on process level, namely using CUDA_VISIBLE_DEVICES environment variable.
That would require me to run several instances of the java app and distribute traffic among them. Not that tempting idea.
Launch several sessions inside a single application and try to assign one device to each of them via ConfigProto:
public class DistributedPredictor {
private Predictor[] nested;
private int[] counters;
// ...
public DistributedPredictor(String modelPath, int numDevices, int numThreadsPerDevice) {
nested = new Predictor[numDevices];
counters = new int[numDevices];
for (int i = 0; i < nested.length; i++) {
nested[i] = new Predictor(modelPath, i, numDevices, numThreadsPerDevice);
}
}
public Prediction predict(Data data) {
int i = acquirePredictorIndex();
Prediction result = nested[i].predict(data);
releasePredictorIndex(i);
return result;
}
private synchronized int acquirePredictorIndex() {
int i = argmin(counters);
counters[i] += 1;
return i;
}
private synchronized void releasePredictorIndex(int i) {
counters[i] -= 1;
}
}
public class Predictor {
private Session session;
public Predictor(String modelPath, int deviceIdx, int numDevices, int numThreadsPerDevice) {
GPUOptions gpuOptions = GPUOptions.newBuilder()
.setVisibleDeviceList("" + deviceIdx)
.setAllowGrowth(true)
.build();
ConfigProto config = ConfigProto.newBuilder()
.setGpuOptions(gpuOptions)
.setInterOpParallelismThreads(numDevices * numThreadsPerDevice)
.build();
byte[] graphDef = Files.readAllBytes(Paths.get(modelPath));
Graph graph = new Graph();
graph.importGraphDef(graphDef);
this.session = new Session(graph, config.toByteArray());
}
public Prediction predict(Data data) {
// ...
}
}
This approach seems to work fine at a glance. However, sessions occasionally ignore setVisibleDeviceList option and all go for the first device causing Out-Of-Memory crash.
Build the model in a multi-tower fashion in python using tf.device() specification. On java side, give different Predictors different towers inside a shared session.
Feels cumbersome and idiomatically wrong to me.
UPDATE: As #ash proposed, there's yet another option:
Assign an appropriate device to each operation of the existing graph by modifying its definition (graphDef).
To get it done, one could adapt the code from Method 2:
public class Predictor {
private Session session;
public Predictor(String modelPath, int deviceIdx, int numDevices, int numThreadsPerDevice) {
byte[] graphDef = Files.readAllBytes(Paths.get(modelPath));
graphDef = setGraphDefDevice(graphDef, deviceIdx)
Graph graph = new Graph();
graph.importGraphDef(graphDef);
ConfigProto config = ConfigProto.newBuilder()
.setAllowSoftPlacement(true)
.build();
this.session = new Session(graph, config.toByteArray());
}
private static byte[] setGraphDefDevice(byte[] graphDef, int deviceIdx) throws InvalidProtocolBufferException {
String deviceString = String.format("/gpu:%d", deviceIdx);
GraphDef.Builder builder = GraphDef.parseFrom(graphDef).toBuilder();
for (int i = 0; i < builder.getNodeCount(); i++) {
builder.getNodeBuilder(i).setDevice(deviceString);
}
return builder.build().toByteArray();
}
public Prediction predict(Data data) {
// ...
}
}
Just like other mentioned approaches, this one doesn't set me free from manually distributing data among devices. But at least it works stably and is comparably easy to implement. Overall, this looks like an (almost) normal technique.
Is there an elegant way to do such a basic thing with tensorflow java API? Any ideas would be appreciated.
In short: There is a workaround, where you end up with one session per GPU.
Details:
The general flow is that the TensorFlow runtime respects the devices specified for operations in the graph. If no device is specified for an operation, then it "places" it based on some heuristics. Those heuristics currently result in "place operation on GPU:0 if GPUs are available and there is a GPU kernel for the operation" (Placer::Run in case you're interested).
What you ask for I think is a reasonable feature request for TensorFlow - the ability to treat devices in the serialized graph as "virtual" ones to be mapped to a set of "phyiscal" devices at run time, or alternatively setting the "default device". This feature does not currently exist. Adding such an option to ConfigProto is something you may want to file a feature request for.
I can suggest a workaround in the interim. First, some commentary on your proposed solutions.
Your first idea will surely work, but as you pointed out, is cumbersome.
Setting using visible_device_list in the ConfigProto doesn't quite work out since that is actually a per-process setting and is ignored after the first session is created in the process. This is certainly not documented as well as it should be (and somewhat unfortunate that this appears in the per-Session configuration). However, this explains why your suggestion here doesn't work and why you still see a single GPU being used.
This could work.
Another option is to end up with different graphs (with operations explicitly placed on different GPUs), resulting in one session per GPU. Something like this can be used to edit the graph and explicitly assign a device to each operation:
public static byte[] modifyGraphDef(byte[] graphDef, String device) throws Exception {
GraphDef.Builder builder = GraphDef.parseFrom(graphDef).toBuilder();
for (int i = 0; i < builder.getNodeCount(); ++i) {
builder.getNodeBuilder(i).setDevice(device);
}
return builder.build().toByteArray();
}
After which you could create a Graph and Session per GPU using something like:
final int NUM_GPUS = 8;
// setAllowSoftPlacement: Just in case our device modifications were too aggressive
// (e.g., setting a GPU device on an operation that only has CPU kernels)
// setLogDevicePlacment: So we can see what happens.
byte[] config =
ConfigProto.newBuilder()
.setLogDevicePlacement(true)
.setAllowSoftPlacement(true)
.build()
.toByteArray();
Graph graphs[] = new Graph[NUM_GPUS];
Session sessions[] = new Session[NUM_GPUS];
for (int i = 0; i < NUM_GPUS; ++i) {
graphs[i] = new Graph();
graphs[i].importGraphDef(modifyGraphDef(graphDef, String.format("/gpu:%d", i)));
sessions[i] = new Session(graphs[i], config);
}
Then use sessions[i] to execute the graph on GPU #i.
Hope that helps.
In python it can be done as follows:
def get_frozen_graph(graph_file):
"""Read Frozen Graph file from disk."""
with tf.gfile.GFile(graph_file, "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
return graph_def
trt_graph1 = get_frozen_graph('/home/ved/ved_1/frozen_inference_graph.pb')
with tf.device('/gpu:1'):
[tf_input_l1, tf_scores_l1, tf_boxes_l1, tf_classes_l1, tf_num_detections_l1, tf_masks_l1] = tf.import_graph_def(trt_graph1,
return_elements=['image_tensor:0', 'detection_scores:0',
'detection_boxes:0', 'detection_classes:0','num_detections:0', 'detection_masks:0'])
tf_sess1 = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
trt_graph2 = get_frozen_graph('/home/ved/ved_2/frozen_inference_graph.pb')
with tf.device('/gpu:0'):
[tf_input_l2, tf_scores_l2, tf_boxes_l2, tf_classes_l2, tf_num_detections_l2] = tf.import_graph_def(trt_graph2,
return_elements=['image_tensor:0', 'detection_scores:0',
'detection_boxes:0', 'detection_classes:0','num_detections:0'])
tf_sess2 = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
Related
First of all: I'm using moa-release-2019.05.0-bin/moa-release-2019.05.0/lib/moa.jar in my java project.
Now, let's go to the point: I am trying to use moa.clusterers.clustream.WithKmeans stream clustering algorithm and I have no idea why this is happening ...
I am new into using moa and I am having a hard time trying to decode how the clustering algorithms have to be used. The documentation lacks of sample code for common usages, and the implementation is not that well explained ... have not found any tutorial either.
My code:
import com.yahoo.labs.samoa.instances.DenseInstance;
import moa.cluster.Clustering;
import moa.clusterers.clustream.WithKmeans;
public class TestingClustream {
static DenseInstance randomInstance(int size) {
DenseInstance instance = new DenseInstance(size);
for (int idx = 0; idx < size; idx++) {
instance.setValue(idx, Math.random());
}
return instance;
}
public static void main(String[] args) {
WithKmeans wkm = new WithKmeans();
wkm.kOption.setValue(5);
wkm.maxNumKernelsOption.setValue(300);
wkm.resetLearningImpl();
for (int i = 0; i < 10000; i++) {
wkm.trainOnInstanceImpl(randomInstance(2));
}
Clustering clusteringResult = wkm.getClusteringResult();
Clustering microClusteringResult = wkm.getMicroClusteringResult();
}
}
Info from the debugger:
I have read the source code many times, and it seems to me that I am using the correct functions, in the correct order ... I do not know what I am missing ... any feedback is welcomed!
Make sure you have fed the algorithm enough data, it will process the data in batches.
The fields are unused, likely coming from some parent class with a different purpose.
Use the getter methods such as getCenter() that will compute the current center from the running sum.
I'm working with Akka (version 2.4.17) to build an observation Flow in Java (let's say of elements of type <T> to stay generic).
My requirement is that this Flow should be customizable to deliver a maximum number of observations per unit of time as soon as they arrive. For instance, it should be able to deliver at most 2 observations per minute (the first that arrive, the rest can be dropped).
I looked very closely to the Akka documentation, and in particular this page which details the built-in stages and their semantics.
So far, I tried the following approaches.
With throttle and shaping() mode (to not close the stream when the limit is exceeded):
Flow.of(T.class)
.throttle(2,
new FiniteDuration(1, TimeUnit.MINUTES),
0,
ThrottleMode.shaping())
With groupedWith and an intermediary custom method:
final int nbObsMax = 2;
Flow.of(T.class)
.groupedWithin(Integer.MAX_VALUE, new FiniteDuration(1, TimeUnit.MINUTES))
.map(list -> {
List<T> listToTransfer = new ArrayList<>();
for (int i = list.size()-nbObsMax ; i>0 && i<list.size() ; i++) {
listToTransfer.add(new T(list.get(i)));
}
return listToTransfer;
})
.mapConcat(elem -> elem) // Splitting List<T> in a Flow of T objects
Previous approaches give me the correct number of observations per unit of time but these observations are retained and only delivered at the end of the time window (and therefore there is an additional delay).
To give a more concrete example, if the following observations arrives into my Flow:
[Obs1 t=0s] [Obs2 t=45s] [Obs3 t=47s] [Obs4 t=121s] [Obs5 t=122s]
It should only output the following ones as soon as they arrive (processing time can be neglected here):
Window 1: [Obs1 t~0s] [Obs2 t~45s]
Window 2: [Obs4 t~121s] [Obs5 t~122s]
Any help will be appreciated, thanks for reading my first StackOverflow post ;)
I cannot think of a solution out of the box that does what you want. Throttle will emit in a steady stream because of how it is implemented with the bucket model, rather than having a permitted lease at the start of every time period.
To get the exact behavior you are after you would have to create your own custom rate-limit stage (which might not be that hard). You can find the docs on how to create custom stages here: http://doc.akka.io/docs/akka/2.5.0/java/stream/stream-customize.html#custom-linear-processing-stages-using-graphstage
One design that could work is having an allowance counter saying how many elements that can be emitted that you reset every interval, for every incoming element you subtract one from the counter and emit, when the allowance used up you keep pulling upstream but discard the elements rather than emit them. Using TimerGraphStageLogic for GraphStageLogic allows you to set a timed callback that can reset the allowance.
I think this is exactly what you need: http://doc.akka.io/docs/akka/2.5.0/java/stream/stream-cookbook.html#Globally_limiting_the_rate_of_a_set_of_streams
Thanks to the answer of #johanandren, I've successfully implemented a custom time-based GraphStage that meets my requirements.
I post the code below, if anyone is interested:
import akka.stream.Attributes;
import akka.stream.FlowShape;
import akka.stream.Inlet;
import akka.stream.Outlet;
import akka.stream.stage.*;
import scala.concurrent.duration.FiniteDuration;
public class CustomThrottleGraphStage<A> extends GraphStage<FlowShape<A, A>> {
private final FiniteDuration silencePeriod;
private int nbElemsMax;
public CustomThrottleGraphStage(int nbElemsMax, FiniteDuration silencePeriod) {
this.silencePeriod = silencePeriod;
this.nbElemsMax = nbElemsMax;
}
public final Inlet<A> in = Inlet.create("TimedGate.in");
public final Outlet<A> out = Outlet.create("TimedGate.out");
private final FlowShape<A, A> shape = FlowShape.of(in, out);
#Override
public FlowShape<A, A> shape() {
return shape;
}
#Override
public GraphStageLogic createLogic(Attributes inheritedAttributes) {
return new TimerGraphStageLogic(shape) {
private boolean open = false;
private int countElements = 0;
{
setHandler(in, new AbstractInHandler() {
#Override
public void onPush() throws Exception {
A elem = grab(in);
if (open || countElements >= nbElemsMax) {
pull(in); // we drop all incoming observations since the rate limit has been reached
}
else {
if (countElements == 0) { // we schedule the next instant to reset the observation counter
scheduleOnce("resetCounter", silencePeriod);
}
push(out, elem); // we forward the incoming observation
countElements += 1; // we increment the counter
}
}
});
setHandler(out, new AbstractOutHandler() {
#Override
public void onPull() throws Exception {
pull(in);
}
});
}
#Override
public void onTimer(Object key) {
if (key.equals("resetCounter")) {
open = false;
countElements = 0;
}
}
};
}
}
I am trying to enhance data in a pipeline by querying Datastore in a DoFn step.
A field from an object from the Class CustomClass is used to do a query against a Datastore table and the returned values are used to enhance the object.
The code looks like this:
public class EnhanceWithDataStore extends DoFn<CustomClass, CustomClass> {
private static Datastore datastore = DatastoreOptions.defaultInstance().service();
private static KeyFactory articleKeyFactory = datastore.newKeyFactory().kind("article");
#Override
public void processElement(ProcessContext c) throws Exception {
CustomClass event = c.element();
Entity article = datastore.get(articleKeyFactory.newKey(event.getArticleId()));
String articleName = "";
try{
articleName = article.getString("articleName");
} catch(Exception e) {}
CustomClass enhanced = new CustomClass(event);
enhanced.setArticleName(articleName);
c.output(enhanced);
}
When it is run locally, this is fast, but when it is run in the cloud, this step slows down the pipeline significantly. What's causing this? Is there any workaround or better way to do this?
A picture of the pipeline can be found here (the last step is the enhancing step):
pipeline architecture
What you are doing here is a join between your input PCollection<CustomClass> and the enhancements in Datastore.
For each partition of your PCollection the calls to Datastore are going to be single-threaded, hence incur a lot of latency. I would expect this to be slow in the DirectPipelineRunner and InProcessPipelineRunner as well. With autoscaling and dynamic work rebalancing, you should see parallelism when running on the Dataflow service unless something about the structure of your causes us to optimize it poorly, so you can try increasing --maxNumWorkers. But you still won't benefit from bulk operations.
It is probably better to express this join within your pipeline, using DatastoreIO.readFrom(...) followed by a CoGroupByKey transform. In this way, Dataflow will do a bulk parallel read of all the enhancements and use the efficient GroupByKey machinery to line them up with the events.
// Here are the two collections you want to join
PCollection<CustomClass> events = ...;
PCollection<Entity> articles = DatastoreIO.readFrom(...);
// Key them both by the common id
PCollection<KV<Long, CustomClass>> keyedEvents =
events.apply(WithKeys.of(event -> event.getArticleId()))
PCollection<KV<Long, Entity>> =
articles.apply(WithKeys.of(article -> article.getKey().getId())
// Set up the join by giving tags to each collection
TupleTag<CustomClass> eventTag = new TupleTag<CustomClass>() {};
TupleTag<Entity> articleTag = new TupleTag<Entity>() {};
KeyedPCollectionTuple<Long> coGbkInput =
KeyedPCollectionTuple
.of(eventTag, keyedEvents)
.and(articleTag, keyedArticles);
PCollection<CustomClass> enhancedEvents = coGbkInput
.apply(CoGroupByKey.create())
.apply(MapElements.via(CoGbkResult joinResult -> {
for (CustomClass event : joinResult.getAll(eventTag)) {
String articleName;
try {
articleName = joinResult.getOnly(articleTag).getString("articleName");
} catch(Exception e) {
articleName = "";
}
CustomClass enhanced = new CustomClass(event);
enhanced.setArticleName(articleName);
return enhanced;
}
});
Another possibility, if there are very few enough articles to store the lookup in memory, is to use DatastoreIO.readFrom(...) and then read them all as a map side input via View.asMap() and look them up in a local table.
// Here are the two collections you want to join
PCollection<CustomClass> events = ...;
PCollection<Entity> articles = DatastoreIO.readFrom(...);
// Key the articles and create a map view
PCollectionView<Map<Long, Entity>> = articleView
.apply(WithKeys.of(article -> article.getKey().getId())
.apply(View.asMap());
// Do a lookup join by side input to a ParDo
PCollection<CustomClass> enhanced = events
.apply(ParDo.withSideInputs(articles).of(new DoFn<CustomClass, CustomClass>() {
#Override
public void processElement(ProcessContext c) {
Map<Long, Entity> articleLookup = c.sideInput(articleView);
String articleName;
try {
articleName =
articleLookup.get(event.getArticleId()).getString("articleName");
} catch(Exception e) {
articleName = "";
}
CustomClass enhanced = new CustomClass(event);
enhanced.setArticleName(articleName);
return enhanced;
}
});
Depending on your data, either of these may be a better choice.
After some checking I managed to pinpoint the problem: the project is located in the EU (and as such, the Datastore is located in the EU-zone; same as the AppEningine zone), while the Dataflow jobs themselves (and thus the workers) are hosted in the US by default (when not overwriting the zone-option).
The difference in performance is 25-30 fold: ~40 elements/s compared to ~1200 elements/s for 15 workers.
I have a classifier that I trained using Python's scikit-learn. How can I use the classifier from a Java program? Can I use Jython? Is there some way to save the classifier in Python and load it in Java? Is there some other way to use it?
You cannot use jython as scikit-learn heavily relies on numpy and scipy that have many compiled C and Fortran extensions hence cannot work in jython.
The easiest ways to use scikit-learn in a java environment would be to:
expose the classifier as a HTTP / Json service, for instance using a microframework such as flask or bottle or cornice and call it from java using an HTTP client library
write a commandline wrapper application in python that reads data on stdin and output predictions on stdout using some format such as CSV or JSON (or some lower level binary representation) and call the python program from java for instance using Apache Commons Exec.
make the python program output the raw numerical parameters learnt at fit time (typically as an array of floating point values) and reimplement the predict function in java (this is typically easy for predictive linear models where the prediction is often just a thresholded dot product).
The last approach will be a lot more work if you need to re-implement feature extraction in Java as well.
Finally you can use a Java library such as Weka or Mahout that implement the algorithms you need instead of trying to use scikit-learn from Java.
There is JPMML project for this purpose.
First, you can serialize scikit-learn model to PMML (which is XML internally) using sklearn2pmml library directly from python or dump it in python first and convert using jpmml-sklearn in java or from a command line provided by this library. Next, you can load pmml file, deserialize and execute the loaded model using jpmml-evaluator in your Java code.
This way works with not all scikit-learn models, but with many of them.
As some commenters correctly pointed out, it's important to note that JPMML project is licensed under GNU AGPL. AGPL is a strong copyleft license, which may limit your ability to use the project. One of the examples may be if you develop a publically accessible service and want to keep the sources closed.
You can either use a porter, I have tested the sklearn-porter (https://github.com/nok/sklearn-porter), and it works well for Java.
My code is the following:
import pandas as pd
from sklearn import tree
from sklearn_porter import Porter
train_dataset = pd.read_csv('./result2.csv').as_matrix()
X_train = train_dataset[:90, :8]
Y_train = train_dataset[:90, 8:]
X_test = train_dataset[90:, :8]
Y_test = train_dataset[90:, 8:]
print X_train.shape
print Y_train.shape
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)
porter = Porter(clf, language='java')
output = porter.export(embed_data=True)
print(output)
In my case, I'm using a DecisionTreeClassifier, and the output of
print(output)
is the following code as text in the console:
class DecisionTreeClassifier {
private static int findMax(int[] nums) {
int index = 0;
for (int i = 0; i < nums.length; i++) {
index = nums[i] > nums[index] ? i : index;
}
return index;
}
public static int predict(double[] features) {
int[] classes = new int[2];
if (features[5] <= 51.5) {
if (features[6] <= 21.0) {
// HUGE amount of ifs..........
}
}
return findMax(classes);
}
public static void main(String[] args) {
if (args.length == 8) {
// Features:
double[] features = new double[args.length];
for (int i = 0, l = args.length; i < l; i++) {
features[i] = Double.parseDouble(args[i]);
}
// Prediction:
int prediction = DecisionTreeClassifier.predict(features);
System.out.println(prediction);
}
}
}
Here is some code for the JPMML solution:
--PYTHON PART--
# helper function to determine the string columns which have to be one-hot-encoded in order to apply an estimator.
def determine_categorical_columns(df):
categorical_columns = []
x = 0
for col in df.dtypes:
if col == 'object':
val = df[df.columns[x]].iloc[0]
if not isinstance(val,Decimal):
categorical_columns.append(df.columns[x])
x += 1
return categorical_columns
categorical_columns = determine_categorical_columns(df)
other_columns = list(set(df.columns).difference(categorical_columns))
#construction of transformators for our example
labelBinarizers = [(d, LabelBinarizer()) for d in categorical_columns]
nones = [(d, None) for d in other_columns]
transformators = labelBinarizers+nones
mapper = DataFrameMapper(transformators,df_out=True)
gbc = GradientBoostingClassifier()
#construction of the pipeline
lm = PMMLPipeline([
("mapper", mapper),
("estimator", gbc)
])
--JAVA PART --
//Initialisation.
String pmmlFile = "ScikitLearnNew.pmml";
PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile));
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml);
//Determine which features are required as input
HashMap<String, Field>() inputFieldMap = new HashMap<String, Field>();
for (int i = 0; i < evaluator.getInputFields().size();i++) {
InputField curInputField = evaluator.getInputFields().get(i);
String fieldName = curInputField.getName().getValue();
inputFieldMap.put(fieldName.toLowerCase(),curInputField.getField());
}
//prediction
HashMap<String,String> argsMap = new HashMap<String,String>();
//... fill argsMap with input
Map<FieldName, ?> res;
// here we keep only features that are required by the model
Map<FieldName,String> args = new HashMap<FieldName, String>();
Iterator<String> iter = argsMap.keySet().iterator();
while (iter.hasNext()) {
String key = iter.next();
Field f = inputFieldMap.get(key);
if (f != null) {
FieldName name =f.getName();
String value = argsMap.get(key);
args.put(name, value);
}
}
//the model is applied to input, a probability distribution is obtained
res = evaluator.evaluate(args);
SegmentResult segmentResult = (SegmentResult) res;
Object targetValue = segmentResult.getTargetValue();
ProbabilityDistribution probabilityDistribution = (ProbabilityDistribution) targetValue;
I found myself in a similar situation.
I'll recommend carving out a classifier microservice. You could have a classifier microservice which runs in python and then expose calls to that service over some RESTFul API yielding JSON/XML data-interchange format. I think this is a cleaner approach.
Alternatively you can just generate a Python code from a trained model. Here is a tool that can help you with that https://github.com/BayesWitnesses/m2cgen
Does anyone know how can I create a new Performance Counter (perfmon tool) in Java?
For example: a new performance counter for monitoring the number / duration of user actions.
I created such performance counters in C# and it was quite easy, however I couldn’t find anything helpful for creating it in Java…
If you want to develop your performance counter independently from the main code, you should look at aspect programming (AspectJ, Javassist).
You'll can plug your performance counter on the method(s) you want without modifying the main code.
Java does not immediately work with perfmon (but you should see DTrace under Solaris).
Please see this question for suggestions: Java app performance counters viewed in Perfmon
Not sure what you are expecting this tool to do but I would create some data structures to record these times and counts like
class UserActionStats {
int count;
long durationMS;
long start = 0;
public void startAction() {
start = System.currentTimeMillis();
}
public void endAction() {
durationMS += System.currentTimeMillis() - start;
count++;
}
}
A collection for these could look like
private static final Map<String, UserActionStats> map =
new HashMap<String, UserActionStats>();
public static UserActionStats forUser(String userName) {
synchronized(map) {
UserActionStats uas = map.get(userName);
if (uas == null)
map.put(userName, uas = new UserActionStats());
return uas;
}
}