I have a classifier that I trained using Python's scikit-learn. How can I use the classifier from a Java program? Can I use Jython? Is there some way to save the classifier in Python and load it in Java? Is there some other way to use it?
You cannot use jython as scikit-learn heavily relies on numpy and scipy that have many compiled C and Fortran extensions hence cannot work in jython.
The easiest ways to use scikit-learn in a java environment would be to:
expose the classifier as a HTTP / Json service, for instance using a microframework such as flask or bottle or cornice and call it from java using an HTTP client library
write a commandline wrapper application in python that reads data on stdin and output predictions on stdout using some format such as CSV or JSON (or some lower level binary representation) and call the python program from java for instance using Apache Commons Exec.
make the python program output the raw numerical parameters learnt at fit time (typically as an array of floating point values) and reimplement the predict function in java (this is typically easy for predictive linear models where the prediction is often just a thresholded dot product).
The last approach will be a lot more work if you need to re-implement feature extraction in Java as well.
Finally you can use a Java library such as Weka or Mahout that implement the algorithms you need instead of trying to use scikit-learn from Java.
There is JPMML project for this purpose.
First, you can serialize scikit-learn model to PMML (which is XML internally) using sklearn2pmml library directly from python or dump it in python first and convert using jpmml-sklearn in java or from a command line provided by this library. Next, you can load pmml file, deserialize and execute the loaded model using jpmml-evaluator in your Java code.
This way works with not all scikit-learn models, but with many of them.
As some commenters correctly pointed out, it's important to note that JPMML project is licensed under GNU AGPL. AGPL is a strong copyleft license, which may limit your ability to use the project. One of the examples may be if you develop a publically accessible service and want to keep the sources closed.
You can either use a porter, I have tested the sklearn-porter (https://github.com/nok/sklearn-porter), and it works well for Java.
My code is the following:
import pandas as pd
from sklearn import tree
from sklearn_porter import Porter
train_dataset = pd.read_csv('./result2.csv').as_matrix()
X_train = train_dataset[:90, :8]
Y_train = train_dataset[:90, 8:]
X_test = train_dataset[90:, :8]
Y_test = train_dataset[90:, 8:]
print X_train.shape
print Y_train.shape
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)
porter = Porter(clf, language='java')
output = porter.export(embed_data=True)
print(output)
In my case, I'm using a DecisionTreeClassifier, and the output of
print(output)
is the following code as text in the console:
class DecisionTreeClassifier {
private static int findMax(int[] nums) {
int index = 0;
for (int i = 0; i < nums.length; i++) {
index = nums[i] > nums[index] ? i : index;
}
return index;
}
public static int predict(double[] features) {
int[] classes = new int[2];
if (features[5] <= 51.5) {
if (features[6] <= 21.0) {
// HUGE amount of ifs..........
}
}
return findMax(classes);
}
public static void main(String[] args) {
if (args.length == 8) {
// Features:
double[] features = new double[args.length];
for (int i = 0, l = args.length; i < l; i++) {
features[i] = Double.parseDouble(args[i]);
}
// Prediction:
int prediction = DecisionTreeClassifier.predict(features);
System.out.println(prediction);
}
}
}
Here is some code for the JPMML solution:
--PYTHON PART--
# helper function to determine the string columns which have to be one-hot-encoded in order to apply an estimator.
def determine_categorical_columns(df):
categorical_columns = []
x = 0
for col in df.dtypes:
if col == 'object':
val = df[df.columns[x]].iloc[0]
if not isinstance(val,Decimal):
categorical_columns.append(df.columns[x])
x += 1
return categorical_columns
categorical_columns = determine_categorical_columns(df)
other_columns = list(set(df.columns).difference(categorical_columns))
#construction of transformators for our example
labelBinarizers = [(d, LabelBinarizer()) for d in categorical_columns]
nones = [(d, None) for d in other_columns]
transformators = labelBinarizers+nones
mapper = DataFrameMapper(transformators,df_out=True)
gbc = GradientBoostingClassifier()
#construction of the pipeline
lm = PMMLPipeline([
("mapper", mapper),
("estimator", gbc)
])
--JAVA PART --
//Initialisation.
String pmmlFile = "ScikitLearnNew.pmml";
PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile));
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml);
//Determine which features are required as input
HashMap<String, Field>() inputFieldMap = new HashMap<String, Field>();
for (int i = 0; i < evaluator.getInputFields().size();i++) {
InputField curInputField = evaluator.getInputFields().get(i);
String fieldName = curInputField.getName().getValue();
inputFieldMap.put(fieldName.toLowerCase(),curInputField.getField());
}
//prediction
HashMap<String,String> argsMap = new HashMap<String,String>();
//... fill argsMap with input
Map<FieldName, ?> res;
// here we keep only features that are required by the model
Map<FieldName,String> args = new HashMap<FieldName, String>();
Iterator<String> iter = argsMap.keySet().iterator();
while (iter.hasNext()) {
String key = iter.next();
Field f = inputFieldMap.get(key);
if (f != null) {
FieldName name =f.getName();
String value = argsMap.get(key);
args.put(name, value);
}
}
//the model is applied to input, a probability distribution is obtained
res = evaluator.evaluate(args);
SegmentResult segmentResult = (SegmentResult) res;
Object targetValue = segmentResult.getTargetValue();
ProbabilityDistribution probabilityDistribution = (ProbabilityDistribution) targetValue;
I found myself in a similar situation.
I'll recommend carving out a classifier microservice. You could have a classifier microservice which runs in python and then expose calls to that service over some RESTFul API yielding JSON/XML data-interchange format. I think this is a cleaner approach.
Alternatively you can just generate a Python code from a trained model. Here is a tool that can help you with that https://github.com/BayesWitnesses/m2cgen
Related
tl;dr:
How do/can I store the function-handles of multiple js-functions in java for using them later? Currently I have two ideas:
Create multipe ScriptEngine instances, each containing one loaded function. Store them in a map by column, multiple entries per column in a list. Looks like a big overhead depending on how 'heavy' a ScriptEngine instance is...
Some Javascript solution to append methods of the same target field to an array. Dont know yet how to access that from the java-side, but also dont like it. Would like to keep the script files as stupid as possible.
var test1 = test1 || [];
test1.push(function(input) { return ""; });
???
Ideas or suggestions?
Tell me more:
I have a project where I have a directory containing script files (javascript, expecting more than hundred files, will grow in future). Those script files are named like: test1;toupper.js, test1;trim.js and test2;capitalize.js. The name before the semicolon is the column/field that the script will be process and the part after the semicolon is a human readable description what the file does (simplified example). So in this example there are two scripts that will be assigned to the "test1" column and one script to the "test2" column. The js-function template basically looks like:
function process(input) { return ""; };
My idea is, to load (and evaluate/compile) all script files at server-startup and then use the loaded functions by column when they are needed. So far, so good.
I can load/evaluate a single function with the following code. Example uses GraalVM, but should be reproducable with other languages too.
final ScriptEngine engine = new ScriptEngineManager().getEngineByName("graal.js");
final Invocable invocable = (Invocable) engine;
engine.eval("function process(arg) { return arg.toUpperCase(); };");
var rr0 = invocable.invokeFunction("process", "abc123xyz"); // rr0 = ABC123XYZ
But when I load/evaluate the next function with the same name, the previous one will be overwritten - logically, since its the same function name.
engine.eval("function process(arg) { return arg + 'test'; };");
var rr1 = invocable.invokeFunction("process", "abc123xyz"); // rr1 = abc123xyztest
This is how I would do it.
The recommended way to use Graal.js is via the polyglot API: https://www.graalvm.org/reference-manual/embed-languages/
Not the same probably would work with the ScriptEngine API, but here's the example using the polyglot API.
Wrap the function definition in ()
return the functions to Java
Not pictured, but you probably build a map from the column name to a list of functions to invoke on it.
Call the functions on the data.
import org.graalvm.polyglot.*;
import org.graalvm.polyglot.proxy.*;
public class HelloPolyglot {
public static void main(String[] args) {
System.out.println("Hello Java!");
try (Context context = Context.create()) {
Value toUpperCase = context.eval("js", "(function process(arg) { return arg.toUpperCase(); })");
Value concatTest = context.eval("js", "(function process(arg) { return arg + 'test'; })");
String text = "HelloWorld";
text = toUpperCase.execute(text).asString();
text = concatTest.execute(text).asString();
System.out.println(text);
}
}
}
Now, Value.execute() returns a Value, which I for simplicity coerce to a Java String with asString(), but you don't have to do that and you can operate on Value (here's the API for Value: https://www.graalvm.org/sdk/javadoc/org/graalvm/polyglot/Value.html).
First of all: I'm using moa-release-2019.05.0-bin/moa-release-2019.05.0/lib/moa.jar in my java project.
Now, let's go to the point: I am trying to use moa.clusterers.clustream.WithKmeans stream clustering algorithm and I have no idea why this is happening ...
I am new into using moa and I am having a hard time trying to decode how the clustering algorithms have to be used. The documentation lacks of sample code for common usages, and the implementation is not that well explained ... have not found any tutorial either.
My code:
import com.yahoo.labs.samoa.instances.DenseInstance;
import moa.cluster.Clustering;
import moa.clusterers.clustream.WithKmeans;
public class TestingClustream {
static DenseInstance randomInstance(int size) {
DenseInstance instance = new DenseInstance(size);
for (int idx = 0; idx < size; idx++) {
instance.setValue(idx, Math.random());
}
return instance;
}
public static void main(String[] args) {
WithKmeans wkm = new WithKmeans();
wkm.kOption.setValue(5);
wkm.maxNumKernelsOption.setValue(300);
wkm.resetLearningImpl();
for (int i = 0; i < 10000; i++) {
wkm.trainOnInstanceImpl(randomInstance(2));
}
Clustering clusteringResult = wkm.getClusteringResult();
Clustering microClusteringResult = wkm.getMicroClusteringResult();
}
}
Info from the debugger:
I have read the source code many times, and it seems to me that I am using the correct functions, in the correct order ... I do not know what I am missing ... any feedback is welcomed!
Make sure you have fed the algorithm enough data, it will process the data in batches.
The fields are unused, likely coming from some parent class with a different purpose.
Use the getter methods such as getCenter() that will compute the current center from the running sum.
I have a server with multiple GPUs and want to make full use of them during model inference inside a java app.
By default tensorflow seizes all available GPUs, but uses only the first one.
I can think of three options to overcome this issue:
Restrict device visibility on process level, namely using CUDA_VISIBLE_DEVICES environment variable.
That would require me to run several instances of the java app and distribute traffic among them. Not that tempting idea.
Launch several sessions inside a single application and try to assign one device to each of them via ConfigProto:
public class DistributedPredictor {
private Predictor[] nested;
private int[] counters;
// ...
public DistributedPredictor(String modelPath, int numDevices, int numThreadsPerDevice) {
nested = new Predictor[numDevices];
counters = new int[numDevices];
for (int i = 0; i < nested.length; i++) {
nested[i] = new Predictor(modelPath, i, numDevices, numThreadsPerDevice);
}
}
public Prediction predict(Data data) {
int i = acquirePredictorIndex();
Prediction result = nested[i].predict(data);
releasePredictorIndex(i);
return result;
}
private synchronized int acquirePredictorIndex() {
int i = argmin(counters);
counters[i] += 1;
return i;
}
private synchronized void releasePredictorIndex(int i) {
counters[i] -= 1;
}
}
public class Predictor {
private Session session;
public Predictor(String modelPath, int deviceIdx, int numDevices, int numThreadsPerDevice) {
GPUOptions gpuOptions = GPUOptions.newBuilder()
.setVisibleDeviceList("" + deviceIdx)
.setAllowGrowth(true)
.build();
ConfigProto config = ConfigProto.newBuilder()
.setGpuOptions(gpuOptions)
.setInterOpParallelismThreads(numDevices * numThreadsPerDevice)
.build();
byte[] graphDef = Files.readAllBytes(Paths.get(modelPath));
Graph graph = new Graph();
graph.importGraphDef(graphDef);
this.session = new Session(graph, config.toByteArray());
}
public Prediction predict(Data data) {
// ...
}
}
This approach seems to work fine at a glance. However, sessions occasionally ignore setVisibleDeviceList option and all go for the first device causing Out-Of-Memory crash.
Build the model in a multi-tower fashion in python using tf.device() specification. On java side, give different Predictors different towers inside a shared session.
Feels cumbersome and idiomatically wrong to me.
UPDATE: As #ash proposed, there's yet another option:
Assign an appropriate device to each operation of the existing graph by modifying its definition (graphDef).
To get it done, one could adapt the code from Method 2:
public class Predictor {
private Session session;
public Predictor(String modelPath, int deviceIdx, int numDevices, int numThreadsPerDevice) {
byte[] graphDef = Files.readAllBytes(Paths.get(modelPath));
graphDef = setGraphDefDevice(graphDef, deviceIdx)
Graph graph = new Graph();
graph.importGraphDef(graphDef);
ConfigProto config = ConfigProto.newBuilder()
.setAllowSoftPlacement(true)
.build();
this.session = new Session(graph, config.toByteArray());
}
private static byte[] setGraphDefDevice(byte[] graphDef, int deviceIdx) throws InvalidProtocolBufferException {
String deviceString = String.format("/gpu:%d", deviceIdx);
GraphDef.Builder builder = GraphDef.parseFrom(graphDef).toBuilder();
for (int i = 0; i < builder.getNodeCount(); i++) {
builder.getNodeBuilder(i).setDevice(deviceString);
}
return builder.build().toByteArray();
}
public Prediction predict(Data data) {
// ...
}
}
Just like other mentioned approaches, this one doesn't set me free from manually distributing data among devices. But at least it works stably and is comparably easy to implement. Overall, this looks like an (almost) normal technique.
Is there an elegant way to do such a basic thing with tensorflow java API? Any ideas would be appreciated.
In short: There is a workaround, where you end up with one session per GPU.
Details:
The general flow is that the TensorFlow runtime respects the devices specified for operations in the graph. If no device is specified for an operation, then it "places" it based on some heuristics. Those heuristics currently result in "place operation on GPU:0 if GPUs are available and there is a GPU kernel for the operation" (Placer::Run in case you're interested).
What you ask for I think is a reasonable feature request for TensorFlow - the ability to treat devices in the serialized graph as "virtual" ones to be mapped to a set of "phyiscal" devices at run time, or alternatively setting the "default device". This feature does not currently exist. Adding such an option to ConfigProto is something you may want to file a feature request for.
I can suggest a workaround in the interim. First, some commentary on your proposed solutions.
Your first idea will surely work, but as you pointed out, is cumbersome.
Setting using visible_device_list in the ConfigProto doesn't quite work out since that is actually a per-process setting and is ignored after the first session is created in the process. This is certainly not documented as well as it should be (and somewhat unfortunate that this appears in the per-Session configuration). However, this explains why your suggestion here doesn't work and why you still see a single GPU being used.
This could work.
Another option is to end up with different graphs (with operations explicitly placed on different GPUs), resulting in one session per GPU. Something like this can be used to edit the graph and explicitly assign a device to each operation:
public static byte[] modifyGraphDef(byte[] graphDef, String device) throws Exception {
GraphDef.Builder builder = GraphDef.parseFrom(graphDef).toBuilder();
for (int i = 0; i < builder.getNodeCount(); ++i) {
builder.getNodeBuilder(i).setDevice(device);
}
return builder.build().toByteArray();
}
After which you could create a Graph and Session per GPU using something like:
final int NUM_GPUS = 8;
// setAllowSoftPlacement: Just in case our device modifications were too aggressive
// (e.g., setting a GPU device on an operation that only has CPU kernels)
// setLogDevicePlacment: So we can see what happens.
byte[] config =
ConfigProto.newBuilder()
.setLogDevicePlacement(true)
.setAllowSoftPlacement(true)
.build()
.toByteArray();
Graph graphs[] = new Graph[NUM_GPUS];
Session sessions[] = new Session[NUM_GPUS];
for (int i = 0; i < NUM_GPUS; ++i) {
graphs[i] = new Graph();
graphs[i].importGraphDef(modifyGraphDef(graphDef, String.format("/gpu:%d", i)));
sessions[i] = new Session(graphs[i], config);
}
Then use sessions[i] to execute the graph on GPU #i.
Hope that helps.
In python it can be done as follows:
def get_frozen_graph(graph_file):
"""Read Frozen Graph file from disk."""
with tf.gfile.GFile(graph_file, "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
return graph_def
trt_graph1 = get_frozen_graph('/home/ved/ved_1/frozen_inference_graph.pb')
with tf.device('/gpu:1'):
[tf_input_l1, tf_scores_l1, tf_boxes_l1, tf_classes_l1, tf_num_detections_l1, tf_masks_l1] = tf.import_graph_def(trt_graph1,
return_elements=['image_tensor:0', 'detection_scores:0',
'detection_boxes:0', 'detection_classes:0','num_detections:0', 'detection_masks:0'])
tf_sess1 = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
trt_graph2 = get_frozen_graph('/home/ved/ved_2/frozen_inference_graph.pb')
with tf.device('/gpu:0'):
[tf_input_l2, tf_scores_l2, tf_boxes_l2, tf_classes_l2, tf_num_detections_l2] = tf.import_graph_def(trt_graph2,
return_elements=['image_tensor:0', 'detection_scores:0',
'detection_boxes:0', 'detection_classes:0','num_detections:0'])
tf_sess2 = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
I have real time streaming data coming into spark and I would like to do a moving average forecasting on that time-series data. Is there any way to implement this using spark in Java?
I've already referred to : https://gist.github.com/samklr/27411098f04fc46dcd05/revisions
and
Apache Spark Moving Average
but both these codes are written in Scala. Since I'm not familiar with Scala, I'm not able to judge if I'll find it useful or even convert the code to Java.
Is there any direct implementation of forecasting in Spark Java?
I took the question you were referring and struggled for a couple of hours in order to translate the Scala code into Java:
// Read a file containing the Stock Quotations
// You can also paralelize a collection of objects to create a RDD
JavaRDD<String> linesRDD = sc.textFile("some sample file containing stock prices");
// Convert the lines into our business objects
JavaRDD<StockQuotation> quotationsRDD = linesRDD.flatMap(new ConvertLineToStockQuotation());
// We need these two objects in order to use the MLLib RDDFunctions object
ClassTag<StockQuotation> classTag = scala.reflect.ClassManifestFactory.fromClass(StockQuotation.class);
RDD<StockQuotation> rdd = JavaRDD.toRDD(quotationsRDD);
// Instantiate a RDDFunctions object to work with
RDDFunctions<StockQuotation> rddFs = RDDFunctions.fromRDD(rdd, classTag);
// This applies the sliding function and return the (DATE,SMA) tuple
JavaPairRDD<Date, Double> smaPerDate = rddFs.sliding(slidingWindow).toJavaRDD().mapToPair(new MovingAvgByDateFunction());
List<Tuple2<Date, Double>> smaPerDateList = smaPerDate.collect();
Then you have to use a new Function Class to do the actual calculation of each data window:
public class MovingAvgByDateFunction implements PairFunction<Object,Date,Double> {
/**
*
*/
private static final long serialVersionUID = 9220435667459839141L;
#Override
public Tuple2<Date, Double> call(Object t) throws Exception {
StockQuotation[] stocks = (StockQuotation[]) t;
List<StockQuotation> stockList = Arrays.asList(stocks);
Double result = stockList.stream().collect(Collectors.summingDouble(new ToDoubleFunction<StockQuotation>() {
#Override
public double applyAsDouble(StockQuotation value) {
return value.getValue();
}
}));
result = result / stockList.size();
return new Tuple2<Date, Double>(stockList.get(0).getTimestamp(),result);
}
}
If you want more detail on this, I wrote about Simple Moving Averages here:
https://t.co/gmWltdANd3
I installed word2Vec using this tutorial on by Ubuntu laptop. Is it completely necessary to install DL4J in order to implement word2Vec vectors in Java? I'm comfortable working in Eclipse and I'm not sure that I want all the other pre-requisites that DL4J wants me to install.
Ideally there would be a really easy way for me to just use the Java code I've already written (in Eclipse) and change a few lines -- so that word look-ups that I am doing would retrieve a word2Vec vector instead of the current retrieval process I'm using.
Also, I've looked into using GloVe, however, I do not have MatLab. Is it possible to use GloVe without MatLab? (I got an error while installing it because of this). If so, the same question as above goes... I have no idea how to implement it in Java.
What is preventing you from saving the word2vec (the C program) output in text format and then read the file with a Java piece of code and load the vectors in a hashmap keyed by the word string?
Some code snippets:
// Class to store a hashmap of wordvecs
public class WordVecs {
HashMap<String, WordVec> wordvecmap;
....
void loadFromTextFile() {
String wordvecFile = prop.getProperty("wordvecs.vecfile");
wordvecmap = new HashMap();
try (FileReader fr = new FileReader(wordvecFile);
BufferedReader br = new BufferedReader(fr)) {
String line;
while ((line = br.readLine()) != null) {
WordVec wv = new WordVec(line);
wordvecmap.put(wv.word, wv);
}
}
catch (Exception ex) { ex.printStackTrace(); }
}
....
}
// class for each wordvec
public class WordVec implements Comparable<WordVec> {
public WordVec(String line) {
String[] tokens = line.split("\\s+");
word = tokens[0];
vec = new float[tokens.length-1];
for (int i = 1; i < tokens.length; i++)
vec[i-1] = Float.parseFloat(tokens[i]);
norm = getNorm();
}
....
}
If you want to get the nearest neighbours for a given word, you can keep a list of N nearest pre-computed neighbours associated with each WordVec object.
Dl4j author here. Our word2vec implementation is targeted for people who need to have custom pipelines. I don't blame you for going the simple route here.
Our word2vec implementation is meant for when you want to do something with them not for messing around. The c word2vec format is pretty straight forward.
Here is parsing logic in java if you'd like:
https://github.com/deeplearning4j/deeplearning4j/blob/374609b2672e97737b9eb3ba12ee62fab6cfee55/deeplearning4j-scaleout/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/embeddings/loader/WordVectorSerializer.java#L113
Hope that helps a bit