Deeplearning4j - how to iterate multiple DataSets for large data?

Deeplearning4j - how to iterate multiple DataSets for large data? - java

I'm studying Deeplearning4j (ver. 1.0.0-M1.1) for building neural networks.
I use IrisClassifier from Deeplearning4j as an example, it works fine:
//First: get the dataset using the record reader. CSVRecordReader handles loading/parsing
int numLinesToSkip = 0;
char delimiter = ',';
RecordReader recordReader = new CSVRecordReader(numLinesToSkip,delimiter);
recordReader.initialize(new FileSplit(new File(DownloaderUtility.IRISDATA.Download(),"iris.txt")));
//Second: the RecordReaderDataSetIterator handles conversion to DataSet objects, ready for use in neural network
int labelIndex = 4; //5 values in each row of the iris.txt CSV: 4 input features followed by an integer label (class) index. Labels are the 5th value (index 4) in each row
int numClasses = 3; //3 classes (types of iris flowers) in the iris data set. Classes have integer values 0, 1 or 2
int batchSize = 150; //Iris data set: 150 examples total. We are loading all of them into one DataSet (not recommended for large data sets)
DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
DataSet allData = iterator.next();
allData.shuffle();
SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.65); //Use 65% of data for training
DataSet trainingData = testAndTrain.getTrain();
DataSet testData = testAndTrain.getTest();
//We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):
DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit(trainingData); //Collect the statistics (mean/stdev) from the training data. This does not modify the input data
normalizer.transform(trainingData); //Apply normalization to the training data
normalizer.transform(testData); //Apply normalization to the test data. This is using statistics calculated from the *training* set
final int numInputs = 4;
int outputNum = 3;
long seed = 6;
log.info("Build model....");
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(seed)
.activation(Activation.TANH)
.weightInit(WeightInit.XAVIER)
.updater(new Sgd(0.1))
.l2(1e-4)
.list()
.layer(new DenseLayer.Builder().nIn(numInputs).nOut(3)
.build())
.layer(new DenseLayer.Builder().nIn(3).nOut(3)
.build())
.layer( new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.activation(Activation.SOFTMAX) //Override the global TANH activation with softmax for this layer
.nIn(3).nOut(outputNum).build())
.build();
//run the model
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();
//record score once every 100 iterations
model.setListeners(new ScoreIterationListener(100));
for(int i=0; i<1000; i++ ) {
model.fit(trainingData);
}
//evaluate the model on the test set
Evaluation eval = new Evaluation(3);
INDArray output = model.output(testData.getFeatures());
eval.eval(testData.getLabels(), output);
log.info(eval.stats());
For my project, I have inputs ~30000 records (in iris example - 150).
Each record is a vector size ~7000 (in iris example - 4).
Obviously, I can't process the whole data in one DataSet - in will produce OOM for JVM.
How I can process data in multiple DataSets?
I assume it should be something like this (store DataSets in List and iterate):
...
DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
List<DataSet> trainingData = new ArrayList<>();
List<DataSet> testData = new ArrayList<>();
while (iterator.hasNext()) {
DataSet allData = iterator.next();
allData.shuffle();
SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.65); //Use 65% of data for training
trainingData.add(testAndTrain.getTrain());
testData.add(testAndTrain.getTest());
}
//We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):
DataNormalization normalizer = new NormalizerStandardize();
for (DataSet dataSetTraining : trainingData) {
normalizer.fit(dataSetTraining); //Collect the statistics (mean/stdev) from the training data. This does not modify the input data
normalizer.transform(dataSetTraining); //Apply normalization to the training data
}
for (DataSet dataSetTest : testData) {
normalizer.transform(dataSetTest); //Apply normalization to the test data. This is using statistics calculated from the *training* set
}
...
for(int i=0; i<1000; i++ ) {
for (DataSet dataSetTraining : trainingData) {
model.fit(dataSetTraining);
}
}
But when I start evaluation, I got this error:
Exception in thread "main" java.lang.NullPointerException: Cannot read field "javaShapeInformation" because "this.jvmShapeInfo" is null
at org.nd4j.linalg.api.ndarray.BaseNDArray.dataType(BaseNDArray.java:5507)
at org.nd4j.linalg.api.ndarray.BaseNDArray.validateNumericalArray(BaseNDArray.java:5575)
at org.nd4j.linalg.api.ndarray.BaseNDArray.add(BaseNDArray.java:3087)
at com.aarcapital.aarmlclassifier.classification.FAClassifierLearning.main(FAClassifierLearning.java:117)
...
Evaluation eval = new Evaluation(26);
INDArray output = new NDArray();
for (DataSet dataSetTest : testData) {
output.add(model.output(dataSetTest.getFeatures())); // ERROR HERE
}
System.out.println("--- Output ---");
System.out.println(output);
INDArray labels = new NDArray();
for (DataSet dataSetTest : testData) {
labels.add(dataSetTest.getLabels());
}
System.out.println("--- Labels ---");
System.out.println(labels);
eval.eval(labels, output);
log.info(eval.stats());
What is correct way to iterate miltiple DataSet for learning network?
Thanx!

Firstly, always use Nd4j.create(..) for ndarrays.
Never use the implementation. That allows you to safely create ndarrays that will work whether you use cpus or gpus.
2nd: Always use the RecordReaderDataSetIterator's builder rather than the constructor. It's very long and error prone.
That is why we made the builder in the first place.
Your NullPointer actually isn't coming from where you think it is. it's due to how you're creating the ndarray. There's no data type or anything so it can't know what to expect. Nd4j.create(..) will properly setup the ndarray for you.
Beyond that you are doing things the right way. The record reader handles the batching for you.

Related

Apache Ignite updating previously trained ML model

I have a dataset that is used for training a KNN model. Later I'd like to update the model with new training data. What I'm seeing is that the updated model only takes the new training data ignoring what was previously trained.
Vectorizer vec = new DummyVectorizer<Integer>(1, 2).labeled(0);
DatasetTrainer<KNNClassificationModel, Double> trainer = new KNNClassificationTrainer();
KNNClassificationModel model;
KNNClassificationModel modelUpdated;
Map<Integer, Vector> trainingData = new HashMap<Integer, Vector>();
Map<Integer, Vector> trainingDataNew = new HashMap<Integer, Vector>();
Double[][] data1 = new Double[][] {
{0.136,0.644,0.154},
{0.302,0.634,0.779},
{0.806,0.254,0.211},
{0.241,0.951,0.744},
{0.542,0.893,0.612},
{0.334,0.277,0.486},
{0.616,0.259,0.121},
{0.738,0.585,0.017},
{0.124,0.567,0.358},
{0.934,0.346,0.863}};
Double[][] data2 = new Double[][] {
{0.300,0.236,0.193}};
Double[] observationData = new Double[] { 0.8, 0.7 };
// fill dataset (in cache)
for (int i = 0; i < data1.length; i++)
trainingData.put(i, new DenseVector(data1[i]));
// first training / prediction
model = trainer.fit(trainingData, 1, vec);
System.out.println("First prediction : " + model.predict(new DenseVector(observationData)));
// new training data
for (int i = 0; i < data2.length; i++)
trainingDataNew.put(data1.length + i, new DenseVector(data2[i]));
// second training / prediction
modelUpdated = trainer.update(model, trainingDataNew, 1, vec);
System.out.println("Second prediction: " + modelUpdated.predict(new DenseVector(observationData)));
As an output I get this:
First prediction : 0.124
Second prediction: 0.3
This looks like the second prediction only used data2 which must lead to 0.3 as prediction.
How does model update work? If I would have to add data2 to data1 and then train on data1 again, what would be the difference compared to a complete new training on all combined data?

How does model update work?
For KNN specifically:
Add data2 to data1 and call modelUpdate on the combined data.
see this test as an example: https://github.com/apache/ignite/blob/635dafb7742673494efa6e8e91e236820156d38f/modules/ml/src/test/java/org/apache/ignite/ml/knn/KNNClassificationTest.java#L167
Follow the instructions in that test:
set up your trainer:
KNNClassificationTrainer trainer = new KNNClassificationTrainer()
.withK(3)
.withDistanceMeasure(new EuclideanDistance())
.withWeighted(false);
Then set up your vectorizer: (note how the labeled coordinate is created)
model = trainer.fit(
trainingData,
parts,
new DoubleArrayVectorizer<Integer>().labeled(Vectorizer.LabelCoordinate.LAST)
);
then call the updateModel as needed.
KNNClassificationModel updatedOnData = trainer.update(
originalMdlOnEmptyDataset,
newData,
parts,
new DoubleArrayVectorizer<Integer>().labeled(Vectorizer.LabelCoordinate.LAST)
);
docs for KNN classification: https://ignite.apache.org/docs/latest/machine-learning/binary-classification/knn-classification
KNN Classification example: https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/knn/KNNClassificationExample.java

The i/p col features must be either string or numeric type, but got org.apache.spark.ml.linalg.VectorUDT

I am very new to Spark Machine Learning just an 3 day old novice and I'm basically trying to predict some data using Logistic Regression algorithm in spark via Java. I have referred few sites and documentation and came up with the code and i am trying to execute it but facing an issue.
So i have pre-processed the data and have used vector assembler to club all the relevant columns into one and i am trying to fit the model and facing an issue.
public class Sparkdemo {
static SparkSession session = SparkSession.builder().appName("spark_demo")
.master("local[*]").getOrCreate();
#SuppressWarnings("empty-statement")
public static void getData() {
Dataset<Row> inputFile = session.read()
.option("header", true)
.format("csv")
.option("inferschema", true)
.csv("C:\\Users\\WildJasmine\\Downloads\\NKI_cleaned.csv");
inputFile.show();
String[] columns = inputFile.columns();
int beg = 16, end = columns.length - 1;
String[] featuresToDrop = new String[end - beg + 1];
System.arraycopy(columns, beg, featuresToDrop, 0, featuresToDrop.length);
System.out.println("rows are\n " + Arrays.toString(featuresToDrop));
Dataset<Row> dataSubset = inputFile.drop(featuresToDrop);
String[] arr = {"Patient", "ID", "eventdeath"};
Dataset<Row> X = dataSubset.drop(arr);
X.show();
Dataset<Row> y = dataSubset.select("eventdeath");
y.show();
//Vector Assembler concept for merging all the cols into a single col
VectorAssembler assembler = new VectorAssembler()
.setInputCols(X.columns())
.setOutputCol("features");
Dataset<Row> dataset = assembler.transform(X);
dataset.show();
StringIndexer labelSplit = new StringIndexer().setInputCol("features").setOutputCol("label");
Dataset<Row> data = labelSplit.fit(dataset)
.transform(dataset);
data.show();
Dataset<Row>[] splitsX = data.randomSplit(new double[]{0.8, 0.2}, 42);
Dataset<Row> trainingX = splitsX[0];
Dataset<Row> testX = splitsX[1];
LogisticRegression lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8);
LogisticRegressionModel lrModel = lr.fit(trainingX);
Dataset<Row> prediction = lrModel.transform(testX);
prediction.show();
}
public static void main(String[] args) {
getData();
}}
Below image is my dataset,
dataset
Error message:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The input column features must be either string type or numeric type, but got org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.feature.StringIndexerBase$class.validateAndTransformSchema(StringIndexer.scala:86)
at org.apache.spark.ml.feature.StringIndexer.validateAndTransformSchema(StringIndexer.scala:109)
at org.apache.spark.ml.feature.StringIndexer.transformSchema(StringIndexer.scala:152)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:135)
My end result is I need a predicted value using the features column.
Thanks in advance.

That error occurs when the input field of your dataframe for which you want to apply the StringIndexer transformation is a Vector. In the Spark documentation https://spark.apache.org/docs/latest/ml-features#stringindexer you can see that the input column is a string. This transformer performs a distinct to that column and creates a new column with integers that correspond to each different string value. It does not work for vectors.

Java.Lang.Double[] to Double[] issue for Polynomial from CSV

First of all thanks for your help in advance.
I'm writing an investment algorithm and am currently pre-processing CSV historical data. The end goal for this part of the process is to create a symmetrical co-variance matrix of 2k x 2k / 2 (2 million) entries.
The Java class I'm writing takes a folder of CSVs each with 8 bits of information, key ones being Date, Time & Opening stock price. Date & time have been combined into one 'seconds from delta' time measure and opening stock prices remain unchanged. The output CSV contains the above two pieces of information also with a filename index for later referencing.
In order to create the co-variance matrix each stock on the NYSE must have a price value for every time, if values are missing the matrix cannot be properly completed. Due to discrepancies between time entries in the historical training CSV, I have to use a polynomial function to estimate missed values, which then can be fed into the next process in the chain.
My problem sounds fairly simple and should be easy to overcome (I'm probably being a massive idiot). The polynomial package I'm using takes in two arrays of doubles (Double[] x, Double[] y). X pertaining to an array of the 'seconds past delta' time values of a particular stock and Y the corresponding price. When I try to feed these in I'm getting a type error as what I'm actually trying to input are 'java.lang.Double' objects. Can anyone help me with converting an array of the latter to an array of the former?
I realise there is a load of ridiculousness after the main print statement, these are just me tinkering trying to miraculously change the type.
Again thanks for your time, I look forward to your replies!
Please find the relevant method below:
public void main(String filePath) throws IOException {
String index = filePath;
index = index.replace("/Users/louislimon/Desktop/Invest Algorithm/Data/Samples US Stock Data/data-1/5 min/us/nyse stocks/1/", "");
index = index.replace(".us.txt", "");
File fout = new File("/Users/louislimon/Desktop/Invest Algorithm/Data.csv");
FileOutputStream fos = new FileOutputStream(fout);
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(fos));
Reader in = new FileReader(filePath);
Iterable<CSVRecord> records;
try {
records = CSVFormat.EXCEL.withSkipHeaderRecord(true).parse(in);
} catch ( IOException ex ) {
System.out.println ( "[ERROR] " + ex );
return;
}
ZoneId zoneId = ZoneId.of("America/New_York");
boolean tmp = true;
Instant firstInstant = null; // Track the baseline against which we calculate the increasing time
ArrayList<Double> timeVals = new ArrayList<Double>();
ArrayList<Double> priceVals = new ArrayList<Double>();
for ( CSVRecord record : records ) {
if(tmp){
tmp = false;
}
else {
//System.out.println(record.toString());
String dateInput = record.get(0);
String timeInput = record.get(1);
Double price = Double.parseDouble(record.get(2));
LocalDate date = LocalDate.parse(dateInput);
LocalTime time = LocalTime.parse(timeInput);
//Double price = Double.parseDouble(priceInput);
LocalDateTime ldt = LocalDateTime.of(date, time);
ZonedDateTime zdt = ldt.atZone(zoneId);
Instant instant = zdt.toInstant(); // Use Instant (moment on the timeline in UTC) for data storage, exchange, serialization, database, etc.
if (null == firstInstant) {
firstInstant = instant; // Capture the first instant.
}
Duration duration = Duration.between(firstInstant, instant);
Long deltaInSeconds = duration.getSeconds();
double doubleDeltaInSeconds = deltaInSeconds.doubleValue();
timeVals.add(doubleDeltaInSeconds);
priceVals.add(price);
//System.out.println("deltaInSeconds: " + deltaInSeconds + " | price: " + price + " | index: " + index);
}
Double [] timeValsArray = timeVals.toArray(new Double[timeVals.size()]);
Double [] priceValsArray = timeVals.toArray(new Double[priceVals.size()]);
Double[] timeFeed = new Double[timeVals.size()];
Double[] priceFeed = new Double[priceVals.size()];
for(int x = 0;x<timeVals.size(); x++) {
timeFeed[x] = new Double (timeValsArray[x].doubleValue());
priceFeed[x] = new Double (priceValsArray[x]);
}
PolynomialFunctionLagrangeForm pflf = new PolynomialFunctionLagrangeForm(timeFeed,priceFeed);
}

According to the documentation, the PolynomialFunctionLagrangeForm constructor takes two double[] arrays, not Double[].
Hence you need to create a raw array and pass that:
...
double[] timeFeed = new double[timeVals.size()];
double[] priceFeed = new double[priceVals.size()];
for(int x = 0; x < timeVals.size(); x++) {
timeFeed[x] = timeValsArray[x].doubleValue();
priceFeed[x] = priceValsArray[x].doubleValue();
}
...
See also How to convert an ArrayList containing Integers to primitive int array? for some alternative ways to convert an ArrayList<T> (where T is a wrapper for a primitive type) to the corresponding raw array T[].
Note that there is also obviously a typo in your code:
Double [] priceValsArray = timeVals.toArray(new Double[priceVals.size()]);
needs to be
Double [] priceValsArray = priceVals.toArray(new Double[priceVals.size()]);

RandomForest with Weka in Java

I am working on a project and I need some examples how to implement RandomForest in Java with weka? I did it with IBk(), it worked. If I do it with RandomForest in the same way, it does not work.
Does anyone have a simple example for me how to implement RandomForest and how to get probability for each class (i did it with IBk withclassifier.distributionForInstance(instance) Function and it returned me probabilities for each class). How can I do it for RandomForest? I will need to get probability of every tree and to combine it?
//example
ConverrterUtils.DataSource source = new ConverterUtils.DataSource ("..../edit.arff);
Instances dataset = source.getDataSet();
dataset.setClassIndex(dataset.numAttributes() - 1);
IBk classifier = new IBk(5); classifier.buildClassifier(dataset);
Instance instance = new SparseInstance(2);
instance.setValue(0, 65) //example data
instance.setValue(1, 120); //example data
double[] prediction = classifier.distributionForInstance(instance);
//now I get the probability for the first class
System.out.println("Prediction for the first class is: "+prediction[0]);

You can calculate the the infogain while buidling the Model in the RandomForest. It is much slower and requires alot of memory while buidling model. I am not so sure about the documentation. you can add options or setValues while buiilding the model.
//numFolds in number of crossvalidations usually between 1-10
//br is your bufferReader
Instances trainData = new Instances(br);
trainData.setClassIndex(trainData.numAttributes() - 1);
RandomForest rf = new RandomForest();
rf.setNumTrees(50);
//You can set the options here
String[] options = new String[2];
options[0] = "-R";
rf.setOptions(options);
rf.buildClassifier(trainData);
weka.filters.supervised.attribute.AttributeSelection as = new weka.filters.supervised.attribute.AttributeSelection();
Ranker ranker = new Ranker();
InfoGainAttributeEval infoGainAttrEval = new InfoGainAttributeEval();
as.setEvaluator(infoGainAttrEval);
as.setSearch(ranker);
as.setInputFormat(trainData);
trainData = Filter.useFilter(trainData, as);
Evaluation evaluation = new Evaluation(trainData);
evaluation.crossValidateModel(rf, trainData, numFolds, new Random(1));
// Using HashMap to store the infogain values of the attributes
int count = 0;
Map<String, Double> infogainscores = new HashMap<String, Double>();
for (int i = 0; i < trainData.numAttributes(); i++) {
String t_attr = trainData.attribute(i).name();
//System.out.println(i+trainData.attribute(i).name());
double infogain = infoGainAttrEval.evaluateAttribute(i);
if(infogain != 0){
//System.out.println(t_attr + "= "+ infogain);
infogainscores.put(t_attr, infogain);
count = count+1;
}
}
//iterating over the hashmap
Iterator it = infogainscores.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
System.out.println(pair.getKey()+" = "+pair.getValue());
System.out.println(pair.getKey()+" = "+pair.getValue());
it.remove(); // avoids a ConcurrentModificationException
}

MongoDB - merge collection and map - can performance be improved

The function below merge word MongoDB collection and map content like this:
Collection:
cat 3,
dog 5
Map:
dog 2,
zebra 1
Collection after merge:
cat 3,
dog 7,
zebra 1
We have empty collection and map with about 14000 elements.
Oracle PL/SQL procedure using one merge SQL running on 15k RPM HD do it in less then a second.
MongoBD on SSD disk needs about 53 seconds.
It looks like Oracle prepares in memory image of file operation
and saves result in one i/o operation.
MongoDB probably does 14000 i/o - it is about 4 ms for each insert. It is corresponds with performance of SSD.
If I do just 14000 inserts without search for documents existence as in case of merge everything works also fast - less then a second.
My questions:
Can the code be improved?
Maybe it necessary to do something with MongoDB configuration?
Function code:
public void addBookInfo(String bookTitle, HashMap<String, Integer> bookInfo)
{
// insert information to the book collection
Document d = new Document();
d.append("book_title", bookTitle);
book.insertOne(d);
// insert information to the word collection
// prepare collection of word info and book_word info documents
List<Document> wordInfoToInsert = new ArrayList<Document>();
List<Document> book_wordInfoToInsert = new ArrayList<Document>();
for (String key : bookInfo.keySet())
{
Document d1 = new Document();
Document d2 = new Document();
d1.append("word", key);
d1.append("count", bookInfo.get(key));
wordInfoToInsert.add(d1);
d2.append("book_title", bookTitle);
d2.append("word", key);
d2.append("count", bookInfo.get(key));
book_wordInfoToInsert.add(d2);
}
// this is collection of insert/update DB operations
List<WriteModel<Document>> updates = new ArrayList<WriteModel<Document>>();
// iterator for collection of words
ListIterator<Document> listIterator = wordInfoToInsert.listIterator();
// generate list of insert/update operations
while (listIterator.hasNext())
{
d = listIterator.next();
String wordToUpdate = d.getString("word");
int countToAdd = d.getInteger("count").intValue();
updates.add(
new UpdateOneModel<Document>(
new Document("word", wordToUpdate),
new Document("$inc",new Document("count", countToAdd)),
new UpdateOptions().upsert(true)
)
);
}
// perform bulk operation
// this is slowly
BulkWriteResult bulkWriteResult = word.bulkWrite(updates);
boolean acknowledge = bulkWriteResult.wasAcknowledged();
if (acknowledge)
System.out.println("Write acknowledged.");
else
System.out.println("Write was not acknowledged.");
boolean countInfo = bulkWriteResult.isModifiedCountAvailable();
if (countInfo)
System.out.println("Change counters avaiable.");
else
System.out.println("Change counters not avaiable.");
int inserted = bulkWriteResult.getInsertedCount();
int modified = bulkWriteResult.getModifiedCount();
System.out.println("inserted: " + inserted);
System.out.println("modified: " + modified);
// insert information to the book_word collection
// this is very fast
book_word.insertMany(book_wordInfoToInsert);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Deeplearning4j - how to iterate multiple DataSets for large data? - java

Related

Apache Ignite updating previously trained ML model

The i/p col features must be either string or numeric type, but got org.apache.spark.ml.linalg.VectorUDT

Java.Lang.Double[] to Double[] issue for Polynomial from CSV

RandomForest with Weka in Java

MongoDB - merge collection and map - can performance be improved

Categories

Resources