I'm getting different results every time I run my code - java

I'm using ELKI to cluster my data I used KMeansLloyd<NumberVector> with k=3 every time I run my java code I'm getting totally different clusters results, is this normal or there is something I should do to make my output nearly stable?? here my code that I got from elki tutorials
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(a);
// Create a database (which may contain multiple relations!)
Database db = new StaticArrayDatabase(dbc, null);
// Load the data into the database (do NOT forget to initialize...)
db.initialize();
// Relation containing the number vectors:
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
// We know that the ids must be a continuous range:
DBIDRange ids = (DBIDRange) rel.getDBIDs();
// K-means should be used with squared Euclidean (least squares):
//SquaredEuclideanDistanceFunction dist = SquaredEuclideanDistanceFunction.STATIC;
CosineDistanceFunction dist= CosineDistanceFunction.STATIC;
// Default initialization, using global random:
// To fix the random seed, use: new RandomFactory(seed);
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
// Textbook k-means clustering:
KMeansLloyd<NumberVector> km = new KMeansLloyd<>(dist, //
3 /* k - number of partitions */, //
0 /* maximum number of iterations: no limit */, init);
// K-means will automatically choose a numerical relation from the data set:
// But we could make it explicit (if there were more than one numeric
// relation!): km.run(db, rel);
Clustering<KMeansModel> c = km.run(db);
// Output all clusters:
int i = 0;
for(Cluster<KMeansModel> clu : c.getAllClusters()) {
// K-means will name all clusters "Cluster" in lack of noise support:
System.out.println("#" + i + ": " + clu.getNameAutomatic());
System.out.println("Size: " + clu.size());
System.out.println("Center: " + clu.getModel().getPrototype().toString());
// Iterate over objects:
System.out.print("Objects: ");
for(DBIDIter it = clu.getIDs().iter(); it.valid(); it.advance()) {
// To get the vector use:
NumberVector v = rel.get(it);
// Offset within our DBID range: "line number"
final int offset = ids.getOffset(it);
System.out.print(v+" " + offset);
// Do NOT rely on using "internalGetIndex()" directly!
}
System.out.println();
++i;
}

I would say, since you are using RandomlyGeneratedInitialMeans:
Initialize k-means by generating random vectors (within the data sets value range).
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
Yes, it is normal.

K-Means is supposed to be initialized randomly. It is desirable to get different results when running it multiple times.
If you don't want this, use a fixed random seed.
From the code you copy and pasted:
// To fix the random seed, use: new RandomFactory(seed);
That is exactly what you should do...
long seed = 0;
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(
new RandomFactory(seed));

This was too long for a comment. As #Idos stated, You are initializing your data randomly; that's why you're getting random results. Now the question is, how do you ensure the results are robust? Try this:
Run the algorithm N times. Each time, record the cluster membership for each observation. When you are finished, classify an observation into the cluster which contained it most often. For example, suppose you have 3 observations, 3 classes, and run the algorithm 3 times:
obs R1 R2 R3
1 A A B
2 B B B
3 C B B
Then you should classify obs1 as A since it was most often classified as A. Classify obs2 as B since it was always classified as B. And classify obs3 as B since it was most often classified as B by the algorithm. The results should become increasingly stable the more times you run the algorithm.

Related

Generate range of promotion codes that are not guessable

I'm looking for a way to generate a range of promotion codes. It would be trivial if it wasn't for both these requirements. That it needs to be a range (not saving every single promotion code in a database) to make it fast and that it is not guessable so it can not generate codes like this 000-000-001, 000-000-002, 000-000-003... and so on.
Is there an algorithm to solve this problem? I could try to solve it with some sort of hashing but trying to solve this security problem myself might leave the service open to exploits that I didn't think about.
I think your first requirement (not saving every promotional code in a database) is problematic.
The question is, is it allowed to redeem a single promotional code multiple times?
If this is not allowed then you have to store the already redeemed codes in some persistent data store anyway, so why not store the generated codes in the persistent data store from the beginning, together with a flag indicating whether it has been redeemed or not?
If you don't want to store all codes / can't store all codes, you could still use a Random with a seed unique to your current campaign:
long seed = 20190921065347L; // identifies your current campaign
Random r = new Random(seed);
for (int i = 0; i < numCodes; i++) {
System.out.println(r.nextLong());
}
or
long seed = 20190921065347L; // identifies your current campaign
Random r = new Random(seed);
r.longs(numCodes, 100_000_000_000_000L, 1_000_000_000_000_000L)
.forEach(System.out::println);
To find out whether a code is valid you can generate the same codes again:
long seed = 20190921065347L; // identifies your current campaign
Random r = new Random(seed);
System.out.println(
r.longs(numCodes, 100_000_000_000_000L, 1_000_000_000_000_000L)
.anyMatch(l -> l == 350160558695557L));
Would something like this work?
Random r = new Random();
long start = 1_000_000_000;
long end = 10_000_000_000L;
long n = r.longs(1, start, end).reduce(0, (a, b) -> b);
String s = String.format("%,d", n).replace(",", "-");
System.out.println(s);

Apache Flink: The execution environment and multiple sink

My question might cause some confusion so please see Description first. It might be helpful to identify my problem. I will add my Code later at the end of the question (Any suggestions regarding my code structure/implementation is also welcomed).
Thank you for any help in advance!
My question:
How to define multiple sinks in Flink Batch processing without having it get data from one source repeatedly?
What is the difference between createCollectionEnvironment() and getExecutionEnvironment() ? Which one should I use in local environment?
What is the use of env.execute()? My code will output the result without this sentence. if I add this sentence it will pop an Exception:
-
Exception in thread "main" java.lang.RuntimeException: No new data sinks have been defined since the last execution. The last execution refers to the latest call to 'execute()', 'count()', 'collect()', or 'print()'.
at org.apache.flink.api.java.ExecutionEnvironment.createProgramPlan(ExecutionEnvironment.java:940)
at org.apache.flink.api.java.ExecutionEnvironment.createProgramPlan(ExecutionEnvironment.java:922)
at org.apache.flink.api.java.CollectionEnvironment.execute(CollectionEnvironment.java:34)
at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:816)
at MainClass.main(MainClass.java:114)
Description:
New to programming. Recently I need to process some data (grouping data, calculating standard deviation, etc.) using Flink Batch processing.
However I came to a point where I need to output two DataSet.
The structure was something like this
From Source(Database) -> DataSet 1 (add index using zipWithIndex())-> DataSet 2 (do some calculation while keeping index) -> DataSet 3
First I output DataSet 2, the index is e.g. from 1 to 10000;
And then I output DataSet 3 the index becomes from 10001 to 20000 although I did not change the value in any function.
My guessing is when outputting DataSet 3 instead of using the result of
previously calculated DataSet 2 it started from getting data from database again and then perform the calculation.
With the use of ZipWithIndex() function it does not only give the wrong index number but also increase the connection to db.
I guess that this is relevant to the execution environment, as when I use
ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
will give the "wrong" index number (10001-20000)
and
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
will give the correct index number (1-10000)
The time taken and number of database connections is different and the order of print will be reversed.
OS, DB, other environment details and versions:
IntelliJ IDEA 2017.3.5 (Community Edition)
Build #IC-173.4674.33, built on March 6, 2018
JRE: 1.8.0_152-release-1024-b15 amd64
JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o
Windows 10 10.0
My Test code(Java):
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
//Table is used to calculate the standard deviation as I figured that there is no such calculation in DataSet.
BatchTableEnvironment tableEnvironment = TableEnvironment.getTableEnvironment(env);
//Get Data from a mySql database
DataSet<Row> dbData =
env.createInput(
JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.mysql.cj.jdbc.Driver")
.setDBUrl($database_url)
.setQuery("select value from $table_name where id =33")
.setUsername("username")
.setPassword("password")
.setRowTypeInfo(new RowTypeInfo(BasicTypeInfo.DOUBLE_TYPE_INFO))
.finish()
);
// Add index for assigning group (group capacity is 5)
DataSet<Tuple2<Long, Row>> indexedData = DataSetUtils.zipWithIndex(dbData);
// Replace index(long) with group number(int), and convert Row to double at the same time
DataSet<Tuple2<Integer, Double>> rawData = indexedData.flatMap(new GroupAssigner());
//Using groupBy() to combine individual data of each group into a list, while calculating the mean and range in each group
//put them into a POJO named GroupDataClass
DataSet<GroupDataClass> groupDS = rawData.groupBy("f0").combineGroup(new GroupCombineFunction<Tuple2<Integer, Double>, GroupDataClass>() {
#Override
public void combine(Iterable<Tuple2<Integer, Double>> iterable, Collector<GroupDataClass> collector) {
Iterator<Tuple2<Integer, Double>> it = iterable.iterator();
Tuple2<Integer, Double> var1 = it.next();
int groupNum = var1.f0;
// Using max and min to calculate range, using i and sum to calculate mean
double max = var1.f1;
double min = max;
double sum = 0;
int i = 1;
// The list is to store individual value
List<Double> list = new ArrayList<>();
list.add(max);
while (it.hasNext())
{
double next = it.next().f1;
sum += next;
i++;
max = next > max ? next : max;
min = next < min ? next : min;
list.add(next);
}
//Store group number, mean, range, and 5 individual values within the group
collector.collect(new GroupDataClass(groupNum, sum / i, max - min, list));
}
});
//print because if no sink is created, Flink will not even perform the calculation.
groupDS.print();
// Get the max group number and range in each group to calculate average range
// if group number start with 1 then the maximum of group number equals to the number of group
// However, because this is the second sink, data will flow from source again, which will double the group number
DataSet<Tuple2<Integer, Double>> rangeDS = groupDS.map(new MapFunction<GroupDataClass, Tuple2<Integer, Double>>() {
#Override
public Tuple2<Integer, Double> map(GroupDataClass in) {
return new Tuple2<>(in.groupNum, in.range);
}
}).max(0).andSum(1);
// collect and print as if no sink is created, Flink will not even perform the calculation.
Tuple2<Integer, Double> rangeTuple = rangeDS.collect().get(0);
double range = rangeTuple.f1/ rangeTuple.f0;
System.out.println("range = " + range);
}
public static class GroupAssigner implements FlatMapFunction<Tuple2<Long, Row>, Tuple2<Integer, Double>> {
#Override
public void flatMap(Tuple2<Long, Row> input, Collector<Tuple2<Integer, Double>> out) {
// index 1-5 will be assigned to group 1, index 6-10 will be assigned to group 2, etc.
int n = new Long(input.f0 / 5).intValue() + 1;
out.collect(new Tuple2<>(n, (Double) input.f1.getField(0)));
}
}
It's fine to connect a source to multiple sink, the source gets executed only once and records get broadcasted to the multiple sinks. See this question Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?
getExecutionEnvironment is the right way to get the environment when you want to run your job. createCollectionEnvironment is a good way to play around and test. See the documentation
The exception error message is very clear: if you call print or collect your data flow gets executed. So you have two choices:
Either you call print/collect at the end of your data flow and it gets executed and printed. That's good for testing stuff. Bear in mind you can only call collect/print once per data flow, otherwise it gets executed many time while it's not completely defined
Either you add a sink at the end of your data flow and call env.execute(). That's what you want to do once your flow is in a more mature shape.

Neural networks and large data sets

I have a basic framework for a neural network to recognize numeric digits, but I'm having some problems with training it. My back-propogation works for small data sets, but when I have more than 50 data points, the return value starts converging to 0. And when I have data sets in the thousands, I get NaN's for costs and returns.
Basic structure: 3 layers: 784 : 15 : 1
784 is the number of pixels per data set, 15 neurons in my hidden layer, and one output neuron which returns a value from 0 to 1 (when you multiply by 10 you get a digit).
public class NetworkManager {
int inputSize;
int hiddenSize;
int outputSize;
public Matrix W1;
public Matrix W2;
public NetworkManager(int input, int hidden, int output) {
inputSize = input;
hiddenSize = hidden;
outputSize = output;
W1 = new Matrix(inputSize, hiddenSize);
W2 = new Matrix(hiddenSize, output);
}
Matrix z2, z3;
Matrix a2;
public Matrix forward(Matrix X) {
z2 = X.dot(W1);
a2 = sigmoid(z2);
z3 = a2.dot(W2);
Matrix yHat = sigmoid(z3);
return yHat;
}
public double costFunction(Matrix X, Matrix y) {
Matrix yHat = forward(X);
Matrix cost = yHat.sub(y);
cost = cost.mult(cost);
double returnValue = 0;
int i = 0;
while (i < cost.m.length) {
returnValue += cost.m[i][0];
i++;
}
return returnValue;
}
Matrix yHat;
public Matrix[] costFunctionPrime(Matrix X, Matrix y) {
yHat = forward(X);
Matrix delta3 = (yHat.sub(y)).mult(sigmoidPrime(z3));
Matrix dJdW2 = a2.t().dot(delta3);
Matrix delta2 = (delta3.dot(W2.t())).mult(sigmoidPrime(z2));
Matrix dJdW1 = X.t().dot(delta2);
return new Matrix[]{dJdW1, dJdW2};
}
}
There's the code for network framework. I pass double arrays of length 784 into the forward method.
int t = 0;
while (t < 10000) {
dJdW = Nn.costFunctionPrime(X, y);
Nn.W1 = Nn.W1.sub(dJdW[0].scalar(3));
Nn.W2 = Nn.W2.sub(dJdW[1].scalar(3));
t++;
}
I call this to adjust the weights. With small sets, the cost converges to 0 pretty well, but larger sets don't (the cost associated with 100 characters converges to 13, always). And if the set is too large, the first adjustment works (and costs go down) but after the second all I can get is NaN.
Why does this implementation fail with larger data sets (specifically training) and how can I fix this? I tried a similar structure with 10 outputs instead of 1 where each would return a value near 0 or 1 acting like boolean values, but the same thing was happening.
I'm also doing this in java by the way, and I'm wondering if that has something to do with the problem. I was wondering if it was a problem with running out of space but I haven't been getting any heap space messages. Is there a problem with how I'm back-propogating or is something else happening?
EDIT: I think I know what's happening. I think my backpropogation function is getting caught in local minimums. Sometimes the training succeeds and sometimes it fails for large data sets. Because I'm starting with random weights, I get random initial costs. What I've noticed is that when the cost initially exceeds a certain amount (it depends on the number of datasets involved), the costs converge to a clean number (sometimes 27, others 17.4) and the outputs converge to 0 (which makes sense).
I was warned about relative minimums in the cost function when I began, and I'm beginning to realize why. So now the question becomes, how do I go about my gradient descent so that I'll actually find the global minimum? I'm working in Java by the way.
This seems like a problem with weight initialization.
As far as i can see you never initialize the weights to any specific value. Therefore the network diverges. You should at least use random initialization.
If your backprop works on small dataset is there really good assumtion that there isn't problem. When you're suspicious about it you can try your BP on XOR problem.
Are units biased?
I once discuss with guy who doing exactly same thing. Hand digit recognition and 15 units in hidden layer. I saw a network who doing this task well. Her topology was:
Input: 784
First hidden: 500
Second hidden: 500
Third hidden: 2000
Output: 10
You have a sets of images and you nonlinear transform 784 pixels of image into the 15 numbers from <0, 1> interval and you doing this for all images of your set. You hope that you can right separate digit based on these 15 numbers. From my point of view is 15 hidden unit too little for such a task when I assumed you have dataset with thousands of example. Please try for example 500 hidden units.
And learning rate has influence on backprop and can caused problem with convergence.

Need some help for deeplearning4j single RBM usage

I have a bunch of sensors and I really just want to reconstruct the input.
So what I want is this:
after I have trained my model I will pass in my feature matrix
get the reconstructed feature matrix back
I want to investigate which sensor values are completely different from the reconstructed value
Therefore I thought a RBM will be the right choice and since I am used to Java, I have tried to use deeplearning4j. But I got stuck very early. If you run the following code, I am facing 2 problems.
The result is far away from a correct prediction, most of them are simply [1.00,1.00,1.00].
I would expect to get back 4 values (which is the number of inputs expected to be reconstructed)
So what do I have to tune to get a) a better result and b) get the reconstructed inputs back?
public static void main(String[] args) {
// Customizing params
Nd4j.MAX_SLICES_TO_PRINT = -1;
Nd4j.MAX_ELEMENTS_PER_SLICE = -1;
Nd4j.ENFORCE_NUMERICAL_STABILITY = true;
final int numRows = 4;
final int numColumns = 1;
int outputNum = 3;
int numSamples = 150;
int batchSize = 150;
int iterations = 100;
int seed = 123;
int listenerFreq = iterations/5;
DataSetIterator iter = new IrisDataSetIterator(batchSize, numSamples);
// Loads data into generator and format consumable for NN
DataSet iris = iter.next();
iris.normalize();
//iris.scale();
System.out.println(iris.getFeatureMatrix());
NeuralNetConfiguration conf = new NeuralNetConfiguration.Builder()
// Gaussian for visible; Rectified for hidden
// Set contrastive divergence to 1
.layer(new RBM.Builder()
.nIn(numRows * numColumns) // Input nodes
.nOut(outputNum) // Output nodes
.activation("tanh") // Activation function type
.weightInit(WeightInit.XAVIER) // Weight initialization
.lossFunction(LossFunctions.LossFunction.XENT)
.updater(Updater.NESTEROVS)
.build())
.seed(seed) // Locks in weight initialization for tuning
.iterations(iterations)
.learningRate(1e-1f) // Backprop step size
.momentum(0.5) // Speed of modifying learning rate
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT) // ^^ Calculates gradients
.build();
Layer model = LayerFactories.getFactory(conf.getLayer()).create(conf);
model.setListeners(Arrays.asList((IterationListener) new ScoreIterationListener(listenerFreq)));
model.fit(iris.getFeatureMatrix());
System.out.println(model.activate(iris.getFeatureMatrix(), false));
}
For b), when you call activate(), you get a list of "nlayers" arrays. Every array in the list is the activation for one layer. The array itself is composed of rows: 1 row per input vector; each column contains the activation for every neuron in this layer and this observation (input).
Once all layers have been activated with some input, you can get the reconstruction with the RBM.propDown() method.
As for a), I'm afraid it's very tricky to train correctly an RBM.
So you really want to play with every parameter, and more importantly,
monitor during training various metrics that will give you some hint about whether it's training correctly or not. Personally, I like to plot:
The score() on the training corpus, which is the reconstruction error after every gradient update; check that it decreases.
The score() on another development corpus: useful to be warned when overfitting occurs;
The norm of the parameter vector: it has a large impact on the score
Both activation maps (= XY rectangular plot of the activated neurons of one layer over the corpus), just after initialization and after N steps: this helps detecting unreliable training (e.g.: when all is black/white, when a large part of all neurons are never activated, etc.)

Reading value from field variable

I am developing desktop app in Java 7. I have here a situation. At the method below
private synchronized void decryptMessage
(CopyOnWriteArrayList<Integer> possibleKeys, ArrayList<Integer> cipherDigits)
{
// apply opposite shift algorithm:
ArrayList<Integer> textDigits = shiftCipher(possibleKeys, cipherDigits);
// count CHI squared statistics:
double chi = countCHIstatistics(textDigits);
if(chi < edgeCHI) // if the value of IOC is greater or equal than that
{
System.err.println(chi + " " + possibleKeys + " +");
key = possibleKeys; // store most suitable key
edgeCHI = chi;
}
}
I count the value called 'chi' and based on that if 'chi' is less than 'edgeCHI' value I save the key at instance variable. That method is invoked by some threads, so I enforce synchronization.
When all the threads complete the program continues to execute by passing control to a method which controls the sequence of operations. Then this line has been executed at that method:
System.err.println(edgeCHI+" "+key+" -");
It prints correct value of 'chi', as has been printed the last value of 'chi' at decryptMessage method, but the value of key is different. The 'decryptMessage' method has been invoked by threads which generate key values.
I store the key value as global variable
private volatile CopyOnWriteArrayList<Integer> key = null; // stores the most suitable key for decryption.
Why do I have two different key values? The values itself are not important. The matter is that the value of key printed at the last call at 'decryptMessage' method (when chi < edgeCHI) must match the one printed at the method which controls the flow of operations.
This is how you create threads:
for(int y = 0; y < mostOccuringL.length; y++){// iterate through the five most frequent letters
for(int i = (y + 1); i < mostOccuringL.length; i++ ){//perform letter combinations
int [] combinations = new int[2];
combinations[0] = y;
combinations [1] = i;
new KeyMembers(""+y+":"+i ,combinations, keywords, intKeyIndex, cipherDigits).t.join();
}
}
Within run method you invoke decryptMesssage method in order to identify most feasible decryption key.
I have been trying to figure out what is the prob for two days, but I don't get it.
Suggestions?
Relying on syserr (or sysout) printing to determine an order of execution is dangerous - especially in multi-threaded environments. There is absolutely no guarantuee when the printing actually occurs or if the printed messages are in order. Maybe what you see as "last" printed message of one of the threads wasn't the "last" thread modifying the key field. You cannot say that by looking only at sterr output.
What you could do is use a synchronized setter for the key field, that increases an associated access counter whenever the field is modified and print the new value along with the modification count. This way you can avoid the problems of syserr printing and reliably determine what the last set value was. e.g. :
private long keyModCount = 0;
private synchronized long update(CopyOnWriteArrayList<Integer> possibilities, double dgeChi) {
this.keys = possibilites;
this.edgeChi = edgeChi; // how is edgeChi declared? Also volatile?
this.keyModCount++;
return this.keyModCount;
}
And inside decryptMessage:
if(chi < edgeCHI) // if the value of IOC is greater or equal than that
{
long sequence = update(possibleKeys, chi);
System.err.println("["+ sequence +"]"+ chi + " " + possibleKeys + " +");
}
To provide an answer we would need to see more of the (simplified if necessary) code that controls the thread execution.
Solution has been found. I just changed CopyOnWriteArrayList data type into ArrayList at the point where field variable gets correct key. It works as expected now.

Categories

Resources