I'd like to use the Stanford Classifier for text classification. My features are mostly textual, but there are some numeric features as well (e.g. the length of a sentence).
I started off with the ClassifierExample and replaced the current features by a simple real valued feature F with value 100 if a stop light is BROKEN and 0.1 otherwise, which results in the following code (apart from the makeStopLights() function in line 10-16, this is just the code of the original ClassifierExample class):
public class ClassifierExample {
protected static final String GREEN = "green";
protected static final String RED = "red";
protected static final String WORKING = "working";
protected static final String BROKEN = "broken";
private ClassifierExample() {} // not instantiable
// the definition of this function was changed!!
protected static Datum<String,String> makeStopLights(String ns, String ew) {
String label = (ns.equals(ew) ? BROKEN : WORKING);
Counter<String> counter = new ClassicCounter<>();
counter.setCount("F", (label.equals(BROKEN)) ? 100 : 0.1);
return new RVFDatum<>(counter, label);
}
public static void main(String[] args) {
// Create a training set
List<Datum<String,String>> trainingData = new ArrayList<>();
trainingData.add(makeStopLights(GREEN, RED));
trainingData.add(makeStopLights(GREEN, RED));
trainingData.add(makeStopLights(GREEN, RED));
trainingData.add(makeStopLights(RED, GREEN));
trainingData.add(makeStopLights(RED, GREEN));
trainingData.add(makeStopLights(RED, GREEN));
trainingData.add(makeStopLights(RED, RED));
// Create a test set
Datum<String,String> workingLights = makeStopLights(GREEN, RED);
Datum<String,String> brokenLights = makeStopLights(RED, RED);
// Build a classifier factory
LinearClassifierFactory<String,String> factory = new LinearClassifierFactory<>();
factory.useConjugateGradientAscent();
// Turn on per-iteration convergence updates
factory.setVerbose(true);
//Small amount of smoothing
factory.setSigma(10.0);
// Build a classifier
LinearClassifier<String,String> classifier = factory.trainClassifier(trainingData);
// Check out the learned weights
classifier.dump();
// Test the classifier
System.out.println("Working instance got: " + classifier.classOf(workingLights));
classifier.justificationOf(workingLights);
System.out.println("Broken instance got: " + classifier.classOf(brokenLights));
classifier.justificationOf(brokenLights);
}
}
In my understanding of linear classifiers, feature F should make the classification task pretty easy - after all, we just need to check whether the value of F is greater than some threshold. However, the classifier returns WORKING on every instance in the test set.
Now my question is: Have I made something wrong, do I need to change some other parts of the code as well for real-valued features to work or is there something wrong with my understanding of linear classifiers?
Your code looks fine. Note that typically with a Maximum Entropy classifier you provide binary valued features (1 or 0).
Here is some more reading on Maximum Entropy classifiers: http://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers
Look at slide titled: "Feature-Based Linear Classifiers" to see the specific probability calculation for Maximum Entropy classifiers.
Here is the formula for your example case with 1 feature and 2 classes ("works", "broken"):
probability(c1) = exp(w1 * f1) / total
probability(c2) = exp(w2 * f1) / total
total = exp(w1 * f1) + exp(w2 * f1)
w1 is the learned weight for "works" and w2 is the learned weight for "broken"
The classifier selects the higher probability. Note that f1 = (100 or 0.1) your feature value.
If you consider your specific example data, since you have (2 classes, 1 feature, feature is always positive), it is not possible to build a maximum entropy classifier that will separate that data, it will always guess all one way or the other.
For sake of argument say w1 > w2.
Say v > 0 is your feature value (either 100 or 0.1).
Then w1 * v > w2 * v, thus exp(w1 * v) > exp(w2 * v), so you'll always assign more probability to class1 regardless of what value v has.
Related
For the past week or so, I have been trying to get a neural network to function using RGB images, but no matter what I do it seems to only be predicting one class.
I have read all the links I could find with people encountering this problem and experimented with a lot of different things, but it always ends up predicting only one out of the two output classes. I have checked the batches going in to the model, I have increased the size of the dataset, I have increased the original pixel size(28*28) to 56*56, increased epochs, done a lot of model tuning and I have even tried a simple non-convolutional neural network as well as dumbing down my own CNN model, yet it changes nothing.
I have also checked into the structure of how the data is passed in for the training set(specifically imageRecordReader), but this input structure(in terms of folder structure and how the data is passed into the training set) works perfectly when given gray-scale images(as it originally was created with a 99% accuracy on the MNIST dataset).
Some context: I use the following folder names as my labels, i.e folder(0), folder(1) for both training and testing data as there will only be two output classes. The training set contains 320 images of class 0 and 240 images of class 1, whereas the testing set is made up of 79 and 80 images respectively.
Code below:
private static final Logger log = LoggerFactory.getLogger(MnistClassifier.class);
private static final String basePath = System.getProperty("java.io.tmpdir") + "/ISIC-Images";
public static void main(String[] args) throws Exception {
int height = 56;
int width = 56;
int channels = 3; // RGB Images
int outputNum = 2; // 2 digit classification
int batchSize = 1;
int nEpochs = 1;
int iterations = 1;
int seed = 1234;
Random randNumGen = new Random(seed);
// vectorization of training data
File trainData = new File(basePath + "/Training");
FileSplit trainSplit = new FileSplit(trainData, NativeImageLoader.ALLOWED_FORMATS, randNumGen);
ParentPathLabelGenerator labelMaker = new ParentPathLabelGenerator(); // parent path as the image label
ImageRecordReader trainRR = new ImageRecordReader(height, width, channels, labelMaker);
trainRR.initialize(trainSplit);
DataSetIterator trainIter = new RecordReaderDataSetIterator(trainRR, batchSize, 1, outputNum);
// vectorization of testing data
File testData = new File(basePath + "/Testing");
FileSplit testSplit = new FileSplit(testData, NativeImageLoader.ALLOWED_FORMATS, randNumGen);
ImageRecordReader testRR = new ImageRecordReader(height, width, channels, labelMaker);
testRR.initialize(testSplit);
DataSetIterator testIter = new RecordReaderDataSetIterator(testRR, batchSize, 1, outputNum);
log.info("Network configuration and training...");
Map<Integer, Double> lrSchedule = new HashMap<>();
lrSchedule.put(0, 0.06); // iteration #, learning rate
lrSchedule.put(200, 0.05);
lrSchedule.put(600, 0.028);
lrSchedule.put(800, 0.0060);
lrSchedule.put(1000, 0.001);
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(seed)
.l2(0.0008)
.updater(new Nesterovs(new MapSchedule(ScheduleType.ITERATION, lrSchedule)))
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.weightInit(WeightInit.XAVIER)
.list()
.layer(0, new ConvolutionLayer.Builder(5, 5)
.nIn(channels)
.stride(1, 1)
.nOut(20)
.activation(Activation.IDENTITY)
.build())
.layer(1, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
.kernelSize(2, 2)
.stride(2, 2)
.build())
.layer(2, new ConvolutionLayer.Builder(5, 5)
.stride(1, 1)
.nOut(50)
.activation(Activation.IDENTITY)
.build())
.layer(3, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
.kernelSize(2, 2)
.stride(2, 2)
.build())
.layer(4, new DenseLayer.Builder().activation(Activation.RELU)
.nOut(500).build())
.layer(5, new OutputLayer.Builder(LossFunctions.LossFunction.SQUARED_LOSS)
.nOut(outputNum)
.activation(Activation.SOFTMAX)
.build())
.setInputType(InputType.convolutionalFlat(56, 56, 3)) // InputType.convolutional for normal image
.backprop(true).pretrain(false).build();
MultiLayerNetwork net = new MultiLayerNetwork(conf);
net.init();
net.setListeners(new ScoreIterationListener(10));
log.debug("Total num of params: {}", net.numParams());
// evaluation while training (the score should go down)
for (int i = 0; i < nEpochs; i++) {
net.fit(trainIter);
log.info("Completed epoch {}", i);
Evaluation eval = net.evaluate(testIter);
log.info(eval.stats());
trainIter.reset();
testIter.reset();
}
ModelSerializer.writeModel(net, new File(basePath + "/Isic.model.zip"), true);
}
Output from running the model:
Odd iteration scores
Evaluation metrics
Any insight would be much appreciated.
I would suggest changing the activation functions in Layer 1 and 2 to a non-linear function. You may try with Relu and Tanh functions.
You may refer to this Documentaion for a list of available activation functions.
Identity on CNNs almost never makes sense 99% of the time. Stick to RELU if you can.
I would instead shift your efforts towards gradient normalization or interspersing drop out layers. Almost every time a CNN doesn't learn, it's usually due to lack of reguarlization.
Also: Never use squared loss with softmax. It never works. Stick to negative log likelihood.
I've never seen squared loss used with softmax in practice.
You can try l2 and l1 regularization (or both: This is called elastic net regularization)
It seems using an ADAM optimizer gave some promising results as well as increasing the batch size(I now have thousands of images) otherwise the net requires an absurd amount of epochs(at least 50+) in order to begin learning.
Thank you for all responses regardless.
I have a basic framework for a neural network to recognize numeric digits, but I'm having some problems with training it. My back-propogation works for small data sets, but when I have more than 50 data points, the return value starts converging to 0. And when I have data sets in the thousands, I get NaN's for costs and returns.
Basic structure: 3 layers: 784 : 15 : 1
784 is the number of pixels per data set, 15 neurons in my hidden layer, and one output neuron which returns a value from 0 to 1 (when you multiply by 10 you get a digit).
public class NetworkManager {
int inputSize;
int hiddenSize;
int outputSize;
public Matrix W1;
public Matrix W2;
public NetworkManager(int input, int hidden, int output) {
inputSize = input;
hiddenSize = hidden;
outputSize = output;
W1 = new Matrix(inputSize, hiddenSize);
W2 = new Matrix(hiddenSize, output);
}
Matrix z2, z3;
Matrix a2;
public Matrix forward(Matrix X) {
z2 = X.dot(W1);
a2 = sigmoid(z2);
z3 = a2.dot(W2);
Matrix yHat = sigmoid(z3);
return yHat;
}
public double costFunction(Matrix X, Matrix y) {
Matrix yHat = forward(X);
Matrix cost = yHat.sub(y);
cost = cost.mult(cost);
double returnValue = 0;
int i = 0;
while (i < cost.m.length) {
returnValue += cost.m[i][0];
i++;
}
return returnValue;
}
Matrix yHat;
public Matrix[] costFunctionPrime(Matrix X, Matrix y) {
yHat = forward(X);
Matrix delta3 = (yHat.sub(y)).mult(sigmoidPrime(z3));
Matrix dJdW2 = a2.t().dot(delta3);
Matrix delta2 = (delta3.dot(W2.t())).mult(sigmoidPrime(z2));
Matrix dJdW1 = X.t().dot(delta2);
return new Matrix[]{dJdW1, dJdW2};
}
}
There's the code for network framework. I pass double arrays of length 784 into the forward method.
int t = 0;
while (t < 10000) {
dJdW = Nn.costFunctionPrime(X, y);
Nn.W1 = Nn.W1.sub(dJdW[0].scalar(3));
Nn.W2 = Nn.W2.sub(dJdW[1].scalar(3));
t++;
}
I call this to adjust the weights. With small sets, the cost converges to 0 pretty well, but larger sets don't (the cost associated with 100 characters converges to 13, always). And if the set is too large, the first adjustment works (and costs go down) but after the second all I can get is NaN.
Why does this implementation fail with larger data sets (specifically training) and how can I fix this? I tried a similar structure with 10 outputs instead of 1 where each would return a value near 0 or 1 acting like boolean values, but the same thing was happening.
I'm also doing this in java by the way, and I'm wondering if that has something to do with the problem. I was wondering if it was a problem with running out of space but I haven't been getting any heap space messages. Is there a problem with how I'm back-propogating or is something else happening?
EDIT: I think I know what's happening. I think my backpropogation function is getting caught in local minimums. Sometimes the training succeeds and sometimes it fails for large data sets. Because I'm starting with random weights, I get random initial costs. What I've noticed is that when the cost initially exceeds a certain amount (it depends on the number of datasets involved), the costs converge to a clean number (sometimes 27, others 17.4) and the outputs converge to 0 (which makes sense).
I was warned about relative minimums in the cost function when I began, and I'm beginning to realize why. So now the question becomes, how do I go about my gradient descent so that I'll actually find the global minimum? I'm working in Java by the way.
This seems like a problem with weight initialization.
As far as i can see you never initialize the weights to any specific value. Therefore the network diverges. You should at least use random initialization.
If your backprop works on small dataset is there really good assumtion that there isn't problem. When you're suspicious about it you can try your BP on XOR problem.
Are units biased?
I once discuss with guy who doing exactly same thing. Hand digit recognition and 15 units in hidden layer. I saw a network who doing this task well. Her topology was:
Input: 784
First hidden: 500
Second hidden: 500
Third hidden: 2000
Output: 10
You have a sets of images and you nonlinear transform 784 pixels of image into the 15 numbers from <0, 1> interval and you doing this for all images of your set. You hope that you can right separate digit based on these 15 numbers. From my point of view is 15 hidden unit too little for such a task when I assumed you have dataset with thousands of example. Please try for example 500 hidden units.
And learning rate has influence on backprop and can caused problem with convergence.
I'm using ELKI to cluster my data I used KMeansLloyd<NumberVector> with k=3 every time I run my java code I'm getting totally different clusters results, is this normal or there is something I should do to make my output nearly stable?? here my code that I got from elki tutorials
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(a);
// Create a database (which may contain multiple relations!)
Database db = new StaticArrayDatabase(dbc, null);
// Load the data into the database (do NOT forget to initialize...)
db.initialize();
// Relation containing the number vectors:
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
// We know that the ids must be a continuous range:
DBIDRange ids = (DBIDRange) rel.getDBIDs();
// K-means should be used with squared Euclidean (least squares):
//SquaredEuclideanDistanceFunction dist = SquaredEuclideanDistanceFunction.STATIC;
CosineDistanceFunction dist= CosineDistanceFunction.STATIC;
// Default initialization, using global random:
// To fix the random seed, use: new RandomFactory(seed);
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
// Textbook k-means clustering:
KMeansLloyd<NumberVector> km = new KMeansLloyd<>(dist, //
3 /* k - number of partitions */, //
0 /* maximum number of iterations: no limit */, init);
// K-means will automatically choose a numerical relation from the data set:
// But we could make it explicit (if there were more than one numeric
// relation!): km.run(db, rel);
Clustering<KMeansModel> c = km.run(db);
// Output all clusters:
int i = 0;
for(Cluster<KMeansModel> clu : c.getAllClusters()) {
// K-means will name all clusters "Cluster" in lack of noise support:
System.out.println("#" + i + ": " + clu.getNameAutomatic());
System.out.println("Size: " + clu.size());
System.out.println("Center: " + clu.getModel().getPrototype().toString());
// Iterate over objects:
System.out.print("Objects: ");
for(DBIDIter it = clu.getIDs().iter(); it.valid(); it.advance()) {
// To get the vector use:
NumberVector v = rel.get(it);
// Offset within our DBID range: "line number"
final int offset = ids.getOffset(it);
System.out.print(v+" " + offset);
// Do NOT rely on using "internalGetIndex()" directly!
}
System.out.println();
++i;
}
I would say, since you are using RandomlyGeneratedInitialMeans:
Initialize k-means by generating random vectors (within the data sets value range).
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
Yes, it is normal.
K-Means is supposed to be initialized randomly. It is desirable to get different results when running it multiple times.
If you don't want this, use a fixed random seed.
From the code you copy and pasted:
// To fix the random seed, use: new RandomFactory(seed);
That is exactly what you should do...
long seed = 0;
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(
new RandomFactory(seed));
This was too long for a comment. As #Idos stated, You are initializing your data randomly; that's why you're getting random results. Now the question is, how do you ensure the results are robust? Try this:
Run the algorithm N times. Each time, record the cluster membership for each observation. When you are finished, classify an observation into the cluster which contained it most often. For example, suppose you have 3 observations, 3 classes, and run the algorithm 3 times:
obs R1 R2 R3
1 A A B
2 B B B
3 C B B
Then you should classify obs1 as A since it was most often classified as A. Classify obs2 as B since it was always classified as B. And classify obs3 as B since it was most often classified as B by the algorithm. The results should become increasingly stable the more times you run the algorithm.
I have a bunch of sensors and I really just want to reconstruct the input.
So what I want is this:
after I have trained my model I will pass in my feature matrix
get the reconstructed feature matrix back
I want to investigate which sensor values are completely different from the reconstructed value
Therefore I thought a RBM will be the right choice and since I am used to Java, I have tried to use deeplearning4j. But I got stuck very early. If you run the following code, I am facing 2 problems.
The result is far away from a correct prediction, most of them are simply [1.00,1.00,1.00].
I would expect to get back 4 values (which is the number of inputs expected to be reconstructed)
So what do I have to tune to get a) a better result and b) get the reconstructed inputs back?
public static void main(String[] args) {
// Customizing params
Nd4j.MAX_SLICES_TO_PRINT = -1;
Nd4j.MAX_ELEMENTS_PER_SLICE = -1;
Nd4j.ENFORCE_NUMERICAL_STABILITY = true;
final int numRows = 4;
final int numColumns = 1;
int outputNum = 3;
int numSamples = 150;
int batchSize = 150;
int iterations = 100;
int seed = 123;
int listenerFreq = iterations/5;
DataSetIterator iter = new IrisDataSetIterator(batchSize, numSamples);
// Loads data into generator and format consumable for NN
DataSet iris = iter.next();
iris.normalize();
//iris.scale();
System.out.println(iris.getFeatureMatrix());
NeuralNetConfiguration conf = new NeuralNetConfiguration.Builder()
// Gaussian for visible; Rectified for hidden
// Set contrastive divergence to 1
.layer(new RBM.Builder()
.nIn(numRows * numColumns) // Input nodes
.nOut(outputNum) // Output nodes
.activation("tanh") // Activation function type
.weightInit(WeightInit.XAVIER) // Weight initialization
.lossFunction(LossFunctions.LossFunction.XENT)
.updater(Updater.NESTEROVS)
.build())
.seed(seed) // Locks in weight initialization for tuning
.iterations(iterations)
.learningRate(1e-1f) // Backprop step size
.momentum(0.5) // Speed of modifying learning rate
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT) // ^^ Calculates gradients
.build();
Layer model = LayerFactories.getFactory(conf.getLayer()).create(conf);
model.setListeners(Arrays.asList((IterationListener) new ScoreIterationListener(listenerFreq)));
model.fit(iris.getFeatureMatrix());
System.out.println(model.activate(iris.getFeatureMatrix(), false));
}
For b), when you call activate(), you get a list of "nlayers" arrays. Every array in the list is the activation for one layer. The array itself is composed of rows: 1 row per input vector; each column contains the activation for every neuron in this layer and this observation (input).
Once all layers have been activated with some input, you can get the reconstruction with the RBM.propDown() method.
As for a), I'm afraid it's very tricky to train correctly an RBM.
So you really want to play with every parameter, and more importantly,
monitor during training various metrics that will give you some hint about whether it's training correctly or not. Personally, I like to plot:
The score() on the training corpus, which is the reconstruction error after every gradient update; check that it decreases.
The score() on another development corpus: useful to be warned when overfitting occurs;
The norm of the parameter vector: it has a large impact on the score
Both activation maps (= XY rectangular plot of the activated neurons of one layer over the corpus), just after initialization and after N steps: this helps detecting unreliable training (e.g.: when all is black/white, when a large part of all neurons are never activated, etc.)
There have been other questions and answers on this site suggesting that, to create an echo or delay effect, you need only add one audio sample with a stored audio sample from the past. As such, I have the following Java class:
public class DelayAMod extends AudioMod {
private int delay = 500;
private float decay = 0.1f;
private boolean feedback = false;
private int delaySamples;
private short[] samples;
private int rrPointer;
#Override
public void init() {
this.setDelay(this.delay);
this.samples = new short[44100];
this.rrPointer = 0;
}
public void setDecay(final float decay) {
this.decay = Math.max(0.0f, Math.min(decay, 0.99f));
}
public void setDelay(final int msDelay) {
this.delay = msDelay;
this.delaySamples = 44100 / (1000/this.delay);
System.out.println("Delay samples:"+this.delaySamples);
}
#Override
public short process(short sample) {
System.out.println("Got:"+sample);
if (this.feedback) {
//Delay should feed back into the loop:
sample = (this.samples[this.rrPointer] = this.apply(sample));
} else {
//No feedback - store base data, then add echo:
this.samples[this.rrPointer] = sample;
sample = this.apply(sample);
}
++this.rrPointer;
if (this.rrPointer >= this.samples.length) {
this.rrPointer = 0;
}
System.out.println("Returning:"+sample);
return sample;
}
private short apply(short sample) {
int loc = this.rrPointer - this.delaySamples;
if (loc < 0) {
loc += this.samples.length;
}
System.out.println("Found:"+this.samples[loc]+" at "+loc);
System.out.println("Adding:"+(this.samples[loc] * this.decay));
return (short)Math.max(Short.MIN_VALUE, Math.min(sample + (int)(this.samples[loc] * this.decay), (int)Short.MAX_VALUE));
}
}
It accepts one 16-bit sample at a time from an input stream, finds an earlier sample, and adds them together accordingly. However, the output is just horrible noisy static, especially when the decay is raised to a level that would actually cause any appreciable result. Reducing the decay to 0.01 barely allows the original audio to come through, but there's certainly no echo at that point.
Basic troubleshooting facts:
The audio stream sounds fine if this processing is skipped.
The audio stream sounds fine if decay is 0 (nothing to add).
The stored samples are indeed stored and accessed in the proper order and the proper locations.
The stored samples are being decayed and added to the input samples properly.
All numbers from the call of process() to return sample are precisely what I would expect from this algorithm, and remain so even outside this class.
The problem seems to arise from simply adding signed shorts together, and the resulting waveform is an absolute catastrophe. I've seen this specific method implemented in a variety of places - C#, C++, even on microcontrollers - so why is it failing so hard here?
EDIT: It seems I've been going about this entirely wrong. I don't know if it's FFmpeg/avconv, or some other factor, but I am not working with a normal PCM signal here. Through graphing of the waveform, as well as a failed attempt at a tone generator and the resulting analysis, I have determined that this is some version of differential pulse-code modulation; pitch is determined by change from one sample to the next, and halving the intended "volume" multiplier on a pure sine wave actually lowers the pitch and leaves volume the same. (Messing with the volume multiplier on a non-sine sequence creates the same static as this echo algorithm.) As this and other DSP algorithms are intended to work on linear pulse-code modulation, I'm going to need some way to get the proper audio stream first.
It should definitely work unless you have significant clipping.
For example, this is a text file with two columns. The leftmost column is the 16 bit input. The second column is the sum of the first and a version delayed by 4001 samples. The sample rate is 22KHz.
Each sample in the second column is the result of summing x[k] and x[k-4001] (e.g. y[5000] = x[5000] + x[999] = -13840 + 9181 = -4659) You can clearly hear the echo signal when playing the samples in the second column.
Try this signal with your code and see if you get identical results.