Deeplearning4j: LSTM example for review sentiment analysis

Deeplearning4j: LSTM example for review sentiment analysis - java

I am looking through the example of deeplearning 4j for classifying movie reviews according to their sentiment.
ReviewExample
At line 124-142 the N-dimensional arrays are created and I am kind of unsure what is happening at these lines:
Line 132:
features.put(new INDArrayIndex[]{NDArrayIndex.point(i),
NDArrayIndex.all(), NDArrayIndex.point(j)}, vector);
I can image that .point(x) and .point(j) address the cell in the array, but what exactly does the NDArrayIndex.all() call do here?
While building the feature array is more or less ok what is happening there I get totally confused by the label mask and this lastIdx variable
Line 138 - 142
int idx = (positive[i] ? 0 : 1);
int lastIdx = Math.min(tokens.size(),maxLength);
labels.putScalar(new int[]{i,idx,lastIdx-1},1.0); //Set label: [0,1] for negative, [1,0] for positive
labelsMask.putScalar(new int[]{i,lastIdx-1},1.0); //Specify that an output exists at the final time step for this example
The label array itself is addressed by i, idx e.g. column/row that is set to 1.0 - but I don't really get how this time-step information fits in? Is this conventional that the last parameter has to mark the last entry?
Then why does the labelsMask use only i and not i, idx ?
Thanks for explanations or pointer that help to clarify some of my questions

It's an index per dimension. All() is an indicator (use this whole dimension). See the nd4j user guide:
http://nd4j.org/userguide
As for the 1. That 1 is meant to be the class for the label there. It's a text classification problem: Take the window from the text and word vectors and have the class be predicted from that.
As for the label mask: The prediction of a neural net happens at the end of a sequence. See:
http://deeplearning4j.org/usingrnns

write a test and you will know it.
val features = Nd4j.zeros(2, 2, 3)
val toPut = Nd4j.ones(2)
features.put(Array[INDArrayIndex](NDArrayIndex.point(0), NDArrayIndex.all, NDArrayIndex.point(1)), toPut)
the result is
[[[0.00, 1.00, 0.00],
[0.00, 1.00, 0.00]],
[[0.00, 0.00, 0.00],
[0.00, 0.00, 0.00]]]
it will put the 'toPut' vector to the features.

Related

How to find the optimal K for clustering? [duplicate]

I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?

You can maximize the Bayesian Information Criterion (BIC):
BIC(C | X) = L(X | C) - (p / 2) * log n
where L(X | C) is the log-likelihood of the dataset X according to model C, p is the number of parameters in the model C, and n is the number of points in the dataset.
See "X-means: extending K-means with efficient estimation of the number of clusters" by Dan Pelleg and Andrew Moore in ICML 2000.
Another approach is to start with a large value for k and keep removing centroids (reducing k) until it no longer reduces the description length. See "MDL principle for robust vector quantisation" by Horst Bischof, Ales Leonardis, and Alexander Selb in Pattern Analysis and Applications vol. 2, p. 59-72, 1999.
Finally, you can start with one cluster, then keep splitting clusters until the points assigned to each cluster have a Gaussian distribution. In "Learning the k in k-means" (NIPS 2003), Greg Hamerly and Charles Elkan show some evidence that this works better than BIC, and that BIC does not penalize the model's complexity strongly enough.

Basically, you want to find a balance between two variables: the number of clusters (k) and the average variance of the clusters. You want to minimize the former while also minimizing the latter. Of course, as the number of clusters increases, the average variance decreases (up to the trivial case of k=n and variance=0).
As always in data analysis, there is no one true approach that works better than all others in all cases. In the end, you have to use your own best judgement. For that, it helps to plot the number of clusters against the average variance (which assumes that you have already run the algorithm for several values of k). Then you can use the number of clusters at the knee of the curve.

Yes, you can find the best number of clusters using Elbow method, but I found it troublesome to find the value of clusters from elbow graph using script. You can observe the elbow graph and find the elbow point yourself, but it was lot of work finding it from script.
So another option is to use Silhouette Method to find it. The result from Silhouette completely comply with result from Elbow method in R.
Here`s what I did.
#Dataset for Clustering
n = 150
g = 6
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
mydata<-d
#Plot 3X2 plots
attach(mtcars)
par(mfrow=c(3,2))
#Plot the original dataset
plot(mydata$x,mydata$y,main="Original Dataset")
#Scree plot to deterine the number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) {
wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
}
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")
#Silhouette analysis for determining the number of clusters
library(fpc)
asw <- numeric(20)
for (k in 2:20)
asw[[k]] <- pam(mydata, k) $ silinfo $ avg.width
k.best <- which.max(asw)
cat("silhouette-optimal number of clusters:", k.best, "\n")
plot(pam(d, k.best))
# K-Means Cluster Analysis
fit <- kmeans(mydata,k.best)
mydata
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, clusterid=fit$cluster)
plot(mydata$x,mydata$y, col = fit$cluster, main="K-means Clustering results")
Hope it helps!!

May be someone beginner like me looking for code example. information for silhouette_score
is available here.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
range_n_clusters = [2, 3, 4] # clusters range you want to select
dataToFit = [[12,23],[112,46],[45,23]] # sample data
best_clusters = 0 # best cluster number which you will get
previous_silh_avg = 0.0
for n_clusters in range_n_clusters:
clusterer = KMeans(n_clusters=n_clusters)
cluster_labels = clusterer.fit_predict(dataToFit)
silhouette_avg = silhouette_score(dataToFit, cluster_labels)
if silhouette_avg > previous_silh_avg:
previous_silh_avg = silhouette_avg
best_clusters = n_clusters
# Final Kmeans for best_clusters
kmeans = KMeans(n_clusters=best_clusters, random_state=0).fit(dataToFit)

Look at this paper, "Learning the k in k-means" by Greg Hamerly, Charles Elkan. It uses a Gaussian test to determine the right number of clusters. Also, the authors claim that this method is better than BIC which is mentioned in the accepted answer.

There is something called Rule of Thumb. It says that the number of clusters can be calculated by
k = (n/2)^0.5
where n is the total number of elements from your sample.
You can check the veracity of this information on the following paper:
http://www.ijarcsms.com/docs/paper/volume1/issue6/V1I6-0015.pdf
There is also another method called G-means, where your distribution follows a Gaussian Distribution or Normal Distribution.
It consists of increasing k until all your k groups follow a Gaussian Distribution.
It requires a lot of statistics but can be done.
Here is the source:
http://papers.nips.cc/paper/2526-learning-the-k-in-k-means.pdf
I hope this helps!

If you don't know the numbers of the clusters k to provide as parameter to k-means so there are four ways to find it automaticaly:
G-means algortithm: it discovers the number of clusters automatically using a statistical test to decide whether to split a k-means center into two. This algorithm takes a hierarchical approach to detect the number of clusters, based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution (continuous function which approximates the exact binomial distribution of events), and if not it splits the cluster. It starts with a small number of centers, say one cluster only (k=1), then the algorithm splits it into two centers (k=2) and splits each of these two centers again (k=4), having four centers in total. If G-means does not accept these four centers then the answer is the previous step: two centers in this case (k=2). This is the number of clusters your dataset will be divided into. G-means is very useful when you do not have an estimation of the number of clusters you will get after grouping your instances. Notice that an inconvenient choice for the "k" parameter might give you wrong results. The parallel version of g-means is called p-means. G-means sources:
source 1
source 2
source 3
x-means: a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. This version of k-means finds the number k and also accelerates k-means.
Online k-means or Streaming k-means: it permits to execute k-means by scanning the whole data once and it finds automaticaly the optimal number of k. Spark implements it.
MeanShift algorithm: it is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters. Mean shift clustering aims to discover “blobs” in a smooth density of samples. It is a centroid-based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids. Sources: source1, source2, source3

First build a minimum spanning tree of your data.
Removing the K-1 most expensive edges splits the tree into K clusters,
so you can build the MST once, look at cluster spacings / metrics for various K,
and take the knee of the curve.
This works only for Single-linkage_clustering,
but for that it's fast and easy. Plus, MSTs make good visuals.
See for example the MST plot under
stats.stackexchange visualization software for clustering.

I'm surprised nobody has mentioned this excellent article:
http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf
After following several other suggestions I finally came across this article while reading this blog:
https://datasciencelab.wordpress.com/2014/01/21/selection-of-k-in-k-means-clustering-reloaded/
After that I implemented it in Scala, an implementation which for my use cases provide really good results. Here's code:
import breeze.linalg.DenseVector
import Kmeans.{Features, _}
import nak.cluster.{Kmeans => NakKmeans}
import scala.collection.immutable.IndexedSeq
import scala.collection.mutable.ListBuffer
/*
https://datasciencelab.wordpress.com/2014/01/21/selection-of-k-in-k-means-clustering-reloaded/
*/
class Kmeans(features: Features) {
def fkAlphaDispersionCentroids(k: Int, dispersionOfKMinus1: Double = 0d, alphaOfKMinus1: Double = 1d): (Double, Double, Double, Features) = {
if (1 == k || 0d == dispersionOfKMinus1) (1d, 1d, 1d, Vector.empty)
else {
val featureDimensions = features.headOption.map(_.size).getOrElse(1)
val (dispersion, centroids: Features) = new NakKmeans[DenseVector[Double]](features).run(k)
val alpha =
if (2 == k) 1d - 3d / (4d * featureDimensions)
else alphaOfKMinus1 + (1d - alphaOfKMinus1) / 6d
val fk = dispersion / (alpha * dispersionOfKMinus1)
(fk, alpha, dispersion, centroids)
}
}
def fks(maxK: Int = maxK): List[(Double, Double, Double, Features)] = {
val fadcs = ListBuffer[(Double, Double, Double, Features)](fkAlphaDispersionCentroids(1))
var k = 2
while (k <= maxK) {
val (fk, alpha, dispersion, features) = fadcs(k - 2)
fadcs += fkAlphaDispersionCentroids(k, dispersion, alpha)
k += 1
}
fadcs.toList
}
def detK: (Double, Features) = {
val vals = fks().minBy(_._1)
(vals._3, vals._4)
}
}
object Kmeans {
val maxK = 10
type Features = IndexedSeq[DenseVector[Double]]
}

If you use MATLAB, any version since 2013b that is, you can make use of the function evalclusters to find out what should the optimal k be for a given dataset.
This function lets you choose from among 3 clustering algorithms - kmeans, linkage and gmdistribution.
It also lets you choose from among 4 clustering evaluation criteria - CalinskiHarabasz, DaviesBouldin, gap and silhouette.

I used the solution I found here : http://efavdb.com/mean-shift/ and it worked very well for me :
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
from itertools import cycle
from PIL import Image
#%% Generate sample data
centers = [[1, 1], [-.75, -1], [1, -1], [-3, 2]]
X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.6)
#%% Compute clustering with MeanShift
# The bandwidth can be automatically estimated
bandwidth = estimate_bandwidth(X, quantile=.1,
n_samples=500)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
n_clusters_ = labels.max()+1
#%% Plot result
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
my_members = labels == k
cluster_center = cluster_centers[k]
plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1],
'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

My idea is to use Silhouette Coefficient to find the optimal cluster number(K). Details explanation is here.

Assuming you have a matrix of data called DATA, you can perform partitioning around medoids with estimation of number of clusters (by silhouette analysis) like this:
library(fpc)
maxk <- 20 # arbitrary here, you can set this to whatever you like
estimatedK <- pamk(dist(DATA), krange=1:maxk)$nc

One possible answer is to use Meta Heuristic Algorithm like Genetic Algorithm to find k.
That's simple. you can use random K(in some range) and evaluate the fit function of Genetic Algorithm with some measurment like Silhouette
And Find best K base on fit function.
https://en.wikipedia.org/wiki/Silhouette_(clustering)

km=[]
for i in range(num_data.shape[1]):
kmeans = KMeans(n_clusters=ncluster[i])#we take number of cluster bandwidth theory
ndata=num_data[[i]].dropna()
ndata['labels']=kmeans.fit_predict(ndata.values)
cluster=ndata
co=cluster.groupby(['labels'])[cluster.columns[0]].count()#count for frequency
me=cluster.groupby(['labels'])[cluster.columns[0]].median()#median
ma=cluster.groupby(['labels'])[cluster.columns[0]].max()#Maximum
mi=cluster.groupby(['labels'])[cluster.columns[0]].min()#Minimum
stat=pd.concat([mi,ma,me,co],axis=1)#Add all column
stat['variable']=stat.columns[1]#Column name change
stat.columns=['Minimum','Maximum','Median','count','variable']
l=[]
for j in range(ncluster[i]):
n=[mi.loc[j],ma.loc[j]]
l.append(n)
stat['Class']=l
stat=stat.sort(['Minimum'])
stat=stat[['variable','Class','Minimum','Maximum','Median','count']]
if missing_num.iloc[i]>0:
stat.loc[ncluster[i]]=0
if stat.iloc[ncluster[i],5]==0:
stat.iloc[ncluster[i],5]=missing_num.iloc[i]
stat.iloc[ncluster[i],0]=stat.iloc[0,0]
stat['Percentage']=(stat[[5]])*100/count_row#Freq PERCENTAGE
stat['Cumulative Percentage']=stat['Percentage'].cumsum()
km.append(stat)
cluster=pd.concat(km,axis=0)## see documentation for more info
cluster=cluster.round({'Minimum': 2, 'Maximum': 2,'Median':2,'Percentage':2,'Cumulative Percentage':2})

Another approach is using Self Organizing Maps (SOP) to find optimal number of clusters. The SOM (Self-Organizing Map) is an unsupervised neural
network methodology, which needs only the input is used to
clustering for problem solving. This approach used in a paper about customer segmentation.
The reference of the paper is
Abdellah Amine et al., Customer Segmentation Model in E-commerce Using
Clustering Techniques and LRFM Model: The Case
of Online Stores in Morocco, World Academy of Science, Engineering and Technology
International Journal of Computer and Information Engineering
Vol:9, No:8, 2015, 1999 - 2010

Hi I'll make it simple and straight to explain, I like to determine clusters using 'NbClust' library.
Now, how to use the 'NbClust' function to determine the right number of clusters: You can check the actual project in Github with actual data and clusters - Extention to this 'kmeans' algorithm also performed using the right number of 'centers'.
Github Project Link: https://github.com/RutvijBhutaiya/Thailand-Customer-Engagement-Facebook

You can choose the number of clusters by visually inspecting your data points, but you will soon realize that there is a lot of ambiguity in this process for all except the simplest data sets. This is not always bad, because you are doing unsupervised learning and there's some inherent subjectivity in the labeling process. Here, having previous experience with that particular problem or something similar will help you choose the right value.
If you want some hint about the number of clusters that you should use, you can apply the Elbow method:
First of all, compute the sum of squared error (SSE) for some values of k (for example 2, 4, 6, 8, etc.). The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid. Mathematically:
SSE=∑Ki=1∑x∈cidist(x,ci)2
If you plot k against the SSE, you will see that the error decreases as k gets larger; this is because when the number of clusters increases, they should be smaller, so distortion is also smaller. The idea of the elbow method is to choose the k at which the SSE decreases abruptly. This produces an "elbow effect" in the graph, as you can see in the following picture:
In this case, k=6 is the value that the Elbow method has selected. Take into account that the Elbow method is an heuristic and, as such, it may or may not work well in your particular case. Sometimes, there are more than one elbow, or no elbow at all. In those situations you usually end up calculating the best k by evaluating how well k-means performs in the context of the particular clustering problem you are trying to solve.

I worked on a Python package kneed (Kneedle algorithm). It finds cluster numbers dynamically as the point where the curve starts to flatten. Given a set of x and y values, kneed will return the knee point of the function. The knee joint is the point of maximum curvature. Here is the sample code.
y = [7342.1301373073857, 6881.7109460930769, 6531.1657905495022,
6356.2255554679778, 6209.8382535595829, 6094.9052166741121,
5980.0191582610196, 5880.1869867848218, 5779.8957906367368,
5691.1879324562778, 5617.5153566271356, 5532.2613232619951,
5467.352265375117, 5395.4493783888756, 5345.3459908298091,
5290.6769823693812, 5243.5271656371888, 5207.2501206569532,
5164.9617535255456]
x = range(1, len(y)+1)
from kneed import KneeLocator
kn = KneeLocator(x, y, curve='convex', direction='decreasing')
print(kn.knee)

Leave here a pretty cool gif from Codecademy course:
The K-Means algorithm:
Place k random centroids for the initial clusters.
Assign data samples to the nearest centroid.
Update centroids based on the above-assigned data samples.
Btw, its not a explanation of full algorithm, its just helpful vizualization

Numbers length of JFormattedTextField mask Java

I'm developing an desktop application with java, right now I'm at the point of registering person data. One of the fields of the person form is "DocumentTextField" which holds the Identification Document and Number, that's why I tried to use a JFormattedTextField mask, to help user with the format to this field.
Basically, I just used the AbstracFormatterFactory to create the mask:
Mask = UU - ########## to get something like (PP-0123456789)
It does work perfecly on the fly, the user just type "pp0123456789" and the mask become this to "PP-0123456789" the point is the numbers length, as you can see on my mask, i declare 10 numbers (##########) but in fact, It could be lower than 10 numbers or even Higher. It does only work with 10 numbers, if user type lower than 10 numbers, the JFormattedTextField resset to empty, the same thing happen if user type more than 10 numbers.
is there any way to declare the range (numbers length) of this? some document are just 5 numbers (PP-01234).
Thank you so much in advance by reading this and trying to help.

I assume you're using Java 8 for your development. Are you seeing any kind of ParseException?
As per the documentation of the component: https://docs.oracle.com/javase/8/docs/api/javax/swing/text/MaskFormatter.html
When initially formatting a value if the length of the string is less than the length of the mask, two things can happen. Either the placeholder string will be used, or the placeholder character will be used. Precedence is given to the placeholder string.
According to the example:
MaskFormatter formatter = new MaskFormatter("###-####");
formatter.setPlaceholderCharacter('_');
setPlaceHolderCharacter method can help you with your problem.

Java: Limiting decimals from a double in output without e.g. %5.2d

I'm very (read: extremely) new to java and was going to make a table with 5 columns print out for an assignement.
The two first columns are strings, third one an int, fourth a double and fifth an int. It is three rows in total that are affected by this.
I formatted it with printf and:
System.out.printf(Locale.ENGLISH, "%s%10s%14d%20.2f%12d\n", stringOne,
stringTwo, firstInt, stupidDouble, secondInt);
and
"%s\t%s\t\t%d\t\t%2.4f\t\t%d\n" (with the format and stuff above as well ofc).
But the teacher didn't want me to "hard code" the layout since it's part of learning special commands (like \t) and thus wanted me to change it and adress it in the variable itself instead.
I've been trying everything I can think of but can't get it to skip the last decimals, which are unneccesary 0's.
I wish I could've just used the numbers as a string instead, but I need it to calculate the last int with:
static int lastInt = (int) Math.round ( stupidDouble - firstInt )
What I suspect that my teacher is looking for me to do is tabs all the way like:
"%s\t%s\t\t%d\t\t%f\t\t%d\n"
But I need to double to be e.g. 1.2345 instead of 1.234500.
If it makes any difference, it's 3 different doubles (3 different rows in the table); one with 2 decimals, one with 4 and one with 5.
It would more or less save my weekend if someone could help me with this. The simplest possible solution would be much appreciated. <3

GLPK-Java to solve MILP type problems

While I already know that there is not much documentation on using GLPK-Java library I'm going to ask anyway... (and no I can't use another solver)
I have a basic problem that involves scheduling. Student to course for semester with some basic constraints.
The example problem is:
We consider {s1, s2} a set of two students. They need to take two
courses {c1, c2} during two semesters {t1, t2}.
We assume the courses are the same. We also assume that the students
cannot take more than one course per semester, and we'd like to determine the
minimum capacity X that must be offered by the classroom, assuming they
all offer the same capacity.
The example we were given in CPLEX format looks like this:
minimize X
subject to
y111 + y112 = 1
y121 + y122 = 1
y211 + y212 = 1
y221 + y222 = 1
y111 + y112 <= 1
y121 + y122 <= 1
y211 + y212 <= 1
y221 + y222 <= 1
y111 + y112 -X <= 0
y121 + y122 -X <= 0
y211 + y212 -X <= 0
y221 + y222 -X <= 0
end
I can run this through the solver via glpsol command and get it to solve but I need to write this using the API. I've never really worked with Linear Programming and the documentation leaves something to be desired. While this is simplistic at best, the real problem involves solving 600 students over 12 semesters who have to take 12 courses out of 18 with certain classes only available certain semesters and some courses having prerequisites.
What I need help with it translating the simplistic problem into a coding example using the API. I'm assuming that once I can see how the very simplistic problem maps to the API calls I can then figure out how to create the application for the more complex issue.
From the examples in the library I can see that you set up columns which in this case would be the Semester and the rows are the Students
// Define Columns
GLPK.glp_add_cols(lp, 2); // adds the number of columns
GLPK.glp_set_col_name(lp, 1, "Sem_1");
GLPK.glp_set_col_kind(lp, 1, GLPKConstants.GLP_IV);
GLPK.glp_set_col_bnds(lp, 1, GLPKConstants.GLP_LO, 0, 0);
GLPK.glp_set_col_name(lp, 2, "Sem_2");
GLPK.glp_set_col_kind(lp, 2, GLPKConstants.GLP_IV);
GLPK.glp_set_col_bnds(lp, 2, GLPKConstants.GLP_LO, 0, 0);
At this point I would assume you need to set up the row constraints but I'm at a loss. Any direction would be greatly appreciated.

When using the API the optimization Problem is basically represented by a matrix of the Problem, where the columns are your variables and the rows your constraints.
For your Problem you have to define 9 Columns, representing y111, y112, ... and X.
Then you can go on with the constraints (rows) by setting the used variables (columns)
GLPK.glp_set_row_name(lp, 2, "constraint1");
GLPK.glp_set_row_bnds(lp, 2, GLPKConstants.GLP_FX, 0, 1.0); // equal to 1.0
GLPK.intArray_setitem(ind, 1, 1); // Add first column at first position
GLPK.intArray_setitem(ind, 2, 2); // Add second column at second position
// set the coefficients to the variables (for all 1. except X is -1. in the example)
GLPK.doubleArray_setitem(val, 1, 1.);
GLPK.doubleArray_setitem(val, 2, 1.);
GLPK.glp_set_mat_row(lp, 1, 2, ind, val);
This will represent the y111 + y112 = 1 constraint - 11 to go.
In the GLPK for Java package should also be a documentation for glpk which contains a good documentation of the GLPK functions (which are also available in glpk for java). Also have a look at the lp.java example.

Java: Easier pretty printing?

At the end of my computations, I print results:
System.out.println("\nTree\t\tOdds of being by the sought author");
for (ParseTree pt : testTrees) {
conditionalProbs = reg.classify(pt.features());
System.out.printf("%s\t\t%f", pt.toString(), conditionalProbs[1]);
System.out.println();
}
This produces, for instance:
Tree Odds of being by the sought author
K and Burstner 0.000000
how is babby formed answer 0.005170
Mary is in heat 0.999988
Prelim 1.000000
Just putting two \t in there is sort of clumsy - the columns don't really line up. I'd rather have an output like this:
Tree Odds of being by the sought author
K and Burstner 0.000000
how is babby formed answer 0.005170
Mary is in heat 0.999988
Prelim 1.000000
(note: I'm having trouble making the SO text editor line up those columns perfectly, but hopefully you get the idea.)
Is there an easy way to do this, or must I write a method to try to figure it out based on the length of the string in the "Tree" column?

You're looking for field lengths. Try using this:
printf ("%-32s %f\n", pt.toString(), conditionalProbs[1])
The -32 tells you that the string should be left justified, but with a field length of 32 characters (adjust to your liking, I picked 32 as it is a multiple of 8, which is a normal tab stop on a terminal). Using the same on the header, but with %s instead of %f will make that one line up nicely too.

What you need is the amazing yet free format().
It works by letting you specify placeholders in a template string; it produces a combination of template and values as output.
Example:
System.out.format("%-25s %9.7f%n", "K and Burstner", 0.055170);
%s is a placeholder for Strings;
%25s means blank-pad any given String to 25 characters.
%-25s means left-justify the String in the field, i.e. pad to the right of the string.
%9.7f means output a floating-point number with 9 places in all and 7 to the right of the decimal.
%n is necessary to "do" a line termination, which is what you're otherwise missing when you go from System.out.println() to System.out.format().
Alternatively, you can use
String outputString = String.format("format-string", arg1, arg2...);
to create an output String, and then use
System.out.println(outputString);
as before to print it.

How about
System.out.printf("%-30s %f\n", pt.toString(), conditionalProbs[1]);
See the docs for more information on the Formatter mini-language.

http://java.sun.com/j2se/1.5.0/docs/api/java/text/MessageFormat.html

Using j-text-utils you may print to console a table like:
And it as simple as:
TextTable tt = new TextTable(columnNames, data);
tt.printTable();
The API also allows sorting and row numbering ...

Perhaps java.io.PrintStream's printf and/or format method is what you are looking for...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.