Naive Bayes Text Classification Algorithm

Naive Bayes Text Classification Algorithm - java

Hye there! I just need the help for implementing Naive Bayes Text Classification Algorithm in Java to just test my Data Set for research purposes. It is compulsory to implement the algorithm in Java; rather using Weka or Rapid Miner tools to get the results!
My Data Set has the following type of Data:
Doc Words Category
Means that I have the Training Words and Categories for each training (String) known in advance. Some of the Data Set is given below:
Doc Words Category
Training
1 Integration Communities Process Oriented Structures...(more string) A
2 Integration Communities Process Oriented Structures...(more string) A
3 Theory Upper Bound Routing Estimate global routing...(more string) B
4 Hardware Design Functional Programming Perfect Match...(more string) C
.
.
.
Test
5 Methodology Toolkit Integrate Technological Organisational
6 This test contain string naive bayes test text text test
SO the Data Set comes from a MySQL DataBase and it may contain multiple training strings and test strings as well! The thing is I just need to implement Naive Bayes Text Classification Algorithm in Java.
The algorithm should follow the following example mentioned here Table 13.1
Source: Read here
The thing is that I can implement the algorithm in Java Code myself but i just need to know if it is possible that there exist some kind a Java library with source code documentation available to allow me to just test the results.
The problem is I just need the results for just one time only means its just a test for results.
So, come to the point can somebody tell me about any good java library that helps my code this algorithm in Java and that could made my dataset possible to process the results, or can somebody give me any good ideas how to do it easily...something good that can help me.
I will be thankful for your help.
Thanks in advance

As per your requirement, you can use the Machine learning library MLlib from apache. The MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities. There is also a java code template to implement the algorithm utilizing the library. So to begin with, you can:
Implement the java skeleton for the Naive Bayes provided on their site as given below.
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.regression.LabeledPoint;
import scala.Tuple2;
JavaRDD<LabeledPoint> training = ... // training set
JavaRDD<LabeledPoint> test = ... // test set
final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
JavaPairRDD<Double, Double> predictionAndLabel =
test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
#Override public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
}
});
double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
#Override public Boolean call(Tuple2<Double, Double> pl) {
return pl._1().equals(pl._2());
}
}).count() / (double) test.count();
For testing your datasets, there is no best solution here than use the Spark SQL. MLlib fits into Spark's APIs perfectly. To start using it, I would recommend you to go through the MLlib API first, implementing the Algorithm according to your needs. This is pretty easy using the library.
For the next step to allow the processing of your datasets possible, just use the Spark SQL.
I will recommend you to stick to this. I too have hunted down multiple options before settling for this easy to use library and it's seamless support for inter-operations with some other technologies. I would have posted the complete code here to perfectly fit your answer. But I think you are good to go.

You can use the Weka Java API and include it in your project if you do not want to use the GUI.
Here's a link to the documentation to incorporate a classifier in your code:
https://weka.wikispaces.com/Use+WEKA+in+your+Java+code

Please take a look at the Bow toolkit.
It has a Gnu license and source code. Some of its code includes
Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.
Performing test/train splits, and automatic classification tests.
It's not a Java library, but you could compile the C code to ensure that you Java had similar results for a given corpus.
I also spotted a decent Dr. Dobbs article that implements in Perl. Once again, not the desired Java, but will give you the one-time results that you are asking for.

Hi I thinks Spark would help you a lot:
http://spark.apache.org/docs/1.2.0/mllib-naive-bayes.html
you can even choose the language you think is the most appropriate to your needs Java / Python / Scala!

You may want to take a look at this.
https://mahout.apache.org/users/classification/bayesian.html

Please use scipy from python. There is already an implementation of what you need:
class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)¶
scipy

You can use an algorithm platform like KNIME, it has variety of classification algorithms (Naive bayed included). You can run it with a GUI or Java API.

If you want to implement Naive Bayes Text Classification Algorithm in Java, then WEKA Java API will be a better solution. The data set should have to be in .arff format. Creating an .arff file from mySql database is very easy. Here is the attachment of the java code for the classifier a link of a sample .arff file.
Create a new Text document. Open it with Notepad. Copy and paste all the texts below the link. Save it as DataSet.arff. http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.arff
Download Weka Java API: http://www.java2s.com/Code/Jar/w/weka.htm
Code for the classifier:
public static void main(String[] args) {
try {
StringBuilder txtAreaShow = new StringBuilder();
//reads the arff file
BufferedReader breader = null;
breader = new BufferedReader(new FileReader("DataSet.arff"));
//if 40 attributes availabe then 39 will be the class index/attribuites(yes/no)
Instances train = new Instances(breader);
train.setClassIndex(train.numAttributes() - 1);
breader.close();
//
NaiveBayes nB = new NaiveBayes();
nB.buildClassifier(train);
Evaluation eval = new Evaluation(train);
eval.crossValidateModel(nB, train, 10, new Random(1));
System.out.println("Run Information\n=====================");
System.out.println("Scheme: " + train.getClass().getName());
System.out.println("Relation: ");
System.out.println("\nClassifier Model(full training set)\n===============================");
System.out.println(nB);
System.out.println(eval.toSummaryString("\nSummary Results\n==================", true));
System.out.println(eval.toClassDetailsString());
System.out.println(eval.toMatrixString());
//txtArea output
txtAreaShow.append("\n\n\n");
txtAreaShow.append("Run Information\n===================\n");
txtAreaShow.append("Scheme: " + train.getClass().getName());
txtAreaShow.append("\n\nClassifier Model(full training set)"
+ "\n======================================\n");
txtAreaShow.append("" + nB);
txtAreaShow.append(eval.toSummaryString("\n\nSummary Results\n==================\n", true));
txtAreaShow.append(eval.toClassDetailsString());
txtAreaShow.append(eval.toMatrixString());
txtAreaShow.append("\n\n\n");
System.out.println(txtAreaShow.toString());
} catch (FileNotFoundException ex) {
System.err.println("File not found");
System.exit(1);
} catch (IOException ex) {
System.err.println("Invalid input or output.");
System.exit(1);
} catch (Exception ex) {
System.err.println("Exception occured!");
System.exit(1);
}

You can take a look at Blayze - It's a pretty minimal Naive Bayes library for the JVM written in Kotlin. Should be easy to follow.
Full disclosure: I'm one of the authors of Blayze

Related

Reading the spss file java

SPSSReader reader = new SPSSReader(args[0], null);
Iterator it = reader.getVariables().iterator();
while (it.hasNext())
{
System.out.println(it.next());
}
I am using this SPSSReader to read the spss file. Here,every string is printed with some junk characters appended with it.
Obtained Result :
StringVariable: nameogr(nulltpc{)(10)
NumericVariable: weightppuo(nullf{nd)
DateVariable: datexsgzj(nulllanck)
DateVariable: timeppzb(null|wt{l)
DateVariable: datetimegulj{(null|ns)
NumericVariable: commissionyrqh(nullohzx)
NumericVariable: priceeub{av(nullvlpl)
Expected Result :
StringVariable: name (10)
NumericVariable: weight
DateVariable: date
DateVariable: time
DateVariable: datetime
NumericVariable: commission
NumericVariable: price
Thanks in advance :)

I tried recreating the issue and found the same thing.
Considering that there is a licensing for that library (see here), I would assume that this might be a way of the developers to ensure that a license is bought as the regular download only contains a demo version as evaluation (see licensing before the download).
As that library is rather old (copyright of the website is 2003-2008, requirement for the library is Java 1.2, no generics, Vectors are used, etc), I would recommend a different library as long as you are not limited to the one used in your question.
After a quick search, it turned out that there is an open source spss reader here which is also available through Maven here.
Using the example on the github page, I put this together:
import com.bedatadriven.spss.SpssDataFileReader;
import com.bedatadriven.spss.SpssVariable;
public class SPSSDemo {
public static void main(String[] args) {
try {
SpssDataFileReader reader = new SpssDataFileReader(args[0]);
for (SpssVariable var : reader.getVariables()) {
System.out.println(var.getVariableName());
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
I wasn't able to find stuff that would print NumericVariable or similar things but as those were the classnames of the library you were using in the question, I will assume that those are not SPSS standardized. If they are, you will either find something like that in the library or you can open an issue on the github page.
Using the employees.sav file from here I got this output from the code above using the open source library:
resp_id
gender
first_name
last_name
date_of_birth
education_type
education_years
job_type
experience_years
monthly_income
job_satisfaction
No additional characters no more!
Edit regarding the comment:
That is correct. I read through some SPSS stuff though and from my understanding there are only string and numeric variables which are then formatted in different ways. The version published in maven only gives you access to the typecode of a variable (to be honest, no idea what that is) but the github version (that does not appear to be published on maven as 1.3-SNAPSHOT unfortunately) does after write- and printformat have been introduced.
You can clone or download the library and run mvn clean package (assuming you have maven installed) and use the generated library (found under target\spss-reader-1.3-SNAPSHOT.jar) in your project to have the methods SpssVariable#getPrintFormat and SpssVariable#getWriteFormat available.
Those return an SpssVariableFormat which you can get more information from. As I have no clue what all that is about, the best I can do is to link you to the source here where references to the stuff that was implemented there should help you further (I assume that this link referenced to in the documentation of SpssVariableFormat#getType is probably the most helpful to determine what kind of format you have there.
If absolutely NOTHING works with that, I guess you could use the demo version of the library in the question to determine the stuff through it.next().getClass().getSimpleName() as well but I would resort to that only if there is no other way to determining the format.

I am not sure, but looking at your code, it.next() is returning a Variable object.
There has to be some method to be chained to the Variable object, something like it.next().getLabel() or it.next().getVariableName(). toString() on an Object is not always meaningful. Check toString() method of Variable class in SPSSReader library.

How to know the Java interfaces an OpenOffice Calc UNO object supports (through queryInterface)

I'm developing a "macro" for OpenOffice Calc. As the language, I chose Java, in order to get code assistance in Eclipse. I even wrote a small ant build script that compiles and embeds the "macro" in an *.ods file. In general, this works fine and surprisingly fast; I'm already using some simple stuff quite successfully.
BUT
So often I get stuck because with UNO, I need to "query" an interface for any given non-trivial object, to be able to access data / call methods of that object. I.e., I literally need to guess which interfaces a given object may provide. This is not at all obvious and not even visible during Java development (through some sort of meta-information, reflection or the like), and also sparsely documented (I downloaded tons of stuff, but I don't find the source or maybe JavaDoc for the interfaces I'm using, like XButton, XPropertySet, etc. - XButton has setLabel, but not getLabel - what??).
There is online documentation (for the most fundamental concepts, which is not bad at all!), but it lacks many details that I'm faced with. It always magically stops exactly at the point I need to solve.
I'm willing to look at the C++ code to get a clue what interfaces an object (e.g. the button / event I'm currently stuck with) may provide. Confusingly, the C++ class and file names don't exactly match the Java interfaces. It's almost what I'm looking for, but then in Java I don't really find the equivalent, or calling queryInterface on a given object returns null.. It's becoming a bit frustrating.
How are the UNO Java interfaces generated? Is there some kind of documentation in the code that serves as the origin for the generated (Java) code?
I think I really need to know what interfaces are available at which point, in order to become a bit more fluent during Java-UNO-macro development.

For any serious UNO project, use an introspection tool.
As an example, I created a button in Calc, then used the Java Object Inspector to browse to the button.
Right-clicking and choosing "Add to Source Code" generated the following.
import com.sun.star.awt.XControlModel;
import com.sun.star.beans.XPropertySet;
import com.sun.star.container.XIndexAccess;
import com.sun.star.container.XNameAccess;
import com.sun.star.drawing.XControlShape;
import com.sun.star.drawing.XDrawPage;
import com.sun.star.drawing.XDrawPageSupplier;
import com.sun.star.sheet.XSpreadsheetDocument;
import com.sun.star.sheet.XSpreadsheets;
import com.sun.star.uno.AnyConverter;
import com.sun.star.uno.UnoRuntime;
import com.sun.star.uno.XInterface;
//...
public void codesnippet(XInterface _oUnoEntryObject){
try{
XSpreadsheetDocument xSpreadsheetDocument = (XSpreadsheetDocument) UnoRuntime.queryInterface(XSpreadsheetDocument.class, _oUnoEntryObject);
XSpreadsheets xSpreadsheets = xSpreadsheetDocument.getSheets();
XNameAccess xNameAccess = (XNameAccess) UnoRuntime.queryInterface(XNameAccess.class, xSpreadsheets);
Object oName = xNameAccess.getByName("Sheet1");
XDrawPageSupplier xDrawPageSupplier = (XDrawPageSupplier) UnoRuntime.queryInterface(XDrawPageSupplier.class, oName);
XDrawPage xDrawPage = xDrawPageSupplier.getDrawPage();
XIndexAccess xIndexAccess = (XIndexAccess) UnoRuntime.queryInterface(XIndexAccess.class, xDrawPage);
Object oIndex = xIndexAccess.getByIndex(0);
XControlShape xControlShape = (XControlShape) UnoRuntime.queryInterface(XControlShape.class, oIndex);
XControlModel xControlModel = xControlShape.getControl();
XPropertySet xPropertySet = (XPropertySet) UnoRuntime.queryInterface(XPropertySet.class, xControlModel);
String sLabel = AnyConverter.toString(xPropertySet.getPropertyValue("Label"));
}catch (com.sun.star.beans.UnknownPropertyException e){
e.printStackTrace(System.out);
//Enter your Code here...
}catch (com.sun.star.lang.WrappedTargetException e2){
e2.printStackTrace(System.out);
//Enter your Code here...
}catch (com.sun.star.lang.IllegalArgumentException e3){
e3.printStackTrace(System.out);
//Enter your Code here...
}
}
//...
Python-UNO may be better than Java because it does not require querying specific interfaces. Also XrayTool and MRI are easier to use than the Java Object Inspector.

Weka java spreadsubsample filter

I'm quite familiar with Weka as I've used the GUI. I'm doing some classification experiments that requires the SpreadSubsample filter on both my training and testing data.
I'm learning java, and want to use the weka API to do this. I've got to the point where I'm loading my training and testing data into Weka like so:
DataSource source = new DataSource("training.arff");
Instances trainingData = source.getDataSet();
if (trainingData.classIndex() == -1)
trainingData.setClassIndex(trainingData.numAttributes() - 1);
and I'm getting an output. Everything is working.
However, I have no idea how to implement a filter. I have the training and testing.arff files already produced and need to filter them through the spreadsubsample filter before loading it into weka.
If anyone could help with a thorough explanation and answer, it'd be much appreciated. Thankyou.

Here some sample code:
SpreadSubsample ff = new SpreadSubsample();
String opt = " ";//any options you like, see documentation
String[] optArray = weka.core.Utils.splitOptions(opt);//right format for the options
ff.setOptions(optArray);
ff.setInputFormat(dataset);
Instances filteredInstances = Filter.useFilter(dataset, ff);
Hope it helped.

What is estimate function in topic modeling using mallet library

I'm new on topic modeling and I'm trying to use Mallet library but I have a question.
I'm using Simple parallel threaded implementation of LDA to find topics for some instances. My question is what is estimate function in ParallelTopicModel?
I have search in API but they have not description. Also I have read this tutorial.
Can someone explain what is this function?
EDIT
This is an example of my code:
public void runModel(Sting [] str){
ParallelTopicModel model = new ParallelTopicModel(numTopics);
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
// Pipes: lowercase, tokenize, remove stopwords, map to features
pipeList.add(new CharSequenceLowercase());
pipeList.add(new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")));
pipeList.add(new TokenSequence2FeatureSequence());
InstanceList instances = new InstanceList(new SerialPipes(pipeList));
instances.addThruPipe(new StringArrayIterator(str));
model.addInstances(instances);
model.setNumThreads(THREADS);
model.setOptimizeInterval(optimizeation);
model.setBurninPeriod(burninInterval);
model.setNumIterations(numIterations);
// model.estimate();
}

estimate() runs LDA, attempting to estimate the topic model given the data and settings you've already set up.
Have a look at the main() function of the ParrallelTopicModel source for inspiration about what's needed to estimate a model.

Get prediction percentage in WEKA using own Java code and a model

Overview
I know that one can get the percentages of each prediction in a trained WEKA model through the GUI and command line options as conveniently explained and demonstrated in the documentation article "Making predictions".
Predictions
I know that there are three ways documented to get these predictions:
command line
GUI
Java code/using the WEKA API, which I was able to do in the answer to "Get risk predictions in WEKA using own Java code"
this fourth one requires a generated WEKA .MODEL file
I have a trained .MODEL file and now I want to classify new instances using this together with the prediction percentages similar to the one below (an output of the GUI's Explorer, in CSV format):
inst#,actual,predicted,error,distribution,
1,1:0,2:1,+,0.399409,*0.7811
2,1:0,2:1,+,0.3932409,*0.8191
3,1:0,2:1,+,0.399409,*0.600591
4,1:0,2:1,+,0.139409,*0.64
5,1:0,2:1,+,0.399409,*0.600593
6,1:0,2:1,+,0.3993209,*0.600594
7,1:0,2:1,+,0.500129,*0.600594
8,1:0,2:1,+,0.399409,*0.90011
9,1:0,2:1,+,0.211409,*0.60182
10,1:0,2:1,+,0.21909,*0.11101
The predicted column is what I want to get from a .MODEL file.
What I know
Based from my experience with the WEKA API approach, one can get these predictions using the following code (the PlainText inserted into an Evaluation object) BUT I do not want to do k-fold cross-validation that is provided by the Evaluation object.
StringBuffer predictionSB = new StringBuffer();
Range attributesToShow = null;
Boolean outputDistributions = new Boolean(true);
PlainText predictionOutput = new PlainText();
predictionOutput.setBuffer(predictionSB);
predictionOutput.setOutputDistribution(true);
Evaluation evaluation = new Evaluation(data);
evaluation.crossValidateModel(j48Model, data, numberOfFolds,
randomNumber, predictionOutput, attributesToShow,
outputDistributions);
System.out.println(predictionOutput.getBuffer());
From the WEKA documentation
Note that a .MODEL file classifies data from an .ARFF or related input is discussed in "Use Weka in your Java code" and "Serialization" a.k.a. "How to use a .MODEL file in your own Java code to classify new instances" (why the vague title smfh).
Using own Java code to classify
Loading a .MODEL file is through "Deserialization" and the following is for versions > 3.5.5:
// deserialize model
Classifier cls = (Classifier) weka.core.SerializationHelper.read("/some/where/j48.model");
An Instance object is the data and it is fed to the classifyInstance. An output is provided here (depending on the data type of the outcome attribute):
// classify an Instance object (testData)
cls.classifyInstance(testData.instance(0));
The question "How to reuse saved classifier created from explorer(in weka) in eclipse java" has a great answer too!
Javadocs
I have already checked the Javadocs for Classifier (the trained model) and Evaluation (just in case) but none directly and explicitly addresses this issue.
The only thing closest to what I want is the classifyInstances method of the Classifier:
Classifies the given test instance. The instance has to belong to a dataset when it's being classified. Note that a classifier MUST implement either this or distributionForInstance().
How can I simultaneously use a WEKA .MODEL file to classify and get predictions of a new instance using my own Java code (aka using the WEKA API)?

This answer simply updates my answer from How to reuse saved classifier created from explorer(in weka) in eclipse java.
I will show how to obtain the predicted instance value and the prediction percentage (or distribution). The example model is a J48 decision tree created and saved in the Weka Explorer. It was built from the nominal weather data provided with Weka. It is called "tree.model".
import weka.classifiers.Classifier;
import weka.core.Instances;
public class Main {
public static void main(String[] args) throws Exception
{
String rootPath="/some/where/";
Instances originalTrain= //instances here
//load model
Classifier cls = (Classifier) weka.core.SerializationHelper.read(rootPath+"tree.model");
//predict instance class values
Instances originalTrain= //load or create Instances to predict
//which instance to predict class value
int s1=0;
//perform your prediction
double value=cls.classifyInstance(originalTrain.instance(s1));
//get the prediction percentage or distribution
double[] percentage=cls.distributionForInstance(originalTrain.instance(s1));
//get the name of the class value
String prediction=originalTrain.classAttribute().value((int)value);
System.out.println("The predicted value of instance "+
Integer.toString(s1)+
": "+prediction);
//Format the distribution
String distribution="";
for(int i=0; i <percentage.length; i=i+1)
{
if(i==value)
{
distribution=distribution+"*"+Double.toString(percentage[i])+",";
}
else
{
distribution=distribution+Double.toString(percentage[i])+",";
}
}
distribution=distribution.substring(0, distribution.length()-1);
System.out.println("Distribution:"+ distribution);
}
}
The output from this is:
The predicted value of instance 0: no
Distribution: *1, 0

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.