How to implement Word2Vec in Java? - java

I installed word2Vec using this tutorial on by Ubuntu laptop. Is it completely necessary to install DL4J in order to implement word2Vec vectors in Java? I'm comfortable working in Eclipse and I'm not sure that I want all the other pre-requisites that DL4J wants me to install.
Ideally there would be a really easy way for me to just use the Java code I've already written (in Eclipse) and change a few lines -- so that word look-ups that I am doing would retrieve a word2Vec vector instead of the current retrieval process I'm using.
Also, I've looked into using GloVe, however, I do not have MatLab. Is it possible to use GloVe without MatLab? (I got an error while installing it because of this). If so, the same question as above goes... I have no idea how to implement it in Java.

What is preventing you from saving the word2vec (the C program) output in text format and then read the file with a Java piece of code and load the vectors in a hashmap keyed by the word string?
Some code snippets:
// Class to store a hashmap of wordvecs
public class WordVecs {
HashMap<String, WordVec> wordvecmap;
....
void loadFromTextFile() {
String wordvecFile = prop.getProperty("wordvecs.vecfile");
wordvecmap = new HashMap();
try (FileReader fr = new FileReader(wordvecFile);
BufferedReader br = new BufferedReader(fr)) {
String line;
while ((line = br.readLine()) != null) {
WordVec wv = new WordVec(line);
wordvecmap.put(wv.word, wv);
}
}
catch (Exception ex) { ex.printStackTrace(); }
}
....
}
// class for each wordvec
public class WordVec implements Comparable<WordVec> {
public WordVec(String line) {
String[] tokens = line.split("\\s+");
word = tokens[0];
vec = new float[tokens.length-1];
for (int i = 1; i < tokens.length; i++)
vec[i-1] = Float.parseFloat(tokens[i]);
norm = getNorm();
}
....
}
If you want to get the nearest neighbours for a given word, you can keep a list of N nearest pre-computed neighbours associated with each WordVec object.

Dl4j author here. Our word2vec implementation is targeted for people who need to have custom pipelines. I don't blame you for going the simple route here.
Our word2vec implementation is meant for when you want to do something with them not for messing around. The c word2vec format is pretty straight forward.
Here is parsing logic in java if you'd like:
https://github.com/deeplearning4j/deeplearning4j/blob/374609b2672e97737b9eb3ba12ee62fab6cfee55/deeplearning4j-scaleout/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/embeddings/loader/WordVectorSerializer.java#L113
Hope that helps a bit

Related

text classifier with weka: how to correctly train a classifier issue

I'm trying to build a text classifier using Weka, but the probabilities with distributionForInstance of the classes are 1.0 in one and 0.0 in all other cases, so classifyInstance always returns the same class as prediction. Something in the training doesn't work correctly.
ARFF training
#relation test1
#attribute tweetmsg String
#attribute classValues {politica,sport,musicatvcinema,infogeneriche,fattidelgiorno,statopersonale,checkin,conversazione}
#DATA
"Renzi Berlusconi Salvini Bersani",politica
"Allegri insulta la terna arbitrale",sport
"Bravo Garcia",sport
Training methods
public void trainClassifier(final String INPUT_FILENAME) throws Exception
{
getTrainingDataset(INPUT_FILENAME);
//trainingInstances consists of feature vector of every input
for(Instance currentInstance : inputDataset)
{
Instance currentFeatureVector = extractFeature(currentInstance);
currentFeatureVector.setDataset(trainingInstances);
trainingInstances.add(currentFeatureVector);
}
classifier = new NaiveBayes();
try {
//classifier training code
classifier.buildClassifier(trainingInstances);
//storing the trained classifier to a file for future use
weka.core.SerializationHelper.write("NaiveBayes.model",classifier);
} catch (Exception ex) {
System.out.println("Exception in training the classifier."+ex);
}
}
private Instance extractFeature(Instance inputInstance) throws Exception
{
String tweet = inputInstance.stringValue(0);
StringTokenizer defaultTokenizer = new StringTokenizer(tweet);
List<String> tokens=new ArrayList<String>();
while (defaultTokenizer.hasMoreTokens())
{
String t= defaultTokenizer.nextToken();
tokens.add(t);
}
Iterator<String> a = tokens.iterator();
while(a.hasNext())
{
String token=(String) a.next();
String word = token.replaceAll("#","");
if(featureWords.contains(word))
{
double cont=featureMap.get(featureWords.indexOf(word))+1;
featureMap.put(featureWords.indexOf(word),cont);
}
else{
featureWords.add(word);
featureMap.put(featureWords.indexOf(word), 1.0);
}
}
attributeList.clear();
for(String featureWord : featureWords)
{
attributeList.add(new Attribute(featureWord));
}
attributeList.add(new Attribute("Class", classValues));
int indices[] = new int[featureMap.size()+1];
double values[] = new double[featureMap.size()+1];
int i=0;
for(Map.Entry<Integer,Double> entry : featureMap.entrySet())
{
indices[i] = entry.getKey();
values[i] = entry.getValue();
i++;
}
indices[i] = featureWords.size();
values[i] = (double)classValues.indexOf(inputInstance.stringValue(1));
trainingInstances = createInstances("TRAINING_INSTANCES");
return new SparseInstance(1.0,values,indices,1000000);
}
private void getTrainingDataset(final String INPUT_FILENAME)
{
try{
ArffLoader trainingLoader = new ArffLoader();
trainingLoader.setSource(new File(INPUT_FILENAME));
inputDataset = trainingLoader.getDataSet();
}catch(IOException ex)
{
System.out.println("Exception in getTrainingDataset Method");
}
System.out.println("dataset "+inputDataset.numAttributes());
}
private Instances createInstances(final String INSTANCES_NAME)
{
//create an Instances object with initial capacity as zero
Instances instances = new Instances(INSTANCES_NAME,attributeList,0);
//sets the class index as the last attribute
instances.setClassIndex(instances.numAttributes()-1);
return instances;
}
public static void main(String[] args) throws Exception
{
Classificatore wekaTutorial = new Classificatore();
wekaTutorial.trainClassifier("training_set_prova_tent.arff");
wekaTutorial.testClassifier("testing.arff");
}
public Classificatore()
{
attributeList = new ArrayList<Attribute>();
initialize();
}
private void initialize()
{
featureWords= new ArrayList<String>();
featureMap = new TreeMap<>();
classValues= new ArrayList<String>();
classValues.add("politica");
classValues.add("sport");
classValues.add("musicatvcinema");
classValues.add("infogeneriche");
classValues.add("fattidelgiorno");
classValues.add("statopersonale");
classValues.add("checkin");
classValues.add("conversazione");
}
TESTING METHODS
public void testClassifier(final String INPUT_FILENAME) throws Exception
{
getTrainingDataset(INPUT_FILENAME);
//trainingInstances consists of feature vector of every input
Instances testingInstances = createInstances("TESTING_INSTANCES");
for(Instance currentInstance : inputDataset)
{
//extractFeature method returns the feature vector for the current input
Instance currentFeatureVector = extractFeature(currentInstance);
//Make the currentFeatureVector to be added to the trainingInstances
currentFeatureVector.setDataset(testingInstances);
testingInstances.add(currentFeatureVector);
}
try {
//Classifier deserialization
classifier = (Classifier) weka.core.SerializationHelper.read("NaiveBayes.model");
//classifier testing code
for(Instance testInstance : testingInstances)
{
double score = classifier.classifyInstance(testInstance);
double[] vv= classifier.distributionForInstance(testInstance);
for(int k=0;k<vv.length;k++){
System.out.println("distribution "+vv[k]); //this are the probabilities of the classes and as result i get 1.0 in one and 0.0 in all the others
}
System.out.println(testingInstances.attribute("Class").value((int)score));
}
} catch (Exception ex) {
System.out.println("Exception in testing the classifier."+ex);
}
}
I want to create a text classifier for short messages, this code is based on this tutorial http://preciselyconcise.com/apis_and_installations/training_a_weka_classifier_in_java.php . The problem is that the classifier predict the wrong class for almost every message in the testing.arff because the probabilities of the classes are not correct. The training_set_prova_tent.arff has the same number of messages per class.
The example i'm following use a featureWords.dat and associate 1.0 to the word if it is present in a message instead I want to create my own dictionary with the words present in the training_set_prova_tent plus the words present in testing and associate to every word the number of occurrences .
P.S
I know that this is exactly what can i do with the filter StringToWordVector but I haven't found any example that exaplain how to use this filter with two file: one for the training set and one for the test set. So it seems easier to adapt the code I found.
Thank you very much
It seems like you changed the code from the website you referenced in some crucial points, but not in a good way. I'll try to draft what you're trying to do and what mistakes I've found.
What you (probably) wanted to do in extractFeature is
Split each tweet into words (tokenize)
Count the number of occurrences of these words
Create a feature vector representing these word counts plus the class
What you've overlooked in that method is
You never reset your featureMap. The line
Map<Integer,Double> featureMap = new TreeMap<>();
originally was at the beginning extractFeatures, but you moved it to initialize. That means that you always add up the word counts, but never reset them. For each new tweet, your word count also includes the word count of all previous tweets. I'm sure that is not what you wanted.
You don't initialize featureWords with the words you want as features. Yes, you create an empty list, but you fill it iteratively with each tweet. The original code initialized it once in the initialize method and it never changed after that. There are two problems with that:
With each new tweet, new features (words) get added, so your feature vector grows with each tweet. That wouldn't be such a big problem (SparseInstance), but that means that
Your class attribute is always in another place. These two lines work for the original code, because featureWords.size() is basically a constant, but in your code the class label will be at index 5, then 8, then 12, and so on, but it must be the same for every instance.
indices[i] = featureWords.size();
values[i] = (double) classValues.indexOf(inputInstance.stringValue(1));
This also manifests itself in the fact that you build a new attributeList with each new tweet, instead of only once in initialize, which is bad for already explained reasons.
There may be more stuff, but - as it is - your code is rather unfixable. What you want is much closer to the tutorial source code which you modified than your version.
Also, you should look into StringToWordVector because it seems like this is exactly what you want to do:
Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).

I need an elegant way to exclude specific words from processing

I am writing an algorithm to extract likely keywords from a document's text. I want to count instances of words and take the top 5 as keywords. Obviously, I want to exclude "insignificant" words lest every document appears with "the" and "and" as major keywords.
Here is the strategy I've successfully used for testing:
exclusions = new ArrayList<String>();
exclusions.add("a","and","the","or");
Now that I want to do a real-life test, my exclusion list is close to 200 words long, and I'd LOVE to be able to do something like this:
exclusions = new ArrayList<String>();
exclusions.add(each word in foo.txt);
Long term, maintaining an external list (rather than a list embedded in my code) is desirable for obvious reasons. With all the file read/write methods out there in Java, I'm fairly certain that this can be done, but my search results have come up empty...I know I've got to be searching on the wrong keywords. Anyone know an elegant way to include an external list in processing?
This does not immediately address the solution you are prescribing but might give you another avenue that might be better.
Instead of deciding in advance what is useless, you could count everything and then filter out what you deem is insignificant (from a information carrying standpoint) because of its overwhelming presence. It is similar to a low-pass filter in signal processing to eliminate noise.
So in short, count everything. Then decide that if something appears with a frequency higher than a threshold you set (you'll have to determine what that threshold is from experiment, say 5% of all words are 'the', that means it does not carry information).
If you do it this way, it'll even work for foreign languages.
Just my two cents on this.
You can use a FileReader to read the Strings out of the file and add them to an ArrayList.
private List<String> createExculsions(String file) throws IOException {
BufferedReader reader = new BufferedReader(new FileReader(file));
String word = null;
List<String> exclusions = new ArrayList<String>();
while((word = reader.readLine()) != null) {
exclusions.add(word);
}
return exclusions;
}
Then you can use List<String> exclusions = createExclusions("exclusions.txt"); to create the list.
Not sure if it is elegant but here I created a simple solution to detect the language or remove noise words from tweets some years ago:
TweetDetector.java
JTweet.java which is using the data like for english
Google Guava library contains lots of useful methods that simplify routine tasks. You can use one of them to read file contents to string and split it by space character:
String contents = Files.toString(new File("foo.txt"), Charset.defaultCharset());
List<String> exclusions = Lists.newArrayList(contents.split("\\s"));
Apache Commons IO provides similar shortcuts:
String contents = FileUtils.readFileToString(new File("foo.txt"));
...
Commons-io has utilities that support this. Include commons-io as a dependency, then issue
File myFile = ...;
List<String> exclusions = FileUtils.readLines( myFile );
as described in:
http://commons.apache.org/io/apidocs/org/apache/commons/io/FileUtils.html
This assumes that every exclusion word is on a new line.
Reading from a file is pretty simple.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashSet;
public class ExcludeExample {
public static HashSet<String> readExclusions(File file) throws IOException{
BufferedReader br = new BufferedReader(new FileReader(file));
String line = "";
HashSet<String> exclusions = new HashSet<String>();
while ((line = br.readLine()) != null) {
exclusions.add(line);
}
br.close();
return exclusions;
}
public static void main(String[] args) throws IOException{
File foo = new File("foo.txt");
HashSet<String> exclusions = readExclusions(foo);
System.out.println(exclusions.contains("the"));
System.out.println(exclusions.contains("Java"));
}
}
foo.txt
the
a
and
or
I used a HashSet instead of a ArrayList because it has faster lookup.

How can I call scikit-learn classifiers from Java?

I have a classifier that I trained using Python's scikit-learn. How can I use the classifier from a Java program? Can I use Jython? Is there some way to save the classifier in Python and load it in Java? Is there some other way to use it?
You cannot use jython as scikit-learn heavily relies on numpy and scipy that have many compiled C and Fortran extensions hence cannot work in jython.
The easiest ways to use scikit-learn in a java environment would be to:
expose the classifier as a HTTP / Json service, for instance using a microframework such as flask or bottle or cornice and call it from java using an HTTP client library
write a commandline wrapper application in python that reads data on stdin and output predictions on stdout using some format such as CSV or JSON (or some lower level binary representation) and call the python program from java for instance using Apache Commons Exec.
make the python program output the raw numerical parameters learnt at fit time (typically as an array of floating point values) and reimplement the predict function in java (this is typically easy for predictive linear models where the prediction is often just a thresholded dot product).
The last approach will be a lot more work if you need to re-implement feature extraction in Java as well.
Finally you can use a Java library such as Weka or Mahout that implement the algorithms you need instead of trying to use scikit-learn from Java.
There is JPMML project for this purpose.
First, you can serialize scikit-learn model to PMML (which is XML internally) using sklearn2pmml library directly from python or dump it in python first and convert using jpmml-sklearn in java or from a command line provided by this library. Next, you can load pmml file, deserialize and execute the loaded model using jpmml-evaluator in your Java code.
This way works with not all scikit-learn models, but with many of them.
As some commenters correctly pointed out, it's important to note that JPMML project is licensed under GNU AGPL. AGPL is a strong copyleft license, which may limit your ability to use the project. One of the examples may be if you develop a publically accessible service and want to keep the sources closed.
You can either use a porter, I have tested the sklearn-porter (https://github.com/nok/sklearn-porter), and it works well for Java.
My code is the following:
import pandas as pd
from sklearn import tree
from sklearn_porter import Porter
train_dataset = pd.read_csv('./result2.csv').as_matrix()
X_train = train_dataset[:90, :8]
Y_train = train_dataset[:90, 8:]
X_test = train_dataset[90:, :8]
Y_test = train_dataset[90:, 8:]
print X_train.shape
print Y_train.shape
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)
porter = Porter(clf, language='java')
output = porter.export(embed_data=True)
print(output)
In my case, I'm using a DecisionTreeClassifier, and the output of
print(output)
is the following code as text in the console:
class DecisionTreeClassifier {
private static int findMax(int[] nums) {
int index = 0;
for (int i = 0; i < nums.length; i++) {
index = nums[i] > nums[index] ? i : index;
}
return index;
}
public static int predict(double[] features) {
int[] classes = new int[2];
if (features[5] <= 51.5) {
if (features[6] <= 21.0) {
// HUGE amount of ifs..........
}
}
return findMax(classes);
}
public static void main(String[] args) {
if (args.length == 8) {
// Features:
double[] features = new double[args.length];
for (int i = 0, l = args.length; i < l; i++) {
features[i] = Double.parseDouble(args[i]);
}
// Prediction:
int prediction = DecisionTreeClassifier.predict(features);
System.out.println(prediction);
}
}
}
Here is some code for the JPMML solution:
--PYTHON PART--
# helper function to determine the string columns which have to be one-hot-encoded in order to apply an estimator.
def determine_categorical_columns(df):
categorical_columns = []
x = 0
for col in df.dtypes:
if col == 'object':
val = df[df.columns[x]].iloc[0]
if not isinstance(val,Decimal):
categorical_columns.append(df.columns[x])
x += 1
return categorical_columns
categorical_columns = determine_categorical_columns(df)
other_columns = list(set(df.columns).difference(categorical_columns))
#construction of transformators for our example
labelBinarizers = [(d, LabelBinarizer()) for d in categorical_columns]
nones = [(d, None) for d in other_columns]
transformators = labelBinarizers+nones
mapper = DataFrameMapper(transformators,df_out=True)
gbc = GradientBoostingClassifier()
#construction of the pipeline
lm = PMMLPipeline([
("mapper", mapper),
("estimator", gbc)
])
--JAVA PART --
//Initialisation.
String pmmlFile = "ScikitLearnNew.pmml";
PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile));
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml);
//Determine which features are required as input
HashMap<String, Field>() inputFieldMap = new HashMap<String, Field>();
for (int i = 0; i < evaluator.getInputFields().size();i++) {
InputField curInputField = evaluator.getInputFields().get(i);
String fieldName = curInputField.getName().getValue();
inputFieldMap.put(fieldName.toLowerCase(),curInputField.getField());
}
//prediction
HashMap<String,String> argsMap = new HashMap<String,String>();
//... fill argsMap with input
Map<FieldName, ?> res;
// here we keep only features that are required by the model
Map<FieldName,String> args = new HashMap<FieldName, String>();
Iterator<String> iter = argsMap.keySet().iterator();
while (iter.hasNext()) {
String key = iter.next();
Field f = inputFieldMap.get(key);
if (f != null) {
FieldName name =f.getName();
String value = argsMap.get(key);
args.put(name, value);
}
}
//the model is applied to input, a probability distribution is obtained
res = evaluator.evaluate(args);
SegmentResult segmentResult = (SegmentResult) res;
Object targetValue = segmentResult.getTargetValue();
ProbabilityDistribution probabilityDistribution = (ProbabilityDistribution) targetValue;
I found myself in a similar situation.
I'll recommend carving out a classifier microservice. You could have a classifier microservice which runs in python and then expose calls to that service over some RESTFul API yielding JSON/XML data-interchange format. I think this is a cleaner approach.
Alternatively you can just generate a Python code from a trained model. Here is a tool that can help you with that https://github.com/BayesWitnesses/m2cgen

Having trouble opening a file in Java

I am trying to open this file in java and i want to know what i am doing wrong. The in file lies in the same directory as my Java file, but i tried to open this with both netbeans and eclipse and it gave a file not found exception. Can someone help me open this file and read from it. I am really new to java files. Here is the code
import java.util.*;
import java.io.*;
public class Practice
{
public static void main(String[] args)throws IOException
{
FileReader fin = new FileReader("anagrams.in");
BufferedReader br = new BufferedReader(fin);
System.out.println(fin);
String string = "Madam Curie";
String test = "Radium came";
string = string.toLowerCase();
test = test.toLowerCase();
string = string.replaceAll("[^a-zA-Z0-9]+", "");
test = test.replaceAll("[^a-zA-Z0-9]+", "");
char[] array = string.toCharArray();
char[] array2 = test.toCharArray();
boolean flag = false;
HashMap hm = new HashMap();
for(int i = 0; i < array.length; i++)
{
hm.put(array[i], array[i]);
}
for(int i = 0; i < array2.length; i++)
{
if(hm.get(array2[i]) == null || test.length() != string.length())
{
flag = false;
i = array2.length;
}
else
{
flag = true;
}
}
System.out.println(flag);
}
}
A few tips:
Abide to proper code indentation
If you're using an IDE like Eclipse, it can automatically correct indentation for you
Develop debugging instinct
Try to get what the current working directory is, and list all the files in it
Refactor repetitive code
Writing paired statements like you did should immediately raise red flags
Effective Java 2nd Edition
Item 23: Don't use raw types in new code
Item 52: Refer to objects by their interfaces
Item 46: Prefer for-each loops to traditional for loops
Use sensible variable names
With regards to 2, try something like this:
public static void listDir() {
File current = new File(".");
System.out.println(current.getAbsolutePath());
for (String filename : current.list()) {
System.out.println(filename);
}
}
Then in your main, simply call listDir before everything else, and see if you're running the app from the right directory, and if there's a "anagrams.in" in the directory. Note that some platforms are case-sensitive.
With regards to 3 and 4, consider having a helper method like this:
static Set<Character> usedCharactersIn(String s) {
Set<Character> set = new HashSet<Character>();
for (char ch : s.toLowerCase().toCharArray()) {
set.add(ch);
}
return set;
}
Note how Set<E> is used instead of Map<K,V>. Looking at the rest of the code, you didn't seem to actually need a mapping, but rather a set of some sort (but more on that later).
You can then have something like this in main, which makes the logic very readable:
String s1 = ...;
String s2 = ...;
boolean isNotQuiteAnagram = (s1.length() == s2.length()) &&
usedCharactersIn(s1).containsAll(usedCharactersIn(s2));
Note how variables are now named rather sensibly, highlighting their roles. Note also that this logic does not quite determine that s1 is an anagram of s2 (consider e.g. "abb" and "aab"), but this is in fact what you were doing.
Since this looks like homework, I'll leave it up to you to try to figure out when two strings are anagrams.
See also
Java Coding Conventions
Java Language Guide/For-each loop
Java Tutorials/Collections Framework
Related questions
Why doesn't Java Map extends Collection?
Make sure that the file lies in the same directory as your .class file. It doesn't matter if it is in the same as your .java file or not.
Other than that, the only problem I can see is in your indentation, which doesn't matter.
The normal practice is to put resources in the runtime classpath or to add its path to the runtime classpath so that you can get it by the classloader. Using relative paths in Java IO is considered poor practice since it breaks portability. The relative path would be dependent on the current working directory over which you have totally no control from inside the Java code.
After having placed it in the classpath (assuming that it's in the same folder as the Java class itself), just do so:
BufferedReader reader = null;
try {
InputStream input = Practice.class.getResourceAsStream("anagrams.in");
reader = new BufferedReader(new InputStreamReader(input, "UTF-8")); // Or whatever encoding it is in.
// Process it.
// ...
} finally {
if (reader != null) try { reader.close(); } catch (IOException ignore) {}
}
Closing in finally is by the way mandatory to release the lock on the file after reading.
Put the anagrams.in file in the same location as the .class file. Then you will be able to read the file. And this should help you get some links on how to read from files in Java.

Advice on reading indexes

I'm trying to figure out the right way to read lucene index only once whilst running the application multiple times, how can I do that in java?
Because indexed data will not change so reading them each time would not be necessary. Can someone explain me the logic of it reading them only once? thank you
UPDATE :
public List initTableObject() throws IOException {
Directory fSDirectory = FSDirectory.open(new File(INDEX_NAME));
List<String>termList = new ArrayList<String>();
RAMDirectory directory = new RAMDirectory(fSDirectory);
IndexReader iReader = IndexReader.open(fSDirectory);
FilterIndexReader fReader = new FilterIndexReader(iReader);
// int numOfDocs = fReader.numDocs();
TermEnum terms = fReader.terms();
while (terms.next()){
Term term = terms.term();
String termText = term.text();
termList.add(termText);
}
iReader.close();
return termList;
}
I'm really new with lucene and this, so here is what I've got so far I'm just not there yet with RAMDirectory.
This method retrieves list because I need this index list to compare with some files that I have. How can I store this list to the RAM so I can use it in my other part of application for comparison ?
I think the answer on this question might be of use.

Categories

Resources