Saving and Loading Trained Stanford classifier in java - java

I have a dataset of 1 million labelled sentences and using it for finding sentiment through Maximum Entropy. I am using Stanford Classifier for the same:-
public class MaximumEntropy {
static ColumnDataClassifier cdc;
public static float calMaxEntropySentiment(String text) {
initializeProperties();
float sentiment = (getMaxEntropySentiment(text));
return sentiment;
}
public static void initializeProperties() {
cdc = new ColumnDataClassifier(
"\\stanford-classifier-2016-10-31\\properties.prop");
}
public static int getMaxEntropySentiment(String tweet) {
String filteredTweet = TwitterUtils.filterTweet(tweet);
System.out.println("Reading training file");
Classifier<String, String> cl = cdc.makeClassifier(cdc.readTrainingExamples(
"\\stanford-classifier-2016-10-31\\labelled_sentences.txt"));
Datum<String, String> d = cdc.makeDatumFromLine(filteredTweet);
System.out.println(filteredTweet + " ==> " + cl.classOf(d) + " " + cl.scoresOf(d));
// System.out.println("Class score is: " +
// cl.scoresOf(d).getCount(cl.classOf(d)));
if (cl.classOf(d) == "0") {
return 0;
} else {
return 4;
}
}
}
My data is labelled 0 or 1. Now for each tweet the whole dataset is being read and it is taking a lot of time considering the size of dataset.
My query is that is there any way to first train the classifier and then load it when a tweet's sentiment is to be found. I think this approach will take less time. Correct me if I am wrong.
The following link provides this but there is nothing for JAVA API.
Saving and Loading Classifier
Any help would be appreciated.

Yes; the easiest way to do this is using Java's default serialization mechanism to serialize a classifier. A useful helper here is the IOUtils class:
IOUtils.writeObjectToFile(classifier, "/path/to/file");
To read the classifier:
Classifier<String, String> cl = IOUtils.readObjectFromFile(new File("/path/to/file");

Related

Language detections with apache Tika

I am currently trying to get along with Apache Tika and set up a language detection that checks all keyValues of my various properties files for the correct language of the respective file. Unfortunately the detection is not really good..All keys are not recognized with the correct language and I don't know how I can do it better. An api solution is out of the question, because I have the order to find a free way and most free connections only allow 1000 calls per day (in german alone I have more than 14000 keys).
If you know how I can make the current code better or maybe have another solution, please let me know!
Thanks a lot,
Pascal
Thats my Current code:
import java.util.Set;
import org.apache.tika.language.LanguageIdentifier;
public class detect {
#SuppressWarnings("deprecation")
public static void main(String[] args) throws Exception {
final MyPropAllKeys mPAK = new MyPropAllKeys("messages_forCheck.properties");
final Set<Object> keys = mPAK.getAllKeys();
for (final Object key : keys) {
final String keyString = key.toString();
final String keyValueString = mPAK.getPropertyValue(keyString);
detect(keyValueString, key);
}
}
public static void detect(String keyValueString, Object key) {
final LanguageIdentifier languageIdentifier = new LanguageIdentifier(keyValueString);
final String language = languageIdentifier.getLanguage();
if (!language.equals("de")) {
System.out.println(language + " " + key + ": " + keyValueString);
}
}
}
For Example thats some of the Results:
pt de.segal.baoss.platform.entity.BackgroundTaskType.MASS_INVOICE_DOCUMENT_CREATION: Rechnungsdokumente erzeugen
sk de.segal.baoss.purchase.supplier.creditorNumber: Kreditorennummer
no de.segal.baoss.module.crm.revenueLastYear: Umsatz vergangenes Jahr
no de.segal.baoss.module.op.customerReturn.action.createCreditEntry: Gutschrift erstellen
All are definitely German

Pattern from String

i want to extract pattern from a string for ex:
string x== "1234567 - israel.ekpo#massivelogdata.net cc55ZZ35 1789 Hello Grok";
pattern its should generate is = "%{EMAIL:username} %{USERNAME:password} %{INT:yearOfBirth}"
basically i want to create patter for logs generated in the java application. any idea how to do that ?
In past i've do some with reguar expression, but in my case the string having ever the same composition pattern or order.
I this case, you can done 3 matching pattern and make the find operation 3 times in order of pattern.
If not so, you must use an text analyzer or search tool.
It's recommended to use grow library to extract data from logs.
Example:
public final class GrokStage {
private static final void displayResults(final Map<String, String> results) {
if (results != null) {
for(Map.Entry<String, String> entry : results.entrySet()) {
System.out.println(entry.getKey() + "=" + entry.getValue());
}
}
}
public static void main(String[] args) {
final String rawDataLine1 = "1234567 - israel.ekpo#massivelogdata.net cc55ZZ35 1789 Hello Grok";
final String expression = "%{EMAIL:username} %{USERNAME:password} %{INT:yearOfBirth}";
final GrokDictionary dictionary = new GrokDictionary();
// Load the built-in dictionaries
dictionary.addBuiltInDictionaries();
// Resolve all expressions loaded
dictionary.bind();
// Take a look at how many expressions have been loaded
System.out.println("Dictionary Size: " + dictionary.getDictionarySize());
Grok compiledPattern = dictionary.compileExpression(expression);
displayResults(compiledPattern.extractNamedGroups(rawDataLine1));
}
}
Output:
username=israel.ekpo#massivelogdata.net
password=cc55ZZ35
yearOfBirth=1789
Note:
This are the patterns used before:
EMAIL %{\S+}#%{\b\w+\b}\.%{[a-zA-Z]+}
USERNAME [a-zA-Z0-9._-]+
INT (?:[+-]?(?:[0-9]+))
More info about grok-patterns: BuiltInDictionary.java

Do Lucene(java framework) by default calculates the tf-idf and cosine similarity of a document against the term?

I am developing a search engine based application and was working on Lucene java framework, i am being confused by the score functionality by default provided by lucene i.e do the score functionality implements by default tf-idf and cosine similarity or do we have to do something else ?
public class LuceneTester {
String indexDir = "C:\\Users\\hamda\\Documents\\NetBeansProjects\\luceneDemo\\Index";
String dataDir = "C:\\Users\\hamda\\Documents\\NetBeansProjects\\luceneDemo\\Data";
Indexer indexer;
Searcher searcher;
public static void main(String[] args) {
LuceneTester tester;
try {
tester = new LuceneTester();
tester.createIndex();
tester.search("DataGuides");
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
private void createIndex() throws IOException{
indexer = new Indexer(indexDir);
int numIndexed;
long startTime = System.currentTimeMillis();
numIndexed = indexer.createIndex(dataDir, new TextFileFilter());
long endTime = System.currentTimeMillis();
indexer.close();
System.out.println(numIndexed+" File indexed, time taken: "
+(endTime-startTime)+" ms");
}
i am getting the Document score in the end of the search function below
private void search(String searchQuery) throws IOException, ParseException{
searcher = new Searcher(indexDir);
long startTime = System.currentTimeMillis();
TopDocs hits = searcher.search(searchQuery);
long endTime = System.currentTimeMillis();
System.out.println(hits.totalHits +
" documents found. Time :" + (endTime - startTime));
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = searcher.getDocument(scoreDoc);
System.out.println(scoreDoc.score+" File: "
+ doc.get(LuceneConstants.FILE_PATH));
}
searcher.close();
}
}
I have googled it and found this:
how can I implement the tf-idf and cosine similarity in Lucene?
Any help will be highly appreciated :)
As of Lucene 6.0, the default similarity implementation is BM25Similarity, which implements BM25.
If you want to use the old standard similarity implementation, use ClassicSimilarity.
For a comparison of the two, you might check out:
Doug Turnbull's BM25 The Next Generation of Lucene Relevance
ElasticSearch's BM25 vs Lucene Default Similarity
As i was going through some details in http://lucene.apache.org/, i found out that lucene scoring model by default use this class DefaultSimilarity http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html which extends the TFIDFSimilarity class, http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
So in there documentation it is stated that scoring model by default implements tf-idf and cosine similarity. Any ways i may be wrong, so you can correct me :)

Create Custom InputFormat of ColumnFamilyInputFormat for cassandra

I am working on a project, using cassandra 1.2, hadoop 1.2
I have created my normal cassandra mapper and reducer, but I want to create my own Input format class, which will read the records from cassandra, and I'll get the desired column's value, by splitting that value using splitting and indexing ,
so, I planned to create custom Format class. but I'm confused and not able to know, how would I make it? What classes are to be extend and implement, and how I will able to fetch the row key, column name, columns value etc.
I have my Mapperclass as follow:
public class MyMapper extends
Mapper<ByteBuffer, SortedMap<ByteBuffer, IColumn>, Text, Text> {
private Text word = new Text();
MyJDBC db = new MyJDBC();
public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns,
Context context) throws IOException, InterruptedException {
long std_id = Long.parseLong(ByteBufferUtil.string(key));
long newSavePoint = 0;
if (columns.values().isEmpty()) {
System.out.println("EMPTY ITERATOR");
sb.append("column_N/A" + ":" + "N/A" + " , ");
} else {
for (IColumn cell : columns.values()) {
name = ByteBufferUtil.string(cell.name());
String value = null;
if (name.contains("int")) {
value = String.valueOf(ByteBufferUtil.toInt(cell.value()));
} else {
value = ByteBufferUtil.string(cell.value());
}
String[] data = value.toString().split(",");
// if (data[0].equalsIgnoreCase("login")) {
Long[] dif = getDateDiffe(d1, d2);
// logics i want to perform inside my custominput class , rather here, i just want a simple mapper class
if (condition1 && condition2) {
myhits++;
sb.append(":\t " + data[0] + " " + data[2] + " "+ data[1] /* + " " + data[3] */+ "\n");
newSavePoint = d2;
}
}
sb.append("~" + like + "~" + newSavePoint + "~");
word.set(sb.toString().replace("\t", ""));
}
db.setInterval(Long.parseLong(ByteBufferUtil.string(key)), newSavePoint);
db.setHits(Long.parseLong(ByteBufferUtil.string(key)), like + "");
context.write(new Text(ByteBufferUtil.string(key)), word);
}
I want to decrease my Mapper Class logics, and want to perform same calculations on my custom input class.
Please help, i wish for the positive r4esponse from stackies...
You can do the intended task by moving the Mapper logic to your custom input class (as you have indicated already)
I found this nice post which explains a similar problem statement as you have. I think it might solve your problem.

Extracting data from a collection in Java

I have a csv dataset like this:
A, 10, USA
B,30, UK
C,4,IT
A,20,UK
B,10,USA
I want to read this csv lines and provide the following output:
A has ran 30 miles with average of 15.
B has ran 30 miles with average of 20.
C has ran 4 miles with average of 4.
I want to achieve this in Java. I have done this in C# by using Linq:
var readlines = File.ReadAllLines(filename);
var query = from lines in readlines
let data = lines.Split(',')
select new
{
Name = data[0],
Miles = data[1],
};
var values = query.GroupBy(x => new {x.Name}).Select(group => new { Person = group.Key, Events = group.Sum(g =>Convert.ToDouble(g.Miles)) ,Count = group.Count() });
I am looking to do this in Java, and I am not sure if I can do this without using any third party library or not? Any ideas?
So far, my code looks like this in Java:
CSVReader reader = new CSVReader(new FileReader(filename));
java.util.List<String[]> content = reader.readAll();
String[] row = null;
for(Object object:content)
{
row = (String[]) object;
String Name = row[0];
String Miles = row[1];
System.out.printf("%s has ran %s miles %n",Name,Miles);
}
reader.close();
}
I am looking for a nice way to get the total milage value for each name to calculate for the average.
As a C# developer, it is hard sometimes not to miss the features of linq. But as Farlan suggested you could do something like this:
CSVReader reader = new CSVReader(new FileReader(filename));
java.util.List<String[]> content = reader.readAll();
Map<String, Group> groups = new HashMap<>();
for(String[] row : content)
{
String Name = row[0];
String Miles = row[1];
System.out.printf("%s has ran %s miles %n", Name, Miles);
if (groups.containsKey(Name)){
groups.get(Name).Add(Double.valueOf(Miles));
} else {
Group g = new Group();
g.Add(Double.valueOf(Miles));
groups.put(Name, g);
}
}
reader.close();
for (String name : groups.keySet())
{
System.out.println(name + " ran " + groups.get(name).total() + " with avg of " + groups.get(name).average());
}
}
class Group {
private List<Double> miles;
public Group()
{
miles = new ArrayList<>();
}
public Double total(){
double sum = 0;
for (Double mile : miles)
{
sum += mile;
}
return sum;
}
public Double average(){
if (miles.size() == 0)
return 0d;
return total() / miles.size();
}
public void Add(Double m){
miles.add(m);
}
}
Use Java's BufferedReader class:
BufferedReader in = new BufferedReader(new FileReader("your.csv"));
String line;
while ( (line = in.readLine()) != null) {
String [] fields = line.split(",");
System.out.println(fields[0] + " has ran " + fields[1] + " miles with average " + fields[2]);
}
There are quite a few ways to do this, some long-winded approaches, some shorter. The issue is that Java can be very verbose for doing simple tasks, so the better approaches can be a bit uglier.
The example below shows you exactly how to achieve this, par the printing. Bear in mind however, it might not be the best approach but I feel its more of the easier ones to read and comprehend.
final File csvFile = new File("filename.csv");
final Scanner reader = new Scanner(csvFile);
final Map<String, Integer> info = new HashMap<>(); //Store the data
//Until there is are no more lines, continue
while (reader.hasNextLine()) {
final String[] data = reader.nextLine().split(","); // data[0] = A. [1] = 10. [2] = USA
final String alpha = data[0];
if (!info.containsKey(alpha)) {
info.put(alpha, Integer.parseInt(data[1]));
} else {
int miles = info.get(alpha);
info.put(alpha, miles + Integer.parseInt(data[1]));
}
}
reader.close();
The steps involved are simple:
Step 1 - Read the file.
By passing a File into the Scanner object, you set the target parsing to the File and not the console. Using the very neat hasNextLine() method, you can continually read each line until no more exist. Each line is then split by a comma, and stored in a String array for reference.
Step 2 - Associating the data.
As you want to cumulatively add the integers together, you need a way to associate already passed in letters with the numbers. A heavyweight but clean way of doing this is to use a HashMap. The Key which it takes is going to be a String, specifically A B or C. By taking advantage of the fact the Key is unique, we can use the O(1) containsKey(String) method to check if we've already read in the letter. If its new, add it to the HashMap and save the number with it. If however, the letter has been seen before, we find the old value, add it with the new one and overwrite the data inside the HashMap.
All you need to do now is print out the data. Feel free to take a different approach, but I hope this is a clear example of how you CAN do it in Java.
Maybe you could try this Java library: https://code.google.com/p/qood/
It handles data without any getter/setters, so it's more flexible than LINQ.
in your case, file "D:/input.csv" has 3 columns:
NAME,MILES,COUNTRY
A, 10, USA
B,30, UK
C,4,IT
A,20,UK
B,10,USA
the query code would be:
final QModel raw = QNew.modelCSV("D:/input.csv")
.debug(-1);//print out what read from CSV
raw.query()
.selectAs("OUTPUT",
"CONCAT(NAME,' has ran ',SUM(MILES),' miles with average of ',MEAN(MILES),'.')")
.groupBy("NAME")
.result().debug(-1)//print out the result
.to().fileCSV("D:/output.csv", "UTF-8");//write to another CSV file

Categories

Resources