Language detections with apache Tika

Language detections with apache Tika - java

I am currently trying to get along with Apache Tika and set up a language detection that checks all keyValues of my various properties files for the correct language of the respective file. Unfortunately the detection is not really good..All keys are not recognized with the correct language and I don't know how I can do it better. An api solution is out of the question, because I have the order to find a free way and most free connections only allow 1000 calls per day (in german alone I have more than 14000 keys).
If you know how I can make the current code better or maybe have another solution, please let me know!
Thanks a lot,
Pascal
Thats my Current code:
import java.util.Set;
import org.apache.tika.language.LanguageIdentifier;
public class detect {
#SuppressWarnings("deprecation")
public static void main(String[] args) throws Exception {
final MyPropAllKeys mPAK = new MyPropAllKeys("messages_forCheck.properties");
final Set<Object> keys = mPAK.getAllKeys();
for (final Object key : keys) {
final String keyString = key.toString();
final String keyValueString = mPAK.getPropertyValue(keyString);
detect(keyValueString, key);
}
}
public static void detect(String keyValueString, Object key) {
final LanguageIdentifier languageIdentifier = new LanguageIdentifier(keyValueString);
final String language = languageIdentifier.getLanguage();
if (!language.equals("de")) {
System.out.println(language + " " + key + ": " + keyValueString);
}
}
}
For Example thats some of the Results:
pt de.segal.baoss.platform.entity.BackgroundTaskType.MASS_INVOICE_DOCUMENT_CREATION: Rechnungsdokumente erzeugen
sk de.segal.baoss.purchase.supplier.creditorNumber: Kreditorennummer
no de.segal.baoss.module.crm.revenueLastYear: Umsatz vergangenes Jahr
no de.segal.baoss.module.op.customerReturn.action.createCreditEntry: Gutschrift erstellen
All are definitely German

Related

Pattern from String

i want to extract pattern from a string for ex:
string x== "1234567 - israel.ekpo#massivelogdata.net cc55ZZ35 1789 Hello Grok";
pattern its should generate is = "%{EMAIL:username} %{USERNAME:password} %{INT:yearOfBirth}"
basically i want to create patter for logs generated in the java application. any idea how to do that ?

In past i've do some with reguar expression, but in my case the string having ever the same composition pattern or order.
I this case, you can done 3 matching pattern and make the find operation 3 times in order of pattern.
If not so, you must use an text analyzer or search tool.

It's recommended to use grow library to extract data from logs.
Example:
public final class GrokStage {
private static final void displayResults(final Map<String, String> results) {
if (results != null) {
for(Map.Entry<String, String> entry : results.entrySet()) {
System.out.println(entry.getKey() + "=" + entry.getValue());
}
}
}
public static void main(String[] args) {
final String rawDataLine1 = "1234567 - israel.ekpo#massivelogdata.net cc55ZZ35 1789 Hello Grok";
final String expression = "%{EMAIL:username} %{USERNAME:password} %{INT:yearOfBirth}";
final GrokDictionary dictionary = new GrokDictionary();
// Load the built-in dictionaries
dictionary.addBuiltInDictionaries();
// Resolve all expressions loaded
dictionary.bind();
// Take a look at how many expressions have been loaded
System.out.println("Dictionary Size: " + dictionary.getDictionarySize());
Grok compiledPattern = dictionary.compileExpression(expression);
displayResults(compiledPattern.extractNamedGroups(rawDataLine1));
}
}
Output:
username=israel.ekpo#massivelogdata.net
password=cc55ZZ35
yearOfBirth=1789
Note:
This are the patterns used before:
EMAIL %{\S+}#%{\b\w+\b}\.%{[a-zA-Z]+}
USERNAME [a-zA-Z0-9._-]+
INT (?:[+-]?(?:[0-9]+))
More info about grok-patterns: BuiltInDictionary.java

Saving and Loading Trained Stanford classifier in java

I have a dataset of 1 million labelled sentences and using it for finding sentiment through Maximum Entropy. I am using Stanford Classifier for the same:-
public class MaximumEntropy {
static ColumnDataClassifier cdc;
public static float calMaxEntropySentiment(String text) {
initializeProperties();
float sentiment = (getMaxEntropySentiment(text));
return sentiment;
}
public static void initializeProperties() {
cdc = new ColumnDataClassifier(
"\\stanford-classifier-2016-10-31\\properties.prop");
}
public static int getMaxEntropySentiment(String tweet) {
String filteredTweet = TwitterUtils.filterTweet(tweet);
System.out.println("Reading training file");
Classifier<String, String> cl = cdc.makeClassifier(cdc.readTrainingExamples(
"\\stanford-classifier-2016-10-31\\labelled_sentences.txt"));
Datum<String, String> d = cdc.makeDatumFromLine(filteredTweet);
System.out.println(filteredTweet + " ==> " + cl.classOf(d) + " " + cl.scoresOf(d));
// System.out.println("Class score is: " +
// cl.scoresOf(d).getCount(cl.classOf(d)));
if (cl.classOf(d) == "0") {
return 0;
} else {
return 4;
}
}
}
My data is labelled 0 or 1. Now for each tweet the whole dataset is being read and it is taking a lot of time considering the size of dataset.
My query is that is there any way to first train the classifier and then load it when a tweet's sentiment is to be found. I think this approach will take less time. Correct me if I am wrong.
The following link provides this but there is nothing for JAVA API.
Saving and Loading Classifier
Any help would be appreciated.

Yes; the easiest way to do this is using Java's default serialization mechanism to serialize a classifier. A useful helper here is the IOUtils class:
IOUtils.writeObjectToFile(classifier, "/path/to/file");
To read the classifier:
Classifier<String, String> cl = IOUtils.readObjectFromFile(new File("/path/to/file");

Flickr 4 Java - How do you find pictures / metadata from a certain region? e.g. Vienna

Best
Goal :
Receiving geographic data(coordinates), time-stamps... . "From pictures taken in Vienna."
My question:
How can i do this in Java? (using flickrapi-1.2.jar)
What did i already found out? :
Give me the 500 most recent pictures - url's ... :s
public static void main(String[] args) throws FlickrException, IOException,
SAXException {
String apiKey = "123456789abcdefghijklmnopqrstvwuxz";
Flickr f = new Flickr(apiKey);
PhotosInterface photosInterface = f.getPhotosInterface();
Collection photosCollection = null;
photosCollection = photosInterface.getRecent(500, 0);
int i = 0;
Photo photo = null;
Iterator photoIterator = photosCollection.iterator();
while (photoIterator.hasNext()) {
i++;
photo = (Photo) photoIterator.next();
System.out.println(i + " - Description: " + photo.getSmallUrl());
}
}
Option : Good Examples or a decent manual is welkom, because i don't know exactly how this API works...
Kind regards

You need to call the flickr.photos.search API method.
With Flickr4Java it would look like this:
String apikey;
String secret;
// Create a Flickr instance with your data. No need to authenticate
Flickr flickr = new Flickr(apikey, secret, new REST());
// Set the wanted search parameters (I'm not using real variables in the example)
SearchParameters searchParameters = new SearchParameters();
searchParameters.setAccuracy(accuracyLevel);
searchParameters.setBBox(minimum_longitude,
minimum_latitude,
maximum_longitude,
maximum_latitude);
PhotoList<Photo> list = flickr.getPhotosInterface().search(searchParameters, 0, 0);
// Do something with the list

How to combine 2 java methods into one efficiently

I'm trying to create a validate java class that receives 4 inputs from an object passed as 1 from the requester. The class needs to convert float inputs to string and evaluate each input to meet a certain format and then throw exceptions complete with error message and code when it fails.
What I have is in two methods and would like to know if there is a better way to combine these two classes into one validate method for the main class to call. I don't seem to be able to get around using the pattern/matcher concept to insure the inputs are formatted correctly. Any help you can give would be very much appreciated.
public class Validator {
private static final String MoneyPattern ="^\\d{1,7}(\\.\\d{1,2})$" ;
private static final String PercentagePattern = "^\\d{1,3}\\.\\d{1,2}$";
private static final String CalendarYearPattern = "^20[1-9][0-9]$";
private int errorcode = 0;
private String errormessage = null;
public Validator(MyInput input){
}
private boolean verifyInput(){
String Percentage = ((Float) input.getPercentage().toString();
String Income = ((Float) input.getIncome().toString();
String PublicPlan = ((Float) input.getPublicPlan().toString();
String Year = ((Float) input.getYear();
try {
if (!doesMatch(Income, MoneyPattern)) {
errormessage = errormessage + "income,";
}
if (!doesMatch(PublicPlan, MoneyPattern)) {
errormessage = errormessage + "insurance plan,";
}
if (!doesMatch(Percentage, PercentagePattern)) {
errormessage = errormessage + "Percentage Plan,";
}
if (!doesMatch(Year, CalendarYearPattern)) {
errormessage = errormessage + "Year,";
}
} catch (Exception e){
errorcode = 111;
errormessage = e.getMessage();
}
}
private boolean doesMatch(String s, String pattern) throws Exception{
try {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(s);
if (!s.equals("")){
if(m.find()){
return true;
} else {
return false;
}
}else {
return false;
}
} catch (PatternSyntaxException pse){
errorcode = 111;
errormessage = pse.getMessage();
}
}
}

This code is borked from the word "go". You have a constructor into which you pass a MyInput reference, but there's no code in the ctor and no private data member to receive it. It looks like you expect to use input in your doesMatch() method, but it's a NullPointerException waiting to happen.
Your code doesn't follow the Sun Java coding standards; variable names should be lower case.
Why you wouldn't do that input validation in the ctor, when you actually receive the value, is beyond me. Perhaps you really meant to pass it into that verifyInput() method.
I would worry about correctness and readability before concerning myself with efficiency.
I'd have methods like this:
public boolean isValidMoney(String money) { // put the regex here }
public boolean isValidYear(String year) { // the regex here }
I think I'd prefer a real Money class to a String. There's no abstraction whatsoever.
Here's one bit of honesty:
private static final String CalendarYearPattern = "^20[1-9][0-9]$";
I guess you either don't think this code will still be running in the 22nd century or you won't be here to maintain it.

One way of doing this would be with DynamicBeans.
package com.acme.validator;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.beanutils.PropertyUtils;
public class Validator {
//A simple optimisation of the pattern
private static final Pattern MoneyPattern = Pattern.compile("^\\d{1,7}(\\.\\d{1,2})$");
private static final Pattern PercentagePattern = Pattern.compile("^\\d{1,3}\\.\\d{1,2}$");
private static final Pattern CalendarYearPattern = Pattern.compile("^20[1-9][0-9]$");
public String Validator(MyInput input) {
String errormessage = "";
/*
* Setting these up as Maps.
* Ideally this would be a 'simple bean'
* but that goes beyond the scope of the
* original question
*/
Map<String,Pattern> patternMap = new HashMap<String,Pattern>();
patternMap.put("percentage", PercentagePattern);
patternMap.put("publicPlan", MoneyPattern);
patternMap.put("income", MoneyPattern);
patternMap.put("year", CalendarYearPattern);
Map<String,String> errorMap = new HashMap<String,String>();
errorMap.put("percentage", "Percentage Plan,");
errorMap.put("publicPlan", "insurance plan,");
errorMap.put("income", "income,");
errorMap.put("year", "Year,");
for (String key : patternMap.keySet()) {
try {
String match = ((Float) PropertyUtils.getSimpleProperty(input, key)).toString();
Matcher m = patternMap.get(key).matcher(match);
if ("".equals(match) || !m.find()) {
errormessage = errormessage + errorMap.get(key);
}
} catch (Exception e) {
errormessage = e.getMessage(); //since getMessage() could be null, you need to work out some way of handling this in the response
//don't know the point of the error code so remove this altogether
break; //Assume an exception trumps any validation failure
}
}
return errormessage;
}
}
I've made a few assumptions about the validation rules (for simplicity used 2 maps but you could also use a single map and a bean containing both the Pattern and the Message and even the 'error code' if that is important).
The key 'flaw' in your original setup and what would hamper the solution above, is that you are using 'year' as Float in the input bean.
(new Float(2012)).toString()
The above returns "2012.0". This will always fail your pattern. When you start messing about with the different types of objects potentially in the input bean, you may need to consider ensuring they are String at the time of creating the input bean and not, as is the case here, when they are retrieved.
Good Luck with the rest of your Java experience.

ini4j - How to get all the key names in a setting?

I've decided to use ini file to store simple key-value pair configuration for my Java application.
I googled and searched stackoverflow and found that ini4j is highly recommended for parsing and interpreting ini files in Java. I spent some time reading the tutorial on ini4j site; however, I was not sure how to get all the key values for a setting in an ini file.
For instance, if I have a ini file like this:
[ food ]
name=steak
type=american
price=20.00
[ school ]
dept=cse
year=2
major=computer_science
and assume that I do not know names of keys ahead of time. How do I get the list of keys so that I can eventually retrieve the values according to keys? For instance, I would get an array or some kind of data structure that contains 'name', 'type', and 'price' if I get a list of keys for food.
Can someone show me an example where you would open an ini file, parse or interpret it so that an app knows all the structure and values of the ini file, and get the list of keys and values?

No guarantees on this one. Made it up in 5min.
But it reads the ini you provided without further knowledge of the ini itself (beside the knowledge that it consists of a number of sections each with a number of options.
Guess you will have to figure out the rest yourself.
import org.ini4j.Ini;
import org.ini4j.Profile.Section;
import java.io.FileReader;
public class Test {
public static void main(String[] args) throws Exception {
Ini ini = new Ini(new FileReader("test.ini"));
System.out.println("Number of sections: "+ini.size()+"\n");
for (String sectionName: ini.keySet()) {
System.out.println("["+sectionName+"]");
Section section = ini.get(sectionName);
for (String optionKey: section.keySet()) {
System.out.println("\t"+optionKey+"="+section.get(optionKey));
}
}
}
}
Check out ini4j Samples and ini4j Tutorials too. As often a not very well documented library.

I couldn't find anything in the tutorials so I stepped through the source, until I found the entrySet method. With that you can do this:
Wini ini = new Wini(new File(...));
Set<Entry<String, Section>> sections = ini.entrySet(); /* !!! */
for (Entry<String, Section> e : sections) {
Section section = e.getValue();
System.out.println("[" + section.getName() + "]");
Set<Entry<String, String>> values = section.entrySet(); /* !!! */
for (Entry<String, String> e2 : values) {
System.out.println(e2.getKey() + " = " + e2.getValue());
}
}
This code essentially re-prints the .ini file to the console. Your sample file would produce this output: (the order may vary)
[food]
name = steak
type = american
price = 20.00
[school]
dept = cse
year = 2
major = computer_science

The methods of interest are get() and keySet()
Wini myIni = new Wini (new File ("test.ini"));
// list section names
for (String sName : myIni.keySet()) {
System.out.println(sName);
}
// check for a section, section name is case sensitive
boolean haveFoodParameters = myIni.keySet().contains("food");
// list name value pairs within a specific section
for (String name : myIni.get("food").keySet() {
System.out.println (name + " = " + myIni.get("food", name)
}

In Kotlin:
val ini = Wini(File(iniPath))
Timber.e("Read value:${ini}")
println("Number of sections: "+ini.size+"\n");
for (sectionName in ini.keys) {
println("[$sectionName]")
val section: Profile.Section? = ini[sectionName]
if (section != null) {
for (optionKey in section.keys) {
println("\t" + optionKey + "=" + section[optionKey])
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Language detections with apache Tika - java

Related

Pattern from String

Saving and Loading Trained Stanford classifier in java

Flickr 4 Java - How do you find pictures / metadata from a certain region? e.g. Vienna

How to combine 2 java methods into one efficiently

ini4j - How to get all the key names in a setting?

Categories

Resources