I am currently trying to get along with Apache Tika and set up a language detection that checks all keyValues of my various properties files for the correct language of the respective file. Unfortunately the detection is not really good..All keys are not recognized with the correct language and I don't know how I can do it better. An api solution is out of the question, because I have the order to find a free way and most free connections only allow 1000 calls per day (in german alone I have more than 14000 keys).
If you know how I can make the current code better or maybe have another solution, please let me know!
Thanks a lot,
Pascal
Thats my Current code:
import java.util.Set;
import org.apache.tika.language.LanguageIdentifier;
public class detect {
#SuppressWarnings("deprecation")
public static void main(String[] args) throws Exception {
final MyPropAllKeys mPAK = new MyPropAllKeys("messages_forCheck.properties");
final Set<Object> keys = mPAK.getAllKeys();
for (final Object key : keys) {
final String keyString = key.toString();
final String keyValueString = mPAK.getPropertyValue(keyString);
detect(keyValueString, key);
}
}
public static void detect(String keyValueString, Object key) {
final LanguageIdentifier languageIdentifier = new LanguageIdentifier(keyValueString);
final String language = languageIdentifier.getLanguage();
if (!language.equals("de")) {
System.out.println(language + " " + key + ": " + keyValueString);
}
}
}
For Example thats some of the Results:
pt de.segal.baoss.platform.entity.BackgroundTaskType.MASS_INVOICE_DOCUMENT_CREATION: Rechnungsdokumente erzeugen
sk de.segal.baoss.purchase.supplier.creditorNumber: Kreditorennummer
no de.segal.baoss.module.crm.revenueLastYear: Umsatz vergangenes Jahr
no de.segal.baoss.module.op.customerReturn.action.createCreditEntry: Gutschrift erstellen
All are definitely German
I have a List<UserVO>
Each UserVO has a getCountry()
I want to group the List<UserVO> based on its getCountry()
I can do it via streams but I have to do it in Java6
This is in Java8. I want this in Java6
Map<String, List<UserVO>> studentsByCountry
= resultList.stream().collect(Collectors.groupingBy(UserVO::getCountry));
for (Map.Entry<String, List<UserVO>> entry: studentsByCountry.entrySet())
System.out.println("Student with country = " + entry.getKey() + " value are " + entry.getValue());
I want output like a Map<String, List<UserVO>>:
CountryA - UserA, UserB, UserC
CountryB - UserM, User
CountryC - UserX, UserY
Edit: Can I further reschuffle this Map so that I display according to the displayOrder of the countries. Display order is countryC=1, countryB=2 & countryA=3
For example I want to display
CountryC - UserX, UserY
CountryB - UserM, User
CountryA - UserA, UserB, UserC
This is how you do it with plain Java. Please note that Java 6 doesn't support the diamond operator so you have use <String, List<UserVO>> explicitly all the time.
Map<String, List<UserVO>> studentsByCountry = new HashMap<String, List<UserVO>>();
for (UserVO student: resultList) {
String country = student.getCountry();
List<UserVO> studentsOfCountry = studentsByCountry.get(country);
if (studentsOfCountry == null) {
studentsOfCountry = new ArrayList<UserVO>();
studentsByCountry.put(country, studentsOfCountry);
}
studentsOfCountry.add(student);
}
It's shorter with streams, right? So try to upgrade to Java 8!
To have a specific order based on the reversed alphabetical String, as mentioned in the comments, you can replace the first line with the following:
Map<String,List<UserVO>> studentsByCountry = new TreeMap<String,List<UserVO>>(Collections.reverseOrder());
i want to extract pattern from a string for ex:
string x== "1234567 - israel.ekpo#massivelogdata.net cc55ZZ35 1789 Hello Grok";
pattern its should generate is = "%{EMAIL:username} %{USERNAME:password} %{INT:yearOfBirth}"
basically i want to create patter for logs generated in the java application. any idea how to do that ?
In past i've do some with reguar expression, but in my case the string having ever the same composition pattern or order.
I this case, you can done 3 matching pattern and make the find operation 3 times in order of pattern.
If not so, you must use an text analyzer or search tool.
It's recommended to use grow library to extract data from logs.
Example:
public final class GrokStage {
private static final void displayResults(final Map<String, String> results) {
if (results != null) {
for(Map.Entry<String, String> entry : results.entrySet()) {
System.out.println(entry.getKey() + "=" + entry.getValue());
}
}
}
public static void main(String[] args) {
final String rawDataLine1 = "1234567 - israel.ekpo#massivelogdata.net cc55ZZ35 1789 Hello Grok";
final String expression = "%{EMAIL:username} %{USERNAME:password} %{INT:yearOfBirth}";
final GrokDictionary dictionary = new GrokDictionary();
// Load the built-in dictionaries
dictionary.addBuiltInDictionaries();
// Resolve all expressions loaded
dictionary.bind();
// Take a look at how many expressions have been loaded
System.out.println("Dictionary Size: " + dictionary.getDictionarySize());
Grok compiledPattern = dictionary.compileExpression(expression);
displayResults(compiledPattern.extractNamedGroups(rawDataLine1));
}
}
Output:
username=israel.ekpo#massivelogdata.net
password=cc55ZZ35
yearOfBirth=1789
Note:
This are the patterns used before:
EMAIL %{\S+}#%{\b\w+\b}\.%{[a-zA-Z]+}
USERNAME [a-zA-Z0-9._-]+
INT (?:[+-]?(?:[0-9]+))
More info about grok-patterns: BuiltInDictionary.java
I have a dataset of 1 million labelled sentences and using it for finding sentiment through Maximum Entropy. I am using Stanford Classifier for the same:-
public class MaximumEntropy {
static ColumnDataClassifier cdc;
public static float calMaxEntropySentiment(String text) {
initializeProperties();
float sentiment = (getMaxEntropySentiment(text));
return sentiment;
}
public static void initializeProperties() {
cdc = new ColumnDataClassifier(
"\\stanford-classifier-2016-10-31\\properties.prop");
}
public static int getMaxEntropySentiment(String tweet) {
String filteredTweet = TwitterUtils.filterTweet(tweet);
System.out.println("Reading training file");
Classifier<String, String> cl = cdc.makeClassifier(cdc.readTrainingExamples(
"\\stanford-classifier-2016-10-31\\labelled_sentences.txt"));
Datum<String, String> d = cdc.makeDatumFromLine(filteredTweet);
System.out.println(filteredTweet + " ==> " + cl.classOf(d) + " " + cl.scoresOf(d));
// System.out.println("Class score is: " +
// cl.scoresOf(d).getCount(cl.classOf(d)));
if (cl.classOf(d) == "0") {
return 0;
} else {
return 4;
}
}
}
My data is labelled 0 or 1. Now for each tweet the whole dataset is being read and it is taking a lot of time considering the size of dataset.
My query is that is there any way to first train the classifier and then load it when a tweet's sentiment is to be found. I think this approach will take less time. Correct me if I am wrong.
The following link provides this but there is nothing for JAVA API.
Saving and Loading Classifier
Any help would be appreciated.
Yes; the easiest way to do this is using Java's default serialization mechanism to serialize a classifier. A useful helper here is the IOUtils class:
IOUtils.writeObjectToFile(classifier, "/path/to/file");
To read the classifier:
Classifier<String, String> cl = IOUtils.readObjectFromFile(new File("/path/to/file");
I want use Separtor and different fields in hashmap, I am trying to write program to find duplicate firstname and lastname fields in data than add sequction number,
Check firstname && lastname in all records
if firstname && lastname found duplicate add seqNumber in feilds like 0,1,2,3..
if didn't find duplicate than 0
I write the code is working fine.. but, it checking line.. instead of fields, I need to check 2 fields firstname and lastname..
Please help me!!
here is inputdata file:- I have data file like:
CustmerNumber,FirstName,LastName,Address1,city
123456789,abcd,efgh,12 spring st,atlanta
2345678,xyz,lastname,16 sprint st,atlanta
232345678,abcd,efgh ,1201 sprint st,atlanta
1234678,xyz,lastname,1234 oakbrook pkwy,atlanta
23556,abcd,efgh,3201 sprint st,atlanta
34564,robert,parker,12032 oakbrrok,atlanta
I want output data file like:
CustmerNumber,FirstName,LastName,Address1,city,**SEQNUMBER**
123456789,**abcd,efgh**,12 spring st,atlanta,**0**
232345678,**abcd,efgh** ,1201 sprint st,atlanta,**1**
23556,**abcd,efgh**,3201 sprint st,atlanta,**2**
2345678,**xyz,lastname**,16 sprint st,atlanta,**0**
1234678,**xyz,lastname**,1234 oakbrook pkwy,atlanta,**1**
34564,**robert,parker**,12032 oakbrrok,atlanta,**0**
Here is my Code:
import java.io.File;
import java.io.FileNotFoundException;
import java.util.HashMap;
import java.util.Map;
import java.util.Scanner;
public class test1 {
/**
* #param args
* #throws FileNotFoundException
*/
public static void main(String[] args) throws FileNotFoundException {
// TODO Auto-generated method stub
Map<String, Integer> names = new HashMap<>();
File dir = new File("Data_File_In");
for (File file : dir.listFiles()) {
Scanner s = new Scanner(file);
s.nextLine();
while(s.hasNextLine()) {
String line = s.nextLine();
if(!names.containsKey(line)) {
names.put(line, 0);
}
names.put(line, names.get(line) + 1);
}
for(String name : names.keySet()) {
for(int i = 1; i <= names.get(name); i++) {
System.out.println(name + "---->" + (i-1));
}
}
s.close();
}
}
}
My Code is checking line if line is duplicate than sequction number is 0,1,2....
if not same like again than only 0
Instead of that need to use fields firstname and lastname..
Please help me!!
Thanks!!
Parse your line so you know the values of firstName and lastName
for each line. Then for each line use e.g. firstName + "###" + lastName
as the key of your map. The values in your map will be e.g. Integer values
(these are the counts). When reading a line, construct its key and see if it
is in the map already. If yes - increase the value i.e. the count, otherwise
add new entry in the map with count=1.