How to detect multi set words OpenNLP - java

I'm doing NER using Java OpenNLP and I'm not sure how can I detect multiple words (eg. New York, Bruno Mars, Hong Kong) by using the custom model I have trained.
My training data do cover multi-word spans:
<START:place> Hong Kong <END> ... <START:person> Putin <END>
I'm pretty sure my trained model and training data are working good. It's just that I do not know how to get the multi-word set. Here is what I did
// testing the model
NameFinderME nameFinder = new NameFinderME(nameFinderModel);
String sentence = "India may US to Japan France so Putin should Hong Kong review Trump";
WhitespaceTokenizer whitespaceTokenizer = WhitespaceTokenizer.INSTANCE;
// Tokenizing the given paragraph
String tokens[] = whitespaceTokenizer.tokenize(sentence);
Span nameSpans[] = nameFinder.find(tokens);
for (Span s : nameSpans)
System.out.println(s.toString() + " " + tokens[s.getStart()]);
And here is what I get:
[0..1) place India
[0..1) place US
[0..1) place Japan
[0..1) place France
[0..1) person Putin
[0..1) place Hong
[0..1) person Trump
But I want to get [0..1) place Hong Kong instead of splitting them into two categories.
Thanks.

I defined an array list to include all the multiple word place name, eg {"Hong", "New", "North", "South" ... } then use it to check if it contains tokens[s.getStart()]. If yes, add tokens[s.getStart()] + " " + tokens[s.getStart() + 1] else, add tokens[s.getStart()]. Although it's not the best approach but it's enough for me now.

Related

Java Spark ML - prediction/forecast with Spark ML 3.1+ issue

Small question regarding prediction/forecast using Spark ML 3.1+ please.
I have a dataset, very simple, of timestamps for when an event happened.
The dataset is very simple, here is a small portion, of the very very very big file.
+----------+-----+
| time|label|
+----------+-----+
|1621900800| 43|
|1619568000| 41|
|1620432000| 41|
|1623974400| 42|
|1620604800| 41|
|1622505600| 42|
truncated
|1624665600| 42|
|1623715200| 41|
|1623024000| 43|
|1623888000| 42|
|1621296000| 42|
|1620691200| 44|
|1620345600| 41|
|1625702400| 44|
+----------+-----+
only showing top 20 rows
The dataset is really just a timestamp representing a day, on the left, and on the right, the number of banana sold that day. Example of the first three rows of above sample translated.
+------ ----+-- ---+
| time| value|
+------- ---+-----+
|May 25, 2021| banana sold 43|
|April 28, 2021| banana sold 41|
|May 8, 2021| banana sold 41|
My goal is just to build a prediction model, how many "banana will be sold tomorrow, the day after, etc...
Therefore, I went to try Linear Regression, but it might bot be a good model for this problem:
VectorAssembler vectorAssembler = new VectorAssembler().setInputCols(new String[]{"time", "label"}).setOutputCol("features");
Dataset<Row> vectorData = vectorAssembler.transform(dataSetBanana);
LinearRegression lr = new LinearRegression();
LinearRegressionModel lrModel = lr.fit(vectorData);
System.out.println("Coefficients: " + lrModel.coefficients() + " Intercept: " + lrModel.intercept());
LinearRegressionTrainingSummary trainingSummary = lrModel.summary();
System.out.println("numIterations: " + trainingSummary.totalIterations());
System.out.println("objectiveHistory: " + Vectors.dense(trainingSummary.objectiveHistory()));
trainingSummary.residuals().show();
System.out.println("RMSE: " + trainingSummary.rootMeanSquaredError());
System.out.println("r2: " + trainingSummary.r2());
System.out.println("the magical prediction: " + lrModel.predict(new DenseVector(new double[]{1.0, 1.0})));
I see all the values printed, very happy.
Coefficients: [-1.5625735463489882E-19,1.0000000000000544] Intercept: 2.5338210784074846E-10
numIterations: 0
objectiveHistory: [0.0]
+--------------------+
| residuals|
+--------------------+
|-1.11910480882215...|
RMSE: 3.0933584599870493E-13
r2: 1.0
the magical prediction: 1.0000000002534366
It is not giving me anything close to a prediction, I was expecting something like
|Some time in the future| banana sold some prediction|
| 1626414043 | 38 |
May I ask what would be a model that can result an answer like "model predicts X banana will be sold at time Y in the future"
A small piece of code with result would be great.
Thank you
Linear regression can be a good start to get familiar with mllib before you go for more complicated models. First, let's have a look at when you have done so far.
Your VectorAssembler transform your data frame that way:
before:
time
label
1621900800
43
1620432000
41
after:
time
label
features
1621900800
43
[1621900800;43]
1620432000
41
[1620432000;41]
Now, when you are asking LinearRegression to train its model, it will expect your dataset to contain two columns:
one column named features and containing a vector with everything that can be used to predict the label.
one column named label, what you want to predict
Regression will find a and b which minimizes errors across all record i where:
y_i = a * x_i + b + error_i
In your particular setup, you have passed the label to your vector assembler, which is wrong, that's what you want to predict !
Your model has simply learnt that the label predicts perfectly the label:
y = 0.0 * features[0] + 1.0 * features[1]
So you should correct your VectorAssembler:
val vectorAssembler = new VectorAssembler().setInputCols(new String[]{"time"}).setOutputCol("features");
Now when you are doing your prediction, you had passed this:
lrModel.predict(new DenseVector(new double[]{ 1.0, 1.0})));
timestamp label
It returned 1.0 as per formula above.
Now if you change the VectorAssembler as proposed above, you should call the prediction that way:
lrModel.predict(new DenseVector(new double[]{ timeStampIWantToPredict })));
Side notes:
you can pass a dataset to your predictor, it will return a dataset with a new column with the prediction.
you should really have a closer look at Mllib Pipeline documentation
then you can try to add some new features to your linear regression : seasonality, auto regressive features...
The model gives you the coefficients of your variables. Then it's easy to calculate the output. If you have only one variable x1 your model will be something like:
y = a*x1 + b
Then the outputs of your model are a and b. Then you can calculate y.
Generally speaking, machine learning libraries also implement other methods that let you calculate the output. It's better to search how to save, load and then evaluate your model with new inputs. Check out https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/ml/regression/LinearRegressionModel.html
There's a method called predict that you can call on your model by giving the input as a Vector instance. I think that will work!
Another thing is: you are trying to solve a time-series problem with a single-variable linear regression model. I think you should use a better algorithm that is intended to deal with time-series or sequence problems such as Long Short Term Memory (LSTM).
I hope that my answer is useful for you. Keep going ;)

Java string indexing make me confused

So i need to gather data from my db, it's holiday date in my country, the data comes like this
Example 1 : THU 21 May Ascension Day of Jesus Christ *ICDX GOLD open for
Example 2 : MON-THU 28-31 Dec Substitute for Commemoration of Idul Fitri Festival
So i need to get data from days, dates, and the holiday name, for get data from example 1 i'm using code like this
public static void main(String[] args) {
String ex1 = "THU 21 May Ascension Day of Jesus Christ *ICDX GOLD open for";
String ex2 = "MON-THU 28-31 Dec Substitute for Commemoration of Idul Fitri Festival ";
String[] trim1 = ex1.trim().split("\\s+"); //to split by space
String[] trim2 = ex1.trim().split(" "); //to split by 3 space so i got the data from multiple space as delimiter
System.out.println("DAY " +trim1[0]);//display day
System.out.println("DATE " +trim1[1] +trim1[2]+"2020");//display date
System.out.println("HOLIDAY NAME " +trim2[3]);//dispay holiday name
}
The Output come like this
DAY MON
DATE 21May2020
HOLIDAY NAME Ascension Day of Jesus Christ
and just like what i need, but when come to example 2, i can't use same code because the space is different, how to get the data i need with example 1 and 2 with same code.
i am new in java so i'm sorry if my question looking dumb, i hope you can help me.Thanks
.split("\\s+") will split at any space, including multiple spaces. Eg. it will split at 1 space or more.
This means that you are able to split at any amount of spaces (what you want). However, this will also split your text comments. You are able to limit the length of the array produced (the amount of times it is split) using .split(regex, n), which will result in an array of n-1 size at most. See this for more details
As for splitting out your two textual comments, I cannot see a way to do this.
Substitute for Commemoration of Idul Fitri Festival "; contains no way of telling what is the first text comment and the second.
It seems quite strange to me that you receive information from your database like this, I would recommend seeing if there are other options for doing this. There is almost certainly a way to get seperate fields.
If have the ability to change all the information in the database, you could put single quotes (') or some other seperator, which you would then be able to split out the two pieces of text.
This is basically what #DanielBarbarian suggested: Since the information seems to always start at the same indexes, you can just use those to get what you need.
String ex1 = "THU 21 May Ascension Day of Jesus Christ *ICDX GOLD open for";
String ex2 = "MON-THU 28-31 Dec Substitute for Commemoration of Idul Fitri Festival ";
String day = ex2.substring(0, 8).trim();
String date = ex2.substring(8, 14).trim() + ex2.substring(14, 22).trim() + "2020";
String name = ex2.substring(22);
System.out.println("DAY " + day);// display day
System.out.println("DATE " + date);// display date
System.out.println("HOLIDAY NAME " + name);// dispay holiday name

String regular expression java

I'm working on a utility where I've this requirement:
there is a string which contains parameters like - #p1 or #p2 or #pn, where n can be any number.
for example string is :
Input:
It provides #p1 latest news, videos #p2 from India and #p3 the world. Get today's news headlines from #p5 Business, #p5
Replace all the parameters with #pn#. So if the parameter is #p1 it will become #p1#.
The above string will become :
Output:
It provides #p1# latest news, videos #p2# from India and #p3# the world. Get today's news headlines from #p4# Business, #p5#
Any quick help appreciated.
Thanks.
Use string.replaceAll function like below.
string.replaceAll("(#p\\d+)", "$1#");
\d+ matches one or more digits. () called capturing group which capture the characters that the matched by the pattern inside () and it store the captured characters into their corresponding groups. Later we could refer those characters by specifying its index like $1 or $2 .
Example:
String s = "It provides #p1 latest news, videos #p2 from India and #p3 the world. Get today's news headlines from #p5 Business, #p5";
System.out.println(s.replaceAll("(#p\\d+)", "$1#"));
Output:
It provides #p1# latest news, videos #p2# from India and #p3# the world. Get today's news headlines from #p5# Business, #p5#
You can try regex like this :
public static void main(String[] args) {
String s = "it provides #p1 latest news, videos #p2 from India and #p3 the world. Get today's news headlines from #p5 Business, #p5";
System.out.println(s.replaceAll("(#p\\d+)(?=\\s+|$)", "$1\\#"));
}
O/P :
it provides #p1# latest news, videos #p2# from India and #p3# the world. Get today's news headlines from #p5# Business, #p5#
Explanation :
(#p\\d+)(?=\\s+|$) --> `#p` followed by any number of digits (which are all captured) followed by a space or end of String (which are matched but not captured..)

String matching and replace in Java

I have a String like this:
String a = "Barbara Liskov (born Barbara Jane Huberman on November 7, 1939"
+" in California) is a computer scientist.[2] She is currently the Ford"
+" Professor of Engineering in the MIT School of Engineering's electrical"
+" engineering and computer science department and an institute professor"
+" at the Massachusetts Institute of Technology.[3]";
I would like to replace all of these elements: [1], [2], [3], etcetera, with a blank space.
I tried with:
if (a.matches("([){1}\\d(]){1}")) {
a = a.replace("");
}
but it does not work!
Your Pattern is all wrong.
Try this example:
String input =
"Barbara Liskov (born Barbara Jane Huberman on November 7, 1939 in California) "
+ "is a computer scientist.[2] She is currently the Ford Professor of Engineering "
+ "in the MIT School of Engineering's electrical engineering and computer "
+ "science department and an institute professor at the Massachusetts Institute "
+ "of Technology.[3]";
// | escaped opening square bracket
// | | any digit
// | | | escaped closing square bracket
// | | | | replace with one space
System.out.println(input.replaceAll("\\[\\d+\\]", " "));
Output (newlines added for clarity)
Barbara Liskov (born Barbara Jane Huberman on November 7,
1939 in California) is a computer scientist.
She is currently the Ford Professor of Engineering in the MIT
School of Engineering's electrical engineering and computer science
department and an institute professor at the Massachusetts Institute of Technology.
Very simple:
a = a.replaceAll("\\[\\d+\\]","");
The changes:
Use replaceAll instead of replace
Escape the [] - they are regex special chars. the partnerships are not escaping them.
No need of {1} on your regex [{1} == [ - both are specifying that the character should be one time
The + added to d+ is for more than one digits numbers such as [12]
About your pattern ([){1}\\d(]){1}:
{1} is always useless since always implicit
[ and ] needs to be escaped with a backslash (which must itself be escaped with another backslash since in a string literal)
\\d has no explicit cardinality, so [12] for example won't match since there are two digits
So, better try: \\[\\d+\\]
Use the String replaceAll(String regex, String replacement).
All you got to do is a=a.replaceAll("\\[\\d+\\]", " ").
You can read Javadoc for more information .
Use this:
String a = "Barbara Liskov (born Barbara Jane Huberman on November 7, 1939 in California) is a computer scientist.[2] She is currently the Ford Professor of Engineering in the MIT School of Engineering's electrical engineering and computer science department and an institute professor at the Massachusetts Institute of Technology.[3]";
for(int i =1 ; i<= 3; i++){
a= a.replace("["+i+"]","");
}
System.out.println(a);
This will work.

Regex to extract a paragraph

I need a regex to extract a each paragraph and store as a string for additional processing from the text buffer containing many such similar paragraphs.
Example: Say, the text buffer is like this:
=== Jun 11 14:05:39 - Person Details ===
Person Name = "Hurlman"
Person Address = "2nd Street Benjamin Blvd NJ"
Persion Age = 25
=== Jun 11 14:05:39 - Person Details ===
Person Name = "Greg"
Person Address = "3rd Street Benjamin Blvd NJ"
Persion Age = 26
=== Jun 11 14:05:42 - Person Details ===
Person Name = "Michel"
Person Address = "4th Street Benjamin Blvd NJ"
Persion Age = 27
And I need to iterate through all the paragraphs and store each one of them to further find the specific person details inside.
Each paragraph I need to extract should be of the below format
=== Jun 11 14:05:42 - Person Details ===
Person Name = "Michel"
Person Address = "4th Street Benjamin Blvd NJ"
Persion Age = 27
Any help is much appreciated!
you could use this pattern (===.*===[\s\S]*?)(?====|$)
Demo
Using regexes to solve this is possible, but it is likely to give you a poor (inefficient, hard to understand, hard to maintain, etc) solution.
What you have is an informal record structure represented using lines of text. (This is not natural language text, so describing it in terms of "paragraphs" doesn't make sense.)
The way to process it is to read it a line at a time and then use Scanner (or equivalent) to parse each line into name value pairs. You just need some simple logic to detect the record boundaries and / or check that they are appearing at the correct place in the input stream.

Categories

Resources