Java Spark ML - prediction/forecast with Spark ML 3.1+ issue - java

Small question regarding prediction/forecast using Spark ML 3.1+ please.
I have a dataset, very simple, of timestamps for when an event happened.
The dataset is very simple, here is a small portion, of the very very very big file.
+----------+-----+
| time|label|
+----------+-----+
|1621900800| 43|
|1619568000| 41|
|1620432000| 41|
|1623974400| 42|
|1620604800| 41|
|1622505600| 42|
truncated
|1624665600| 42|
|1623715200| 41|
|1623024000| 43|
|1623888000| 42|
|1621296000| 42|
|1620691200| 44|
|1620345600| 41|
|1625702400| 44|
+----------+-----+
only showing top 20 rows
The dataset is really just a timestamp representing a day, on the left, and on the right, the number of banana sold that day. Example of the first three rows of above sample translated.
+------ ----+-- ---+
| time| value|
+------- ---+-----+
|May 25, 2021| banana sold 43|
|April 28, 2021| banana sold 41|
|May 8, 2021| banana sold 41|
My goal is just to build a prediction model, how many "banana will be sold tomorrow, the day after, etc...
Therefore, I went to try Linear Regression, but it might bot be a good model for this problem:
VectorAssembler vectorAssembler = new VectorAssembler().setInputCols(new String[]{"time", "label"}).setOutputCol("features");
Dataset<Row> vectorData = vectorAssembler.transform(dataSetBanana);
LinearRegression lr = new LinearRegression();
LinearRegressionModel lrModel = lr.fit(vectorData);
System.out.println("Coefficients: " + lrModel.coefficients() + " Intercept: " + lrModel.intercept());
LinearRegressionTrainingSummary trainingSummary = lrModel.summary();
System.out.println("numIterations: " + trainingSummary.totalIterations());
System.out.println("objectiveHistory: " + Vectors.dense(trainingSummary.objectiveHistory()));
trainingSummary.residuals().show();
System.out.println("RMSE: " + trainingSummary.rootMeanSquaredError());
System.out.println("r2: " + trainingSummary.r2());
System.out.println("the magical prediction: " + lrModel.predict(new DenseVector(new double[]{1.0, 1.0})));
I see all the values printed, very happy.
Coefficients: [-1.5625735463489882E-19,1.0000000000000544] Intercept: 2.5338210784074846E-10
numIterations: 0
objectiveHistory: [0.0]
+--------------------+
| residuals|
+--------------------+
|-1.11910480882215...|
RMSE: 3.0933584599870493E-13
r2: 1.0
the magical prediction: 1.0000000002534366
It is not giving me anything close to a prediction, I was expecting something like
|Some time in the future| banana sold some prediction|
| 1626414043 | 38 |
May I ask what would be a model that can result an answer like "model predicts X banana will be sold at time Y in the future"
A small piece of code with result would be great.
Thank you

Linear regression can be a good start to get familiar with mllib before you go for more complicated models. First, let's have a look at when you have done so far.
Your VectorAssembler transform your data frame that way:
before:
time
label
1621900800
43
1620432000
41
after:
time
label
features
1621900800
43
[1621900800;43]
1620432000
41
[1620432000;41]
Now, when you are asking LinearRegression to train its model, it will expect your dataset to contain two columns:
one column named features and containing a vector with everything that can be used to predict the label.
one column named label, what you want to predict
Regression will find a and b which minimizes errors across all record i where:
y_i = a * x_i + b + error_i
In your particular setup, you have passed the label to your vector assembler, which is wrong, that's what you want to predict !
Your model has simply learnt that the label predicts perfectly the label:
y = 0.0 * features[0] + 1.0 * features[1]
So you should correct your VectorAssembler:
val vectorAssembler = new VectorAssembler().setInputCols(new String[]{"time"}).setOutputCol("features");
Now when you are doing your prediction, you had passed this:
lrModel.predict(new DenseVector(new double[]{ 1.0, 1.0})));
timestamp label
It returned 1.0 as per formula above.
Now if you change the VectorAssembler as proposed above, you should call the prediction that way:
lrModel.predict(new DenseVector(new double[]{ timeStampIWantToPredict })));
Side notes:
you can pass a dataset to your predictor, it will return a dataset with a new column with the prediction.
you should really have a closer look at Mllib Pipeline documentation
then you can try to add some new features to your linear regression : seasonality, auto regressive features...

The model gives you the coefficients of your variables. Then it's easy to calculate the output. If you have only one variable x1 your model will be something like:
y = a*x1 + b
Then the outputs of your model are a and b. Then you can calculate y.
Generally speaking, machine learning libraries also implement other methods that let you calculate the output. It's better to search how to save, load and then evaluate your model with new inputs. Check out https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/ml/regression/LinearRegressionModel.html
There's a method called predict that you can call on your model by giving the input as a Vector instance. I think that will work!
Another thing is: you are trying to solve a time-series problem with a single-variable linear regression model. I think you should use a better algorithm that is intended to deal with time-series or sequence problems such as Long Short Term Memory (LSTM).
I hope that my answer is useful for you. Keep going ;)

Related

How to compute pseudorange from the parameters fetched via Google GNSSLogger?

The official GNSS raw measurements fetched via GNSS logger app provides the following parameters :
TimeNanos
LeapSecond
TimeUncertaintyNanos
FullBiasNanos
BiasNanos
BiasUncertaintyNanos
DriftNanosPerSecond
DriftUncertaintyNanosPerSecond HardwareClockDiscontinuityCount
Svid
TimeOffsetNanos
State
ReceivedSvTimeNanos
ReceivedSvTimeUncertaintyNanos
Cn0DbHz
PseudorangeRateMetersPerSecond
PseudorangeRateUncertaintyMetersPerSecond
I'm looking for the raw pseudorange measurements PR from the above data. A little help?
Reference 1: https://github.com/google/gps-measurement-tools
Reference 2 : https://developer.android.com/guide/topics/sensors/gnss
Pseudorange[m] = (AverageTravelTime[s] + delta_t[s]) * speedOfLight[m/s]
where: m - meters, s - seconds.
Try this way:
Select satellites from one constellation (at first try with GPS).
Chose max value of ReceivedSvTimeNanos.
Calculate delta_t for each satellite as max ReceivedSvTimeNanos minus current ReceivedSvTimeNanos(delta_t = maxRst - curRst).
Average travel time is 70 milliseconds, speed of light 299792458 m/s. use it for calculation.
Don't forget to convert all values to the same units.
For details refer to this pdf and UserPositionVelocityWeightedLeastSquare class
Unfortunately Android doesn't provide pseudorange directly from the API - you have to calculate this yourself.
The EU GSA has a great document here that explains in detail how to use GNSS raw measurements in section 2.4:
https://www.gsa.europa.eu/system/files/reports/gnss_raw_measurement_web_0.pdf
Specifically, section 2.4.2 explains how to calculate pseudorange from the data given by the Android APIs. It's literally pages of text, so I won't copy the whole thing in-line here, but here's the Example 1 they share for a Matlab code snippet to compute the pseudorange for Galileo, GPS and BeiDou signals when the time-of-week is encoded:
% Select GPS + GAL TOW decoded (state bit 3 enabled)
pos = find( (gnss.Const == 1 | gnss.Const == 6) & bitand(gnss.State,2^3);
% Generate the measured time in full GNSS time
tRx_GNSS = gnss.timeNano(pos) - (gnss.FullBiasNano(1) + gnss.BiasNano(1));
% Change the valid range from full GNSS to TOW
tRx = mod(tRx_GNSS(pos),WEEKSEC*1e9);
% Generate the satellite time
tTx = gnss.ReceivedSvTime(pos) + gnss.TimeOffsetNano(pos);
% Generate the pseudorange
prMilliSeconds = (tRx - tTx );
pr = prMilliSeconds *Constant.C*1e-9;

Predict function R returns 0.0 [duplicate]

I posted earlier today about an error I was getting with using the predict function. I was able to get that corrected, and thought I was on the right path.
I have a number of observations (actuals) and I have a few data points that I want to extrapolate or predict. I used lm to create a model, then I tried to use predict with the actual value that will serve as the predictor input.
This code is all repeated from my previous post, but here it is:
df <- read.table(text = '
Quarter Coupon Total
1 "Dec 06" 25027.072 132450574
2 "Dec 07" 76386.820 194154767
3 "Dec 08" 79622.147 221571135
4 "Dec 09" 74114.416 205880072
5 "Dec 10" 70993.058 188666980
6 "Jun 06" 12048.162 139137919
7 "Jun 07" 46889.369 165276325
8 "Jun 08" 84732.537 207074374
9 "Jun 09" 83240.084 221945162
10 "Jun 10" 81970.143 236954249
11 "Mar 06" 3451.248 116811392
12 "Mar 07" 34201.197 155190418
13 "Mar 08" 73232.900 212492488
14 "Mar 09" 70644.948 203663201
15 "Mar 10" 72314.945 203427892
16 "Mar 11" 88708.663 214061240
17 "Sep 06" 15027.252 121285335
18 "Sep 07" 60228.793 195428991
19 "Sep 08" 85507.062 257651399
20 "Sep 09" 77763.365 215048147
21 "Sep 10" 62259.691 168862119', header=TRUE)
str(df)
'data.frame': 21 obs. of 3 variables:
$ Quarter : Factor w/ 24 levels "Dec 06","Dec 07",..: 1 2 3 4 5 7 8 9 10 11 ...
$ Coupon: num 25027 76387 79622 74114 70993 ...
$ Total: num 132450574 194154767 221571135 205880072 188666980 ...
Code:
model <- lm(df$Total ~ df$Coupon, data=df)
> model
Call:
lm(formula = df$Total ~ df$Coupon)
Coefficients:
(Intercept) df$Coupon
107286259 1349
Predict code (based on previous help):
(These are the predictor values I want to use to get the predicted value)
Quarter = c("Jun 11", "Sep 11", "Dec 11")
Total = c(79037022, 83100656, 104299800)
Coupon = data.frame(Quarter, Total)
Coupon$estimate <- predict(model, newdate = Coupon$Total)
Now, when I run that, I get this error message:
Error in `$<-.data.frame`(`*tmp*`, "estimate", value = c(60980.3823396919, :
replacement has 21 rows, data has 3
My original data frame that I used to build the model had 21 observations in it. I am now trying to predict 3 values based on the model.
I either don't truly understand this function, or have an error in my code.
Help would be appreciated.
Thanks
First, you want to use
model <- lm(Total ~ Coupon, data=df)
not model <-lm(df$Total ~ df$Coupon, data=df).
Second, by saying lm(Total ~ Coupon), you are fitting a model that uses Total as the response variable, with Coupon as the predictor. That is, your model is of the form Total = a + b*Coupon, with a and b the coefficients to be estimated. Note that the response goes on the left side of the ~, and the predictor(s) on the right.
Because of this, when you ask R to give you predicted values for the model, you have to provide a set of new predictor values, ie new values of Coupon, not Total.
Third, judging by your specification of newdata, it looks like you're actually after a model to fit Coupon as a function of Total, not the other way around. To do this:
model <- lm(Coupon ~ Total, data=df)
new.df <- data.frame(Total=c(79037022, 83100656, 104299800))
predict(model, new.df)
Thanks Hong, that was exactly the problem I was running into. The error you get suggests that the number of rows is wrong, but the problem is actually that the model has been trained using a command that ends up with the wrong names for parameters.
This is really a critical detail that is entirely non-obvious for lm and so on. Some of the tutorial make reference to doing lines like lm(olive$Area#olive$Palmitic) - ending up with variable names of olive$Area NOT Area, so creating an entry using anewdata<-data.frame(Palmitic=2) can't then be used. If you use lm(Area#Palmitic,data=olive) then the variable names are right and prediction works.
The real problem is that the error message does not indicate the problem at all:
Warning message: 'anewdata' had 1 rows but variable(s) found to have X
rows
instead of newdata you are using newdate in your predict code, verify once. and just use Coupon$estimate <- predict(model, Coupon)
It will work.
To avoid error, an important point about the new dataset is the name of independent variable. It must be the same as reported in the model. Another way is to nest the two function without creating a new dataset
model <- lm(Coupon ~ Total, data=df)
predict(model, data.frame(Total=c(79037022, 83100656, 104299800)))
Pay attention on the model. The next two commands are similar, but for predict function, the first work the second don't work.
model <- lm(Coupon ~ Total, data=df) #Ok
model <- lm(df$Coupon ~ df$Total) #Ko

java.lang.NumberFormatException: Expected an int but was 0.6 at line 1 column 8454

I'm using the retrofit library for my calls in a demo project.
I received the following error:
java.lang.NumberFormatException: Expected an int but was 0.6 at line 1 column 8454 path $.result.results.ads[2].acres
I under stand that this is down to GSON.
I will show you the JSON it's getting caught in:
{
"ad_id":739580087654,
"property_type":"site",
"house_type":"",
"selling_type":"private-treaty",
"price_type":"",
"agreed":0,
"priority":2,
"description":"Beautiful elevated 0.6 acre site - zoned residential - and within easy walk to this popular and scenic coastal village\r\n\r\n\r\nthe site area is zoned residential ( i.e. can be constructed on for residential home) and has beautiful coastal views\r\n\r\nSpiddal is an exceptionally popular location , just 8 miles west of Galway City but the area has not been over developed.\r\n\r\nAll services and family amenities are location in the village centre.\r\n\r\n",
"price":135000,
"bedrooms":null,
"bathrooms":null,
"tax_section":"0",
"square_metres":0,
"acres":0.6, <----------------------TRIPPING UP HERE
"features":[
"Zoned residential",
"within easy walk of coastal village of Spiddal",
"with coastal views"
],
"ber_rating":"",
"ber_code":"",
"ber_epi":0,
"city":"",
"general_area":"Connemara",
"postcode":null,
"latlon_accuracy":1,
"main_email":"",
"cc_email":"",
"auction_address":"",
"start_date":1384425002,
"listing_date":1384425002,
"agreed_date":0,
"auction_date":0,
"tags":1
},
I'm not that experienced with Retrofit so decided to learn and integrate on this project.
Would anyone have any suggestions?
I don't have any control over the JSON being sent down.
Try using a float or double instead of an int; 0.6 is not an integer, it is a decimal. Note that java automatically interprets decimals as doubles; an example of a float would be 0.6f.
That's because the parser is expecting an int whereas the actual value it got was float. What you can do is, change that value's type from int to float in your model.
This might cause problems in your code wherever you are using that value. You can solve it by casting that float value to an integer.

How do I choose the best k mean cluster in weka

As you can see the bottom result I have two different clusters using different seed. I would like to choose the best cluster out of the two clusters.
I know that the minimum square error is the better. However, it shows the same square error although I use different seeds. I want to know why it shows similar square error. I also want to know what other things I need to consider when i am selecting the best cluster.
*******************************************************************
kMeans
======
Number of iterations: 10
Within cluster sum of squared errors: 527.6988818392938
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(4898) (2781) (2117)
=====================================================
fixedacidity 6.8548 6.9565 6.7212
volatileacidity 0.2782 0.2826 0.2725
citricacid 0.3342 0.3389 0.3279
residualsugar 6.3914 8.2678 3.9265
chlorides 0.0458 0.0521 0.0374
freesulfurdioxide 35.3081 38.6897 30.8658
totalsulfurdioxide 138.3607 155.2585 116.1627
density 0.994 0.9958 0.9916
pH 3.1883 3.1691 3.2134
sulphates 0.4898 0.492 0.4871
alcohol 10.5143 9.6325 11.6726
quality 5.8779 5.4779 6.4034
Time taken to build model (full training data) : 0.19 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 2781 ( 57%)
1 2117 ( 43%)
***********************************************************************
kMeans
======
Number of iterations: 7
Within cluster sum of squared errors: 527.6993178146143
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(4898) (2122) (2776)
=====================================================
fixedacidity 6.8548 6.7208 6.9572
volatileacidity 0.2782 0.2723 0.2828
citricacid 0.3342 0.3281 0.3389
residualsugar 6.3914 3.9451 8.2614
chlorides 0.0458 0.0374 0.0522
freesulfurdioxide 35.3081 30.9105 38.6697
totalsulfurdioxide 138.3607 116.2175 155.2871
density 0.994 0.9917 0.9958
pH 3.1883 3.2137 3.1689
sulphates 0.4898 0.4876 0.4916
alcohol 10.5143 11.6695 9.6312
quality 5.8779 6.4043 5.4755
Time taken to build model (full training data) : 0.15 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 2122 ( 43%)
1 2776 ( 57%)
Define "best result".
By the definition of k-means, a lower sum of squares is better.
Anything else is worse by k-means - but that doesn't mean that a different quality criterion (or clustering algorithm) could be more helpful for your actual problem.
Using different seeds doesnot guarantee you different clusters in the result.

Regex - Get text between two strings

I have a large text file which contains many abstracts (7k of them). I want to separate them. They have the following properties:
a number at the begining with a period right after
123.
and it always ends in:
[PubMed - indexed for MEDLINE]
It would be even better if I can get the title and abstract out of the separated string. I am fine if I have to split the articles first then split the texts.
In the example the title is the third line:
Effects of propofol and isoflurane on haemodynamics and the inflammatory response in cardiopulmonary bypass surgery.
The abstract is on the 8th line:
Cardiopulmonary bypass (CPB) causes reperfusion injury...
I have tried to use the following code for this text
Regex:
[0-9\.]*\s*(((?![0-9\.]*|MEDLINE).)+)\s*MEDLINE
Text:
1. Br J Biomed Sci. 2015;72(3):93-101.
Effects of propofol and isoflurane on haemodynamics and the inflammatory response
in cardiopulmonary bypass surgery.
Sayed S, Idriss NK, Sayyedf HG, Ashry AA, Rafatt DM, Mohamed AO, Blann AD.
Cardiopulmonary bypass (CPB) causes reperfusion injury that when most severe is
clinically manifested as a systemic inflammatory response syndrome. The
anaesthetic propofol may have anti-inflammatory properties that may reduce such a
response. We hypothesised differing effects of propofol and isoflurane on
inflammatory markers in patients having CBR Forty patients undergoing elective
CPB were randomised to receive either propofol or isoflurane for maintenance of
anaesthesia. CRP, IL-6, IL-8, HIF-1α (ELISA), CD11 and CD18 expression (flow
cytometry), and haemoxygenase (HO-1) promoter polymorphisms (PCR/electrophoresis)
were measured before anaesthetic induction, 4 hours post-CPB, and 24 hours later.
There were no differences in the 4 hours changes in CRP, IL-6, IL-8 or CD18
between the two groups, but those in the propofol group had higher HIF-1α (P =
0.016) and lower CD11 expression (P = 0.026). After 24 hours, compared to the
isoflurane group, the propofol group had significantly lower levels of CRP (P <
0.001), IL-6 (P < 0.001) and IL-8 (P < 0.001), with higher levels CD11 (P =
0.009) and CD18 (P = 0.002) expression. After 24 hours, patients on propofol had
increased expression of shorter HO-1 GT(n) repeats than patients on isoflurane (P
= 0.001). Use of propofol in CPB is associated with a less adverse inflammatory
profile than is isofluorane, and an increased up-regulation of HO-1. This
supports the hypothesis that propofol has anti-inflammatory activity.
PMID: 26510263 [PubMed - indexed for MEDLINE]
Two useful solutions have been proposed by Mariano and stribizhev:
Mariano's solution: Use the split method with the typical end
(?m)\[PubMed - indexed for MEDLINE\]$
DEMO : http://ideone.com/Qw5ss2
Java 4+
stribizhev's solution: Fully extract data from the text
(?m)^\s*\d+\..*\R{2} # Get to the title
(?<title>[^\n]*(?:\n(?!\n)[^\n]*)*) # Get title
\R{2} # Get to the authors
[^\n]*(?:\n(?!\R)[^\R]*)* # Consume authors
(?<abstract>[^\[]*(?:\[(?!PubMed[ ]-[ ]indexed[ ]for[ ]MEDLINE\])[^\[]*)*) #Grab abstract
DEMO: https://regex101.com/r/sG2yQ2/2
Java 8+
Try this:
"^[0-9]+\..*\s+(.*)\s+.*\s+((?:\s|.)*?)\[PubMed - indexed for MEDLINE\]"
First group would be title. Second would be abstract.

Categories

Resources