I posted earlier today about an error I was getting with using the predict function. I was able to get that corrected, and thought I was on the right path.
I have a number of observations (actuals) and I have a few data points that I want to extrapolate or predict. I used lm to create a model, then I tried to use predict with the actual value that will serve as the predictor input.
This code is all repeated from my previous post, but here it is:
df <- read.table(text = '
Quarter Coupon Total
1 "Dec 06" 25027.072 132450574
2 "Dec 07" 76386.820 194154767
3 "Dec 08" 79622.147 221571135
4 "Dec 09" 74114.416 205880072
5 "Dec 10" 70993.058 188666980
6 "Jun 06" 12048.162 139137919
7 "Jun 07" 46889.369 165276325
8 "Jun 08" 84732.537 207074374
9 "Jun 09" 83240.084 221945162
10 "Jun 10" 81970.143 236954249
11 "Mar 06" 3451.248 116811392
12 "Mar 07" 34201.197 155190418
13 "Mar 08" 73232.900 212492488
14 "Mar 09" 70644.948 203663201
15 "Mar 10" 72314.945 203427892
16 "Mar 11" 88708.663 214061240
17 "Sep 06" 15027.252 121285335
18 "Sep 07" 60228.793 195428991
19 "Sep 08" 85507.062 257651399
20 "Sep 09" 77763.365 215048147
21 "Sep 10" 62259.691 168862119', header=TRUE)
str(df)
'data.frame': 21 obs. of 3 variables:
$ Quarter : Factor w/ 24 levels "Dec 06","Dec 07",..: 1 2 3 4 5 7 8 9 10 11 ...
$ Coupon: num 25027 76387 79622 74114 70993 ...
$ Total: num 132450574 194154767 221571135 205880072 188666980 ...
Code:
model <- lm(df$Total ~ df$Coupon, data=df)
> model
Call:
lm(formula = df$Total ~ df$Coupon)
Coefficients:
(Intercept) df$Coupon
107286259 1349
Predict code (based on previous help):
(These are the predictor values I want to use to get the predicted value)
Quarter = c("Jun 11", "Sep 11", "Dec 11")
Total = c(79037022, 83100656, 104299800)
Coupon = data.frame(Quarter, Total)
Coupon$estimate <- predict(model, newdate = Coupon$Total)
Now, when I run that, I get this error message:
Error in `$<-.data.frame`(`*tmp*`, "estimate", value = c(60980.3823396919, :
replacement has 21 rows, data has 3
My original data frame that I used to build the model had 21 observations in it. I am now trying to predict 3 values based on the model.
I either don't truly understand this function, or have an error in my code.
Help would be appreciated.
Thanks
First, you want to use
model <- lm(Total ~ Coupon, data=df)
not model <-lm(df$Total ~ df$Coupon, data=df).
Second, by saying lm(Total ~ Coupon), you are fitting a model that uses Total as the response variable, with Coupon as the predictor. That is, your model is of the form Total = a + b*Coupon, with a and b the coefficients to be estimated. Note that the response goes on the left side of the ~, and the predictor(s) on the right.
Because of this, when you ask R to give you predicted values for the model, you have to provide a set of new predictor values, ie new values of Coupon, not Total.
Third, judging by your specification of newdata, it looks like you're actually after a model to fit Coupon as a function of Total, not the other way around. To do this:
model <- lm(Coupon ~ Total, data=df)
new.df <- data.frame(Total=c(79037022, 83100656, 104299800))
predict(model, new.df)
Thanks Hong, that was exactly the problem I was running into. The error you get suggests that the number of rows is wrong, but the problem is actually that the model has been trained using a command that ends up with the wrong names for parameters.
This is really a critical detail that is entirely non-obvious for lm and so on. Some of the tutorial make reference to doing lines like lm(olive$Area#olive$Palmitic) - ending up with variable names of olive$Area NOT Area, so creating an entry using anewdata<-data.frame(Palmitic=2) can't then be used. If you use lm(Area#Palmitic,data=olive) then the variable names are right and prediction works.
The real problem is that the error message does not indicate the problem at all:
Warning message: 'anewdata' had 1 rows but variable(s) found to have X
rows
instead of newdata you are using newdate in your predict code, verify once. and just use Coupon$estimate <- predict(model, Coupon)
It will work.
To avoid error, an important point about the new dataset is the name of independent variable. It must be the same as reported in the model. Another way is to nest the two function without creating a new dataset
model <- lm(Coupon ~ Total, data=df)
predict(model, data.frame(Total=c(79037022, 83100656, 104299800)))
Pay attention on the model. The next two commands are similar, but for predict function, the first work the second don't work.
model <- lm(Coupon ~ Total, data=df) #Ok
model <- lm(df$Coupon ~ df$Total) #Ko
Related
I want to take uk from 1st row and replace it in the entire country column without changing the values in zones. I have tried regex expression from expression builder but failed.
COUNTRY
ZONE
UK
12
AU
44
FR
21
GER
20
FR
02
Your job design will look like this
Second , using a tSampleRow you will get the range of lines (in your case you would like first line )
Third , stock your wanted line in a global variable like this
Finally , in the tmap just get your global variable as such
Here is the output (I have 201 lignes i will have 201 UK printed ):
.--------.
|tLogRow_1|
|=------=|
|mystring|
|=------=|
|UK|
|UK |
|UK |
|UK |
|UK |
'--------'
[statistics] disconnected
Job operation ended at 14:00 21/02/2022. [exit code = 0]
Query:
To filter the data below to find the last date of each month in the list. Note that in this context,
the last date of month in the data may or
may not match with the last date of the calendar month.
The expected output is shown in second list.
Research:
I believe TemporalAdjusters.lastDayOfMonth() will not help in this case as the last date in the list may or may not match with calendar month's last date.
I checked several questions on Stack Overflow and I Googled as well,
but I was unable to find something similar to my need.
I hope the issue is clear and points me in the direction on how this can be done with streams,
as I don't want to use a for loop.
Sample Data:
Date
Model
Start
End
27-11-1995
ABC
241
621
27-11-1995
XYZ
3456
7878
28-11-1995
ABC
242
624
28-11-1995
XYZ
3457
7879
29-11-1995
ABC
243
627
29-11-1995
XYZ
3458
7880
30-11-1995
ABC
244
630
30-11-1995
XYZ
3459
7881
01-12-1995
ABC
245
633
01-12-1995
XYZ
3460
7882
04-12-1995
ABC
246
636
04-12-1995
XYZ
3461
7883
27-12-1995
ABC
247
639
27-12-1995
XYZ
3462
7884
28-12-1995
ABC
248
642
28-12-1995
XYZ
3463
7885
29-12-1995
ABC
249
645
29-12-1995
XYZ
3464
7886
01-01-1996
ABC
250
648
01-01-1996
XYZ
3465
7887
02-01-1996
ABC
251
651
02-01-1996
XYZ
3466
7888
29-01-1996
ABC
252
654
29-01-1996
XYZ
3467
7889
30-01-1996
ABC
253
657
30-01-1996
XYZ
3468
7890
31-01-1996
ABC
254
660
31-01-1996
XYZ
3469
7891
Screenshot
Output required:
Date
Model
Start
End
30-11-1995
ABC
244
630
30-11-1995
XYZ
3459
7881
29-12-1995
ABC
249
645
29-12-1995
XYZ
3464
7886
31-01-1996
ABC
254
660
31-01-1996
XYZ
3469
7891
Screenshot
Well, a combination of groupingBy and maxBy will probably do.
I assume each record of the table to be of type Event:
record Event(LocalDate date, String model, int start, int end) { }
To get the last days of the month which are within the table, we could utilize groupingBy. In order to group this, we could first create a grouping type. Below, I created an EventGrouping record1, with a static method to convert an Event to an EventGrouping. Your desired output suggests that you want to group by each year-month-model combination, so we just picked those two properties:
public record EventGrouping(YearMonth yearMonth, String model) {
public static EventGrouping fromEvent(Event event) {
return new EventGrouping(YearMonth.from(event.date()), event.model());
}
}
Then, we could get our desired result like this:
events.stream()
.collect(Collectors.groupingBy(
EventGrouping::fromEvent,
Collectors.maxBy(Comparator.comparing(Event::date))
));
What happens here is that all stream elements are grouped by our EventGrouping, and then the "maximum value" of each of the event groups is picked. The maximum value is, of course, the most recent date of that certain month.
Note that maxBy returns an Optional, for the case when a group is empty. Also note that the resulting Map is unordered.
We could fix both of these issues by using collectingAndThen and a map factory respectively:
Map<EventGrouping, Event> map = events.stream()
.collect(groupingBy(
EventGrouping::fromEvent,
() -> new TreeMap<>(Comparator.comparing(EventGrouping::yearMonth)
.thenComparing(EventGrouping::model)),
collectingAndThen(maxBy(Comparator.comparing(Event::date)), Optional::get)
));
Note: groupingBy, collectingAndThen and maxBy are all static imports from java.util.stream.Collectors.
We added a Supplier of a TreeMap. A TreeMap is a Map implementation with a predictable order by a given comparator. This allows us to iterate over the resulting entries ordered by year–month–model.
collectingAndThen allows us to apply a function to the result of the given Collector. As already mentioned, maxBy returns an Optional, because maxBy is not applicable if there are no elements in the source stream. However, in our case, this can never happen. So we can safely map the Optional to its contained value.
1 Instead of writing a custom type, you could also use an existing class holding two arbitrary values, such as a Map.Entry, a Pair or even a List<Object>.
I would suggest to create a Map<YearMonth, List<LocalDate> and parse all your dates and fill your map. After that you sort all your lists and your last (or firstdepending on sorting order) value in each list will be your desired value
Small question regarding prediction/forecast using Spark ML 3.1+ please.
I have a dataset, very simple, of timestamps for when an event happened.
The dataset is very simple, here is a small portion, of the very very very big file.
+----------+-----+
| time|label|
+----------+-----+
|1621900800| 43|
|1619568000| 41|
|1620432000| 41|
|1623974400| 42|
|1620604800| 41|
|1622505600| 42|
truncated
|1624665600| 42|
|1623715200| 41|
|1623024000| 43|
|1623888000| 42|
|1621296000| 42|
|1620691200| 44|
|1620345600| 41|
|1625702400| 44|
+----------+-----+
only showing top 20 rows
The dataset is really just a timestamp representing a day, on the left, and on the right, the number of banana sold that day. Example of the first three rows of above sample translated.
+------ ----+-- ---+
| time| value|
+------- ---+-----+
|May 25, 2021| banana sold 43|
|April 28, 2021| banana sold 41|
|May 8, 2021| banana sold 41|
My goal is just to build a prediction model, how many "banana will be sold tomorrow, the day after, etc...
Therefore, I went to try Linear Regression, but it might bot be a good model for this problem:
VectorAssembler vectorAssembler = new VectorAssembler().setInputCols(new String[]{"time", "label"}).setOutputCol("features");
Dataset<Row> vectorData = vectorAssembler.transform(dataSetBanana);
LinearRegression lr = new LinearRegression();
LinearRegressionModel lrModel = lr.fit(vectorData);
System.out.println("Coefficients: " + lrModel.coefficients() + " Intercept: " + lrModel.intercept());
LinearRegressionTrainingSummary trainingSummary = lrModel.summary();
System.out.println("numIterations: " + trainingSummary.totalIterations());
System.out.println("objectiveHistory: " + Vectors.dense(trainingSummary.objectiveHistory()));
trainingSummary.residuals().show();
System.out.println("RMSE: " + trainingSummary.rootMeanSquaredError());
System.out.println("r2: " + trainingSummary.r2());
System.out.println("the magical prediction: " + lrModel.predict(new DenseVector(new double[]{1.0, 1.0})));
I see all the values printed, very happy.
Coefficients: [-1.5625735463489882E-19,1.0000000000000544] Intercept: 2.5338210784074846E-10
numIterations: 0
objectiveHistory: [0.0]
+--------------------+
| residuals|
+--------------------+
|-1.11910480882215...|
RMSE: 3.0933584599870493E-13
r2: 1.0
the magical prediction: 1.0000000002534366
It is not giving me anything close to a prediction, I was expecting something like
|Some time in the future| banana sold some prediction|
| 1626414043 | 38 |
May I ask what would be a model that can result an answer like "model predicts X banana will be sold at time Y in the future"
A small piece of code with result would be great.
Thank you
Linear regression can be a good start to get familiar with mllib before you go for more complicated models. First, let's have a look at when you have done so far.
Your VectorAssembler transform your data frame that way:
before:
time
label
1621900800
43
1620432000
41
after:
time
label
features
1621900800
43
[1621900800;43]
1620432000
41
[1620432000;41]
Now, when you are asking LinearRegression to train its model, it will expect your dataset to contain two columns:
one column named features and containing a vector with everything that can be used to predict the label.
one column named label, what you want to predict
Regression will find a and b which minimizes errors across all record i where:
y_i = a * x_i + b + error_i
In your particular setup, you have passed the label to your vector assembler, which is wrong, that's what you want to predict !
Your model has simply learnt that the label predicts perfectly the label:
y = 0.0 * features[0] + 1.0 * features[1]
So you should correct your VectorAssembler:
val vectorAssembler = new VectorAssembler().setInputCols(new String[]{"time"}).setOutputCol("features");
Now when you are doing your prediction, you had passed this:
lrModel.predict(new DenseVector(new double[]{ 1.0, 1.0})));
timestamp label
It returned 1.0 as per formula above.
Now if you change the VectorAssembler as proposed above, you should call the prediction that way:
lrModel.predict(new DenseVector(new double[]{ timeStampIWantToPredict })));
Side notes:
you can pass a dataset to your predictor, it will return a dataset with a new column with the prediction.
you should really have a closer look at Mllib Pipeline documentation
then you can try to add some new features to your linear regression : seasonality, auto regressive features...
The model gives you the coefficients of your variables. Then it's easy to calculate the output. If you have only one variable x1 your model will be something like:
y = a*x1 + b
Then the outputs of your model are a and b. Then you can calculate y.
Generally speaking, machine learning libraries also implement other methods that let you calculate the output. It's better to search how to save, load and then evaluate your model with new inputs. Check out https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/ml/regression/LinearRegressionModel.html
There's a method called predict that you can call on your model by giving the input as a Vector instance. I think that will work!
Another thing is: you are trying to solve a time-series problem with a single-variable linear regression model. I think you should use a better algorithm that is intended to deal with time-series or sequence problems such as Long Short Term Memory (LSTM).
I hope that my answer is useful for you. Keep going ;)
I have to write a program but I have no idea where to start. Can anyone help me with an outline of how I should go about it? please excuse my novice level at programming. I have provided the input and output of the program.
The trouble that I'm facing is how do I handle the input text? How should I store the input text to extract the data that I need to produce the output commands? Any guidance would be so helpful.
A little explanation of the input:
The output will start with APPLE1: CT= (whatever number is there for CT in line 4)
The following lines of the output will begin with "APPLES:"
I must include and extract the values for CR, PLANTING and RW in the output.
Wherever there is a non-zero or not null in the DATA portion, it will appear in the output.
When the program reads END, "APP;APPLER:CT=(whatever number);" will be the last two commands
INPUT:
<apple:ct=12;
FARM DATA
INPUT DATA
CT CH CR PLANTING RW DATA
12 YES PG -0 FA=1 R=CODE1 MM2 COA COB CI COC COD
0 0 1 0
COE RN COF COG COH
4 00 0
COI COJ D
0
FA=2 R=CODE2 112 COA COB CI COC COD
0 0 0 0
COE RN COF COG COH
4 00 0
COI COJ D
7
END
OUPUT:
APPLE1:CT=12;
APPLES:CR=PG-0,FA=1,R=CODE1,RW=MM2,COC=1,COE=4;
APPLES:FA=2,R=CODE2,RW=112,COE=4,COI=7;
APP;
APPLER:CT=12;
I'm a total newbie when it comes to Natty and Antler. Up to now, Natty has been great and has parsed dates with no problems. Recently we have started to receive a new date and time format which Natty has trouble extracting.
Mon 29 Feb 09:00:00 2016
It cannot extract the year due to it being separated from the rest of the date.
I've been trying to add my own format into DateParser, where it could pick up on this format as it does with any other.
I've made the following changes:
date_time: Added an extra rule called custom_dates which will be the new rule for my format
date_time
: (
(date)=>date (date_time_separator explicit_time)?
| explicit_time (time_date_separator date)?
| custom_dates
) -> ^(DATE_TIME date? explicit_time?)
| relative_time -> ^(DATE_TIME relative_time?)
;
custom_date: My new rule
custom_date
: relaxed_day_of_week WHITE_SPACE relaxed_day_of_month WHITE_SPACE relaxed_month (date_time_separator explicit_time)? relaxed_year
-> ^(EXPLICIT_DATE relaxed_day_of_week relaxed_day_of_month relaxed_month relaxed_year (date_time_separator explicit_time)?)
;
When I try to build Natty with my changes, it just hangs, and never finishes. The output up to that point is:
Decision can match input such as "COMMA WHITE_SPACE INT_00 INT_00" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
warning(200): com\joestelmach\natty\generated\DateParser.g:444:73:
Decision can match input such as "COMMA WHITE_SPACE INT_00 {INT_13..INT_19, INT_20..INT_23}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
warning(200): com\joestelmach\natty\generated\DateParser.g:496:45:
Decision can match input such as "WHITE_SPACE IN {COMMA, WHITE_SPACE}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
warning(200): com\joestelmach\natty\generated\DateParser.g:504:77:
Decision can match input such as "WHITE_SPACE IN {COMMA, WHITE_SPACE}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Am I possibly going the wrong way about this? I've taken a look at the Natty and ANTLR v3 documentation but there isn't much to go on.
Thanks in advance
EDIT:
As requested in the comments below. I've added in where the first warning occurs. However what I've included above is just a small snapshot of the dozens of warnings that have been in there before I modified any code with my own rules
The first warning appears in the date_time_separator
date_time_separator
: WHITE_SPACE (AT WHITE_SPACE)?
| WHITE_SPACE? COMMA WHITE_SPACE? (AT WHITE_SPACE)?
| T
;
One observation I've made is when I changed my rule to always include the time
custom_date
: relaxed_day_of_week WHITE_SPACE relaxed_day_of_month WHITE_SPACE relaxed_month (date_time_separator explicit_time) relaxed_year
-> ^(EXPLICIT_DATE relaxed_day_of_week relaxed_day_of_month relaxed_month relaxed_year (date_time_separator explicit_time)?)
;
When I compile I receive this error:
error(202): com\joestelmach\natty\generated\DateParser.g:831:3: the decision cannot distinguish between alternative(s) 1,2 for input such as "INT_00 INT_00 INT_00 EOF"
Looking at line 831 is where the explicit_time resides. I cannot find anything on StackOverflow or otherwise as to what this error means. I assume this error means that there is some ambiguity between the two possible routes. However I don't understand why merely adding in my code should cause an error.
explicit_time_hours_minutes returns [String hours, String minutes, String ampm]
: hours (COLON | DOT)? minutes ((COLON | DOT)? seconds)? (WHITE_SPACE (meridian_indicator | (MILITARY_HOUR_SUFFIX | HOUR)))?
{$hours=$hours.text; $minutes=$minutes.text; $ampm=$meridian_indicator.text;}
-> hours minutes seconds? meridian_indicator?
| hours (WHITE_SPACE? meridian_indicator)?
{$hours=$hours.text; $ampm=$meridian_indicator.text;}
-> hours ^(MINUTES_OF_HOUR INT["0"]) meridian_indicator?
;