id no, no2, list
id1 (3, 5, [t[0][66], y[5][626]])
id2 (3, 5, [t[0][66], y[5][626], z[5][626]])
id2 (3, 5, [t[0][66], y[5][626]])
id3 (32, 54, [t[0][66], y[5][626]])
id4 (3, 541, [t[0][66], y[5][626], u[5][626], y[25][6226]])
id5 (3, 52, [t[0][66], y[5][626]])
id6 (23, 5, [t[0][66], y[5][626]])
How would I go about parsing such text? I tried creating an object from it without much success. List can vary in size. Java code would be great, but any language or pseudo code, or regular language is fine.
Not your language but in Python
import sys, re
def regex(regex, str):
return [s for s in re.split(regex, str) if s]
def parse(fname):
data = []
with open(fname) as f:
data = f.read().splitlines()
header = regex('[, ]+', data[0]);
print header
for line in data[1:]:
fields = [regex('[(),]+', field)[0] # Remove ) ( ,
for field in line.split()]
fields[3] = fields[3][1:] # Remove [
fields[-1] = fields[-1][:-1] # Remove ]
print fields[0], fields[1], fields[2], fields[3:]
parse("file");
Output ('file' contains your text):
$ python parse.py
['id', 'no', 'no2', 'list']
id1 3 5 ['t[0][66]', 'y[5][626]']
id2 3 5 ['t[0][66]', 'y[5][626]', 'z[5][626]']
id2 3 5 ['t[0][66]', 'y[5][626]']
id3 32 54 ['t[0][66]', 'y[5][626]']
id4 3 541 ['t[0][66]', 'y[5][626]', 'u[5][626]', 'y[25][6226]']
id5 3 52 ['t[0][66]', 'y[5][626]']
id6 23 5 ['t[0][66]', 'y[5][626]']
I've tried to make a regex to extract data but I have no time to finish it.
here's what I have so far: "id(\\d) \\((\\d*), (\\d*),\\s*\\,*\\[(\\,*\\s*(\\D)\\[(\\d*)\\]\\[(\\d*)\\])*.*\\]\\)"
Use an online tester to make it work better...
1st group is the id#, 2nd group the no, 3rd group no2 and you should get the list items afterwards.
There is really no reason to create a parser by hand as there are multiple parser generators available, JavaCC being the most popular. A skeleton process is.
Define language using BNF
Translate the BNF to the input language the parser generator understands making sure to make it either left recursive or right recursive as appropriate. JavaCC requires right recursion.
Invoke the parser generator to create the parser classes.
Augment the generated sourcecode by inserting/refining the generator source.
There are many examples
Related
Hello all, to begin with, base on the title someone may say that the question is already answered but my point is to compare ReduceBykey, GroupBykey performance, specific on the Dataset and RDD API. i have seen in many posts that the performance over the ReduceBykey method is more efficient over GroupByKey and of course i agree with this. Nevertheless, I am little confused and i can’t figure out how these methods behaves if we use a Dataset or RDD. Which one should be used one each case?
I will try to be more specific, thus i will provide my problem with my solve as well as with the working code and i am waiting at your earliest convenience to suggest me an improvements on this.
+---+------------------+-----+
|id |Text1 |Text2|
+---+------------------+-----+
|1 |one,two,three |one |
|2 |four,one,five |six |
|3 |seven,nine,one,two|eight|
|4 |two,three,five |five |
|5 |six,five,one |seven|
+---+------------------+-----+
The point here is to check if the third Colum contained on EACH row of the second Colum and after that, collect all the ID of thems. For example, the word of the third column «one» appeared in the sentences of second column with ID 1, 5, 2, 3.
+-----+------------+
|Text2|Set |
+-----+------------+
|seven|[3] |
|one |[1, 5, 2, 3]|
|six |[5] |
|five |[5, 2, 4] |
+-----+------------+
Here is my working code
List<Row> data = Arrays.asList(
RowFactory.create(1, "one,two,three", "one"),
RowFactory.create(2, "four,one,five", "six"),
RowFactory.create(3, "seven,nine,one,two", "eight"),
RowFactory.create(4, "two,three,five", "five"),
RowFactory.create(5, "six,five,one", "seven")
);
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("Text1", DataTypes.StringType, false, Metadata.empty()),
new StructField("Text2", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> df = spark.createDataFrame(data, schema);
df.show(false);
Dataset<Row> df1 = df.select("id", "Text1")
.crossJoin(df.select("Text2"))
.filter(col("Text1").contains(col("Text2")))
.orderBy(col("Text2"));
df1.show(false);
Dataset<Row> df2 = df1
.groupBy("Text2")
.agg(collect_set(col("id")).as("Set"));
df2.show(false);
My question detailed in 3 subsequences:
In order to improve the performance do i need to convert the Dataset in RDD and make ReduceBykey instead of Dataset groupby?
Which one should i use and why? Dataset or RDD
i would be grateful if you could give an alternative solution that is more efficient if exists in my approach
TL;DR Both are bad, but if you're using Dataset stay with Dataset.
Dataset.groupBy behaves like reduceByKey if used with suitable function. Unfortunately collect_set behaves pretty much like groupByKey, if number of duplicates is low. Rewriting it with reduceByKey won't change a thing.
i would be grateful if you could give an alternative solution that is more efficient if exists in my approach
Best you can do is to remove crossJoin:
val df = Seq((1, "one,two,three", "one"),
(2, "four,one,five", "six"),
(3, "seven,nine,one,two", "eight"),
(4, "two,three,five", "five"),
(5, "six,five,one", "seven")).toDF("id", "text1", "text2")
df.select(col("id"), explode(split(col("Text1"), ",")).alias("w"))
.join(df.select(col("Text2").alias("w")), Seq("w"))
.groupBy("w")
.agg(collect_set(col("id")).as("Set")).show
+-----+------------+
| w| Set|
+-----+------------+
|seven| [3]|
| one|[1, 5, 2, 3]|
| six| [5]|
| five| [5, 2, 4]|
+-----+------------+
I have 2 csv files .
Employee.csv with the schema
EmpId Fname
1 John
2 Jack
3 Ram
and 2nd csv file as
Leave.csv
EmpId LeaveType Designation
1 Sick SE
1 Casual SE
2 Sick SE
3 Privilege M
1 Casual SE
2 Privilege SE
Now I want the data in json as
EmpID-1
Sick : 2
Casual : 2
Privilege : 0
Using spark in Java
Grouping by the column 'LeaveType' and perfoming count on them
import org.apache.spark.sql.functions.count
val leaves = ??? // Load leaves
leaves.groupBy(col("LeaveType")).agg(count(col("LeaveType").as("total_leaves")).show()
I'm not familiar with Java syntax but if you do not want to use the dataframe API, you may do something like this in scala,
val rdd= sc.textfile("/path/to/leave.csv").map(_.split(",")).map(x=>((x(0),x(1),x(2)),1)).reduceByKey(_+_)
now you need to use some external API like GSON to transform each element of this RDD to desired JSON format. Each element of this rdd is a Tuple4, in which there is (EmpId, leaveType, Designation, Countofleaves)
Let me know if this helped, Cheers.
I posted earlier today about an error I was getting with using the predict function. I was able to get that corrected, and thought I was on the right path.
I have a number of observations (actuals) and I have a few data points that I want to extrapolate or predict. I used lm to create a model, then I tried to use predict with the actual value that will serve as the predictor input.
This code is all repeated from my previous post, but here it is:
df <- read.table(text = '
Quarter Coupon Total
1 "Dec 06" 25027.072 132450574
2 "Dec 07" 76386.820 194154767
3 "Dec 08" 79622.147 221571135
4 "Dec 09" 74114.416 205880072
5 "Dec 10" 70993.058 188666980
6 "Jun 06" 12048.162 139137919
7 "Jun 07" 46889.369 165276325
8 "Jun 08" 84732.537 207074374
9 "Jun 09" 83240.084 221945162
10 "Jun 10" 81970.143 236954249
11 "Mar 06" 3451.248 116811392
12 "Mar 07" 34201.197 155190418
13 "Mar 08" 73232.900 212492488
14 "Mar 09" 70644.948 203663201
15 "Mar 10" 72314.945 203427892
16 "Mar 11" 88708.663 214061240
17 "Sep 06" 15027.252 121285335
18 "Sep 07" 60228.793 195428991
19 "Sep 08" 85507.062 257651399
20 "Sep 09" 77763.365 215048147
21 "Sep 10" 62259.691 168862119', header=TRUE)
str(df)
'data.frame': 21 obs. of 3 variables:
$ Quarter : Factor w/ 24 levels "Dec 06","Dec 07",..: 1 2 3 4 5 7 8 9 10 11 ...
$ Coupon: num 25027 76387 79622 74114 70993 ...
$ Total: num 132450574 194154767 221571135 205880072 188666980 ...
Code:
model <- lm(df$Total ~ df$Coupon, data=df)
> model
Call:
lm(formula = df$Total ~ df$Coupon)
Coefficients:
(Intercept) df$Coupon
107286259 1349
Predict code (based on previous help):
(These are the predictor values I want to use to get the predicted value)
Quarter = c("Jun 11", "Sep 11", "Dec 11")
Total = c(79037022, 83100656, 104299800)
Coupon = data.frame(Quarter, Total)
Coupon$estimate <- predict(model, newdate = Coupon$Total)
Now, when I run that, I get this error message:
Error in `$<-.data.frame`(`*tmp*`, "estimate", value = c(60980.3823396919, :
replacement has 21 rows, data has 3
My original data frame that I used to build the model had 21 observations in it. I am now trying to predict 3 values based on the model.
I either don't truly understand this function, or have an error in my code.
Help would be appreciated.
Thanks
First, you want to use
model <- lm(Total ~ Coupon, data=df)
not model <-lm(df$Total ~ df$Coupon, data=df).
Second, by saying lm(Total ~ Coupon), you are fitting a model that uses Total as the response variable, with Coupon as the predictor. That is, your model is of the form Total = a + b*Coupon, with a and b the coefficients to be estimated. Note that the response goes on the left side of the ~, and the predictor(s) on the right.
Because of this, when you ask R to give you predicted values for the model, you have to provide a set of new predictor values, ie new values of Coupon, not Total.
Third, judging by your specification of newdata, it looks like you're actually after a model to fit Coupon as a function of Total, not the other way around. To do this:
model <- lm(Coupon ~ Total, data=df)
new.df <- data.frame(Total=c(79037022, 83100656, 104299800))
predict(model, new.df)
Thanks Hong, that was exactly the problem I was running into. The error you get suggests that the number of rows is wrong, but the problem is actually that the model has been trained using a command that ends up with the wrong names for parameters.
This is really a critical detail that is entirely non-obvious for lm and so on. Some of the tutorial make reference to doing lines like lm(olive$Area#olive$Palmitic) - ending up with variable names of olive$Area NOT Area, so creating an entry using anewdata<-data.frame(Palmitic=2) can't then be used. If you use lm(Area#Palmitic,data=olive) then the variable names are right and prediction works.
The real problem is that the error message does not indicate the problem at all:
Warning message: 'anewdata' had 1 rows but variable(s) found to have X
rows
instead of newdata you are using newdate in your predict code, verify once. and just use Coupon$estimate <- predict(model, Coupon)
It will work.
To avoid error, an important point about the new dataset is the name of independent variable. It must be the same as reported in the model. Another way is to nest the two function without creating a new dataset
model <- lm(Coupon ~ Total, data=df)
predict(model, data.frame(Total=c(79037022, 83100656, 104299800)))
Pay attention on the model. The next two commands are similar, but for predict function, the first work the second don't work.
model <- lm(Coupon ~ Total, data=df) #Ok
model <- lm(df$Coupon ~ df$Total) #Ko
This question already has answers here:
Format 32-character string with hyphens to become UUID
(3 answers)
Closed 7 months ago.
Is there a more effective, simplified means of converting a "formatted" Java UUID - without dashes - to a Java compatible format - with dashes - in PHP, and ultimately: How would I do it?
I have code that already performs this action, but it seems unprofessional and I feel that it could probably be done more effectively.
[... PHP code ...]
$uuido = $json['id'];
$uuidx = array();
$uuidx[0] = substr( $uuido, 0, 8 );
$uuidx[1] = substr( $uuido, 8, 4 );
$uuidx[2] = substr( $uuido, 12, 4);
$uuidx[3] = substr( $uuido, 16, 4);
$uuidx[4] = substr( $uuido, 20, 12);
$uuid = implode( "-", $uuidx );
[... PHP code ...]
Input: f9e113324bd449809b98b0925eac3141
Output: f9e11332-4bd4-4980-9b98-b0925eac3141
The data from $json['id'] is called from the following Mojang Profile API using the file_get_contents( $url ) function combined with a json_decode( $file ), which could alternatively be done through cURL - but since it would eventually be requesting anything up to 2048 profiles at once, I figured it would become slow.
I do have my code in use, and public, through the following ProjectRogue Server Ping API, which usually contains a list of online players.
Note: There are several questions related to this, but none apply to PHP as far as I am aware. I have looked.
I mention Java UUID as the parsed output should effectively translate to a Player UUID for use in a Java-Based plugin for Spigot or Craftbukkit after 1.7.X.
Your question doesn't make much sense but assuming you want to read the UUID in the correct format in Java you can do something like this:
import java.util.UUID;
class A
{
public static void main(String[] args){
String input = "f9e113324bd449809b98b0925eac3141";
String uuid_parse = input.replaceAll(
"(\\w{8})(\\w{4})(\\w{4})(\\w{4})(\\w{12})",
"$1-$2-$3-$4-$5");
UUID uuid = UUID.fromString(uuid_parse);
System.out.println(uuid);
}
}
Borrowed from maerics, see here: https://stackoverflow.com/a/18987428/4195825
Or in PHP you can do something like:
<?php
$UUID = "f9e113324bd449809b98b0925eac3141";
$UUID = substr($UUID, 0, 8) . '-' . substr($UUID, 8, 4) . '-' . substr($UUID, 12, 4) . '-' . substr($UUID, 16, 4) . '-' . substr($UUID, 20);
echo $UUID;
?>
Borrowed from fico7489: https://stackoverflow.com/a/33484855/4195825
And then you can send that to Java where you can create a UUID object using fromtString().
UUID is not a special java format. It is the Universal Unique Identifier.
A universally unique identifier (UUID) is an identifier standard used in software construction. A UUID is simply a 128-bit value.
What happen is that the conversion from the 128 bit value to a human readable version of the same value converting it to a string follow generally some conventions.
Basically the number is converted to the better human readable hexadecimal format with some hyphen to separate block of bits.
From a performance perspective you can use the several times the substr_replace function, instead of creating an array of strings using substr and applying implode to it.
Hi all,
I need your advice for a project. As you can see in the table there is client_id, user_id and date columns. This is a log book and it keeps the data for each user belongs to some company. At the end of the month I need a statistics about users and their usage of the system. So it will be like
User 7 from Client 1 was enabled for 5 days last month
User 25 from Client 1 was enabled for 3 days last month
User 8 from Client 5 was enabled for 4 days last month ..... etc
Currently easiest method I found is something like
def logs = LogBook.createCriteria()
def result = logs.list{
projections {
groupProperty("user")
count("user")
property("client","client")
}
which returns something like
[User : 7, 4, Client: 1]
[User : 8, 3, Client: 5]
[User : 10, 3, Client: 15]
[User : 11, 3, Client: 16]
[User : 25, 3, Client: 1]
[0] is User object, [1] is count "count(user_id)" and [2] is Client object, do you have any idea to make this simpler or more solid? Or is it safe? Thanks for your advise.
How are you looking to display the data? I would write a query for exactly what you want, make a class that grabs it with a resultSet, and write the set into a JTable.
Depending on the size of the Database table, you could return everything with * and then parse within the java.