Spark - sample() function duplicating data?

Spark - sample() function duplicating data? - java

I want to randomly select a subset of my data and then limit it to 200 entries. But after using the sample() function, I'm getting duplicate rows, and I don't know why. Let me show you:
DataFrame df= sqlContext.sql("SELECT * " +
" FROM temptable" +
" WHERE conditions");
DataFrame df1 = df.select(df.col("col1"))
.where(df.col("col1").isNotNull())
.distinct()
.orderBy(df.col("col1"));
df.show();
System.out.println(df.count());
Up until now, everything is OK. I get the output:
+-----------+
|col1 |
+-----------+
| 10016|
| 10022|
| 100281|
| 10032|
| 100427|
| 100445|
| 10049|
| 10070|
| 10076|
| 10079|
| 10081|
| 10082|
| 100884|
| 10092|
| 10099|
| 10102|
| 10103|
| 101039|
| 101134|
| 101187|
+-----------+
only showing top 20 rows
10512
with 10512 records without duplicates. AND THEN!
df = df.sample(true, 0.5).limit(200);
df.show();
System.out.println(users.count());
This returns 200 rows full of duplicates:
+-----------+
|col1 |
+-----------+
| 10022|
| 100445|
| 100445|
| 10049|
| 10079|
| 10079|
| 10081|
| 10081|
| 10082|
| 10092|
| 10102|
| 10102|
| 101039|
| 101134|
| 101134|
| 101134|
| 101345|
| 101345|
| 10140|
| 10141|
+-----------+
only showing top 20 rows
200
Can anyone tell me why? This is driving me crazy. Thank you!

You explicitly ask for a sample with replacement so there is nothing unexpected about getting duplicates:
public Dataset<T> sample(boolean withReplacement, double fraction)

Related

Count distinct while aggregating others?

This is how my dataset looks like:
+---------+------------+-----------------+
| name |request_type| request_group_id|
+---------+------------+-----------------+
|Michael | X | 1020 |
|Michael | X | 1018 |
|Joe | Y | 1018 |
|Sam | X | 1018 |
|Michael | Y | 1021 |
|Sam | X | 1030 |
|Elizabeth| Y | 1035 |
+---------+------------+-----------------+
I want to calculate the amount of request_type's per person and count unique request_group_id's
Result should be following:
+---------+--------------------+---------------------+--------------------------------+
| name |cnt(request_type(X))| cnt(request_type(Y))| cnt(distinct(request_group_id))|
+---------+--------------------+---------------------+--------------------------------+
|Michael | 2 | 1 | 3 |
|Joe | 0 | 1 | 1 |
|Sam | 2 | 0 | 2 |
|John | 1 | 0 | 1 |
|Elizabeth| 0 | 1 | 1 |
+---------+--------------------+---------------------+--------------------------------+
What I've done so far: (helps to derive first two columns)
msgDataFrame.select(NAME, REQUEST_TYPE)
.groupBy(NAME)
.pivot(REQUEST_TYPE, Lists.newArrayList(X, Y))
.agg(functions.count(REQUEST_TYPE))
.show();
How to count distinct request_group_id's in this select? Is it possible to do within it?
I think it's possible only via two datasets join (my current result + separate aggregation by distinct request_group_id)

Example with "countDistinct" ("countDistinct" is not worked over window, replaced with "size","collect_set"):
val groupIdWindow = Window.partitionBy("name")
df.select($"name", $"request_type",
size(collect_set("request_group_id").over(groupIdWindow)).alias("countDistinct"))
.groupBy("name", "countDistinct")
.pivot($"request_type", Seq("X", "Y"))
.agg(count("request_type"))
.show(false)

How to perform a query using a field that is a merge of 2 columns?

I'm building up a series of distribution analysis using Java Spark library. This is the actual code I'm using to fetch the data from a JSON file and save the output.
Dataset<Row> dataset = spark.read().json("local/foods.json");
dataset.createOrReplaceTempView("cs_food");
List<GenericAnalyticsEntry> menu_distribution= spark
.sql(" ****REQUESTED QUERY ****")
.toJavaRDD()
.map(row -> Triple.of( row.getString(0), BigDecimal.valueOf(row.getLong(1)), BigDecimal.valueOf(row.getLong(2))))
.map(GenericAnalyticsEntry::of)
.collect();
writeObjectAsJsonToHDFS(fs, "/local/output/menu_distribution_new.json", menu_distribution);
The query I'm looking for is based on this structure:
+------------+-------------+------------+------------+
| FIRST_FOOD | SECOND_FOOD | DATE | IS_SPECIAL |
+------------+-------------+------------+------------+
| Pizza | Spaghetti | 11/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Lasagna | Pizza | 12/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Spaghetti | Spaghetti | 13/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Pizza | Spaghetti | 14/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Spaghetti | Lasagna | 15/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Pork | Mozzarella | 16/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Lasagna | Mozzarella | 17/02/2017 | FALSE |
+------------+-------------+------------+------------+
How can I achieve this (written below) output from the code written above?
+------------+--------------------+----------------------+
| FOODS | occurrences(First) | occurrences (Second) |
+------------+--------------------+----------------------+
| Pizza | 2 | 1 |
+------------+--------------------+----------------------+
| Lasagna | 2 | 1 |
+------------+--------------------+----------------------+
| Spaghetti | 2 | 3 |
+------------+--------------------+----------------------+
| Mozzarella | 0 | 2 |
+------------+--------------------+----------------------+
| Pork | 1 | 0 |
+------------+--------------------+----------------------+
I've of course tried to figure out a solution by myself but had no luck with the my tries, I may be wrong, but I need something like this:
"SELECT (first_food + second_food) as menu, COUNT(first_food), COUNT(second_food) from cs_food GROUP BY menu"

From the example data, this looks like it will produce the output you want:
select
foods,
first_count,
second_count
from
(select first_food as food from menus
union select second_food from menus) as f
left join (
select first_food, count(*) as first_count from menus
group by first_food
) as ff on ff.first_food=f.food
left join (
select second_food, count(*) as second_count from menus
group by second_food
) as sf on sf.second_food=f.food
;

Simple combination of flatMap and groupBy should do the job like this (sorry, can't check if it 100% correct right now):
import spark.sqlContext.implicits._
val df = Seq(("Pizza", "Pasta"), ("Pizza", "Soup")).toDF("first", "second")
df.flatMap({case Row(first: String, second: String) => Seq((first, 1, 0), (second, 0, 1))})
.groupBy("_1")

Issue with tracing down the array in Java recursion function

I have an issue with recursion in Java. The question is as such:
Given n pairs of parentheses, write a function to generate all combinations of well-formed parentheses.
For example, given n = 3, a solution set is:
The code for the above problem is recursive and is as mentioned below:
public List<String> generateParenthesis(int n) {
ArrayList<String> result = new ArrayList<String>();
dfs(result, "", n, n);
return result;
}
public void dfs(ArrayList<String> result, String s, int left, int right){
if(left > right)
return;
if(left==0&&right==0){
result.add(s);
return;
}
if(left>0){
dfs(result, s+"(", left-1, right);
}
if(right>0){
dfs(result, s+")", left, right-1);
}
}
I have been able to trace the program upto a particular point, but I am unable to trace it down totally.
if n=2
left=2;right=2;
result="(())",
__________
| s="" |
| l=2 |
| r=2 |
| |
| |
|________|
|
V
__________
| s=( |
| l 1 |
| r 2 |
| |
| |
|________|
|
V
__________
| s=(( |
| l 0 |
| r 2 |
| |
| |
|________|
|
V
__________
| s=(() |
| l 0 |
| r 1 |
| |
| |
|________|
|
V
__________
| s= (())|
| l=0 |
| r=0 |
| |
| |
|________|
how would the program work after what I have mentioned above? Can someone help me tracing it? Thanks.

From where you left off:
__________
| s=( |
| l=1 |
| r=2 |
| |
| |
|________|
|
V
__________
| s=() |
| l 1 |
| r 1 |
| |
| |
|________|
|
V
__________
| s=()( |
| l 0 |
| r 1 |
| |
| |
|________|
|
V
__________
| s=()() |
| l 0 |
| r 0 |
| |
| |
|________|
If you're using eclipse or any other IDE, it should be easy to set a breakpoint and go through how your program runs line by line (showing all your variables and how they change). If you haven't learned debugging yet, I would encourage you to google it and learn how to debug programs.
What your program is actually doing:
left (l=1, r=2)
left (l=0, r=2)
right (l=0, r=1)
right (l=0, r=0)
add result to s (l=0, r=0)
*here you break out of 3 recursive functions and values of l,r reset to (l=1, r=2)*
right (l=1, r=1)
left (l=0, r=1)
right (l=0, r=0)
add result to s (l=0, r=0)

Parsing SPARQL Result into jtable

I'm working on an Apache Jena project. I've got a Fuseki server running on my localhost.
I want to create a Java Program for my Fuseki server, that shows all the data in the triplestore in a JTable. I just have no idea how to parse the result from my query into a JTable.
My code sofar:
(left out the part where the window, table, frame etc is created)
private void Go() {
String query = "SELECT ?subject ?predicate ?object \n" +
"WHERE { \n" +
"?subject ?predicate ?object }";
Query sparqlQuery = QueryFactory.create(query, Syntax.syntaxARQ) ;
QueryEngineHTTP httpQuery = new QueryEngineHTTP("http://localhost:3030/AnimalDataSet/", sparqlQuery);
ResultSet results = httpQuery.execSelect();
System.out.println(ResultSetFormatter.asText(results));
while (results.hasNext()) {
QuerySolution solution = results.next();
}
httpQuery.close();
}
The sysout prints this, which is the correct data:
-------------------------------------------------------------------------------------------------------------------------------------
| subject | predicate | object |
=====================================================================================================================================
| <urn:animals:data> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq> |
| <urn:animals:data> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1> | <urn:animals:lion> |
| <urn:animals:data> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2> | <urn:animals:tarantula> |
| <urn:animals:data> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3> | <urn:animals:hippopotamus> |
| <urn:animals:lion> | <http://www.some-ficticious-zoo.com/rdf#name> | "Lion" |
| <urn:animals:lion> | <http://www.some-ficticious-zoo.com/rdf#species> | "Panthera leo" |
| <urn:animals:lion> | <http://www.some-ficticious-zoo.com/rdf#class> | "Mammal" |
| <urn:animals:tarantula> | <http://www.some-ficticious-zoo.com/rdf#name> | "Tarantula" |
| <urn:animals:tarantula> | <http://www.some-ficticious-zoo.com/rdf#species> | "Avicularia avicularia" |
| <urn:animals:tarantula> | <http://www.some-ficticious-zoo.com/rdf#class> | "Arachnid" |
| <urn:animals:hippopotamus> | <http://www.some-ficticious-zoo.com/rdf#name> | "Hippopotamus" |
| <urn:animals:hippopotamus> | <http://www.some-ficticious-zoo.com/rdf#species> | "Hippopotamus amphibius" |
| <urn:animals:hippopotamus> | <http://www.some-ficticious-zoo.com/rdf#class> | "Mammal" |
-------------------------------------------------------------------------------------------------------------------------------------
I really hope someone here knows how to parse the data from the query into a JTbale :D
Thanks in advance!

I've done some further research and finally found the solution! It's quite easy actually.
You just simply change the while loop like this:
while(rs.hasNext())
{
QuerySolution sol = rs.nextSolution();
RDFNode object = sol.get("object");
RDFNode predicate = sol.get("predicate");
RDFNode subject = sol.get("subject");
DefaultTableModel model = (DefaultTableModel) table.getModel();
model.addRow(new Object[]{subject, predicate, object});
}
And that works fine for me!
For everyone who's interested, i've puplished my version as it is now to pastebin which has comments:
The link to the full (current) version of my project

Java code in Hadoop

I am running a map only job in Hadoop. The data-set is a set of html pages in a single file (returned by a crawler)
The mapper code is written in Java. I am using JSoup to parse. What I want as my output is a key that has both the contents of the title tag and the content of a meta tag. Ideally I should get 1592 records for my map output records. I am getting 3184.
The concatenation I attempt to do with this line of code is not happening.
String MN_Job = (jobT + "\t" + jobsDetail);
What I get instead is each of these separately, hence double the number of outputs. What am I doing wrong here?
public class JobsDataMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text keytext = new Text();
private Text valuetext = new Text();
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
Document doc = Jsoup.parse(line);
Elements desc = doc.select("head title, meta[name=twitter:description]");
for (Element jobhtml : desc) {
Elements title = jobhtml.select("title");
String jobT = "";
for (Element titlehtml : title) {
jobT = titlehtml.text();
}
Elements meta = jobhtml.select("meta[name=twitter:description]");
String jobsDetail ="";
for (Element metahtml : meta) {
String content = metahtml.attr("content");
String content1 = content.replaceAll("\\p{Punct}+", " ");
jobsDetail = content1.replaceAll(" (?i)a | (?i)able | (?i)about | (?i)across | (?i)after | (?i)all | (?i)almost | (?i)also | (?i)am | (?i)among | (?i)an | (?i)and | (?i)any | (?i)are | (?i)as | (?i)at | (?i)be | (?i)because | (?i)been | (?i)but | (?i)by | (?i)can | (?i)cannot | (?i)could | (?i)dear | (?i)did | (?i)do | (?i)does | (?i)either | (?i)else | (?i)ever | (?i)every | (?i)for | (?i)from | (?i)get | (?i)got | (?i)had | (?i)has | (?i)have | (?i)he | (?i)her | (?i)hers | (?i)him | (?i)his | (?i)how | (?i)however | (?i)i | (?i)if | (?i)in | (?i)into | (?i)is | (?i)it | (?i)its | (?i)just | (?i)least | (?i)let | (?i)like | (?i)likely | (?i)may | (?i)me | (?i)might | (?i)most | (?i)must | (?i)my | (?i)neither | (?i)no | (?i)nor | (?i)not | (?i)nbsp | (?i)of | (?i)off | (?i)often | (?i)on | (?i)only | (?i)or | (?i)other | (?i)our | (?i)own | (?i)rather | (?i)said | (?i)say | (?i)says | (?i)she | (?i)should | (?i)since | (?i)so | (?i)some | (?i)than | (?i)that | (?i)the | (?i)their | (?i)them | (?i)then | (?i)there | (?i)these | (?i)they | (?i)this | (?i)tis | (?i)to | (?i)too | (?i)twas | (?i)us | (?i)wants | (?i)was | (?i)we | (?i)were | (?i)what | (?i)when | (?i)where | (?i)which | (?i)while | (?i)who | (?i)whom | (?i)why | (?i)will | (?i)with | (?i)would | (?i)yet | (?i)you | (?i)your "," ");
}
String IT_Job = (jobT + "\t" + jobsDetail);
keytext.set(IT_Job) ;
valuetext.set("JobDetail");
context.write( keytext, valuetext );
}
}
}

Edit: I know what the problem is. But the thing is that the solution might not be obvious in MapReduce. You might have to write your custom RecordReader. Let me explain the problem.
In your code you read line by line. Then you apply this to the line you read:
Elements desc = doc.select("head title, meta[name=twitter:description]");
But evidently, it might only have a title or a <meta name=twitter:description> tag. So you read one of those and store it. The other one remains blank. So at a time, only one of your variables, jobT and jobsDetail has any data. So for the code snippet:
String IT_Job = (jobT + "\t" + jobsDetail);
one time, the first one is blank and the second time, the other one is blank. So if you are expecting n records, you get 2n records. Similarly, if you'll attempt to extract three fields, then you should get 3n records. So you can test this theory by extracting another field and then checking if you are getting thrice the number of expected records.
If the theory turns out to be correct, you might want to delimit the webpages you extract with a specific delimiter string. Then you want to write a custom RecordReader which will read one html file at a time according to the delimiter and then process the entire html file at once. That way you'll get the title and the meta tags together.

Just by the look at the numbers: 3184/2 = 1592.
I think that your file is just duplicated in the input folder. I can't tell for sure, because you have not given the code how you submit the job, but maybe you can verify it with a simple:
bin/hadoop fs -ls /your/input_path
When submitting, either make sure that there is just the single file in there, or just reference the single file in your submission logic.

I made changes to the original code removing the loops that were not necessary. What was happening the older code was that when there is a title in the record, it is output, and later when there is a content, it is output as well. So, there are two writes per HTML file.
public class JobsDataMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text keytext = new Text();
private Text valuetext = new Text();
private String jobT = new String();
private String jobName= new String();
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
Document doc = Jsoup.parse(line);
Elements desc = doc.select("head title, meta[name=twitter:description]");
for (Element jobhtml : desc){
Elements title = jobhtml.select("title");
String jobTT = title.text();
jobT =jobTT ;
if (jobT.length()> 0){
jobName=jobTT;
}
Elements meta = jobhtml.select("meta[name=twitter:description]");
String jobsDetail ="";
String content = meta.attr("content");
String content1 = content.replaceAll("\\p{Punct}+", " ");
jobsDetail = content1.toLowerCase();
jobsDetail = content1.replaceAll(" a| able | about | across | after | all | almost | also | am | among | an | and | any | are | as | at | be| because | been | but | by | can | cannot | could | dear | did | do | does | either | else | ever | every | for | from | get | got | had | has | have | he | her | hers | him | his | how | however | i | if | in | into | is | it | its | just | least | let | like | likely | may | me | might | most | must | my | neither | no | nor | not | nbsp | of | off | often | on | only | or | other | our | own | rather | said | say | says | she | should | since | so | some | than | that | the | their | them | then | there | these | they | this | tis | to | too | twas | us | wants | was | we | were | what | when | where | which | while | who | whom | why | will | with | would | yet | you | your "," ");
if (jobsDetail.length()>0) {
String MN_Job = (jobName+ "\t" + jobsDetail);
keytext.set(MN_Job) ;
valuetext.set("JobInIT");
context.write( keytext, valuetext );
}
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark - sample() function duplicating data? - java

You explicitly ask for a sample with replacement so there is nothing unexpected about getting duplicates: public Dataset<T> sample(boolean withReplacement, double fraction)

Related

Count distinct while aggregating others?

How to perform a query using a field that is a merge of 2 columns?

Issue with tracing down the array in Java recursion function

Parsing SPARQL Result into jtable

Java code in Hadoop

Categories

Resources