Create dataframe from rdd objectfile

Create dataframe from rdd objectfile - java

What is the method to create ddf from an RDD which is saved as objectfile. I want to load the RDD but I don't have a java object, only a structtype I want to use as schema for ddf.
I tried retrieving as Row
val myrdd = sc.objectFile[org.apache.spark.sql.Row]("/home/bipin/"+name)
But I get
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to
org.apache.spark.sql.Row
Is there a way to do this.
Edit
From what I understand, I have to read rdd as array of objects and convert it to row. If anyone can give a method for this, it would be acceptable.

If you have an Array of Object you only have to use the Row apply method for an array of Any. In code will be something like this:
val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row(x))
EDIT
you are rigth #user568109 this will create a Dataframe with only one field that will be an Array to parse the whole array you have to do this:
val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row.fromSeq(x.toSeq))
As #user568109 said there are other ways to do this:
val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row(x:_*))
No matters which one you will because both are wrappers for the same code:
/**
* This method can be used to construct a [[Row]] with the given values.
*/
def apply(values: Any*): Row = new GenericRow(values.toArray)
/**
* This method can be used to construct a [[Row]] from a [[Seq]] of values.
*/
def fromSeq(values: Seq[Any]): Row = new GenericRow(values.toArray)

Let me add some explaination,
suppose you have a mysql table grocery with 3 columns (item,category,price) and its contents as below
+------------+---------+----------+-------+
| grocery_id | item | category | price |
+------------+---------+----------+-------+
| 1 | tomato | veg | 2.40 |
| 2 | raddish | veg | 4.30 |
| 3 | banana | fruit | 1.20 |
| 4 | carrot | veg | 2.50 |
| 5 | apple | fruit | 8.10 |
+------------+---------+----------+-------+
5 rows in set (0.00 sec)
Now, within spark you want to read it, your code will be something like below
val groceryRDD = new JdbcRDD(sc, ()=> DriverManager.getConnection(url,uname,passwd), "select item,price from grocery limit ?,?",1,10,2,r => r.getString("item")+"|"+r.getString("price"))
Note :
In the above statement i converted the ResultSet into String r => r.getString("item")+"|"+r.getString("price")
So my JdbcRDD will be as
groceryRDD: org.apache.spark.rdd.JdbcRDD[String] = JdbcRDD[29] at JdbcRDD at <console>:21
now you save it.
groceryRDD.saveAsObjectFile("/user/cloudera/jdbcobject")
Answer to your question
while reading the object file you need to write as below,
val newJdbObjectFile = sc.objectFile[String]("/user/cloudera/jdbcobject")
In a blind manner ,just substitute the type Parameter of RDD you are saving.
In my case, groceryRDD has a type parameter as String, hence i have used the same
UPDATE:
In your case, as mentioned by jlopezmat, you need to use Array[Object]
Here each row of RDD will be Object, but since you have converted that using ObjectArray each row with its contents will be again saved as Array,
i.e, In my case , if save above RDD as below,
val groceryRDD = new JdbcRDD(sc, ()=> DriverManager.getConnection(url,uname,passwd), "select item,price from grocery limit ?,?",1,10,2,r => JdbcRDD.resultSetToObjectArray(r))
when i read the same using and collect data
val newJdbcObjectArrayRDD = sc.objectFile[Array[Object]]("...")
val result = newJdbObjectArrayRDD.collect
result will be of type Array[Array[Object]]
result: Array[Array[Object]] = Array(Array(raddish, 4.3), Array(banana, 1.2), Array(carrot, 2.5), Array(apple, 8.1))
you can parse the above based on your column definitions.
Please let me know if it answered you question

Related

Searching and updating a Spark Dataset column with values from another Dataset

Java 8 and Spark 2.11:2.3.2 here. Although I would greatly prefer Java API answers, I do speak a wee bit of Scala so I will be able to understand any answers provided in it! But Java if at all possible (please)!
I have two datasets with different schema, with the exception of a common "model_number" (string) column: that exists on both.
For each row in my first Dataset (we'll call that d1), I need to scan/search the second Dataset ("d2") to see if there is a row with the same model_number, and if so, update another d2 column.
Here are my Dataset schemas:
d1
===========
model_number : string
desc : string
fizz : string
buzz : date
d2
===========
model_number : string
price : double
source : string
So again, if a d1 row has a model_number of , say, 12345, and a d2 row also has the same model_number, I want to update the d2.price by multiplying it by 10.0.
My best attempt thus far:
// I *think* this would give me a 3rd dataset with all d1 and d2 columns, but only
// containing rows from d1 and d2 that have matching 'model_number' values
Dataset<Row> d3 = d1.join(d2, d1.col("model_number") == d2.col("model_number"));
// now I just need to update d2.price based on matching
Dataset<Row> d4 = d3.withColumn("adjusted_price", d3.col("price") * 10.0);
Can anyone help me cross the finish line here? Thanks in advance!

Some points here, as #VamsiPrabhala mentioned in the comment, the function that you need to use is join on your specific fields. Regarding the "update", you need to take in mind that df, ds and rdd in spark are immutable, so you can not update them. So, the solution here is, after join your df's, you need to perform your calculation, in this case multiplication, in a select or using withColumn and then select. In other words, you can not update the column, but you can create the new df with the "new" column.
Example:
Input data:
+------------+------+------+----+
|model_number| desc| fizz|buzz|
+------------+------+------+----+
| model_a|desc_a|fizz_a|null|
| model_b|desc_b|fizz_b|null|
+------------+------+------+----+
+------------+-----+--------+
|model_number|price| source|
+------------+-----+--------+
| model_a| 10.0|source_a|
| model_b| 20.0|source_b|
+------------+-----+--------+
using join will output:
val joinedDF = d1.join(d2, "model_number")
joinedDF.show()
+------------+------+------+----+-----+--------+
|model_number| desc| fizz|buzz|price| source|
+------------+------+------+----+-----+--------+
| model_a|desc_a|fizz_a|null| 10.0|source_a|
| model_b|desc_b|fizz_b|null| 20.0|source_b|
+------------+------+------+----+-----+--------+
applying your calculation:
joinedDF.withColumn("price", col("price") * 10).show()
output:
+------------+------+------+----+-----+--------+
|model_number| desc| fizz|buzz|price| source|
+------------+------+------+----+-----+--------+
| model_a|desc_a|fizz_a|null| 100.0|source_a|
| model_b|desc_b|fizz_b|null| 200.0|source_b|
+------------+------+------+----+-----+--------+

Stream Filter List based on Combination of values from another List

Need: To filter out data in list - 1 based on the values present in list - 2 with multiple criteria i.e. combination of Date & Order Number
Issue: Able to filter based on 1 criteria. But when I try adding another filter condition it treats it as 2 separate & not as combination. Unable to figure out how to make it as a combination.
Hope issue faced is clear.
Research: I referred to my earlier query on similar need - Link1 . Also checked - Link2
List 1: (All Orders)
[Date | OrderNumber | Time | Company | Rate ]
[2014-10-01 | 12345 | 10:00:01 | CompA | 1000]
[2015-03-01 | 23456 | 08:00:01 | CompA | 2200]
[2016-08-01 | 34567 | 09:00:01 | CompA | 3300]
[2017-09-01 | 12345 | 11:00:01 | CompA | 4400]
[2017-09-01 | 98765 | 12:00:01 | CompA | 7400]
List 2: (Completed Orders)
[Date | OrderNumber | Time]
[2014-10-01 | 12345 | 10:00:01]
[2015-03-01 | 23456 | 08:00:01]
[2016-08-01 | 34567 | 09:00:01]
[2017-09-01 | 98765 | 12:00:01]
Expected O/p after filter :
[Date | OrderNumber | Time | Company | Rate]
[2017-09-01 | 12345 | 11:00:01 | CompA | 4400]
Code:
// Data extracted from MySQL database
// List 1: All Orders
List<ModelAllOrders> listOrders = getDataFromDatabase.getTable1();
// List 2: Completed Orders
List<ModelCompletedOrders> listCompletedOrders = getDataFromDatabase.getTable2();
// Filter with 1 criteria works
Set<Integer> setOrderNumbers = listCompletedOrders.stream().map(ModelCompletedOrders::getOrderNumber).collect(Collectors.toSet());
listOrders = listOrders.stream().filter(p -> !setOrderNumbers.contains(p.getOrderNumber()).collect(Collectors.toList());
// Below not working as expected when trying to combinational filter
Set<LocalDate> setDates = listCompletedOrders.stream().map(ModelCompletedOrders::getDate).collect(Collectors.toSet());
listOrders = listOrders.stream().filter(p -> !setDates.contains(p.getDate()) && !setOrderNumbers.contains(p.getOrderNumber()))
.collect(Collectors.toList());

You've asked for logic that will do this:
The combination of Date & Order Number is unique. I need to check if that unique combination is present in List-2, if yes then filter out, if not then output should contain that row.
Stream::filter() will return a subset of the stream where the filter predicate returns true (i.e. it filters out those objects in the stream where the predicate is false).
listOrders = listOrders.stream().filter(p -> !setDates.contains(p.getDate()) && !setOrderNumbers.contains(p.getOrderNumber()))
.collect(Collectors.toList());
Your code expression here says "show me orders where the order's date does not appear in the list of prior orders AND where the order's order number does not appear in the list of prior orders". Your logical expression is wrong (you're getting confused between what in electronics would be called positive vs negative logic).
You want either:
listOrders = listOrders.stream().filter(p -> !(setDates.contains(p.getDate()) && setOrderNumbers.contains(p.getOrderNumber())))
.collect(Collectors.toList());
"show me orders where both the order's date and order's id are not
present in the list of prior orders"
or:
listOrders = listOrders.stream().filter(p -> !setDates.contains(p.getDate()) || !setOrderNumbers.contains(p.getOrderNumber()))
.collect(Collectors.toList());
"show me orders where either the order's date has not been seen before
OR the order's id has not been seen before"

How to add a column of counts to an ArrayList

I have:
TAG | REVIEW
A | hello
B | yay
A | win
in an ArrayList and I am trying to get:
TAG | COUNT
A | 8 //hello+win =8
B | 3 //yay =3
where count is the total number of characters in all strings with the same tag. I have been reading about Collections and Maps, but I am completely lost. Can someone explain how to solve this in pieces?
1) To get the count:
List<String,Integer> poll_reviewText_count=new ArrayList<>();
for(String l:poll_reviewText){
poll_reviewText_count.add({l[0],l[1].length()}) //TAG, COUNT
}
2) Then I think I need to combine all the instances of TAG that match into one sum. Not sure how to do this.

There isn't such thing as List<V, T> in java. Also you can't use a Map for your data, because inserting this :
TAG | REVIEW
A | hello
B | yay
A | win
In map, A | hello will get replaced by A | win (they have the same key).
A solution will be to create a class that will contain TAG and REVIEW information:
class Bar {
String tag;
String review;
// setters - getters
}
And then using the java stream, you can collect the data how you want:
Map<String, Integer> collect = poll_reviewText_count.stream()
.collect(Collectors.groupingBy(Bar::getTag, Collectors.summingInt(o -> o.getReview().length())));

How to pass Scenario out line data as a object in step method using cucumber-jvm

I am finding a solution to pass each scenario outline example row as object in cucuber-jvm.
So as for example if I consider a scenario
Scenario Outline: example
Given I have a url
When I choose <input_1>
Then page should hold field1 value as <validation field1> field2 value as <validation field2> fieldn value as <validation fieldn>
Examples:
| input_1 | validation field1 |validation field2|validation field n|
| input_1_case_1 | expected value 1 |expected value 1 |expected value n |
So in Step file
public void validationMethod(String validation field2,String validation field2,String validation field3){
............
............
}
So if I have more field then my method also consume more argument.
Now I want to pass all validation field as object in method. So is it possible using cucumber jvm? If possible could any one can please provide some suggestion with sample code.

You could try something like this
Then Use the following values
| <validation field1> | <validation field2> | <validation field3> |
Examples:
| input_1 | validation field1 |validation field2|validation field3 |
| input_1_case_1 | expected value 1 |expected value 2 |expected value 3 |
| input_2_case_2 | expected value 1 |expected value 2 |expected value 3 |
Step Definition
#Then("^Use the following values$")
public void useFollVal(List<String> valFields) {
//The values will be inside the list. Use index to access
}
You can even get an validation object instead of string list ie List<ValidationData>. To do this add a header in the step (not the examples table) with names matching the variables in the ValidationData class and cucumber will populate the data into the object.
Then Use the following values
| valField1 | valField2 | valField3 | <<<--- Header to add
| <validation field1> | <validation field2> | <validation field3> |
valField1 -> private String valField1; in ValidationData
Step Definition
#Then("^Use the following values$")
public void useFollVal(List<ValidationData> valObject) {
}

This is more of a comment: Wouldnt a variable length argument list work for you? You would need to know the sequence of your params though, without the argument names to help out.
public void multiParams(String... val){
}

Changing Mocked ResultSet value each time a SQL statement is run

I'm using Mockrunner to create a mock result set for a select statement. I have a loop that executes the select statement (which returns a single value). I want to have the result set return a different value each time, but I have been unable to find anything about how to specify the result set return value based on the times the statement has been called. Here's a pseudocode snippet of the code:
In the test Code:
String selectSQL = "someselectStmt";
StatementResultSetHandler stmtHandler = conn.GetStatementResultSetHandler();
MockResultSet result = stmtHandler.createResultSet();
result.addRow(new Integer[]{new Integer(1)});
stmtHandler.prepareResultSet(selectSQL, result);
In the Actual Target Class:
Integer[] Results = getResults(selectSQL);
while(Results.length != 0){
//do some stuff that change what gets returned in the select stmt
Results = getResults(selectSQL)
}
So essentially I'd like to return something like 1 on the first time through, 2 on the 2nd and nothing on the 3rd. I haven't found anything so far that I'd be able to leverage that could achieve this. The mocked select statement will always return whatever the last result set was to be associated with it (for instance if I created two MockResultSets and associated both with the same select stmt). Is this idea possible?

Looping Control Flow Working Within Java and SQL
If you're coding this one in Java, a way to make your code execution calls return different, sequenced results can be accomplished throughh a looping control flow statement such as a do-while-loop. This Wikipedia reference has a good discussion using the contrast of the do-while-loop between implementations in Java and also in different programming lanugages.
Some Additional Influences through Observation:
A clue from your work with the Mockrunner tool:
The mocked select statement will always return whatever the last result set was to be associated with it (for instance if I created two MockResultSets and associated both with the same select stmt)
This is the case because the SELECT statement must actually change as well or else repeating the query will also repeat the result output. A clue is that your SQL exists as a literal string value throughout the execution of the code. Strings can be altered through code and simple string manipulations.
String selectSQL = "someselectStmt";
StatementResultSetHandler stmtHandler = conn.GetStatementResultSetHandler();
MockResultSet result = stmtHandler.createResultSet();
result.addRow(new Integer[]{new Integer(1)});
stmtHandler.prepareResultSet(selectSQL, result);
in addition to the selectSQL variable, also add a line for a numeric variable to keep track of how many times the SQL statement is executed:
Int queryLoopCount = 0;
In the following target class:
Integer[] Results = getResults(selectSQL);
while(Results.length != 0){
//do some stuff that change what gets returned in the select stmt
Results = getResults(selectSQL)
}
Try rewriting this WHILE loop control following this example. In your pseudocode, you will keep pulling the same data from the call to getResults(selectSQL); because the query remains the same through every pass made through the code.
Setting up the Test Schema and Example SQL Statement
Here is a little workup using a single MySQL table that contains "testdata" output to be fed into some result set. The ID column could be a way of uniquely identifying each different record or "test case"
SQL Fiddle
MySQL 5.5.32 Schema Setup:
CREATE TABLE testCaseData
(
id int primary key,
testdata_col1 int,
testdata_col2 varchar(20),
details varchar(30)
);
INSERT INTO testCaseData
(id, testdata_col1, testdata_col2, details)
VALUES
(1, 2021, 'alkaline gab', 'First Test'),
(2, 322, 'rebuked girdle', '2nd Test'),
(3, 123, 'municipal shunning', '3rd Test'),
(4, 4040, 'regal limerick', 'Skip Test'),
(5, 5550, 'admonished hundredth', '5th Test'),
(6, 98, 'docile pushover', '6th Test'),
(7, 21, 'mousiest festivity', 'Last Test');
commit;
Query 1 A Look at All the Test Data:
SELECT id, testdata_col1, testdata_col2, details
FROM testCaseData
Results:
| ID | TESTDATA_COL1 | TESTDATA_COL2 | DETAILS |
|----|---------------|----------------------|------------|
| 1 | 2021 | alkaline gab | First Test |
| 2 | 322 | rebuked girdle | 2nd Test |
| 3 | 123 | municipal shunning | 3rd Test |
| 4 | 4040 | regal limerick | Skip Test |
| 5 | 5550 | admonished hundredth | 5th Test |
| 6 | 98 | docile pushover | 6th Test |
| 7 | 21 | mousiest festivity | Last Test |
Query 2 Querying Only the First Record in the Table:
SELECT id, testdata_col1, testdata_col2, details
FROM testCaseData
WHERE id = 1
Results:
| ID | TESTDATA_COL1 | TESTDATA_COL2 | DETAILS |
|----|---------------|---------------|------------|
| 1 | 2021 | alkaline gab | First Test |
Query 3 Querying a Specific Test Record Within the Table:
SELECT id, testdata_col1, testdata_col2, details
FROM testCaseData
WHERE id = 2
Results:
| ID | TESTDATA_COL1 | TESTDATA_COL2 | DETAILS |
|----|---------------|----------------|----------|
| 2 | 322 | rebuked girdle | 2nd Test |
Query 4 Returning and Limiting the Output Set Size:
SELECT id, testdata_col1, testdata_col2, details
FROM testCaseData
WHERE id < 5
Results:
| ID | TESTDATA_COL1 | TESTDATA_COL2 | DETAILS |
|----|---------------|--------------------|------------|
| 1 | 2021 | alkaline gab | First Test |
| 2 | 322 | rebuked girdle | 2nd Test |
| 3 | 123 | municipal shunning | 3rd Test |
| 4 | 4040 | regal limerick | Skip Test |
Writing a Parameterized SQL Statement
I do not know if this difference in syntax yields the exact same results as your pseudocode, but I am recommending it from references of code structures that I know already work.
set condition value before loop
do{
// do some work
// update condition value
}while(condition);
The WHILE condition is instead at the end of the statement and should be based on a change to a value made within the looping block. We will now introduce the second variable, an int which tracks the number of times that the loop is iterated over:
String selectSQL = "someselectStmt";
String[] Results; = getResults(selectSQL);
// set condition value before loop
queryLoopCount = 0
do{
// do some work
Results = getResults(selectSQL);
// update condition value
queryLoopCount = queryLoopcount + 1;
}while(queryLoopCount < 6);
Where selectSQL comes from:
SELECT id, testdata_col1, testdata_col2, details
FROM testCaseData
WHERE id = 2;
And adapts with a built in parameter to:
selectSQL = 'SELECT id, testdata_col1, testdata_col2, details
FROM testCaseData
WHERE id = ' + queryLoopCount;
Mixing the string and integer values may not be a problem as in this reference on concatenated(+) values suggests: Anything concatenated to a string is converted to string (eg, "weight = " + kilograms).
Ideas for Specialized Case Requirements
You could introduce your own numbering sequence to get the records of each case to cycle through the reference table. There are a lot of possibilities by introducing an ORDER BY statement and altering the key ORDER BY value.
The "Skip" case. Within the Do-While loop, add a IF-THEN statement to conditionally skip a specific record.
set condition value before loop
do{
if ( queryLoopCount <> 4 ) {
// do some work}
// update condition value
queryLoopCount = queryLoopCount + 1;
}while(condition);
Using an if-then loop, this code sample will process all test records but will skip over the record of ID = 4 and continue through until the while loop condition is met.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Create dataframe from rdd objectfile - java

Related

Searching and updating a Spark Dataset column with values from another Dataset

Stream Filter List based on Combination of values from another List

How to add a column of counts to an ArrayList

How to pass Scenario out line data as a object in step method using cucumber-jvm

Changing Mocked ResultSet value each time a SQL statement is run

Categories

Resources