How to read dataframe in rdd operations

How to read dataframe in rdd operations - java

Scenario
I have two string lists containing text file path, List a, List b.
I want to to cartesian product of list a,b to achieve a cartesian dataframe comparison.
The way I am trying is first do cartesian product,
transfer it to pairRdd and then on foreach apply operation.
List<String> a = Lists.newList("/data/1.text",/data/2.text","/data/3.text");
List<String> b = Lists.newList("/data/4.text",/data/5.text","/data/6.text");
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
List<Tuple2<String,String>> cartesian = cartesian(a,b);
jsc.parallelizePairs(cartesian).filter(new Function<Tuple2<String, String>, Boolean>() {
#Override public Boolean call(Tuple2<String, String> tup) throws Exception {
Dataset<Row> text1 = spark.read().text(tup._1); <-- this throw NullPointerException
Dataset<Row> text2 = spark.read().text(tup._2);
return text1.first()==text2.first(); <-- this is an indicative function only
});
Even I can use spark to do cartesian as
JavaRDD<Column> sourceRdd = jsc.parallelize(a);
JavaRDD<Column> allRdd = jsc.parallelize(b);
sourceRdd.cache().cartesian(allRdd).filter(new Function<Tuple2<String, String>, Boolean>() {
#Override public Boolean call(Tuple2<Column, Column> tup) throws Exception {
Dataset<Row> text1 = spark.read().text(tup._1); <-- same issue
Dataset<Row> text2 = spark.read().text(tup._2);
return text1.first()==text2.first();
}
});
Please suggest good approach to handle this.

Not sure if I completely understood your problem. Here is the sample for Cartesian using Spark and Java.
public class CartesianDemo {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("CartesianDemo").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
//list
List<String> listOne = Arrays.asList("one", "two", "three", "four", "five");
List<String> listTwo = Arrays.asList("ww", "xx", "yy", "zz");
//RDD
JavaRDD<String> rddOne = jsc.parallelize(listOne);
JavaRDD<String> rddTwo = jsc.parallelize(listTwo);
//Cartesian
JavaPairRDD<String, String> cartesianRDD = rddOne.cartesian(rddTwo);
//print
cartesianRDD.foreach(data -> {
System.out.println("X=" + data._1() + " Y=" + data._2());
});
//stop
jsc.stop();
jsc.close();
}
}

Related

how to store the values that has the same parent key

lets say I have:
bob:V
bob:A
bob:B
bob:C
bob:C
sally:B
sally:C
sally:A
steve:A
steve:B
steve:C
how do I store:
the values as:
bob={V,A,B,C,C}, sally={B,C,A}, steve={A,B,C}
and for any guy who has a sequence A,B,C repeated how do I get that person name?
I am fairly new to Java and Im trying to implement this scenario, as I dont see anything like this in this communtiy.
here is my final answer: first stored the list into a map and then used collectors to loop through and map it to their respective attributes.
public class Solution{
static List<String> doWork(List<LogItem> eventsInput) {
Map<String, String> personMap = eventsInput.stream()
.collect(Collectors.toMap(LogItem::getUserId, p -> Character.toString(p.getEventChar()), String::concat));
System.out.println("person map is \n" + personMap);
BiPredicate<Entry<String, List<String>>, String> contains =
(entry, attr) -> entry.getValue().stream()
.collect(Collectors.joining()).contains(attr);
String attributes = "ABC";
List<String> results = personMap.entrySet().stream()
.filter(e -> e.getValue().contains(attributes))
.map(Entry::getKey).collect(Collectors.toList());
return results;
}
public static void main(String[] args){
List<LogItem> exampleInputItems = new ArrayList<>();
exampleInputItems.add(new LogItem("bob", 'V'));
exampleInputItems.add(new LogItem("bob", 'A'));
exampleInputItems.add(new LogItem("steve", 'A'));
exampleInputItems.add(new LogItem("bob", 'B'));
exampleInputItems.add(new LogItem("bob", 'C'));
exampleInputItems.add(new LogItem("bob", 'C'));
exampleInputItems.add(new LogItem("steve", 'B'));
exampleInputItems.add(new LogItem("sally", 'B'));
exampleInputItems.add(new LogItem("steve", 'C'));
exampleInputItems.add(new LogItem("sally", 'C'));
exampleInputItems.add(new LogItem("sally", 'A'));
List<String> returnedNames = doWork(exampleInputItems);
if (returnedNames.size() != 2) {
throw new RuntimeException("Wrong number of names found. Found: " + returnedNames);
}
if (!returnedNames.contains("bob")) {
throw new RuntimeException("Did not find \"bob\" in the returnedNames: " + returnedNames);
}
if (!returnedNames.contains("steve")) {
throw new RuntimeException("Did not find \"steve\" in the returnedNames: " + returnedNames);
}
System.out.println("The example passed.");
}
static class LogItem {
public String userId;
public char eventChar;
public LocalDateTime dateTime;
LogItem(String userId, char eventChar) {
this.userId = userId;
this.eventChar = eventChar;
dateTime = LocalDateTime.now();
}
public String getUserId() {
return userId;
}
public void setUserId(String userId) {
this.userId = userId;
}
public char getEventChar() {
return eventChar;
}
public void setEventChar(char eventChar) {
this.eventChar = eventChar;
}
public LocalDateTime getDateTime() {
return dateTime;
}
public void setDateTime(LocalDateTime dateTime) {
this.dateTime = dateTime;
}
#Override
public String toString() {
return "LogItem [userId=" + userId + ", eventChar=" + eventChar + ", dateTime=" + dateTime + ", getUserId()="
+ getUserId() + ", getEventChar()=" + getEventChar() + ", getDateTime()=" + getDateTime() + "]";
}
}
}
}

First, I would store the attributes in a Map<String,String>. This will make it easier to filter the attributes later. I am using a record in lieu of a class but a class would work as well.
record Person(String getName, String getAttribute) {
}
Create the list of Person objects
List<Person> list = List.of(new Person("bob", "V"),
new Person("bob", "A"), new Person("bob", "B"),
new Person("bob", "C"), new Person("bob", "C"),
new Person("sally", "B"), new Person("sally", "C"),
new Person("sally", "A"), new Person("steve", "A"),
new Person("steve", "B"), new Person("steve", "C"));
Now create the map. Simply stream the list of people and concatenate the attributes for each person.
Map<String, String> personMap = list.stream()
.collect(Collectors.toMap(Person::getName,
Person::getAttribute, String::concat));
The map will look like this.
bob=VABCC
steve=ABC
sally=BCA
Now grab the name based on an attribute string.
Now stream the entries of the map and pass the entry whose value contains the attribute string. Then retrieve the key (name) and return as a list of names.
String attributes = "ABC";
ListString> results = personMap.entrySet().stream()
.filter(e -> e.getValue().contains(attributes))
.map(Entry::getKey).collect(Collectors.toList());
System.out.println(results);
prints
[bob, steve]
Alternative approach using Map<String, List<String>>
Group the objects by name but the values will be a list of attributes instead of a string.
Map<String, List<String>> personMap = list.stream()
.collect(Collectors.groupingBy(Person::getName,
Collectors.mapping(Person::getAttribute,
Collectors.toList())));
The map will look like this.
bob=[V, A, B, C, C]
steve=[A, B, C]
sally=[B, C, A]
To facilitate testing the attributes, a BiPredicate is used to stream the list and concatinate the attributes and then check and see if it contains the attribute string.
BiPredicate<Entry<String, List<String>>, String> contains =
(entry, attr) -> entry.getValue().stream()
.collect(Collectors.joining()).contains(attr);
As before, stream the entry set of the map and apply the filter to pass those entries which satisfy the condition. In this case, the filter invokes the BiPredicate.
String attributes = "ABC";
List<String> results = personMap.entrySet().stream()
.filter(e->contains.test(e, attributes))
.map(Entry::getKey).collect(Collectors.toList());
System.out.println(results);
prints
[bob, steve]
Update Answer
To work with Character attributes, you can do the following using the first example.
Map<String, String> personMap2 = list.stream()
.collect(Collectors.toMap(Person::getName,
p -> Character.toString(p.getAttribute()),
String::concat));
Imo, it would be easier, if possible to change your attribute types to string.

Aggregation and grouping using Hazelcast jet

I am trying to do grouping and aggregation using Hazelcast Jet but it is getting little bit slow as I have to loop through data twice one for creating groupingKey and after that aggregating all data so is there any better and feasible way to it, Please help
Here is my code first I am creating a groupingKey from using my data as grouping is done by multiple keys:
// Fields with which I want to do grouping.
List<String> fields1 = {"Field1", "Field4"};
List<String> cAggCount = {"CountFiled"};
List<String> sumField = {"SumFiled"};
BatchStage<Map<Object, List<Object>>> aggBatchStageDataGroupBy = batchStage
.aggregate(AggregateOperations.groupingBy(jdbcData -> {
Map<String, Object> m = ((Map<String, Object>)jdbcData);
Set<String> jset = m.keySet();
StringBuilder stringBuilder = new StringBuilder("");
fields1.stream().forEach(dataValue -> {
if (!jset.contains(dataValue.toString())) {
stringBuilder.append(null+"").append(",");
} else {
Object k = m.get(dataValue.toString());
if (k == null) {
stringBuilder.append("").append(",");
} else {
stringBuilder.append(k).append(",");
}
}
});
return stringBuilder.substring(0, stringBuilder.length() - 1);
}));
And after that I am doing aggregation on it as below:
BatchStage<List<Map<String, Object>>> aggBatchStageData = aggBatchStageDataGroupBy
.map(data -> {
data.entrySet().stream().forEach(v -> {
Map<String, Object> objectMap = new HashMap<>();
IntStream.range(0, cAggCount).forEach(k -> {
objectMap.put(countAlias.get(k), v.getValue().stream().mapToLong(dataMap -> {
return new BigDecimal(1).intValue();
}).count());
});
}
return mapList;
});
So can we do this whole process in one go instead of doing loop twice like groupigByKey first and than aggregating it.

Flatten two lists to be in String .csv style Format

I have two lists a List<String> headers and a List<Row> rows which is essentially a List<List<String>>
I'm trying to flatten these lists to get this desired output:
Name,ID,Value
John Doe,1111,XXX
Jane Doe,2222,YYY
I'm currently using in iterator to go over the headers list and appending them to a String builder as so
List<String> headers = ((RowListResult) result).getHeaders();
List<RowListResult.Row> rows = ((RowListResult) result).getRows();
output = output.concat(String.join(",", headers)).concat("\n");
for(RowListResult.Row row: rows) output = output.concat(String.join(",", row.getColumns().toString()).concat("\n")).replace("/[[]]/", "");
System.out.println(output);
return null;
}
}
I was going to do the same thing for the rows, however I feel like I am making this way harder then I should be and at the end I'd have to remove the commas at the end of each row. Is there a better way to accomplish this? A nudge in the right direction would be helpful. I can also share anything else you may need just leave a comment.
Data:
Name,ID,Value
[John Doe, 1111, XXX]
[Jane Doe, 2222, YYY]

You can use String.join():
import java.util.Arrays;
import java.util.List;
class Main {
public static void main(String[] args) {
List<String> headers = Arrays.asList("header1","header2","header3");
List<String> row1 = Arrays.asList("cell1","cell2","cell3");
List<String> row2 = Arrays.asList("cell1","cell2","cell3");
List<List<String>> rows = Arrays.asList(row1, row2);
System.out.println(toString(headers, rows));
}
private static String toString(List<String> headers, List<List<String>> rows) {
StringBuilder str = new StringBuilder();
str.append(String.join(", ", headers)).append("\n");
for (List<String> row: rows) str.append(String.join(", ", row)).append("\n");
return str.toString();
}
}

Do it as follows:
import java.util.List;
public class Main {
public static void main(String[] args) {
List<String> header = List.of("Name", "ID", "Value");
List<List<String>> values = List.of(List.of("John Doe", "1111", "XXX"), List.of("Jane Doe", "2222", "YYY"));
StringBuilder sb = new StringBuilder();
sb.append(String.join(",", header).replace("[", "").replace("]", ""));
for (List<String> valueList : values) {
sb.append(System.lineSeparator()).append(String.join(",", valueList).replace("[", "").replace("]", ""));
}
// Display
System.out.println(sb);
}
}
Output:
Name,ID,Value
John Doe,1111,XXX
Jane Doe,2222,YYY

You can use streams and then join using a Collector.
class Row {
final List<String> values;
}
class Header {
final List<String> headers;
}
#Test
public void test() {
Header header = new Header(Arrays.asList("Name", "Number", "Description"));
List<Row> row = Arrays.asList(new Row(Arrays.asList("John Doe", "111", "XXX")), new Row(Arrays.asList("John Doe", "112", "XXX")));
String headerString = header.headers.stream().collect(Collectors.joining(","));
System.out.println(headerString);
List<String> listOfValues = row.stream()
.map(x -> x.values)
.map(y -> y.stream().collect(Collectors.joining(",")))
.collect(Collectors.toList());
System.out.println(listOfValues);
}
Output:
Name,Number,Description
John Doe,111,XXX
John Doe,112,XXX

Regex pattern to get only particular part from List of String

As i am new to Regex pattern i want to extract particular part from below mentioned List of string and want to store it in Map as key value pair.
Example:
List<String> ref3Path = new ArrayList<String>();
ref3Path.add("s3://REF3/ca209_040/ahshd.csv");
ref3Path.add("s3://REF3/ca209_040/grren.csv");
ref3Path.add("s3://REF3/ca209_030/aestyyuae.csv");
i want only ca209_040 and aesae.csv from the above list and want to store in Map.
Below is the code for I am writing to compare list and map:
public static void main(String[] args) {
// TODO Auto-generated method stub
MultiValuedMap<String, String> studyDomain = new ArrayListValuedHashMap<>();
List<String> ref3Path = new ArrayList<String>();
studyDomain.put("ca209_040", "czvv.csv");
studyDomain.put("ca209_040", "efe.csv");
studyDomain.put("ca209_030", "efef.csv");
studyDomain.put("ca209_030", "hhhjd.csv");
studyDomain.put("ca209_020", "rr.csv");
studyDomain.put("ca209_020", "eghh.csv");
ref3Path.add("s3://REF3/ca209_040/jlkjl.csv");
ref3Path.add("s3://REF3/ca209_040/aesaehkhk.csv");
ref3Path.add("s3://REF3/ca209_030/aesaedhd.csv");
ref3Path.add("null");
ref3Path.add("s3://REF3/ca209_020/aedae.csv");
ref3Path.add("s3://REF3/ca209_020/aeqwee.csv");
log.info("List of inbox: " +studyDomain);
log.info("List of ref3 :" +ref3Path);
rule1(studyDomain,ref3Path);
}

You can do it with simple split function. Regex pattern is overkill
Note: check for boundary conditions
scala> val arr = "s3://REF3/ca209_040/aesae.csv".split('/')
arr: Array[String] = Array(s3:, "", REF3, ca209_040, aesae.csv)
scala> arr(3)
res2: String = ca209_040
scala> arr(4)
res3: String = aesae.csv
In Java
Note: check for boundary conditions
public class Demo {
public static void main(String[] s) {
String[] args = "s3://REF3/ca209_040/aesae.csv".split("/");
System.out.println(args[3] + " , " + args[4]);
}
}

How to pass csv mapped bean class to Dataset

I wrote code to read a csv file and map all the columns to a bean class.
Now, I'm trying to set these values to a Dataset and getting an issue.
7/08/30 16:33:58 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: object is not an instance of declaring class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
If I try to set the values manually it works fine
public void run(String t, String u) throws FileNotFoundException {
JavaRDD<String> pairRDD = sparkContext.textFile("C:/temp/L1_result.csv");
JavaPairRDD<String,String> rowJavaRDD = pairRDD.mapToPair(new PairFunction<String, String, String>() {
public Tuple2<String,String> call(String rec) throws FileNotFoundException {
String[] tokens = rec.split(";");
String[] vals = new String[tokens.length];
for(int i= 0; i < tokens.length; i++){
vals[i] =tokens[i];
}
return new Tuple2<String, String>(tokens[0], tokens[1]);
}
});
ColumnPositionMappingStrategy cpm = new ColumnPositionMappingStrategy();
cpm.setType(funds.class);
String[] csvcolumns = new String[]{"portfolio_id", "portfolio_code"};
cpm.setColumnMapping(csvcolumns);
CSVReader csvReader = new CSVReader(new FileReader("C:/temp/L1_result.csv"));
CsvToBean csvtobean = new CsvToBean();
List csvDataList = csvtobean.parse(cpm, csvReader);
for (Object dataobject : csvDataList) {
funds fund = (funds) dataobject;
System.out.println("Portfolio:"+fund.getPortfolio_id()+ " code:"+fund.getPortfolio_code());
}
/* funds b0 = new funds();
b0.setK("k0");
b0.setSomething("sth0");
funds b1 = new funds();
b1.setK("k1");
b1.setSomething("sth1");
List<funds> data = new ArrayList<funds>();
data.add(b0);
data.add(b1);*/
System.out.println("Portfolio:" + rowJavaRDD.values());
//manual set works fine ///
// Dataset<Row> fundDf = SQLContext.createDataFrame(data, funds.class);
Dataset<Row> fundDf = SQLContext.createDataFrame(rowJavaRDD.values(), funds.class);
fundDf.printSchema();
fundDf.write().option("mergeschema", true).parquet("C:/test");
}
The line below is giving an issue: using rowJavaRDD.values():
Dataset<Row> fundDf = SQLContext.createDataFrame(rowJavaRDD.values(), funds.class);
what is the resolution to this? whatever values Im column mapping should be passed here, but how this needs to be done. Any idea really helps me.

Dataset fundDf = SQLContext.createDataFrame(csvDataList, funds.class);
Passing list worked!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read dataframe in rdd operations - java

Related

how to store the values that has the same parent key

Aggregation and grouping using Hazelcast jet

Flatten two lists to be in String .csv style Format

Regex pattern to get only particular part from List of String

How to pass csv mapped bean class to Dataset

Categories

Resources