Converting a list of object values to group - java

I have the following piece of code
OrderCriteria o1 = new OrderCriteria(1, 1, 101, 201);
OrderCriteria o2 = new OrderCriteria(1, 1, 102, 202);
OrderCriteria o4 = new OrderCriteria(1, 1, 102, 201);
OrderCriteria o5 = new OrderCriteria(2, 2, 501, 601);
OrderCriteria o6 = new OrderCriteria(2, 2, 501, 602);
OrderCriteria o7 = new OrderCriteria(2, 2, 502, 601);
OrderCriteria o8 = new OrderCriteria(2, 2, 502, 602);
OrderCriteria o9 = new OrderCriteria(2, 2, 503, 603);
Where OrderCriteria looks like below:
public class OrderCriteria {
private final long orderId;
private final long orderCatalogId;
private final long procedureId;
private final long diagnosisId;
public OrderCriteria(long orderId, long orderCatalogId, long procedureId, long diagnosisId) {
this.orderId = orderId;
this.orderCatalogId = orderCatalogId;
this.procedureId = procedureId;
this.diagnosisId = diagnosisId;
}
// Getters
}
What I want is to get a list of procedures and list of diagnosis grouped by order id. So it should return:
{1, {101, 102}, {201, 202}}
{2, {501, 502, 503}, {601, 602, 603}}
which means Order with id 1 is having procedure ids 101, 102 and diagnosis ids 201, 202 etc. I tried using google guava table but could not come up with any valid solution.

First you'll need a new structure to hold the grouped data:
class OrderCriteriaGroup {
final Set<Long> procedures = new HashSet<>();
final Set<Long> diagnoses = new HashSet<>();
void add(OrderCriteria o) {
procedures.add(o.getProcedureId());
diagnoses.add(o.getDiagnosisId());
}
OrderCriteriaGroup merge(OrderCriteriaGroup g) {
procedures.addAll(g.procedures);
diagnoses.addAll(g.diagnoses);
return this;
}
}
add() and merge() are convenience methods that will help us stream and collect the data, like so:
Map<Long, OrderCriteriaGroup> grouped = criteriaList.stream()
.collect(Collectors.groupingBy(OrderCriteria::getOrderId,
Collector.of(
OrderCriteriaGroup::new,
OrderCriteriaGroup::add,
OrderCriteriaGroup::merge)));

I highly recommend you to change the output structure. The current, according to your example is probably Map<List<Set<Long>>>. I suggest you distinguish between "procedure: and "diagnosis" set of data using the following structure:
Map<Long, Map<String, Set<Long>>> map = new HashMap<>();
Now filling the data is quite easy:
for (OrderCriteria oc: list) {
if (map.containsKey(oc.getOrderId())) {
map.get(oc.getOrderId()).get("procedure").add(oc.getProcedureId());
map.get(oc.getOrderId()).get("diagnosis").add(oc.getDiagnosisId());
} else {
Map<String, Set<Long>> innerMap = new HashMap<>();
innerMap.put("procedure", new HashSet<>());
innerMap.put("diagnosis", new HashSet<>());
map.put(oc.getOrderId(), innerMap);
}
}
Output: {1={diagnosis=[201, 202], procedure=[102]}, 2={diagnosis=[601, 602, 603], procedure=[501, 502, 503]}}
If you insist on the structure you have drafted, you would have to remember that the first Set contains procedures and the second one contains the diagnosis and the maintenaince would be impractical.
Map<Long, List<Set<Long>>> map = new HashMap<>();
for (OrderCriteria oc: list) {
if (map.containsKey(oc.getOrderId())) {
map.get(oc.getOrderId()).get(0).add(oc.getProcedureId());
map.get(oc.getOrderId()).get(1).add(oc.getDiagnosisId());
} else {
List<Set<Long>> listOfSet = new ArrayList<>();
listOfSet.add(new HashSet<>());
listOfSet.add(new HashSet<>());
map.put(oc.getOrderId(), listOfSet);
}
}
Output: {1=[[102], [201, 202]], 2=[[501, 502, 503], [601, 602, 603]]}
Alternatively you might want to create a new object with 2 Set<Long> to store the data instead (another answer shows the way).

Related

Java8 Sorting Custom Objects having Custom Object in it

I have an Employee Object, with in it Department Object. I need to sort by Employee Object Fields and then by Department Fields too. Data looks like below.
public static List getEmployeeData() {
Department account = new Department("Account", 75);
Department hr = new Department("HR", 50);
Department ops = new Department("OP", 25);
Department tech = new Department("Tech", 150);
List<Employee> employeeList = Arrays.asList(new Employee("David", 32, "Matara", account),
new Employee("Brayan", 25, "Galle", hr), new Employee("JoAnne", 45, "Negombo", ops),
new Employee("Jake", 65, "Galle", hr), new Employee("Brent", 55, "Matara", hr),
new Employee("Allice", 23, "Matara", ops), new Employee("Austin", 30, "Negombo", tech),
new Employee("Gerry", 29, "Matara", tech), new Employee("Scote", 20, "Negombo", ops),
new Employee("Branden", 32, "Matara", account), new Employee("Iflias", 31, "Galle", hr));
return employeeList;
}
I want to sort by Employee::name, Employee::Age, Department::DepartmentName how it can be sorted?
This should led to the desired result:
List<Employee> employees = getEmployeeData()
.stream()
.sorted(Comparator
.comparing(Employee::getName)
.thenComparing(Employee::getAge)
.thenComparing(e -> e.getDepartment().getName()))
.collect(Collectors.toList());

grouping values with diferent values in columns using java lambda

I have this Object:
QuoteProductDTO with three columns ( name, value1, value2)
List<QuoteProductDTO> lstQuoteProductDTO = new ArrayList<>();
lstQuoteProductDTO.add( new QuoteProductDTO("product", 10, 15.5) );
lstQuoteProductDTO.add( new QuoteProductDTO("product", 05, 2.5) );
lstQuoteProductDTO.add( new QuoteProductDTO("product", 13, 1.0) );
lstQuoteProductDTO.add( new QuoteProductDTO("product", 02, 2.0) );
I need to get a consolidate ( a new object QuoteProductDTO ):
the firts column name,I have to get the first value "product".
the second one (value1) I have to get the biggest value 13.
and third column I heve to get the sum of all values 20.
This takes the current data provided and generates a new object with the required data. It uses the Collectors.teeing() method of Java 12+
Given the following data:
ArrayList<QuoteProductDTO> lstQuoteProductDTO = new ArrayList<>();
ArrayList<QuoteProductDTO> nextQuoteProductDTO = new ArrayList<>();
// empty Quote for Optional handling below.
QuoteProductDTO emptyQuote = new QuoteProductDTO("EMPTY", -1, -1);
lstQuoteProductDTO.add(
new QuoteProductDTO("Product", 10, 15.5));
lstQuoteProductDTO.add(
new QuoteProductDTO("Product", 05, 2.5));
lstQuoteProductDTO.add(
new QuoteProductDTO("Product", 13, 1.0));
lstQuoteProductDTO.add(
new QuoteProductDTO("Product", 02, 2.0));
You can consolidate like you want into a new instance of QuoteProductDTO.
QuoteProductDTO prod = lstQuoteProductDTO.stream()
.collect(Collectors.teeing(
Collectors.maxBy(Comparator
.comparing(p -> p.value1)),
Collectors.summingDouble(
p -> p.value2),
(a, b) -> new QuoteProductDTO(
a.orElse(emptyQuote).name,
a.orElse(emptyQuote).value1,
b.doubleValue())));
System.out.println(prod);
Prints
Product, 13, 21.0
You can also take a list of lists of different products and put them in a list of consolidated products. Add the following to a new list and then add those to a main list.
nextQuoteProductDTO.add(
new QuoteProductDTO("Product2", 10, 15.5));
nextQuoteProductDTO.add(
new QuoteProductDTO("Product2", 25, 20.5));
nextQuoteProductDTO.add(
new QuoteProductDTO("Product2", 13, 1.0));
nextQuoteProductDTO.add(
new QuoteProductDTO("Product2", 02, 2.0));
List<List<QuoteProductDTO>> list = List.of(
lstQuoteProductDTO, nextQuoteProductDTO);
Now consolidate those into a list of objects.
List<QuoteProductDTO> prods = list.stream().map(lst -> lst.stream()
.collect(Collectors.teeing(
Collectors.maxBy(Comparator
.comparing(p -> p.value1)),
Collectors.summingDouble(
p -> p.value2),
(a, b) -> new QuoteProductDTO(
a.orElse(emptyQuote).name,
a.orElse(emptyQuote).value1,
b.doubleValue()))))
.collect(Collectors.toList());
prods.forEach(System.out::println);
Prints
Product, 13, 21.0
Product2, 25, 39.0
I created a class to help demonstrate this.
class QuoteProductDTO {
public String name;
public int value1;
public double value2;
public QuoteProductDTO(String name, int value1,
double value2) {
this.name = name;
this.value1 = value1;
this.value2 = value2;
}
public String toString() {
return name + ", " + value1 + ", " + value2;
}
}

Spark Java: how to compare schemas when columns are not in the same order?

Following this question, I now run this code:
List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));
fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));
StructType schema1 = DataTypes.createStructType(fields);
Dataset<Row> df1 = spark.sql("select 1 as A, 2.2 as B");
Dataset<Row> finalDf1 = spark.createDataFrame(df1.javaRDD(), schema1);
fields = new ArrayList<>();
fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));
fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));
StructType schema2 = DataTypes.createStructType(fields);
Dataset<Row> df2 = spark.sql("select 2.2 as B, 1 as A");
Dataset<Row> finalDf2 = spark.createDataFrame(df2.javaRDD(), schema2);
finalDf1.printSchema();
finalDf2.printSchema();
System.out.println(finalDf1.schema());
System.out.println(finalDf2.schema());
System.out.println(finalDf1.schema().equals(finalDf2.schema()));
Here's the output:
root
|-- A: long (nullable = true)
|-- B: double (nullable = true)
root
|-- B: double (nullable = true)
|-- A: long (nullable = true)
StructType(StructField(A,LongType,true), StructField(B,DoubleType,true))
StructType(StructField(B,DoubleType,true), StructField(A,LongType,true))
false
While the columns are not arranges in the same order, both these datasets have exactly the same columns and columns types. What comparison in required here in order to get true?
Assuming order cols does not match and same name is same semantics and same number of columns is required.
An example using SCALA, you should be able to tailor to JAVA:
import spark.implicits._
val df = sc.parallelize(Seq(
("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
)).toDF("c1", "c2", "Val1", "Val2")
val names = df.columns
val df2 = sc.parallelize(Seq(
("A", "X", 2, 1))).toDF("c1", "c2", "Val1", "Val2")
val names2 = df2.columns
names.sortWith(_ < _) sameElements names2.sortWith(_ < _)
returns true or false, experiment with the input.
If they have different order then they are not the same. even that both of them have the same number of columns and the same names. if you want to see if the both schemas have the same column names then get the schema in a list from both Dataframes and you write the code to compare them. see java example below
public static void main(String[] args)
{
List<String> firstSchema =Arrays.asList(DataTypes.createStructType(ConfigConstants.firstSchemaFields).fieldNames());
List<String> secondSchema = Arrays.asList(DataTypes.createStructType(ConfigConstants.secondSchemaFields).fieldNames());
if(schemasHaveTheSameColumnNames(firstSchema,secondSchema))
{
System.out.println("Yes, schemas have the same column names");
}else
{
System.out.println("No, schemas do not have the same column names");
}
}
private static boolean schemasHaveTheSameColumnNames(List<String> firstSchema, List<String> secondSchema)
{
if(firstSchema.size() != secondSchema.size())
{
return false;
}else
{
for (String column : secondSchema)
{
if(!firstSchema.contains(column))
return false;
}
}
return true;
}
Following the previous answers, seems like the fastest way to compare the StructFields (columns and types) and not just the names, is the following:
Set<StructField> set1 = new HashSet<>(Arrays.asList(schema1.fields()));
Set<StructField> set2 = new HashSet<>(Arrays.asList(schema2.fields()));
boolean result = set1.equals(set2);

Extract Aggregator values in Batch Execution

Is there any way to programatically extract the final value of the aggregators after a Dataflow batch execution ?
Based on the DirectePipelineRunner class, I wrote the following method. It seems to work, but for dinamically created counters, it gives different values than the values shown in the console output.
PS. If it helps, I'm assuming that aggregators are based on Long values, with a sum combining function.
public static Map<String, Object> extractAllCounters(Pipeline p, PipelineResult pr)
{
AggregatorPipelineExtractor aggregatorExtractor = new AggregatorPipelineExtractor(p);
Map<String, Object> results = new HashMap<>();
for (Map.Entry<Aggregator<?, ?>, Collection<PTransform<?, ?>>> e :
aggregatorExtractor.getAggregatorSteps().entrySet()) {
Aggregator agg = e.getKey();
try {
results.put(agg.getName(), pr.getAggregatorValues(agg).getTotalValue(agg.getCombineFn()));
} catch(AggregatorRetrievalException|IllegalArgumentException aggEx) {
//System.err.println("Can't extract " + agg.getName() + ": " + aggEx.getMessage());
}
}
return results;
}
The values of aggregators should be available in the PipelineResult. For example:
CountOddsFn countOdds = new CountOddsFn();
pipeline
.apply(Create.of(1, 3, 5, 7, 2, 4, 6, 8, 10, 12, 14, 20, 42, 68, 100))
.apply(ParDo.of(countOdds));
PipelineResult result = pipeline.run();
// Here you may need to use the BlockingDataflowPipelineRunner
AggregatorValues<Integer> values =
result.getAggregatorValues(countOdds.aggregator);
Map<String, Integer> valuesAtSteps = values.getValuesAtSteps();
// Now read the values from the step...
Example DoFn that reports the aggregator:
private static class CountOddsFn extends DoFn<Integer, Void> {
Aggregator<Integer, Integer> aggregator =
createAggregator("odds", new SumIntegerFn());
#Override
public void processElement(ProcessContext c) throws Exception {
if (c.element() % 2 == 1) {
aggregator.addValue(1);
}
}
}

Parallelize a collection with Spark

I'm trying to parallelize a collection with Spark and the example in the documentation doesn't seem to work:
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
I'm creating a list of LabeledPoints from records each of which contain data points (double[]) and a label (defaulted: true/false).
public List<LabeledPoint> createLabeledPoints(List<ESRecord> records) {
List<LabeledPoint> points = new ArrayList<>();
for (ESRecord rec : records) {
points.add(new LabeledPoint(
rec.defaulted ? 1.0 : 0.0, Vectors.dense(rec.toDataPoints())));
}
return points;
}
public void test(List<ESRecord> records) {
SparkConf conf = new SparkConf().setAppName("SVM Classifier Example");
SparkContext sc = new SparkContext(conf);
List<LabeledPoint> points = createLabeledPoints(records);
JavaRDD<LabeledPoint> data = sc.parallelize(points);
...
}
The function signature of parallelize is no longer taking one parameter, here is how it looks in spark-mllib_2.11 v1.3.0: sc.parallelize(seq, numSlices, evidence$1)
So any ideas on how to get this working?
In Java, you should use JavaSparkContext.
https://spark.apache.org/docs/0.6.2/api/core/spark/api/java/JavaSparkContext.html

Categories

Resources