Apache Beam Group by Aggregate Fields - java

I have a PCollection reading data from AvroIO. I want to apply aggregation such that after grouping by a specific key, I want to count unique counts of some fields within that group.
With normal Pig or SQL this is just applying groupby and doing a distinct count, but unable to properly understand how to do it in Beam.
So far I have been able to write this:
Schema schema = new Schema.Parser().parse(new File(options.getInputSchema()));
Pipeline pipeline = Pipeline.create(options);
PCollection<GenericRecord> inputData= pipeline.apply(AvroIO.readGenericRecords(schema).from(options.getInput()));
PCollection<Row> filteredData = inputData.apply(Select.fieldNames("user_id", "field1", "field2"));
PCollection<Row> groupedData = filteredData.apply(Group.byFieldNames("user_id")
.aggregateField("field1",Count.perElement(),"out_field1")
.aggregateField("field2",Count.perElement(),"out_field2"));
But this does not accept the arguments in aggregateField method.
Can someone help in providing the correct way to do this.
Thanks!

You can replace Count.perElement() with the CountCombineFn() fn which is a subclass of CombineFn class as seen here
filteredData.apply(Group.byFieldNames("user_id")
.aggregateField("field1", CountCombineFn(), "out_field1")
.aggregateField("field2", CountCombineFn(), "out_field2"));

Related

How to use $toLower and $trim aggregation operators using MongoDB Java?

I have a collection called Users, which contains an array called skills.
I use the following code to unwind the skills array and count the number of documents associated with each skill:
Bson uw = unwind("$skills");
Bson sbc = sortByCount("$skills");
Bson limit = limit(10);
coll.aggregate(Arrays.asList(uw, sbc,limit)).forEach(printDocuments());
Now I want to make use of $trim and $toLower operations for the above aggregation because in the database, some skills saved in different ways (e.g., "CSS", "CSS ", and "css").
I'm able to do this in the mongo shell with the following aggregation:
db.users.aggregate([{$unwind:"$skills"} , {$sortByCount:{$toLower:{$trim:{input:"$skills"}}}}])
But I'm having troubles with implementing it in Java.
Do you have any idea?
I managed to find a way to do this with changing the sortByCount to the following:
Bson sbc = sortByCount(eq("$toLower", eq("$trim", eq("input", "$skills"))));

How to sum a mongodb inner field and push it during grouping using MongoTemplate

I can use the sum within the push operation in a MongoDB console. However, I am not getting how can I do the same using MongoTemplate?
$group : {
_id: "$some_id",
my_field: { $push : {$sum: "$my_field" }}
}
The code I am using for this is something like:
Aggregation aggregation =
Aggregation.newAggregation(
match(matchingCriteria),
group("some_id")
.count()
.as("count")
.push("my_field")
.as("my_field")
project("some_id", "count", "my_field"));
AggregationResults<MyModel> result =
mongoTemplate.aggregate(aggregation, "my_collection", MyModel.class);
The thing is I want the sum of my_field but it is coming as an array of my_field here(as I am directly using the push). I am able to achieve the same using the above sum inside of push operation. But not able to use that for MongoTemplate. My app is in Spring Boot. I have also looked into the docs for these methods but couldn't find much.
Also, I tried directly using .sum() as well on the field(without using the push), but that is not working for me as my_field is an inner object, and it's not a number but an array of numbers after the grouping. That is why I need to use the push and sum combination.
Any help regarding this is appreciated. Thanks in advance.
I was able to get this to work using the below code:
Aggregation aggregation =
Aggregation.newAggregation(
match(allTestMatchingCriteria),
project("some_id")
.and(AccumulatorOperators.Sum.sumOf("my_field"))
.as("my_field_sum")
group("some_id")
.count()
.as("count")
.push("my_field_sum")
.as("my_field_sum"),
project("some_id", "count", "my_field_sum"));
AggregationResults<MyModel> result =
mongoTemplate.aggregate(aggregation, "my_collection", MyModel.class);
I used AccumulatorOperators.Sum in the projection stage itself and sum the inner fields and get the desired output. Then I passed this to the grouping stage where I did the count aggregation as I needed that data as well and then had to project all the data generated to be collected as output.

Converting String to Double with TableSource, Table or DataSet object in Java

I have imported data from a CSV file into Flink Java. One of the attributes I had to import as string (attribute Result) because of parsing errors. Now I want to convert the String to a Double. But I dont know how to do this with a object of the TableSource, Table or DataSet class. See my code below for this.
I've looked into flink documentation and tried some solutions with Map and FlatMap classes. But I did not find the solution for this.
BatchTableEnvironment tableEnv = BatchTableEnvironment.create(fbEnv);
//Get H data from CSV file.
TableSource csvSource = CsvTableSource.builder()
.path("Path")
.fieldDelimiter(";")
.field("ID", Types.INT())
.field("result", Types.STRING())
.field("unixDateTime", Types.LONG())
.build();
// register the TableSource
tableEnv.registerTableSource("HTable", csvSource);
Table HTable = tableEnv.scan("HTable");
DataSet<Row> result = tableEnv.toDataSet(HTable, Row.class);
I think it should work to use a combination of replace and cast to convert the strings to doubles, as in "SELECT id, CAST(REPLACE(result, ',', '.') AS DOUBLE) AS result, ..." or the equivalent using the table API.

Implementing a user defined aggregation function to be used in RelationalGroupedDataset.agg() using Java

It seems like you can aggregate multiple columns like this:
Dataset<Row> df = spark.read().textFile(inputFile);
List<Row> result = df.groupBy("id")
.agg(sum(df.col("price")), avg(df.col("weight")))
.collectAsList();
Now, I want to write my own custom aggregation function instead of sum or avg. How can I do that?
The Spark documentation shows how to create a custom aggregation function. But that one is registered and then used in the SQL and I don't think if it can be used in the .agg() function. Since agg accepts Column instances and the custom aggregation function is not one.
If you have a class GeometricMean which extends UserDefinedAggregationFunction, then you can use it like this (taken from https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html) :
// Create an instance of UDAF GeometricMean.
val gm = new GeometricMean
// Show the geometric mean of values of column "id".
df.groupBy("group_id").agg(gm(col("id")).as("GeometricMean")).show()
Should be easy to transleta this into Java

Perform aggregation in Dataflow

I am storing the (time series) values in Bigtable and I have come across a use case where I need to apply a filter on these values and perform an aggregation. I am using the following configuration to get the connection to Bigtable (to perform range scan etc):
Connection connection = BigtableConfiguration.connect(projectId, instanceId);
Table table = connection.getTable(TableName.valueOf(tableId));
table.getScanner(<a scanner with filter>);
This helps me with ResultScanner and I can iterate the rows. However, what I want to do is, perform an aggregation on certain columns and get the values. An SQL equivalent of what I want to do would be this:
SELECT SUM(A), SUM(B)
FROM table
WHERE C = D;
To do the same in HBase, I came across AggregationClient (javadoc here), however, it requires Configuration and I need something that runs off Bigtable (so that I don't need to use the low level Hbase APIs).
I checked the documentation and couldn't find anything (in Java) that could do this. Can anyone share an example to perform aggregation with (non row key or any) filters on BigTable.
Bigtable does not natively have any aggregation mechanisms. In addition, Bigtable has difficulty processing WHERE C = D, so that type of processing is generally better done on the client side.
AggregationClient is an HBase coprocessor. Cloud Bigtable does not support coprocessors.
If you want to use Cloud Bigtable for this type of aggregation, you'll have to use table.scan() and your own logic. If the scale is large enough, you would have to use Dataflow or BigQuery to perform the aggregations.
Here's one way:
PCollection<TableRow> rows = p.apply(BigQueryIO.readTableRows()
.fromQuery("SELECT A, B FROM table;"));
PCollection<KV<String, Integer>> valuesA =
rows.apply(
MapElements.into(TypeDescriptors.kvs(
TypeDescriptors.strings(),
TypeDescriptors.integers()))
.via((TableRow row) -> KV.of(
"A", (Integer) row.getF().get(0).getV())));
PCollection<KV<String, Integer>> valuesB =
rows.apply(
MapElements.into(TypeDescriptors.kvs(
TypeDescriptors.strings(),
TypeDescriptors.integers()))
.via((TableRow row) -> KV.of(
"B", (Integer) row.getF().get(1).getV())));
PCollection<KV<String, Integer>> sums =
PCollectionList.of(sumOfA).and(sumOfB)
.apply(Flatten.pCollections())
.apply(Sum.integersPerKey());

Categories

Resources