Perform aggregation in Dataflow - java

I am storing the (time series) values in Bigtable and I have come across a use case where I need to apply a filter on these values and perform an aggregation. I am using the following configuration to get the connection to Bigtable (to perform range scan etc):
Connection connection = BigtableConfiguration.connect(projectId, instanceId);
Table table = connection.getTable(TableName.valueOf(tableId));
table.getScanner(<a scanner with filter>);
This helps me with ResultScanner and I can iterate the rows. However, what I want to do is, perform an aggregation on certain columns and get the values. An SQL equivalent of what I want to do would be this:
SELECT SUM(A), SUM(B)
FROM table
WHERE C = D;
To do the same in HBase, I came across AggregationClient (javadoc here), however, it requires Configuration and I need something that runs off Bigtable (so that I don't need to use the low level Hbase APIs).
I checked the documentation and couldn't find anything (in Java) that could do this. Can anyone share an example to perform aggregation with (non row key or any) filters on BigTable.

Bigtable does not natively have any aggregation mechanisms. In addition, Bigtable has difficulty processing WHERE C = D, so that type of processing is generally better done on the client side.
AggregationClient is an HBase coprocessor. Cloud Bigtable does not support coprocessors.
If you want to use Cloud Bigtable for this type of aggregation, you'll have to use table.scan() and your own logic. If the scale is large enough, you would have to use Dataflow or BigQuery to perform the aggregations.

Here's one way:
PCollection<TableRow> rows = p.apply(BigQueryIO.readTableRows()
.fromQuery("SELECT A, B FROM table;"));
PCollection<KV<String, Integer>> valuesA =
rows.apply(
MapElements.into(TypeDescriptors.kvs(
TypeDescriptors.strings(),
TypeDescriptors.integers()))
.via((TableRow row) -> KV.of(
"A", (Integer) row.getF().get(0).getV())));
PCollection<KV<String, Integer>> valuesB =
rows.apply(
MapElements.into(TypeDescriptors.kvs(
TypeDescriptors.strings(),
TypeDescriptors.integers()))
.via((TableRow row) -> KV.of(
"B", (Integer) row.getF().get(1).getV())));
PCollection<KV<String, Integer>> sums =
PCollectionList.of(sumOfA).and(sumOfB)
.apply(Flatten.pCollections())
.apply(Sum.integersPerKey());

Related

How to get the first result from a group (Jooq)

My requirement is to take a list of identifiers, each of which could refer to multiple records, and return the newest record per identifier.
This would seem to be doable with a combination of orderBy(date, desc) and fetchGroups() on the identifier column. I then use values() to get the Result objects.
At this point, I want the first value in each result object. I can do get(0) to get the first value in the list, but that seems like cheating. Is there a better way to get that first result from a Result object?
You're going to write a top-1-per-category query, which is a special case of a top-n-per-category query. Most syntaxes that produce this behaviour in SQL are supported by jOOQ as well. You shouldn't use grouping in the client, because you'd transfer all the excess data from the server to the client, which corresponds to the remaining results per group.
Some examples:
Standard SQL (when window functions are supported)
Field<Integer> rn = rowNumber().over(T.DATE.desc()).as("rn");
var subquery = table(
select(T.fields())
.select(rn)
.from(T)
).as("subquery");
var results =
ctx.select(subquery.fields(T.fields())
.from(subquery)
.where(subquery.field(rn).eq(1))
.fetch();
Teradata and H2 (we might emulate this soon)
var results =
ctx.select(T.fields())
.from(T)
.qualify(rowNumber().over(T.DATE.desc()).eq(1))
.fetch();
PostgreSQL
var results =
ctx.select(T.fields())
.distinctOn(T.DATE)
.from(T)
.orderBy(T.DATE.desc())
.fetch();
Oracle
var results =
ctx.select(
T.DATE,
max(T.COL1).keepDenseRankFirstOrderBy(T.DATE.desc()).as(T.COL1),
max(T.COL2).keepDenseRankFirstOrderBy(T.DATE.desc()).as(T.COL2),
...
max(T.COLN).keepDenseRankFirstOrderBy(T.DATE.desc()).as(T.COLN))
.from(T)
.groupBy(T.DATE)
.fetch();

Using Apache Spark in poor systems with cassandra and java

I want to use Apache Spark on my cluster which is made by 5 poor systems. At first I have implemented cassandra 3.11.3 on my nodes and all of my nodes are OK.
After that I have inserted 100k records in my nodes with a JAVA api without using Spark and all is OK too.
Now I want to execute a simple query like as follows:
select * from myKeySpace.myTbl where field1='someValue';
Since my nodes are weak in hardware, I want to get just a little records from myTbl like this:
select * from myKeySpace.myTbl where field1='someValue' limit 20;
I have tested this (A) but it is very slow (and I don't know the reason):
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue' limit 20");
and also (B) that I think Spark fetches all data and then uses limit function which is not my goal:
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue'").limit(20);
I think I can use Spark core (C) too. Also I know that a method called perPartitionLimit is implemented in cassandra 3.6 and greater (D).
As you know, since my nodes are weak, I don't want to fetch all records from cassandra table and then use limit function or something like that. I want to fetch just a little number of records from my table in such that my nodes can handle that.
So what is the best solution?
update:
I have done the suggestion which is given by #AKSW at the comment:
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.set("spark.cassandra.connection.host","192.168.107.100");
long limit=20;
JavaSparkContext jsc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> rdd1 = javaFunctions(jsc)
.cassandraTable("myKeySpace", "myTbl")
.select("id").perPartitionLimit(limit);
System.out.println("Count: " + rdd1.count()); //output is "Count: 100000" which is wrong!
jsc.stop();
but perPartitionLimit(limit) that limit=20 does not work and all records fetch!

Spark writing to Cassandra with varying TTL

In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to.
I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp), where CONST_TTL is a constant TTL that I configured.
Currently I am writing to Cassandra with spark using a constant TTL, with the following code:
df.write().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "key_space_name");
put("table, "table_name");
put("spark.cassandra.output.ttl, Long.toString(CONST_TTL)); // Should be depended on bucket_timestamp column
}
}).mode(SaveMode.Overwrite).save();
One possible way I thought about is - for each possible bucket_timestamp - filter the data according to timestamp, calculate the TTL and write filtered data to Cassandra. but this seems very non-efficient and not the spark way. Is there a way in Java Spark to provide a spark column as the TTL option, so that the TTL will differ for each row?
Solution should be working with Java and dataset< Row>: I encountered some solutions for performing this with RDD in scala, but didn't find a solution for using Java and dataframe.
Thanks!
From Spark-Cassandra connector options (https://github.com/datastax/spark-cassandra-connector/blob/v2.3.0/spark-cassandra-connector/src/main/java/com/datastax/spark/connector/japi/RDDAndDStreamCommonJavaFunctions.java) you can set the TTL as:
constant value (withConstantTTL)
automatically resolved value (withAutoTTL)
column-based value (withPerRowTTL)
In your case you could try the last option and compute the TTL as a new column of the starting Dataset with the rule you provided in the question.
For use case you can see the test here: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/writer/TableWriterSpec.scala#L612
For DataFrame API there is no support for such functionality, yet... There is JIRA for it - https://datastax-oss.atlassian.net/browse/SPARKC-416, you can watch it to get notified when it's implemented...
So only choice that you have is to use RDD API as described in the #bartosz25's answer...

How to get features record for plan estimate change using lookback API

I am using rally lookback api with java. I am trying to fetch historical data features, sample code that i am using is as shown below.
LookbackApi lookbackApi = new LookbackApi();
lookbackApi.setCredentials("username", "password");
lookbackApi.setWorkspace(47903209423);
lookbackApi.setServer("https://rally1.rallydev.com");
//lookbackApi.setWorkspace("90432948");
LookbackQuery query = lookbackApi.newSnapshotQuery();
query.addFindClause("_TypeHierarchy", "PortfolioItem/Feature");
query.setPagesize(200) // set pagesize to 200 instead of the default 20k
.setStart(200) // ask for the second page of data
.requireFields("ScheduleState", // A useful set of fields for defects, add any others you may want
"ObjectID",
"State",
"Project",
"PlanEstimate",
"_ValidFrom",
"_ValidTo")
.sortBy("_UnformattedID")
.hydrateFields("ScheduleState","State", "PlanEstimate","Project"); // ScheduleState will come back as an OID if it doesn't get hydrated
LookbackResult resultSet = query.execute();
int resultCount = resultSet.Results.size();
Map<String,Object> firstSnapshot = resultSet.Results.get(0);
Iterator<Map<String,Object>> iterator = resultSet.getResultsIterator();
while (iterator.hasNext()) {
Map<String, Object> snapshot = iterator.next();
}
I need a way to put a condition so that it will fetch all the records from history which will have plan estimate changed,but will ignore other history for any feature and underlying user story. I need it this way so that we can track plan estimate change but, will be able to avoid fetching un-necessary data and reduce the time to do this.
I'm not familiar with the java toolkit, but using the raw Lookback API, you would accomplish this with a filter clause like {"_PreviousValues.PlanEstimate": {"$exists": true}}.
Map ifExist = new HashMap();
ifExist.put("$exists", true);
// Note:- true is java boolean, be careful with this as string "true" will not work.
query.addFindClause("_PreviousValues.PlanEstimate",ifExist);
Additinally one need to consider adding "_PreviousValues.PlanEstimate" into
.requireFields() in case only "PlanEstimate" is required to hydrated

Inserting Analytic data from Spark to Postgres

I have Cassandra database from which i analyzed the data using SparkSQL through Apache Spark. Now i want to insert those analyzed data into PostgreSQL . Is there any ways to achieve this directly apart from using the PostgreSQL driver (I achieved it using postREST and Driver i want to know whether there is any methods like saveToCassandra())?
At the moment there is no native implementation of writing the RDD to any DBMS. Here are the links to the related discussions in the Spark user list: one, two
In general, the most performant approach would be the following:
Validate the number of partitions in RDD, it should not be too low and too high. 20-50 partitions should be fine, if the number is lower - call repartition with 20 partitions, if higher - call coalesce to 50 partitions
Call the mapPartition transformation, inside of it call the function to insert the records to your DBMS using JDBC. In this function you open the connection to your database and use the COPY command with this API, it would allow you to eliminate the need for a separate command for each record - this way the insert would be processed much faster
This way you would insert the data into Postgres in a parallel fashion utilizing up to 50 parallel connection (depends on your Spark cluster size and its configuration). The whole approach might be implemented as a Java/Scala function accepting the RDD and the connection string
You can use Postgres copy api to write it, its much faster that way. See following two methods - one iterates over RDD to fill the buffer that can be saved by copy api. Only thing you have to take care of is creating correct statement in csv format that will be used by copy api.
def saveToDB(rdd: RDD[Iterable[EventModel]]): Unit = {
val sb = mutable.StringBuilder.newBuilder
val now = System.currentTimeMillis()
rdd.collect().foreach(itr => {
itr.foreach(_.createCSV(sb, now).append("\n"))
})
copyIn("myTable", new StringReader(sb.toString), "statement")
sb.clear
}
def copyIn(tableName: String, reader: java.io.Reader, columnStmt: String = "") = {
val conn = connectionPool.getConnection()
try {
conn.unwrap(classOf[PGConnection]).getCopyAPI.copyIn(s"COPY $tableName $columnStmt FROM STDIN WITH CSV", reader)
} catch {
case se: SQLException => logWarning(se.getMessage)
case t: Throwable => logWarning(t.getMessage)
} finally {
conn.close()
}
}
Answer by 0x0FFF is good. Here is an additional point that would be useful.
I use foreachPartition to persist to external store. This is also inline with the design pattern Design Patterns for using foreachRDD given in Spark documentation
https://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#output-operations-on-dstreams
Example:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
The answers above refers to old spark versions, in spark 2.* there is jdbc connector, enable write directly to RDBS from a dataFrame.
example:
jdbcDF2.write.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Categories

Resources