Apache Spark DataFrame no RDD partitioning

Apache Spark DataFrame no RDD partitioning - java

According to new Spark Docs, using Spark's DataFrame should be preferred over using JdbcRDD.
First touch was pretty enjoyable until I faced first problem - DataFrame has no flatMapToPair() method. The first mind was to convert it into JavaRDD and I did it.
Everything was fine, I wrote my code using this approach and that noticed that such code:
JavaRDD<Row> myRDD = dataFrame.toJavaRDD();
int amount = myRDD.partitions().length
produces 1. All code below such transformation to JavaRDD is absolutely inefficient. Force repartitioning of RDD takes a good piece of time and makes bigger overhead than code, that works with 1 partition.
How to deal with it?
While using JdbcRDD we wrote specific SQL with "pager" like WHERE id >= ? and id <= ? that was used to create partitions. How to make something like this using DataFrame?

`
val connectionString` = "jdbc:oracle:thin:username/password#111.11.1.11:1521:ORDERS"
val ordersDF = sqlContext.load("jdbc",
Map( "url" -> connectionString,
"dbtable" -> "(select * from CUSTOMER_ORDERS)",
"partitionColumn" -> "ORDER_ID",
"lowerBound"-> "1000",
"upperBound" -> "40000",
"numPartitions"-> "10"))

Related

Find elements in one RDD but not in ther other RDD

I have two JavaRDD A and B. I want to only keep longs that are in A but not in B. How should I do that? Thanks!

I am posting a solution in scala. Should be almost similar in Java.
Do a leftOuterJoin which would give all the records in the first rdd alongwith matching records from the second rdd. Like WrappedArray((168,(def,None)), (192,(abc,Some(abc)))). But to keep the record only present in first rdd, we apply a filter over None.
val data = spark.sparkContext.parallelize(Seq((192, "abc"),(168, "def")))
val data2 = spark.sparkContext.parallelize(Seq((192, "abc")))
val result = data
.leftOuterJoin(data2)
.filter(record => record._2._2 == None)
println(result.collect.toSeq)
Output> WrappedArray((168,(def,None)))

If you use the Dataframe API - RDD is old and does not have a lot of the optimisation of Tungsten engine - you can use an antijoin (it could exist also on RDD api, but let's use the good one ;-) )
val dataA = Seq((192, "abc"),(168, "def") ).toDF("MyLong", "MyString")
val dataB = Seq((192, "abc")).toDF.toDF("MyLong", "MyString")
dataA.join(dataB, Seq("MyLong"), "leftanti").show(false)
+------+--------+
|MyLong|MyString|
+------+--------+
|168 |def |
+------+--------+

Using Apache Spark in poor systems with cassandra and java

I want to use Apache Spark on my cluster which is made by 5 poor systems. At first I have implemented cassandra 3.11.3 on my nodes and all of my nodes are OK.
After that I have inserted 100k records in my nodes with a JAVA api without using Spark and all is OK too.
Now I want to execute a simple query like as follows:
select * from myKeySpace.myTbl where field1='someValue';
Since my nodes are weak in hardware, I want to get just a little records from myTbl like this:
select * from myKeySpace.myTbl where field1='someValue' limit 20;
I have tested this (A) but it is very slow (and I don't know the reason):
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue' limit 20");
and also (B) that I think Spark fetches all data and then uses limit function which is not my goal:
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue'").limit(20);
I think I can use Spark core (C) too. Also I know that a method called perPartitionLimit is implemented in cassandra 3.6 and greater (D).
As you know, since my nodes are weak, I don't want to fetch all records from cassandra table and then use limit function or something like that. I want to fetch just a little number of records from my table in such that my nodes can handle that.
So what is the best solution?
update:
I have done the suggestion which is given by #AKSW at the comment:
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.set("spark.cassandra.connection.host","192.168.107.100");
long limit=20;
JavaSparkContext jsc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> rdd1 = javaFunctions(jsc)
.cassandraTable("myKeySpace", "myTbl")
.select("id").perPartitionLimit(limit);
System.out.println("Count: " + rdd1.count()); //output is "Count: 100000" which is wrong!
jsc.stop();
but perPartitionLimit(limit) that limit=20 does not work and all records fetch!

Perform aggregation in Dataflow

I am storing the (time series) values in Bigtable and I have come across a use case where I need to apply a filter on these values and perform an aggregation. I am using the following configuration to get the connection to Bigtable (to perform range scan etc):
Connection connection = BigtableConfiguration.connect(projectId, instanceId);
Table table = connection.getTable(TableName.valueOf(tableId));
table.getScanner(<a scanner with filter>);
This helps me with ResultScanner and I can iterate the rows. However, what I want to do is, perform an aggregation on certain columns and get the values. An SQL equivalent of what I want to do would be this:
SELECT SUM(A), SUM(B)
FROM table
WHERE C = D;
To do the same in HBase, I came across AggregationClient (javadoc here), however, it requires Configuration and I need something that runs off Bigtable (so that I don't need to use the low level Hbase APIs).
I checked the documentation and couldn't find anything (in Java) that could do this. Can anyone share an example to perform aggregation with (non row key or any) filters on BigTable.

Bigtable does not natively have any aggregation mechanisms. In addition, Bigtable has difficulty processing WHERE C = D, so that type of processing is generally better done on the client side.
AggregationClient is an HBase coprocessor. Cloud Bigtable does not support coprocessors.
If you want to use Cloud Bigtable for this type of aggregation, you'll have to use table.scan() and your own logic. If the scale is large enough, you would have to use Dataflow or BigQuery to perform the aggregations.

Here's one way:
PCollection<TableRow> rows = p.apply(BigQueryIO.readTableRows()
.fromQuery("SELECT A, B FROM table;"));
PCollection<KV<String, Integer>> valuesA =
rows.apply(
MapElements.into(TypeDescriptors.kvs(
TypeDescriptors.strings(),
TypeDescriptors.integers()))
.via((TableRow row) -> KV.of(
"A", (Integer) row.getF().get(0).getV())));
PCollection<KV<String, Integer>> valuesB =
rows.apply(
MapElements.into(TypeDescriptors.kvs(
TypeDescriptors.strings(),
TypeDescriptors.integers()))
.via((TableRow row) -> KV.of(
"B", (Integer) row.getF().get(1).getV())));
PCollection<KV<String, Integer>> sums =
PCollectionList.of(sumOfA).and(sumOfB)
.apply(Flatten.pCollections())
.apply(Sum.integersPerKey());

Apache Spark - how to get unmatched rows from two RDDs

I have two different RDDs, each RDD have some common fields, based on that fields i want to get unmatched records from RDD1 or RDD2.[Records available in RDD1 but not available in RDD2] [Records available in RDD2 but not available in RDD1]
It seems we could use subtract or subtractbyKey.
Sample Input:
**File 1:**
sam,23,cricket
alex,34,football
ann,21,football
**File 2:**
ruby,25,football,usa
alex,34,cricket,usa
ann,21,cricket,usa
**expected output:**
sam,23,cricket
Update:
Currently i am using Spark SQL to get the unmatched records from the RDDs(Writing a query to get the unmatched records).
What i am looking is, is it something we can do it with Spark Core itself instead of using Spark SQL and also i am not looking the code, is there any operation available in Spark Core?
Please advise on this.
Regards,
Shankar.

You could bring both RDDs to the same shape and use subtract to remove the common elements.
Given rdd1 from file1 and rdd2 from file2 as presented above, you could do something like:
val userScore2 = rdd2.map{case (name, score, sport, country) => (name, score, sport)}
val in1andNotin2 = rdd1 subtract userScore2
val in2andNotIn1 = userScore2 subtract rdd1

Inserting Analytic data from Spark to Postgres

I have Cassandra database from which i analyzed the data using SparkSQL through Apache Spark. Now i want to insert those analyzed data into PostgreSQL . Is there any ways to achieve this directly apart from using the PostgreSQL driver (I achieved it using postREST and Driver i want to know whether there is any methods like saveToCassandra())?

At the moment there is no native implementation of writing the RDD to any DBMS. Here are the links to the related discussions in the Spark user list: one, two
In general, the most performant approach would be the following:
Validate the number of partitions in RDD, it should not be too low and too high. 20-50 partitions should be fine, if the number is lower - call repartition with 20 partitions, if higher - call coalesce to 50 partitions
Call the mapPartition transformation, inside of it call the function to insert the records to your DBMS using JDBC. In this function you open the connection to your database and use the COPY command with this API, it would allow you to eliminate the need for a separate command for each record - this way the insert would be processed much faster
This way you would insert the data into Postgres in a parallel fashion utilizing up to 50 parallel connection (depends on your Spark cluster size and its configuration). The whole approach might be implemented as a Java/Scala function accepting the RDD and the connection string

You can use Postgres copy api to write it, its much faster that way. See following two methods - one iterates over RDD to fill the buffer that can be saved by copy api. Only thing you have to take care of is creating correct statement in csv format that will be used by copy api.
def saveToDB(rdd: RDD[Iterable[EventModel]]): Unit = {
val sb = mutable.StringBuilder.newBuilder
val now = System.currentTimeMillis()
rdd.collect().foreach(itr => {
itr.foreach(_.createCSV(sb, now).append("\n"))
})
copyIn("myTable", new StringReader(sb.toString), "statement")
sb.clear
}
def copyIn(tableName: String, reader: java.io.Reader, columnStmt: String = "") = {
val conn = connectionPool.getConnection()
try {
conn.unwrap(classOf[PGConnection]).getCopyAPI.copyIn(s"COPY $tableName $columnStmt FROM STDIN WITH CSV", reader)
} catch {
case se: SQLException => logWarning(se.getMessage)
case t: Throwable => logWarning(t.getMessage)
} finally {
conn.close()
}
}

Answer by 0x0FFF is good. Here is an additional point that would be useful.
I use foreachPartition to persist to external store. This is also inline with the design pattern Design Patterns for using foreachRDD given in Spark documentation
https://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#output-operations-on-dstreams
Example:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}

The answers above refers to old spark versions, in spark 2.* there is jdbc connector, enable write directly to RDBS from a dataFrame.
example:
jdbcDF2.write.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Spark DataFrame no RDD partitioning - java

Related

Find elements in one RDD but not in ther other RDD

Using Apache Spark in poor systems with cassandra and java

Perform aggregation in Dataflow

Apache Spark - how to get unmatched rows from two RDDs

Inserting Analytic data from Spark to Postgres

Categories

Resources