I have Cassandra database from which i analyzed the data using SparkSQL through Apache Spark. Now i want to insert those analyzed data into PostgreSQL . Is there any ways to achieve this directly apart from using the PostgreSQL driver (I achieved it using postREST and Driver i want to know whether there is any methods like saveToCassandra())?
At the moment there is no native implementation of writing the RDD to any DBMS. Here are the links to the related discussions in the Spark user list: one, two
In general, the most performant approach would be the following:
Validate the number of partitions in RDD, it should not be too low and too high. 20-50 partitions should be fine, if the number is lower - call repartition with 20 partitions, if higher - call coalesce to 50 partitions
Call the mapPartition transformation, inside of it call the function to insert the records to your DBMS using JDBC. In this function you open the connection to your database and use the COPY command with this API, it would allow you to eliminate the need for a separate command for each record - this way the insert would be processed much faster
This way you would insert the data into Postgres in a parallel fashion utilizing up to 50 parallel connection (depends on your Spark cluster size and its configuration). The whole approach might be implemented as a Java/Scala function accepting the RDD and the connection string
You can use Postgres copy api to write it, its much faster that way. See following two methods - one iterates over RDD to fill the buffer that can be saved by copy api. Only thing you have to take care of is creating correct statement in csv format that will be used by copy api.
def saveToDB(rdd: RDD[Iterable[EventModel]]): Unit = {
val sb = mutable.StringBuilder.newBuilder
val now = System.currentTimeMillis()
rdd.collect().foreach(itr => {
itr.foreach(_.createCSV(sb, now).append("\n"))
})
copyIn("myTable", new StringReader(sb.toString), "statement")
sb.clear
}
def copyIn(tableName: String, reader: java.io.Reader, columnStmt: String = "") = {
val conn = connectionPool.getConnection()
try {
conn.unwrap(classOf[PGConnection]).getCopyAPI.copyIn(s"COPY $tableName $columnStmt FROM STDIN WITH CSV", reader)
} catch {
case se: SQLException => logWarning(se.getMessage)
case t: Throwable => logWarning(t.getMessage)
} finally {
conn.close()
}
}
Answer by 0x0FFF is good. Here is an additional point that would be useful.
I use foreachPartition to persist to external store. This is also inline with the design pattern Design Patterns for using foreachRDD given in Spark documentation
https://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#output-operations-on-dstreams
Example:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
The answers above refers to old spark versions, in spark 2.* there is jdbc connector, enable write directly to RDBS from a dataFrame.
example:
jdbcDF2.write.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Related
I'm trying to define a java (written in Kotlin) method that returns multiple result sets when called as a stored procedure. Code is below. the Hsqldb website;s features page indicates that this should be possible, what am I missing? I currently get an array out of bounds error on index 1:
val createProcedure = """
CREATE PROCEDURE GET_CACHED(dir VARCHAR(100), hashCode INT)
MODIFIES SQL DATA
LANGUAGE JAVA
DYNAMIC RESULT SETS 9
EXTERNAL NAME 'CLASSPATH:integration.FileCache.getResultSets'
"""
#JvmStatic
#Throws(SQLException::class)
public fun getResultSets(conn: Connection, dir: String, hashCode: Int, result: Array<ResultSet?>) {
val file = getFile(dir, hashCode, "sql")
//A list of cached sql statements
val sqlList = BufferedReader(InputStreamReader(file.inputStream())).readLines()
val stmt = conn.createStatement()
for(i in sqlList.indices) {
result[i] = stmt.executeQuery(sqlList[i])
}
}
Given that I can set a breakpoint and reach the inside of the function, I don't think I need to add any more of my code, but if that is wrong let me know.
I'm in a quest to find an in-memory database that can handle multiple result sets for testing purposes (We're modernizing an application, setting up tests first, containerized testing is currently out of reach). I've tried H2, sqlite, and now hsqldb, if there's a better solution I'm open to it.
HSQLDB supports multiple result test, but the Guide states :HyperSQL support this method of returning single or multiple result sets from SQL/PSM procedures only via the JDBC CallableStatement interface. Note the reference to SQL/PSM, rather than SQL/JRT. Currently there is no Java mechanism to return multiple result sets from a Java language PROCEDURE.
For test purposes, you could use a text template consisting of an SQL/PSM CREATE PROCEDURE statement written in SQL, with placeholders for the actual SQL statements that you want to execute. Process the template with your test SQL statements from your file and execute the resulting CREATE PROCEDURE statement.
I am required to execute a stored procedure in a SQL server to fetch some data, and since I will later save the data into a Mongo and this one is with ReactiveMongoTemplate and so on, I introduced Spring R2DBC.
implementation("org.springframework.data:spring-data-r2dbc:1.0.0.RELEASE")
implementation("io.r2dbc:r2dbc-mssql:0.8.1.RELEASE")
I see that I can do SELECT and INSERT and so on with R2DBC, but is it possible to EXEC prod_name? I tried it and it hangs forever and then the test terminates, without success but neither failure. The last line of log is:
io.r2dbc.mssql.QUERY - Executing query: EXEC "SCHEMA"."MY_PROCEDURE"
The code is like:
public Flux<Coupon> selectWithProcedure() {
return databaseClient
.execute("EXEC \"SCHEMA\".\"MY_PROCEDURE\" ")
.as(Coupon.class)
.fetch().all()
.doOnNext(coupon -> {
coupon.setCouponStatusRefFromId(coupon.getCouponStatusRefId());
});
}
And it seems that no data is retrieved.
If I test some other methods with simple queries like SELECT... it works. But the problem is, DBAs do not allow my app to read table data, instead, they create a procedure for me. If this query is not possible, I must go with traditional JPA way and going reactive at Mongo side has lost its sense.
Well. I just saw this:
https://github.com/r2dbc/r2dbc-mssql, version 0.8.1:
Next steps:
Execution of stored procedures
Add support for TVP and UDTs
And:
https://r2dbc.io/2019/05/13/r2dbc-0-8-milestone-8-released
We already have a few tickets lined up for the next milestone, and we know that they will require further SPI modifications:
Support for Auto-Commit
Connection Validation
Support for Stored Procedures
I want to use Apache Spark on my cluster which is made by 5 poor systems. At first I have implemented cassandra 3.11.3 on my nodes and all of my nodes are OK.
After that I have inserted 100k records in my nodes with a JAVA api without using Spark and all is OK too.
Now I want to execute a simple query like as follows:
select * from myKeySpace.myTbl where field1='someValue';
Since my nodes are weak in hardware, I want to get just a little records from myTbl like this:
select * from myKeySpace.myTbl where field1='someValue' limit 20;
I have tested this (A) but it is very slow (and I don't know the reason):
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue' limit 20");
and also (B) that I think Spark fetches all data and then uses limit function which is not my goal:
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue'").limit(20);
I think I can use Spark core (C) too. Also I know that a method called perPartitionLimit is implemented in cassandra 3.6 and greater (D).
As you know, since my nodes are weak, I don't want to fetch all records from cassandra table and then use limit function or something like that. I want to fetch just a little number of records from my table in such that my nodes can handle that.
So what is the best solution?
update:
I have done the suggestion which is given by #AKSW at the comment:
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.set("spark.cassandra.connection.host","192.168.107.100");
long limit=20;
JavaSparkContext jsc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> rdd1 = javaFunctions(jsc)
.cassandraTable("myKeySpace", "myTbl")
.select("id").perPartitionLimit(limit);
System.out.println("Count: " + rdd1.count()); //output is "Count: 100000" which is wrong!
jsc.stop();
but perPartitionLimit(limit) that limit=20 does not work and all records fetch!
How I sure such a this condition always true:
Let's say I have 3 tables which are contains 1000, 2000, 3000 records.
And I have to load that all records to the cache, but I want to be sure all the data is in cache before Client using this cache.
Note that cache mode is replicated.
Apache Ignite has this future?
Here is the things that I will plan to do:
I will set the "atomicityMode" to the "TRANSACTIONAL"
I will create class for cacheStoreFactory
This class contains loadCache method extends from CacheStoreAdapter.
here is the pseducode:
public void loadCache(IgniteBiInClosure<K, V> clo, Object... args) {
// Connect the all databases
/*while true:
try(Transaction transaction = Ignition.ignite().transactions().txStart(TransactionConcurrency.OPTIMISTIC, TransactionIsolation.SERIALIZABLE)){
PreparedStatement for all databases such as "select * from persons"
then take the all ResultSet from all databases
then create the corresponding object from the results and put the "clo.apply(objectID, object);"
before transaction.commit()
if there is way, find the total records number from all databases (maybe find before starting try block)
then, again if there is way, compare the cache size and total records
If 2 numbers are equals -> transaction.commmit() & break
else -> continue;
}
*/
}
You can use distributed data structures to signal that some action was completed. For example:
https://apacheignite.readme.io/docs/countdownlatch
https://apacheignite.readme.io/docs/distributed-semaphore
https://apacheignite.readme.io/docs/atomic-types
Also, you can check the size of the cache or send the message from one node to another like here:
https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/messaging/MessagingExample.java
However, until you load the data in the transaction then other nodes will not be able to read it.
In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to.
I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp), where CONST_TTL is a constant TTL that I configured.
Currently I am writing to Cassandra with spark using a constant TTL, with the following code:
df.write().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "key_space_name");
put("table, "table_name");
put("spark.cassandra.output.ttl, Long.toString(CONST_TTL)); // Should be depended on bucket_timestamp column
}
}).mode(SaveMode.Overwrite).save();
One possible way I thought about is - for each possible bucket_timestamp - filter the data according to timestamp, calculate the TTL and write filtered data to Cassandra. but this seems very non-efficient and not the spark way. Is there a way in Java Spark to provide a spark column as the TTL option, so that the TTL will differ for each row?
Solution should be working with Java and dataset< Row>: I encountered some solutions for performing this with RDD in scala, but didn't find a solution for using Java and dataframe.
Thanks!
From Spark-Cassandra connector options (https://github.com/datastax/spark-cassandra-connector/blob/v2.3.0/spark-cassandra-connector/src/main/java/com/datastax/spark/connector/japi/RDDAndDStreamCommonJavaFunctions.java) you can set the TTL as:
constant value (withConstantTTL)
automatically resolved value (withAutoTTL)
column-based value (withPerRowTTL)
In your case you could try the last option and compute the TTL as a new column of the starting Dataset with the rule you provided in the question.
For use case you can see the test here: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/writer/TableWriterSpec.scala#L612
For DataFrame API there is no support for such functionality, yet... There is JIRA for it - https://datastax-oss.atlassian.net/browse/SPARKC-416, you can watch it to get notified when it's implemented...
So only choice that you have is to use RDD API as described in the #bartosz25's answer...