I want to use Apache Spark on my cluster which is made by 5 poor systems. At first I have implemented cassandra 3.11.3 on my nodes and all of my nodes are OK.
After that I have inserted 100k records in my nodes with a JAVA api without using Spark and all is OK too.
Now I want to execute a simple query like as follows:
select * from myKeySpace.myTbl where field1='someValue';
Since my nodes are weak in hardware, I want to get just a little records from myTbl like this:
select * from myKeySpace.myTbl where field1='someValue' limit 20;
I have tested this (A) but it is very slow (and I don't know the reason):
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue' limit 20");
and also (B) that I think Spark fetches all data and then uses limit function which is not my goal:
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue'").limit(20);
I think I can use Spark core (C) too. Also I know that a method called perPartitionLimit is implemented in cassandra 3.6 and greater (D).
As you know, since my nodes are weak, I don't want to fetch all records from cassandra table and then use limit function or something like that. I want to fetch just a little number of records from my table in such that my nodes can handle that.
So what is the best solution?
update:
I have done the suggestion which is given by #AKSW at the comment:
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.set("spark.cassandra.connection.host","192.168.107.100");
long limit=20;
JavaSparkContext jsc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> rdd1 = javaFunctions(jsc)
.cassandraTable("myKeySpace", "myTbl")
.select("id").perPartitionLimit(limit);
System.out.println("Count: " + rdd1.count()); //output is "Count: 100000" which is wrong!
jsc.stop();
but perPartitionLimit(limit) that limit=20 does not work and all records fetch!
Related
In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to.
I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp), where CONST_TTL is a constant TTL that I configured.
Currently I am writing to Cassandra with spark using a constant TTL, with the following code:
df.write().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "key_space_name");
put("table, "table_name");
put("spark.cassandra.output.ttl, Long.toString(CONST_TTL)); // Should be depended on bucket_timestamp column
}
}).mode(SaveMode.Overwrite).save();
One possible way I thought about is - for each possible bucket_timestamp - filter the data according to timestamp, calculate the TTL and write filtered data to Cassandra. but this seems very non-efficient and not the spark way. Is there a way in Java Spark to provide a spark column as the TTL option, so that the TTL will differ for each row?
Solution should be working with Java and dataset< Row>: I encountered some solutions for performing this with RDD in scala, but didn't find a solution for using Java and dataframe.
Thanks!
From Spark-Cassandra connector options (https://github.com/datastax/spark-cassandra-connector/blob/v2.3.0/spark-cassandra-connector/src/main/java/com/datastax/spark/connector/japi/RDDAndDStreamCommonJavaFunctions.java) you can set the TTL as:
constant value (withConstantTTL)
automatically resolved value (withAutoTTL)
column-based value (withPerRowTTL)
In your case you could try the last option and compute the TTL as a new column of the starting Dataset with the rule you provided in the question.
For use case you can see the test here: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/writer/TableWriterSpec.scala#L612
For DataFrame API there is no support for such functionality, yet... There is JIRA for it - https://datastax-oss.atlassian.net/browse/SPARKC-416, you can watch it to get notified when it's implemented...
So only choice that you have is to use RDD API as described in the #bartosz25's answer...
I am storing the (time series) values in Bigtable and I have come across a use case where I need to apply a filter on these values and perform an aggregation. I am using the following configuration to get the connection to Bigtable (to perform range scan etc):
Connection connection = BigtableConfiguration.connect(projectId, instanceId);
Table table = connection.getTable(TableName.valueOf(tableId));
table.getScanner(<a scanner with filter>);
This helps me with ResultScanner and I can iterate the rows. However, what I want to do is, perform an aggregation on certain columns and get the values. An SQL equivalent of what I want to do would be this:
SELECT SUM(A), SUM(B)
FROM table
WHERE C = D;
To do the same in HBase, I came across AggregationClient (javadoc here), however, it requires Configuration and I need something that runs off Bigtable (so that I don't need to use the low level Hbase APIs).
I checked the documentation and couldn't find anything (in Java) that could do this. Can anyone share an example to perform aggregation with (non row key or any) filters on BigTable.
Bigtable does not natively have any aggregation mechanisms. In addition, Bigtable has difficulty processing WHERE C = D, so that type of processing is generally better done on the client side.
AggregationClient is an HBase coprocessor. Cloud Bigtable does not support coprocessors.
If you want to use Cloud Bigtable for this type of aggregation, you'll have to use table.scan() and your own logic. If the scale is large enough, you would have to use Dataflow or BigQuery to perform the aggregations.
Here's one way:
PCollection<TableRow> rows = p.apply(BigQueryIO.readTableRows()
.fromQuery("SELECT A, B FROM table;"));
PCollection<KV<String, Integer>> valuesA =
rows.apply(
MapElements.into(TypeDescriptors.kvs(
TypeDescriptors.strings(),
TypeDescriptors.integers()))
.via((TableRow row) -> KV.of(
"A", (Integer) row.getF().get(0).getV())));
PCollection<KV<String, Integer>> valuesB =
rows.apply(
MapElements.into(TypeDescriptors.kvs(
TypeDescriptors.strings(),
TypeDescriptors.integers()))
.via((TableRow row) -> KV.of(
"B", (Integer) row.getF().get(1).getV())));
PCollection<KV<String, Integer>> sums =
PCollectionList.of(sumOfA).and(sumOfB)
.apply(Flatten.pCollections())
.apply(Sum.integersPerKey());
I am using the Java MongoDB Connector to run an Hadoop Mapreduce job against MongoDB.
I am setting the input and output URI with the MongoConfigUtil
MongoConfigUtil.setInputURI( conf, "mongodb://host/db.collection" );
MongoConfigUtil.setOutputURI( conf, "mongodb://host/db.collectionOut" );
And the Job is correctly fetching all the document in the specified collection.
Is there a way to limit the number of fetched document?
I wish to achieve this query(Mongo Style):
db.collection.find().limit(1000)
I know MongoConfigUtil has a SetQuery method but how can I set the limit query? Any hints?
I tried to add
MongoConfigUtil.setLimit(conf, 1000)
But I still get all the documents in the collection.
setSplitSize 8 MB is default Size and this property has higher priority than setLimit(mongo.input.limit).
Example mongoConfig.setSplitSize(5); // MB - 8 MB Deafault
In the example above i set the value to 5 MB.
If the stated limit size(for example 1000) for each chunk fetched for each Mapper.setLimit means the limit of your each chunk(split) query limit.
I think you want to limit the query for the entire MapReduce process.
SetQuery is the query inside the find() and that must be represented in JSON format like MongoDB.As far I know you can't limit inside mongo query(find()).
You can find another way to filter query like { fieldName: { $lt: 20 } } based on you case.Besides, you may create a separate collection based on you limit using projection and then apply MapReduce there.
Finally, SetQuery is used to filter the collection.
I found the solution using the setLimit method of the class MongoInputSplit, passing the number of document that you want to fetch.
myMongoInputSplitObj = new MongoInputSplit(*param*)
myMongoInputSplitObj.setLimit(100)
MongoConfigUtil setLimit
Allow users to set the limit on MongoInputSplits (HADOOP-267).
According to new Spark Docs, using Spark's DataFrame should be preferred over using JdbcRDD.
First touch was pretty enjoyable until I faced first problem - DataFrame has no flatMapToPair() method. The first mind was to convert it into JavaRDD and I did it.
Everything was fine, I wrote my code using this approach and that noticed that such code:
JavaRDD<Row> myRDD = dataFrame.toJavaRDD();
int amount = myRDD.partitions().length
produces 1. All code below such transformation to JavaRDD is absolutely inefficient. Force repartitioning of RDD takes a good piece of time and makes bigger overhead than code, that works with 1 partition.
How to deal with it?
While using JdbcRDD we wrote specific SQL with "pager" like WHERE id >= ? and id <= ? that was used to create partitions. How to make something like this using DataFrame?
`
val connectionString` = "jdbc:oracle:thin:username/password#111.11.1.11:1521:ORDERS"
val ordersDF = sqlContext.load("jdbc",
Map( "url" -> connectionString,
"dbtable" -> "(select * from CUSTOMER_ORDERS)",
"partitionColumn" -> "ORDER_ID",
"lowerBound"-> "1000",
"upperBound" -> "40000",
"numPartitions"-> "10"))
I have Cassandra database from which i analyzed the data using SparkSQL through Apache Spark. Now i want to insert those analyzed data into PostgreSQL . Is there any ways to achieve this directly apart from using the PostgreSQL driver (I achieved it using postREST and Driver i want to know whether there is any methods like saveToCassandra())?
At the moment there is no native implementation of writing the RDD to any DBMS. Here are the links to the related discussions in the Spark user list: one, two
In general, the most performant approach would be the following:
Validate the number of partitions in RDD, it should not be too low and too high. 20-50 partitions should be fine, if the number is lower - call repartition with 20 partitions, if higher - call coalesce to 50 partitions
Call the mapPartition transformation, inside of it call the function to insert the records to your DBMS using JDBC. In this function you open the connection to your database and use the COPY command with this API, it would allow you to eliminate the need for a separate command for each record - this way the insert would be processed much faster
This way you would insert the data into Postgres in a parallel fashion utilizing up to 50 parallel connection (depends on your Spark cluster size and its configuration). The whole approach might be implemented as a Java/Scala function accepting the RDD and the connection string
You can use Postgres copy api to write it, its much faster that way. See following two methods - one iterates over RDD to fill the buffer that can be saved by copy api. Only thing you have to take care of is creating correct statement in csv format that will be used by copy api.
def saveToDB(rdd: RDD[Iterable[EventModel]]): Unit = {
val sb = mutable.StringBuilder.newBuilder
val now = System.currentTimeMillis()
rdd.collect().foreach(itr => {
itr.foreach(_.createCSV(sb, now).append("\n"))
})
copyIn("myTable", new StringReader(sb.toString), "statement")
sb.clear
}
def copyIn(tableName: String, reader: java.io.Reader, columnStmt: String = "") = {
val conn = connectionPool.getConnection()
try {
conn.unwrap(classOf[PGConnection]).getCopyAPI.copyIn(s"COPY $tableName $columnStmt FROM STDIN WITH CSV", reader)
} catch {
case se: SQLException => logWarning(se.getMessage)
case t: Throwable => logWarning(t.getMessage)
} finally {
conn.close()
}
}
Answer by 0x0FFF is good. Here is an additional point that would be useful.
I use foreachPartition to persist to external store. This is also inline with the design pattern Design Patterns for using foreachRDD given in Spark documentation
https://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#output-operations-on-dstreams
Example:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
The answers above refers to old spark versions, in spark 2.* there is jdbc connector, enable write directly to RDBS from a dataFrame.
example:
jdbcDF2.write.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html