I'am trying to load a dataset to spark using the following code:
Dataset<Row> dataset = spark.read().jdbc(RPP_CONNECTION_URL, creditoDia3, rppDBProperties));
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, creditoDia2, rppDBProperties)));
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, creditoDia, rppDBProperties)));
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, debitoDia3, rppDBProperties)));
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, debitoDia2, rppDBProperties)));
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, debitoDia,rppDBProperties)));
dataset = dataset.cache();
Long numberOfRowsProcessed = dataset.count();
So after this 6 sessions hitting my database and extracting the dataset and counting the number of rows, I wouldn't need to go to the database anymore. But after running the following code:
dataset.createOrReplaceTempView("temp");
Dataset<Row> base = spark.sql(new StringBuilder()
.append("select ")
.append("TRANSACTION ")
.append("from temp ")
.append("where PAYMENT_METHOD in (1,2,3,4) ")
.append("and TRANSACTION_STATUS in ('A','B') ")
.toString()
);
base.createOrReplaceTempView("base");
But, what I am actually seeing is spark running again the query, but this time, appending the filters I passed when defining Dataset<Row> base. And as you can see, I already cached the data, but it had no effect.
Question: Is that possible to load everything in memory in spark and use the cached data, querying spark and not anymore the database?
To fetch the data from my relational database is expensive and taking a while to do so.
UPDATE
I could notice that spark is sending new queries to the database when it tries to execute
from base a
left join base b on on a.IDT_TRANSACTION = b.IDT_TRANSACTION and a.DATE = b.DATE
This is the string spark is appending to the query (captured from the database):
WHERE ("IDT_TRANSACTION_STATUS" IS NOT NULL) AND ("NUM_BIN_CARD" IS NOT NULL)
In the log appears:
18/01/16 14:22:20 INFO DAGScheduler: ShuffleMapStage 12 (show at
RelatorioBinTransacao.java:496) finished in 13,046 s 18/01/16 14:22:20
INFO DAGScheduler: looking for newly runnable stages 18/01/16 14:22:20
INFO DAGScheduler: running: Set(ShuffleMapStage 9) 18/01/16 14:22:20
INFO DAGScheduler: waiting: Set(ShuffleMapStage 13, ShuffleMapStage
10, ResultStage 14, ShuffleMapStage 11) 18/01/16 14:22:20 INFO
DAGScheduler: failed: Set()
I'm not sure if I get what is trying to say, but I think something is missing in memory.
If I just add comments on the left join like this:
from base a
//left join base b on on a.IDT_TRANSACTION = b.IDT_TRANSACTION and a.DATE = b.DATE
it works just fine and it doesn't go to the database anymore.
This sounds like you may not have enough memory to store the unioned results on your cluster. After Long numberOfRowsProcessed = dataset.count(); please look at the Storage tab of your Spark UI to see if the whole dataset is fully cached or not. If it is NOT then you need more memory (and/or disk space).
If you've confirmed the dataset is indeed cached then please post the query plan (e.g. base.explain()).
I figure out a way to workaround the problem. I had to add a cache() instruction to every line I sent queries to the database. So it looks like this:
Dataset<Row> dataset = spark.read().jdbc(RPP_CONNECTION_URL, fake, rppDBProperties);
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, creditoDia3, rppDBProperties).cache());
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, creditoDia2, rppDBProperties).cache());
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, creditoDia, rppDBProperties).cache());
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, debitoDia3, rppDBProperties).cache());
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, debitoDia2, rppDBProperties).cache());
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, debitoDia,rppDBProperties).cache());
dataset = dataset.cache();
I had to add the first line of fake sql, because no matter what I did, spark seems not to consider caching the first query, so I kept seeing the first query being sent to the database.
Bottom line, I don't understand why I have to add a cache() instruction to every line if I already did it at the end. But, it worked.
Related
Small question regarding Spark please.
What I would like to achieve is quite straightforward:
I have a Spark cluster with 10 executors which I would like to utilize.
I need to run a query selecting 10 rows from the DB.
My expectation is something like: select 10 rows, results are rows 1 2 3 4 5 6 7 8 9 10.
Then apply a map operation on each row. Something like executor 1 applies the operation Op to row one of the row, executor 2 applies the operation Op to another row, etc...
Note, my operation OP have proper logging and proper KPI.
Therefore, I went to try this:
public static void main(String[] args) {
final String query = "SELECT TOP(10) id, last_name, first_name FROM mytable WHERE ...";
final SparkSession sparkSession = SparkSession.builder().getOrCreate();
final Properties dbConnectionProperties = new Properties();
dbConnectionProperties.putAll([...]);
final Dataset<Row> topTenDataSet = sparkSession.read().jdbc(someUrl, query, dbConnectionProperties);
topTenDataSet.show();
final Dataset<String> topTenDataSetAfterMap = topTenDataSet.repartition(10).map((MapFunction<Row, String>) row -> performOperationWithLogAndKPI(row), Encoders.STRING());
LOGGER.info("the count is expected to be 10 " + topTenDataSetAfterMap.count() + topTenDataSetAfterMap.showString(100000, 1000000, false));
sparkSession.stop();
}
With this code, there is a strange outcome.
Both topTenDataSet.show(); and topTenDataSetAfterMap.count() shows 10 rows, happy.
But I look at the logs from the operation Op performOperationWithLogAndKPI I can see much more than 10 logs, much more than 10 metrics. Meaning, I can see executor 1 performing 10 times the operation, but also executor 2 performing 10 times the operation, etc.
It seems like each executor run its own "SELECT TOP(10) from DB" and applies the map function on each dataset.
May I ask: did I made some mistake in the code?
Is my understanding not correct?
How to achieve the expected, querying once, and having each executor apply a function to part of the result set?
Thank you
If you're trying to execute multiple actions on the same Dataset, try to cache it. This way the select top 10 results query should be executed only once:
final Dataset<Row> topTenDataSet = sparkSession.read().jdbc(someUrl, query, dbConnectionProperties);
topTenDataSet.cache();
topTenDataSet.show();
final Dataset<String> topTenDataSetAfterMap = topTenDataSet.repartition(10).map((MapFunction<Row, String>) row -> performOperationWithLogAndKPI(row), Encoders.STRING());
Further info here
I am reading Oracle database from my spark code and I persist it - (cache operation).
val dataOracle = spark.read
.format("jdbc")
.option("url",conn_url)
.option("dbtable", s"(select * from table)")
.option("user", oracle_user)
.option("password", oracle_pass)
.option("driver",oracle_driver)
.load().persist()
End of the code, I need unpersist this dataframe, cause it can be happened some changes in database and I need those data in the next cycle, but at the same time time cost so important to me. If I cache the dataframe my code takes under the 1 second, if I dont above 3 second(which is not acceptable). Is there any strategy to get latest data from DB, also minimized time cost value!
There is the my main operation using Oracle data:
dataOracle.createOrReplaceTempView("TABLE")
val total = spark.sql(s"select count(*) from TABLE where name = ${name}").first().getLong(0)
val items = spark.sql(s"SELECT count(*) from TABLE where index = ${id} and name = ${name}").first().getLong(0)
val first_rule: Double = total.toDouble / items.toDouble
If your dataframe is updated and you need those updates, then by definition you can't cache anything and you just need to read it all over again. A possible way to optimize is to add a column of last modified timestamp to your table in the database and only read those entries where the last modified timestamp is greater than some value.
I have a spark job which reads an input File in a dataFrame, does some computation and generates two outputs processed and filtered.
Dataset<Row> input = sparkSession.read().parquet(inputPath);
Dataset<Row> processed = someFunction(input);
Dataset<Row> filtered = processed.filter(someCondition);
processed.write().parquet(outputPath1);
filtered.write().parquet(outputPath2);
I observed that during the code execution someFunction() is being called twice(Once while writing processed and other time while writing filtered due to lazy evaluation in spark) .
Is there a way to write both outputs(multiple outputs in general) using a single call to someFunction().
You can do it by caching processed:
Dataset<Row> processed = someFunction(input).cache(); //cache
Dataset<Row> filtered = processed.filter(someCondition);
because the data frame used to produce filtered is cached, Spark won't need to call someFunction() a second time.
Spark has the ability to .persist() a dataframe for future computations. By default, it will store the computed dataframe in memory and spill over (temporarily, for the life of the driver) to disk if necessary.
Dataset<Row> input = sparkSession.read().parquet(inputPath);
Dataset<Row> processed = someFunction(input).persist();
Dataset<Row> filtered = processed.filter(someCondition);
processed.write().parquet(outputPath1);
filtered.write().parquet(outputPath2);
processed.unpersist();
I just wrote a toy class to test Spark dataframe (actually Dataset since I'm using Java).
Dataset<Row> ds = spark.sql("select id,name,gender from test2.dummy where dt='2018-12-12'");
ds = ds.withColumn("dt", lit("2018-12-17"));
ds.cache();
ds.write().mode(SaveMode.Append).insertInto("test2.dummy");
//
System.out.println(ds.count());
According to my understanding, there're 2 actions, "insertInto" and "count".
I debug the code step by step, when running "insertInto", I see several lines of:
19/01/21 20:14:56 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
When running "count", I still see similar logs:
19/01/21 20:15:26 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
I have 2 questions:
1) When there're 2 actions on same dataframe like above, if I don't call ds.cache or ds.persist explicitly, will the 2nd action always causes the re-executing of the sql query?
2) If I understand the log correctly, both actions trigger hdfs file reading, does that mean the ds.cache() actually doesn't work here? If so, why it doesn't work here?
Many thanks.
It's because you append into the table where ds is created from, so ds needs to be recomputed because the underlying data changed. In such cases, spark invalidates the cache. If you read e.g. this Jira (https://issues.apache.org/jira/browse/SPARK-24596):
When invalidating a cache, we invalid other caches dependent on this
cache to ensure cached data is up to date. For example, when the
underlying table has been modified or the table has been dropped
itself, all caches that use this table should be invalidated or
refreshed.
Try to run the ds.count before inserting into the table.
I found that the other answer doesn't work. What I had to do was break lineage such that the df I was writing does not know that one of its source is the table I am writing to. To break lineage, I created a copy df using
copy_of_df = sql_context.createDataframe(df.rdd)
I want to use Apache Spark on my cluster which is made by 5 poor systems. At first I have implemented cassandra 3.11.3 on my nodes and all of my nodes are OK.
After that I have inserted 100k records in my nodes with a JAVA api without using Spark and all is OK too.
Now I want to execute a simple query like as follows:
select * from myKeySpace.myTbl where field1='someValue';
Since my nodes are weak in hardware, I want to get just a little records from myTbl like this:
select * from myKeySpace.myTbl where field1='someValue' limit 20;
I have tested this (A) but it is very slow (and I don't know the reason):
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue' limit 20");
and also (B) that I think Spark fetches all data and then uses limit function which is not my goal:
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue'").limit(20);
I think I can use Spark core (C) too. Also I know that a method called perPartitionLimit is implemented in cassandra 3.6 and greater (D).
As you know, since my nodes are weak, I don't want to fetch all records from cassandra table and then use limit function or something like that. I want to fetch just a little number of records from my table in such that my nodes can handle that.
So what is the best solution?
update:
I have done the suggestion which is given by #AKSW at the comment:
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.set("spark.cassandra.connection.host","192.168.107.100");
long limit=20;
JavaSparkContext jsc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> rdd1 = javaFunctions(jsc)
.cassandraTable("myKeySpace", "myTbl")
.select("id").perPartitionLimit(limit);
System.out.println("Count: " + rdd1.count()); //output is "Count: 100000" which is wrong!
jsc.stop();
but perPartitionLimit(limit) that limit=20 does not work and all records fetch!