Small question regarding Spark please.
What I would like to achieve is quite straightforward:
I have a Spark cluster with 10 executors which I would like to utilize.
I need to run a query selecting 10 rows from the DB.
My expectation is something like: select 10 rows, results are rows 1 2 3 4 5 6 7 8 9 10.
Then apply a map operation on each row. Something like executor 1 applies the operation Op to row one of the row, executor 2 applies the operation Op to another row, etc...
Note, my operation OP have proper logging and proper KPI.
Therefore, I went to try this:
public static void main(String[] args) {
final String query = "SELECT TOP(10) id, last_name, first_name FROM mytable WHERE ...";
final SparkSession sparkSession = SparkSession.builder().getOrCreate();
final Properties dbConnectionProperties = new Properties();
dbConnectionProperties.putAll([...]);
final Dataset<Row> topTenDataSet = sparkSession.read().jdbc(someUrl, query, dbConnectionProperties);
topTenDataSet.show();
final Dataset<String> topTenDataSetAfterMap = topTenDataSet.repartition(10).map((MapFunction<Row, String>) row -> performOperationWithLogAndKPI(row), Encoders.STRING());
LOGGER.info("the count is expected to be 10 " + topTenDataSetAfterMap.count() + topTenDataSetAfterMap.showString(100000, 1000000, false));
sparkSession.stop();
}
With this code, there is a strange outcome.
Both topTenDataSet.show(); and topTenDataSetAfterMap.count() shows 10 rows, happy.
But I look at the logs from the operation Op performOperationWithLogAndKPI I can see much more than 10 logs, much more than 10 metrics. Meaning, I can see executor 1 performing 10 times the operation, but also executor 2 performing 10 times the operation, etc.
It seems like each executor run its own "SELECT TOP(10) from DB" and applies the map function on each dataset.
May I ask: did I made some mistake in the code?
Is my understanding not correct?
How to achieve the expected, querying once, and having each executor apply a function to part of the result set?
Thank you
If you're trying to execute multiple actions on the same Dataset, try to cache it. This way the select top 10 results query should be executed only once:
final Dataset<Row> topTenDataSet = sparkSession.read().jdbc(someUrl, query, dbConnectionProperties);
topTenDataSet.cache();
topTenDataSet.show();
final Dataset<String> topTenDataSetAfterMap = topTenDataSet.repartition(10).map((MapFunction<Row, String>) row -> performOperationWithLogAndKPI(row), Encoders.STRING());
Further info here
Related
I'm having performance issue trying to query large table, with multiple threads.
I'm using Oracle, Spring 2 and java 7.
I use a PoolDataSource (driver oracle.jdbc.pool.OracleDataSource) with as many connections as the number of thread to analyse a single table.
I made sure by logging poolDataSource.getStatistics() that I have enough available connections at any time.
Here is the code :
ExecutorService executorService = Executors.newFixedThreadPool(nbThreads);
List<Foo> foo = new ArrayList<>();
List<Callable<List<Foo>>> callables = new ArrayList<>();
int offset = 1;
while(offSetMaxReached) {
callables.add(new Callable<List<Foo>> {
#Override
public List<Foo> call() throws SQLException, InterruptedException {
return dao.doTheJob(...);
}
});
offset += 10000;
}
for(Future<List<Foo>> fooFuture : executorService.invokeAll(callables)) {
geometrieIncorrectes.addAll(fooFuture.get());
}
executorService.shutdown();
executorService.awaitTermination(1, TimeUnit.DAYS);
In the dao, I use a connection from the PoolDataSource, and Spring JdbcTemplate.query(query, PreparedStatementSetter, RowCallBackHandler).
The doTheJob method do the exact same thing for every query results.
My queries look like : SELECT A, B, C FROM MY.BIGTABLE OFFSET ? ROWS FETCH NEXT ? ROWS ONLY
So, to sum up, I have n threads called by a FixThreadPool, each thread deal with the exact same amount of data and do the exact same things.
But each thread completion is longer than the last one !
Example :
4 threads launched at the same time, but the first row of each resultSet (ie in the RowCallBackHandlers) is processed at :
thread 1 : 1.5s
thread 2 : 9s
thread 3 : 18s
thread 4 : 35s
and so on...
What can be the cause of such behavior ?
Edit :
The main cause of the problem was within the processed table itself.
Using OFFSET x ROWS FETCH NEXT y ROWS ONLY, Oracle needs to run through all the lines, starting by one.
So accessing offset 10 is faster than accessing offset 10000000 !
I was able to get good response time with temporary tables, and good indexes.
I want to use Apache Spark on my cluster which is made by 5 poor systems. At first I have implemented cassandra 3.11.3 on my nodes and all of my nodes are OK.
After that I have inserted 100k records in my nodes with a JAVA api without using Spark and all is OK too.
Now I want to execute a simple query like as follows:
select * from myKeySpace.myTbl where field1='someValue';
Since my nodes are weak in hardware, I want to get just a little records from myTbl like this:
select * from myKeySpace.myTbl where field1='someValue' limit 20;
I have tested this (A) but it is very slow (and I don't know the reason):
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue' limit 20");
and also (B) that I think Spark fetches all data and then uses limit function which is not my goal:
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue'").limit(20);
I think I can use Spark core (C) too. Also I know that a method called perPartitionLimit is implemented in cassandra 3.6 and greater (D).
As you know, since my nodes are weak, I don't want to fetch all records from cassandra table and then use limit function or something like that. I want to fetch just a little number of records from my table in such that my nodes can handle that.
So what is the best solution?
update:
I have done the suggestion which is given by #AKSW at the comment:
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.set("spark.cassandra.connection.host","192.168.107.100");
long limit=20;
JavaSparkContext jsc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> rdd1 = javaFunctions(jsc)
.cassandraTable("myKeySpace", "myTbl")
.select("id").perPartitionLimit(limit);
System.out.println("Count: " + rdd1.count()); //output is "Count: 100000" which is wrong!
jsc.stop();
but perPartitionLimit(limit) that limit=20 does not work and all records fetch!
I'am trying to load a dataset to spark using the following code:
Dataset<Row> dataset = spark.read().jdbc(RPP_CONNECTION_URL, creditoDia3, rppDBProperties));
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, creditoDia2, rppDBProperties)));
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, creditoDia, rppDBProperties)));
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, debitoDia3, rppDBProperties)));
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, debitoDia2, rppDBProperties)));
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, debitoDia,rppDBProperties)));
dataset = dataset.cache();
Long numberOfRowsProcessed = dataset.count();
So after this 6 sessions hitting my database and extracting the dataset and counting the number of rows, I wouldn't need to go to the database anymore. But after running the following code:
dataset.createOrReplaceTempView("temp");
Dataset<Row> base = spark.sql(new StringBuilder()
.append("select ")
.append("TRANSACTION ")
.append("from temp ")
.append("where PAYMENT_METHOD in (1,2,3,4) ")
.append("and TRANSACTION_STATUS in ('A','B') ")
.toString()
);
base.createOrReplaceTempView("base");
But, what I am actually seeing is spark running again the query, but this time, appending the filters I passed when defining Dataset<Row> base. And as you can see, I already cached the data, but it had no effect.
Question: Is that possible to load everything in memory in spark and use the cached data, querying spark and not anymore the database?
To fetch the data from my relational database is expensive and taking a while to do so.
UPDATE
I could notice that spark is sending new queries to the database when it tries to execute
from base a
left join base b on on a.IDT_TRANSACTION = b.IDT_TRANSACTION and a.DATE = b.DATE
This is the string spark is appending to the query (captured from the database):
WHERE ("IDT_TRANSACTION_STATUS" IS NOT NULL) AND ("NUM_BIN_CARD" IS NOT NULL)
In the log appears:
18/01/16 14:22:20 INFO DAGScheduler: ShuffleMapStage 12 (show at
RelatorioBinTransacao.java:496) finished in 13,046 s 18/01/16 14:22:20
INFO DAGScheduler: looking for newly runnable stages 18/01/16 14:22:20
INFO DAGScheduler: running: Set(ShuffleMapStage 9) 18/01/16 14:22:20
INFO DAGScheduler: waiting: Set(ShuffleMapStage 13, ShuffleMapStage
10, ResultStage 14, ShuffleMapStage 11) 18/01/16 14:22:20 INFO
DAGScheduler: failed: Set()
I'm not sure if I get what is trying to say, but I think something is missing in memory.
If I just add comments on the left join like this:
from base a
//left join base b on on a.IDT_TRANSACTION = b.IDT_TRANSACTION and a.DATE = b.DATE
it works just fine and it doesn't go to the database anymore.
This sounds like you may not have enough memory to store the unioned results on your cluster. After Long numberOfRowsProcessed = dataset.count(); please look at the Storage tab of your Spark UI to see if the whole dataset is fully cached or not. If it is NOT then you need more memory (and/or disk space).
If you've confirmed the dataset is indeed cached then please post the query plan (e.g. base.explain()).
I figure out a way to workaround the problem. I had to add a cache() instruction to every line I sent queries to the database. So it looks like this:
Dataset<Row> dataset = spark.read().jdbc(RPP_CONNECTION_URL, fake, rppDBProperties);
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, creditoDia3, rppDBProperties).cache());
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, creditoDia2, rppDBProperties).cache());
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, creditoDia, rppDBProperties).cache());
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, debitoDia3, rppDBProperties).cache());
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, debitoDia2, rppDBProperties).cache());
dataset = dataset.union(spark.read().jdbc(RPP_CONNECTION_URL, debitoDia,rppDBProperties).cache());
dataset = dataset.cache();
I had to add the first line of fake sql, because no matter what I did, spark seems not to consider caching the first query, so I kept seeing the first query being sent to the database.
Bottom line, I don't understand why I have to add a cache() instruction to every line if I already did it at the end. But, it worked.
I am executing the following set of statements in my java application. It connects to a oracle database.
stat=connection.createStatement();
stat1=commection.createstatement();
ResultSet rs = stat.executeQuery(BIGQUERY);
while(rs.next()) {
obj1.setAttr1(rs.getString(1));
obj1.setAttr2(rs.getString(1));
obj1.setAttr3(rs.getString(1));
obj1.setAttr4(rs.getString(1));
ResultSet rs1 = stat1.executeQuery(SMALLQ1);
while(rs1.next()) {
obj1.setAttr5(rs1.getString(1));
}
ResultSet rs2 = stat1.executeQuery(SMALLQ2);
while(rs2.next()) {
obj1.setAttr6(rs2.getString(1));
}
.
.
.
LinkedBlockingqueue.add(obj1);
}
//all staements and connections close
The BIGQUERY returns around 4.5 million records and for each record, I have to execute the smaller queries, which are 14 in number. Each small query has 3 inner join statements.
My multi threaded application now can process 90,000 in one hour. But I may have to run the code daily, so I want to process all the records in 20 hours. I am using about 200 threads which process the above code and stores the records in linked blocking queue.
Does increasing the thread count blindly helps increase the performance or is there some other way in which I can increase the performance of the result sets?
PS : I am unable to post the query here, but I am assured that all queries are optimized.
To improve JDBC performance for your scenario you can apply some modifications.
As you will see, all these modifications can significantly speed your task.
1. Using batch operations.
You can read your big query and store results in some kind of buffer.
And only when buffer is full you should run subquery for all data collected in buffer.
This significantly reduces number of SQL statements to execute.
static final int BATCH_SIZE = 1000;
List<MyData> buffer = new ArrayList<>(BATCH_SIZE);
while (rs.hasNext()) {
MyData record = new MyData( rs.getString(1), ..., rs.getString(4) );
buffer.add( record );
if (buffer.size() == BATCH_SIZE) {
processBatch( buffer );
}
}
void processBatch( List<MyData> buffer ) {
String sql = "select ... where X and id in (" + getIDs(buffer) + ")";
stat1.executeQuery(sql); // query for all IDs in buffer
while(stat1.hasNext()) { ... }
...
}
2. Using efficient maps to store content from many selects.
If your records are no so big you can store them all at once event for 4 mln table.
I used this approach many times for night processes (with no normal users).
Such approach may need much heap memory (i.e. 100 MB - 1 GB) - but is much faster that approach 1).
To do that you need efficient map implementation, i.e. - gnu.trove.map.TIntObjectMap (etc)
which is much better that java standard library maps.
final TIntObjectMap<MyData> map = new TIntObjectHashMap<MyData>(10000, 0.8f);
// query 1
while (rs.hasNext()) {
MyData record = new MyData( rs.getInt(1), rs.getString(2), ..., rs.getString(4) );
map.put(record.getId(), record);
}
// query 2
while (rs.hasNext()) {
int id = rs.getInt(1); // my data id
String x = rs.getString(...);
int y = rs.getInt(...);
MyData record = map.get(id);
record.add( new MyDetail(x,y) );
}
// query 3
// same pattern as query 2
After this you have map filled with all data collected. Probably with a lot of memory allocated.
This is why you can use that method only if you hava such resources.
Another topic is how to write MyData and MyDetail classes to be as small as possible.
You can use some tricks:
storing 3 integers (with limited range) in 1 long variable (using util for bit shifting)
storing Date objects as integer (yymmdd)
calling str.intern() for each string fetched from DB
3. Transactions
If you have to do some updates or inserts than 4 mln records is too much to handle in on transactions.
This is too much for most database configurations.
Use approach 1) and commit transaction for each batch.
On each new inserted record you can have something like RUN_ID and if everything went well you can mark this RUN_ID as successful.
If your queries only read - there is no problem. However you can mark transaction as Read-only to help your database.
4. Jdbc fetch size.
When you load a lot of records from database it is very, very important to set proper fetch size on your jdbc connection.
This reduces number of physical hits to database socket and speeds your process.
Example:
// jdbc
statement.setFetchSize(500);
// spring
JdbcTemplate jdbc = new JdbcTemplate(datasource);
jdbc.setFetchSize(500);
Here you can find some benchmarks and patterns for using fetch size:
http://makejavafaster.blogspot.com/2015/06/jdbc-fetch-size-performance.html
5. PreparedStatement
Use PreparedStatement rather than Statement.
6. Number of sql statements.
Always try to minimize number of sql statements you send to database.
Try this
resultSet.setFetchSize(100);
while(resultSet.next) {
...
}
The parameter is the number of rows that should be retrieved from the
database in each roundtrip
I have this SQL query which queries the database every 5 seconds to determine who is currently actively using the software. Active users have pinged the server in the last 10 seconds. (The table gets updated correctly on user activity and a I have a thread evicting entries on session timeouts, that all works correctly).
What I'm looking for is a more efficient/quicker way to do this, since it gets called frequently, about every 5 seconds. In addition, there may be up to 500 users in the database. The language is Java, but the question really pertains to any language.
List<String> r = new ArrayList<String>();
Calendar c = Calendar.getInstance();
long threshold = c.get(Calendar.SECOND) + c.get(Calendar.MINUTE)*60 + c.get(Calendar.HOUR_OF_DAY)*60*60 - 10;
String tmpSql = "SELECT user_name, EXTRACT(HOUR FROM last_access_ts) as hour, EXTRACT(MINUTE FROM last_access_ts) as minute, EXTRACT(SECOND FROM last_access_ts) as second FROM user_sessions";
DBResult rs = DB.select(tmpSql);
for (int i=0; i<rs.size(); i++)
{
Map<String, Object> result = rs.get(i);
long hour = (Long)result.get("hour");
long minute = (Long)result.get("minute");
long second = (Long)result.get("second");
if (hour*60*60 + minute*60 + second > threshold)
r.add(result.get("user_name").toString());
}
return r;
If you want this to run faster, then create an index on user_sessions(last_access_ts, user_name), and do the date logic in the query:
select user_name
from user_sessions
where last_access_ts >= now() - 5/(24*60*60);
This does have a downside. You are, presumably, updating the last_access_ts field quite often. An index on the field will also have to be updated. On the positive side, this is a covering index, so the index itself can satisfy the query without resorting to the original data pages.
I would move the logic from Java to DB. This mean you translate if into where, and just select the name of valid result.
SELECT user_name FROM user_sessions WHERE last_access_ts > ?
In your example the c represent current time. It is highly possible that result will be empty.
So your question should be more about date time operation on your database.
Just let the database do the comparison for you by using this query:
SELECT
user_name
FROM user_sessions
where TIMESTAMPDIFF(SECOND, last_access_ts, current_timestamp) > 10
Complete example:
List<String> r = new ArrayList<String>();
Calendar c = Calendar.getInstance();
long threshold = c.get(Calendar.SECOND) + c.get(Calendar.MINUTE)*60 + c.get(Calendar.HOUR_OF_DAY)*60*60 - 10;
// this will return all users that were inactive for longer than 10 seconds
String tmpSql = "SELECT
user_name
FROM user_sessions
where TIMESTAMPDIFF(SECOND, last_access_ts, current_timestamp) > 10";
DBResult rs = DB.select(tmpSql);
for (int i=0; i<rs.size(); i++)
{
Map<String, Object> result = rs.get(i);
r.add(result.get("user_name").toString());
}
return r;
SQLFiddle
The solution is to remove the logic from your code to the sql query to only get the active users from that select, using a where clause.
It is faster to use the sql built-in functions to get fewer records and iterate less in your code.
Add this to your sql query to get the active users only:
Where TIMESTAMPDIFF(SECOND, last_access_ts, current_timestamp) > 10
This will get you all the records whose date is 10 seconds ago or sooner.
Try the MySQL TimeDiff function in your select. This way you can select only the results that are active without having to do any other calculations.
Link: MySQL: how to get the difference between two timestamps in seconds
If I get you right, then you got only 500 entries in your user_sessions table. In this case I wouldn't even care about indexes. Throw them away. The DB engine probably won't use them anyway for such a low record count. The performance gain due to not updating the indexes on every record update could be probably higher than the query overhead.
If you care about DB stress, then lengthen the query/update intervals to 1 minute or more, if your application allows this. Gordon Linoff's answer should give you the best query performance though.
As a side note (because it has bitten me before): If you don't use the same synchronized time for all user callbacks, then your "active users logic" is flawed by design.