I'm having performance issue trying to query large table, with multiple threads.
I'm using Oracle, Spring 2 and java 7.
I use a PoolDataSource (driver oracle.jdbc.pool.OracleDataSource) with as many connections as the number of thread to analyse a single table.
I made sure by logging poolDataSource.getStatistics() that I have enough available connections at any time.
Here is the code :
ExecutorService executorService = Executors.newFixedThreadPool(nbThreads);
List<Foo> foo = new ArrayList<>();
List<Callable<List<Foo>>> callables = new ArrayList<>();
int offset = 1;
while(offSetMaxReached) {
callables.add(new Callable<List<Foo>> {
#Override
public List<Foo> call() throws SQLException, InterruptedException {
return dao.doTheJob(...);
}
});
offset += 10000;
}
for(Future<List<Foo>> fooFuture : executorService.invokeAll(callables)) {
geometrieIncorrectes.addAll(fooFuture.get());
}
executorService.shutdown();
executorService.awaitTermination(1, TimeUnit.DAYS);
In the dao, I use a connection from the PoolDataSource, and Spring JdbcTemplate.query(query, PreparedStatementSetter, RowCallBackHandler).
The doTheJob method do the exact same thing for every query results.
My queries look like : SELECT A, B, C FROM MY.BIGTABLE OFFSET ? ROWS FETCH NEXT ? ROWS ONLY
So, to sum up, I have n threads called by a FixThreadPool, each thread deal with the exact same amount of data and do the exact same things.
But each thread completion is longer than the last one !
Example :
4 threads launched at the same time, but the first row of each resultSet (ie in the RowCallBackHandlers) is processed at :
thread 1 : 1.5s
thread 2 : 9s
thread 3 : 18s
thread 4 : 35s
and so on...
What can be the cause of such behavior ?
Edit :
The main cause of the problem was within the processed table itself.
Using OFFSET x ROWS FETCH NEXT y ROWS ONLY, Oracle needs to run through all the lines, starting by one.
So accessing offset 10 is faster than accessing offset 10000000 !
I was able to get good response time with temporary tables, and good indexes.
Related
Small question regarding Spark please.
What I would like to achieve is quite straightforward:
I have a Spark cluster with 10 executors which I would like to utilize.
I need to run a query selecting 10 rows from the DB.
My expectation is something like: select 10 rows, results are rows 1 2 3 4 5 6 7 8 9 10.
Then apply a map operation on each row. Something like executor 1 applies the operation Op to row one of the row, executor 2 applies the operation Op to another row, etc...
Note, my operation OP have proper logging and proper KPI.
Therefore, I went to try this:
public static void main(String[] args) {
final String query = "SELECT TOP(10) id, last_name, first_name FROM mytable WHERE ...";
final SparkSession sparkSession = SparkSession.builder().getOrCreate();
final Properties dbConnectionProperties = new Properties();
dbConnectionProperties.putAll([...]);
final Dataset<Row> topTenDataSet = sparkSession.read().jdbc(someUrl, query, dbConnectionProperties);
topTenDataSet.show();
final Dataset<String> topTenDataSetAfterMap = topTenDataSet.repartition(10).map((MapFunction<Row, String>) row -> performOperationWithLogAndKPI(row), Encoders.STRING());
LOGGER.info("the count is expected to be 10 " + topTenDataSetAfterMap.count() + topTenDataSetAfterMap.showString(100000, 1000000, false));
sparkSession.stop();
}
With this code, there is a strange outcome.
Both topTenDataSet.show(); and topTenDataSetAfterMap.count() shows 10 rows, happy.
But I look at the logs from the operation Op performOperationWithLogAndKPI I can see much more than 10 logs, much more than 10 metrics. Meaning, I can see executor 1 performing 10 times the operation, but also executor 2 performing 10 times the operation, etc.
It seems like each executor run its own "SELECT TOP(10) from DB" and applies the map function on each dataset.
May I ask: did I made some mistake in the code?
Is my understanding not correct?
How to achieve the expected, querying once, and having each executor apply a function to part of the result set?
Thank you
If you're trying to execute multiple actions on the same Dataset, try to cache it. This way the select top 10 results query should be executed only once:
final Dataset<Row> topTenDataSet = sparkSession.read().jdbc(someUrl, query, dbConnectionProperties);
topTenDataSet.cache();
topTenDataSet.show();
final Dataset<String> topTenDataSetAfterMap = topTenDataSet.repartition(10).map((MapFunction<Row, String>) row -> performOperationWithLogAndKPI(row), Encoders.STRING());
Further info here
I'm having a bit of trouble understanding how to manually commit properly for each record I consume.
First, let's look at an example from https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
buffer.add(record);
}
if (buffer.size() >= minBatchSize) {
insertIntoDb(buffer);
consumer.commitSync();
buffer.clear();
}
}
This example commits only after all the records that were received in the poll were processed. I think this isn't a great approach, because if we receive three records, and my service dies while processing the second one, it will end up consuming the first record again, which is incorrect.
So there's a second example that covers committing records on a per-partition basis:
try {
while(running) {
ConsumerRecords<String, String> records = consumer.poll(Long.MAX_VALUE);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
System.out.println(record.offset() + ": " + record.value());
}
long lastOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
}
}
} finally {
consumer.close();
}
However, I think this suffers from the same problem, it only commits after processing all the records that have come from a particular partition.
The solution I have managed to come up with is this:
val consumer: Consumer<String, MyEvent> = createConsumer(bootstrap)
consumer.subscribe(listOf("some-topic"))
while (true) {
val records: ConsumerRecords<String, MyEvent> = consumer.poll(Duration.ofSeconds(1))
if (!records.isEmpty) {
mainLogger.info("Received ${records.count()} events from CRS kafka topic, with partitions ${records.partitions()}")
records.forEach {
mainLogger.debug("Record at offset ${it.offset()}, ${it.value()}")
processEvent(it.value()) // Complex event processing occurs in this function
consumer.commitSync(mapOf(TopicPartition(it.topic(), it.partition()) to OffsetAndMetadata (it.offset() + 1)))
}
}
}
Now this seems to work while I am testing. So far, during my testing though, there appears to be only one partition being used (I have checked this by logging records.partitions()).
Is this approach going to cause any issues? The Consumer API does not seem to provide a way to commit an offset without specifying a partition, and this seems a bit odd to me. Am I missing something here?
There's no right or wrong way to commit. It really depends on your use case and application.
Committing every offset gives more granular control but it has an implication in terms of performance. On the other side of the spectrum, you could commit asynchronously every X seconds (like auto commit does) and have very little overhead but a lot less control.
In the first example, events are processed and committed in batch. It's interesting in terms of performance, but in case of error, the full batch could be reprocessed.
In the second example, it's also batching but only per partitions. This should lead to smaller batches so less performance but less reprocessing in case things to wrong.
In your last example, you choose the commit every single message. While this gives the most control, it significantly affects performance. In addition, like the other cases, it's not fully error proof.
If the application crashes after the event is processed but before it's committed, upon restarting the last event is likely to be reprocessed (ie at least once semantics). But at least, only one event should be affected.
If you want exactly once semantics, you need to use the Transactional Producer.
How I sure such a this condition always true:
Let's say I have 3 tables which are contains 1000, 2000, 3000 records.
And I have to load that all records to the cache, but I want to be sure all the data is in cache before Client using this cache.
Note that cache mode is replicated.
Apache Ignite has this future?
Here is the things that I will plan to do:
I will set the "atomicityMode" to the "TRANSACTIONAL"
I will create class for cacheStoreFactory
This class contains loadCache method extends from CacheStoreAdapter.
here is the pseducode:
public void loadCache(IgniteBiInClosure<K, V> clo, Object... args) {
// Connect the all databases
/*while true:
try(Transaction transaction = Ignition.ignite().transactions().txStart(TransactionConcurrency.OPTIMISTIC, TransactionIsolation.SERIALIZABLE)){
PreparedStatement for all databases such as "select * from persons"
then take the all ResultSet from all databases
then create the corresponding object from the results and put the "clo.apply(objectID, object);"
before transaction.commit()
if there is way, find the total records number from all databases (maybe find before starting try block)
then, again if there is way, compare the cache size and total records
If 2 numbers are equals -> transaction.commmit() & break
else -> continue;
}
*/
}
You can use distributed data structures to signal that some action was completed. For example:
https://apacheignite.readme.io/docs/countdownlatch
https://apacheignite.readme.io/docs/distributed-semaphore
https://apacheignite.readme.io/docs/atomic-types
Also, you can check the size of the cache or send the message from one node to another like here:
https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/messaging/MessagingExample.java
However, until you load the data in the transaction then other nodes will not be able to read it.
I have a requirement to insert/update more than 15000 rows in 3 tables. So that's 45k total inserts.
I used Statelesssession in hibernate after reading online that it is the best for batch processing as it doesn't have a context cache.
session = sessionFactory.openStatelessSession;
for(Employee e: emplList) {
session.insert(e);
}
transcation.commit;
But this codes takes more than an hour to complete.
Is there a way to save all the entity objects in one go?
Save the entire collection rather than doing it one by one?
Edit: Is there any other framework that can offer a quick insert?
Cheers!!
You should read this article of Vlad Mihalcea:
How to batch INSERT and UPDATE statements with Hibernate
You need to make sure that you've set the hibernate property:
hibernate.jdbc.batch_size
So that Hibernate can batch these inserts, otherwise they'll be done one at a time.
There is no way to insert all entities in one go. Even if you could do something like session.save(emplList) internally Hibernate will save one by one.
Accordingly to Hibernate User Guide StatelessSession do not use batch feature:
The insert(), update(), and delete() operations defined by the StatelessSession interface operate directly on database rows. They cause the corresponding SQL operations to be executed immediately. They have different semantics from the save(), saveOrUpdate(), and delete() operations defined by the Session interface.
Instead use normal Session and clear the cache from time to time. Acttually, I suggest you to measure your code first and then make changes like use hibernate.jdbc.batch_size, so you can see how much any tweak had improved your load.
Try to change it like this:
session = sessionFactory.openSession();
int count = 0;
int step = 0;
int stepSize = 1_000;
long start = System.currentTimeMillis();
for(Employee e:emplList) {
session.save(e);
count++;
if (step++ == stepSize) {
long elapsed = System.currentTimeMillis() - start;
long linesPerSecond = stepSize / elapsed * 1_000;
StringBuilder msg = new StringBuilder();
msg.append("Step time: ");
msg.append(elapsed);
msg.append(" ms Lines: ");
msg.append(count);
msg.append("/");
msg.append(emplList.size());
msg.append(" Lines/Seconds: ");
msg.append(linesPerSecond);
System.out.println(msg.toString());
start = System.currentTimeMillis();
step = 0;
session.clear();
}
}
transcation.commit;
About hibernate.jdbc.batch_size - you can try different values, including some very large depending on underlying database in use and network configuration. For example, I do use a value of 10,000 for a 1gbps network between app server and database server, giving me 20,000 records per second.
Change stepSize to the same value of hibernate.jdbc.batch_size.
I am executing the following set of statements in my java application. It connects to a oracle database.
stat=connection.createStatement();
stat1=commection.createstatement();
ResultSet rs = stat.executeQuery(BIGQUERY);
while(rs.next()) {
obj1.setAttr1(rs.getString(1));
obj1.setAttr2(rs.getString(1));
obj1.setAttr3(rs.getString(1));
obj1.setAttr4(rs.getString(1));
ResultSet rs1 = stat1.executeQuery(SMALLQ1);
while(rs1.next()) {
obj1.setAttr5(rs1.getString(1));
}
ResultSet rs2 = stat1.executeQuery(SMALLQ2);
while(rs2.next()) {
obj1.setAttr6(rs2.getString(1));
}
.
.
.
LinkedBlockingqueue.add(obj1);
}
//all staements and connections close
The BIGQUERY returns around 4.5 million records and for each record, I have to execute the smaller queries, which are 14 in number. Each small query has 3 inner join statements.
My multi threaded application now can process 90,000 in one hour. But I may have to run the code daily, so I want to process all the records in 20 hours. I am using about 200 threads which process the above code and stores the records in linked blocking queue.
Does increasing the thread count blindly helps increase the performance or is there some other way in which I can increase the performance of the result sets?
PS : I am unable to post the query here, but I am assured that all queries are optimized.
To improve JDBC performance for your scenario you can apply some modifications.
As you will see, all these modifications can significantly speed your task.
1. Using batch operations.
You can read your big query and store results in some kind of buffer.
And only when buffer is full you should run subquery for all data collected in buffer.
This significantly reduces number of SQL statements to execute.
static final int BATCH_SIZE = 1000;
List<MyData> buffer = new ArrayList<>(BATCH_SIZE);
while (rs.hasNext()) {
MyData record = new MyData( rs.getString(1), ..., rs.getString(4) );
buffer.add( record );
if (buffer.size() == BATCH_SIZE) {
processBatch( buffer );
}
}
void processBatch( List<MyData> buffer ) {
String sql = "select ... where X and id in (" + getIDs(buffer) + ")";
stat1.executeQuery(sql); // query for all IDs in buffer
while(stat1.hasNext()) { ... }
...
}
2. Using efficient maps to store content from many selects.
If your records are no so big you can store them all at once event for 4 mln table.
I used this approach many times for night processes (with no normal users).
Such approach may need much heap memory (i.e. 100 MB - 1 GB) - but is much faster that approach 1).
To do that you need efficient map implementation, i.e. - gnu.trove.map.TIntObjectMap (etc)
which is much better that java standard library maps.
final TIntObjectMap<MyData> map = new TIntObjectHashMap<MyData>(10000, 0.8f);
// query 1
while (rs.hasNext()) {
MyData record = new MyData( rs.getInt(1), rs.getString(2), ..., rs.getString(4) );
map.put(record.getId(), record);
}
// query 2
while (rs.hasNext()) {
int id = rs.getInt(1); // my data id
String x = rs.getString(...);
int y = rs.getInt(...);
MyData record = map.get(id);
record.add( new MyDetail(x,y) );
}
// query 3
// same pattern as query 2
After this you have map filled with all data collected. Probably with a lot of memory allocated.
This is why you can use that method only if you hava such resources.
Another topic is how to write MyData and MyDetail classes to be as small as possible.
You can use some tricks:
storing 3 integers (with limited range) in 1 long variable (using util for bit shifting)
storing Date objects as integer (yymmdd)
calling str.intern() for each string fetched from DB
3. Transactions
If you have to do some updates or inserts than 4 mln records is too much to handle in on transactions.
This is too much for most database configurations.
Use approach 1) and commit transaction for each batch.
On each new inserted record you can have something like RUN_ID and if everything went well you can mark this RUN_ID as successful.
If your queries only read - there is no problem. However you can mark transaction as Read-only to help your database.
4. Jdbc fetch size.
When you load a lot of records from database it is very, very important to set proper fetch size on your jdbc connection.
This reduces number of physical hits to database socket and speeds your process.
Example:
// jdbc
statement.setFetchSize(500);
// spring
JdbcTemplate jdbc = new JdbcTemplate(datasource);
jdbc.setFetchSize(500);
Here you can find some benchmarks and patterns for using fetch size:
http://makejavafaster.blogspot.com/2015/06/jdbc-fetch-size-performance.html
5. PreparedStatement
Use PreparedStatement rather than Statement.
6. Number of sql statements.
Always try to minimize number of sql statements you send to database.
Try this
resultSet.setFetchSize(100);
while(resultSet.next) {
...
}
The parameter is the number of rows that should be retrieved from the
database in each roundtrip