I have a PostgreSQL table that has millions of record. I need to process every row and for that I am using a column in that table namely 'isProcessed' so by default it's false and when I process it I change it to true.
Now the problem is that there are too many records and due to exceptions code bypasses some records leaving them isProcessed=false and that makes the execution really slow.
I was thinking to use indexing but with boolean it does not help.
Please provide some optimization technique or some better practice.
UPDATE:
I don't have the code, It just a problem my colleagues were asking for my opinion.
Normally an index on a Boolean isn't a good idea, but in PostgreSQL you can do an index where it contains only entries for one value using a partial index http://www.postgresql.org/docs/9.3/interactive/indexes-partial.html. It ends up being a queue of things for you to process, items drop off once done.
CREATE INDEX "yourtable_isProcessed_idx" ON "public"."yourtable"
USING btree ("isProcessed")
WHERE (isProcessed IS NOT TRUE);
This will make life easier when it is looking for the next item to process. Ideally you should be processing more than one at a time, particularly if you can do it in a single query, though doing millions at once may be prohibitive. In that situation, you might be able to do
update yourtable
set ....
where id in (select id from yourtable where isProcessed = false limit 100 )
If you have to do things one at a time, I'd still limit what you retrieve, so potentially retrieve
select id from yourtable where iProcessed = false limit 1
Without seeing your code, it would be tough to say what is really going on. Doing any processing row by row, which it sounds like is what is going on, is going to take a VERY long time.
Generally speaking, the best way to work with data is in sets. At the end of your process, you're going to ultimately have a set of records where isProcessed needs to be true (where the operation was successful), and a set where isProcessed needs to be false (where the operation failed). As you process the data, keep track of which records could be updated successfully, as well as which could not be updated. You could do this by making a list or array of the primary key or whatever other data you use to identify the rows. Then, after you're done processing your data, do one update to flag the records that were successful, and one to update the records that were not successful. This will be a bit more code, but updating each row individually after you process it is going to be awfully slow.
Again, seeing code would help, but if you're updating each record after you process it, I suspect this is what's slowing you down.
Here is approach I use. You should be able to store processing state including errors. It can be one column with values PENDING, PROCESSED, ERROR or two columns is_processed, is_error.
This is to be able skip records which couldn't be successfully processed and which if not skipped slow down processing of good tasks. You may try to reprocess them later or give DevOps possibility to move tasks from ERROR to PENDING state if the reason for failure for example was temporary unavailable resource.
Then you create conditional index on the table which include only PENDING tasks.
Processing is done using following algorithm (using spring: transaction and nestedTransaction are spring transaction templates):
while (!(batch = getNextBatch()).isEmpty()):
transaction.execute( (TransactionStatus status) -> {
for (Element element : batch) {
try {
nestedTransaction.execute( (TransactionStatuc status ) -> {
processElement(element);
markAsProcessed(element);
});
} catch (Exception e) {
markAsFailed(element);
}
}
});
Several important notes:
getting of records is done in batches - this at least saves round trips to database and is quicker then one by one retrieval
processing of individual elements is done in nested transaction (this is implemented using postgresql SAVEPOINTs). This is quicker then processing each element in own transaction but have the benefit that failure in processing of one element will not lose results of processing of others elements in batch.
This is good when processing is complex enough and cannot be done in sql by single query to process batch. If processElement rather simple update of element then whole batch may be updated via single update statement.
processing on elements of the batch may be done in parallel. This requires propagation of transaction to worker threads.
Related
I am facing a similar issue as described in this question.
Spring batch jpaPagingItemReader why some rows are not read?
I have to read some records from a table like
SELECT * FROM TABLE1 WHERE COLUMN1 = NULL
and based on value of other columns, do some processing(connect to REST service) and fetch some data and update in COLUMN1.
Since, I am using RepositoryItemReader with taskExecutor the paginated fetch is not working fine and around half of the records are being skipped from the actual eligible records.
In order to avoid this issue, I reduced the query to SELECT * FROM TABLE1 so that pagination is not affected since the query will be idempotent. I put a check in code to skip the record if the column is not null. Even with this setup, the records are still being skipped.
Also, this problem only occurs when the page size is smaller than the actual number of records.
If I keep the page size greater than or equal to the total number of eligible records, I don't face any issues. I am not sure if such a large page size (~100000) is a wise thing to have.
I noticed that in such case, a single query retrieves all the records, and then the taskExecutors process and write those records in different threads.
Due to high volume of data, I cannot avoid multi-threading as single-threaded mode is dreadfully slow.
Any pointers what can be done?
You are basically trying to implement the process indicator pattern in a multi-threaded step. There is a sample for that here: Parallel sample. The idea is to use a staging table with the process indicator instead of modifying the original table.
That said, I am not sure if the process indicator pattern can be implemented with a paging technique. A partitioned step where each partition is read in sequence with a cursor-based reader is a better option in my opinion.
We need to generate sequential numbers for our transactions. We encountered sqlcode=-911, sqlstate=40001, sqlerrmc=2 (deadlock) when concurrent users are trying to book transactions at the same time. Currently deadlock occurs because it is reading and updating to the same record. How can we design this so that deadlock can be prevented?
Create a "seed" table that contains a single data row.
This "seed" table row holds the "Next Sequential" value.
When you wish to insert a new business data row using the "Next Sequential" value. perform the following steps.
1). Open a cursor for UPDATE on the "seed" table and fetch the current row. This gives you exclusive control over the seed value.
2). You will employ this fetched row as the "Next Value"... However before doing so
3). Increment the fetched "Next Value" and commit the update. This commit closes your cursor and releases the seed row with the new "Next Value".
you are now free to employ your "Next Value".
There are a number of ways around this issue, some less performant than others.
Deadlocks can be prevented if all objects are locked in the same hierarchical sequence. [https://en.wikipedia.org/wiki/Deadlock#Prevention]
However, solutions to the Dining Philosophers Problem [https://en.wikipedia.org/wiki/Dining_philosophers_problem] which completely prevent deadlocks are often less performant than simply rolling back the transaction and retrying. You'll need to test your solution.
If you're looking for a data-side solution, an old-fashioned (and potentially under-performant) approach is to force the acquisition of new transaction IDs to be atomic by establishing a rigorous lock sequence.
A quick-ish solution (test this under load before releasing to production!) could be to use TRANSACTION boundaries and have a control row acting as a gatekeeper. Here's a stupid example which demonstrates the basic technique.
It has no error checking, and the code to reclaim ghost IDs is outside the scope of this example:
DECLARE #NewID INTEGER;
BEGIN TRANSACTION;
UPDATE [TABLE] SET [LOCKFLAG] = CURRENT_TIMESTAMP WHERE [ROW_ID] = 0;
SELECT #NewID = MAX([ROW_ID])+1 FROM [TABLE];
INSERT INTO [TABLE] ([ROW_ID]) VALUES (#NewID);
UPDATE [TABLE] SET [LOCKFLAG] = NULL WHERE [ROW_ID] = 0;
COMMIT TRANSACTION;
The idea is to make this atomic, single-threaded, serialized operation very, very short in duration -- do only what is needed to safely reserve the ID and get out of the way.
By making the first step to update row 0, if all ID requests comply with this standard, then competing users will simply queue up behind that first step.
Once you have your ID reserved, go off and do what you like, and you can use a new transaction to update the row you've created.
You'd need to cover cases where the later steps decide to ROLLBACK, as there would now be a ghost row in the table. You'd want a way to reclaim those; a variety of simple solutions can be used.
I have a multi-threaded client/server system with thousands of clients continuously sending data to the server that is stored in a specific table. This data is only important for a few days, so it's deleted afterwards.
The server is written in J2SE, database is MySQL and my table uses InnoDB engine. It contains some millions of entries (and is indexed properly for the usage).
One scheduled thread is running once a day to delete old entries. This thread could take a large amount of time for deleting, because the number of rows to delete could be very large (some millions of rows).
On my specific system deletion of 2.5 million rows would take about 3 minutes.
The inserting threads (and reading threads) get a timeout error telling me
Lock wait timeout exceeded; try restarting transaction
How can I simply get that state from my Java code? I would prefer handling the situation on my own instead of waiting. But the more important point is, how to prevent that situation?
Could I use
conn.setIsolationLevel( Connection.TRANSACTION_READ_UNCOMMITTED )
for the reading threads, so they will get their information regardless if it is most currently accurate (which is absolutely OK for this usecase)?
What can I do to my inserting threads to prevent blocking? They purely insert data into the table (primary key is the tuple userid, servertimemillis).
Should I change my deletion thread? It is purely deleting data for the tuple userid, greater than specialtimestamp.
Edit:
When reading the MySQL documentation, I wonder if I cannot simply define the connection for inserting and deleting rows with
conn.setIsolationLevel( Connection.TRANSACTION_READ_COMMITTED )
and achieve what I need. It says that UPDATE- and DELETE statements, that use a unique index with a unique search pattern only lock the matching index entry, but not the gap before and with that, rows can still be inserted into that gap. It would be great to get your experience on that, since I can't simply try it on production - and it is a big effort to simulate it on test environment.
Try in your deletion thread to first load the IDs of the records to be deleted and then delete one at a time, committing after each delete.
If you run the thread that does the huge delete once a day and it takes 3 minutes, you can split it to smaller transactions that delete a small number of records, and still manage to get it done fast enough.
A better solution :
First of all. Any solution you try must be tested prior to deployment in production. Especially a solution suggested by some random person on some random web site.
Now, here's the solution I suggest (making some assumptions regarding your table structure and indices, since you didn't specify them):
Alter your table. It's not recommended to have a primary key of multiple columns in InnoDB, especially in large tables (since the primary key is included automatically in any other indices). See the answer to this question for more reasons. You should add some unique RecordID column as primary key (I'd recommend a long identifier, or BIGINT in MySQL).
Select the rows for deletion - execute "SELECT RecordID FROM YourTable where ServerTimeMillis < ?".
Commit (to release the lock on the ServerTimeMillis index, which I assume you have, quickly)
For each RecordID, execute "DELETE FROM YourTable WHERE RecordID = ?"
Commit after each record or after each X records (I'm not sure whether that would make much difference). Perhaps even one Commit at the end of the DELETE commands will suffice, since with my suggested new logic, only the deleted rows should be locked.
As for changing the isolation level. I don't think you have to do it. I can't suggest whether you can do it or not, since I don't know the logic of your server, and how it will be affected by such a change.
You can try to replace your one huge DELETE with multiple shorter DELETE ... LIMIT n with n being determined after testing (not too small to cause many queries and not too large to cause long locks). Since the locks would last for a few ms (or seconds, depending on your n) you could let the delete thread run continuously (provided it can keep-up; again n can be adjusted so it can keep-up).
Also, table partitioning can help.
Here is the scenario I am researching a solution for at work. We have a table in postgres which stores events happening on network. Currently the way it works is, rows get inserted as network events come and at the same time older records which match the specific timestamp get deleted in order to keep table size limited to some 10,000 records. Basically, similar idea as log rotation. Network events come in burst of thousands at a time, hence rate of transaction is too high which causes performance degradation, after sometime either server just crashes or becomes very slow, on top of that, customer is asking to keep table size up to million records which is going to accelerate performance degradation (since we have to keep deleting record matching specific timestamp) and cause space management issue. We are using simple JDBC to read/write on table. Can tech community out there suggest better performing way to handle inserts and deletes in this table?
I think I would use partitioned tables, perhaps 10 x total desired size, inserting into the newest, and dropping the oldest partition.
http://www.postgresql.org/docs/9.0/static/ddl-partitioning.html
This makes load on "dropping oldest" much smaller than query and delete.
Update: I agree with nos' comment though, the inserts/deletes may not be your bottleneck. Maybe some investigation first.
Some things you could try -
Write to a log, have a separate batch proc. write to the table.
Keep the writes as they are, do the deletes periodically or at times of lower traffic.
Do the writes to a buffer/cache, have the actual db writes happen from the buffer.
A few general suggestions -
Since you're deleting based on timestamp, make sure the timestamp is indexed. You could also do this with a counter / auto-incremented rowId (e.g. delete where id< currentId -1000000).
Also, JDBC batch write is much faster than individual row writes (order of magnitude speedup, easily). Batch writing 100 rows at a time will help tremendously, if you can buffer the writes.
I'm looking for a high level answer, but here are some specifics in case it helps, I'm deploying a J2EE app to a cluster in WebLogic. There's one Oracle database at the backend.
A normal flow of the app is
- users feed data (to be inserted as rows) to the app
- the app waits for the data to reach a certain size and does a batch insert into the database (only 1 commit)
There's a constraint in the database preventing "duplicate" data insertions. If the app gets a constraint violation, it will have to rollback and re-insert one row at a time, so the duplicate rows can be "renamed" and inserted.
Suppose I had 2 running instances of the app. Each of the instances is about to insert 1000 rows. Even if there is only 1 duplicate, one instance will have to rollback and insert rows one by one.
I can easily see that it would be smarter to re-insert the non-conflicting 999 rows as a batch in this instance, but what if I had 3 running apps and the 999 rows also had a chance of duplicates?
So my question is this: is there a design pattern for this kind of situation?
This is a long question, so please let me know where to clarify. Thank you for your time.
EDIT:
The 1000 rows of data is in memory for each instance, but they cannot see the rows of each other. The only way they know if a row is a duplicate is when it's inserted into the database.
And if the current application design doesn't make sense, feel free to suggest better ways of tackling this problem. I would appreciate it very much.
http://www.oracle-developer.net/display.php?id=329
The simplest would be to avoid parallel processing of the same data. For example, your size or time based event could run only on one node or post a massage to a JMS queue, so only one of the nodes would process it (for instance, by using similar duplicate-check, e.g. based on a timestamp of the message/batch).