Update primary keys without creating duplicate rows? - java

I'm working on a Java project which needs to be able to alter all the primary keys in a table - and in most cases - some other values in the row as well.
The problem I have is that, if I update a row by selecting by its old primary key (SET pk=new_pk WHERE pk=old_pk) I get duplicate rows (since the old PK value may be equal to another row's new PK value and both rows are then updated).
I figured that in Oracle and some other DBs I might be able to do this with ROWNUM or something similar, but the system should work with most DB systems and right now we can't get this to work for MySQL.
I should add that I don't have access to change the schema of the DB - so, I can't add a column.
What I have tried:
Updating ResultSets directly with RS.updateRow() - this seems to
work, but is very slow.
Hashing the PK's in the table, storing the hash in code and selecting on the hashed PK. This acts sort of as a signature, since a
hashed PK indicates that the row has been read but not yet updated
and I can avoid appropriate rows that way. The issue with this seems
to have been hash collisions as I was getting duplicate PKs.
PS:
I realise this sounds like either a terrible idea, or terrible design, but we really have no choice. The system I'm working on aims to anonymize private data and that may entail changing PKs in some tables. Don't fear, we do account for FK references.

In this case you can use simple update with delta = max Pk from updating table
select delta
select max(pk) as delta from table
and then use it in query
update table SET pk=pk+delta+1
Before this operation you need to disable constraints. And don't forget that you should also update foreign keys.

Related

Most efficient way to determine if a row EXISTS and INSERT into MySQL using java JDBC

I'm looking at trying to query a table in a MySQL database (I have the primary key, which is comprised of two categories, a name and a number but string comparison), such that this table could have anywhere from very few rows to upwards of hundreds of millions. Now, for efficiency, I'm not exactly sure how costly it is to actually do an INSERT query but I have a few options as to go about it:
I could query the database to see if the element EXISTS and then call an INSERT query if it doesn't.
I could try to brute force INSERT into the database and if it succeeds or fails, so be it.
I could initially on program execution, create a cache/store, grab the primary key columns and store them in a Map<String, List<Integer>> and then search the key for if the name exists, then if it does, does the key and value combination in the List<Integer> exists, if it doesn't, then INSERT query the database.
?
Option one really isn't on the table for what I would really implement, just on the list of possible choices. Option two would most likely average better for unique occurrences such that it isn't in the table already. Option three would favour if common occurrences are the case such that a lot are in the cache.
Bearing in mind that option chosen will be iterated over potentially millions of times. Memory usage aside (From option 3), from my calculations it's nothing significant in respect to the capacity available.
Let the database do the work.
You should do the second method. If you don't want to get a failure, you can use on duplicate key update:
insert into t(pk1, pk2, . . . )
values ( . . . )
on duplicate key update set pk1 = values(pk1);
The only purpose of on duplicate key update is to do nothing useful but not return an error.
Why is this the best solution? In a database, a primary key (or columns declared unique) have an index structure. This is efficient for the database to use.
Second, this requires only one round-trip to the database.
Third, there are no race conditions, if you have multiple threads or applications that might be attempting to insert the same record(s).
Fourth, the method with on duplicate key update will work for inserting multiple rows at once. (Without on duplicate key insert, then a multi-value statement would fail if a single row is duplicated.) Combining multiple inserts into a single statement can be another big efficiency.
Your second option is really the right way to go.
Rather than fetching all your result in the third option , you could try using Limit 1 , given the fact that the combination of your name and number form a primary key thus , using limit 1 to fetch the result and then if the result is empty then you can probably insert your desired data. It would lot faster that way.
MySQL has a neat way to perform an special insertion. The INSERT ON DUPLICATE KEY UPDATE is a MySQL extension to the INSERT statement. If you specify the ON DUPLICATE KEY UPDATE option in the INSERT statement and the new row causes a duplicate value in the UNIQUE or PRIMARY KEY index, MySQL performs an update to the old row based on the new values:
INSERT INTO table(column_list)
VALUES(value_list)
ON DUPLICATE KEY UPDATE column_1 = new_value_1, column_2 = new_value_2;

Return (self) generated value from insert statement (no id, no returning)

sorry, if the question title is misleading or not accurate enough, but i didn't see how to ask it in one sentence.
Let's say we have a table where the PK is a String (numbers from '100,000' to '999,999', comma is for readability only).
Let's also say, the PK is not sequentially used.
Now i want to insert a new row into the table using java.sql and show the PK of the inserted row to the User. Since the PK is not generated by default (e.g. insert values without the PK didn't work, something like generated_keys is not available in the given environment) i've seen two different approaches:
in two different statements, first find a possible next key, then try to insert (and expect that another transaction used the same key in the time between the two statements) - is it valid to retry until success or could any sql trick with transaction-settings/locks help here? how can i realize that in java.sql?
for me, that's a disappointing solution, because of the non-deterministic behaviour (perhaps you could convince me of the contrary), so i searched for another one:
insert with a nested select statement that looks up the next possible PK. looking up other answers on generating the PK myself I came close to a working solution with that statement (left out the casts from string to int):
INSERT INTO mytable (pk,othercolumns)
VALUES(
(SELECT MIN(empty_numbers.empty_number)
FROM (SELECT t1.pk + 1 as empty_number
FROM mytable t1
LEFT OUTER JOIN mytable t2
ON t1.pk + 1 = t2.pk
WHERE t2.pk IS NULL
AND t1.pk > 100000)
as empty_numbers),
othervalues);
that works like a charm and has (afaik) a more predictable and stable solution than my first approach, but: how can i possibly retrieve the generated PK from that statement? I've read that there is no way to return the inserted row (or any columns) directly and most of the google results i've found, point to returning generated keys - even though my key is generated, it's not generated by the DBMS directly, but by my statement.
Note, that the DBMS used in development is MSSQL 2008 and the productive system is currently a DB2 on AS/400 (don't know which version) so i have to stick close to SQL standards. i can't change the db-structure in any way (e.g. use generated keys, i'm not sure about stored procedures).
DB2 for i allows generated keys, stored procedures, user defined functions - pretty much all of the things SQL Server can do. The exact implementation is different, but that's what manuals are for :-) Ask your admin what version of IBM i they're running, then hit up the Infocenter for specifics.
The constraining factor is that you can't alter the database design; you are stuck with apparently multiple processes trying to INSERT while backfilling 'holes' in the existing keyspace. That's a very tough nut to crack. Because you can't change the DB design, there's nothing to be done except to allow for and handle PK collisions. There's no SQL trick that'll help - the SQL way is to have the DB generate the PK, not the application.
There are several alternatives to suggest, in the event that some change is allowed. All have issues needing a workaround, but that is unavoidable at this point due to the application design.
Create a UDF that all INSERT clients use to retrieve the next available PK. Use a table of 'available numbers' and delete them as they are issued.
Pre-INSERT all the available numbers. Force clients to do an UPDATE. Make them FETCH...FOR UPDATE where (rest of data = not populated). This will lock the row, avoiding collisions as well as make the PK immediately available.
Leave the DB and the other application programs using this table as-is, but have your INSERT process draw from a block of keys that's been set aside for your use. Keep the next available number in an SQL SEQUENCE or an IBM i data area. This only works if there's a very large hole in the keyspace that's not yet used.

PostgreSQL JDBC getGeneratedKeys returns all columns

I've recently switched from MySQL to PostgreSQL for the back end of a project and discovered some of my database proxy methods needed reviewing. To insert linked objects I use a transaction to make sure everything is stored. I do this using jdbc methods such as setAutoCommit(false) and commit(). I've written a utility method that inserts a record into a table and returns the generated key. Basically I've followed technique 2 as described here:
http://www.selikoff.net/2008/09/03/database-key-generation-in-java-applications/
This has worked since the start of the project, but after migrating from MySQL to PostgreSQL getGeneratedKeys returns all the columns of the newly inserted record (see console output below).
Code:
final ResultSet keys = ps.getGeneratedKeys();
final ResultSetMetaData metaData = keys.getMetaData();
for (int j = 0; j < metaData.getColumnCount(); j++) {
System.out.println("Col name: "+metaData.getColumnName(j+1));
}
Output:
Col name: pathstart
Col name: fk_id_c
Col name: xpathid
Col name: firstnodeisroot
Database signature for the table (auto generated SQL from pgAdmin III):
CREATE TABLE configuration.configuration_xpath
(
pathstart integer NOT NULL,
fk_id_c integer NOT NULL,
xpathid integer NOT NULL DEFAULT nextval('configuration.configuration_xpath_id_seq'::regclass),
firstnodeisroot boolean NOT NULL DEFAULT false,
CONSTRAINT configuration_xpath_pkey PRIMARY KEY (xpathid),
CONSTRAINT configuration_fk FOREIGN KEY (fk_id_c)
REFERENCES configuration.configuration (id_c) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE
)
Database signature for the sequence behind the PK:
CREATE SEQUENCE configuration.configuration_xpath_id_seq
INCREMENT 1
MINVALUE 1
MAXVALUE 9223372036854775807
START 242
CACHE 1
OWNED BY configuration.configuration_xpath.xpathid;
So the question is, why is getGeneratedKeys returning all the columns instead of just the generated key? I've searched and found someone else with a similar problem here:
http://www.postgresql.org/message-id/004801cb7518$cbc632e0$635298a0$#pravdin#disi.unitn.it
But their question has not been answered, only a suggested workaround is offered.
Most drivers support getGeneratedKeys() by tacking on a RETURNING-clause at the end of the query with the columns that are auto-generated. PostgreSQL returns all fields because it has RETURNING * which simply returns all columns. That means that to return the generated key it doesn't have to query the system table to determine which column(s) to return, and this saves network roundtrips (and query time).
This is implicitly allowed by the JDBC specification, because it says:
Note:If the columns which represent the auto-generated keys were not specified, the JDBC driver implementation will determine the columns which best represent the auto-generated keys.
Reading between the lines you can say that this allows for saying 'I don't know, or it is too much work, so all columns best represent the auto-generated keys'.
An additional reason might be that it is very hard to determine which columns are auto-generated and which aren't (I am not sure if that is true for PostgreSQL). For example in Jaybird (the JDBC driver for Firebird that I maintain) we also return all columns because in Firebird it is impossible to determine which columns are auto-generated (but we do need to query the system tables for the column names because Firebird 3 and earlier do not have RETURNING *).
Therefor it is always advisable to explicitly query the generated keys ResultSet by column name and not by position.
Other solutions are explicitly specifying the column names or the column positions you want returned using the alternate methods accepting a String[] or int[] (although I am not 100% sure how the PostgreSQL driver handles that).
BTW: Oracle is (was?) even worse: by default it returns the ROW_ID of the row, and you need to use a separate query to get the (generated) values from that row.
UPDATE - The accepted answer (by Mark) correctly explains what the problem is. My solution also works, but that's only because I added the PK column first when recreating the tables. Either way, all columns are returned by getGeneratedKeys().
After some research I've managed to find a possible cause of the problem. As I said before, I've changed from MySQL to PostgreSQL during the development of a software project. For this migration, I've taken an SQL dump which I loaded into PostgreSQL. Aside from the migrated tables, I've also created some new ones (using the GUI wizards in pgAdmin III). After a close investigation of the differences between two tables (one imported, one created), I've established 2 things:
CREATE TABLE statements from the MySQL dump convert PKs to BIGINT NOT NULL, not to SERIAL. This lead to the fact auto generated PKs no longer worked properly, though I fixed this before I asked this question.
The tables that I 'fixed' by adding a new sequence and linking it up work perfectly fine, but the SQL generation code (auto-generated by pgAdmin III, as shown in the original question) is different that that of a table that is made in PostgreSQL natively.
Note that my fixed tables work(ed) perfectly: I can insert records, update records and perform joins... basically do anything. The primary keys get auto generated and the sequence gets updated. However, the JDBC driver (psotgresql-9.2-1003.jdbc4.jar to be precise) fails to return my generated keys (though the tables are fully functional).
To illustrate the difference between a migrated and created table, here is an example of generation code for a table that I added after the migration:
CREATE TABLE configuration.configuration_xpathitem
(
xpathitemid serial NOT NULL,
xpathid integer,
fk_id_c integer,
itemname text,
index integer,
CONSTRAINT pk_configuration_xpathitem PRIMARY KEY (xpathitemid),
CONSTRAINT fk_configuration_xpathitem_configuration FOREIGN KEY (fk_id_c)
REFERENCES configuration.configuration (id_c) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT fk_configuration_xpathitem_configuration_xpath FOREIGN KEY (xpathid)
REFERENCES configuration.configuration_xpath (xpathid) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
You can clearly see here my PK has the serial keyword, where it is integer not null default ... for the migrated (and fixed) table.
Because of this, I figured maybe the JDBC driver for PostgreSQL was unable to find the PK. I had already read the specification that #Mark highlighted in his reply and this lead me to think that that was the cause for the driver to return all columns. This lead me to believe the driver could not find the PK because I think it is looking for the serial keyword.
So to solve the problem, I dumped my data, deleted my faulty tables and added them again, this time from scratch rather than with the SQL statements from the MySQL dump, and reloaded my data. This has solved the problem for me. I hope this can help anyone that is also stuck.

Refreshing PrimaryID to start from one after a deleted Row

Im programming a program in java and i have a database in a JTable just like the ones below. I wanted to know if it is possible to refresh the primaryID location from 1 on the GUI interface form one when a row is deleted? for example below the LoactionID is deleted for London and added again with an id 4. Is this possible?
Im using SQL in java
To answer your question, yes it is possible.
There is no good reason for you to do this though, and I highly recommend you don't do this.
The only reason to do this would be for cosmetic ones - the database doesn't care if records are sequential, only that they relate to one another consistently. There's no need to "correct" the values for the database's sake.
If you use these Id's for some kind of numbering on the UI (cosmetic reason):
Do not use your identity for this. Separate the visual row number, order or anything else from the internal database key.
If you REALLY want to do this,
Google "reseeding or resetting auto increment primary ID's" for your sql product.
Be aware for some solutions if you reset the identity seed below values that you currently have in the table, that you will violate the indentity column's uniqueness constraint as soon as the values start to overlap
Thanks Andriy for mentioning my blindly pasting a mysql solution :)
Some examples:
ALTER TABLE table_name ALTER COLUMN auto_increment_column_name RESTART WITH 8 Java DB
DBCC CHECKIDENT (mytable, RESEED, 0)
Altering the sequence

Move million records from MEMORY table to MYISAM table

I am looking for a fast way to move records from a MEMORY table to MYISAM table. MEMORY table has around 0.5 million records. Both tables have exactly the same structure (same number of columns, data types etc.). But the MYISAM table is indexed (B-TREE) on a few columns. There are around 25 columns most of which are unsigned integers.
I have already tried using "INSERT INTO SELECT * FROM " query. But is there any faster way to do this?
Appreciate your help.
Prashant
A others pointed out -- you should not use indexes during insert.
You can disable updating them on every insert:
ALTER TABLE table DISABLE KEYS;
INSERT INTO table
ALTER TABLE tbl_name ENABLE KEYS;
And also lock a table to get single index write:
LOCK TABLES table WRITE;
INSERT INTO table
UNLOCK TABLES;
Anyway, if you use it in a single INSERT ... SELECT you might
not get significant performance gain.
You can also tune bulk_insert_buffer_size setting in the server config.
More on: http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
In principle, you should get good performance by:
Create the target table without secondary indexes.
Sort the contents of the source table on the target table's primary key.
Insert sorted records into target table.
Add the secondary indexes one at a time.
It's probably mostly about tuning. Is the MyISAM table initially empty? If so, you can do a few cheats - disable indexes during the load, then enable them (this is NOT a good idea on a non-empty table).
Doing an ORDER BY on a memory table is not a particularly good idea, as they usually use hash indexes, hence cannot do an in-order index scan, so it would introduce an extra filesort(), which is probably bad.

Categories

Resources