Move million records from MEMORY table to MYISAM table - java

I am looking for a fast way to move records from a MEMORY table to MYISAM table. MEMORY table has around 0.5 million records. Both tables have exactly the same structure (same number of columns, data types etc.). But the MYISAM table is indexed (B-TREE) on a few columns. There are around 25 columns most of which are unsigned integers.
I have already tried using "INSERT INTO SELECT * FROM " query. But is there any faster way to do this?
Appreciate your help.
Prashant

A others pointed out -- you should not use indexes during insert.
You can disable updating them on every insert:
ALTER TABLE table DISABLE KEYS;
INSERT INTO table
ALTER TABLE tbl_name ENABLE KEYS;
And also lock a table to get single index write:
LOCK TABLES table WRITE;
INSERT INTO table
UNLOCK TABLES;
Anyway, if you use it in a single INSERT ... SELECT you might
not get significant performance gain.
You can also tune bulk_insert_buffer_size setting in the server config.
More on: http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html

In principle, you should get good performance by:
Create the target table without secondary indexes.
Sort the contents of the source table on the target table's primary key.
Insert sorted records into target table.
Add the secondary indexes one at a time.

It's probably mostly about tuning. Is the MyISAM table initially empty? If so, you can do a few cheats - disable indexes during the load, then enable them (this is NOT a good idea on a non-empty table).
Doing an ORDER BY on a memory table is not a particularly good idea, as they usually use hash indexes, hence cannot do an in-order index scan, so it would introduce an extra filesort(), which is probably bad.

Related

How to listen to results of a EXPLAIN [duplicate]

I'm trying to speed up bulk insert in an InnoDB table by temporary disabling its indexes:
ALTER TABLE mytable DISABLE KEYS;
But it gives a warning:
+-------+------+-------------------------------------------------------------+
| Level | Code | Message |
+-------+------+-------------------------------------------------------------+
| Note | 1031 | Table storage engine for 'mytable' doesn't have this option |
+-------+------+-------------------------------------------------------------+
1 row in set (0.00 sec)
How can we disable the indexes?
What alternatives are there to avoid using the index when doing bulk inserts?
How can we speed up the process?
Have you tried the following?
SET autocommit=0;
SET unique_checks=0;
SET foreign_key_checks=0;
From the MySQL References https://dev.mysql.com/doc/refman/8.0/en/optimizing-innodb-bulk-data-loading.html
See Section "Bulk Data Loading Tips"
There is a very good reason why you cannot execute DISABLE KEYS on an InnoDB table; InnoDB is not designed to use it, and MyISAM is.
In fact, here is what happens when you reload a mysqldump:
You will see a CREATE TABLE for a MyISAM table following by a write lock.
Before all the bulk inserts are run, a call to ALTER TABLE ... DISABLE KEYS is done.
What this does is turn off secondary indexes in the MyISAM table.
Then, bulk inserts are done. While this is being done, the PRIMARY KEY and all UNIQUE KEYS in the MyISAM table are being rebuilt. Before the UNLOCK TABLEs, a call ALTER TABLE ... ENABLE KEYS is done in order to rebuild all non-unique indexes linearly.
IMHO this operation was not coded into the InnoDB Storage Engine because all keys in a non-unique index come with the primary key entry from gen_clust_index (aka Clustered Index). That would be a very expensive operation since building a non-unique index would require O(n log n) running time to retrieve each unique key to attach to a non-unique key.
In light of this, posting a warning about trying to DISABLE KEYS/ENABLE KEYS on an InnoDB table is far easier than coding exceptions to the mysqldump for any special cases involving non-MyISAM storage engines.
A little late but... whatever... forget all the answers here, don't disable the indexes, there's no way, just drop them ALTER TABLE tablename DROP INDEX whatever, bulk insert the data, then ALTER TABLE tablename ADD INDEX whatever (whatever); the amount of time recreating the indexes is 1% of the bulk insert with indexes on it, like 400000 rows took 10 minutes with indexes and like 2 seconds without them..., cheers...
to reduce the costs for re-calculating the indexes you should insert the data either using DATA INFILE or using Mysql Multi Row Inserts, like
INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);
-> so inserting several rows with one statement.
How many rows one can insert with one statement depends on the max_allowed_packet mysql setting.

Efficiant way to check large number string existing in database

I have a very large table in the database, the table has a column called
"unique_code_string", this table has almost 100,000,000 records.
Every 2 minutes, I will receive 100,000 code string, they are in an array and they are unique to each other. I need to insert them to the large table if they are all "good".
The meaning of "good" is this:
All 100,000 codes in the array never occur in the database large table.
If one or more codes occur in the database large table, the whole array will not use at all,
it means no codes in the array will insert into the large table.
Currently, I use this way:
First I do a loop and check each code in the array to see if there is already same code in the database large table.
Second, if all code is "new", then, I do the real insert.
But this way is very slow, I must finish all thing within 2 minutes.
I am thinking of other ways:
Join the 100,000 code in a SQL "in clause", each code has 32 length, I think no database will accept this 32*100,000 length "in clause".
Use database transaction, I force insert the codes anyway, if error happens, the transaction rollback. This cause some performance issue.
Use database temporary table, I am not good at writing SQL querys, please give me some example if this idea may work.
Now, can any experts give me some advice or some solutions?
I am a non-English speaker, I hope you see the issue I am meeting.
Thank you very much.
Load the 100,000 rows into a table!
Create a unique index on the original table:
create unique index unq_bigtable_uniquecodestring on bigtable (unique_code_string);
Now, you have the tools you need. I think I would go for a transaction, something like this:
insert into bigtable ( . . . )
select . . .
from smalltable;
If any row fails (due to the unique index), then the transaction will fail and nothing is inserted. You can also be explicit:
insert into bigtable ( . . . )
select . . .
from smalltable
where not exists (select 1
from smalltable st join
bigtable bt
on st.unique_code_string = bt.unique_code_string
);
For this version, you should also have an index/unique constraint on smalltable(unique_code_string).
It's hard to find an optimal solution with so little information. Often this depends on the network latency between application and database server and hardware resources.
You can load the 100,000,000 unique_code_string from the database and use HashSet or TreeSet to de-duplicate in-memory before inserting into the database. If your database server is resource constrained or there is considerable network latency this might be faster.
Depending how your receive the 100,000 records delta you could load it into the database e.g. a CSV file can be read using external table. If you can get the data efficiently into a temporary table and database server is not overloaded you can do it very efficiently with SQL or stored procedure.
You should spend some time to understand how real-time the update has to be e.g. how many SQL queries are reading the 100,000,000 row table and can you allow some of these SQL queries to be cancelled or blocked while you update the rows. Often it's a good idea to create a shadow table:
Create new table as copy of the existing 100,000,000 rows table.
Disable the indexes on the new table
Load the delta rows to the new table
Rebuild the indexes on new table
Delete the existing table
Rename the new table to the existing 100,000,000 rows table
The approach here is database specific. It will depend on how your database is defining the indexes e.g. if you have a partitioned table it might be not necessary.

Update primary keys without creating duplicate rows?

I'm working on a Java project which needs to be able to alter all the primary keys in a table - and in most cases - some other values in the row as well.
The problem I have is that, if I update a row by selecting by its old primary key (SET pk=new_pk WHERE pk=old_pk) I get duplicate rows (since the old PK value may be equal to another row's new PK value and both rows are then updated).
I figured that in Oracle and some other DBs I might be able to do this with ROWNUM or something similar, but the system should work with most DB systems and right now we can't get this to work for MySQL.
I should add that I don't have access to change the schema of the DB - so, I can't add a column.
What I have tried:
Updating ResultSets directly with RS.updateRow() - this seems to
work, but is very slow.
Hashing the PK's in the table, storing the hash in code and selecting on the hashed PK. This acts sort of as a signature, since a
hashed PK indicates that the row has been read but not yet updated
and I can avoid appropriate rows that way. The issue with this seems
to have been hash collisions as I was getting duplicate PKs.
PS:
I realise this sounds like either a terrible idea, or terrible design, but we really have no choice. The system I'm working on aims to anonymize private data and that may entail changing PKs in some tables. Don't fear, we do account for FK references.
In this case you can use simple update with delta = max Pk from updating table
select delta
select max(pk) as delta from table
and then use it in query
update table SET pk=pk+delta+1
Before this operation you need to disable constraints. And don't forget that you should also update foreign keys.

Most efficient way to determine if a row EXISTS and INSERT into MySQL using java JDBC

I'm looking at trying to query a table in a MySQL database (I have the primary key, which is comprised of two categories, a name and a number but string comparison), such that this table could have anywhere from very few rows to upwards of hundreds of millions. Now, for efficiency, I'm not exactly sure how costly it is to actually do an INSERT query but I have a few options as to go about it:
I could query the database to see if the element EXISTS and then call an INSERT query if it doesn't.
I could try to brute force INSERT into the database and if it succeeds or fails, so be it.
I could initially on program execution, create a cache/store, grab the primary key columns and store them in a Map<String, List<Integer>> and then search the key for if the name exists, then if it does, does the key and value combination in the List<Integer> exists, if it doesn't, then INSERT query the database.
?
Option one really isn't on the table for what I would really implement, just on the list of possible choices. Option two would most likely average better for unique occurrences such that it isn't in the table already. Option three would favour if common occurrences are the case such that a lot are in the cache.
Bearing in mind that option chosen will be iterated over potentially millions of times. Memory usage aside (From option 3), from my calculations it's nothing significant in respect to the capacity available.
Let the database do the work.
You should do the second method. If you don't want to get a failure, you can use on duplicate key update:
insert into t(pk1, pk2, . . . )
values ( . . . )
on duplicate key update set pk1 = values(pk1);
The only purpose of on duplicate key update is to do nothing useful but not return an error.
Why is this the best solution? In a database, a primary key (or columns declared unique) have an index structure. This is efficient for the database to use.
Second, this requires only one round-trip to the database.
Third, there are no race conditions, if you have multiple threads or applications that might be attempting to insert the same record(s).
Fourth, the method with on duplicate key update will work for inserting multiple rows at once. (Without on duplicate key insert, then a multi-value statement would fail if a single row is duplicated.) Combining multiple inserts into a single statement can be another big efficiency.
Your second option is really the right way to go.
Rather than fetching all your result in the third option , you could try using Limit 1 , given the fact that the combination of your name and number form a primary key thus , using limit 1 to fetch the result and then if the result is empty then you can probably insert your desired data. It would lot faster that way.
MySQL has a neat way to perform an special insertion. The INSERT ON DUPLICATE KEY UPDATE is a MySQL extension to the INSERT statement. If you specify the ON DUPLICATE KEY UPDATE option in the INSERT statement and the new row causes a duplicate value in the UNIQUE or PRIMARY KEY index, MySQL performs an update to the old row based on the new values:
INSERT INTO table(column_list)
VALUES(value_list)
ON DUPLICATE KEY UPDATE column_1 = new_value_1, column_2 = new_value_2;

Performance of SELECT query- Oracle/JDBC

I have a existing query in the system which is a simple select query as follows:
SELECT <COLUMN_X>, <COLUMN_Y>, <COLUMN_Z> FROM TABLE <WHATEVER>
Over time, <WHATEVER> is growing in terms of records. Is there any way possible to improve the performance here? The developer is using Statement interface. I believe PreparedStatement won't help here since the query is executed only once.
Is there any thing else that can be done? One of the columns is a primary key and others are VARCHAR (if the information helps)
Does you query have any predicates? Or are you always returning all of the rows from the table?
If you are always returning all the rows, a covering index on column_x, column_y, column_z would allow Oracle to merely scan the index rather than doing a table scan. The query will still slow down over time but the index should grow more slowly than the table.
If you are returning a subset of rows, there are potentially other indexes that would be more advantageous from a performance perspective.
Are there any optimization you can do outside of the SQL query tunning? If yes here are some suggestion:
Try putting the table in memory (like the MEMORY storage engine in MySQL) or any other optimization in the DB
Cache the ResultSet in java. query again only when the table content changes. If the table only has inserts and no updates or delete (wishful thinking), then you can use SELECT COUNT(*) FROM table. If the rows returned are different than the previous time then fire your original query and update cache only if needed.

Categories

Resources