I have a relatively simple object model:
ParentObject
Collection<ChildObject1>
ChildObject2
The MySQL operation when saving this object model does the following:
Update the ParentObject
Delete all previous items from the ChildObject1 table (about 10 rows)
Insert all new ChildObject1 (again, about 10 rows)
Insert ChildObject2
The objects / tables are unremarkable - no strings, rather mainly ints and longs.
MySQL is currently saving about 20-30 instances of the object model per second. When this goes into prodcution it's going to be doing upwards of a million saves, which at current speeds is going to take 10+ hours, which is no good to me...
I am using Java and Spring. I have profiled my app and the bottle neck is in the calls to MySQL by a long distance.
How would you suggest I increase the throughput?
You can get some speedup by tracking a dirty flag on your objects (especially your collection of child objects). You only delete/update the dirty ones. Depending on what % of them change on each write, you might save a good chunk.
The other thing you can do is do bulk writes via batch updating on the prepared statement. (Look at PreparedStatement.addBatch()) This can be an order of magnitude faster, but might not be record by record,e.g. might look something like:
delete all dirty-flagged children as a single batch command
update all parents as a single batch command
insert all dirty-flagged children as a single batch command.
Note that since you're dealing with millions of records you're probably not going to be able to load them all into a map and dump them at once, you'll have to stream them into a batch handler and dump the changes to the db 1000 records at a time or so. Once you've done this the actual speed is sensitive to the batch size, you'll have to determine the defaults by trial-and-error.
Deleting any existing ChildObject1 records from the table and then inserting the ChildObject1 instances from the current state of your Parent object seems unnecessary to me. Are the values of the all of the child objects different than what was previously stored?
A better solution might involve only modifying the database when you need to, i.e. when there has been a change in state of the ChildObject1 instances.
Rolling your own persistence logic for this type of thing can be hard (your persistence layer needs to know the state of the ChildObject1 objects when they were retrieved to compare them with the versions of the objects at save-time). You might want to look into using an ORM like Hibernate for something like this, which does an excellent job of knowing when it needs to update the records in the database or not.
Related
I use ORMLite on a solution made by server and clients.
On server side I use PostgreSql, on client side I use SQLite.
In code, I use the same ORMLite methods, without taking care of the DB that is managed (PostgreSql or SQLite).
Let's say that:
Table A corresponds to class A
I have an Arraylist of objects A
I want to insert all items of ArrayList in DB.
Today I use a for() cycle, and I insert them one by one (doing it in Transaction Manager).
When the items are few, no problem, but now the items are becaming more and this is not probably the best way, also because I lock the DB for long time.
I'm searching a way to insert all the items in one step, so to go quickly, to not lock the DB for long time. I understood that it should be a sort of Stored Procedures (I'm not expert...).
To be noted that some items could be new (that is it not exist already an item with the same primary key id), then must be performed and INSERT; other items could be existing, so it should be performed an UPDATE.
Thank you
I'm searching a way to insert all the items in one step, so to go quickly, to not lock the DB for long time.
So there are two ways to do this that I know of: transactions and disabling auto-commit. If you are inserting into the database and it needs to all happen "at once" from a consistency standpoint, transactions are the only way to go. If you just want to insert and update a large number of records with higher performance then you can disable auto-commit, do the operations, and then commit. Depending on the database implementation, this is what the TransactionManager is really doing.
I understood that it should be a sort of Stored Procedures...
I don't see how stored procedures helps you at all. They aren't magic.
but now the items are becoming more and this is not probably the best way, also because I lock the DB for long time.
I don't think there is a magic solution to this. If you are pushing a large number of objects to the database and you need the data to be transactional, then locks are going to be have to be held during the updates. One thing to realize is that postgres should handle this a ton better than Sqlite. Sqlite does not (I don't think) have row level locking meaning that the whole DB is paused during transactions. Postgres has a much more mature locking system and should be more performant in this situation. This is also why Sqlite is so fast in many other operations because it doesn't have to burdened with the lock complexity.
One thing to consider is to rearchitect your schema. Try to figure out the minimal amount of data that needs to be transactionally inserted. For example, maybe just the object relationships needs to be changed transactionally but all of the data can be stored later. For example, you could have an AccountOwner object which just has 2 ids while all of information about the Account can be stored outside of the transaction. This makes your schema more complicated but maybe much faster.
Hope something here helps.
you can user entityManager.merge([list of items]);
the entityManager will insert the list in one shot.
Merge create the object if it doesn't exist in the database and updated if already exsit.
I am trying to insert in batches (Objects are stored in an arraylist and as soon as count is divisible by 10000, I insert all these objects into my table. But it takes more than 4 minutes to do so. Is there any approach which is faster?
arr.add(new Car(name, count, type));
if(count%10000==0){
repository.saveAll(arr);
arr.clear();
}
So here is what is happening. I am most curious to see the table definition inside Cassandra. But given your Car constructor,
new Car(name, count, type)
Given those column names, I'm guessing that name is the partition key.
The reason that is significant, is because the hash of the partition key column is what Cassandra uses to figure out which node (token range) the data should be written to.
When you saveAll on 10000 Cars at once, there is no way you can guarantee that all 10000 of those are going to the same node. To deal with this, Spring Data Cassandra must be using a BATCH (or something like it) behind the scenes. If it is a BATCH, that essentially puts one Cassandra node (designated as a "coordinator") to route writes to the required nodes. Due to Cassandra's distributed nature, that is never going to be fast.
If you really need to store 10000 of them, the best way would be send one write at a time asynchronously. Of course, you won't want 10000 threads all writing concurrently, so you'll want to throttle-down (limit) the number of active threads in your code. DataStax's Ryan Svihla has written a couple of articles detailing how to do this. I recommend this one- Cassandra: Batch Loading Without the Batch - The Nuanced Edition.
tl;dr;
Spring Data Cassandra's saveAll really shouldn't be used to persist several thousand writes. If I were using Spring Data Cassandra, I wouldn't even go beyond double-digits with saveAll, TBH.
Edit
Check out this answer for details on how to use Spring Boot/Data with Cassandra asyncrhonously: AsyncCassandraOperations examples
I'm working on an application for a pharmacy , basically this application has a class "item" and another class "selling invoices" which logs selling processes .
So my question here if the pharmacy is expected to have about ten thousand products in stock, and I'm storing these products in a linked list of type Item, and storing the invoices in linked list also , then on closing the app i save them using object output stream and reload them upon the start, Is it a bad practice ? Have I to use database instead?
My second question is, if i continue on using linkedlist and object output stream , what is better for performance and memory, storing the actual item as a field member in the invoice class or just its ID and then getting the item upon recalling using this ID reference, so what's better ?
Thanks in advance .
It is a bad idea to use ObjectOutputStream like that.
Here are some of the reasons:
If your application crashes (or the power fails) before you "save", then all changes are lost.
Saving "all objects" is expensive.
Serialized objects are opaque. It is only practical to look at them from Java code.
Serialized objects are fragile. If your application classes change, you may find that old serialized objects can no longer be read. That's bad enough, but now consider what happens if your client wants to look at pharmacy records from 5 years ago ... from a backup tape.
Serialized objects provide no way of searching ... apart from reading all of the objects one at a time.
Designs which involve reading all objects into memory do not scale. You are liable to run out of memory. Or compromise on your requirements to avoid running out of memory.
By contrast:
A database won't lose any changes have been committed. They are much more resilient to things like application errors and system level failures.
Committing database changes is not as expensive, because you only write data that has changed.
Typical databases can be viewed, queried, and if necessary repaired using an off-the-shelf database tool.
Changing Java code doesn't break the database. And for some schema changes, there are ways to migrate the database schema and records to match an updated database.
Databases have indexes and query languages for implementing efficient search.
Databases scale because the primary copy of the data is on disk, not in memory.
We are currently trying to solve a performance problem. Which is searching for data and presenting it in a paginated way takes about 2-3 minutes.
Upon further investigation (and after several sql tuning), it seems that searching is slow just because of the sheer amount of data.
A possible solution that I'm currently investigating is to replicate the data in a searchable cache. Now this cache can be in the database (i.e. materialized view) or it could be outside the db (nosql approach). However, since I would like the cache to be horizontally scalable, I am leaning towards caching it outside the database.
I've created a proof of concept, and indeed, searching in my cache is faster than in the db. However, the initial full replication takes a long time to complete. Although the full replication will just happen once, and then succeeding replication will just be incremental against those that changed since the last replication, it would still be great if I can speed up the initial full replication.
However, during full replication, aside from the slowness of the query's execution, I also have to battle against network latency. In fact, I can deal with the slow query execution time. But the network latency is really really slowing the replication down.
So which leads me to my question, how can I speed up my replication? Should I spawn several threads each one doing a query? Should I use a scrollable?
Replicating the data in a cache seems like replicating the functionality of the database.
From reading other comments, I see that you are not doing this to avoid network roundtrips, but because of costly joins. In many DBMS you can create temporary tables - like this:
CREATE TEMPORARY TABLE abTable AS SELECT * FROM a , b ;
If a and b are large (relatively permanent) tables, then you will have a one-time cost of 2-3 minutes to create the temporary table. However, if you use abTable for many queries, then the subsequent per query cost will be much smaller than
SELECT name, city, ... , FROM a , b ;
Other database systems have a view concept which lets you do something like this
CREATE VIEW abView AS SELECT * FROM a , b ;
Changes in the underlying a and b table will be reflected in the abView.
If you really are concerned about network round trips, then you may be able to replicate parts of the database on the local computer.
A good database management system should be able to handle your data needs. So why reinvent the wheel?
SELECT * FROM YOUR_TABLE
Map results into an object or data structure
Assign a unique key for each object or data structure
Load the key and object or data structure into a WeakHashMap to act as your cache.
I don't see why you need sorting, because your cache should access values by unique key in O(1) time. What is sorting buying you?
Be sure to think about thread safety.
I'm assuming that this is a read-only cache, and you're doing this to avoid the constant network latency. I'm also assuming that you'll do this once on start up.
How much data per record? 12M records at 1KB per record means you'll need 12GB of RAM just to hold your cache.
I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?
I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).
I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.
Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.
Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.
Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.