I have a mongo (version 2) in production in replicaset configuration (the next step is to add sharding).
I need to implement the following:
Once a day i'll receive a file with millions rows and i shall load it into mongo.
I have a runtime application that always read from this collection - very large amount of reads, and their performance is very important.
The collection is indexed and all read perform readByIndex operation.
My current implementation of loading is:
drop collection
create collection
insert into collection new documents
One of the thing I see is that because of mongoDB lock my total performance getting worst during the loading.
I've checked the collection with up to 10Million entries.
For more that that size I think I should start use sharding
What is the best way to love such issue?
Or maybe should I use another solution strategy?
You could use two collections :)
collectionA contains this day's data
new data arrives
create a new collection (collectionB) and insert the data
now use collectionB as your data
Then, next day, repeat the above just swapping A and B :)
This will let collectionA still service requests while collectionB is being updated.
PS Just noticed that I'm about a year late answering this question :)
Related
I am trying to insert in batches (Objects are stored in an arraylist and as soon as count is divisible by 10000, I insert all these objects into my table. But it takes more than 4 minutes to do so. Is there any approach which is faster?
arr.add(new Car(name, count, type));
if(count%10000==0){
repository.saveAll(arr);
arr.clear();
}
So here is what is happening. I am most curious to see the table definition inside Cassandra. But given your Car constructor,
new Car(name, count, type)
Given those column names, I'm guessing that name is the partition key.
The reason that is significant, is because the hash of the partition key column is what Cassandra uses to figure out which node (token range) the data should be written to.
When you saveAll on 10000 Cars at once, there is no way you can guarantee that all 10000 of those are going to the same node. To deal with this, Spring Data Cassandra must be using a BATCH (or something like it) behind the scenes. If it is a BATCH, that essentially puts one Cassandra node (designated as a "coordinator") to route writes to the required nodes. Due to Cassandra's distributed nature, that is never going to be fast.
If you really need to store 10000 of them, the best way would be send one write at a time asynchronously. Of course, you won't want 10000 threads all writing concurrently, so you'll want to throttle-down (limit) the number of active threads in your code. DataStax's Ryan Svihla has written a couple of articles detailing how to do this. I recommend this one- Cassandra: Batch Loading Without the Batch - The Nuanced Edition.
tl;dr;
Spring Data Cassandra's saveAll really shouldn't be used to persist several thousand writes. If I were using Spring Data Cassandra, I wouldn't even go beyond double-digits with saveAll, TBH.
Edit
Check out this answer for details on how to use Spring Boot/Data with Cassandra asyncrhonously: AsyncCassandraOperations examples
I have to process an xml file and for that I need to get around ~4k objects using it's primary key from a single table . I am using EhCache. I have few queries as follows:
1) It is taking lot of time if I am querying row by row based on Id and saving it in Cache . Can I query at initial point of time and save whole table in EHCache and can query it using primary key later in the processing
2) I dont want to use Query cache. As I can't load 4k objects at a time and loop it for finding correct object.
I am looking for optimal solution as right now my process is taking around 2 hours (it involves other processing too)
Thank you for your kind help.
You can read the whole table and store it in a Map<primary-key, table-row> to reduce the overhead of the DB connection.
I guess a TreeMap is probably the best choice, it makes search for elements faster.
Ehcache is great to handle concurrence, but if you are reading the xml with a single process you don't even need it (just store the Map in memory).
Question Goes like this.
Form one application I am getting approx 2,00,000 Encrypted values
task
Read all Encrypted values in one Vo /list
Reformat it add header /trailers.
Dump this records to DB in one shot with header and trailer in seperated define coloums
I don't want to use any file in between processes
What would be the best way to store 2,00,000 records list or something
how to dump this record at one shot in db. is better to dived in chunks and use separate thread to work on it.
please suggest some less time consuming solution for this.
I am using spring batch for this and this process will be one job.
Spring batch is made to do this type of operation. You will want a chunk tasklet. This type of tasklet uses a reader, an item processor, and writer. Also, this type of tasklet uses streaming, so you will never have all items in memory at one time.
I'm not sure of the incoming format of your data, but there are existing readers for pretty much any use-case. And if you can't find the type you need, you can create your own. You will then want to implement ItemProcessor to handle any modifications you need to do.
For writing, you can just use JdbcBatchItemWriter.
As for these headers/footers, I would need more details on this. If they are an aggregation of all the records, you will need to process them beforehand. You can put the end results into the ExecutionContext.
There are a couple of generic tricks to make bulk insertion go faster:
Consider using the database's native bulk insert.
Sort the records into ascending order on the primary key before you insert them.
If you are inserting into an empty table, drop the secondary indexes first and then recreate them.
Don't do it all in one database transaction.
I don't know how well these tricks translate to spring-batch ... but if they don't you could consider bypassing spring-batch and going directly to the database.
Does anybody know of a library or good code sample that could be used to re-index all/some entities in all/some namespaces ?
If I implement this on my own, is MapReduce what I should consider ?
"I need to re-index ?" feels like a problem many developers have run into but the closest I could find is this, which may be a good start ?
Other option is a homebrewn solution using Task Queues that iterate the datastore namespaces and entities but I'd prefer not the re-invent the wheel and go for a robust, proven solution.
What are the options ?
I'm afraid I don't know of any pre-built system. I think you basically need to create a cursor to iterate through all your entities and then do a get and a put on all of them (or optionally check if they're in the index before doing the put - if you have some that won't need updating, that would save you a write at the cost of a read and/or a small operation).
Follow the example here:
https://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify#Cursors
Create a java.util.concurrent.SynchronousQueue to hold batches of datastore keys.
Create 10 new consumer threads (the current limit) using ThreadManager:
https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/ThreadManager
Those threads should do the following:
Create a new objectify instance and turn off the session cache and memcache for objectify.
Get a batch of keys from the SynchronousQueue.
Fetch all of those entities using a batch get.
Optionally do a keys-only query for all those entities using the relevant property.
Put all those entities (or exclude the ones that were returned above).
Repeat from step 2.
In a loop, fetch the next 30 keys using a keys-only cursor query and put them into the SynchronousQueue.
Once you've put all of the items into the SynchronousQueue, set a property to stop all the consumer threads once they've done their work.
I have a relatively simple object model:
ParentObject
Collection<ChildObject1>
ChildObject2
The MySQL operation when saving this object model does the following:
Update the ParentObject
Delete all previous items from the ChildObject1 table (about 10 rows)
Insert all new ChildObject1 (again, about 10 rows)
Insert ChildObject2
The objects / tables are unremarkable - no strings, rather mainly ints and longs.
MySQL is currently saving about 20-30 instances of the object model per second. When this goes into prodcution it's going to be doing upwards of a million saves, which at current speeds is going to take 10+ hours, which is no good to me...
I am using Java and Spring. I have profiled my app and the bottle neck is in the calls to MySQL by a long distance.
How would you suggest I increase the throughput?
You can get some speedup by tracking a dirty flag on your objects (especially your collection of child objects). You only delete/update the dirty ones. Depending on what % of them change on each write, you might save a good chunk.
The other thing you can do is do bulk writes via batch updating on the prepared statement. (Look at PreparedStatement.addBatch()) This can be an order of magnitude faster, but might not be record by record,e.g. might look something like:
delete all dirty-flagged children as a single batch command
update all parents as a single batch command
insert all dirty-flagged children as a single batch command.
Note that since you're dealing with millions of records you're probably not going to be able to load them all into a map and dump them at once, you'll have to stream them into a batch handler and dump the changes to the db 1000 records at a time or so. Once you've done this the actual speed is sensitive to the batch size, you'll have to determine the defaults by trial-and-error.
Deleting any existing ChildObject1 records from the table and then inserting the ChildObject1 instances from the current state of your Parent object seems unnecessary to me. Are the values of the all of the child objects different than what was previously stored?
A better solution might involve only modifying the database when you need to, i.e. when there has been a change in state of the ChildObject1 instances.
Rolling your own persistence logic for this type of thing can be hard (your persistence layer needs to know the state of the ChildObject1 objects when they were retrieved to compare them with the versions of the objects at save-time). You might want to look into using an ORM like Hibernate for something like this, which does an excellent job of knowing when it needs to update the records in the database or not.