Mongock data partitioning

Mongock data partitioning - java

I am trying o migrate data in MongoDB with mongock and stumbled upon an issue when I need to run migration across a big amount of data.
Is there a way to partition this data in any way? I didn't find support for that on documentation.
Problem is that it takes a lot of time to make queries on this data and to load all of it into memory at once also can cause problems.
UPD:
Problem is that sample code can bring 1 or 1kk documents and it can take huge amount of time
mongoTemplate.findAll(User.class).stream()
.map(this::migrateUser)
.forEach(mongoTemplate::save);

Related

how to increase performance of data migration from oracle to mongodb

I have created an application where I am migrating data(in GBs) from Oracle to mongo. I have used sharding in MongoDB.
How to reduce time consumption for migration and increase the performance rate?

Your migration could be slow for a number of reasons.
It could be because of :
how the migration code works
how fast oracle responds to the queries
how you are pulling the data from oracle(if a stream or something similar is possible)
directly querying like select *
Also if the two db servers are in the same data center or are seperated. All above are assuming that the two db servers are in the same data center(location or within the same private server area or within same virtual cloud region).
Also depends on the number of indexes which are on oracle (which may impact the rate at which the data is being accessed).
Also if the operations are just reads they should be faster. If you are performing read and update/delete on oracle again the performance could be slower.
This is a very generic question and might have a long list of reasons which are causing the performance issues

Apache Cassandra crashes under heavy read load

I've been working on an application which requires regular writes and massive reads at once.
The application is storing a few text columns which are not very big in size and a map of which is the biggest column in the table.
Working with Phantom-DSL in Scala (Datastax Java driver underneath), my application crashes when the data size increases.
Here is a log from my application.
[error] - com.websudos.phantom - All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.OperationTimedOutException: [/127.0.0.1:9042] Operation timed out))
And here are the cassandra logs.
I have posted Cassandra logs in a pastebin because they were pretty large to be embedded in the answers.
I can't seem to understand the reason for this crash. I have tried increasing the timeout and turning off row cache.
From what I understand, this is a basic problem and can be resolved by tuning cassandra for this special case.
My cassandra usage is coming from different data sources. So writes are not very frequent. But reads are big in size in that over 300K rows may be required at once which then need to be transferred over HTTP.

The logs show significant GC pressure (ParNew of 5 seconds).
When you say "reads are big in size in that over 300K rows may be required at once", do you mean you're pulling 300k rows in a single query? The Datastax driver supports native paging - set the fetch size significantly lower (500 or 1000), and allow it to page through those queries instead of trying to load all 300k rows in a single pass?

Maps (and collections in general) can be very demanding for Cassandra heapspace. Changing your data model to replace the map with another table may solve your gc issues. But that's a lot of speculation due to the lack of further details on your Cassandra usage.

BigQuery performance and Running concurrent jobs

We are working with Google BigQuery (using Java) for one of our cloud solution and facing lot of issues in development. Our observations and issues as follows -
We are using Query Jobs (Example: jobs().insert()/jobs().query() method first and then tablesdata().list() for data) for data retrieval. The Job execution taking 2-3 seconds (we had data in MBs only right now). We looked into sample codes on code.google.com and github.com and tried to implement them. However, we are not able to achieve fast execution than 2-3 seconds. What is the fast way to retrieve data from BigQuery tables? Is there a way to improvise Job execution speed? If yes, Can you provide links for sample codes?
In our screens, we need to fetch data from different tables (different queries) and display them. So, we inserted multiple query jobs and total execution time getting summed-up (Example: if we had two jobs (i.e. two queries), it takes 6-7 seconds). In Google documentation it has been mentioned that, we can run concurrent Jobs. Is there any sample code available for this?
Waiting for your valuable responses.

Query of cached results can be much faster, if you can run the query independently. The following query will run faster.
Check that the bottle-neck is not related to network\ paginating\ page rendering\ etc. you can do it by trying executing only the 2nd step.
Parallel jobs might be queued on BQ end based on their current load.
My recommendation would be to separate the query from presentation. Run the BQ queries, retrieve the "Small size" data to a fast access data store (flat file, Cache, Cloud SQL, etc.) and present it from there.
As Pentium10 says, BQ is excellent for huge datas (and returns results faster and cheaper than any other comparable solution). If you are looking for a back-end of a fast reporting visualization tool, I am afraid that BQ might not be your solution.

1) Big Query is a highly scalable database, before being a "super fast" database. It's designed to process HUGE amount of data distributing the processing among several different machines using a technique named Dremel. Because it's designed to use several machines and parallel processing, you should expect to have super-scalability with a good performance.
2) BigQuery is an asset when you want to analyze billions of rows.
For example: analyzing all the wikipedia revisions in 5-10 seconds isn't bad, is it? But even a much smaller table would take about the same time, even if has 10k rows.
3) Under this size, you'll be better off using more traditional data storage solutions such as Cloud SQL or the App Engine Datastore. If you want to keep SQL capability, Cloud SQL is the best guess.
Sybase IQ is often installed in a single database and it doesn't use Dremel. That said, it's going to be faster than Big Query in many scenarios...as designed.
4) Certainly the performance differ from a dedicated environment. You get your dedicated environment for 20K$ a month.

Google App Engine: How quick are the data dumps?

I was reading this SO article and was amazed to find that no one mentioned how long it takes to run a data dump with the appcfg.py tool.
Now, obviously, this is a function of how much data you have, and the hardware your dumping to, but was wondering if anyone had ever benchmarked this. If I have a decent amount of data (say, 50GB), how long can I expect to wait for the entire datastore to dump to a YAML file?
Minutes? Hours? Days?

The factors affecting time to download a dataset are
number of entities
size of entities
speed of you network.
how appengine backend is performing ;-)
I am on a slow link (typically only achieve 150kbps (3G wireless) and my 675MB of entities totalling 1.5million entities takes about 2 days to download to the local machine using appcfg.
So unfortunately there won't be a good hard number you can rely on - you will just have to give it a go and see how long it takes in your situation.

I can't tell you about dumping/downloading to your computer, but I have used Datastore backup, which saves datastore data to Blobstore: I have seen backup speeds somewhere in 100-150MBytes / minute.

Terracotta + Compass = Hibernate + HSQLDB + JMS?

I am currently in need of a high performance java storage mechanism.
This means:
1) I have 10,000+ objects with 1 - Many Relationship.
2) The objects are updated every 5 seconds, with the most recent updates persistent in the case of system failure.
3) The objects need to be queryable in a reasonable time (1-5 seconds). (IE: Give me all of the objects with this timestamp or give me all of the objects within these location boundaries).
4) The objects need to be available across various Glassfish installs.
Currently:
I have been using JMS to distribute the objects, Hibernate as an ORM, and HSQLDB to provide the needed recoverablity.
I am not exactly happy with the performance. Especially the JMS part of this.
After doing some Stack Overflow research, I am wondering if this would be a better solution. Keep in mind that I have no experience with what Terracotta gives me.
I would use Terracotta to distribute objects around the system, and something else need to give the ability to "query" for attributes of those objects.
Does this sound reasonable? Would it meet these performance constraints? What other solutions should I consider?

I know it's not what you asked, but, you may want to start by switching from HSQLDB to H2. H2 is a relatively new, pure Java DB. It is written by the same guy who wrote HSQLDB and he claims the performance is much better. I'm using it for some time now and I'm very happy with it. It should be a very quick transition (add a Jar, change the connection string, create the database) so it's worth a shot.
In general, I believe in trying to get the most of what I have before rewriting the application in a different architecture. Try profiling it to identify the bottleneck first.

At first, Lucene isn't your friend here. (read only)
Terracotta is to scale around at the Logical layer! Your problem seems not to be related to the processing logic. It's more around the Storage/Communication point.
Identify your bottleneck! Benchmark the Storage/Logic/JMS processing time and overhead!
Kill JMS issues with a good JMS framework (eg. ActiveMQ) and a good/tuned configuration.
Maybe a distributed key=>value store is your friend. Try Project Voldemort!
If you like to stay at Hibernate and HSQL, check out the Hibernate 2nd level cache and connection pooling (c3po, container driven...)!

Several Terracotta users have built systems like this in the past, so I can you tell you by proof of existence that it can be done. :)
Compass does have support for clustering with Terracotta so that might help you. I suspect you might get further faster by just being careful with how you create your clustered data structures.
Regarding your requirements and Terracotta:
1) 10k objects is quite small from a Terracotta perspective
2) 5 sec update rate doesn't seem like an issue. Might depend how many nodes there are and whether there is any natural partitioning you can take advantage of. All updates will be persistent.
3) 1-5 second query time seems quite easy. Building your own well-organized data structures for lookup is the tricky part. Obviously you want to avoid scanning all the data.
4) Terracotta currently supports Glassfish v1 and v2.
If you post on the Terracotta forums, you could probably get more Terracotta eyeballs on the problem.

I am currently working on writing the client for a very (very) fast Key/Value distributed hash DB that provides set + list semantics. The DB is C99 and requires GCC and right now I'm battling with good old Java network IO to break my current 30,000 get/sets per/sec barrier. Hope to be done within the week. Drop me a line through my account and I'll get back when its show time.

With such a high update rate, Lucene is almost definitely not what you're looking for, since there is no way to update a document once it's indexed. You'd have to keep all the object versions in the index and select the one with the latest time stamp, which will kill your performance.
I'm no DB expert, but I think you should look into any one of the distributed DB solutions that's been on the news lately. (CouchDB, Cassandra)

Maybe you should take a look to: Prevayler.
Your objects are always in mem.
The "changes" to your objects are persisted.
From time to time you are able to take a snapshot: every object is persisted.

You don't say what vendor you are using for JMS, but I wouldn't surprise me if you have some bottle neck there. I couldn't get more than 100 messages a second from ActiveMq, and whatever I tried in terms of configuration of acknowledgment, queue size, etc we were unable to soak the CPU beyond a few percent.
The solution was to batch many queries into one JMS message. We had a simple class that either sent a batch of messages when it got to 200 queries or reached a timeout (we used 20ms), which gave us a dramatic increase in message throughput.

Guaranteed messaging is going to be much slower than volatile messaging. Given every object is updated every few second, you might consider batching your updates (into say 500 changes or by time say 1-10 ms' worth), sending over volatile messaging, and batching your transactions. In this case you are more likely to be limited by bandwidth. Tuning your use case you may find smaller batch sizes also work efficiently. If bandwidth is critical (say you have a 10 MB connection or slower, then you could use compression over JMS)
You can achieve much higher performance with a custom solution (which also might be simpler) e.g. Hazelcast & JGroups are free (you can add a node(s) which does the database synchronization so your main app doesn't slow down). There are commercial products which handle in the order of half a million durable messages/sec.

Terracotta + jofti = queryable persistent clustered data structures
Search google for terracotta querymap or visit tusharkhairnar.blogspot.com for querymap blog
You may want to integrate timasync as well to update your database. Database is is your system of record use terracotta as caching and database offloading mechanism you can even batch async updates to make it faster so that I'd db contains fairly recent data
Tushar
tusharkhairnar.blogspot.com

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Mongock data partitioning - java

Related

how to increase performance of data migration from oracle to mongodb

Apache Cassandra crashes under heavy read load

BigQuery performance and Running concurrent jobs

Google App Engine: How quick are the data dumps?

Terracotta + Compass = Hibernate + HSQLDB + JMS?

Categories

Resources