MongoDB using much memory

MongoDB using much memory - java

Im tryin to migrate a mysql table to mongodb. My table has 6 million entries. Im using java with morphia. When i save about 1,2 million my memory is almost all consumed.
I've read that mongo store the data in memory and after save in disk. Is it possible to send something like a commit to free some amount of memory?

1) In terms of durability, you can tell the MongoDB java driver (which Morphia is using), which strategy to use, see https://github.com/mongodb/mongo-java-driver/blob/master/src/main/com/mongodb/WriteConcern.java#L53. It's simply a trade-off between speed: NONE (not even connectivity issues will cause an error) up to FSYNC_SAFE (the data is definitely written to disk).
For the internal details check out http://www.kchodorow.com/blog/2012/10/04/how-mongodbs-journaling-works/
2) Your whole data is mapped to memory (that's why the 32bit edition has a size limit of 2GB), however it is only actually loaded, when required. MongoDB leaves that to the operating system by using mmap. So as long as there is more RAM available, MongoDB will happily load all the data it needs into RAM to make queries very quick. If there is no more memory available, it's up to the operating system to swap out old stuff. This has the nice effect that your data will be kept in memory even if you restart the MongoDB process; only if you restart the server itself the data must be fetched from disk again. I think the downside is that the database process might have a slightly better understanding of what should be swapped out first in comparison to the operating system.
I'm not using MongoDB on Windows and haven't seen that message on Mac or Linux (yet), but the operating system should handle that for you (and automatically swap out pieces of information as required). Have you tried setting the driver to JOURNAL_SAFE (should be a good compromise between data security and speed)? In that setting, no data should be lost, even if the MongoDB process dies.
3) In general MongoDB is built to use as much available memory as possible, but you might be able to restrict it with http://captaincodeman.com/2011/02/27/limit-mongodb-memory-use-windows/ - which I haven't tested, as we are using (virtual) Linux servers.

if you just want to release some memory mongodb uses, after your data is processed and mongod is idle, you can run this command
use admin
db.runCommand({closeAllDatabases: 1})
then , you will see the mapped,vsize, res that outputed by mongostat go down a lot.
I have try, and it works. Hope to helps , ^_^

Related

JAVA Lightest Thread Framework

I have a project where I have to send emails using Amazon SES REST API, now amazon allows concurrent connections at a same time based on account. So in my case amazon allows me to open 50 connections at a same time, which means I can send 50 emails/sec. To achieve this, currently I am using JAVA Executioner threads where I control the thread speed to be 50/sec. Also I have implemented this in Hibernate framework because I need to execute some SQL queries before sending emails.
This java program runs continuously in background(its a jar file). This takes around 512MB RAM, so my question is that can I use some other frameworks or better thread system to make it more lighter? The SQL query I execute is only a select query, update/delete/create queries are not used.
I am not good in JAVA so may be this sounds stupid.

I guess the smallest possible framework to use would be plain JDBC.
This would limit your libraries to those in the jre plus the DB driver and maybe libs for AWS / Email. Depending on what else you need, selecting a compact profile might be worth investigating.
Also check your memory settings:
If you set -Xms512m it's really not surprising your app uses 512m, is it?
Edit due to rephrased question
In your level of parallelism, most of your Memory is consumed by Objects, not by Threads (well, Threads are objects, but small ones). Threads are good the way they are in Java. You can run hundrets of them without them consuming 500 mb of heap or more as you claim.
So the issue with 50 threads consuming 512m of your memory is more likely rooted in your code and your objects, not (only) in your threads.
In order to reduce memory footprint, tra the follwing:
Remove hibernate. As you say you only have a simple select SQL, so you don't need the overhead and additional libraries.
Take a memory dump of your running app and analyse it. (MAT - Eclipse Memory Analyser tool comes to mind)
Check other objects and how you use them. When you say "sending emails" - how large are your emails? Might there be duplicate buffers do to bad choice of coding? Share your code for how you do it, then we can have a look.
Try running without any memory options and see how the program runs on defaults.
Add garbage collector output and check that

Java applications on Oracle Exadata

For reasons that are beside the point, a company has bought an Exadata Eighth Rack. Some of the managers thought that this would improve performance of current applications. The problem is that hardly any application makes intensive database work (yes, this is a good moment for looking at facepalm animated gifs). So, at the moment, migrations have proven just little benefit.
The question is obvious. Most of the applications are written in Java, and some of them make intensive use of Solr and Cassandra. For what I know, Exadata is intended for storing data, while Exalogic can hold applications too. Anyway, I'm wondering if there is some way of taking advantage of mentioned infrastructure.

Replace Solr with Oracle Text.
Before I get down-voted: normally I would not recommend replacing existing code built with a popular, open-source program with a seldom-used, proprietary product. But if you want to use a lot of space and CPU on your database servers then Oracle Text can definitely help.
As more generic advice, the primary role of a database is not to store data. A file system can do that. Databases are built to join data. If an application is reading a large amount of data and doing ad hoc joins, those are the jobs you want to move to the database.

Exadata -> Oracle Database extreme performance.
Exalogic -> Fusion Middleware extreme performance. (Java goes here)
Your best move will be refactoring the application to put as much workload as possible on the DB (PL/SQL).
Another thing I could think of, but this would be a radical approach I have never really tried it myself (Yes I work with Exadatas too) maybe you can give it a shot and let us know here...
What about using all those GBs on the Exadata's RAM and start tuning your Java application's latency? I mean with that gruesome amount of Memory you can try and set a real nice amount of heap and avoid Garbage Collection induced latency. Please do let me know here what comes out if you actually try this.

Which protocol do the Java applications use to connect to Oracle?
If it's not IPC (inter process communication, aka BEQUEATH, aka shared memory), but maybe TCP and you have many fast & tiny roundtrips, than this would be your low-hanging fruit - eliminate the network stack.
edit: just realized that exadata cannot run java applications by default (only ODA does) - so it wouldn't be possible to make use of IPC. However, perhaps you're able to test the impact of IPC in one of your applications using the former infrastructure?

Exadata cannot host any customer application. You cannot install anything there. You only can host Oracle database on Exadata.
It means you can use database features like DBFS (file system over Oracle database), Java option (storing and executing java code in database). But you need to check what options you have license for. And internal JVM is used, which cannot be customized or upgraded.
Exadata is database appliance designed to work with large amount of differently accessed data in very effective and manageable way.

Is there a Java local queue library I can use that keeps memory usage low by dumping to the hard drive?

This maybe not possible but I thought I might just give it a try. I have some work that process some data, it makes 3 decisions with each data it proceses: keep, discard or modify/reprocess(because its unsure to keep/discard). This generates a very large amount of data because the reprocess may break the data into many different parts.
My initial method was to send it to my executionservice that was processing the data but because the number of items to process was large I would run out of memory very quickly. Then I decided to maybe offload the queue off to a messaging server(rabbitmq) which works fine but now I'm bound by network IO. What I like about rabbitmq is it keeps messages in memory up to a certain level and then dumps old messages to the local drive so if I have 8 gigs of memory on my server I can still have a 100 gig message queue.
So my question is, is there any library that has a similar feature in Java? Something that I can use as a nonblocking queue that keeps only X items in queue(either by number of items or size) and writes the rest to the local drive.
note: Right now I'm only asking for this to be used on one server. In the future I might add more servers but because each server is self-generating data I would try to take messages from one queue and push them to another if one server's queue is empty. The library would not need to have network access but I would need to access the queue from another Java process. I know this is a long shot but thought if anyone knew it would be SO.

Not sure if it id the approach you are looking for, but why not using a lightweight database like hsqldb and a persistence layer like hibernate? You can have your messages in memory, then commit to db to save on disk, and later query them, with a convenient SQL query.

Actually, as Cuevas wrote, HSQLDB could be a solution. If you use the "cached table" provided, you can specify the maximum amount of memory used, exceeding data will be sent to the hard drive.

Use the filesystem. It's old-school, yet so many engineers get bitten with libraries because they are lazy. True that HSQLDB provides lots of value-add features, but in the context of being light weight....

Highly reliable storage for a 'log' / time series

In an application I'm working on, I need a write-behind data log. That is, the application accumulates data in memory, and can hold all the data in memory. It must, however, persist, tolerate reasonable faults, and allow for backup.
Obviously, I could write to a SQL database; Derby springs to mind for easy embedding. I'm not tremendously fond of the dealing with a SQL API (JDBC however lipsticked) and I don't need any queries, indices, or other decoration. The records go out, and on restart, I need to read them all back.
Are there any other suitable alternatives?

Try using a just a simple log file.
As data comes in, store in memory and write (append) to a file. write() followed by fsync() will guarantee (on most systems, read your system and filesystem docs carefully) that the data is written to persistent storage (disc). These are the same mechanisms any database engine would use to get data in persistent storage.
On restart, reload the log. Occasionally, trim the front of the log file so data usage doesn't grow infinitely. Or, model the log file as a circular buffer the same size as what you can hold in memory.

Have you looked at (now Oracle) Berkeley DB for Java? The "Direct Persistence Layer" is actually quite simple to use. Docs here for DPL.
Has different options for backups comes with a few utilities. Runs embedded.
(Licensing: a form of the BSD License I beleive.)

How to improve my software project's speed?

I'm doing a school software project with my class mates in Java.
We store the info on a remote db.
When we start the application we pull all the information from the database and transform it into objects to use in our application (using java sql statemens).
In the application we edit some of these objects and then when we exit the application
we save or update information in the database using Hibernate.
As you see we dont use Hibernate for pulling in information, we use it just for saving and updating.
We have 2, but very similar problems.
The loading of object(when we start the app) and the saving of objects(with Hibernate) in the db(when closing the app) is taking too much time.
And our project its not a huge enterprise application, its a quite small app, we just manage some students, teachers, homeworks and tests. So our db is also very very small.
How could we increase performance ?
later edit: if we use a local database it runs very quick, it just runs slow on remote databases

Are you saying you are loading the entire database into memory and then manipulating it? If that is the case, why don't you instead simply use the database as a storage device, and do lookups and manipulation as necessary (using Hibernate if you like, or something else if you don't)? The key there is to make sure that you are using connection pooling, as that will reduce the connection time.
If this is what you are doing, then you could be running into memory issues as well - first, by not caching the entire database in memory, you will reduce memory and will spread out the network load from the beginning/end to the times when it needs to happen.

These 2 sentences are red flags for me :
When we start the application we pull
all the information from the database
and transform it into objects to use
in our application (using java sql
statemens). In the application we edit
some of these objects and then when we
exit the application we save or update
information in the database using
Hibernate.
Is there a requirements reason that you are loading all the information from the database into memory at startup, or why you're waiting until shutdown to save changes back in the database?
If not, I'd suggest a design change. If you've already got Hibernate mappings for the tables in the DB, I'd use Hibernate for both all of your CRUD (create, read, update, delete) operations. And, I'd only load the data that each page in your app needs, as it needs it.
If you can't make that kind of design change at this point, I think you've got to look closely at how you're managing the database connections. Are you using connection pools? Are you opening up multiple connections? Forgetting to release them?
Something else to look at. How are you using Hibernate to save the entities to the db? Are you doing a getHibernateTemplate().get on each one and then doing an entity.save or entity.update on each one? If so, that means you are also causing Hibernate to run a select query for each database object before it does a save or update. So, essentially, you'd be loading each database object twice (once at the beginning of the program, once before saving). To see if that's what's happening, you can turn on the show_sql property or use P6Spy to see exactly what queries Hibernate is running.

For what you are doing, you may very well be better off serializing your objects and writing them out to a flat file.
But, much more likely, you should just read / update objects directly from your database as needed instead of all at once, for all the reasons aperkins gives.
Also, consider what happens if your application crashes? If all of your updates are saved only in memory until the application is closed, everything would be lost if the app closes unexpectedly.

The difference in loading everything from a remote DB server versus loading everything from a local DB server is the network latency / pipe size. The network is a much smaller pipe than anything else. Two questions: first, how much data are we really talking about? Second, what is your network speed? 10/100/1000? Figure between 10 and 20% of your pipe size is going to be overhead due to everything from networking protocols to the actual queries themselves.
As others have stated, the way you've architected is usually high on the list of "don't do". When starting, pull only enough data to initialize the app. As the user works through it, pull what you need for that task.
The ONLY time you pull everything is when they are working in a disconnected state. In that case, you still don't load everything as objects in the application, you just work from a local data store which gets sync'ed with the remote server every so often.

The project its pretty much complete. we cant do large refactoring on it now.
I tried to use a second level cache for Hibernate when saving. EhCacheProvider.
in hibernate.xml:
net.sf.ehcache.hibernate.EhCacheProvider
i have done a config for the cache, ehcache.xml:
i have put the cache.jar in the project build path
and i have set the hibernate property for every class and set in the mapping.
But this cache doesn't seem to have an effect. I dont know if it works(if it is used).

Try minimising number of SQL queries, since every query has its own overhead.
You can enable database compression, which should speed things up when there is a lot of data.
Maybe you are connecting to the database many times?
Check the ping time of remote database server - it might be the problem.

As your application is just slow when running on a remote database server, I'd assume that the performance loss is due to:
Connecting to the server: try to reuse connections (pass the instance around) or use connection pooling
Query round-trip time: use as few queries as possible, see here in case of a hand-written DAL:
Preferred way of retrieving row with multiple relating rows
For hibernate you may use its batch functionality and adjust hibernate.batch_size.
In all cases, especially when you can't refactor larger parts of the codebase, use a profiler (method time or sql queries) to find the bottleneck. I bet you'll find thousands of queries, each taking 10ms RTT) which could be merged into one.

Some other things you can look into:
You can allocate more memory to the JVM
Use the jconsole tool to investigate what the bottlenecks are.

Why dont you have two separate threads?
Thread 1 will load your objects one by one.
Thread 2 will process objects as they are loaded.
Your app will seem more interactive at startup.

It never hurts to review the basics:
Improving speed means reducing time (obviously), and to do that, you find activities that take significant time but can be eliminated or replaced with something that uses less time. What I mean by activity is almost always a function call, method call, or property call, performed on a specific line of code for a specific purpose. If may invoke I/O or it may invoke computation, or both. If its purpose is not essential, then it can be optimized.
Many people use profilers to try to find these time-wasting lines of code, but most profilers miss the target because they look at functions, not lines, they go to sleep during I/O, and they worry about "self time".
Many more people try to guess what could be the problem, or they ask others to guess, such as by asking on SO. Such guesses, in the nature of guesses, are sometimes right - more often not, but people still invest time and resources in them.
There's a very simple way to find out for sure, without guessing, what could fruitfully be optimized, and here is one way to do it in Java.

Thanks for your answers. Their were more than helpful.
We completely solved this problem like so:
Refactored the LOAD code. Now it uses Hibernate with Lazy Fetching.
Refactored the SAVE code. Now it saves, just the data that was modified and right after the time it was modified. This way we dont have a HUGE save an the end.
Im amazed of how good it all went. The amount of new code we had to write was very very small.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.