We are using MongoDB as an intermediate storage for an application that allows the user uploading downloading video files.
We are using GridFS API from a Java applications , as is very convenient for the case (We found it more appropriate, faster and reliable than storing the files in a table in a relation database).
Once the videos have been processed from the DB (and stored into physical files) we can remove them, but we have the problem that the new space is not just reallocated and is instead "used" without any util data. We have tried to repair the database, as suggested in posts like Auto compact the deleted space in mongodb? but this had the database down for few days! , which is not ideal as it needs to be running 24/7. (We have come accoss this recently when the DB was without free space).
I am not very knowledgeable in this topic and so would like to get opinions for a solution you know/use that would be efficient: Allow the storage in blocks and be easy to reallocate memory once the blocks/chunks are not needed.
Some options are:
1) Have two Mongo DB: from time to time export and import data(all
except the tables used by GridFS that contain the videos) from one
db to the other. First db can be dropped and space defragmented
again. This seems a bit complex and not good if it needs to be done
frequently as we don't have much total space for the DB.
2) Store them in a relational database (for a table without relation
and these special characteristics does not seem ideal but works
if other solutions don't)
3) ...
If it serves, the application is deployed in a J2EE infrastructure.
Thanks.
Related
We're developing a biometric matching solution for a verification system. As you may know, one of the main issues with biometric data is that they're unstructured binaries and every single biometric minutiae must be matched with the whole minutiae database.
Hence, we're looking for a fast and appropriate solution to eliminate the binary retrieval (I/O) latency from the physical hard disk and decrease the overheads by making all the binary records available for new matching requests.
Currently, our solution is to use an in-memory database like Redis with a caching mechanism. The problem with this solution is that the size of memory (RAM) goes really big if the number of biometric minutiae binary is so high. We're looking for a solution to make all the binaries highly available for our matching application.
Take note that usually each biometric minutiae are less than 5 KB only and we have millions of biometric minutiae records.
You can use a combination of in-memory and disk-based DB, to store millions of minutiae.
You can store all minutiae in any disk-based DBs like MySQL, PostgreSQL, or any other.
Minutiae data would be spread across three different datastores.
Application cache (Local cache)
In-Memory DB (Memcache, Redis, etc)
Disk-based DB (MySQL, MongoDB, etc)
Let's say you're using Redis and MySQL in your setup.
Your code should first search for the minutiae in the application cache, if it's not found then it should search in Redis to see if it's available there, if you find there then get that and store it in the local cache with expiry.
Even if data is not available in the Redis then you should search in the MySQL database and bring it back. If you find then you should store the same data in Redis with expiry.
Using expiry you can avoid having all objects in the memory at the same time.
Let's say now you don't want to use expiry as you always need all the minutiae. In such cases, you can either increase the size of your Redis instance or use the Redis cluster. As an alternative, IMDG (In-memory data grid) like Hazelcast, Apache Ignite, etc can be used to store all the minutiae. If you don't like to use such a complex setup, then you should consider using In-memory databases like Sap Hana, MemSQL, etc.
I have an existing database in a file. I want to load the database in memory; because I'm doing a lot queries and the database isn't very large (<50MB) to fasten those queries. Is there any way to do this?
50 MB easily fits in the OS file cache; you do not need to do anything.
If the file locking results in a noticeable overhead (which is unlikely), consider using the exclusive locking mode.
You could create a RAM drive and have the database use these files instead of your HDD/SSD hosted files. If you have insane performance requirements your could go for a in memory database as well.
Before you do for any in memory solutions: what is "a lot of queries" an what is the expected response time per query? Chances are that the database program isn't the performance bottleneck, but slow application code or inefficient queries / lack of indexes / ... .
I think SQLite does not support concurrent access to the database, which would waste a lot of performance. If write occur rather infrequently, you could boost your performance by keeping copies of the database and have different threads read different SQLite instances (never tried that).
Either of the solutions suggested by CL and Ray will not perform as well as a true in-memory database due to the simple fact of the file system overhead (irrespective of whether the data is cached and/or in a RAM drive; those measure will help, but you can't beat getting the file system out of the way, entirely).
SQLite allows multiple concurrent readers, but any write transaction will block readers until it is complete.
SQLite only allows a single process to use an in-memory database, though that process can have multiple threads.
You can't load (open) a persistent SQLite database as an in-memory database (at least, the last time I looked into it). You'll have to create a second in-memory database and read from the persistent database to load the in-memory database. But if the database is only 50 MB, that shouldn't be an issue. There are 3rd party tools that will then let you save that in-memory SQLite database and subsequently reload it.
I want to create a hashmap kind of thing for fast lookup between IDs and assigned names.
The number of entries will be a few hundred thousand. Thus I don't want to keep everything in memory. Anyhow, as performance counts in the process, I don't want to make database queries for each ID.
So, what are my chances? How could I get fast lookups on large datasets?
A quick search found these:
Production-ready:
MapDB - I can personally recommend this. It's the successor to JDBM, if you found that one while googling.
Chronicle-Map
Voldemort
Probably nor production-ready, but worth looking at:
https://github.com/aloksingh/disk-backed-map
https://github.com/reines/persistenthashmap
Well there are couple of solutions in my mind !
1) Go for lucene -> store in files
2) Make views in the database -> store in database
So its upto you for which you go for !!
I had a similar requirement a few years back and was avoiding using databases thinking it would have highlook up times. Similar to you I had large set of values so could not use in memory datastructures. So I decided to sequentially parse the filesystem. It was a bit slow, but I could do nothing about it.
Then I explored more on DBs and used DB for my application, just to test. Initially it was slower compared to filesystem. But after indexing the table and Optimizing the database. It proved to be atleast 10-15 times faster than file system. I cant remember the exact performance results but it took just 150-200 ms to read data from a large dataset(around 700 mb of data-size on file system), whereas the same for filesystem was 3.5 seconds.
I used DB2 database, and this guide for performance tuning of DB2
Besides once the DB is setup, you can reuse it for mutiple applications over the network.
If you looking fast solution. Answer in-memory database
Redis, Memcached, Hazelcast, VoltDB etc.
Im tryin to migrate a mysql table to mongodb. My table has 6 million entries. Im using java with morphia. When i save about 1,2 million my memory is almost all consumed.
I've read that mongo store the data in memory and after save in disk. Is it possible to send something like a commit to free some amount of memory?
1) In terms of durability, you can tell the MongoDB java driver (which Morphia is using), which strategy to use, see https://github.com/mongodb/mongo-java-driver/blob/master/src/main/com/mongodb/WriteConcern.java#L53. It's simply a trade-off between speed: NONE (not even connectivity issues will cause an error) up to FSYNC_SAFE (the data is definitely written to disk).
For the internal details check out http://www.kchodorow.com/blog/2012/10/04/how-mongodbs-journaling-works/
2) Your whole data is mapped to memory (that's why the 32bit edition has a size limit of 2GB), however it is only actually loaded, when required. MongoDB leaves that to the operating system by using mmap. So as long as there is more RAM available, MongoDB will happily load all the data it needs into RAM to make queries very quick. If there is no more memory available, it's up to the operating system to swap out old stuff. This has the nice effect that your data will be kept in memory even if you restart the MongoDB process; only if you restart the server itself the data must be fetched from disk again. I think the downside is that the database process might have a slightly better understanding of what should be swapped out first in comparison to the operating system.
I'm not using MongoDB on Windows and haven't seen that message on Mac or Linux (yet), but the operating system should handle that for you (and automatically swap out pieces of information as required). Have you tried setting the driver to JOURNAL_SAFE (should be a good compromise between data security and speed)? In that setting, no data should be lost, even if the MongoDB process dies.
3) In general MongoDB is built to use as much available memory as possible, but you might be able to restrict it with http://captaincodeman.com/2011/02/27/limit-mongodb-memory-use-windows/ - which I haven't tested, as we are using (virtual) Linux servers.
if you just want to release some memory mongodb uses, after your data is processed and mongod is idle, you can run this command
use admin
db.runCommand({closeAllDatabases: 1})
then , you will see the mapped,vsize, res that outputed by mongostat go down a lot.
I have try, and it works. Hope to helps , ^_^
In an application I'm working on, I need a write-behind data log. That is, the application accumulates data in memory, and can hold all the data in memory. It must, however, persist, tolerate reasonable faults, and allow for backup.
Obviously, I could write to a SQL database; Derby springs to mind for easy embedding. I'm not tremendously fond of the dealing with a SQL API (JDBC however lipsticked) and I don't need any queries, indices, or other decoration. The records go out, and on restart, I need to read them all back.
Are there any other suitable alternatives?
Try using a just a simple log file.
As data comes in, store in memory and write (append) to a file. write() followed by fsync() will guarantee (on most systems, read your system and filesystem docs carefully) that the data is written to persistent storage (disc). These are the same mechanisms any database engine would use to get data in persistent storage.
On restart, reload the log. Occasionally, trim the front of the log file so data usage doesn't grow infinitely. Or, model the log file as a circular buffer the same size as what you can hold in memory.
Have you looked at (now Oracle) Berkeley DB for Java? The "Direct Persistence Layer" is actually quite simple to use. Docs here for DPL.
Has different options for backups comes with a few utilities. Runs embedded.
(Licensing: a form of the BSD License I beleive.)