I have a question about faults metric in mongostat.
I'm running mongo 2.0, on ubuntu, with 2 disks (each 32G) in raid-0 configuration.
The test in to load into mongo 5 million of user profiles.
I'm doing the process in single thread and use insert (bulk of 1000 entries) .
When I'm setting up the mongo for the first time and loading into it the profiles i see many faults in mongostat (2,5,and even 15) during the loading.
Then I'm running the loading again: first i'm dropping the old collection, and then run the loading.
The following times the faults=0 almost all the time.
Why is that?
MongoDB relays memory management to the OS via memory-mapped files mechanism. Basically, this mechanism allows a program to open files much larger than amount of installed RAM. When program tries to access a portion of that file, OS looks if this portion (page) is in RAM. If it is not, then page fault happens and that page is loaded from disk. faults/s metric in mongostat shows exactly this: how many page faults are occuring per second.
Now, when you're starting mongo and loading data into it, data files are not mapped into memory and they have to be loaded from disk (page faults). When you drop a collection, it is deleted logically, but corresponding physical files are not deleted and will be reused. Since they are in RAM already, there are no page faults.
If you drop a database instead, it takes the files with it, so you should see page faults next time.
Related
I ran into a problem with Solr going OutOfMemory. The situation is as follows. We had 2 Amazon EC2 small instances (3.5G) each running a Spring/BlazeDS backend in Tomcat 6 (behind a loadbalancer). Each instance has its own local Solr instance. The index size on disk is about 500M. The JVM settings were since months (Xms=512m,Xmx=768). We use Solr to find people based on properties they entered in their profile and documents they uploaded. We're not using the Solr update handler, only the select. Updates are done using deltaImports. The Spring app in each Tomcat instance has a job that triggers the /dataimport?command=delta-import handler every 30 seconds.
This worked well for months, even for over a year if I'm correct (I'm not that long on the project). CPU load was at a minimum, with exceptionally some peaks.
The past week we suddenly had OutOfMemory crashes of SOLR on both machines. I reviewed my changes over the past few weeks, but none of the seamed related to SOLR. Bugfixes in the UI, something email related, but again: nothing in the SOLR schema or queries.
Today, we changed the Ec2 instances to m1.large (7.5G) and the SOLR JVM settings to -Xms=2048 / -Mmx=3072. This helped a bit, they run for 3 a 4 hours, but eventually, they crash too.
Oh, and the dataset (number of rows, documents, entities, etc) did not change significantly. There is a constant growth, but it doesn't make sense to me when I triple the JVM memory, that it still crashes...
The question: have you any directions to point me to?
Measure, not guess. Instead of guessing, what has changed, which could lead to your problems, you would better take some memory leak detection tool, e.g. Plumbr. Run your Solr with the tool attached and see, is it will tell you the exact reason of memory leak.
Take a look at your Solr cache settings. Reducing the size of the document cache has helped us stabilize a Solr 3.6 server that was also experiencing OutOfMemory errors. The query result cache size may also be relevant in your case, it was not in mine.
You can see your Solr cache usage on the admin page for your core:
http://localhost:8983/solr/core0/admin/stats.jsp#cache
(Replace core0 with the name of your Solr core)
documentCache
https://wiki.apache.org/solr/SolrCaching#documentCache
queryResultCache
https://wiki.apache.org/solr/SolrCaching#queryResultCache
Im tryin to migrate a mysql table to mongodb. My table has 6 million entries. Im using java with morphia. When i save about 1,2 million my memory is almost all consumed.
I've read that mongo store the data in memory and after save in disk. Is it possible to send something like a commit to free some amount of memory?
1) In terms of durability, you can tell the MongoDB java driver (which Morphia is using), which strategy to use, see https://github.com/mongodb/mongo-java-driver/blob/master/src/main/com/mongodb/WriteConcern.java#L53. It's simply a trade-off between speed: NONE (not even connectivity issues will cause an error) up to FSYNC_SAFE (the data is definitely written to disk).
For the internal details check out http://www.kchodorow.com/blog/2012/10/04/how-mongodbs-journaling-works/
2) Your whole data is mapped to memory (that's why the 32bit edition has a size limit of 2GB), however it is only actually loaded, when required. MongoDB leaves that to the operating system by using mmap. So as long as there is more RAM available, MongoDB will happily load all the data it needs into RAM to make queries very quick. If there is no more memory available, it's up to the operating system to swap out old stuff. This has the nice effect that your data will be kept in memory even if you restart the MongoDB process; only if you restart the server itself the data must be fetched from disk again. I think the downside is that the database process might have a slightly better understanding of what should be swapped out first in comparison to the operating system.
I'm not using MongoDB on Windows and haven't seen that message on Mac or Linux (yet), but the operating system should handle that for you (and automatically swap out pieces of information as required). Have you tried setting the driver to JOURNAL_SAFE (should be a good compromise between data security and speed)? In that setting, no data should be lost, even if the MongoDB process dies.
3) In general MongoDB is built to use as much available memory as possible, but you might be able to restrict it with http://captaincodeman.com/2011/02/27/limit-mongodb-memory-use-windows/ - which I haven't tested, as we are using (virtual) Linux servers.
if you just want to release some memory mongodb uses, after your data is processed and mongod is idle, you can run this command
use admin
db.runCommand({closeAllDatabases: 1})
then , you will see the mapped,vsize, res that outputed by mongostat go down a lot.
I have try, and it works. Hope to helps , ^_^
Hi I am working on Spelling Corrector project of Natural Language processing and I am supposed read data from a file whose size is 6.2 MB 1 GB. While it is working fine, the problem that I am facing is that every time I run the java program I have to load the data in to the memory and it is taking same amount of time every time it is run.
Is there any way this data can cached in to the memory in java?Can any one suggest me some work around of it?
Basically what I want to know is that What is procedure of storing content of a large file in memory so that I dont have to read it again? lets say file is of GB.
6.2 MB of data will probably be stored in the cache of your operating system as it is a relatively small amount of data and therefore shouldn't take much time at all to load. You should investigate whether it is the parsing of this data that is taking a long time and maybe cache the parsed data to a binary file for quick loading.
6.2 MB isn't very big and unless this is taking a long time and you can't use a background thread to load the file I wouldn't worry about it.
You can use memory mapped files but these are not as simple to work with. Memory mapped files are useful if you have between 1 GB and 1 TB of data.
I see here that loading/parsing of the data from the file and creating the cache is causing you some time delay and you want to save time from doing this every time.
In this case, I would suggest you to use EHcache.
The EHcache (which is ofcource open source and apache licensed) will maintain the cahce for you, prevent your application from Out of memory errors and also will save the state of the cahce to the disk.
So, on the next boot of your application, you can configure you application to directly boot from the EHcahce Data file, so in this way you will avoid parsing your file again and again.
You can still load whatever cahce you are using into memory, only difference is load it though the EHCache APIs.
If you intend to code/debug your program and it seems that reloading the resources for every changes you do takes too much time then consider JRebel Social (if this is a non-commercial project, or JRebel if it is). It allows you to fix bugs in your code or do some changes without restarting your VM, so you get to retain the loaded data (e.g., stored in a static variable), without using any cache or even having to restart your VM. See my previous question: Loading Resources Once in Java. But if it's for production, and your intent is to save memory than saving load time (which in most cases is a problem limited only during startup), then EhCache or other caching libraries should be enough.
I have the following problem:
I have a web application that stores data in the database. I would like for the clients to be able to extract the data e.g. of 2 tables to a file (local to the client).
The database could be arbitrarily big (meaning I have no idea how many data can potentially be in the database. Could be huge).
What is the best approach for this?
Should all the data be SELECTed out of the tables and returned to the client as a single structure to be stored in a file?
Or should the data be retrieved in parts e.g. first 100 then next 100 entries etc and create the single structure in the client?
Are there any pros-cons to consider here?
I've built something similar - there are some really awkward problems here, especially as the filesize can grow beyond what you can comfortably handle in a browser. As the amount of data grows, the time to generate the file increases; this in turn is not what a web application is good at, so you run the risk of your web server getting unhappy with even a smallish number of visitors all requesting a large file.
What we did is split the application into 3 parts.
The "file request" was a simple web page, in which authenticated users can request their file. This kicks off the second part outside the context of the web page request:
File generator.
In our case, this was a windows service which looked at a database table with file requests, picked the latest one, ran the appropriate SQL query, wrote the output to a CSV file, and ZIPped that file, before moving it to the output directory and mailing the user with a link. It set the state of the record in the database to make sure only one process happened at any one point in time.
FTP/WebDAV site:
The ZIP files were written to a folder which was accessible via FTP and WebDAV - these protocols tend to do better with huge files than a standard HTTP download.
This worked pretty well - users didn't like to wait for their files, but the delay was rarely more than a few minutes.
We have a similar use case with an oracle cluster containig approx. 40GB of data. The solution working best for us is a maximum of data per select statement as it reduces DB-overhead significantly.
That being said, there are three optimizations which worked very well for us:
1.) We partition the data into 10 roughly same-sized sets and select them from the database in parallel. For our cluster we found that 8 connections in parallel work approx. 8 times faster than a single connection. There is some additional speedup up to 12 connections but that depends on your database and your dba.
2.) Keep away from hibernate or other ORMs and use custom made JDBCs once you talk about large amounts of data. Use all optimiziations you can get there (e.g. ResultSet.setFetchSize())
3.) Our data compresses very well and putting the data through a gziper saves lots of I/O time. In our case it eliminated I/O from the critical path. By the way, this is also true for storing the data in a file.
I have a problem with my html-scraper. Html-scraper is multithreading application written on Java using HtmlUnit, by default it run with 128 threads. Shortly, it works as follows: it takes a site url from big text file, ping url and if it is accessible - parse site, find specific html blocks, save all url and blocks info including html code into corresponding tables in database and go to the next site. Database is mysql 5.1, there are 4 InnoDb tables and 4 views. Tables have numeric indexes for fields used in table joining. I also has a web-interface for browsing and searching parsed data (for searching I use Sphinx with delta indexes), written on CodeIgniter.
Server configuration:
CPU: Type Xeon Quad Core X3440 2.53GHz
RAM: 4 GB
HDD: 1TB SATA
OS: Ubuntu Server 10.04
Some mysql config:
key_buffer = 256M
max_allowed_packet = 16M
thread_stack = 192K
thread_cache_size = 128
max_connections = 400
table_cache = 64
query_cache_limit = 2M
query_cache_size = 128M
Java machine run with default parameters except next options: -Xms1024m -Xmx1536m -XX:-UseGCOverheadLimit -XX:NewSize=500m -XX:MaxNewSize=500m -XX:SurvivorRatio=6 -XX:PermSize=128M -XX:MaxPermSize=128m -XX:ErrorFile=/var/log/java/hs_err_pid_%p.log
When database was empty, scraper process 18 urls in second and was stable enough. But after 2 weaks, when urls table contains 384929 records (~25% of all processed urls) and takes 8.2Gb, java application begun work very slowly and crash every 1-2 minutes. I guess the reason is mysql, that can not handle growing loading (parser, which perform 2+4*BLOCK_NUMBER queries every processed url; sphinx, which updating delta indexes every 10 minutes; I don't consider web-interface, because it's used by only one person), maybe it rebuild indexes very slowly? But mysql and scraper logs (which also contain all uncaught exceptions) are empty. What do you think about it?
I'd recommend running the following just to check a few status things.. puting that output here would help as well:
dmesg
top Check the resident vs virtual memory per processes
So the application become non responsive? (Not the same as a crash at all) I would check all your resources are free. e.g. do a jstack to check if any threads are tied up.
Check in MySQL you have the expect number of connections. If you continuously create connections in Java and don't clean them up the database will run slower and slower.
Thank you all for your advice, mysql was actually cause of the problem. By enabling slow query log in my.conf I see that one of the queries, which executes every iteration, performs 300s (1 field for searching was not indexed).