How could I load Neo4J into memory on demand?
On different stages of my long running jobs I'm persisting nodes and relationships to Neo4J. So Neo4J should be on disk, since it may consume too much memory and I don't know when I gonna run read queries against it.
But at some point (only once) I will want to run pretty heavy read query against my Neo4J server, and it have very poor performance (hours). As a solution I want to load all Neo4J to RAM for better performance.
What is the best option for it? Should I use run disk or there are any better solutions?
P.S.
Query with [r:LINK_REL_1*2] works pretty fast, [r:LINK_REL_1*3] works 17 seconds, [r:LINK_REL_1*4] works more than 5 minutes, even do not know how much, since I have 5 minutes timeout. But I need [r:LINK_REL_1*2..4] query to perform in reasonable time.
My heavy query explanation
PROFILE
MATCH path = (start:COLUMN)-[r:LINK_REL_1*2]->(col:COLUMN)
WHERE start.ENTITY_ID = '385'
WITH path UNWIND NODES(path) AS col
WITH path,
COLLECT(DISTINCT col.DATABASE_ID) as distinctDBs
WHERE LENGTH(path) + 1 = SIZE(distinctDBs)
RETURN path
Updated query with explanation (got the same performance in tests)
PROFILE
MATCH (start:COLUMN)
WHERE start.ENTITY_ID = '385'
MATCH path = (start)-[r:LINK_REL_1*2]->(col:COLUMN)
WITH path, REDUCE(dbs = [], col IN NODES(path) |
CASE WHEN col.DATABASE_ID in dbs
THEN dbs
ELSE dbs + col.DATABASE_ID END) as distinctDbs
WHERE LENGTH(path) + 1 = SIZE(distinctDbs)
RETURN path
APOC procedures has apoc.warmup.run(), which may get much of Neo4j into cached memory. See if that will make a difference.
It looks like you're trying to create a query in which the path contains only :Persons from distinct countries. Is this right?
If so, I think we can find a better query that can do this without hanging.
First, let's go for low-hanging fruit and see if avoiding the UNWIND can make a difference.
PROFILE or EXPLAIN the query and see if any numbers look significantly different compared to the original query.
MATCH (start:PERSON)
WHERE start.ID = '385'
MATCH path = (start)-[r:FRIENDSHIP_REL*2..5]->(person:PERSON)
WITH path, REDUCE(countries = [], person IN NODES(path) |
CASE WHEN person.country in countries
THEN countries
ELSE countries + person.COUNTRY_ID END) as distinctCountries
WHERE LENGTH(path) + 1 = SIZE(distinctCountries)
RETURN path
Related
I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?
One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});
From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;
I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.
i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);
I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.
My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.
If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.
If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm
For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}
I have a query with a resultset of half a million records, with each record I'm creating an object and trying to add it into an ArrayList.
How can I optimize this operation to avoid memory issues as I'm getting out of heap space error.
This is a fragment o code :
while (rs.next()) {
lista.add(sd.loadSabanaDatos_ResumenLlamadaIntervalo(rs));
}
public SabanaDatos loadSabanaDatos_ResumenLlamadaIntervalo(ResultSet rs)
{
SabanaDatos sabanaDatos = new SabanaDatos();
try {
sabanaDatos.setId(rs.getInt("id"));
sabanaDatos.setHora(rs.getString("hora"));
sabanaDatos.setDuracion(rs.getInt("duracion"));
sabanaDatos.setNavegautenticado(rs.getInt("navegautenticado"));
sabanaDatos.setIndicadorasesor(rs.getInt("indicadorasesor"));
sabanaDatos.setLlamadaexitosa(rs.getInt("llamadaexitosa"));
sabanaDatos.setLlamadanoexitosa(rs.getInt("llamadanoexitosa"));
sabanaDatos.setTipocliente(rs.getString("tipocliente"));
} catch (SQLException e) {
logger.info("dip.sabana.SabanaDatos SQLException : "+ e);
e.printStackTrace();
}
return sabanaDatos;
}
NOTE: The reason of using list is that this is a critic system, and I just can make a call every 2 hours to the bd. I don't have permission to do more calls to the bd in short times, but I need to show data every 10 minutes. Example : first query 10 rows, I show 1 rows each minute after the sql query.
I dont't have permission to create local database, write file or other ... Just acces to memory.
First Of All - It is not a good practice to read half million objects
You can think of breaking down the number of records to be read into small chunks
As a solution to this you can think of following options
1 - use of CachedRowSetImpl - it is same resultSet , it is a bad practice to keep resultSet open (as it is a Database connection property) If you use ArrayList - then you are again performing operations and utilizing the memory
For more info on cachedRowSet you can go to
https://docs.oracle.com/javase/tutorial/jdbc/basics/cachedrowset.html
2 - you can think of using an In-Memory Database, such as HSQLDB or H2. They are very lightweight and fast, provide the JDBC interface you can run the SQL queries as well
For HSQLDB implementation you can check
https://www.tutorialspoint.com/hsqldb/
It might help to have Strings interned, have for two occurrences of the same string just one single object.
public class StringCache {
private Map<String, String> identityMap = new Map<>();
public String cached(String s) {
if (s == null) {
return null;
}
String t = identityMap.get(s);
if (t == null) {
t = s;
identityMap.put(t, t);
}
return t;
}
}
StringCache horaMap = new StringCache();
StringCache tipoclienteMap = new StringCache();
sabanaDatos.setHora(horaMap.cached(rs.getString("hora")));
sabanaDatos.setTipocliente(tipoclienteMap .cached(rs.getString("tipocliente")));
Increasing memory is already said.
A speed-up is possible by using column numbers; if needed gotten from the column name once before the loop (rs.getMetaData()).
Option1:
If you need all the items in the list at the same time you need to increase the heap space of the JVM, adding the argument -Xmx2G for example when you launch the app (java -Xmx2G -jar yourApp.jar).
Option2:
Divide the sql in more than one call
Some of your options:
Use a local database, such as SQLite. That's a very lightweight database management system which is easy to install – you don't need any special privileges to do so – its data is held in a single file in a directory of your choice (such as the directory that holds your Java application) and can be used as an alternative to a large Java data structure such as a List.
If you really must use an ArrayList, make sure you take up as little space as possible. Try the following:
a. If you know the approximate number of rows, then construct your ArrayList with an appropriate initialCapacity to avoid reallocations. Estimate the maximum number of rows your database will grow to, and add another few hundred to your initialCapacity just in case.
b. Make sure your SabanaDatos objects are as small as they can be. For example, make sure the id field is an int and not an Integer. If the hora field is just a time of day, it can be more efficiently held in a short than a String. Similarly for other fields, e.g. duracion - perhaps it can even fit into a byte, if its range allows it to? If you have several flag/Boolean fields, they can be packed into a single byte or short as bits. If you have String fields that have a lot of repetitions, you can intern them as per Joop's suggestion.
c. If you still get out-of-memory errors, increase your heap space using the JVM flags -Xms and -Xmx.
I'm trying to use H2O via R to build multiple models using subsets of one large-ish data set (~ 10GB). The data is one years worth of data and I'm trying to build 51 models (ie train on week 1, predict on week 2, etc.) with each week being about 1.5-2.5 million rows with 8 variables.
I've done this inside of a loop which I know is not always the best way in R. One other issue I found was that the H2O entity would accumulate prior objects, so I created a function to remove all of them except the main data set.
h2o.clean <- function(clust = localH2O, verbose = TRUE, vte = c()){
# Find all objects on server
keysToKill <- h2o.ls(clust)$Key
# Remove items to be excluded, if any
keysToKill <- setdiff(keysToKill, vte)
# Loop thru and remove items to be removed
for(i in keysToKill){
h2o.rm(object = clust, keys = i)
if(verbose == TRUE){
print(i);flush.console()
}
}
# Print remaining objects in cluster.
h2o.ls(clust)
}
The script runs fine for a while and then crashes - often with a complaint about running out of memory and swapping to disk.
Here's some pseudo code to describe the process
# load h2o library
library(h2o)
# create h2o entity
localH2O = h2o.init(nthreads = 4, max_mem_size = "6g")
# load data
dat1.hex = h2o.importFile(localH2O, inFile, key = "dat1.hex")
# Start loop
for(i in 1:51){
# create test/train hex objects
train1.hex <- dat1.hex[dat1.hex$week_num == i,]
test1.hex <- dat1.hex[dat1.hex$week_num == i + 1,]
# train gbm
dat1.gbm <- h2o.gbm(y = 'click_target2', x = xVars, data = train1.hex
, nfolds = 3
, importance = T
, distribution = 'bernoulli'
, n.trees = 100
, interaction.depth = 10,
, shrinkage = 0.01
)
# calculate out of sample performance
test2.hex <- cbind.H2OParsedData(test1.hex,h2o.predict(dat1.gbm, test1.hex))
colnames(test2.hex) <- names(head(test2.hex))
gbmAuc <- h2o.performance(test2.hex$X1, test2.hex$click_target2)#model$auc
# clean h2o entity
h2o.clean(clust = localH2O, verbose = F, vte = c('dat1.hex'))
} # end loop
My question is what, if any, is the correct way to manage data and memory in a stand alone entity (this is NOT running on hadoop or a cluster - just a large EC2 instance (~ 64gb RAM + 12 CPUs)) for this type of process? Should I be killing and recreating the H2O entity after each loop (this was original process but reading data from file every time adds ~ 10 minutes per iteration)? Is there a proper way to garbage collect or release memory after each loop?
Any suggestions would be appreciated.
This answer is for the original H2O project (releases 2.x.y.z).
In the original H2O project, the H2O R package creates lots of temporary H2O objects in the H2O cluster DKV (Distributed Key/Value store) with a "Last.value" prefix.
These are visible both in the Store View from the Web UI and by calling h2o.ls() from R.
What I recommend doing is:
at the bottom of each loop iteration, use h2o.assign() to do a deep copy of anything you want to save to a known key name
use h2o.rm() to remove anything you don't want to keep, in particular the "Last.value" temps
call gc() explicitly in R somewhere in the loop
Here is a function which removes the Last.value temp objects for you. Pass in the H2O connection object as the argument:
removeLastValues <- function(conn) {
df <- h2o.ls(conn)
keys_to_remove <- grep("^Last\\.value\\.", perl=TRUE, x=df$Key, value=TRUE)
unique_keys_to_remove = unique(keys_to_remove)
if (length(unique_keys_to_remove) > 0) {
h2o.rm(conn, unique_keys_to_remove)
}
}
Here is a link to an R test in the H2O github repository that uses this technique and can run indefinitely without running out of memory:
https://github.com/h2oai/h2o/blob/master/R/tests/testdir_misc/runit_looping_slice_quantile.R
New suggestion as of 12/15/2015: update to latest stable (Tibshirani 3.6.0.8 or later).
We've completely reworked how R & H2O handle internal temp variables, and the memory management is much smoother.
Next: H2O temps can be held "alive" by R dead variables... so run an R gc() every loop iteration. Once R's GC removes the dead variables, H2O will reclaim that memory.
After that, your cluster should only hold on to specifically named things, like loaded datasets, and models. These you'll need to delete roughly as fast as you make them, to avoid accumulating large data in the K/V store.
Please let us know if you have any more problems by posting to the google group h2o stream:
https://groups.google.com/forum/#!forum/h2ostream
Cliff
The most current answer to this question is that you should probably just use the h2o.grid() function rather than writing a loop.
With the H2O new version (currently 3.24.0.3), they suggest to use the following recommendations:
my for loop {
# perform loop
rm(R object that isn’t needed anymore)
rm(R object of h2o thing that isn’t needed anymore)
# trigger removal of h2o back-end objects that got rm’d above, since the rm can be lazy.
gc()
# optional extra one to be paranoid. this is usually very fast.
gc()
# optionally sanity check that you see only what you expect to see here, and not more.
h2o.ls()
# tell back-end cluster nodes to do three back-to-back JVM full GCs.
h2o:::.h2o.garbageCollect()
h2o:::.h2o.garbageCollect()
h2o:::.h2o.garbageCollect()
}
Here the source: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/general-troubleshooting.html
I am working on some inherited code and I am not use to the entity frame work. I'm trying to figure out why a previous programmer coded things the way they did, sometimes mixing and matching different ways of querying data.
Deal d = _em.find(Deal.class, dealid);
List<DealOptions> dos = d.getDealOptions();
for(DealOptions o : dos) {
if(o.price == "100") {
//found the 1 item i wanted
}
}
And then sometimes i see this:
Query q = _em.createQuery("select count(o.id) from DealOptions o where o.price = 100 and o.deal.dealid = :dealid");
//set parameters, get results then check result and do whatver
I understand what both pieces of code do, and I understand that given a large dataset, the second way is more efficient. However, given only a few records, is there any reason not to do a query vs just letting the entity do the join and looping over your recordset?
Some reasons never to use the first approach regardless of the number of records:
It is more verbose
The intention is less clear, since there is more clutter
The performance is worse, probably starting to degrade with the very first entities
The performance of the first approach will degrade much more with each added entity than with the second approach
It is unexpected - most experienced developers would not do it - so it needs more cognitive effort for other developers to understand. They would assume you were doing it for a compelling reason and would look for that reason without finding one.
I get this exception:
Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
at java.lang.StringBuilder.append(StringBuilder.java:119)
at java.util.AbstractMap.toString(AbstractMap.java:493)
at org.hibernate.pretty.Printer.toString(Printer.java:59)
at org.hibernate.pretty.Printer.toString(Printer.java:90)
at org.hibernate.event.def.AbstractFlushingEventListener.flushEverythingToExecutions(AbstractFlushingEventListener.java:97)
at org.hibernate.event.def.DefaultAutoFlushEventListener.onAutoFlush(DefaultAutoFlushEventListener.java:35)
at org.hibernate.impl.SessionImpl.autoFlushIfRequired(SessionImpl.java:969)
at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1114)
at org.hibernate.impl.QueryImpl.list(QueryImpl.java:79)
At this code:
Query query = null;
Transaction tx= session.beginTransaction();
if (allRadio.isSelected()) {
query = session.createQuery("select d from Document as d, d.msg as m, m.senderreceivers as s where m.isDraft=0 and d.isMain=1 and s.organization.shortName like '" + search + "' and s.role=0");
} else if (periodRadio.isSelected()) {
query = session.createQuery("select d from Document as d, d.msg as m, m.senderreceivers as s where m.isDraft=0 and d.isMain=1 and s.organization.shortName like '" + search + "' and s.role=0 and m.receivingDate between :start and :end");
query.setParameter("start", start);
query.setParameter("end", end);
}
final List<Document> documents = query.list();
query = session.createQuery("select o from Organization as o");
List<Organization> organizations = query.list(); <---AT THIS LINE
tx.commit();
Im making 2 consecutive queries. If i comment out 1 of them the other works fine.
if i remove transaction thingy exception dissappears. What's going on? Is this a memory leak or something? Thanks in advance.
A tip I picked up from many years of pain with this sort of thing: the answer is usually carefully hidden somewhere in first 10 lines of the stack trace. Always read the stack trace several times, and if that doesn't give enough help, read the source code of the methods where the failure happens.
In this case the problem comes from somewhere in Hibernate's pretty printer. This is a logging feature, so the problem is that Hibernate is trying to log some enormous string. Notice how it fails while trying to increase the size of a StringBuilder.
Why is it trying to log an enormous string? I can't say from the information you've given, but I'm guessing you have something very big in your Organization entity (maybe a BLOB?) and Hibernate is trying to log the objects that the query has pulled out of the database. It may also be a mistake in the mapping, whereby eager fetching pulls in many dependent objects - e.g. a child collection that loads the entire table due to a wrong foreign-key definition.
If it's a mistake in the mapping, fixing the mapping will solve the problem. Otherwise, your best bet is probably to turn off this particular logging feature. There's an existing question on a very similar problem, with some useful suggestions in the answers.
While such an error might be an indicator for a memory leak, it could also just result from high memory usage in your program.
You could try to amend it by adding the following parameter to your command line (which will increase the maximum heap size; adapt the 512m according to your needs):
java -Xmx512m yourprog
If it goes away that way, your program probably just needed more than the default size (which depends on the platform); if it comes again (probably a little later in time), you have a memory leak somewhere.
You need to increase the JVM heap size. Start it with -Xmx256m command-line param.