Discrepancy between memory usage got through getMemoryMXBean() and jvisualvm? - java

When trying to monitor my own program's memory usage through the following code
public static String heapMemUsage()
{
long used = ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getUsed();
long max = ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getMax();
return ""+used+" of "+max+" ("+ used/(double)max*100.0 + "%)";
}
I got a slightly different result than seen through jvisualvm (17 588 616 in my program vs 18 639 640 in jvisualvm). I know it's not that big of a deal, but it did get me thinking.
Is there any explanation for this fact?
I'd like to use the coded version if possible, but if its results are in some way skewed, being jvisualvm in some way more credible, I'll have to stick with jvisualvm instead.
Thanks

VisualVM does use the same approach to get required values, let's check the MonitoredDataImpl.java:
MonitoredDataImpl(Jvm vm,JmxSupport jmxSupport,JvmMXBeans jmxModel) {
//...
MemoryUsage mem = jmxModel.getMemoryMXBean().getHeapMemoryUsage();
MemoryPoolMXBean permBean = jmxSupport.getPermGenPool();
//...
genCapacity = new long[2];
genUsed = new long[2];
genMaxCapacity = new long[2];
genCapacity[0] = mem.getCommitted();
genUsed[0] = mem.getUsed();
genMaxCapacity[0] = mem.getMax();
if (permBean != null) {
MemoryUsage perm = permBean.getUsage();
genCapacity[1] = perm.getCommitted();
genUsed[1] = perm.getUsed();
genMaxCapacity[1] = perm.getMax();
}
}
So it is safe to use your approach anyway. Please post additional info regarding JVM version, etc in order to trace this issue.

Related

Java Spark - java.lang.OutOfMemoryError: GC overhead limit exceeded - Large Dataset

We have a spark SQL query that returns over 5 million rows. Collecting them all for processing results in java.lang.OutOfMemoryError: GC overhead limit exceeded (eventually). Here's the code:
final Dataset<Row> jdbcDF = sparkSession.read().format("jdbc")
.option("url", "xxxx")
.option("driver", "com.ibm.db2.jcc.DB2Driver")
.option("query", sql)
.option("user", "xxxx")
.option("password", "xxxx")
.load();
final Encoder<GdxClaim> gdxClaimEncoder = Encoders.bean(GdxClaim.class);
final Dataset<GdxClaim> gdxClaimDataset = jdbcDF.as(gdxClaimEncoder);
System.out.println("BEFORE PARALLELIZE");
final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());
System.out.println("AFTER");
final JavaRDD<ClaimResponse> gdxClaimResponse = gdxClaimJavaRDD.mapPartitions(mapFunc);
mapFunc = (FlatMapFunction<Iterator<GdxClaim>, ClaimResponse>) claim -> {
System.out.println(":D " + claim.next().getRBAT_ID());
if (claim != null && !currentRebateId.equals((claim.next().getRBAT_ID()))) {
if (redisCommands == null || (claim.next().getRBAT_ID() == null)) {
serObjList = Collections.emptyList();
} else {
generateYearQuarterKeys(claim.next());
redisBillingKeys = redisBillingKeys.stream().collect(Collectors.toList());
final String[] stringArray = redisBillingKeys.toArray(new String[redisBillingKeys.size()]);
serObjList = redisCommands.mget(stringArray);
serObjList = serObjList.stream().filter(clientObj -> clientObj.hasValue()).collect(Collectors.toList());
deserializeClientData(serObjList);
currentRebateId = (claim.next().getRBAT_ID());
}
}
return (Iterator) racAssignmentService.assignRac(claim.next(), listClientRegArr);
};
You can ignore most of this, the line that runs forever and never can return is:
final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());
Because of:
gdxClaimDataset.collectAsList()
We are unsure where to go from here and totally stuck. Can anyone help? We've looked everywhere for some example to help.
At a high level, collectAsList() is going to bring your entire dataset into memory, and this is what you need to avoid doing.
You may want to look at the Dataset docs in general (same link as above). They explain its behavior, including the javaRDD() method, which is probably the way to avoid collectAsList().
Keep in mind: other "terminal" operations, that collect your dataset into memory, will cause the same problem. The key is to filter down to your small subset, whatever that is, either before or during the process of collection.
Try to replace this line :
final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());
with :
final JavaRDD<GdxClaim> gdxClaimJavaRDD = gdxClaimDataset.javaRDD();

Java OutOfMemoryError in apache Jena using TDB

Hi I've been using Jena for a project and now I am trying to query a Graph for storage in plain files for batch processing with Hadoop.
I open a TDB Dataset and then I query by pages with LIMIT and OFFSET.
I output files with 100000 triplets per file.
However at file 10th the performance degrades and at file 15th it goes down by a factor of 3 and at the 22th file the performances is down to 1%.
My query is:
SELECT DISTINCT ?S ?P ?O WHERE {?S ?P ?O .} LIMIT 100000 OFFSET X
The method that queries and writes to a file is shown in the next code block:
public boolean copyGraphPage(int size, int page, String tdbPath, String query, String outputDir, String fileName) throws IllegalArgumentException {
boolean retVal = true;
if (size == 0) {
throw new IllegalArgumentException("The size of the page should be bigger than 0");
}
long offset = ((long) size) * page;
Dataset ds = TDBFactory.createDataset(tdbPath);
ds.begin(ReadWrite.READ);
String queryString = (new StringBuilder()).append(query).append(" LIMIT " + size + " OFFSET " + offset).toString();
QueryExecution qExec = QueryExecutionFactory.create(queryString, ds);
ResultSet resultSet = qExec.execSelect();
List<String> resultVars;
if (resultSet.hasNext()) {
resultVars = resultSet.getResultVars();
String fullyQualifiedPath = joinPath(outputDir, fileName, "txt");
try (BufferedWriter bwr = new BufferedWriter(new OutputStreamWriter(new BufferedOutputStream(
new FileOutputStream(fullyQualifiedPath)), "UTF-8"))) {
while (resultSet.hasNext()) {
QuerySolution next = resultSet.next();
StringBuffer sb = new StringBuffer();
sb.append(next.get(resultVars.get(0)).toString()).append(" ").
append(next.get(resultVars.get(1)).toString()).append(" ").
append(next.get(resultVars.get(2)).toString());
bwr.write(sb.toString());
bwr.newLine();
}
qExec.close();
ds.end();
ds.close();
bwr.flush();
} catch (IOException e) {
e.printStackTrace();
}
resultVars = null;
qExec = null;
resultSet = null;
ds = null;
} else {
retVal = false;
}
return retVal;
}
The null variables are there because I didn't know if there was a possible leak in there.
However after the 22th file the program fails with the following message:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.jena.ext.com.google.common.cache.LocalCache$EntryFactory$2.newEntry(LocalCache.java:455)
at org.apache.jena.ext.com.google.common.cache.LocalCache$Segment.newEntry(LocalCache.java:2144)
at org.apache.jena.ext.com.google.common.cache.LocalCache$Segment.put(LocalCache.java:3010)
at org.apache.jena.ext.com.google.common.cache.LocalCache.put(LocalCache.java:4365)
at org.apache.jena.ext.com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:5077)
at org.apache.jena.atlas.lib.cache.CacheGuava.put(CacheGuava.java:76)
at org.apache.jena.tdb.store.nodetable.NodeTableCache.cacheUpdate(NodeTableCache.java:205)
at org.apache.jena.tdb.store.nodetable.NodeTableCache._retrieveNodeByNodeId(NodeTableCache.java:129)
at org.apache.jena.tdb.store.nodetable.NodeTableCache.getNodeForNodeId(NodeTableCache.java:82)
at org.apache.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:50)
at org.apache.jena.tdb.store.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:67)
at org.apache.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:50)
at org.apache.jena.tdb.solver.BindingTDB.get1(BindingTDB.java:122)
at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
at org.apache.jena.sparql.engine.binding.BindingProjectBase.get1(BindingProjectBase.java:52)
at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
at org.apache.jena.sparql.engine.binding.BindingProjectBase.get1(BindingProjectBase.java:52)
at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
at org.apache.jena.sparql.engine.binding.BindingBase.hashCode(BindingBase.java:201)
at org.apache.jena.sparql.engine.binding.BindingBase.hashCode(BindingBase.java:183)
at java.util.HashMap.hash(HashMap.java:338)
at java.util.HashMap.containsKey(HashMap.java:595)
at java.util.HashSet.contains(HashSet.java:203)
at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.getInputNextUnseen(QueryIterDistinct.java:106)
at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.hasNextBinding(QueryIterDistinct.java:70)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
at org.apache.jena.sparql.engine.iterator.QueryIterSlice.hasNextBinding(QueryIterSlice.java:76)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
Disconnected from the target VM, address: '127.0.0.1:57723', transport: 'socket'
Process finished with exit code 255
The memory viewer shows an increment in memory usage after querying a page :
It is clear that Jena LocalCache is filling up, I have changed the Xmx to 2048m and Xms to 512m with the same result. Nothing changed.
Do I need more memory?
Do I need to clear something?
Do I need to stop the program and do it in parts?
Is my query wrong?
Does the OFFSET has anyting to do with it?
I read in some old mail postings that you can turn the cache off but I could not find any way to do it. Is there a way to turn cache off?
I know it is a very difficult question but I appreciate any help.
It is clear that Jena LocalCache is filling up
This is the TDB node cache - it usually needs 1.5G (2G is better) per dataset itself. This cache persists for the lifetime of the JVM.
A java heap of 2G is a small Java heap by today's standards. If you must use a small heap, you can try running in 32 bit mode (called "Direct mode" in TDB) but this is less performant (mainly because the node cache is smaller and in this dataset you do have enough nodes to cause cache churn for a small cache).
The node cache is the main cause of the heap exhaustion but the query is consuming memory elsewhere, per query, in DISTINCT.
DISTINCT is not necessarily cheap. It needs to remember everything it has seen to know whether a new row is the first occurrence or already seen.
Apache Jena does optimize some cases of (a TopN query) but the cutoff
for the optimization is 1000 by default. See OpTopN in the code.
Otherwise it is collecting all the rows seen so far. The further through the dataset you go, the more that is in the node cache and also the more than is in the DISTINCT filter.
Do I need more memory?
Yes, more heap. The sensible minimum is 2G per TDB dataset and then whatever Java itself requires (say, 0.5G) and plus your program and query workspace.
You seem to have memory leak somewhere, this is just a guess, but try this:
TDBFactory.release(ds);
REF: https://jena.apache.org/documentation/javadoc/tdb/org/apache/jena/tdb/TDBFactory.html#release-org.apache.jena.query.Dataset-

RIAK high diskspace usage

I am evaluating RIAK kV V2.1.1 on a local desktop using java client and a little customised version of the sample code
And my concern is I found it to be taking almost 920bytes per KV.
That's too steep. The data dir was 93 mb for 100k kvs and kept increasing linearly there after for every 100k Store ops.
Is that expected.
RiakCluster cluster = setUpCluster();
RiakClient client = new RiakClient(cluster);
System.out.println("Client object successfully created");
Namespace quotesBucket = new Namespace("quotes2");
long start = System.currentTimeMillis();
for(int i=0; i< 100000; i++){
RiakObject quoteObject = new RiakObject().setContentType("text/plain").setValue(BinaryValue.create("You're dangerous, Maverick"));
Location quoteObjectLocation = new Location(quotesBucket, ("Ice"+i));
StoreValue storeOp = new StoreValue.Builder(quoteObject).withLocation(quoteObjectLocation).build();
StoreValue.Response storeOpResp = client.execute(storeOp);
}
There was a thread on the riak users mailing list a while back that discussed the overhead of the riak object, estimating it at ~400 bytes per object. However, that was before the new object format was introduced, so it is outdated. Here is a fresh look.
First we need a local client
(node1#127.0.0.1)1> {ok,C}=riak:local_client().
{ok,{riak_client,['node1#127.0.0.1',undefined]}}
Create a new riak object with a 0-byte value
(node1#127.0.0.1)2> Obj = riak_object:new(<<"size">>,<<"key">>,<<>>).
#r_object{bucket = <<"size">>,key = <<"key">>,
contents = [#r_content{metadata = {dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],[],...}}},
value = <<>>}],
vclock = [],
updatemetadata = {dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
updatevalue = undefined}
The object is actually stored in a reduced binary format:
(node1#127.0.0.1)3> byte_size(riak_object:to_binary(v1,Obj)).
36
That is 36 bytes overhead for just the object, but that doesn't include the metadata like last updated time or the version vector, so store it in Riak and check again.
(node1#127.0.0.1)4> C:put(Obj).
ok
(node1#127.0.0.1)5> {ok,Obj1} = C:get(<<"size">>,<<"key">>).
{ok, #r_object{bucket = <<"size">>,key = <<"key">>,
contents = [#r_content{metadata = {dict,3,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[[...]],[...],...}}},
value = <<>>}],
vclock = [{<<204,153,66,25,119,94,124,200,0,0,156,65>>,
{3,63654324108}}],
updatemetadata = {dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
updatevalue = undefined}}
(node1#127.0.0.1)6> byte_size(riak_object:to_binary(v1,Obj)).
110
Now it is 110 bytes overhead for an empty object with a single entry in the version vector. If a subsequent put of the object is coordinated by a different vnode, it will add another entry. I've selected the bucket and key names so that the local node is not a member of the preflist, so the second put has a fair probability of being coordinated by a different node.
(node1#127.0.0.1)7> C:put(Obj1).
ok
(node1#127.0.0.1)8> {ok,Obj2} = C:get(<<"size">>,<<"key">>).
{ok, #r_object{bucket = <<"size">>,key = <<"key">>,
contents = [#r_content{metadata = {dict,3,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[[...]],[...],...}}},
value = <<>>}],
vclock = [{<<204,153,66,25,119,94,124,200,0,0,156,65>>,
{3,63654324108}},
{<<85,123,36,24,254,22,162,159,0,0,78,33>>,{1,63654324651}}],
updatemetadata = {dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
updatevalue = undefined}}
(node1#127.0.0.1)9> byte_size(riak_object:to_binary(v1,Obj2)).
141
Which is another 31 bytes added for an additional entry in the version vector.
These numbers don't include storing the actual bucket and key names with the value, or Bitcask storing them again in a hint file, so the actual space on disk would then be 2x(bucketname size + keyname size) + value overhead + file structure overhead + checksum/hash size
If you're using bitcask, there is a calculator in the documentation that will help you estimate disk and memory requirements: http://docs.basho.com/riak/kv/2.2.0/setup/planning/bitcask-capacity-calc/
If you use eLevelDB, you have the option of snappy compression which could reduce the size on disk.

How to query nested mongodb fields efficiently in Java?

I'm not very experienced with running Mongo queries from Java, so I'm no expert at commands. I have a Mongo collection with ~6500 documents, each containing multiple fields (some of which have sub-fields), like below:
"_id" : NumberLong(714847),
"franchiseIds" : [
NumberLong(714848),
NumberLong(714849)
],
"profileSettings" : {
"DISCLAIMER_SETUP" : {
"settingType" : "DISCLAIMER_SETUP",
"attributeMap" : {
...
I want to have an operation which will go through the entire collection from time to time and calculate how many franchiseIds are present, since different documents could have anywhere from 1 to 4 franchiseIds.
From the Mongo shell, I did a very simple script to get this, and it calculated the result immediately:
rs_default:SECONDARY> var totalCount = 0;
rs_default:SECONDARY> db.profiles.find().forEach( function(profile) { totalCount += profile.franchiseIds.length } );
rs_default:SECONDARY> totalCount
However, when I attempted to do the same thing in Java, which is where this would run on the server from time to time, it was much less performant, taking around 15 seconds to complete:
int result = 0;
List<Profile> allProfiles = mongoTemplate.findAll(Profile.class, PROFILE_COLLECTION);
for (Profile profile : allProfiles) {
result += profile.getFranchiseIds().size();
}
return results
I realize the above isn't performant in Java as it's having to allocate memory for all of the Profiles being loaded in. In the Mongo shell script, is Mongo simply taking care of this itself?
Any ideas how I can do something similar in Java?
EDIT:
I returned only the franchiseIds field on the response from Mongo, and that helped significantly. Below is the improved code:
final Query query = new Query();
query.fields().include(FRANCHISE_IDS);
final List<Profile> allProfiles = mongoTemplate.find(query, Profile.class, PROFILE_COLLECTION);
for (Profile profile : allProfiles) {
result += profile.getFranchiseIds().size();
}
return result;

Out of heap space with hibernate - what's the problem?

I'm running into the following (common) error after I added a new DB table, hibernate class, and other classes to access the hibernate class:
java.lang.OutOfMemoryError: Java heap space
Here's the relevant code:
From .jsp:
<%
com.companyconnector.model.HomepageBean homepage = new com.companyconnector.model.HomepageBean();
%>
From HomepageBean:
public class HomepageBean {
...
private ReviewBean review1;
private ReviewBean review2;
private ReviewBean review3;
public HomepageBean () {
...
GetSurveyResults gsr = new GetSurveyResults();
List<ReviewBean> rbs = gsr.getRecentReviews();
review1 = rbs.get(0);
review2 = rbs.get(1);
review3 = rbs.get(2);
}
From GetSurveyResults:
public List<ReviewBean> getRecentReviews() {
List<OpenResponse> ors = DatabaseBean.getRecentReviews();
List<ReviewBean> rbs = new ArrayList<ReviewBean>();
for(int x = 0; ors.size() > x; x =+ 2) {
String employer = "";
rbs.add(new ReviewBean(ors.get(x).getUid(), employer, ors.get(x).getResponse(), ors.get(x+1).getResponse()));
}
return rbs;
}
and lastly, from DatabaseBean:
public static List<OpenResponse> getRecentReviews() {
SessionFactory session = HibernateUtil.getSessionFactory();
Session sess = session.openSession();
Transaction tx = sess.beginTransaction();
List results = sess.createQuery(
"from OpenResponse where (uid = 46) or (uid = 50) or (uid = 51)"
).list();
tx.commit();
sess.flush();
sess.close();
return results;
}
Sorry for all the code and such a long message, but I'm getting over a million instances of ReviewBean (I used jProfiler to find this). Am I doing something wrong in the for loop in GetSurveyResults? Any other problems?
I'm happy to provide more code if necessary.
Thanks for the help.
Joe
Using JProfiler to find which objects occupy the memory is a good first step. Now that you know that needlessly many instances are created, a logical next analysis step is to run your application in debug mode, and step through the code that allocates the ReviewBeans. If you do that, the bug should be obvious. (I am pretty sure I spotted it, but I'd rather teach you how to find such bugs on your own. It's a skill that is indispensable for any good programmer).
Also you probably want to close session/commit transaction if the finally block to make sure it's always invoked event if your method throws exception. Standard pattern for working with resources in java (simplified pseudo code):
Session s = null;
try {
s = openSession();
// do something useful
}
finally {
if (s != null) s.close();
}

Categories

Resources