Hi I've been using Jena for a project and now I am trying to query a Graph for storage in plain files for batch processing with Hadoop.
I open a TDB Dataset and then I query by pages with LIMIT and OFFSET.
I output files with 100000 triplets per file.
However at file 10th the performance degrades and at file 15th it goes down by a factor of 3 and at the 22th file the performances is down to 1%.
My query is:
SELECT DISTINCT ?S ?P ?O WHERE {?S ?P ?O .} LIMIT 100000 OFFSET X
The method that queries and writes to a file is shown in the next code block:
public boolean copyGraphPage(int size, int page, String tdbPath, String query, String outputDir, String fileName) throws IllegalArgumentException {
boolean retVal = true;
if (size == 0) {
throw new IllegalArgumentException("The size of the page should be bigger than 0");
}
long offset = ((long) size) * page;
Dataset ds = TDBFactory.createDataset(tdbPath);
ds.begin(ReadWrite.READ);
String queryString = (new StringBuilder()).append(query).append(" LIMIT " + size + " OFFSET " + offset).toString();
QueryExecution qExec = QueryExecutionFactory.create(queryString, ds);
ResultSet resultSet = qExec.execSelect();
List<String> resultVars;
if (resultSet.hasNext()) {
resultVars = resultSet.getResultVars();
String fullyQualifiedPath = joinPath(outputDir, fileName, "txt");
try (BufferedWriter bwr = new BufferedWriter(new OutputStreamWriter(new BufferedOutputStream(
new FileOutputStream(fullyQualifiedPath)), "UTF-8"))) {
while (resultSet.hasNext()) {
QuerySolution next = resultSet.next();
StringBuffer sb = new StringBuffer();
sb.append(next.get(resultVars.get(0)).toString()).append(" ").
append(next.get(resultVars.get(1)).toString()).append(" ").
append(next.get(resultVars.get(2)).toString());
bwr.write(sb.toString());
bwr.newLine();
}
qExec.close();
ds.end();
ds.close();
bwr.flush();
} catch (IOException e) {
e.printStackTrace();
}
resultVars = null;
qExec = null;
resultSet = null;
ds = null;
} else {
retVal = false;
}
return retVal;
}
The null variables are there because I didn't know if there was a possible leak in there.
However after the 22th file the program fails with the following message:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.jena.ext.com.google.common.cache.LocalCache$EntryFactory$2.newEntry(LocalCache.java:455)
at org.apache.jena.ext.com.google.common.cache.LocalCache$Segment.newEntry(LocalCache.java:2144)
at org.apache.jena.ext.com.google.common.cache.LocalCache$Segment.put(LocalCache.java:3010)
at org.apache.jena.ext.com.google.common.cache.LocalCache.put(LocalCache.java:4365)
at org.apache.jena.ext.com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:5077)
at org.apache.jena.atlas.lib.cache.CacheGuava.put(CacheGuava.java:76)
at org.apache.jena.tdb.store.nodetable.NodeTableCache.cacheUpdate(NodeTableCache.java:205)
at org.apache.jena.tdb.store.nodetable.NodeTableCache._retrieveNodeByNodeId(NodeTableCache.java:129)
at org.apache.jena.tdb.store.nodetable.NodeTableCache.getNodeForNodeId(NodeTableCache.java:82)
at org.apache.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:50)
at org.apache.jena.tdb.store.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:67)
at org.apache.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:50)
at org.apache.jena.tdb.solver.BindingTDB.get1(BindingTDB.java:122)
at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
at org.apache.jena.sparql.engine.binding.BindingProjectBase.get1(BindingProjectBase.java:52)
at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
at org.apache.jena.sparql.engine.binding.BindingProjectBase.get1(BindingProjectBase.java:52)
at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
at org.apache.jena.sparql.engine.binding.BindingBase.hashCode(BindingBase.java:201)
at org.apache.jena.sparql.engine.binding.BindingBase.hashCode(BindingBase.java:183)
at java.util.HashMap.hash(HashMap.java:338)
at java.util.HashMap.containsKey(HashMap.java:595)
at java.util.HashSet.contains(HashSet.java:203)
at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.getInputNextUnseen(QueryIterDistinct.java:106)
at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.hasNextBinding(QueryIterDistinct.java:70)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
at org.apache.jena.sparql.engine.iterator.QueryIterSlice.hasNextBinding(QueryIterSlice.java:76)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
Disconnected from the target VM, address: '127.0.0.1:57723', transport: 'socket'
Process finished with exit code 255
The memory viewer shows an increment in memory usage after querying a page :
It is clear that Jena LocalCache is filling up, I have changed the Xmx to 2048m and Xms to 512m with the same result. Nothing changed.
Do I need more memory?
Do I need to clear something?
Do I need to stop the program and do it in parts?
Is my query wrong?
Does the OFFSET has anyting to do with it?
I read in some old mail postings that you can turn the cache off but I could not find any way to do it. Is there a way to turn cache off?
I know it is a very difficult question but I appreciate any help.
It is clear that Jena LocalCache is filling up
This is the TDB node cache - it usually needs 1.5G (2G is better) per dataset itself. This cache persists for the lifetime of the JVM.
A java heap of 2G is a small Java heap by today's standards. If you must use a small heap, you can try running in 32 bit mode (called "Direct mode" in TDB) but this is less performant (mainly because the node cache is smaller and in this dataset you do have enough nodes to cause cache churn for a small cache).
The node cache is the main cause of the heap exhaustion but the query is consuming memory elsewhere, per query, in DISTINCT.
DISTINCT is not necessarily cheap. It needs to remember everything it has seen to know whether a new row is the first occurrence or already seen.
Apache Jena does optimize some cases of (a TopN query) but the cutoff
for the optimization is 1000 by default. See OpTopN in the code.
Otherwise it is collecting all the rows seen so far. The further through the dataset you go, the more that is in the node cache and also the more than is in the DISTINCT filter.
Do I need more memory?
Yes, more heap. The sensible minimum is 2G per TDB dataset and then whatever Java itself requires (say, 0.5G) and plus your program and query workspace.
You seem to have memory leak somewhere, this is just a guess, but try this:
TDBFactory.release(ds);
REF: https://jena.apache.org/documentation/javadoc/tdb/org/apache/jena/tdb/TDBFactory.html#release-org.apache.jena.query.Dataset-
Related
public List<Employee> getEmployeeDataByDate(Date dateCreated) throws Exception {
List<Employee> listDetails = new ArrayList<>();
sqlServTemplate.query(SyncQueryConstants.RETRIEVE_EMPLOYEE_RECORDS_BY_DATE, new Object[] { dateCreated },
new RowCallbackHandler() {
#Override
public void processRow(ResultSet rs) throws SQLException {
Employee emp = new Employee();
emp.setEmployeeID(rs.getString("employeeID"));
emp.setFirstName(rs.getString("firstName"));
emp.setLastName(rs.getString("lastName"));
emp.setMiddleName(rs.getString("middleName"));
emp.setNickName(rs.getString("nickName"));
byte[] res = rs.getBytes("employeeImage");
Blob blob = new SerialBlob(res);
emp.setEmployeeImage(blob);
// .....
listDetails.add(emp);
}
});
return listDetails;
}
Here I'm trying to retrieving records of the employee table.Because of BLOB data It's saying OutOfMemoryError: Java heap space . Could any one help me on this ?
It's a stand-alone application, I'm doing syncing from one table to another table. So unable to use pagination. Every day 2k record will sync at mid-night by cron job. Give some idea how I can solve this issue.
SELECT * FROM Employees with(nolock) WHERE cast (datediff (day, 0, dateCreated) as datetime) >= ?
This query giving me all data based on date, (around 2k record each day).
If I'm commenting
byte[] res = rs.getBytes("employeeImage");
Blob blob = new SerialBlob(res);
emp.setEmployeeImage(blob);
This line then no issue. Other wise It's throwing error.
Please give some idea, if possible give some sample code.
I'm struggling from 2days in this possition.
As some other commenters have mentioned you can either increase your heap space or limit the amount of records you are returning from your query and process them in smaller batches.
You are reading an MB sized image in bytes, it will consume your HEAP memory..
instead try using BinaryStream:
InputStream image = rs.getBinaryStream("employeeImage");
Rather then adding each user to the list you could process them one at a time. Put each record in the other database as it's pulled from the source database, rather then adding them to the List which is causing the OOM error. If there is some other processing downstream then inject a class into this DAO that handles the actual processing/writing to the target DB.
We have a spark SQL query that returns over 5 million rows. Collecting them all for processing results in java.lang.OutOfMemoryError: GC overhead limit exceeded (eventually). Here's the code:
final Dataset<Row> jdbcDF = sparkSession.read().format("jdbc")
.option("url", "xxxx")
.option("driver", "com.ibm.db2.jcc.DB2Driver")
.option("query", sql)
.option("user", "xxxx")
.option("password", "xxxx")
.load();
final Encoder<GdxClaim> gdxClaimEncoder = Encoders.bean(GdxClaim.class);
final Dataset<GdxClaim> gdxClaimDataset = jdbcDF.as(gdxClaimEncoder);
System.out.println("BEFORE PARALLELIZE");
final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());
System.out.println("AFTER");
final JavaRDD<ClaimResponse> gdxClaimResponse = gdxClaimJavaRDD.mapPartitions(mapFunc);
mapFunc = (FlatMapFunction<Iterator<GdxClaim>, ClaimResponse>) claim -> {
System.out.println(":D " + claim.next().getRBAT_ID());
if (claim != null && !currentRebateId.equals((claim.next().getRBAT_ID()))) {
if (redisCommands == null || (claim.next().getRBAT_ID() == null)) {
serObjList = Collections.emptyList();
} else {
generateYearQuarterKeys(claim.next());
redisBillingKeys = redisBillingKeys.stream().collect(Collectors.toList());
final String[] stringArray = redisBillingKeys.toArray(new String[redisBillingKeys.size()]);
serObjList = redisCommands.mget(stringArray);
serObjList = serObjList.stream().filter(clientObj -> clientObj.hasValue()).collect(Collectors.toList());
deserializeClientData(serObjList);
currentRebateId = (claim.next().getRBAT_ID());
}
}
return (Iterator) racAssignmentService.assignRac(claim.next(), listClientRegArr);
};
You can ignore most of this, the line that runs forever and never can return is:
final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());
Because of:
gdxClaimDataset.collectAsList()
We are unsure where to go from here and totally stuck. Can anyone help? We've looked everywhere for some example to help.
At a high level, collectAsList() is going to bring your entire dataset into memory, and this is what you need to avoid doing.
You may want to look at the Dataset docs in general (same link as above). They explain its behavior, including the javaRDD() method, which is probably the way to avoid collectAsList().
Keep in mind: other "terminal" operations, that collect your dataset into memory, will cause the same problem. The key is to filter down to your small subset, whatever that is, either before or during the process of collection.
Try to replace this line :
final JavaRDD<GdxClaim> gdxClaimJavaRDD = javaSparkContext.parallelize(gdxClaimDataset.collectAsList());
with :
final JavaRDD<GdxClaim> gdxClaimJavaRDD = gdxClaimDataset.javaRDD();
I'm trying to read text files .txt with more than 10.000 lines per file, splitting them and inserting the data in Access database using Java and UCanAccess. The problem is that it becomes slower and slower every time (as the database gets bigger).
Now after reading 7 text files and inserting them into database, it would take the project more than 20 minutes to read another file.
I tried to do just the reading and it works fine, so the problem is the actual inserting into database.
N.B: This is my first time using UCanAccess with Java because I found that the JDBC-ODBC Bridge is no longer available. Any suggestions for an alternative solution would also be appreciated.
If your current task is simply to import a large amount of data from text files straight into the database, and it does not require any sophisticated SQL manipulations, then you might consider using the Jackcess API directly. For example, to import a CSV file you could do something like this:
String csvFileSpec = "C:/Users/Gord/Desktop/BookData.csv";
String dbFileSpec = "C:/Users/Public/JackcessTest.accdb";
String tableName = "Book";
try (Database db = new DatabaseBuilder()
.setFile(new File(dbFileSpec))
.setAutoSync(false)
.open()) {
new ImportUtil.Builder(db, tableName)
.setDelimiter(",")
.setUseExistingTable(true)
.setHeader(false)
.importFile(new File(csvFileSpec));
// this is a try-with-resources block,
// so db.close() happens automatically
}
Or, if you need to manually parse each line of input, insert a row, and retrieve the AutoNumber value for the new row, then the code would be more like this:
String dbFileSpec = "C:/Users/Public/JackcessTest.accdb";
String tableName = "Book";
try (Database db = new DatabaseBuilder()
.setFile(new File(dbFileSpec))
.setAutoSync(false)
.open()) {
// sample data (e.g., from parsing of an input line)
String title = "So, Anyway";
String author = "Cleese, John";
Table tbl = db.getTable(tableName);
Object[] rowData = tbl.addRow(Column.AUTO_NUMBER, title, author);
int newId = (int)rowData[0]; // retrieve generated AutoNumber
System.out.printf("row inserted with ID = %d%n", newId);
// this is a try-with-resources block,
// so db.close() happens automatically
}
To update an existing row based on its primary key, the code would be
Table tbl = db.getTable(tableName);
Row row = CursorBuilder.findRowByPrimaryKey(tbl, 3); // i.e., ID = 3
if (row != null) {
// Note: column names are case-sensitive
row.put("Title", "The New Title For This Book");
tbl.updateRow(row);
}
Note that for maximum speed I used .setAutoSync(false) when opening the Database, but bear in mind that disabling AutoSync does increase the chance of leaving the Access database file in a damaged (and possibly unusable) state if the application terminates abnormally while performing the updates.
Also, if you need to use slq/ucanaccess, you have to call setAutocommit(false) on the connection at the begin, and do a commit each 200/300 record. The performances will improve drammatically (about 99%).
The user will browse (paging) sorted entities in Datastore 12 at time.
DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
//get url parameter
int next = Integer.parseInt(request.getParameter("next") );
//query with sort
Query query = new Query("page").addSort("importance", SortDirection.DESCENDING );
PreparedQuery pq = datastore.prepare(query);
//get 12 entity from query result from the index (next)
FetchOptions options = FetchOptions.Builder.withLimit(12).chunkSize(12).offset(next);
for (Entity result : pq.asIterable(options)) {
Text text = (Text)result.getProperty("content");
Document doc = Jsoup.parse(text.getValue());
//display the content
.....
}
The problem is that when the next variable increase the quota consume increase faster!.
For example when the next is 6000, the quota consumed by 40%, while when the next is 10 the quota consumed by less than 1%.
If you you use the Google App Engine cursors to facilitate your paging then your queries will be optimized. It is not recommended to use large offsets. The recommended way to do paging in GAE is with cursors.
Trying to use a similar example from the sample code found here
My sample function is:
void query()
{
String nodeResult = "";
String rows = "";
String resultString;
String columnsString;
System.out.println("In query");
// START SNIPPET: execute
ExecutionEngine engine = new ExecutionEngine( graphDb );
ExecutionResult result;
try ( Transaction ignored = graphDb.beginTx() )
{
result = engine.execute( "start n=node(*) where n.Name =~ '.*79.*' return n, n.Name" );
// END SNIPPET: execute
// START SNIPPET: items
Iterator<Node> n_column = result.columnAs( "n" );
for ( Node node : IteratorUtil.asIterable( n_column ) )
{
// note: we're grabbing the name property from the node,
// not from the n.name in this case.
nodeResult = node + ": " + node.getProperty( "Name" );
System.out.println("In for loop");
System.out.println(nodeResult);
}
// END SNIPPET: items
// START SNIPPET: columns
List<String> columns = result.columns();
// END SNIPPET: columns
// the result is now empty, get a new one
result = engine.execute( "start n=node(*) where n.Name =~ '.*79.*' return n, n.Name" );
// START SNIPPET: rows
for ( Map<String, Object> row : result )
{
for ( Entry<String, Object> column : row.entrySet() )
{
rows += column.getKey() + ": " + column.getValue() + "; ";
System.out.println("nested");
}
rows += "\n";
}
// END SNIPPET: rows
resultString = engine.execute( "start n=node(*) where n.Name =~ '.*79.*' return n.Name" ).dumpToString();
columnsString = columns.toString();
System.out.println(rows);
System.out.println(resultString);
System.out.println(columnsString);
System.out.println("leaving");
}
}
When I run this in the web console I get many results (as there are multiple nodes that have an attribute of Name that contains the pattern 79. Yet running this code returns no results. The debug print statements 'in loop' and 'nested' never print either. Thus this must mean there are not results found in the Iterator, yet that doesn't make sense.
And yes, I already checked and made sure that the graphDb variable is the same as the path for the web console. I have other code earlier that uses the same variable to write to the database.
EDIT - More info
If I place the contents of query in the same function that creates my data, I get the correct results. If I run the query by itself it returns nothing. It's almost as the query works only in the instance where I add the data and not if I come back to the database cold in a separate instance.
EDIT2 -
Here is a snippet of code that shows the bigger context of how it is being called and sharing the same DBHandle
package ContextEngine;
import ContextEngine.NeoHandle;
import java.util.LinkedList;
/*
* Class to handle streaming data from any coded source
*/
public class Streamer {
private NeoHandle myHandle;
private String contextType;
Streamer()
{
}
public void openStream(String contextType)
{
myHandle = new NeoHandle();
myHandle.createDb();
}
public void streamInput(String dataLine)
{
Context context = new Context();
/*
* get database instance
* write to database
* check for errors
* report errors & success
*/
System.out.println(dataLine);
//apply rules to data (make ContextRules do this, send type and string of data)
ContextRules contextRules = new ContextRules();
context = contextRules.processContextRules("Calls", dataLine);
//write data (using linked list from contextRules)
NeoProcessor processor = new NeoProcessor(myHandle);
processor.processContextData(context);
}
public void runQuery()
{
NeoProcessor processor = new NeoProcessor(myHandle);
processor.query();
}
public void closeStream()
{
/*
* close database instance
*/
myHandle.shutDown();
}
}
Now, if I call streamInput AND query in in the same instance (parent calls) the query returns results. If I only call query and do not enter ANY data in that instance (yet web console shows data for same query) I get nothing. Why would I have to create the Nodes and enter them into the database at runtime just to return a valid query. Shouldn't I ALWAYS get the same results with such a query?
You mention that you are using the Neo4j Browser, which comes with Neo4j. However, the example you posted is for Neo4j Embedded, which is the in-process version of Neo4j. Are you sure you are talking to the same database when you try your query in the Browser?
In order to talk to Neo4j Server from Java, I'd recommend looking at the Neo4j JDBC driver, which has good support for connecting to the Neo4j server from Java.
http://www.neo4j.org/develop/tools/jdbc
You can set up a simple connection by adding the Neo4j JDBC jar to your classpath, available here: https://github.com/neo4j-contrib/neo4j-jdbc/releases Then just use Neo4j as any JDBC driver:
Connection conn = DriverManager.getConnection("jdbc:neo4j://localhost:7474/");
ResultSet rs = conn.executeQuery("start n=node({id}) return id(n) as id", map("id", id));
while(rs.next()) {
System.out.println(rs.getLong("id"));
}
Refer to the JDBC documentation for more advanced usage.
To answer your question on why the data is not durably stored, it may be one of many reasons. I would attempt to incrementally scale back the complexity of the code to try and locate the culprit. For instance, until you've found your problem, do these one at a time:
Instead of looping through the result, print it using System.out.println(result.dumpToString());
Instead of the regex query, try just MATCH (n) RETURN n, to return all data in the database
Make sure the data you are seeing in the browser is not "old" data inserted earlier on, but really is an insert from your latest run of the Java program. You can verify this by deleting the data via the browser before running the Java program using MATCH (n) OPTIONAL MATCH (n)-[r]->() DELETE n,r;
Make sure you are actually working against the same database directories. You can verify this by leaving the server running. If you can still start your java program, unless your Java program is using the Neo4j REST Bindings, you are not using the same directory. Two Neo4j databases cannot run against the same database directory simultaneously.