The user will browse (paging) sorted entities in Datastore 12 at time.
DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
//get url parameter
int next = Integer.parseInt(request.getParameter("next") );
//query with sort
Query query = new Query("page").addSort("importance", SortDirection.DESCENDING );
PreparedQuery pq = datastore.prepare(query);
//get 12 entity from query result from the index (next)
FetchOptions options = FetchOptions.Builder.withLimit(12).chunkSize(12).offset(next);
for (Entity result : pq.asIterable(options)) {
Text text = (Text)result.getProperty("content");
Document doc = Jsoup.parse(text.getValue());
//display the content
.....
}
The problem is that when the next variable increase the quota consume increase faster!.
For example when the next is 6000, the quota consumed by 40%, while when the next is 10 the quota consumed by less than 1%.
If you you use the Google App Engine cursors to facilitate your paging then your queries will be optimized. It is not recommended to use large offsets. The recommended way to do paging in GAE is with cursors.
Related
Query query = new Query("Apple");
query.lang("en");
query.setCount(100);
query.setSince("2018-12-03");
query.setUntil("2018-12-04");
QueryResult result = twitter.search(query);
SentiWordNetDemoCode sentiwordnet = new SentiWordNetDemoCode();
for (Status tweet : result.getTweets()){
System.out.println(tweet.getCreatedAt());
}
When testing this, all the tweets are from 7:59 SRET. Is there any way to get a tweet from a time other than this?
According to the API reference on twitter, there are some restrictions to the query. There is no (more) since parameter, only a since_id which takes a status id. Then you can only search in the last 7 days. So the until field should not be older than that.
https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html
Hi I've been using Jena for a project and now I am trying to query a Graph for storage in plain files for batch processing with Hadoop.
I open a TDB Dataset and then I query by pages with LIMIT and OFFSET.
I output files with 100000 triplets per file.
However at file 10th the performance degrades and at file 15th it goes down by a factor of 3 and at the 22th file the performances is down to 1%.
My query is:
SELECT DISTINCT ?S ?P ?O WHERE {?S ?P ?O .} LIMIT 100000 OFFSET X
The method that queries and writes to a file is shown in the next code block:
public boolean copyGraphPage(int size, int page, String tdbPath, String query, String outputDir, String fileName) throws IllegalArgumentException {
boolean retVal = true;
if (size == 0) {
throw new IllegalArgumentException("The size of the page should be bigger than 0");
}
long offset = ((long) size) * page;
Dataset ds = TDBFactory.createDataset(tdbPath);
ds.begin(ReadWrite.READ);
String queryString = (new StringBuilder()).append(query).append(" LIMIT " + size + " OFFSET " + offset).toString();
QueryExecution qExec = QueryExecutionFactory.create(queryString, ds);
ResultSet resultSet = qExec.execSelect();
List<String> resultVars;
if (resultSet.hasNext()) {
resultVars = resultSet.getResultVars();
String fullyQualifiedPath = joinPath(outputDir, fileName, "txt");
try (BufferedWriter bwr = new BufferedWriter(new OutputStreamWriter(new BufferedOutputStream(
new FileOutputStream(fullyQualifiedPath)), "UTF-8"))) {
while (resultSet.hasNext()) {
QuerySolution next = resultSet.next();
StringBuffer sb = new StringBuffer();
sb.append(next.get(resultVars.get(0)).toString()).append(" ").
append(next.get(resultVars.get(1)).toString()).append(" ").
append(next.get(resultVars.get(2)).toString());
bwr.write(sb.toString());
bwr.newLine();
}
qExec.close();
ds.end();
ds.close();
bwr.flush();
} catch (IOException e) {
e.printStackTrace();
}
resultVars = null;
qExec = null;
resultSet = null;
ds = null;
} else {
retVal = false;
}
return retVal;
}
The null variables are there because I didn't know if there was a possible leak in there.
However after the 22th file the program fails with the following message:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.jena.ext.com.google.common.cache.LocalCache$EntryFactory$2.newEntry(LocalCache.java:455)
at org.apache.jena.ext.com.google.common.cache.LocalCache$Segment.newEntry(LocalCache.java:2144)
at org.apache.jena.ext.com.google.common.cache.LocalCache$Segment.put(LocalCache.java:3010)
at org.apache.jena.ext.com.google.common.cache.LocalCache.put(LocalCache.java:4365)
at org.apache.jena.ext.com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:5077)
at org.apache.jena.atlas.lib.cache.CacheGuava.put(CacheGuava.java:76)
at org.apache.jena.tdb.store.nodetable.NodeTableCache.cacheUpdate(NodeTableCache.java:205)
at org.apache.jena.tdb.store.nodetable.NodeTableCache._retrieveNodeByNodeId(NodeTableCache.java:129)
at org.apache.jena.tdb.store.nodetable.NodeTableCache.getNodeForNodeId(NodeTableCache.java:82)
at org.apache.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:50)
at org.apache.jena.tdb.store.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:67)
at org.apache.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:50)
at org.apache.jena.tdb.solver.BindingTDB.get1(BindingTDB.java:122)
at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
at org.apache.jena.sparql.engine.binding.BindingProjectBase.get1(BindingProjectBase.java:52)
at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
at org.apache.jena.sparql.engine.binding.BindingProjectBase.get1(BindingProjectBase.java:52)
at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
at org.apache.jena.sparql.engine.binding.BindingBase.hashCode(BindingBase.java:201)
at org.apache.jena.sparql.engine.binding.BindingBase.hashCode(BindingBase.java:183)
at java.util.HashMap.hash(HashMap.java:338)
at java.util.HashMap.containsKey(HashMap.java:595)
at java.util.HashSet.contains(HashSet.java:203)
at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.getInputNextUnseen(QueryIterDistinct.java:106)
at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.hasNextBinding(QueryIterDistinct.java:70)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
at org.apache.jena.sparql.engine.iterator.QueryIterSlice.hasNextBinding(QueryIterSlice.java:76)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:114)
Disconnected from the target VM, address: '127.0.0.1:57723', transport: 'socket'
Process finished with exit code 255
The memory viewer shows an increment in memory usage after querying a page :
It is clear that Jena LocalCache is filling up, I have changed the Xmx to 2048m and Xms to 512m with the same result. Nothing changed.
Do I need more memory?
Do I need to clear something?
Do I need to stop the program and do it in parts?
Is my query wrong?
Does the OFFSET has anyting to do with it?
I read in some old mail postings that you can turn the cache off but I could not find any way to do it. Is there a way to turn cache off?
I know it is a very difficult question but I appreciate any help.
It is clear that Jena LocalCache is filling up
This is the TDB node cache - it usually needs 1.5G (2G is better) per dataset itself. This cache persists for the lifetime of the JVM.
A java heap of 2G is a small Java heap by today's standards. If you must use a small heap, you can try running in 32 bit mode (called "Direct mode" in TDB) but this is less performant (mainly because the node cache is smaller and in this dataset you do have enough nodes to cause cache churn for a small cache).
The node cache is the main cause of the heap exhaustion but the query is consuming memory elsewhere, per query, in DISTINCT.
DISTINCT is not necessarily cheap. It needs to remember everything it has seen to know whether a new row is the first occurrence or already seen.
Apache Jena does optimize some cases of (a TopN query) but the cutoff
for the optimization is 1000 by default. See OpTopN in the code.
Otherwise it is collecting all the rows seen so far. The further through the dataset you go, the more that is in the node cache and also the more than is in the DISTINCT filter.
Do I need more memory?
Yes, more heap. The sensible minimum is 2G per TDB dataset and then whatever Java itself requires (say, 0.5G) and plus your program and query workspace.
You seem to have memory leak somewhere, this is just a guess, but try this:
TDBFactory.release(ds);
REF: https://jena.apache.org/documentation/javadoc/tdb/org/apache/jena/tdb/TDBFactory.html#release-org.apache.jena.query.Dataset-
I have observed an odd behaviour but I don't see what I am doing wrong.
I created via multiple BooleanQueries the following query:
+(-(Request.zipCode:18055 Request.zipCode:33333 Request.zipCode:99999) +Request.zipCode:[* TO *]) *:*
...this is what I get via toString
Update: this way I created a part of the BooleanQuery which is responsible to create this snippet +Request.zipCode:[* TO *])
Query fieldOccursQuery = new TermQuery(new Term(queryFieldName, "[* TO *]"));
I have created exaclty same (per my understanding) Query via QueryParser like this:
String querystr = "+(-(Request.zipCode:18055 Request.zipCode:33333 Request.zipCode:99999) +Request.zipCode:[* TO *]) *:*";
Query query = new QueryParser(Version.LUCENE_46, "title", LuceneServiceI.analyzer).parse(querystr);
I processed both of them the same way like this:
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
int max = reader.maxDoc();
TopScoreDocCollector collector = TopScoreDocCollector.create(max > 0 ? max : 1, true);
searcher.search(query, collector);
....
ScoreDoc[] hits = collector.topDocs().scoreDocs;
Map<Integer, Document> docMap = new TreeMap<Integer, Document>();
for (int i = 0; i < hits.length; i++) {
docMap.put(hits[i].doc, indexSearcher.doc(hits[i].doc));
}
Different results
On a index like: stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<Request.zipCode:04103>
The Query via QueryParser deliver one document as expected
The Query via BooleanQuery does not deliver 1 expected document
Questions
Are there possibilities that both same queries deliver different results? Set certain attributes to my BooleanQuery etc.
How can I get the same wanted result for BooleanQuery?
I could not found anything about differences only in concern of performance (http://www.gossamer-threads.com/lists/lucene/java-user/144374)
I found the solution to my problem.
Instead of creating this for the BooleanQuery:
Query fieldOccursQuery = new TermQuery(new Term(queryFieldName, "[* TO *]"));
I used this:
ConstantScoreQuery constantScoreQuery = new ConstantScoreQuery(new FieldValueFilter(queryFieldName));
query.add(constantScoreQuery, Occur.MUST);
Now my query looks different but I only get documents with fields with my queryFieldName.
Issue seems to be the leading wildcard in my first solution:
Find all Lucene documents having a certain field
I have a list of games in GAE datastore and I want to query fixed number of them, starting from a certain offset, i.e. get next 25 games starting form entry with id "75".
PersistenceManager pm = PMF.get().getPersistenceManager(); // from Google examples
Query query = pm.newQuery(Game.class); // objects of class Game are stored in datastore
query.setOrdering("creationDate asc");
/* querying for open games, not created by this player */
query.setFilter("state == Game.STATE_OPEN && serverPlayer.id != :playerId");
String playerId = "my-player-id";
List<Game> games = query.execute(playerId); // if there's lots of games, returned list has more entries, than user needs to see at a time
//...
Now I need to extend that query to fetch only 25 games and only games following entry with id "75". So the user can browse for open games, fetching only 25 of them at a time.
I know there's lots of examples for GAE datastore, but those all are mostly in Python, including sample code for setting limit on the query.
I am looking for a working Java code sample and couldn't find one so far.
It sounds like you want to facilitate paging via Query Cursors. See: http://code.google.com/appengine/docs/java/datastore/queries.html#Query_Cursors
From the Google doc:
public class ListPeopleServlet extends HttpServlet {
#Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException {
DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
Query q = new Query("Person");
PreparedQuery pq = datastore.prepare(q);
int pageSize = 15;
resp.setContentType("text/html");
resp.getWriter().println("<ul>");
FetchOptions fetchOptions = FetchOptions.Builder.withLimit(pageSize);
String startCursor = req.getParameter("cursor");
// If this servlet is passed a cursor parameter, let's use it
if (startCursor != null) {
fetchOptions.startCursor(Cursor.fromWebSafeString(startCursor));
}
QueryResultList<Entity> results = pq.asQueryResultList(fetchOptions);
for (Entity entity : results) {
resp.getWriter().println("<li>" + entity.getProperty("name") + "</li>");
}
resp.getWriter().println("</ul>");
String cursor = results.getCursor().toWebSafeString();
// Assuming this servlet lives at '/people'
resp.getWriter().println(
"<a href='/people?cursor=" + cursor + "'>Next page</a>");
}
}
Thanks everyone for help. The cursors was the right answer.
The thing is that I am pretty much stuck with JDO and can't use DatastoreService, so I finally found this link:
http://code.google.com/appengine/docs/java/datastore/jdo/queries.html#Query_Cursors
The result list is perfect but the getResultSize() is incorrect.
I've knocked up some code to illustrate.
Criteria criteria2 = this.getSession().createCriteria(Film.class);
Criterion genre = Restrictions.eq("genreAlias.genreName", details.getSearch().getGenreName());
criteria2.createAlias("genres", "genreAlias", CriteriaSpecification.INNER_JOIN);
criteria2.add(genre);
criteria2.setMaxResults(details.getMaxRows())
.setFirstResult(details.getStartResult());
FullTextEntityManager fullTextEntityManager = org.hibernate.search.jpa.Search.createFullTextEntityManager(entityManager);
org.apache.lucene.queryParser.QueryParser parser2 = new QueryParser("title", new StopAnalyzer() );
org.apache.lucene.search.Query luceneQuery2 = parser2.parse( "title:"+details.getSearch()");
FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery( luceneQuery2, Film.class);
fullTextQuery.setCriteriaQuery(criteria2);
fullTextQuery.getResultList()); // Returns the correctly filtered list
fullTextQuery.getResultSize()); // Returns the retsult size without the genre resrtiction
From http://docs.jboss.org/hibernate/search/3.3/api/org/hibernate/search/jpa/FullTextQuery.html
int getResultSize()
Returns the number of hits for this search Caution: The number of results might be slightly different from getResultList().size() because getResultList() may be not in sync with the database at the time of query.
You should try to use some of the more specialized queries like this one:
Query query = new FuzzyQuery(new Term("title", q));
FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery(query, Film.class);
int filmCount = fullTextQuery.getResultSize();
and this is how you do pagination requests (I'm guessing you have improperly implemented your paggination):
FullTextQuery hits = Search.getFullTextSession(getSession()).createFullTextQuery(query, Film.class)
.setFirstResult((pageNumber - 1) * perPageItems).setMaxResults(perPageItems);
The above works for me every time. You should keep in mind that the result of getResultSize() more of estimate. I use pagination a lot and I have experienced the number changing between pages. So you should say "about xxxx" results.