Reindex task/mapper/job for AppEngine Java

Reindex task/mapper/job for AppEngine Java - java

Does anybody know of a library or good code sample that could be used to re-index all/some entities in all/some namespaces ?
If I implement this on my own, is MapReduce what I should consider ?
"I need to re-index ?" feels like a problem many developers have run into but the closest I could find is this, which may be a good start ?
Other option is a homebrewn solution using Task Queues that iterate the datastore namespaces and entities but I'd prefer not the re-invent the wheel and go for a robust, proven solution.
What are the options ?

I'm afraid I don't know of any pre-built system. I think you basically need to create a cursor to iterate through all your entities and then do a get and a put on all of them (or optionally check if they're in the index before doing the put - if you have some that won't need updating, that would save you a write at the cost of a read and/or a small operation).
Follow the example here:
https://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify#Cursors
Create a java.util.concurrent.SynchronousQueue to hold batches of datastore keys.
Create 10 new consumer threads (the current limit) using ThreadManager:
https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/ThreadManager
Those threads should do the following:
Create a new objectify instance and turn off the session cache and memcache for objectify.
Get a batch of keys from the SynchronousQueue.
Fetch all of those entities using a batch get.
Optionally do a keys-only query for all those entities using the relevant property.
Put all those entities (or exclude the ones that were returned above).
Repeat from step 2.
In a loop, fetch the next 30 keys using a keys-only cursor query and put them into the SynchronousQueue.
Once you've put all of the items into the SynchronousQueue, set a property to stop all the consumer threads once they've done their work.

Related

How to partially rebuild index in Hibernate Search 5.10?

I am working on a project where i need to use Hibernate Search and i am gonna index just one entity , it's mapped to a huge table with almost 20 million records and more records daily gonna be added to it but not via the application and entity manager which i am working on and hibernate search can't index new changes automatically. the problem is rebuilding whole index for the entity every day gonna take long time .
I wanted to know is there any way to keep my current index and partially rebuild the index documents for just new changes?

If, at the end of the day, you are able to list all the entities that have been modified during the last 24 hours based on information from the database (a date/time of last change for each entity, for example), then yes, there are ways to do that.
First, you can do it "manually" by running your own Hibernate ORM query and calling FullTextSession.index on each element you want to see reindexed. You will have to do this in batches, preferably opening a transaction for each batch, if you have a large number of elements to reindex.
Another, better option is to use the JSR352 integration, which will however require you to use a JSR352-compatible runtime (Spring Batch is not very standard-compliant and won't work; JBeret is known to work). By targeting your single entity and calling restrictedBy(Criterion) when building the parameters, you will be able to restrict the list of entities to reindex.
For example:
Properties jobProperties = MassIndexingJob.parameters()
.forEntity( MyClass.class )
.restrictedBy( Restrictions.ge( "lastChangeDateTime", LocalDate.now().minus( 26, ChronoUnit.HOURS ) ) // 26 to account for DST switch and other slight delays
.build();
long executionId = BatchRuntime.getJobOperator()
.start( MassIndexingJob.NAME, jobProperties );
The MassIndexer unfortunately doesn't offer such feature yet. You can vote for the feature on ticket HSEARCH-499 and explain your use case in a comment: we'll try to prioritize features that benefit many users. And of course, you can always reach out to us to discuss how to implement this and contribute a patch :)

Background Task of iterating and updating large datasets in order

I need to iterate around quite a large dataset of an entity in index order as a background task. (Number of entities approx 200,000+)
I am aware that the TaskQueue API along with possibly a background instance is the way to go, but I am hitting DataStoreUnavailable and timeout exceptions on occasion and what I'm looking for is a reliable way of iterating and updating in the background using the GAE APIs.
It is also very useful for me to know the progress of the iteration.
I am also aware of the experimental Java Map Reduce API but on first view to me this seems to be more of a parallel processing API rather than ordered. (Please correct me if I'm wrong. Java Map Reduce examples seem to be few and far between at the moment)
Are there any concrete examples or good patterns for doing this sort of work?

Process only a limited number of entities in a job.
Start with a query as usual, but if the job request has a cursor parameter, apply it to the query. Then fetch only a fixed number of entities, instead of fetching all.
When the job is done, but there are more entities to process, retrieve the current query cursor, and schedule the same job again with the cursor as request parameter.

mongo db insert big collections

I have a mongo (version 2) in production in replicaset configuration (the next step is to add sharding).
I need to implement the following:
Once a day i'll receive a file with millions rows and i shall load it into mongo.
I have a runtime application that always read from this collection - very large amount of reads, and their performance is very important.
The collection is indexed and all read perform readByIndex operation.
My current implementation of loading is:
drop collection
create collection
insert into collection new documents
One of the thing I see is that because of mongoDB lock my total performance getting worst during the loading.
I've checked the collection with up to 10Million entries.
For more that that size I think I should start use sharding
What is the best way to love such issue?
Or maybe should I use another solution strategy?

You could use two collections :)
collectionA contains this day's data
new data arrives
create a new collection (collectionB) and insert the data
now use collectionB as your data
Then, next day, repeat the above just swapping A and B :)
This will let collectionA still service requests while collectionB is being updated.
PS Just noticed that I'm about a year late answering this question :)

Google app engine: Poor Performance with JDO + Datastore

I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh

First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine

You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.

Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B

Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?

If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html

The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.

java efficient de-duplication

Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?

Insane number of rows
Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )
Unable to fit all rows in memory
Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.
Fits in memory
Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).
Oh and if I were you, I will put the unique constraint on the DB anyways...

I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.

You have two options,
do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.

Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.

Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.

Your problem can be solve with a Extract, transform, load (ETL) approach:
You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.
You can do this manually or use an ETL tool.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.