I have an index of documents which is distributed over several shards and replicas. The size is ca. 40 mil and I expect it to grow
Problem: Users add information to these documents, which they change quite frequently. They need it to be integrated in search syntax, e.g. funny and cool and cat:interesting. Where cat would be a volatile data set
As far as I know neither Solr nor Lucene support "true update", that means that I have to reindex the whole set of changed documents again. Thus I need to connect it to external data source such as relational database.
I did it in Lucene with extendable search (http://lucene.apache.org/core/4_3_0/queryparser/index.html). The algorithm was pretty easy:
Preprosess query by adding "_" to all external fields
Map these fields to classes
Each class extends org.apache.lucene.search.Filter class and converts ids to a bitset by overriding public public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException:
ResultSet set = state.executeQuery();
OpenBitSet bitset = new OpenBitSet();
while (set.next()) {
bitset.set(set.getInt("ID"));
}
Then by extending org.apache.lucene.queryparser.ext.ParserExtension, I override parse like this:
public Query parse(ExtensionQuery eq) throws ParseException{
String cat= eq.getRawQueryString();
Filter filter = _cache.getFilter(cat);
return new ConstantScoreQuery(filter);
}
Extend org.apache.lucene.queryparser.ext.Extensions using add method and done.
But HOW to do this in Solr?
I found couple of suggestions:
Using external field (http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/schema/ExternalFileField.html)
NRS (http://wiki.apache.org/solr/NearRealtimeSearch) which looks a little bit under construction to me.
Any ideas how to do it in Solr? Maybe there are some code examples?
Please, consider also that Im kinda new to Solr.
Thank you
The Solr 4.x releases all support Atomic Update which I believe may satisfy your needs.
Related
I am migrating my app to use Firebase Firestore, and one of my models is very complex (contains lists of other custom objects). Looking at the documentation, on how to commit a model object as a document, it looks like you simply create your model object with a public constructor, and getters and setters.
For example from the add data guide:
public class City {
private String name;
private String state;
private String country;
private boolean capital;
private long population;
private List<String> regions;
public City() {}
public City(String name, String state, String country, boolean capital, long population, List<String> regions) {
// getters/setters
}
Firestore automatically translates this to and from and document without any additional steps. You pass an instance to a DocumentReference.set(city) call, and retrieve it from a call to DocumentSnapshot.toObject(City.class)
How exactly does it serialize this to a document? Through reflection? It doesn't discuss any limitations. Basically, I'm left wondering if this will work on more complex models, and how complex. Will it work for a class with an ArrayList of custom objects?
Firestore automatically translates this to and from and document without any additional steps. How exactly does it serialize this to a document? Through reflection?
You're guessing right, through reflection. As also #Doug Stevenson mentioned in his comment, that's very common for systems as Firebase, to convert JSON data to POJO (Plain Old Java Object). Please also note that the setters are not required. If there is no setter for a JSON property, the Firebase client will set the value directly onto the field. A constructor-with-arguments is also not required. While both are idiomatic, there are good cases to have classes without them. Please also take a look at some informations regarding the existens fo the no-argument constructor.
It doesn't discuss any limitations.
Yes it does. The official documentation explains that the documents have limits. So there are some limits when it comes to how much data you can put into a document. According to the official documentation regarding usage and limits:
Maximum size for a document: 1 MiB (1,048,576 bytes)
As you can see, you are limited to 1 MiB total of data in a single document. When we are talking about storing text, you can store pretty much but as your array getts bigger (with custom objects), be careful about this limitation.
Please also note, that if you are storing large amount of data in arrays and those arrays should be updated by lots of users, there is another limitation that you need to take care of. So you are limited to 1 write per second on every document. So if you have a situation in which a lot of users al all trying to write/update data to the same documents all at once, you might start to see some of this writes to fail. So, be careful about this limitation too.
Will it work for a class with an ArrayList of custom objects?
It will work with any types of classes as long as are supported data type objects.
Basically, I'm left wondering if this will work on more complex models, and how complex.
It will work with any king of complex model as long as you are using the correct data types for your objects and your documents are within that 1 MIB limitation.
Although appengine already is schema-less, there still need to define the entities that needed to be stored into the Datastore through the Datanucleus persistence layer. So I am thinking of a way to get around this; by having a layer that will store Key-value at runtime, instead of compile-time Entities.
The way this is done with Redis is by creating a key like this:
private static final String USER_ID_FORMAT = "user:id:%s";
private static final String USER_NAME_FORMAT = "user:name:%s";
From the docs Redis types are: String, Linked-list, Set, Sorted set. I am not sure if there's more.
As for the GAE datastore is concerned a String "Key" and a "Value" have to be the entity that will be stored.
Like:
public class KeyValue {
private String key;
private Value value; // value can be a String, Linked-list, Set or Sorted set etc.
// Code omitted
}
The justification of this scheme is rooted to the Restful access to the datastore (that is provided by Datanucleus-api-rest)
Using this rest api, to persist a object or entity:
POST http://datanucleus.appspot.com/dn/guestbook.Greeting
{"author":null,
"class":"guestbook.Greeting",
"content":"test insert",
"date":1239213923232}
The problem with this approach is that in order to persist a Entity the actual class needs to be defined at compile time; unlike with the idea of having a key-value store mechanism we can simplify the method call:
POST http://datanucleus.appspot.com/dn/org.myframework.KeyValue
{ "class":"org.myframework.KeyValue"
"key":"user:id:johnsmith;followers",
"value":"the_list",
}
Passing a single string as "value" is fairly easy, I can use JSON array for list, set or sorted list. The real question would be how to actually persist different types of data passed into the interface. Should there be multiple KeyValue entities each representing the basic types it support: KeyValueString? KeyValueList? etc.
Looks like you're using a JSON based REST API, so why not just store Value as a JSON string?
You do not need to use the Datanucleus layer, or any of the other fine ORM layers (like Twig or Objectify). Those are optional, and are all based on the low-level API. If I interpret what you are saying properly, perhaps it already has the functionality that you want. See: https://developers.google.com/appengine/docs/java/datastore/entities
Datanucleus is a specific framework that runs on top of GAE. You can however access the database at a lower, less structured, more key/value-like level - the low-level API. That's the lowest level you can access directly.
BTW, the low-level-"GAE datastore" internally runs on 6 global Google Megastore tables, which in turn are hosted on the Google Big Table database system.
Saving JSON as a String works fine. But you will need ways to retrieve your objects other than by ID. That is, you need a way to index your data to support any kind of useful query on it.
I'm looked a lot into being able to use Hibernate to persist a map like Map<String, Set<Entity>> with little luck (especially since I want it all to be on one table).
Mapping MultiMaps with Hibernate is the thing that seems to get referenced the most, which describes in detail how to go about implementing this using a UserCollectionType.
I was wondering, since that was written over four years ago, is there any better way of doing it now?
So, for example, I would like to have on EntityA a map like Map<String, Set/List<EntityB>>.
There would be two tables: EntityA and EntityB (with EntityB having a foreign key back to EntityA).
I don't want any intermediate tables.
The way how its done on my current project is that we transforming beans/collections to xml using xstream:
public static String toXML(Object instance) {
XStream xs = new XStream();
StringWriter writer = new StringWriter();
xs.marshal(instance, new CompactWriter(writer));
return writer.toString();
}
and then using Lob type in hibernate for persisting :
#Lob
#Column(nullable = false)
private String data;
I found this approach very generic and you could effectively implement flexible key/value storage with it. You you don't like XML format then Xstream framework has inbuilt driver for transforming objects to JSON. Give it a try, its really cool.
Cheers
EDIT: Response to comment.
Yes, if you want to overcome limitations of classic approach you are probably sacrifice something like indexing and/or search. You stil could implement indexing/searching/foreign/child relationships thru collections/generic entity beans by yourself - just maintain separate key/value table with property name/property value(s) for which you think search is needed.
I've seen number of database designs for products where flexible and dynamic(i.e. creation new attributes for domain objects without downtime) schema is needed and many of them use key/value tables for storing domain attributes and references from owner objects to child one. Those products cost millions of dollars (banking/telco) so I guess this design is already proven to be effective.
Sorry, that's not answer to your original question since you asked about solution without intermediate tables.
It depends :) When things are getting complex, you should understand what your application is doing.
In some situation, you may represent your Set as a TreeSet, and represent this TreeSet in an ordered coded String, such as ["1", "8", "12"] where 1, 8, 12 are primary keys, and then let's write code !
Obviously, it's not a general answer for, in my opinion, a too general question.
I'm in a position where our company has a database search service that is highly configurable, for which it's very useful to configure queries in a programmatic fashion. The Criteria API is powerful but when one of our developers refactors one of the data objects, the criteria restrictions won't signal that they're broken until we run our unit tests, or worse, are live and on our production environment. Recently, we had a refactoring project essentially double in working time unexpectedly due to this problem, a gap in project planning that, had we known how long it would really take, we probably would have taken an alternative approach.
I'd like to use the Example API to solve this problem. The Java compiler can loudly indicate that our queries are borked if we are specifying 'where' conditions on real POJO properties. However, there's only so much functionality in the Example API and it's limiting in many ways. Take the following example
Product product = new Product();
product.setName("P%");
Example prdExample = Example.create(product);
prdExample.excludeProperty("price");
prdExample.enableLike();
prdExample.ignoreCase();
Here, the property "name" is being queried against (where name like 'P%'), and if I were to remove or rename the field "name", we would know instantly. But what about the property "price"? It's being excluded because the Product object has some default value for it, so we're passing the "price" property name to an exclusion filter. Now if "price" got removed, this query would be syntactically invalid and you wouldn't know until runtime. LAME.
Another problem - what if we added a second where clause:
product.setPromo("Discounts up to 10%");
Because of the call to enableLike(), this example will match on the promo text "Discounts up to 10%", but also "Discounts up to 10,000,000 dollars" or anything else that matches. In general, the Example object's query-wide modifications, such as enableLike() or ignoreCase() aren't always going to be applicable to every property being checked against.
Here's a third, and major, issue - what about other special criteria? There's no way to get every product with a price greater than $10 using the standard example framework. There's no way to order results by promo, descending. If the Product object joined on some Manufacturer, there's no way to add a criterion on the related Manufacturer object either. There's no way to safely specify the FetchMode on the criteria for the Manufacturer either (although this is a problem with the Criteria API in general - invalid fetched relationships fail silently, even more of a time bomb)
For all of the above examples, you would need to go back to the Criteria API and use string representations of properties to make the query - again, eliminating the biggest benefit of Example queries.
What alternatives exist to the Example API that can get the kind of compile-time advice we need?
My company gives developers days when we can experiment and work on pet projects (a la Google) and I spent some time working on a framework to use Example queries while geting around the limitations described above. I've come up with something that could be useful to other people interested in Example queries too. Here is a sample of the framework using the Product example.
Criteria criteriaQuery = session.createCriteria(Product.class);
Restrictions<Product> restrictions = Restrictions.create(Product.class);
Product example = restrictions.getQueryObject();
example.setName(restrictions.like("N%"));
example.setPromo("Discounts up to 10%");
restrictions.addRestrictions(criteriaQuery);
Here's an attempt to fix the issues in the code example from the question - the problem of the default value for the "price" field no longer exists, because this framework requires that criteria be explicitly set. The second problem of having a query-wide enableLike() is gone - the matcher is only on the "name" field.
The other problems mentioned in the question are also gone in this framework. Here are example implementations.
product.setPrice(restrictions.gt(10)); // price > 10
product.setPromo(restrictions.order(false)); // order by promo desc
Restrictions<Manufacturer> manufacturerRestrictions
= Restrictions.create(Manufacturer.class);
//configure manuf restrictions in the same manner...
product.setManufacturer(restrictions.join(manufacturerRestrictions));
/* there are also joinSet() and joinList() methods
for one-to-many relationships as well */
Even more sophisticated restrictions are available.
product.setPrice(restrictions.between(45,55));
product.setManufacturer(restrictions.fetch(FetchMode.JOIN));
product.setName(restrictions.or("Foo", "Bar"));
After showing the framework to a coworker, he mentioned that many data mapped objects have private setters, making this kind of criteria setting difficult as well (a different problem with the Example API!). So, I've accounted for that too. Instead of using setters, getters are also queryable.
restrictions.is(product.getName()).eq("Foo");
restrictions.is(product.getPrice()).gt(10);
restrictions.is(product.getPromo()).order(false);
I've also added some extra checking on the objects to ensure better type safety - for example, the relative criteria (gt, ge, le, lt) all require a value ? extends Comparable for the parameter. Also, if you use a getter in the style specified above, and there's a #Transient annotation present on the getter, it will throw a runtime error.
But wait, there's more!
If you like that Hibernate's built-in Restrictions utility can be statically imported, so that you can do things like criteria.addRestriction(eq("name", "foo")) without making your code really verbose, there's an option for that too.
Restrictions<Product> restrictions = new Restrictions<Product>(){
public void query(Product queryObject){
queryObject.setPrice(gt(10));
queryObject.setPromo(order(false));
//gt() and order() inherited from Restrictions
}
}
That's it for now - thank you very much in advance for any feedback! We've posted the code on Sourceforge for those that are interested. http://sourceforge.net/projects/hqbe2/
The API looks great!
Restrictions.order(boolean) smells like control coupling. It's a little unclear what the values of the boolean argument represent.
I suggest replacing or supplementing with orderAscending() and orderDescending().
Have a look at Querydsl. Their JPA/Hibernate module requires code generation. Their Java collections module uses proxies but cannot be used with JPA/Hibernate at the moment.
I have a 2D array
public static class Status{
public static String[][] Data= {
{ "FriendlyName","Value","Units","Serial","Min","Max","Mode","TestID","notes" },
{ "PIDs supported [01 – 20]:",null,"Binary","0",null,null,"1","0",null },
{ "Online Monitors since DTCs cleared:",null,"Binary","1",null,null,"1","1",null },
{ "Freeze DTC:",null,"NONE IN MODE 1","2",null,null,"1","2",null },
I want to
SELECT "FriendlyName","Value" FROM Data WHERE "Mode" = "1" and "TestID" = "2"
How do I do it? The fastest execution time is important because there could be hundreds of these per minute.
Think about how general it needs to be. The solution for something truly as general as SQL probably doesn't look much like the solution for a few very specific queries.
As you present it, I'd be inclined to avoid the 2D array of strings and instead create a collection - probably an ArrayList, but if you're doing frequent insertions & deletions maybe a LinkedList would be more appropriate - of some struct-like class. So
List<MyThing> list = new ArrayList<MyThing>();
and index the fields on which you want to search using a HashMap:
Map<Integer, MyThing> modeIndex = new HashMap<Integer, MyThing>()
for (MyThing thing : list)
modeIndex.put(thing.mode, thing);
Writing it down makes me realize that won't do, in and of itself, because multiple things could have the same mode. So probably a multimap instead - or roll your own by making the value type of the map not MyThing, but rather List. Google Collections has a fine multimap implementation.
This doesn't exactly answer your question, but it is possible to run some Java RDBMs with their tables entirely in your JVM's memory. For example, HSQLDB. This will give you the full power of SQL selects without the overheads of disc access. The only catch is that you won't be able to query a raw Java data structure like you are asking. You'll first have to insert the data into the DB's in-memory tables.
(I've not tried this ... perhaps someone could comment if this approach is really viable.)
As to your actual question, in C# they used to use LINQ (Language Integrated Query) for this, which takes benefit of the language's support for closures. Right now with Java 6 as the latest official release, Java doesn't support closures, but it's going to come in the shortly upcoming Java 7. The Java 7 based equivalent for LINQ is likely going to be JaQue.
As to your actual problem, you're definitely using a wrong datastructure for the job. Your best bet will be to convert the String[][] into a List<Entity> and using convenient searching/filtering API's provided by Guava, as suggested by Carl Manaster. The Iterables#filter() would be a good start.
EDIT: I took a look at your array, and I think this is definitely a job for RDBMS. If you want in-memory datastructure like features (fast/no need for DB server), embedded in-memory databases like HSQLDB, H2 can provide those.
If you want good execution time, you MUST have a good datastructure. If you just have data stored in a 2D array unordered, you'll be mostly stuck with O(n).
What you need is indexes for example, just like other RDBMS. For example, if you use a lot of WHERE clause like this WHERE name='Brian' AND last_name='Smith' you could do something like this (kind of a pseudocode):
Set<Entry> everyEntry = //the set that contains all data
Map<String, Set<Entry>> indexedSet = newMap();
for(String name : unionSetOfNames){
Set<Entry> subset = Iterables.collect(new HasName(name), everyEntries);
indexedSet.put(name, subset);
}
//and later...
Set<Entry> brians = indexedSet.get("Brian");
Entry target = Iterables.find(new HasLastName("Smith"),brians);
(Please forgive me if the Guava API usage is wrong in the example code (it's pseudo-code!, but you get the idea).
In the above code, you'll be doing a lookup of O(1) once, and then another O(n) lookup, but on a much much smaller subset. So this can be more effective than doing a O(n) lookup on the entire set, etc. If you use a ordered Set, ordered by the last_name and use binary search, that lookup will become O(log n). Things like that. There are bunch of datastructures out there and this is only a very simple example.
So in conclusion, if I were you, I'll define my own classes and create a datastructure using some standard datastructures available in JDK. If that doesn't suffice, I might look at some other datastructures out there, but if it gets really complex, I think I'd just use some in-memory RDBMS like HSQLDB or H2. They are easy to embed, so there are quite close to having your own in-memory datastructure. And as more and more you do complex stuff, chances are that that option provides better performance.
Note also that I used the Google Guava library in my sample code.. They are excellent, and I highly recommend to use them because it's so much nicer. Of course don't forget to look at the java.utli.collections package, too..
I ended up using a lookup table. 90% of the data is referenced from near the top.
public static int lookupReferenceInTable (String instanceMode, String instanceTID){
int ModeMatches[]=getReferencesToMode(Integer.parseInt(instanceMode));
int lineLookup = getReferenceFromPossibleMatches(ModeMatches, instanceTID);
return lineLookup;
}
private static int getReferenceFromPossibleMatches(int[] ModeMatches, String instanceTID) {
int counter = 0;
int match = 0;
instanceTID=instanceTID.trim();
while ( counter < ModeMatches.length ){
int x = ModeMatches[counter];
if (Data[x][DataTestID].equals(instanceTID)){
return ModeMatches[counter];
}
counter ++ ;
}
return match;
}
It can be further optimized so that instead of looping through all of the arrays it will loop on column until it finds a match, then loop the next, then the next. The data is laid out in a flowing and well organized manner so a lookup based on 3 criteria should only take a number of checks equal to the rows.