Suppose you have a collection of a few hundred in-memory objects and you need to query this List to return objects matching some SQL or Criteria like query. For example, you might have a List of Car objects and you want to return all cars made during the 1960s, with a license plate that starts with AZ, ordered by the name of the car model.
I know about JoSQL, has anyone used this, or have any experience with other/homegrown solutions?
Filtering is one way to do this, as discussed in other answers.
Filtering is not scalable though. On the surface time complexity would appear to be O(n) (i.e. already not scalable if the number of objects in the collection will grow), but actually because one or more tests need to be applied to each object depending on the query, time complexity more accurately is O(n t) where t is the number of tests to apply to each object.
So performance will degrade as additional objects are added to the collection, and/or as the number of tests in the query increases.
There is another way to do this, using indexing and set theory.
One approach is to build indexes on the fields within the objects stored in your collection and which you will subsequently test in your query.
Say you have a collection of Car objects and every Car object has a field color. Say your query is the equivalent of "SELECT * FROM cars WHERE Car.color = 'blue'". You could build an index on Car.color, which would basically look like this:
'blue' -> {Car{name=blue_car_1, color='blue'}, Car{name=blue_car_2, color='blue'}}
'red' -> {Car{name=red_car_1, color='red'}, Car{name=red_car_2, color='red'}}
Then given a query WHERE Car.color = 'blue', the set of blue cars could be retrieved in O(1) time complexity. If there were additional tests in your query, you could then test each car in that candidate set to check if it matched the remaining tests in your query. Since the candidate set is likely to be significantly smaller than the entire collection, time complexity is less than O(n) (in the engineering sense, see comments below). Performance does not degrade as much, when additional objects are added to the collection. But this is still not perfect, read on.
Another approach, is what I would refer to as a standing query index. To explain: with conventional iteration and filtering, the collection is iterated and every object is tested to see if it matches the query. So filtering is like running a query over a collection. A standing query index would be the other way around, where the collection is instead run over the query, but only once for each object in the collection, even though the collection could be queried any number of times.
A standing query index would be like registering a query with some sort of intelligent collection, such that as objects are added to and removed from the collection, the collection would automatically test each object against all of the standing queries which have been registered with it. If an object matches a standing query then the collection could add/remove it to/from a set dedicated to storing objects matching that query. Subsequently, objects matching any of the registered queries could be retrieved in O(1) time complexity.
The information above is taken from CQEngine (Collection Query Engine). This basically is a NoSQL query engine for retrieving objects from Java collections using SQL-like queries, without the overhead of iterating through the collection. It is built around the ideas above, plus some more. Disclaimer: I am the author. It's open source and in maven central. If you find it helpful please upvote this answer!
I have used Apache Commons JXPath in a production application. It allows you to apply XPath expressions to graphs of objects in Java.
yes, I know it's an old post, but technologies appear everyday and the answer will change in the time.
I think this is a good problem to solve it with LambdaJ. You can find it here:
http://code.google.com/p/lambdaj/
Here you have an example:
LOOK FOR ACTIVE CUSTOMERS // (Iterable version)
List<Customer> activeCustomers = new ArrayList<Customer>();
for (Customer customer : customers) {
if (customer.isActive()) {
activeCusomers.add(customer);
}
}
LambdaJ version
List<Customer> activeCustomers = select(customers,
having(on(Customer.class).isActive()));
Of course, having this kind of beauty impacts in the performance (a little... an average of 2 times), but can you find a more readable code?
It has many many features, another example could be sorting:
Sort Iterative
List<Person> sortedByAgePersons = new ArrayList<Person>(persons);
Collections.sort(sortedByAgePersons, new Comparator<Person>() {
public int compare(Person p1, Person p2) {
return Integer.valueOf(p1.getAge()).compareTo(p2.getAge());
}
});
Sort with lambda
List<Person> sortedByAgePersons = sort(persons, on(Person.class).getAge());
Update: after java 8 you can use out of the box lambda expressions, like:
List<Customer> activeCustomers = customers.stream()
.filter(Customer::isActive)
.collect(Collectors.toList());
Continuing the Comparator theme, you may also want to take a look at the Google Collections API. In particular, they have an interface called Predicate, which serves a similar role to Comparator, in that it is a simple interface that can be used by a filtering method, like Sets.filter. They include a whole bunch of composite predicate implementations, to do ANDs, ORs, etc.
Depending on the size of your data set, it may make more sense to use this approach than a SQL or external relational database approach.
If you need a single concrete match, you can have the class implement Comparator, then create a standalone object with all the hashed fields included and use it to return the index of the match. When you want to find more than one (potentially) object in the collection, you'll have to turn to a library like JoSQL (which has worked well in the trivial cases I've used it for).
In general, I tend to embed Derby into even my small applications, use Hibernate annotations to define my model classes and let Hibernate deal with caching schemes to keep everything fast.
I would use a Comparator that takes a range of years and license plate pattern as input parameters. Then just iterate through your collection and copy the objects that match. You'd likely end up making a whole package of custom Comparators with this approach.
The Comparator option is not bad, especially if you use anonymous classes (so as not to create redundant classes in the project), but eventually when you look at the flow of comparisons, it's pretty much just like looping over the entire collection yourself, specifying exactly the conditions for matching items:
if (Car car : cars) {
if (1959 < car.getYear() && 1970 > car.getYear() &&
car.getLicense().startsWith("AZ")) {
result.add(car);
}
}
Then there's the sorting... that might be a pain in the backside, but luckily there's class Collections and its sort methods, one of which receives a Comparator...
Related
My dataset looks like this:
Task-1, Priority1, (SkillA, SkillB)
Task-2, Priority2, (SkillA)
Task-3, Priority3, (SkillB, SkillC)
Calling application (client) will send in a list of skills - say (SkillD, SkillA).
lookup:
Search thru dataset for SkillD first, and not find anything.
Search for SkillA. We will find two entries - Task-1 with Priority1, Task-2 with Priority2.
Identify the task with highest priority (in this case, Task-1)
Remove Task-1 from that dataset & return Task-1 to client
Design considerations:
there will be lot of add/update/delete to the dataset when website goes live
There are only few skills but not a static list (about 10), but for each skill, there can be thousands of tasks. So, the lookup/retrieval will have to be extremely fast
I have considered simple List with binarySearch(comparator) or Map(skill, SortedSettasks(task)), but looking for more ideas.
What is the best way to design a data structure for this kind of dataset that allows a complex key and sorted array of tasks associated with that key.
How about changing the aproach a bit?
You can use the Guava and a Multimap in particular.
Every experienced Java programmer has, at one point or another, implemented a Map<K, List<V>> or Map<K, Set<V>>, and dealt with the awkwardness of that structure. For example, Map<K, Set<V>> is a typical way to represent an unlabeled directed graph. Guava's Multimap framework makes it easy to handle a mapping from keys to multiple values. A Multimap is a general way to associate keys with arbitrarily many values.
There are two ways to think of a Multimap conceptually: as a collection of mappings from single keys to single values:
I would suggest you having a Multimap of and the answer to your problem in a powerfull feature introduced by Multimap called Views
Good luck!
I would consider MongoDB. The data object for one of your rows sounds like a good fit into a JSON format, versus a row in a table. The reason is because the skill set list may grow. In classic relational DB you solve this through one of three ways, have ever expanding columns to make sure you have max number of skill set columns (this is very ugly), have a separate table that has grouping of skill sets matched to an ID, or store the skill sets as a comma delimited list of skill sets. Each of these suck. In MongoDB you can have array fields and the items in the array are indexable.
So with this in mind I would do all the querying on MongoDB and let it deal with it all. I would create a POJO that would like this:
public class TaskPriority {
String taskId;
String priorityId;
List<String> skillIds;
}
In MongoDB you can index all these fields to get fast searching and querying.
If it is the case that you have to cache these items locally and do these queries off of Java data structures then what you can do is create an index for the items you care about that reference instances of the TaskPriority object.
For example to track skill sets to their TaskPriority's then the following Map can be used:
Map<String, TaskPriority> skillSetToTaskPriority;
You can repeat this for taskId and priorityId. You would have to manage these indexes. This is usually the job of your DB to do.
Finally, you can then have POJO's and tables (or MongodDB collections) that map the taskId to a Task object that contains any meta data about that task that you may wish to have. And the same is true for Priority and SkillSet. So thats 4 MongoDB collections... Tasks, Priorities, SkillSets, and TaskPriorities.
I need to implement an n:m relation in Java.
The use case is a catalog.
a product can be in multiple categories
a category can hold multiple products
My current solution is to have a mapping class that has two hashmaps.
The key of the first hashmap is the product id and the value is a list of category ids
The key to the second hashmap is the category id and the value is a list of product ids
This is totally redundant an I need a setting class that always takes care that the data is stored/deleted in both hashmaps.
But this is the only way I found to make the following performant in O(1):
what products holds a category?
what categories is a product in?
I want to avoid full array scans or something like that in every way.
But there must be another, more elegant solution where I don't need to index the data twice.
Please en-light me. I have only plain Java, no database or SQLite or something available. I also don't really want to implement a btree structure if possible.
If you associate Categories with Products via a member collection, and vica versa, then you can accomplish the same thing:
public class Product {
private Set<Category> categories = new HashSet<Category>();
//implement hashCode and equals, potentially by id for extra performance
}
public class Category {
private Set<Product> contents = new HashSet<Product>();
//implement hashCode and equals, potentially by id for extra performance
}
The only difficult part is populating such a structure, where some intermediate maps might be needed.
But the approach of using auxiliary hashmaps/trees for indexing is not a bad one. After all, most indices placed on databases for example are auxiliary data structures: they coexist with the table of rows; the rows aren't necessarily organized in the structure of the index itself.
Using an external structure like this empowers you to keep optimizations and data separate from each other; that's not a bad thing. Especially if tomorrow you want to add O(1) look-ups for Products given a Vendor, e.g.
Edit: By the way, it looks like what you want is an implementation of a Multimap optimized to do reverse lookups in O(1) as well. I don't think Guava has something to do that, but you could implement the Multimap interface so at least you don't have to deal with maintaining the HashMaps separately. Actually it's more like a BiMap that is also a Multimap which is contradictory given their definitions. I agree with MStodd that you probably want to roll your own layer of abstraction to encapsulate the two maps.
Your solution is perfectly good. Remember that putting an object into a HashMap doesn't make a copy of the Object, it just stores a reference to it, so the cost in time and memory is quite small.
I would go with your first solution. Have a layer of abstraction around two hashmaps. If you're worried about concurrency, implement appropriate locking for CRUD.
If you're able to use an immutable data structure, Guava's ImmutableMultimap offers an inverse() method, which enables you to get a collection of keys by value.
I'm in a position where our company has a database search service that is highly configurable, for which it's very useful to configure queries in a programmatic fashion. The Criteria API is powerful but when one of our developers refactors one of the data objects, the criteria restrictions won't signal that they're broken until we run our unit tests, or worse, are live and on our production environment. Recently, we had a refactoring project essentially double in working time unexpectedly due to this problem, a gap in project planning that, had we known how long it would really take, we probably would have taken an alternative approach.
I'd like to use the Example API to solve this problem. The Java compiler can loudly indicate that our queries are borked if we are specifying 'where' conditions on real POJO properties. However, there's only so much functionality in the Example API and it's limiting in many ways. Take the following example
Product product = new Product();
product.setName("P%");
Example prdExample = Example.create(product);
prdExample.excludeProperty("price");
prdExample.enableLike();
prdExample.ignoreCase();
Here, the property "name" is being queried against (where name like 'P%'), and if I were to remove or rename the field "name", we would know instantly. But what about the property "price"? It's being excluded because the Product object has some default value for it, so we're passing the "price" property name to an exclusion filter. Now if "price" got removed, this query would be syntactically invalid and you wouldn't know until runtime. LAME.
Another problem - what if we added a second where clause:
product.setPromo("Discounts up to 10%");
Because of the call to enableLike(), this example will match on the promo text "Discounts up to 10%", but also "Discounts up to 10,000,000 dollars" or anything else that matches. In general, the Example object's query-wide modifications, such as enableLike() or ignoreCase() aren't always going to be applicable to every property being checked against.
Here's a third, and major, issue - what about other special criteria? There's no way to get every product with a price greater than $10 using the standard example framework. There's no way to order results by promo, descending. If the Product object joined on some Manufacturer, there's no way to add a criterion on the related Manufacturer object either. There's no way to safely specify the FetchMode on the criteria for the Manufacturer either (although this is a problem with the Criteria API in general - invalid fetched relationships fail silently, even more of a time bomb)
For all of the above examples, you would need to go back to the Criteria API and use string representations of properties to make the query - again, eliminating the biggest benefit of Example queries.
What alternatives exist to the Example API that can get the kind of compile-time advice we need?
My company gives developers days when we can experiment and work on pet projects (a la Google) and I spent some time working on a framework to use Example queries while geting around the limitations described above. I've come up with something that could be useful to other people interested in Example queries too. Here is a sample of the framework using the Product example.
Criteria criteriaQuery = session.createCriteria(Product.class);
Restrictions<Product> restrictions = Restrictions.create(Product.class);
Product example = restrictions.getQueryObject();
example.setName(restrictions.like("N%"));
example.setPromo("Discounts up to 10%");
restrictions.addRestrictions(criteriaQuery);
Here's an attempt to fix the issues in the code example from the question - the problem of the default value for the "price" field no longer exists, because this framework requires that criteria be explicitly set. The second problem of having a query-wide enableLike() is gone - the matcher is only on the "name" field.
The other problems mentioned in the question are also gone in this framework. Here are example implementations.
product.setPrice(restrictions.gt(10)); // price > 10
product.setPromo(restrictions.order(false)); // order by promo desc
Restrictions<Manufacturer> manufacturerRestrictions
= Restrictions.create(Manufacturer.class);
//configure manuf restrictions in the same manner...
product.setManufacturer(restrictions.join(manufacturerRestrictions));
/* there are also joinSet() and joinList() methods
for one-to-many relationships as well */
Even more sophisticated restrictions are available.
product.setPrice(restrictions.between(45,55));
product.setManufacturer(restrictions.fetch(FetchMode.JOIN));
product.setName(restrictions.or("Foo", "Bar"));
After showing the framework to a coworker, he mentioned that many data mapped objects have private setters, making this kind of criteria setting difficult as well (a different problem with the Example API!). So, I've accounted for that too. Instead of using setters, getters are also queryable.
restrictions.is(product.getName()).eq("Foo");
restrictions.is(product.getPrice()).gt(10);
restrictions.is(product.getPromo()).order(false);
I've also added some extra checking on the objects to ensure better type safety - for example, the relative criteria (gt, ge, le, lt) all require a value ? extends Comparable for the parameter. Also, if you use a getter in the style specified above, and there's a #Transient annotation present on the getter, it will throw a runtime error.
But wait, there's more!
If you like that Hibernate's built-in Restrictions utility can be statically imported, so that you can do things like criteria.addRestriction(eq("name", "foo")) without making your code really verbose, there's an option for that too.
Restrictions<Product> restrictions = new Restrictions<Product>(){
public void query(Product queryObject){
queryObject.setPrice(gt(10));
queryObject.setPromo(order(false));
//gt() and order() inherited from Restrictions
}
}
That's it for now - thank you very much in advance for any feedback! We've posted the code on Sourceforge for those that are interested. http://sourceforge.net/projects/hqbe2/
The API looks great!
Restrictions.order(boolean) smells like control coupling. It's a little unclear what the values of the boolean argument represent.
I suggest replacing or supplementing with orderAscending() and orderDescending().
Have a look at Querydsl. Their JPA/Hibernate module requires code generation. Their Java collections module uses proxies but cannot be used with JPA/Hibernate at the moment.
I have a 2D array
public static class Status{
public static String[][] Data= {
{ "FriendlyName","Value","Units","Serial","Min","Max","Mode","TestID","notes" },
{ "PIDs supported [01 – 20]:",null,"Binary","0",null,null,"1","0",null },
{ "Online Monitors since DTCs cleared:",null,"Binary","1",null,null,"1","1",null },
{ "Freeze DTC:",null,"NONE IN MODE 1","2",null,null,"1","2",null },
I want to
SELECT "FriendlyName","Value" FROM Data WHERE "Mode" = "1" and "TestID" = "2"
How do I do it? The fastest execution time is important because there could be hundreds of these per minute.
Think about how general it needs to be. The solution for something truly as general as SQL probably doesn't look much like the solution for a few very specific queries.
As you present it, I'd be inclined to avoid the 2D array of strings and instead create a collection - probably an ArrayList, but if you're doing frequent insertions & deletions maybe a LinkedList would be more appropriate - of some struct-like class. So
List<MyThing> list = new ArrayList<MyThing>();
and index the fields on which you want to search using a HashMap:
Map<Integer, MyThing> modeIndex = new HashMap<Integer, MyThing>()
for (MyThing thing : list)
modeIndex.put(thing.mode, thing);
Writing it down makes me realize that won't do, in and of itself, because multiple things could have the same mode. So probably a multimap instead - or roll your own by making the value type of the map not MyThing, but rather List. Google Collections has a fine multimap implementation.
This doesn't exactly answer your question, but it is possible to run some Java RDBMs with their tables entirely in your JVM's memory. For example, HSQLDB. This will give you the full power of SQL selects without the overheads of disc access. The only catch is that you won't be able to query a raw Java data structure like you are asking. You'll first have to insert the data into the DB's in-memory tables.
(I've not tried this ... perhaps someone could comment if this approach is really viable.)
As to your actual question, in C# they used to use LINQ (Language Integrated Query) for this, which takes benefit of the language's support for closures. Right now with Java 6 as the latest official release, Java doesn't support closures, but it's going to come in the shortly upcoming Java 7. The Java 7 based equivalent for LINQ is likely going to be JaQue.
As to your actual problem, you're definitely using a wrong datastructure for the job. Your best bet will be to convert the String[][] into a List<Entity> and using convenient searching/filtering API's provided by Guava, as suggested by Carl Manaster. The Iterables#filter() would be a good start.
EDIT: I took a look at your array, and I think this is definitely a job for RDBMS. If you want in-memory datastructure like features (fast/no need for DB server), embedded in-memory databases like HSQLDB, H2 can provide those.
If you want good execution time, you MUST have a good datastructure. If you just have data stored in a 2D array unordered, you'll be mostly stuck with O(n).
What you need is indexes for example, just like other RDBMS. For example, if you use a lot of WHERE clause like this WHERE name='Brian' AND last_name='Smith' you could do something like this (kind of a pseudocode):
Set<Entry> everyEntry = //the set that contains all data
Map<String, Set<Entry>> indexedSet = newMap();
for(String name : unionSetOfNames){
Set<Entry> subset = Iterables.collect(new HasName(name), everyEntries);
indexedSet.put(name, subset);
}
//and later...
Set<Entry> brians = indexedSet.get("Brian");
Entry target = Iterables.find(new HasLastName("Smith"),brians);
(Please forgive me if the Guava API usage is wrong in the example code (it's pseudo-code!, but you get the idea).
In the above code, you'll be doing a lookup of O(1) once, and then another O(n) lookup, but on a much much smaller subset. So this can be more effective than doing a O(n) lookup on the entire set, etc. If you use a ordered Set, ordered by the last_name and use binary search, that lookup will become O(log n). Things like that. There are bunch of datastructures out there and this is only a very simple example.
So in conclusion, if I were you, I'll define my own classes and create a datastructure using some standard datastructures available in JDK. If that doesn't suffice, I might look at some other datastructures out there, but if it gets really complex, I think I'd just use some in-memory RDBMS like HSQLDB or H2. They are easy to embed, so there are quite close to having your own in-memory datastructure. And as more and more you do complex stuff, chances are that that option provides better performance.
Note also that I used the Google Guava library in my sample code.. They are excellent, and I highly recommend to use them because it's so much nicer. Of course don't forget to look at the java.utli.collections package, too..
I ended up using a lookup table. 90% of the data is referenced from near the top.
public static int lookupReferenceInTable (String instanceMode, String instanceTID){
int ModeMatches[]=getReferencesToMode(Integer.parseInt(instanceMode));
int lineLookup = getReferenceFromPossibleMatches(ModeMatches, instanceTID);
return lineLookup;
}
private static int getReferenceFromPossibleMatches(int[] ModeMatches, String instanceTID) {
int counter = 0;
int match = 0;
instanceTID=instanceTID.trim();
while ( counter < ModeMatches.length ){
int x = ModeMatches[counter];
if (Data[x][DataTestID].equals(instanceTID)){
return ModeMatches[counter];
}
counter ++ ;
}
return match;
}
It can be further optimized so that instead of looping through all of the arrays it will loop on column until it finds a match, then loop the next, then the next. The data is laid out in a flowing and well organized manner so a lookup based on 3 criteria should only take a number of checks equal to the rows.
I need to store data in memory where I map one or more key strings to an object, as follows:
"green", "blue" -> object1
"red", "yellow" -> object2
So, in Java the datastructure might implement:
Map<Set<String>, V>
I need to be able to efficiently receive a list of objects, where the strings match some boolean criteria, such as:
("red" OR "green") AND NOT "blue"
I'm working in Java, so the ideal solution would be an off-the-shelf Java library. I am, however, willing to implement something from scratch if necessary.
Anyone have any ideas? I'd rather avoid the overhead of an in-memory database if possible, I'm hoping for something comparable in speed to a HashMap (or at least the same order of magnitude).
Actually, I liked the problem so I implemented a full solution in the spirit of my earlier answer:
http://pastebin.com/6iazSKG9
A simple solution, not thread safe or anything, but fun and a good starting point, I guess.
Edit: Some elaboration, as requested
See the unit test for usage.
There are two interfaces, DataStructure<K,V> and Query<V>. DataStructure behaves somewhat like a map (and in my implementation it actually works with an internal map), but it also provides reuseable and immutable query objects which can be combined like this:
Query<String> combinedQuery =
structure.and(
structure.or(
structure.search("blue"),
structure.search("red")
),
structure.not(
structure.search("green")
)
);
(A Query that searches for objects that are tagged as (blue OR red) AND NOT green). This query is reuseable, which means that it's results will change whenever the backing map is changed (kind of like an ITunes smart playlist).
The query objects are already thread safe, but the backing map is not, so there is some room for improvement here. Also, the queries could cache their results, but that would probably mean that the interface would have to be extended to provide for a purge method (kind of like the detach method in Wicket's models), which wouldn't be pretty.
As for licensing: if anybody wants this code I'll be happy to put it on SourceForge etc. ...
Sean
Would the criteria be amenable to bitmap indexing: http://en.wikipedia.org/wiki/Bitmap_index ?
I would say that the easiest way is simply to do a recursive filtering and being cleaver, when for instance evaluating X AND Y where X has been evaluated to the empty set.
The mapping however, needs to be from tags (such as "red" or "blue") to sets of objects.
The base case (resolving the atomic tags) of the recursion, would then be a simple lookup in this map. AND would be implemented using intersection, OR using union, and so on.
Check out the Apache Commons - Collections project. They have tons of great stuff that you will be able to use, particularly the CollectionUtils class for performing strong collection-based logic.
For instance, if your values were stored in a HashMap (as suggested by another answer) as follows:
myMap["green"] -> obj1
myMap["blue"] -> obj1
myMap["red"] -> obj2
myMap["yellow"] -> obj2
Then to retrieve results that match: ("red" or "green") and not "blue you might do this:
CollectionUtils.disjunction(CollectionUtils.union(myMap.get("red"), myMap.get("green")), myMap.get("blue"))
You could map string keys to a binary constant, then use bit shifting to produce an appropriate mask.
i truly think some type of database solution is your best bet. SQL easily supports querying data by
(X and Y) and not Z
this would have worked too reusable condition/expression classes
The Google Collections SetMultimap looks like an easy way to get the underlying structure, then combining that with the Maps static filters to get the querying behavior you want.
Construction would go something like
smmInstance.put(from1,to1);
smmInstance.put(from1,to2);
smmInstance.put(from2,to3);
smmInstance.put(from3,to1);
smmInstance.put(from1,to3);
//...
queries would then look like
valueFilter = //...build predicate
Set<FromType> result = Maps.filterValues(smmInstance.asMap(),valueFilter).keySet()
You can do any amount of fancy building the predicate, but Predicates has a few methods that would probably be enough to do contains/doesn't contain style queries.
I wasn't able to find a satisfactory solution, so I decided to cook up my own and release it as an open source (LGPL) project, find it here.