I need to store data in memory where I map one or more key strings to an object, as follows:
"green", "blue" -> object1
"red", "yellow" -> object2
So, in Java the datastructure might implement:
Map<Set<String>, V>
I need to be able to efficiently receive a list of objects, where the strings match some boolean criteria, such as:
("red" OR "green") AND NOT "blue"
I'm working in Java, so the ideal solution would be an off-the-shelf Java library. I am, however, willing to implement something from scratch if necessary.
Anyone have any ideas? I'd rather avoid the overhead of an in-memory database if possible, I'm hoping for something comparable in speed to a HashMap (or at least the same order of magnitude).
Actually, I liked the problem so I implemented a full solution in the spirit of my earlier answer:
http://pastebin.com/6iazSKG9
A simple solution, not thread safe or anything, but fun and a good starting point, I guess.
Edit: Some elaboration, as requested
See the unit test for usage.
There are two interfaces, DataStructure<K,V> and Query<V>. DataStructure behaves somewhat like a map (and in my implementation it actually works with an internal map), but it also provides reuseable and immutable query objects which can be combined like this:
Query<String> combinedQuery =
structure.and(
structure.or(
structure.search("blue"),
structure.search("red")
),
structure.not(
structure.search("green")
)
);
(A Query that searches for objects that are tagged as (blue OR red) AND NOT green). This query is reuseable, which means that it's results will change whenever the backing map is changed (kind of like an ITunes smart playlist).
The query objects are already thread safe, but the backing map is not, so there is some room for improvement here. Also, the queries could cache their results, but that would probably mean that the interface would have to be extended to provide for a purge method (kind of like the detach method in Wicket's models), which wouldn't be pretty.
As for licensing: if anybody wants this code I'll be happy to put it on SourceForge etc. ...
Sean
Would the criteria be amenable to bitmap indexing: http://en.wikipedia.org/wiki/Bitmap_index ?
I would say that the easiest way is simply to do a recursive filtering and being cleaver, when for instance evaluating X AND Y where X has been evaluated to the empty set.
The mapping however, needs to be from tags (such as "red" or "blue") to sets of objects.
The base case (resolving the atomic tags) of the recursion, would then be a simple lookup in this map. AND would be implemented using intersection, OR using union, and so on.
Check out the Apache Commons - Collections project. They have tons of great stuff that you will be able to use, particularly the CollectionUtils class for performing strong collection-based logic.
For instance, if your values were stored in a HashMap (as suggested by another answer) as follows:
myMap["green"] -> obj1
myMap["blue"] -> obj1
myMap["red"] -> obj2
myMap["yellow"] -> obj2
Then to retrieve results that match: ("red" or "green") and not "blue you might do this:
CollectionUtils.disjunction(CollectionUtils.union(myMap.get("red"), myMap.get("green")), myMap.get("blue"))
You could map string keys to a binary constant, then use bit shifting to produce an appropriate mask.
i truly think some type of database solution is your best bet. SQL easily supports querying data by
(X and Y) and not Z
this would have worked too reusable condition/expression classes
The Google Collections SetMultimap looks like an easy way to get the underlying structure, then combining that with the Maps static filters to get the querying behavior you want.
Construction would go something like
smmInstance.put(from1,to1);
smmInstance.put(from1,to2);
smmInstance.put(from2,to3);
smmInstance.put(from3,to1);
smmInstance.put(from1,to3);
//...
queries would then look like
valueFilter = //...build predicate
Set<FromType> result = Maps.filterValues(smmInstance.asMap(),valueFilter).keySet()
You can do any amount of fancy building the predicate, but Predicates has a few methods that would probably be enough to do contains/doesn't contain style queries.
I wasn't able to find a satisfactory solution, so I decided to cook up my own and release it as an open source (LGPL) project, find it here.
Related
Generally I've been quite a fan of using List.of / Arrays.asList, but unfortunately they don't accept nulls. The usecase I most often stumble upon this, is when dealing with DB calls, and parameters are abstracted to be passed via list. So it always have to be of certain length for any given procedure, and optional values are given as nulls.
I found Why does Map.of not allow null keys and values? which mentions List.of, but mainly talks about maps. And in the case of maps, I agree -- ambiguity of .get seems troublesome (key missing, or value intentionally null?), but this doesn't seem to apply to a list. If you get a null from a list, then you know for sure someone intentionally had put it there.
Some people might say "use optional", but I strongly believe it should be reserved for return types, and others seem to agree Uses for Optional
having an Optional in a class field or in a data structure, is considered a misuse of the API
What is the intended clean, standard solution? Does it really boil down to "use boilerplaty ArrayList initialization without these nice shorthand syntaxes" or "write your own util methods"?
Generally I've been quite a fan of using List.of / Arrays.asList, but unfortunately they don't accept nulls
This is incorrect:
List<String> list = Arrays.asList("a", "b", null);
String nullRef = list.get(2);
System.out.println(nullRef);
Works fine. List.of indeed doesn't accept null refs.
Some people might say "use optional"
Yeah, ignore those people.
What is the intended clean, standard solution?
Clean is a weaselword that is generally best read as: "I like it personally but am unwilling to just say that it's a taste preference" - it means nothing. You can't argue about taste, after all.
The standard solution, that is borderline objective language, we can work with that.
I think your API design that takes a list of db values is the problem. Presumably, you've written your own abstraction layer for this.
I'm having a hard time imaginine DB abstraction APIs where 'list of objects that includes null refs' is the right call.
If you're trying to weave prepared statement 'parameters', this would appear to be a nice API for that, and doesn't require lists at all:
db.select()
.append("SELECT a, b FROM c INNER JOIN d ON c.e = d.f ")
.append("WHERE (signup IS NULL OR signup < ?) ", LocalDate.now())
.append("AND LEFT(username, ?) = ?", 2, "AA")
.query()
If you're trying to abstract an insert/merge/upsert statement, you need some way of linking values with the column name that the value is a value for. This would suggest an API along the lines of:
db.insert("tableName")
.mergeOn("key", key)
.mergeOn("date", someDate)
.put("value", value)
.exec();
You can also re-use Map.of for this purpose, in which case 'I want the db to use the default value for this column' is done by simply not including that key/value pair in your map, thus sidestepping your needs to put null in such a map.
Perhaps if you show your specific API use-case, more insights can be provided.
If your question boils down to:
"I want to put null refs and use List.of, what is the standard clean way to do that"
then the answer is a rather obvious: There is nothing - as that does not work.
There is no --shut-up-List.of--let-me-add-nulls command line switch.
I have a 2D array
public static class Status{
public static String[][] Data= {
{ "FriendlyName","Value","Units","Serial","Min","Max","Mode","TestID","notes" },
{ "PIDs supported [01 – 20]:",null,"Binary","0",null,null,"1","0",null },
{ "Online Monitors since DTCs cleared:",null,"Binary","1",null,null,"1","1",null },
{ "Freeze DTC:",null,"NONE IN MODE 1","2",null,null,"1","2",null },
I want to
SELECT "FriendlyName","Value" FROM Data WHERE "Mode" = "1" and "TestID" = "2"
How do I do it? The fastest execution time is important because there could be hundreds of these per minute.
Think about how general it needs to be. The solution for something truly as general as SQL probably doesn't look much like the solution for a few very specific queries.
As you present it, I'd be inclined to avoid the 2D array of strings and instead create a collection - probably an ArrayList, but if you're doing frequent insertions & deletions maybe a LinkedList would be more appropriate - of some struct-like class. So
List<MyThing> list = new ArrayList<MyThing>();
and index the fields on which you want to search using a HashMap:
Map<Integer, MyThing> modeIndex = new HashMap<Integer, MyThing>()
for (MyThing thing : list)
modeIndex.put(thing.mode, thing);
Writing it down makes me realize that won't do, in and of itself, because multiple things could have the same mode. So probably a multimap instead - or roll your own by making the value type of the map not MyThing, but rather List. Google Collections has a fine multimap implementation.
This doesn't exactly answer your question, but it is possible to run some Java RDBMs with their tables entirely in your JVM's memory. For example, HSQLDB. This will give you the full power of SQL selects without the overheads of disc access. The only catch is that you won't be able to query a raw Java data structure like you are asking. You'll first have to insert the data into the DB's in-memory tables.
(I've not tried this ... perhaps someone could comment if this approach is really viable.)
As to your actual question, in C# they used to use LINQ (Language Integrated Query) for this, which takes benefit of the language's support for closures. Right now with Java 6 as the latest official release, Java doesn't support closures, but it's going to come in the shortly upcoming Java 7. The Java 7 based equivalent for LINQ is likely going to be JaQue.
As to your actual problem, you're definitely using a wrong datastructure for the job. Your best bet will be to convert the String[][] into a List<Entity> and using convenient searching/filtering API's provided by Guava, as suggested by Carl Manaster. The Iterables#filter() would be a good start.
EDIT: I took a look at your array, and I think this is definitely a job for RDBMS. If you want in-memory datastructure like features (fast/no need for DB server), embedded in-memory databases like HSQLDB, H2 can provide those.
If you want good execution time, you MUST have a good datastructure. If you just have data stored in a 2D array unordered, you'll be mostly stuck with O(n).
What you need is indexes for example, just like other RDBMS. For example, if you use a lot of WHERE clause like this WHERE name='Brian' AND last_name='Smith' you could do something like this (kind of a pseudocode):
Set<Entry> everyEntry = //the set that contains all data
Map<String, Set<Entry>> indexedSet = newMap();
for(String name : unionSetOfNames){
Set<Entry> subset = Iterables.collect(new HasName(name), everyEntries);
indexedSet.put(name, subset);
}
//and later...
Set<Entry> brians = indexedSet.get("Brian");
Entry target = Iterables.find(new HasLastName("Smith"),brians);
(Please forgive me if the Guava API usage is wrong in the example code (it's pseudo-code!, but you get the idea).
In the above code, you'll be doing a lookup of O(1) once, and then another O(n) lookup, but on a much much smaller subset. So this can be more effective than doing a O(n) lookup on the entire set, etc. If you use a ordered Set, ordered by the last_name and use binary search, that lookup will become O(log n). Things like that. There are bunch of datastructures out there and this is only a very simple example.
So in conclusion, if I were you, I'll define my own classes and create a datastructure using some standard datastructures available in JDK. If that doesn't suffice, I might look at some other datastructures out there, but if it gets really complex, I think I'd just use some in-memory RDBMS like HSQLDB or H2. They are easy to embed, so there are quite close to having your own in-memory datastructure. And as more and more you do complex stuff, chances are that that option provides better performance.
Note also that I used the Google Guava library in my sample code.. They are excellent, and I highly recommend to use them because it's so much nicer. Of course don't forget to look at the java.utli.collections package, too..
I ended up using a lookup table. 90% of the data is referenced from near the top.
public static int lookupReferenceInTable (String instanceMode, String instanceTID){
int ModeMatches[]=getReferencesToMode(Integer.parseInt(instanceMode));
int lineLookup = getReferenceFromPossibleMatches(ModeMatches, instanceTID);
return lineLookup;
}
private static int getReferenceFromPossibleMatches(int[] ModeMatches, String instanceTID) {
int counter = 0;
int match = 0;
instanceTID=instanceTID.trim();
while ( counter < ModeMatches.length ){
int x = ModeMatches[counter];
if (Data[x][DataTestID].equals(instanceTID)){
return ModeMatches[counter];
}
counter ++ ;
}
return match;
}
It can be further optimized so that instead of looping through all of the arrays it will loop on column until it finds a match, then loop the next, then the next. The data is laid out in a flowing and well organized manner so a lookup based on 3 criteria should only take a number of checks equal to the rows.
I'm looking for clever ways to build dynamic Java classes, that is classes where you can add/remove fields at runtime. Usage scenario: I have an editor where users should be able to add fields to the model at runtime or maybe even create the whole model at runtime.
Some design goals:
Type safe without casts if possible for custom code that works on the dynamic fields (that code would come from plugins which extend the model in unforeseen ways).
Good performance (can you beat HashMap? Maybe use an array and assign indexes to the fields during setup?)
Field "reuse" (i.e. if you use the same type of field in several places, it should be possible to define it once and then reuse it).
Calculated fields which depend on the value of other fields
Signals should be sent when fields change value (no necessarily via the Beans API)
"Automatic" parent child relations (when you add a child to a parent, then the parent pointer in the child should be set for "free").
Easy to understand
Easy to use
Note that this is a "think outside the circle" question. I'll post an example below to get you in the mood :-)
Type safe without casts if possible for custom code that works on the dynamic fields (that code would come from plugins which extend the model in unforeseen ways)
AFAIK, this is not possible. You can only get type-safety without type casts if you use static typing. Static typing means method signatures (in classes or interfaces) that are known at compile time.
The best you can do is have an interface with a bunch of methods like String getStringValue(String field), int getIntValue(String field) and so on. And of course you can only do that for a predetermined set of types. Any field whose type is not in that set will require a typecast.
The obvious answer is to use a HashMap (or a LinkedHashMap if you care for the order of fields). Then, you can add dynamic fields via a get(String name) and a set(String name, Object value) method.
This code can be implemented in a common base class. Since there are only a few methods, it's also simple to use delegation if you need to extend something else.
To avoid the casting issue, you can use a type-safe object map:
TypedMap map = new TypedMap();
String expected = "Hallo";
map.set( KEY1, expected );
String value = map.get( KEY1 ); // Look Ma, no cast!
assertEquals( expected, value );
List<String> list = new ArrayList<String> ();
map.set( KEY2, list );
List<String> valueList = map.get( KEY2 ); // Even with generics
assertEquals( list, valueList );
The trick here is the key which contains the type information:
TypedMapKey<String> KEY1 = new TypedMapKey<String>( "key1" );
TypedMapKey<List<String>> KEY2 = new TypedMapKey<List<String>>( "key2" );
The performance will be OK.
Field reuse is by using the same value type or by extending the key class of the type-safe object map with additional functionality.
Calculated fields could be implemented with a second map that stores Future instances which do the calculation.
Since all the manipulation happens in just two (or at least a few) methods, sending signals is simple and can be done any way you like.
To implement automatic parent/child handling, install a signal listener on the "set parent" signal of the child and then add the child to the new parent (and remove it from the old one if necessary).
Since no framework is used and no tricks are necessary, the resulting code should be pretty clean and easy to understand. Not using String as keys has the additional benefit that people won't litter the code with string literals.
So basically you're trying to create a new kind of object model with more dynamic properties, a bit like a dynamic language?
Might be worth looking at the source code for Rhino (i.e. Javascript implemented in Java), which faces a similar challenge of implementing a dynamic type system in Java.
Off the top of my head, I suspect you will find that internal HashMaps ultimately work best for your purposes.
I wrote a little game (Tyrant - GPL source available) using a similar sort of dynamic object model featuring HashMaps, it worked great and performance was not an issue. I used a few tricks in the get and set methods to allow dynamic property modifiers, I'm sure you could do the same kind of thing to implement your signals and parent/child relations etc.
[EDIT] See the source of BaseObject how it is implemented.
You can use the bytecode manipulation libraries for it. Shortcoming of this approach is that you need to do create own classloader to load changes in classes dynamically.
I do almost the same, it's pure Java solution:
Users generate their own models, which are stored as JAXB schema.
Schema is compiled in Java classes on the fly and stored in
user jars
All classes are forced to extend one "root" class, where you could put every extra functionality you want.
Appropriate classloaders are implemented with "model change"
listeners.
Speaking of performance (which is important in my case), you can hardly beat this solution. Reusability is the same of XML document.
The Facts
I have the following datastructure consisting of a table and a list of attributes (simplified):
class Table {
List<Attribute> m_attributes;
}
abstract class Attribute {}
class LongAttribute extends Attribute {}
class StringAttribute extends Attribute {}
class DateAttribute extends Attribute {}
...
Now I want to do different actions with this datastructure:
print it in XML notation
print it in textual form
create an SQL insert statement
create an SQL update statement
initialize it from a SQL result set
First Try
My first attempt was to put all these functionality inside the Attribute, but then the Attribute was overloaded with very different responsibilities.
Alternative
It feels like a visitor pattern could do the job very well instead, but on the other side it looks like overkill for this simple structure.
Question
What's the most elegant way to solve this?
I would look at using a combination of JAXB and Hibernate.
JAXB will let you marshall and unmarshall from XML. By default, properties are converted to elements with the same name as the property, but that can be controlled via #XmlElement and #XmlAttribute annotations.
Hibernate (or JPA) are the standard ways of moving data objects to and from a database.
The Command pattern comes to mind, or a small variation of it.
You have a bunch of classes, each of which is specialized to do a certain thing with your data class. You can keep these classes in a hashmap or some other structure where an external choice can pick one for execution. To do your thing, you call the selected Command's execute() method with your data as an argument.
Edit: Elaboration.
At the bottom level, you need to do something with each attribute of a data row.
This indeed sounds like a case for the Visitor pattern: Visitor simulates a double
dispatch operation, insofar as you are able to combine a variable "victim" object
with a variable "operation" encapsulated in a method.
Your attributes all want to be xml-ed, text-ed, insert-ed updat-ed and initializ-ed.
So you end up with a matrix of 5 x 3 classes to do each of these 5 operations
to each of 3 attribute types. The rest of the machinery of the visitor pattern
will traverse your list of attributes for you and apply the correct visitor for
the operation you chose in the right way for each attribute.
Writing 15 classes plus interface(s) does sound a little heavy. You can do this
and have a very general and flexible solution. On the other hand, in the time
you've spent thinking about a solution, you could have hacked together the code
to it for the currently known structure and crossed your fingers that the shape
of your classes won't change too much too often.
Where I thought of the command pattern was for choosing among a variety of similar
operations. If the operation to be performed came in as a String, perhaps in a
script or configuration file or such, you could then have a mapping from
"xml" -> XmlifierCommand
"text" -> TextPrinterCommand
"serial" -> SerializerCommand
...where each of those Commands would then fire up the appropriate Visitor to do
the job. But as the operation is more likely to be determined in code, you probably
don't need this.
I dunno why you'd store stuff in a database yourself these days instead of just using hibernate, but here's my call:
LongAttribute, DateAttribute, StringAttribute,… all have different internals (i.e. fields specific to them not present in Attribute class), so you cannot create one generic method to serialize them all. Now XML, SQL and plain text all have different properties when serializing to them. There's really no way you can avoid writing O(#subclasses of Attribute #output formats)* different methods of serializing.
Visitor is not a bad pattern for serializing. True, it's a bit overkill if used on non-recursive structures, but a random programmer reading your code will immediately grasp what it is doing.
Now for deserialization (from XML to object, from SQL to object) you need a Factory.
One more hint, for SQL update you probably want to have something that takes old version of the object, new version of the object and creates update query only on the difference between them.
In the end, I used the visitor pattern. Now looking back, it was a good choice.
Suppose you have a collection of a few hundred in-memory objects and you need to query this List to return objects matching some SQL or Criteria like query. For example, you might have a List of Car objects and you want to return all cars made during the 1960s, with a license plate that starts with AZ, ordered by the name of the car model.
I know about JoSQL, has anyone used this, or have any experience with other/homegrown solutions?
Filtering is one way to do this, as discussed in other answers.
Filtering is not scalable though. On the surface time complexity would appear to be O(n) (i.e. already not scalable if the number of objects in the collection will grow), but actually because one or more tests need to be applied to each object depending on the query, time complexity more accurately is O(n t) where t is the number of tests to apply to each object.
So performance will degrade as additional objects are added to the collection, and/or as the number of tests in the query increases.
There is another way to do this, using indexing and set theory.
One approach is to build indexes on the fields within the objects stored in your collection and which you will subsequently test in your query.
Say you have a collection of Car objects and every Car object has a field color. Say your query is the equivalent of "SELECT * FROM cars WHERE Car.color = 'blue'". You could build an index on Car.color, which would basically look like this:
'blue' -> {Car{name=blue_car_1, color='blue'}, Car{name=blue_car_2, color='blue'}}
'red' -> {Car{name=red_car_1, color='red'}, Car{name=red_car_2, color='red'}}
Then given a query WHERE Car.color = 'blue', the set of blue cars could be retrieved in O(1) time complexity. If there were additional tests in your query, you could then test each car in that candidate set to check if it matched the remaining tests in your query. Since the candidate set is likely to be significantly smaller than the entire collection, time complexity is less than O(n) (in the engineering sense, see comments below). Performance does not degrade as much, when additional objects are added to the collection. But this is still not perfect, read on.
Another approach, is what I would refer to as a standing query index. To explain: with conventional iteration and filtering, the collection is iterated and every object is tested to see if it matches the query. So filtering is like running a query over a collection. A standing query index would be the other way around, where the collection is instead run over the query, but only once for each object in the collection, even though the collection could be queried any number of times.
A standing query index would be like registering a query with some sort of intelligent collection, such that as objects are added to and removed from the collection, the collection would automatically test each object against all of the standing queries which have been registered with it. If an object matches a standing query then the collection could add/remove it to/from a set dedicated to storing objects matching that query. Subsequently, objects matching any of the registered queries could be retrieved in O(1) time complexity.
The information above is taken from CQEngine (Collection Query Engine). This basically is a NoSQL query engine for retrieving objects from Java collections using SQL-like queries, without the overhead of iterating through the collection. It is built around the ideas above, plus some more. Disclaimer: I am the author. It's open source and in maven central. If you find it helpful please upvote this answer!
I have used Apache Commons JXPath in a production application. It allows you to apply XPath expressions to graphs of objects in Java.
yes, I know it's an old post, but technologies appear everyday and the answer will change in the time.
I think this is a good problem to solve it with LambdaJ. You can find it here:
http://code.google.com/p/lambdaj/
Here you have an example:
LOOK FOR ACTIVE CUSTOMERS // (Iterable version)
List<Customer> activeCustomers = new ArrayList<Customer>();
for (Customer customer : customers) {
if (customer.isActive()) {
activeCusomers.add(customer);
}
}
LambdaJ version
List<Customer> activeCustomers = select(customers,
having(on(Customer.class).isActive()));
Of course, having this kind of beauty impacts in the performance (a little... an average of 2 times), but can you find a more readable code?
It has many many features, another example could be sorting:
Sort Iterative
List<Person> sortedByAgePersons = new ArrayList<Person>(persons);
Collections.sort(sortedByAgePersons, new Comparator<Person>() {
public int compare(Person p1, Person p2) {
return Integer.valueOf(p1.getAge()).compareTo(p2.getAge());
}
});
Sort with lambda
List<Person> sortedByAgePersons = sort(persons, on(Person.class).getAge());
Update: after java 8 you can use out of the box lambda expressions, like:
List<Customer> activeCustomers = customers.stream()
.filter(Customer::isActive)
.collect(Collectors.toList());
Continuing the Comparator theme, you may also want to take a look at the Google Collections API. In particular, they have an interface called Predicate, which serves a similar role to Comparator, in that it is a simple interface that can be used by a filtering method, like Sets.filter. They include a whole bunch of composite predicate implementations, to do ANDs, ORs, etc.
Depending on the size of your data set, it may make more sense to use this approach than a SQL or external relational database approach.
If you need a single concrete match, you can have the class implement Comparator, then create a standalone object with all the hashed fields included and use it to return the index of the match. When you want to find more than one (potentially) object in the collection, you'll have to turn to a library like JoSQL (which has worked well in the trivial cases I've used it for).
In general, I tend to embed Derby into even my small applications, use Hibernate annotations to define my model classes and let Hibernate deal with caching schemes to keep everything fast.
I would use a Comparator that takes a range of years and license plate pattern as input parameters. Then just iterate through your collection and copy the objects that match. You'd likely end up making a whole package of custom Comparators with this approach.
The Comparator option is not bad, especially if you use anonymous classes (so as not to create redundant classes in the project), but eventually when you look at the flow of comparisons, it's pretty much just like looping over the entire collection yourself, specifying exactly the conditions for matching items:
if (Car car : cars) {
if (1959 < car.getYear() && 1970 > car.getYear() &&
car.getLicense().startsWith("AZ")) {
result.add(car);
}
}
Then there's the sorting... that might be a pain in the backside, but luckily there's class Collections and its sort methods, one of which receives a Comparator...