Using Java, I've got a source data set of integers, it's big but not huge - let's say it won't get bigger than 30,000 values.
Using the source dataset I have some summary values I want to create (these are domain specific so not something you'll find in a library such as Apache Math).
There is a relationship between the summary values like this:
[source data] -> summary1 -> summary2 -> summary3
\ ^
\____________________|
I don't want to over-engineer the solution, but I do expect in future that there may be additional summary values that will build upon this graph. Currently my solution involves having a domain object that has a 'getter' for each summary and merely checks if it has already been computed, and compute-stores it if needed. This works fine, but I don't like having all this compute logic in my domain object.
It feels to me like this could be represented as more of a key->calculator design where results are stored in a map and calculators know which "keys" they need. Before I go off and implement something like this it's very hard to imagine someone hasn't already done this (a thousand times).
Can anyone advise me on idioms or any libraries that would be worth looking at for this kind of problem space? I'm familiar with things like JGraph but I don't believe this will let me associate a calculator on a node, it will merely provide a graph model. Perhaps this is more problem for a caching library?
The key -> calculator idea looks like a typical application of a loading cache (aka auto populating, aka read through). An example to do it with cache2k:
Cache<Key, Integer> summary1cache = new Cache2kBuilder<Key, Integer>() {}
.loader(this::calculateSummary1)
.build();
int calculateSummary1(Key key) {
...
}
To achieve the best performance I recommend one cache per summary type. The user guide has more information about cache loaders / read through.
You can do exactly the same with other caches, e.g. Guava Cache or Caffeine.
An alternative pattern is Map.computeIfAbsent(key, function). However, if the loader function is known from the start I recommend configuring the cache with it.
Disclaimer: I don't know for 100% whether this is the best solution since it is not totally clear from the question how many different keys / summaries you'll have and what the access pattern looks like.
Related
I am fairly new to DS and Algorithms and recently at a job interview I was asked a question on performance tuning along with code. We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure. So which Java feature/library can we use to do the searching in the quickest time possible ?
On the spot I could not think of exact answer so I wrote that:
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
How can I understand the exact answer to this question and what can be the optimal solution(s) ?
After reading the question and getting clarification in the comments, I think what has become apparent to me is that: you needed to ask follow-up questions.
I'll try to break it down and provide comments that I hope will be helpful, because I also know what it's like to be "in the moment" and how nerves can stab you in the back when you least need them to.
We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure.
I think a good follow-up question here would've been:
Q: What specific data structure is being used to contain all this data?
I would press until they give me an actual name and explain why it is not possible to name a Java algorithm/library. For all you know, the data structure could've been String[], a Set<String>, or even a fancy name for a file on disk (if they're trying to throw you off). They could've also clarified and said the DS was not relevant and that you could pick whichever DS you thought was best.
The wording also implies that they implemented the structure and that it's already populated in a system with, presumably, enough memory to hold all of it. Asking to confirm that this is really the case could've given you helpful information.
For example: "Based on the wording, it seems this mystery data structure is already implemented and fully populated in memory in a system with enough memory to hold it. Can you confirm my understanding here is correct? If not, could you clarify further?"
Given the suggested wording, and the fact that we don't have additional clarifications to go from, I will assume, for the purposes of this answer, that my suppositions are indeed correct.
Note that if you had been asked to design the data structure to hold all of this info, you would've had to ask very different questions, take memory constraints into account, and perhaps even ask about character sets/encodings (e.g. ASCII vs multi-byte Unicode).
Also, if you had been asked to design the search algorithm, then knowing the DS is a pre-requisite, and not knowing this could've made the task impossible. For example, the binary search algorithm implementation will look very different if you're working on an array vs a binary search tree, even though both would offer O(lg n) time complexity.
So which java feature/library can we use to do the searching in the quickest time possible?
Consistent with the 1st part, this question only asks what pre-existing/built-in Java code you would choose to perform the search for you. The "quickest time possible" here should make you think about solutions that are in O(1), i.e. are constant time. However, the data structure may open/close doors for you.
Some search algorithms in Java work on generics and others work on other types like arrays. Some algorithms work on Maps while others work on Lists, Sets, and so on. The follow-up question from the first part could've helped in answering this question.
That said, even if you knew the DS, but couldn't think of a specific method name or such at the time, I also think it should be considered reasonable to mention the interface or at least a relevant package and say that further details can be checked on the the Java documentation if you're pressed for more specificity, given that's what it's there for in the first place.
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
Given the wording, my interpretation of their question was not "which data structure would you use?", but rather, "which pre-existing search algorithm would you choose?". It seems to me like it was them who needed to answer the question regarding DS.
That said, if you had indeed been asked "which data structure would you use?", then a Map would've still worked against you, since you didn't really need to map a key to a value. You only needed to store a value (i.e. the words). Therefore, a Set, specifically a HashSet, would've been a better candidate, since it also avoids duplicates and should consume less memory in the process because it stores singular values, rather than key/value pairs.
Of course, that's still under the assumption(s) I made earlier. If memory constraints are said to be an issue, then scaling horizontally to multiple servers and so on would've likely been necessary.
How can I understand the exact answer to this question and what can be the optimal solution(s)?
It is probably the case that they wanted to see if you would follow up with questions, given the lack of information they gave you.
There are a couple data structures that allow for efficient searching, assuming that memory requirements aren't an issue and the data structure is already populated.
Regarding time complexity, Set#contains and Map#containsKey are both O(1), assuming that the hash function isn't expensive and that there aren't many collisions.
Because the data structure stores words (assuming you're referring to Strings), then it could also be relatively efficient to use a trie (radix tree, prefix tree, etc.), which would allow you to search by character (which I believe would be O(log n)). If the hash function is expensive or there are many collisions, this could be a good alternative!
The answer that you gave to the interviewer should suffice since hashing is an effective searching method, even for billions of entries.
You did not mention whether the entries are words or documents (multiple words). In both cases a search index could be suitable.
Search indexes extract words from the billion document entries and manage a map of these words to the documents they are used in. Frameworks like Lucene (e.g. as part of SOLR or ElasticSearch) manage memory and persistence for you.
If it were only multiple of thousands of entries, a simple HashMap would be sufficient because there is no need for memory management then. If all of the billion entries are single words, a database could be a slightly better choice.
The hashmap solution is reasonable as stated by others but there are doubts with respect to scalability.
Here is a possible solution for the problem as discussed in the below post
Sub-string match If your entry blob is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every entry to find best possible entries that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside
Indexed search. Here you are assuming that entry contains set of words and search is limited to fixed word lengths. In this case, entries are indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
b. OpenFTS : http://openfts.sourceforge.net/
c. Sphinx http://sphinxsearch.com/
Most likely if you need "fixed words" as queries, the approach two will be very fast and effective
Reference - https://softwareengineering.stackexchange.com/questions/118759/how-to-quickly-search-through-a-very-large-list-of-strings-records-on-a-databa
Multi-billion entries lie at the edge of what might conceivably be stored in main memory (for instance, storing 10 billion entries at 100 bytes per entry will take 1000 GB main memory).
While storing the data in main memory offers a very high throughput (thousands to millions of requests per second), you'd likely need special hardware (typical blade servers only offers 16 GB, but there are commodity servers that permit installation of up to 3000 GB of main memory). Also, keeping this much data in the Java Heap will likely cause garbage collector pauses of seconds or minutes unless special care is taken.
Therefore, unless the structure of your data admits a very compact representation in main memory (say, you only need membership checking among ints, which is possible with a 512 MB Bitset), you'll not want to store it in main memory, but on disk.
Therefore, you'll need persistence. Any relational or NoSQL database permits efficient searching by key and can handle such amounts of data with ease. To talk to a relational database, use JPA or JDBC. To talk to a non-relational database, you can use their proprietary Java API or an abstraction layer such as Spring Data.
You could also implement persistence from scratch if you wanted to (i.e. the interviewer asks for that). A data structure optimized for efficient lookup in external memory is the B-Tree, that's what many databases use internally :-)
I'm currently looking for a java library (or native library with a java API) for formula parsing and evaluation.
Using recommandations from here, I took a look on many libraries :
JFormula
JEval
Symja
JEP
But none of them fulfil my needs, that are :
Multiple formula evaluation with dependency between them (a formula is always an affectation to a variable using other variables or numerical values)
Possibility to change only one formula out of maybe 50, with good performances if only one formule changes
no need to handle by hand variables dependancies
Automatically update other dependant variables if a formula changes
Possibility to listen which variable changed
no need to have a specific format for the variables (the user will directly enter a name and doesn't want to have a complexe notation)
Maybe an exemple will be better. Let's say we have, entered in the system in this order :
a = b + c
c = 2 * d
b = 3
d = 2
I would like to be able to enter those 4 lines in this order, and ask for the result of "a" (or "b", whatever).
Then if in the user interface (basically a table variable <> formula) "b" is changed to "2 * d", the library will automatically change the value of "b" and "a", and return me (or lunch an event, or call a function) a list of changes
The best library would be one just like JEP, but with the out-of-order variables capability and the possibility to auto-evaluate dependant variables
I know that compilers and spreadsheet softwares uses such mechanisms, but I didn't found any java or java compatible libraries directly usable
Does someone know one?
EDIT : Precision : the question is really about a library, or eventually a set of libraries to link together. The question is for a project in a company and the idea is to spend the minimum amount of time. The "do it yourself" solution has already been estimated and is not in the scope of the question
For a project that I also needed a simple formula parser I used the code of the article Lexical analysis, Part 2: Build an application in javaworld.com. It's simple and small (7 classes), and you can adapt it to your needs.
You can downdoad the source form here (search for 'Lexical Analysis Part II' entry).
Don't know of any libraries.
Assuming what you have is a set of equations with a single variable on at least one side of the equation (A+B=C-D is disallowed) and no cycles, (e.g., A=B+1; B=A-2), what you technically need to do is to build a a data flow graph showing how each operator depends on its operands. For side-effect-free equations (e.g., pure math) this is pretty easy; you end up with a directed acyclic graph (a forest with shared subtrees representing shared subexpressions). Then if a value of a variable is changed, or a new formula is introduced, you revise the dag and re-evaluate the changed parts, propagating changes up the dag to the dag roots. So, you need to build trees for the expressions, and then share them (often by hashing on subtrees to find potential equivalent candidates). So, lots of structure manipulation to keep the dag (and is root values)
But if its only 50 variables of the complexity you show, it would act, you could simply reevaluate them all. If you store the expression as trees (or better yet, reverse polish) you can evaluate each tree quite fast, and you don't pay any overhead to keep all those data structures up to date.
If you have hundreds of equations, the dag scheme is likely a lot better.
If you have constraint equations (e.g., you aren't restricted as to what can be on both sides), you're outside the spreadsheet paradigm and into constraint solvers, which is a far more complex technology.
Why would not you just write your own? Your assessment of complexity of this task might be wrong. It is much easier than you might think - chances are, learning how to deal with any 3rd party library would require much more effort than implementing such a trivial thing from scratch. It should not take more than a couple of hours in the worst case.
It does not make any sense to look for 3rd party libraries for doing simple things (I know, it is a part of the Java ethos, but still...)
I'd recommend to take a look at the Cells library for inspiration. It is in Common Lisp, but ideas are basic enough to be transferred anywhere else.
you can check these links too...
MathPiper (a Java fork of the Java Yacas version) (has it's own
editor based on jEdit) (GPL) http://code.google.com/p/mathpiper/
Symja/Matheclipse (my own project, uses JAS and Commons Math
libraries) (LGPL) http://krum.rz.uni-mannheim.de/jas/
Java Algebra System (JAS) (LGPL) http://krum.rz.uni-mannheim.de/jas/
I would embedd Groovy, see the Tutorial about embedding here. Freeplane (a Java Mindmapper) also uses Groovy for formulas.
Whenever a variable is changing you have to put the new value into the binding.
All the cell code should be given to the Groovy Shell as single code piece. You can register on changes via BindPath.
Anyway I assume you have to implement a thin layer to fullfill your requirements:
no need to handle by hand variables dependancies
Possibility to listen which variable changed
I have to store more than 100 millions of key-values in my HashMultiMap (key can have multiple values). Now, I want to use Jedis for that. I download it from here - Jedis 2.0.0.0.jar as recomended to me here. Now, after little bit searching, I could not find any nice document that helps me as a beginner:
1) How to use Jedis (specifically, do I have to treat it as normal .jar files in java ex. like Guava) ?
2) How to implement HashMultiMap (key can have multiple values) in Redis ?
3) How to perform all insertion, searching etc. in Redis.
4) I found by searching Redis, many options like Jedis, Redis, Jredis etc. What are those variations ? And which one would me nice to me for solving this ?
Any information and/or link to any document will be helpful for me. Sorry, if any stupid questions I ask, because I have no idea about Redis. So, beginning idea will be valuable for me. Thanks.
I'm afraid there isn't a simple way to achieve what you want. Redis only has normal hashes. One key - one value.
However, you can serialize your multiple values to a string and store that as a value. Of course, you lose ability to individually insert/update/remove items, you'll have to reset the whole value every time. But this might not be a problem for you.
Redis has few internal types like lists or sets or associated hashes. I guess you can use sets for your case. It's better that serializing whle data because operations with internal types are atomic, and you will not need to worry about possible race conditions.
check out https://github.com/xetorthio/jedis/wiki and http://redis.io/commands
there are several ways which imply using list/sortedSet/hashs as a single fields of your multimap. Then
a) make of subdatabases to provide separated namespaces i.e. limit what is your overall multimap ( select . and/or
b) use the rich semantics the keys have in redis ( see example here ). You could make up your multimap simply using regular key/value mappings set/get with the key name additionally describing your map fields. You have a variety of options to get what you want. One of the last resorts is scripting.
Depends!
afaik, jedis is the most mature.
Is there any Java library with TreeMap-like data structure which also supports all of these:
lookup by value (like Guava's BiMap)
possibility of non-unique keys as well as non unique values (like Guava's Multimap)
keeps track of sorted values as well as sorted keys
If it exists, it would probaby be called SortedBiTreeMultimap, or similar :)
This can be produced using a few data structures together, but I never took time to unite them in one nice class, so I was wondering if someone else has done it already.
I think you are looking for a "Graph". You might be interested in this slightly similar question asked a while ago, as well as this discussion thread on BiMultimaps / Graphs. Google has a BiMultimap in its internal code base, but they haven't yet decided whether to open source it.
We have a system which performs a 'coarse search' by invoking an interface on another system which returns a set of Java objects. Once we have received the search results I need to be able to further filter the resulting Java objects based on certain criteria describing the state of the attributes (e.g. from the initial objects return all objects where x.y > z && a.b == c).
The criteria used to filter the set of objects each time is partially user configurable, by this I mean that users will be able to select the values and ranges to match on but the attributes they can pick from will be a fixed set.
The data sets are likely to contain <= 10,000 objects for each search. The search will be executed manually by the application user base probably no more than 2000 times a day (approx). It's probably worth mentioning that all the objects in the result set are known domain object classes which have Hibernate and JPA annotations describing their structure and relationship.
Possible Solutions
Off the top of my head I can think of 3 ways of doing this:
For each search persist the initial result set objects in our database, then use Hibernate to re-query them using the finer grained criteria.
Use an in-memory Database (such as hsqldb?) to query and refine the initial result set.
Write some custom code which iterates the initial result set and pulls out the desired records.
Option 1
Option 1 seems to involve a lot of toing and froing across a network to a physical Database (Oracle 10g) which might result in a lot of network and disk activity. It would also require the results from each search to be isolated from other result sets to ensure that different searches don't interfere with each other.
Option 2
Option 2 seems like a good idea in principle as it would allow me to do the finer query in memory and would not require the persistence of result data which would only be discarded after the search was complete. Gut feeling is that this could be pretty performant too but might result in larger memory overheads (which is fine as we can be pretty flexible on the amount of memory our JVM gets).
Option 3
Option 3 could be very performant but is something I would like to avoid as any code we write would require such careful testing that the time taken to acheive something flexible and robust enough would probably be prohibitive.
I don't have time to prototype all 3 ideas so I am looking for comments people may have on the 3 options above, plus any further ideas I have not considered, to help me decide which idea might be most suitable. I'm currently leaning toward option 2 (in memory database) so would be keen to hear from people with experience of querying POJOs in memory too.
Hopefully I have described the situation in enough detail but don't hesitate to ask if any further information is required to better understand the scenario.
Cheers,
Edd
Options 1 and 2 are quite compatible: by implementing one you can replace it with the other with simple reconfiguration of persistence.xml (given that in-memory database is JPA compatible, e.g. JavaDB, Derby, etc.).
Option 3 is re-implementing both third-party software (database) and your own code (existing JPA entities). You also listed its advantages as concerns. It's clearly a less feasible option in your case. I can't think of anything else to promote Option 3 either.
It seems that in-memory database is more suitable given use cases and their time span. If requirements evolve into less transient ones then you can switch to Oracle.
If your expressions are not too complex, you can use an expression language for evaluating string queries on your Java objects (POJOs). I can recommend MVEL http://mvel.codehaus.org .
The idea is that you put your objects into MVEL context. Then you provide string query written according to MVEL simple notation, and finally evaluate expression.
Example taken from MVEL site:
Map vars = new HashMap();
vars.put("x", new Integer(5));
vars.put("y", new Integer(10));
Integer result = (Integer) MVEL.eval("x * y", vars);
assert result.intValue() == 50; // Mind the JDK 1.4 compatible code :)
Usually expression languages support traversing your object graph (collections) and
accessing members in JSP EL style (dot notation).
Also, I can suggest looking at OGNL (google it, I can't add more than one link)
How complex are the refining criteria? If the majority are quite simple, I'd be tempted to go for option (3) to start with, but make sure it's encapsulated behind a suitable interface so that if you come across something that is too complex or inefficient to code up yourself you can switch to the in-memory DB at that point (either wholesale for all queries, or just for the complex ones if there's an overhead in setting up the temporary tables).
Option 2 seems to be good - since you can toggle between 1 & 2 as per need. 3 is restricted in terms of future data sizing issue as well. Querying objects would imply greater dependency on the code structure for storage and querying.
Probably it would be good idea to include some caching mechanism (ehcache/memcache) along with usage of Option 2 and then profiling to check the performance difference.