I want to store large mapping tables between an id and two text attributes.
The dataset will be up to 1 million entries and refreshed on a daily basis.
Would you rather create a lucene index and an index table by that id? Or create a database (postgres) table with id as primary key? Or even a different solution?
And why would one prefer either solution?
I only want to lookup by ID, no reverse lookup. The mapping table should be simple as that: put in an id, and get back two string attributes.
What you are looking for appears to be a Key-value store (wikipedia article)
Key-value (KV) stores use the associative array (also known as a map
or dictionary) as their fundamental data model. In this model, data is
represented as a collection of key-value pairs, such that each
possible key appears at most once in the collection.
The key-value model is one of the simplest non-trivial data models,
and richer data models are often implemented on top of it. The
key-value model can be extended to an ordered model that maintains
keys in lexicographic order. This extension is powerful, in that it
can efficiently process key ranges.
Key-value stores can use consistency models ranging from eventual
consistency to serializability. Some support ordering of keys. Some
maintain data in memory (RAM), while others employ solid-state drives
or rotating disks.
The article there also gives a rather complete list of available implementations. Unfortunately I cannot suggest you one of the implementations, as I have not used any of these in production. But I strongly believe that google is full of comparisons of key-value stores.
To answer your question, I would not go for Lucene, as it is a open source information retrieval software library, designed to implement information retrieval applications. What you are going to do is not going to hit Lucene's sweet-spots.
A classic RDBMS comes closer to your requirements. But as stated above a Key-value store would nail it.
Related
I want to build a data structure to store the information of multiple houses, and later user can retrieve desirable housing information through a search query. In order to achieve a fast search, I will use red black tree. The problem I am facing is that the key of each node only contains one attribute of the house i.e. price, as for the others such as number of beds, land size etc they can not be stored in a single tree. What would be a good data structure for this problem, initially I thought a tree nested in a tree, is this viable or considered good?
The problem you are facing can be solved using secondary indexes on top of your data. Secondary indexes are a concept studied intensely in the database world and you should have no trouble finding resources to help you understand how they are implemented in real databases.
So, you currently have a primary key for your data: the objects memory reference or maybe an index into a collection of references. For each attribute that you want to query you will need to have a fast way of looking up matching objects. The exact data structure you use will depend on the type of queries you perform but some kind of search tree will be a good general purpose data structure and will usually be efficient for updates which is very important for a lot of databases. Your data structure should take in a query relating to the specific attribute and return references, or primary keys, to all the objects which match that query.
In your example you might have one red-black tree for price and another for number-of-beds. If you are answering a query for "price = 30 or number-of-beds = 4" then all you need to do is query your price data structure and then your number-of-beds data structure and then since you have an "or" in your query you simply take the union of the primary keys returned from your data structures (take the intersection for "and"s).
Notice that if you add to or update your objects then you will also need to update all the indexes that change. This is a trade-off you also see in real databases; faster reads for slower writes.
A nested tree approach might work depending on what kind of queries you are making but will quickly become unsuitable if the data structure is not static - it will be very slow to update the tree if you update your objects.
I have abstract super class and some sub classes. My question is how is the best way to keep objects of those classes so I can easily find them using all the different parameters.
For example if I want to look up with resourceCode (every object is with unique resource code) I can use HashMap with key value resourceCode. But what happens if I want to look up with genre - there are many games with the same genre so I will get all those games. My first idea was with ArrayList of those objects, but isn’t it too slow if we have 1 000 000 games (about 1 000 000 operations).
My other idea is to have a HashTable with key value the product code. Complexity of the search is constant. After that I create that many HashSets as I have fields in the classes and for each field I get the productCode/product Codes of the objects, that are in the HashSet under that certain filed (for example game promoter). With those unique codes I can get everything I want from the HashTable. Is this a good idea? It seems there will be needed a lot of space for the date to be stored, but it will be fast.
So my question is what Data Structure should I use so I can implement fast finding of custom object, searching by its attributes (fields)
Please see the attachment: Classes Example
Thank you in advanced.
Stefan Stefanov
You can use Sorted or Ordered data structures to optimize search complexity.
You can introduce your own search index for custom data.
But it is better to use database or search engine.
Have a look at Elasticsearch, Apache Solr, PostgreSQL
It sounds like most of your fields can be mapped to a string (name, genre, promoter, description, year of release, ...). You could put all these strings in a single large index that maps each keyword to all objects that contain the word in any of their fields. Then if you search for certain keywords it will return a list of all entries that contain that word. For example searching for 'mine' should return 'minecraft' (because of title), as well as all mine craft clones (having 'minecraft-like' as genre) and all games that use the word 'mine' in the 'info text' field.
You can code this yourself, but I suppose some fulltext indexer, such as Lucene may be useful. I haven't used Lucene myself, but I suppose it would also allow you to search for multiple keyword at once, even if they occur in different fields.
This is not a very appealing answer.
Start with a database. Maybe an embedded database (like h2database).
Easy set of fixed develop/test data; can be easily changed. (The database dump.)
. Too many indices (hash maps) harm
Developing and optimizing queries is easier (declarative) than with data structures
Database tables are less coupled than data structures with help structures (maps)
The resulting system is far less complex and better scalable
After development has stabilized the set of queries, you can think of doing away of the DB part. Use at least a two tier separation of database and the classes.
Then you might find a stable and best fitting data model.
Should you still intend to do it all with pure objects, then work them out in detail as design documentation before you start programming. Example stories, and how one solves them.
Question about Google App Engine + Datastore. We have some queries with several equality filters. For this, we don't need to keep any composed index, Datastore maintains these indexes automatically, as described here.
The built-in indexes can handle simple queries, including all entities of a given kind, filters and sort orders on a single property, and equality filters on any number of properties.
However, we need the result to be sorted on one of these properties. I can do that (using Objectify) with .sort("prop") on the datastore query, which requires me to add a composite index and will make for a huge index once deployed. The alternative I see is retrieving the unordered list (max 100 entities in the resultset) and then sorting them in-memory.
Since our entity implements Comparable, I can simply use Collections.sort(entities).
My question is simple: which one is desired? And even if the datastore composite index would be more performant, is it worth creating all those indexes?
Thanks!
There is no right or wrong approach - solution depends on your requirements. There are several factors to consider:
Extra indexes take space and cost more both in storage costs and in write costs - you have to update every index on every update of an entity.
Sort on property is faster, but with a small result set the difference is negligible.
You can store sorted results in Memcache and avoid sorting them in every request.
You will not be able to use pagination without a composite index, i.e. you will have to retrieve all results every time for in-memory sort.
It depends on your definition of "desired". IMO, if you know the result set is a "manageable" size, I would just do in memory sort. Adding lots of indexes will have cost impact, you can do cost analysis first to check it.
I need some help to store some data efficiently. I have a large list of objects (about 100.000) and want to store associations between this items with a coefficient. Not all items are associated, in fact I have something about 1 Mio. Associations. I need fast access to these associations when referencing by the two items. What I did is a structure like that:
Map<Item, Map<Item, Float>>
I tried this with HashMap and Hashtable. Both work fine and is fast enough. My problem is, that all that Maps create a lot of overhead in memory, concrete for the given scenario more than 300 MB. Is there a Map-Implementation with less footprint? Is there maybe a better algorithm to store that kind of data?
Here are some ideas:
Store in a Map<Pair<Item,Item>,Float>. If you are worried about allocating a new Pair for each lookup, and your code is synchronized, you can keep a single lookup Pair instance.
Loosen the outer map to be Map<Item, ?>. The value can be a simple {Item,Float} tuple for the first association, a small tuple array for a small number of associations, then promote to a full fledged Map.
Use Commons Collections' Flat3Map for the inner maps.
If you are in tight control of the Items, and Item equivalence is referential (i.e. each Item instance is not equal to any other Item instance, then you can number each instance. Since you are talking about < 2 billion instances, a single Long will represent an Item pair with some bit manipulation. Then the map gets much smaller if you use Trove's TLongObjectHashMap
You have two options.
1) Reduce what you're storing.
If your data is calculable, using a WeakHashMap will allow the garbage collector to remove members. You will probably want to decorate it with a mechanism that calculates lost or absent key/value pairs on the fly. This is basically a cache.
Another possibility that might trim a relatively tiny amount of RAM is to instruct your JVM to use compressed object pointers. That may save you about 3 MB with your current data size.
2) Expand your capacity.
I'm not sure what your constraint is (run-time memory on a desktop, serialization, etc.) but you can either expand the heapsize and deal with it, or you can push it out of process. With all those "NoSQL" stores out there, one will probably fit your needs. Or, an indexed db table can be quite fast. If you're looking for a simple key-value store, Voldemort is extremely easy to set up and integrate.
However, I don't know what you're doing with your working set. Can you give more details? Are you performing aggregations, partitioning, cluster analysis, etc.? Where are you running into trouble?
Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?
Insane number of rows
Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )
Unable to fit all rows in memory
Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.
Fits in memory
Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).
Oh and if I were you, I will put the unique constraint on the DB anyways...
I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.
You have two options,
do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.
Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.
Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.
Your problem can be solve with a Extract, transform, load (ETL) approach:
You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.
You can do this manually or use an ETL tool.