I'm thinking about possible solution (tool) for my issue.
There is a collection of locations with a huge amount (more than 600 000) of elements. Locations have name (in different languages) and represented in tree structure: region->country->admin division->city->zip. User can add custom location, but I plan that these actions will happen rarely. Application should provide efficient ability to perform search by location name, type, to build hierarchical name (f.e. "London->England->United Kingdom"), build subtree of locations (f.e. all countries and cities in those countries of Europe).
I've considered three solutions.
Plain database: locations will hold in some tables and the main building logic will be implemented in java code. In case of this solution I am worried about performance, because search, building tree and creating custom locations can involve additional table joining.
SOLR: at first glance this task is exactly for solr: data set changes rarely, we need search by names. But I'm worried if Solr pivots feature will satisfy the tree building needs. Also I'm not sure if Solr searching will be much better then plain DB, because search is not so difficult (just searching by names which are short strings).
graph db Neo4j: it seems useful for building trees and subtrees. But I'm not sure about search performance (it seems I should use community edition, which does not have some useful performance features like caching and etc.)
Database is a big NO. as RDBMS is not optimized for relation based queries. For example show me the people who are eating in the same restaurant where I do and also belong to the same region where I do. OR to make it more complex, a db query can be a killer where level of relations are to be calculated. Like I can be your second level friend where one or more of your friends is/are my friend(s).
SOLR: Solr is a good option but you have to see the performance impact of it. With so many rows to index it can be a memory killer. Go through these first before implementing SOLR.
http://wiki.apache.org/solr/SolrPerformanceProblems
http://wiki.apache.org/solr/SolrPerformanceFactors
SOLR also not a good solution for more logical searches as you have to learn it all before going for it.
Neo4J (or Any other graph DB) is perfect solution. I have implemented all these three technologies myself and with my experience I found Neo4J best for such requirement.
However, you must see how to backup the database and how to recover it in case of a crash.
All the best.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
We have a table of vocabulary items that we use to search text documents. The java program that uses this table currently reads it from a database, stores it in memory and then searches documents for individual items in the table. The table is brought into memory for performance reasons. This has worked for many years but the table has grown quite large over time and now we are starting to see Java Heap Space errors.
There is a brute force approach to solving this problem which is to upgrade to a larger server, install more memory, and then allocate more memory to the Java heap. But I'm wondering if there are better solutions. I don't think an embedded database will work for our purposes because the tables are constantly being updated and the application is hosted on multiple sites suggesting a maintenance nightmare. But, I'm uncertain about what other techniques are out there that might help in this situation.
Some more details, there are currently over a million vocabulary items (think of these items as short text strings, not individual words). The documents are read from a directory by our application, and then each document is analyzed to determine if any of the vocabulary is present in the document. If it is, we note which items are present and store them in a database. The vocabulary itself is stored and maintained in a MS SQL relational database that we have been growing for years. Since, all vocabulary items must be analyzed for each document, repeatedly reading from the database is inefficient. And the number of documents that need to be analyzed each day can at some of our installations be quite large (on the order of 100K documents a day). The documents are typically 2 to 3 pages long although we occasionally see documents as large a 100 pages.
In the hopes of making your application more performant, you're taking all the data out of a database that is designed with efficient data operations in mind and putting it into your application's memory. This works fine for small data sets, but as those data sets grow, you are eventually going to run out of resources in the application to handle the entire dataset.
The solution is to use a database, or at least a data tier, that's appropriate for your use case. Let your data tier do the heavy lifting instead of replicating the data set into your application. Databases are incredible, and their ability to crunch through huge amounts of data is often underrated. You don't always get blazing fast performance for free (you might have to think hard about indexes and models), but few are the use cases where java code is going to be able to pull an entire data set down and process it more efficiently than a database can.
You don't say much about which database technologies you're using, but most relational databases are going to offer a lot of useful tools for full text searching . I've seen well designed relational databases perform text searches very effectively. But if you're constrained by your database technology or your table really is so big that a relational database text search isn't feasible, you should put your data into a searchable cache such as elastic search. If you model and index your data effectively, you can build a very performant text search platform that will scale reliably. Tom's suggestion of lucene is another good one. There's a lot of cloud technologies that can help with this kind of thing too: S3 + Athena comes to mind, if you're into AWS.
I'd look at http://lucene.apache.org it should be a good fit for what you've described.
I was having the same issue with a Table with one more than millon of Data and there was a Client that want export all that data. My solution was very simple I followed this Question. But there was a little Issue having more than 100k records go to Heap Space. So I just use Chunks with my queries WITH NO LOCK ( I know this can have some inconsistent data, but I needed to do that because it was Blocking the DB Without this Statement). I hope this approach help you.
When you had a small table, you probably implemented an approach of looping over the words in the table and for each one looking it up in the document to be processed.
Now the table has grown to the point where you have trouble loading it all in memory. I expect that the processing of each document has also slowed down due to having more words to look up in each document.
If you flip the processing around, you have more opportunities to optimize this process. In particular, to process a document you first identify the set of words in the document (e.g., by adding each word to a Set). Then you loop over each document word and look it up in the table. The simplest implementation simply does a database query for each word.
To optimize this, without loading the whole table in memory, you will want to implement an in-memory cache of some kind. Your database server will actually automatically implement this for you when you query the database for each word; the efficacy of this approach will depend on the hardware and configuration of your database server as well as the other queries that are competing with your word look-ups.
You can also implement an in-memory cache of the most-used portion of the table. You will limit the size of the cache based on how much memory you can afford to give it. All words that you look up that are not in the cache need to be checked by querying the database. Your cache might use a least-recently-used eviction strategy so that you keep the most common words in the cache.
While you can only store words that exist in the table in your cache, you might achieve better performance if you cache the result of the lookup. This will result in your cache having the most common words that show up in the documents being in the cache (and each one with a boolean value that indicates if the word is or is not in the table).
There are several really good open source in-memory caching implementations available in Java, which will minimize the amount of code you need to write to implement a caching solution.
I have abstract super class and some sub classes. My question is how is the best way to keep objects of those classes so I can easily find them using all the different parameters.
For example if I want to look up with resourceCode (every object is with unique resource code) I can use HashMap with key value resourceCode. But what happens if I want to look up with genre - there are many games with the same genre so I will get all those games. My first idea was with ArrayList of those objects, but isn’t it too slow if we have 1 000 000 games (about 1 000 000 operations).
My other idea is to have a HashTable with key value the product code. Complexity of the search is constant. After that I create that many HashSets as I have fields in the classes and for each field I get the productCode/product Codes of the objects, that are in the HashSet under that certain filed (for example game promoter). With those unique codes I can get everything I want from the HashTable. Is this a good idea? It seems there will be needed a lot of space for the date to be stored, but it will be fast.
So my question is what Data Structure should I use so I can implement fast finding of custom object, searching by its attributes (fields)
Please see the attachment: Classes Example
Thank you in advanced.
Stefan Stefanov
You can use Sorted or Ordered data structures to optimize search complexity.
You can introduce your own search index for custom data.
But it is better to use database or search engine.
Have a look at Elasticsearch, Apache Solr, PostgreSQL
It sounds like most of your fields can be mapped to a string (name, genre, promoter, description, year of release, ...). You could put all these strings in a single large index that maps each keyword to all objects that contain the word in any of their fields. Then if you search for certain keywords it will return a list of all entries that contain that word. For example searching for 'mine' should return 'minecraft' (because of title), as well as all mine craft clones (having 'minecraft-like' as genre) and all games that use the word 'mine' in the 'info text' field.
You can code this yourself, but I suppose some fulltext indexer, such as Lucene may be useful. I haven't used Lucene myself, but I suppose it would also allow you to search for multiple keyword at once, even if they occur in different fields.
This is not a very appealing answer.
Start with a database. Maybe an embedded database (like h2database).
Easy set of fixed develop/test data; can be easily changed. (The database dump.)
. Too many indices (hash maps) harm
Developing and optimizing queries is easier (declarative) than with data structures
Database tables are less coupled than data structures with help structures (maps)
The resulting system is far less complex and better scalable
After development has stabilized the set of queries, you can think of doing away of the DB part. Use at least a two tier separation of database and the classes.
Then you might find a stable and best fitting data model.
Should you still intend to do it all with pure objects, then work them out in detail as design documentation before you start programming. Example stories, and how one solves them.
What is an index in Elasticsearch? Does one application have multiple indexes or just one?
Let's say you built a system for some car manufacturer. It deals with people, cars, spare parts, etc. Do you have one index named manufacturer, or do you have one index for people, one for cars and a third for spare parts? Could someone explain?
Good question, and the answer is a lot more nuanced than one might expect. You can use indices for several different purposes.
Indices for Relations
The easiest and most familiar layout clones what you would expect from a relational database. You can (very roughly) think of an index like a database.
MySQL => Databases => Tables => Rows/Columns
ElasticSearch => Indices => Types => Documents with Properties
An ElasticSearch cluster can contain multiple Indices (databases), which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties (columns).
So in your car manufacturing scenario, you may have a SubaruFactory index. Within this index, you have three different types:
People
Cars
Spare_Parts
Each type then contains documents that correspond to that type (e.g. a Subaru Imprezza doc lives inside of the Cars type. This doc contains all the details about that particular car).
Searching and querying takes the format of: http://localhost:9200/[index]/[type]/[operation]
So to retrieve the Subaru document, I may do this:
$ curl -XGET localhost:9200/SubaruFactory/Cars/SubaruImprezza
.
Indices for Logging
Now, the reality is that Indices/Types are much more flexible than the Database/Table abstractions we are used to in RDBMs. They can be considered convenient data organization mechanisms, with added performance benefits depending on how you set up your data.
To demonstrate a radically different approach, a lot of people use ElasticSearch for logging. A standard format is to assign a new index for each day. Your list of indices may look like this:
logs-2013-02-22
logs-2013-02-21
logs-2013-02-20
ElasticSearch allows you to query multiple indices at the same time, so it isn't a problem to do:
$ curl -XGET localhost:9200/logs-2013-02-22,logs-2013-02-21/Errors/_search=q:"Error Message"
Which searches the logs from the last two days at the same time. This format has advantages due to the nature of logs - most logs are never looked at and they are organized in a linear flow of time. Making an index per log is more logical and offers better performance for searching.
.
Indices for Users
Another radically different approach is to create an index per user. Imagine you have some social networking site, and each users has a large amount of random data. You can create a single index for each user. Your structure may look like:
Zach's Index
Hobbies Type
Friends Type
Pictures Type
Fred's Index
Hobbies Type
Friends Type
Pictures Type
Notice how this setup could easily be done in a traditional RDBM fashion (e.g. "Users" Index, with hobbies/friends/pictures as types). All users would then be thrown into a single, giant index.
Instead, it sometimes makes sense to split data apart for data organization and performance reasons. In this scenario, we are assuming each user has a lot of data, and we want them separate. ElasticSearch has no problem letting us create an index per user.
#Zach's answer is valid for elasticsearch 5.X and below. Since elasticsearch 6.X Type has been deprecated and will be completely removed in 7.X. Quoting the elasticsearch docs:
Initially, we spoke about an “index” being similar to a “database” in an SQL database, and a “type” being equivalent to a “table”.
This was a bad analogy that led to incorrect assumptions.
Further to explain, two columns with the same name in SQL from two different tables can be independent of each other. But in an elasticsearch index that is not possible since they are backed by the same Lucene field. Thus, "index" in elasticsearch is not quite same as a "database" in SQL. If there are any same fields in an index they will end up having conflicts of field types. To avoid this the elasticsearch documentation recommends storing index per document type.
Refer: Removal of mapping types
An index is a data structure for storing the mapping of fields to the corresponding documents. The objective is to allow faster searches, often at the expense of increased memory usage and preprocessing time.
The number of indexes you create is a design decision that you should take according to your application requirements. You can have an index for each business concept... You can an index for each month of the year...
You should invest some time getting acquainted with lucene and elasticsearch concepts.
Take a look at the introductory video and to this one with some data design patterns
Above one is too detailed in very short it could be defined as
Index: It is a collection of different type of documents and document properties. Index also uses the concept of shards to improve the performance. For example, a set of document contains data of a social networking application.
Answer from tutorialpoints.com
Since index is collection of different type of documents as per question depends how you want to categorize.
Do you have one index named manufacturer?
Yes , we will keep one document with manufacturer thing.
do you have one index for people, one for cars and a third for spare parts? Could someone explain?
Think of instance car given by same manufacturer to many people driving it on road .So there could be many indices depending upon number of use.
If we think deeply we will found except first question all are invalid ones.
Elastic-search documents are much different that SQL docs or csv or spreadsheet docs ,from one indices and by good powerful query language you can create millions type of data categorised documents in CSV style.
Due to its blazingly fast and indexed capability we create one index only for one customer , from that we create many type of documnets as per our need .
For example:
All old people using same model.Or One Old people using all model .
Permutation is inifinite.
I need to store about 100 thousands of objects representing users. Those users have a username, age, gender, city and country.
The users should be searchable by a range of age and any of the other attributes, but also a combination of attributes (e.g. women between 30 and 35 from Brussels). The results should be found quickly as it is one of the Server's services for many connected Clients). Users may only be deleted or added, not updated.
I've thought of a fast database with indexed attributes (like h2 db which seems to be pretty fast, and I've seen they have a in-memory mode)
I was wondering if any other option was possible before going for the DB.
Thank you for any ideas !
How much memory does your server have? How much memory would these objects take up? Is it feasible to keep them all in memory, or not? Do you really need the speedup of keeping in memory, vs shoving in a database? It does make it more complex to keep in memory, and it does increase hardware requirements... are you sure you need it?
Because all of what you describe could be ran on a very simple server and put in a very simple database and give you the results you want in the order of 100ms per request. Do you need faster than 100ms response time? Why?
I would use a RDBMS - there are plenty of good ORMs available, such as Hibernate, which allow you to transparently stuff the POJOs into a db. Once you've got the data access abstracted, you then have the freedom to decide how best to persist the data.
For this size of project, I would use the H2 database. It has both embedded and client/server modes, and can operate from disk or entirely in memory.
Most definitely a relational database. With that size you'll want a client-server system, not something embedded like Sqlite. Pick one system depending on further requirements. Indexing is a basic feature, most systems support it. Personally I'd try something that's popular and free such as MySQL or PostgreSQL so you can more easily google your way out of problems. If you make your SQL queries generic enough (no vendor-specific constructs), you can switch systems without much pain. I agree with bwawok, try whether a standard setup is good enough and think of optimizations later.
Did you think to use cache system like EHCache or Memcached?
Also If you have enough memory you can use some sorted collection like TreeMap as index map, or HashMap to search user by name (separate Map per field). It will take more memory but can be effective. Also you can find based on the user query experience the most frequently used query with the best selectivity and create comparator based on this query onli. In this case subset of the element will not be a big and can can be filter fast without any additional optimization.
I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh
First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.
Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B
Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?
If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html
The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.