(Java) Store a huge collection of objects with indexed attributes

(Java) Store a huge collection of objects with indexed attributes - java

I need to store about 100 thousands of objects representing users. Those users have a username, age, gender, city and country.
The users should be searchable by a range of age and any of the other attributes, but also a combination of attributes (e.g. women between 30 and 35 from Brussels). The results should be found quickly as it is one of the Server's services for many connected Clients). Users may only be deleted or added, not updated.
I've thought of a fast database with indexed attributes (like h2 db which seems to be pretty fast, and I've seen they have a in-memory mode)
I was wondering if any other option was possible before going for the DB.
Thank you for any ideas !

How much memory does your server have? How much memory would these objects take up? Is it feasible to keep them all in memory, or not? Do you really need the speedup of keeping in memory, vs shoving in a database? It does make it more complex to keep in memory, and it does increase hardware requirements... are you sure you need it?
Because all of what you describe could be ran on a very simple server and put in a very simple database and give you the results you want in the order of 100ms per request. Do you need faster than 100ms response time? Why?

I would use a RDBMS - there are plenty of good ORMs available, such as Hibernate, which allow you to transparently stuff the POJOs into a db. Once you've got the data access abstracted, you then have the freedom to decide how best to persist the data.
For this size of project, I would use the H2 database. It has both embedded and client/server modes, and can operate from disk or entirely in memory.

Most definitely a relational database. With that size you'll want a client-server system, not something embedded like Sqlite. Pick one system depending on further requirements. Indexing is a basic feature, most systems support it. Personally I'd try something that's popular and free such as MySQL or PostgreSQL so you can more easily google your way out of problems. If you make your SQL queries generic enough (no vendor-specific constructs), you can switch systems without much pain. I agree with bwawok, try whether a standard setup is good enough and think of optimizations later.

Did you think to use cache system like EHCache or Memcached?
Also If you have enough memory you can use some sorted collection like TreeMap as index map, or HashMap to search user by name (separate Map per field). It will take more memory but can be effective. Also you can find based on the user query experience the most frequently used query with the best selectivity and create comparator based on this query onli. In this case subset of the element will not be a big and can can be filter fast without any additional optimization.

Related

Cache query results

Let's say, we have a highly configurable report system, which allows users to select columns, filters, and sorting.
All this configuration comes to BE, where it's being transformed to SQL, executed against DB and then the user sees his report and can continue to work with it. But on each operation, like sorting, we still build a query.
The transformation itself takes few milliseconds, but the query execution against DB can take 3-5 seconds (up to 20 if there are a lot of parallel executions).
So, I'm thinking about adding some sort of cache.
Currently, I see 3 ways:
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
The cache invalidation will be a few times a day.
What do you see is the best way to make it faster? What cons and pros in proposed solutions from yours perspective? What would you do if you are free in selecting Database and technology (Java stack)?

OK, let's make sure I got it right.
there are more than 10k different reports
So it doesn't make sense to pre-calculate and pre-cache them, they have to be generated on-demand.
there is not a lot of data in rows, just short strings, dates and integers. It’s not costly to fetch it in memory and even save there for a while
So caching a small amount of data can avoit a big costly query, that's good.
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Problem is, most likely every report query will have different columns, with different names, so that doesn't fit a single table well unless you use a format like JSON, storing each cached result row as a JSON dictionary... And in this case indexing it would be a problem, even if you create indexes on fields inside JSON values, if you have a zillion different column names from your many reports you'll need a zillion indexes too...
Smells like a can of worms.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Pros: each cache table can have the proper columns, data types and indexes. It is easy to invalidate the cache, just truncate it. You can set all the cache tables to UNLOGGED to make them faster. And you can do all the extra sorting/filtering on the cached result using the same SQL queries you were using before, so this might be the simpler option to code. It is also nice for pagination if you only want to fetch part of the result. And that will be the fastest option as far as copying the results of reporting queries into cache since the cache is already in postgres, there is no need to transfer data. You can also store the cache on another drive/SSD.
Cons: I've heard the main issue with tons of tables is if your filesystem slows down on directories with large numbers of files. That shouldn't be an issue on modern filesystems though, and I don't think postgres itself is going to be bothered at all by 10k tables.
It might make queries on information_schema slow, and stuff like "\dt" in psql problematic, so the cache tables would be better hidden away in a "cache" schema so they don't interfere. This will also make it easier to exclude them from backups.
It will also use some RAM on postgres server to cache the cache tables, that depends on the number of online users.
I'd say it would be worth a little bit of benchmarking. Create a schema, add 10k tables, see if something breaks.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
That's a bit of reinventing the wheel, and you got to reimplement the sort/filter in java... plus the cache algos... meeeh.
There are other options though:
Put the cache in another database, on another machine. This may be a postgres instance, or another database (which may require rewriting some queries). Could be interesting only if the cache eats too much RAM on your database.
Put the cache in the web browser, and use javascript to filter/sort. That could be faster depending on speed of internet connection, and it would reduce server load, but you'll have to write lots of javascript code.
IMO you're cautious about the large number of tables, it is good to be cautious, but if it works well, it really is the simplest solution...

Collection processing or database request ? which one is better

This is my first post on stackoverflow, so please be nice to me :-)
So let me explain the context. I'm developing a web service with a standard layer (resources, services, DAO Layer...). I use JPA with hibernate implementation for my object model with the database.
For a class A parent and a class B child, most of the time when i want to find an object B on the collection, I use the streamAPI to filter the collection based on what i want. My question here is more general, is it better to search an object by requesting the database (from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)

If you consider latency, the database will always be slower.
So you gotta ask yourself some questions:
how far away is the database (latency)?
how big is the dataset?
How do I process them ?
do I have any major runtime issues ?
from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)
You're program is probably not very performant programmed. I suggest you check the O-Notation if you have any major runtime leaks.
Your Question is very broad, so it's hard to tell you, for your use-case, which might be the best.

Use database to return data what you need and Java to perform processing on them that would be complicated to do in a JPQL/SQL query.
Databases are designed to perform queries more efficiently than Java (stream or no).
Besides, fetching many data from a database to finally keep only a part of them is not efficient.

The database is usually faster since it is optimized for requesting specific data. Usually one would add indexes to speed up querying on certain fields.
TLDR: Filter your data in the database and process them from java.

This isn't an easy question to answer, since there are many different factors that would influence my decision to go to the db or not. First, I think it's fair to say that, for almost every app I've worked on in the past 20 years, hitting the DB for information is the default strategy. More recently (say past 10 or so years) data access through web service calls has become common as well.
For me, the main question would be something along the lines of, "Are there any situations when I would not hit an external resource (DB, Service, or even file read) for data every time I need it?"
So, I'll outline some of the things I would consider.
Is the data search space very small?
If you are searching a data space of tens of different records, then this information might be a candidate for non-db storage. On the other hand, once you get past a fairly small set records, this approach becomes increasingly untenable. Examples of these "small sets" might be something like salutations (Mr., Ms., Dr., Mrs., Lord). I looks for small sets of data that rarely change, which I, as a lazy developer, wouldn't mind typing into a configuration file. Once I get past something like 50 different records (like US States, for example), I want to pull that info from a DB or service call.
Are the data cacheable?
If you have multiple requests that could legitimately use the exact same data, then leverage caching in your application. Examine the data and expected usage of your service for opportunities to leverage regularities in data and likely requests to cache data whenever possible. Remember to consider cache keys, how long items should be cached, and when cached items should be evicted.
In many web usage scenarios, it's not uncommon that each display could include a fairly large amount of cached information, and a small amount of dynamic data. Menu and other navigation items are good candidates for caching. User-specific data, such as contract-sepcific pricing in an eCommerce app are often poor candidates.
Can you pre-load some data into cache?
Some items can be read once and cached for the entire duration of your application. A list of US States and/or Canadian Provinces is a good example here. These almost never change, so once read from the db, you would rarely need to read them again. Consider application components that can load such data on startup, and then hold this data in an appropriate collection.

Oracle distinct vs java (cqengine/set) : whose leads to better performances?

I have a table from which I extract 8 columns, said columns will be properties of a pojo, say MyPojo.
I want to remove duplicates.
I came up with two strategies.
1-Let oracle take care of this with distinct keyword
select distinct c1,c2...c8 from TABLE where...`
2-Do this in java with cqengine (https://code.google.com/p/cqengine/wiki/DeduplicationStrategies#Logical_Elimination_Strategy):
DeduplicationOption deduplication = deduplicate(DeduplicationStrategy.LOGICAL_ELIMINATION);
ResultSet<Car> results = cars.retrieve(query, queryOptions(deduplication));
3-Do this in java with a set
simply storing rows inside of a Set<MyPojo>
From a performance point of view which one is better?

Let the database do the work. In this case you don't send unnecessary data over the network which will - probably - have the biggest positive impact on performance.
Also it is the most compact solution in terms of code size.

The best way to decide these things is to model it.
What are the access patterns in your application?
If this is would be a one-off request: have the database do the filtering.
If you expect to get many such identical requests: have the database do the filtering, and consider caching results in the application.
If you expect to get a variety of queries on the same dataset, consider caching the unfiltered dataset into the application tier, and querying it with CQEngine.
There is no rule of thumb such as "always have the database do the work". If your application operates at any kind of scale, you will not want every request to hit the database. You need to scale out your application tier.
On the other hand, you should not over-engineer. The answer depends on the traffic volume and data access patterns that you expect.

Java File IO vs Local database

I am working on a project that involves parsing through a LARGE amount of data rapidly. Currently this data is on disk and broken down into a directory hierarchy:
(Folder: DataSource) -> (Files: Day1, Day2, Day3...Day1000...)
(Folder: DataSource2) -> (Files: Day1, Day2, Day3...Day1000...)
...
(Folder: DataSource1000) -> ...
...
Each Day file consists of entries that need to be accessed very quickly.
My initial plans were to use traditional FileIO in java to access these files, but upon further reading, I began to fear that this might be too slow.
In short, what is the fastest way I can selectively load entries from my filesystem from varying DataSources and Days?

The issue could be solved both ways but it depends on few factors
go for FileIO.
if the volume is < millons of rows
if your dont do a complicated query like Jon Skeet said
if your referance for fetching the row is by using hte Folder Name: "DataSource" as the key
go for DB
if you see your program reading through millions of records
you can do complicated selection, even multiple rows using a single select.
if you have knowledge of creating a basic table structure for DB

Depending on architecture you are using you can implement different ways of caching, in the Jboss there is a built-in Jboss Caching, there are also third party opensource software that lets utilizes caching, like Redis, or EhCache depending on your needs. Basically Caching stores objects in their memory, some are passivated/activated upon demand, when memory is exhausted it is stored as a physical IO file, which are also easily activated marshalled by the caching mechanism. It lowers the database connectivity held by your program. There are other caches but here are some of them that I've worked with:
Jboss:http://www.jboss.org/jbosscache/
Redis:http://redis.io/
EhCache:http://ehcache.org/

what is the fastest way I can selectively load entries from my filesystem from varying DataSources and Days?
selectively means filtering, so my answer is a localhost database. Generally speaking if you filter, sort, paginate or extract distinct records from a large number of records, it's hard to beat a localhost SQL server. You get a query optimizer (nobody does that Java), a cache (which requires effort in Java, especially the invalidation), database indexes (have not seen that being done in Java either) etc. It's possible to implement these things manually, but then your are writing a database in Java.
On top of this you gain access to higher level SQL functions like window aggegrates etc., so in most cases there is no need to post-process data in Java.

Using hashmap or H2 database?

I am developing a web application in which I need to store session, user messages etc. I am thinking of using HashMap or H2 database.
Please let me know which is better approach in terms of performance and memory utilization. The web site has to support 10,000 users.
Thanks.

As usual with these questions, I would worry about performance as/when you know it's an issue.
10000 users is not a lot of data to hold in memory. I would likely start off with a standard Java collection, and look at performance when you predict it's going to cause you grief.
Abstract out the access to this Java collection such that when you substitute it, the refactoring required is localised (and perhaps make it configurable, such that you can easily perform before/after performance tests with your different solutions -H2, Derby, Oracle, etc. etc.)

If your session objects aren't too big (which should be the case), there is no need to persist them in a database.
Using a database for this would add a lot of complexity in a case when you can start with a few lines of code. So don't use a database, simply store them in a ligth memory structure (HashMap for example).
You may need to implement a way to clean your HashMap if you don't want to keep sessions in memory when the user left from a long time. Many solutions are available (the easiest is simply to have a background thread removing from time to time the too old sessions). Note that it's usually easier to clean a hashmap than a database.

Both H2 and Hash Map are gonna keep the data in memory (So from space point of view they are almost the same).
If look ups are simple like KEY VALUE then looking up in the Hash Map will be quicker.
If you have to do comparisons like KEY < 100 etc use H2.
In fact 10K user info is not that high a number.

If you don't need to save user messages - use the collections. But if the message is should be saved, be sure to use a database. Because after restart you lost all data.

The problem with using a HashMap for storing objects is that you would run into issues when your site becomes too big for one server and would need to be clustered in order to scale with demand. Then you would face problems with how to synchronise the HashMap instances on different servers.
A possible alternative would be to use a key-value store like Redis as you won't need the structure of a database or even use the distributed cache abilities of something like EHCache

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.