Collection processing or database request ? which one is better

Collection processing or database request ? which one is better - java

This is my first post on stackoverflow, so please be nice to me :-)
So let me explain the context. I'm developing a web service with a standard layer (resources, services, DAO Layer...). I use JPA with hibernate implementation for my object model with the database.
For a class A parent and a class B child, most of the time when i want to find an object B on the collection, I use the streamAPI to filter the collection based on what i want. My question here is more general, is it better to search an object by requesting the database (from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)

If you consider latency, the database will always be slower.
So you gotta ask yourself some questions:
how far away is the database (latency)?
how big is the dataset?
How do I process them ?
do I have any major runtime issues ?
from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)
You're program is probably not very performant programmed. I suggest you check the O-Notation if you have any major runtime leaks.
Your Question is very broad, so it's hard to tell you, for your use-case, which might be the best.

Use database to return data what you need and Java to perform processing on them that would be complicated to do in a JPQL/SQL query.
Databases are designed to perform queries more efficiently than Java (stream or no).
Besides, fetching many data from a database to finally keep only a part of them is not efficient.

The database is usually faster since it is optimized for requesting specific data. Usually one would add indexes to speed up querying on certain fields.
TLDR: Filter your data in the database and process them from java.

This isn't an easy question to answer, since there are many different factors that would influence my decision to go to the db or not. First, I think it's fair to say that, for almost every app I've worked on in the past 20 years, hitting the DB for information is the default strategy. More recently (say past 10 or so years) data access through web service calls has become common as well.
For me, the main question would be something along the lines of, "Are there any situations when I would not hit an external resource (DB, Service, or even file read) for data every time I need it?"
So, I'll outline some of the things I would consider.
Is the data search space very small?
If you are searching a data space of tens of different records, then this information might be a candidate for non-db storage. On the other hand, once you get past a fairly small set records, this approach becomes increasingly untenable. Examples of these "small sets" might be something like salutations (Mr., Ms., Dr., Mrs., Lord). I looks for small sets of data that rarely change, which I, as a lazy developer, wouldn't mind typing into a configuration file. Once I get past something like 50 different records (like US States, for example), I want to pull that info from a DB or service call.
Are the data cacheable?
If you have multiple requests that could legitimately use the exact same data, then leverage caching in your application. Examine the data and expected usage of your service for opportunities to leverage regularities in data and likely requests to cache data whenever possible. Remember to consider cache keys, how long items should be cached, and when cached items should be evicted.
In many web usage scenarios, it's not uncommon that each display could include a fairly large amount of cached information, and a small amount of dynamic data. Menu and other navigation items are good candidates for caching. User-specific data, such as contract-sepcific pricing in an eCommerce app are often poor candidates.
Can you pre-load some data into cache?
Some items can be read once and cached for the entire duration of your application. A list of US States and/or Canadian Provinces is a good example here. These almost never change, so once read from the db, you would rarely need to read them again. Consider application components that can load such data on startup, and then hold this data in an appropriate collection.

Related

Database Data Filtering Best Practice

I am currently using raw JDBC to query records in a MySql database; each record in the subsequent Resultset is ultimately extracted, placed in a domain specific model, and stored to a List Instance.
My query is: in circumstances where there is a requirement to further filter that data (incidentally based on columns that exist in the SAME Table) which of the following approaches would generally be considered best practice:
1.The issuance of further WHERE clause calls into the database. This will effectively offload the filtering process to the database but obviously results in an additional query or queries where multiple filters are applied consecutively.
2.Explicitly filtering the aforementioned preprocessed List at the Application level, thus negating the need to have to make additional calls into the database each time the records are filtered.
3.Some hybrid combination of the above two approaches, perhaps where all filtering operations are initially undertaken by the database server but THEN preprocessed to a application specific model and implicitly cached to a collection for some finite amount of time. Further filter queries, received within this interval, would then be serviced from the data stored in the cache.
It is important to note that the Database Server in this scenario is actually located on
an external machine, therefore the overhead and latency of sending query traffic over the local network also has to be factored into the approach we ultimately elect to take.
I am patently aware of the age-old mantra that stipulates that: "The database server should be used to do what its good at." however in this scenario it just seems like a less than adequate solution to be making numerous calls into the database to filter data that I ALREADY HAVE at the application level.
Your thoughts and insights would be greatly appreciated.

I have used the hybrid approach on many applications with good results.
Database filtering works good especially for columns that are indexed. This reduces network overhead since fewer rows are sent to application.
Database filtering can be really slow for some columns depending upon the quantity of rows in the results and the lack of indexes. The network overhead can be negligible compared to database query time so application filtering may be faster for this situation.
I also find that application filtering in Java easier to write and understand instead of complex SQL.
I usually experiment manually to get the fewest rows in a reasonable time with plain SQL. Then write Java to refine to the desired rows.

i appreciate this question first...as i too faced similar situation few days back...as you already discussed all available options i prefer to go with the second option....i mean handling at application level rather than filtering at DB level.

why don't almost all databases implement row cache?

I noticed that almost all database don't implement row cache internally. I ask this question because I find someone add row-cache patch for innodb. At least, it has one advantage beyond performance gain, i.e. it is transparent for the client.
Is there any difficult technical reason which prevent doing so, or just because it's useful for very specific access pattern?
Thanks

Quite frankly, if you're getting the same version of the same row multiple times, you're doing it wrong. Data should be cached, as a general rule, if it's unlikely to change and needs to be accessed multiple times. Given this rule, if caching a DB row on the server side ever helps, it means you're making too many round trips to the database for the data you're interested in. You should instead be caching it client-side to cut down on the round trips. If the data changes often and so you need to access it often, caching still won't help because the cached data is out of date and the query must be re-executed. Only getting the data that's different from the cached data doesn't help; you have to figure out what's different and you're still making a query of the DB.
On top of that, most databases are designed for high-concurrency performance. Caching one guy's massive result set is going to eat into resources available for the next guy's massive result set and so on. In a high-user-count scenario, building a cache would likely simply result in the cached data being thrown away to make room for more cached data; it wouldn't be able to stick around long enough to be of use.

How to retrieve huge (>2000) amount of entities from GAE datastore in under 1 second?

We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB.
On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second).
We had tried several strategies to speed up the entities retrieval:
Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.
Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.
The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case.
We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval?
UPDATE: Supplying additional information about our use case and parallelization result:
We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only.
We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.
UPDATE 2: Giving an example of what we are trying to do:
Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not.
The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value.
The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.

200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory.
Do you need the whole 5kb of each entity for computation?
Do you need all 200k entities when querying before computation? Do queries touch all entities?
Also, check out BigQuery. It might suit your needs.

Use Memcache. I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform.

This is very interesting, but yes, its possible & Iv seen some mind boggling results.
I would have done the same; map-reduce concept
It would be great if you would provide us more metrics on how many parallel instances do you use & what are the results of each instance?
Also, our process includes retrieval alone or retrieval & storing ?
How many elements do you have in your data store? 4000? 10000? Reason is because you could cache it up from the previous request.
regards

In the end, it does not appear that we could retrieve >2000 entities from a single instance in under one second, so we are forced to use in-memory caching placed on our backend instance, as described in the original question. If someone comes up with a better answer, or if we found a better strategy/implementation for this problem, I would change or update the accepted answer.

Our solution involves periodically reading entities in a background task and storing the result in a json blob. That way we can quickly return more than 100k rows. All filtering and sorting is done in javascript using SlickGrid's DataView model.
As someone has already commented, MapReduce is the way to go on GAE. Unfortunately the Java library for MapReduce is broken for me so we're using non optimal task to do all the reading but we're planning to get MapReduce going in the near future (and/or the Pipeline API).
Mind that, last time I checked, the Blobstore wasn't returning gzipped entities > 1MB so at the moment we're loading the content from a compressed entity and expanding it into memory, that way the final payload gets gzipped. I don't like that, it introduces latency, I hope they fix issues with GZIP soon!

caching readonly data for java application

I have a database which has around 150K records of data with a primary key on the table. The data size for each record will take less than 1kB. The processing time for constructing a POJO from the DB record takes about 1-2 secs(there is some business logic that takes too much time). This is read-only data. Hence I'm planning to implement caching the data. What I'm thinking to do is. Load the data in subsets(200 records each time) and create a thread that'll construct the POJOs and keep them in a hashtable. While the cache is being loaded(when I start the application) the User will see a wait sign. For storing the data in HashTable is an issue I'll actually store the processed data in to another DB table(marshall the POJO to xml).
I use a third party API to load the data from database. Once I load a record I'll have load the data I'll have to load associations for the loaded data and then associations for the association found at the top level. It's like loading a family tree.
I can't use Hibernate or any ORM framework as I'm using a third party API to load the data which is shipped with the database it self(it's a product). More over I don't think loading data once is not a big issue.
If there is a possibility to fine tune the business logic I wouldn't have asked this question here.
Caching the data on demand is an option, but I'm trying to see if I can do anything better.
Suggest me if there is a better idea that you are aware of. Thank you./

Suggest me if there is a better idea that you are aware of.
Yes, fix the business logic so that it doesn't take 1 to 2 seconds per record. That's a ridiculously long time.
Before you do that, profile your application to make sure that it is really the business logic that is causing the slow record loading, and not something else. (For example, it could be a pathological data structure, or a database issue.)
Once you've fixed the root cause of the slow record loading, it is still a good idea to cache the read-only records, but you probably don't need to preload the cache. Instead, just load the records on demand.

It sounds like you are reinventing the wheel. I'd be looking to use hibernate. Apart from simplifying the code to access the database, hibernate has built-in caching and lazy loading of data so it only creates objects as you request them. Ergo, a lot of what you describe above is already in place and you can concentrate on sorting out your business logic. I suspect that once you solve the business logic performance issue, there will be no need to do such as complicated caching system and hibernate defaults will be sufficient.

As maximdim said in a comment, preloading the whole thing will take a lot of time. If your system is not very strange, the user won't need all data at once. Just cache on demand instead. I would also recommend using an established caching solution, such as EHCache, which has persistence via DiskStore -- the only issue is that whatever you cache in this case has to be Serializable. Since you can marshall it as XML, I'm betting you can serialize it too, which should be faster.
In a past project, we had to query a very busy, very sluggish service running in an off-site mainframe in order to assemble one of the entities. Average response times from our app were dominated by this query. Since the data we retrieved was mostly read-only caching with EHCache solved our problems.

jdbm has a nice, persistent map implementation (http://code.google.com/p/jdbm2/) - that may help you do local caching - it would certainly be a lot faster than serializing your POJOs to XML and writing them back into a SQL database.
If your data is truly read-only, then I'd think that the best solution would be to treat the source database as an input queue that feeds your app database. Create a background process (heck, a service would be better), and have it monitor the source database and keep your app database synced.

(Java) Store a huge collection of objects with indexed attributes

I need to store about 100 thousands of objects representing users. Those users have a username, age, gender, city and country.
The users should be searchable by a range of age and any of the other attributes, but also a combination of attributes (e.g. women between 30 and 35 from Brussels). The results should be found quickly as it is one of the Server's services for many connected Clients). Users may only be deleted or added, not updated.
I've thought of a fast database with indexed attributes (like h2 db which seems to be pretty fast, and I've seen they have a in-memory mode)
I was wondering if any other option was possible before going for the DB.
Thank you for any ideas !

How much memory does your server have? How much memory would these objects take up? Is it feasible to keep them all in memory, or not? Do you really need the speedup of keeping in memory, vs shoving in a database? It does make it more complex to keep in memory, and it does increase hardware requirements... are you sure you need it?
Because all of what you describe could be ran on a very simple server and put in a very simple database and give you the results you want in the order of 100ms per request. Do you need faster than 100ms response time? Why?

I would use a RDBMS - there are plenty of good ORMs available, such as Hibernate, which allow you to transparently stuff the POJOs into a db. Once you've got the data access abstracted, you then have the freedom to decide how best to persist the data.
For this size of project, I would use the H2 database. It has both embedded and client/server modes, and can operate from disk or entirely in memory.

Most definitely a relational database. With that size you'll want a client-server system, not something embedded like Sqlite. Pick one system depending on further requirements. Indexing is a basic feature, most systems support it. Personally I'd try something that's popular and free such as MySQL or PostgreSQL so you can more easily google your way out of problems. If you make your SQL queries generic enough (no vendor-specific constructs), you can switch systems without much pain. I agree with bwawok, try whether a standard setup is good enough and think of optimizations later.

Did you think to use cache system like EHCache or Memcached?
Also If you have enough memory you can use some sorted collection like TreeMap as index map, or HashMap to search user by name (separate Map per field). It will take more memory but can be effective. Also you can find based on the user query experience the most frequently used query with the best selectivity and create comparator based on this query onli. In this case subset of the element will not be a big and can can be filter fast without any additional optimization.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.