DB4O performance retrieving a large number of objects - java

I'm interesting in using DB4O to store the training data for a learning algorithm. This will consist of (potentially) hundreds of millions of objects. Each object is on average 2k in size based on my benchmarking.
The training algorithm needs to iterate over the entire set of objects repeatedly (perhaps 10 times). It doesn't care what order the objects are in.
My question is this: When I retrieve a very large set of objects from DB4O, are they all loaded into memory, or are they pulled off disk as needed?
Clearly, pulling hundreds of millions of 2k objects into memory won't be practical on the type of servers I'm working with (they hvae about 19GB of RAM).
Is Db4o a wise choice here?

db4o activation mechanism allows you to control which object are loaded into memory. For complex object graphs you probably should us transparent activation, where db4o loads an object into memory as soon as it is used.
However db4o doesn't explicit remove object from memory. It just keeps a weak reference to all loaded objects. If a object is reachable, it will stay there (just like any other object). Optionally you can explicitly deactivate an object.
I just want to add a few notes to the scalability of db4o. db4o was built for embedding in application and devices. It was never built for large datasets. Therefore it has its limitations.
It is internally single-threaded. Most db4o operation block all other db4o operations.
It can only deal with relatively small databases. By default a database can only be 2GB. You can increase it up to 127 GB. However I think db4o operates well in the 2-16 GB range. Afterwards the database is probably to large for it. Anyway, hundreds of millions of 2K objects is way to large database. (100Mio 2K obj => 200GB)
Therefore you probably should look at larger object databases, like VOD. Or maybe a graph database like Neo4J is also a good choise for your problem?

Related

database or ObjectOutputStream, Object specific member or actual object for reference

I'm working on an application for a pharmacy , basically this application has a class "item" and another class "selling invoices" which logs selling processes .
So my question here if the pharmacy is expected to have about ten thousand products in stock, and I'm storing these products in a linked list of type Item, and storing the invoices in linked list also , then on closing the app i save them using object output stream and reload them upon the start, Is it a bad practice ? Have I to use database instead?
My second question is, if i continue on using linkedlist and object output stream , what is better for performance and memory, storing the actual item as a field member in the invoice class or just its ID and then getting the item upon recalling using this ID reference, so what's better ?
Thanks in advance .
It is a bad idea to use ObjectOutputStream like that.
Here are some of the reasons:
If your application crashes (or the power fails) before you "save", then all changes are lost.
Saving "all objects" is expensive.
Serialized objects are opaque. It is only practical to look at them from Java code.
Serialized objects are fragile. If your application classes change, you may find that old serialized objects can no longer be read. That's bad enough, but now consider what happens if your client wants to look at pharmacy records from 5 years ago ... from a backup tape.
Serialized objects provide no way of searching ... apart from reading all of the objects one at a time.
Designs which involve reading all objects into memory do not scale. You are liable to run out of memory. Or compromise on your requirements to avoid running out of memory.
By contrast:
A database won't lose any changes have been committed. They are much more resilient to things like application errors and system level failures.
Committing database changes is not as expensive, because you only write data that has changed.
Typical databases can be viewed, queried, and if necessary repaired using an off-the-shelf database tool.
Changing Java code doesn't break the database. And for some schema changes, there are ways to migrate the database schema and records to match an updated database.
Databases have indexes and query languages for implementing efficient search.
Databases scale because the primary copy of the data is on disk, not in memory.

When to hit the database instead of searching an object in your List

If we are using MongoDB (NoSQL) or MySQL (Relational DB) to retrieve objects and we want to search for a specific element (where clause) but we already have an in-memory list (LinkedList, ArrayList or whatever) containing some of the Objects, either for caching or for any other reason.
Is there an equation / library that can "advice" when is cheaper to use the in-memory structure for retrieval instead of querying the Database? (taking into consideration,for example, the size of the in-memory list?)
It's always cheaper to query the in-memory cache.
The only exception would be if you had constructed the cache so that it was very inefficient to search it (e.g., linear search). But as long as it's a hash-based structure, it'll be doing at worst what the database needs to do, but without the network overhead. Looking something up in a hash table is essentially free.
The bigger question is whether your cache uses so much memory it starves the rest of the application. You'll want a weak hash map or similar, to avoid this. If this is a cache produced by some sort of ORM, it'll be weakly referenced already.

Java Application / ArrayList verses direct database queries

In short I want to know how effective it is to use arraylists in Java to hold objects with lot of data in it. How long an arraylist can grow and is there any issues using arraylist to hold 2000+ customer details (objects) while at runtime? Does it hit the performance in any way? Or is there any better way to design app which needs to quickly access data?
I am developing a new module (customer lead tracker) for my small ERM application which also handles payroll details for a company. So far the data was not so huge, now with this module I am expecting the data base to grow fast and I will have to load 2000+ customer details from database to perform different data manipulations, updates.
I wanted some suggestion as to which approach would be better,
Querying customer Database (100+ columns) and getting data to work with for each transaction. (A lot of seperate queries for each)
Load each row into objects save it in an arraylist at the beginning of and use the list to work with each row when required. And save the objects (rows) at end of a transaction?
Sorry if I have asked a dump question, I am really a start up independent developer this may sound a bit awkward from an experienced developer's perspective.
Depends on how much memory you have.Querying DB for each and every transaction is not a good approach as well.A better approach would be load data into memory depending on your memory size and once you are done with it, remove it and fire next set of db queries.In thi way you can optimize memory as well as db queries.
Any ArrayList can hold not much than 231-1 elements, due to int typed index of inner array.
There is an approach called in memory Db which implies that you hold a lot of data in memory for gain fast access to it. But this approach also implies, that:
a. you have a lot of memory, available for holding all necessary data (it could be several tens of gigabytes);
b. you db implements compact form of data storage. It means that db will not contain ready java-objects, but fragments of byte-array data, from which you will contstruct objects on demand.
So, you need to reckon, how much memory you will need for all data that you want to load into memory and decide whether this approach eligible or not.

Reducing memory usage for batch JDO inserts

We have a Java web app using Data Nucleus 1.1.4 / JDO 2.3 for persistence.
There's a batch import operation that persists a large number of JDO objects in one shot. We've had some situations where OutOfMemoryError's are being thrown because the data to import is so large.
The intended pattern was to loop through the input stream, parse a row, instantiate a JDO object, call makePersistent, then release the object reference to the JDO object in order to keep our memory footprint flat regardless of the input data size.
In doing some analysis of the heap during this operation, it appears that the JDO object instances pile up and take up a large chunk of memory until the commit happens. Even though we don't hold references to them it looks like Data Nucleus's PersistenceManager and Transaction implementations reference an org.datanucleus.ObjectManagerImpl object that holds onto a list of "dirty" JDO object instances (actually copies of the original). There's probably a good rationale reason for this but I was a bit surprised that the framework needed to hold onto copies of each JDO object. They are let go after the commit, but given that we want to make sure all insertions happen atomically, we need to run this operation inside of a transaction. In it's current state the memory usage is linearly correlated to the data input size which opens us up for these OutOfMemoryErrors - if not for a single operation, then under concurrent operations.
Are there any tips or best practices around keeping as close to a flat memory footprint for a batch JDO insertion operation like this?
What I found out was that the best practice was to call the PersistenceManager's flush method periodically through the loop. This causes the JDO framework (ObjectManagerImpl) to let go of the objects.

Using hashmap or H2 database?

I am developing a web application in which I need to store session, user messages etc. I am thinking of using HashMap or H2 database.
Please let me know which is better approach in terms of performance and memory utilization. The web site has to support 10,000 users.
Thanks.
As usual with these questions, I would worry about performance as/when you know it's an issue.
10000 users is not a lot of data to hold in memory. I would likely start off with a standard Java collection, and look at performance when you predict it's going to cause you grief.
Abstract out the access to this Java collection such that when you substitute it, the refactoring required is localised (and perhaps make it configurable, such that you can easily perform before/after performance tests with your different solutions -H2, Derby, Oracle, etc. etc.)
If your session objects aren't too big (which should be the case), there is no need to persist them in a database.
Using a database for this would add a lot of complexity in a case when you can start with a few lines of code. So don't use a database, simply store them in a ligth memory structure (HashMap for example).
You may need to implement a way to clean your HashMap if you don't want to keep sessions in memory when the user left from a long time. Many solutions are available (the easiest is simply to have a background thread removing from time to time the too old sessions). Note that it's usually easier to clean a hashmap than a database.
Both H2 and Hash Map are gonna keep the data in memory (So from space point of view they are almost the same).
If look ups are simple like KEY VALUE then looking up in the Hash Map will be quicker.
If you have to do comparisons like KEY < 100 etc use H2.
In fact 10K user info is not that high a number.
If you don't need to save user messages - use the collections. But if the message is should be saved, be sure to use a database. Because after restart you lost all data.
The problem with using a HashMap for storing objects is that you would run into issues when your site becomes too big for one server and would need to be clustered in order to scale with demand. Then you would face problems with how to synchronise the HashMap instances on different servers.
A possible alternative would be to use a key-value store like Redis as you won't need the structure of a database or even use the distributed cache abilities of something like EHCache

Categories

Resources