Reducing memory usage for batch JDO inserts - java

We have a Java web app using Data Nucleus 1.1.4 / JDO 2.3 for persistence.
There's a batch import operation that persists a large number of JDO objects in one shot. We've had some situations where OutOfMemoryError's are being thrown because the data to import is so large.
The intended pattern was to loop through the input stream, parse a row, instantiate a JDO object, call makePersistent, then release the object reference to the JDO object in order to keep our memory footprint flat regardless of the input data size.
In doing some analysis of the heap during this operation, it appears that the JDO object instances pile up and take up a large chunk of memory until the commit happens. Even though we don't hold references to them it looks like Data Nucleus's PersistenceManager and Transaction implementations reference an org.datanucleus.ObjectManagerImpl object that holds onto a list of "dirty" JDO object instances (actually copies of the original). There's probably a good rationale reason for this but I was a bit surprised that the framework needed to hold onto copies of each JDO object. They are let go after the commit, but given that we want to make sure all insertions happen atomically, we need to run this operation inside of a transaction. In it's current state the memory usage is linearly correlated to the data input size which opens us up for these OutOfMemoryErrors - if not for a single operation, then under concurrent operations.
Are there any tips or best practices around keeping as close to a flat memory footprint for a batch JDO insertion operation like this?

What I found out was that the best practice was to call the PersistenceManager's flush method periodically through the loop. This causes the JDO framework (ObjectManagerImpl) to let go of the objects.

Related

database or ObjectOutputStream, Object specific member or actual object for reference

I'm working on an application for a pharmacy , basically this application has a class "item" and another class "selling invoices" which logs selling processes .
So my question here if the pharmacy is expected to have about ten thousand products in stock, and I'm storing these products in a linked list of type Item, and storing the invoices in linked list also , then on closing the app i save them using object output stream and reload them upon the start, Is it a bad practice ? Have I to use database instead?
My second question is, if i continue on using linkedlist and object output stream , what is better for performance and memory, storing the actual item as a field member in the invoice class or just its ID and then getting the item upon recalling using this ID reference, so what's better ?
Thanks in advance .
It is a bad idea to use ObjectOutputStream like that.
Here are some of the reasons:
If your application crashes (or the power fails) before you "save", then all changes are lost.
Saving "all objects" is expensive.
Serialized objects are opaque. It is only practical to look at them from Java code.
Serialized objects are fragile. If your application classes change, you may find that old serialized objects can no longer be read. That's bad enough, but now consider what happens if your client wants to look at pharmacy records from 5 years ago ... from a backup tape.
Serialized objects provide no way of searching ... apart from reading all of the objects one at a time.
Designs which involve reading all objects into memory do not scale. You are liable to run out of memory. Or compromise on your requirements to avoid running out of memory.
By contrast:
A database won't lose any changes have been committed. They are much more resilient to things like application errors and system level failures.
Committing database changes is not as expensive, because you only write data that has changed.
Typical databases can be viewed, queried, and if necessary repaired using an off-the-shelf database tool.
Changing Java code doesn't break the database. And for some schema changes, there are ways to migrate the database schema and records to match an updated database.
Databases have indexes and query languages for implementing efficient search.
Databases scale because the primary copy of the data is on disk, not in memory.

Collection processing or database request ? which one is better

This is my first post on stackoverflow, so please be nice to me :-)
So let me explain the context. I'm developing a web service with a standard layer (resources, services, DAO Layer...). I use JPA with hibernate implementation for my object model with the database.
For a class A parent and a class B child, most of the time when i want to find an object B on the collection, I use the streamAPI to filter the collection based on what i want. My question here is more general, is it better to search an object by requesting the database (from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)
If you consider latency, the database will always be slower.
So you gotta ask yourself some questions:
how far away is the database (latency)?
how big is the dataset?
How do I process them ?
do I have any major runtime issues ?
from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)
You're program is probably not very performant programmed. I suggest you check the O-Notation if you have any major runtime leaks.
Your Question is very broad, so it's hard to tell you, for your use-case, which might be the best.
Use database to return data what you need and Java to perform processing on them that would be complicated to do in a JPQL/SQL query.
Databases are designed to perform queries more efficiently than Java (stream or no).
Besides, fetching many data from a database to finally keep only a part of them is not efficient.
The database is usually faster since it is optimized for requesting specific data. Usually one would add indexes to speed up querying on certain fields.
TLDR: Filter your data in the database and process them from java.
This isn't an easy question to answer, since there are many different factors that would influence my decision to go to the db or not. First, I think it's fair to say that, for almost every app I've worked on in the past 20 years, hitting the DB for information is the default strategy. More recently (say past 10 or so years) data access through web service calls has become common as well.
For me, the main question would be something along the lines of, "Are there any situations when I would not hit an external resource (DB, Service, or even file read) for data every time I need it?"
So, I'll outline some of the things I would consider.
Is the data search space very small?
If you are searching a data space of tens of different records, then this information might be a candidate for non-db storage. On the other hand, once you get past a fairly small set records, this approach becomes increasingly untenable. Examples of these "small sets" might be something like salutations (Mr., Ms., Dr., Mrs., Lord). I looks for small sets of data that rarely change, which I, as a lazy developer, wouldn't mind typing into a configuration file. Once I get past something like 50 different records (like US States, for example), I want to pull that info from a DB or service call.
Are the data cacheable?
If you have multiple requests that could legitimately use the exact same data, then leverage caching in your application. Examine the data and expected usage of your service for opportunities to leverage regularities in data and likely requests to cache data whenever possible. Remember to consider cache keys, how long items should be cached, and when cached items should be evicted.
In many web usage scenarios, it's not uncommon that each display could include a fairly large amount of cached information, and a small amount of dynamic data. Menu and other navigation items are good candidates for caching. User-specific data, such as contract-sepcific pricing in an eCommerce app are often poor candidates.
Can you pre-load some data into cache?
Some items can be read once and cached for the entire duration of your application. A list of US States and/or Canadian Provinces is a good example here. These almost never change, so once read from the db, you would rarely need to read them again. Consider application components that can load such data on startup, and then hold this data in an appropriate collection.

Java Application / ArrayList verses direct database queries

In short I want to know how effective it is to use arraylists in Java to hold objects with lot of data in it. How long an arraylist can grow and is there any issues using arraylist to hold 2000+ customer details (objects) while at runtime? Does it hit the performance in any way? Or is there any better way to design app which needs to quickly access data?
I am developing a new module (customer lead tracker) for my small ERM application which also handles payroll details for a company. So far the data was not so huge, now with this module I am expecting the data base to grow fast and I will have to load 2000+ customer details from database to perform different data manipulations, updates.
I wanted some suggestion as to which approach would be better,
Querying customer Database (100+ columns) and getting data to work with for each transaction. (A lot of seperate queries for each)
Load each row into objects save it in an arraylist at the beginning of and use the list to work with each row when required. And save the objects (rows) at end of a transaction?
Sorry if I have asked a dump question, I am really a start up independent developer this may sound a bit awkward from an experienced developer's perspective.
Depends on how much memory you have.Querying DB for each and every transaction is not a good approach as well.A better approach would be load data into memory depending on your memory size and once you are done with it, remove it and fire next set of db queries.In thi way you can optimize memory as well as db queries.
Any ArrayList can hold not much than 231-1 elements, due to int typed index of inner array.
There is an approach called in memory Db which implies that you hold a lot of data in memory for gain fast access to it. But this approach also implies, that:
a. you have a lot of memory, available for holding all necessary data (it could be several tens of gigabytes);
b. you db implements compact form of data storage. It means that db will not contain ready java-objects, but fragments of byte-array data, from which you will contstruct objects on demand.
So, you need to reckon, how much memory you will need for all data that you want to load into memory and decide whether this approach eligible or not.

DB4O performance retrieving a large number of objects

I'm interesting in using DB4O to store the training data for a learning algorithm. This will consist of (potentially) hundreds of millions of objects. Each object is on average 2k in size based on my benchmarking.
The training algorithm needs to iterate over the entire set of objects repeatedly (perhaps 10 times). It doesn't care what order the objects are in.
My question is this: When I retrieve a very large set of objects from DB4O, are they all loaded into memory, or are they pulled off disk as needed?
Clearly, pulling hundreds of millions of 2k objects into memory won't be practical on the type of servers I'm working with (they hvae about 19GB of RAM).
Is Db4o a wise choice here?
db4o activation mechanism allows you to control which object are loaded into memory. For complex object graphs you probably should us transparent activation, where db4o loads an object into memory as soon as it is used.
However db4o doesn't explicit remove object from memory. It just keeps a weak reference to all loaded objects. If a object is reachable, it will stay there (just like any other object). Optionally you can explicitly deactivate an object.
I just want to add a few notes to the scalability of db4o. db4o was built for embedding in application and devices. It was never built for large datasets. Therefore it has its limitations.
It is internally single-threaded. Most db4o operation block all other db4o operations.
It can only deal with relatively small databases. By default a database can only be 2GB. You can increase it up to 127 GB. However I think db4o operates well in the 2-16 GB range. Afterwards the database is probably to large for it. Anyway, hundreds of millions of 2K objects is way to large database. (100Mio 2K obj => 200GB)
Therefore you probably should look at larger object databases, like VOD. Or maybe a graph database like Neo4J is also a good choise for your problem?

How would you go about improving MySQL throughput in this simple scenario?

I have a relatively simple object model:
ParentObject
Collection<ChildObject1>
ChildObject2
The MySQL operation when saving this object model does the following:
Update the ParentObject
Delete all previous items from the ChildObject1 table (about 10 rows)
Insert all new ChildObject1 (again, about 10 rows)
Insert ChildObject2
The objects / tables are unremarkable - no strings, rather mainly ints and longs.
MySQL is currently saving about 20-30 instances of the object model per second. When this goes into prodcution it's going to be doing upwards of a million saves, which at current speeds is going to take 10+ hours, which is no good to me...
I am using Java and Spring. I have profiled my app and the bottle neck is in the calls to MySQL by a long distance.
How would you suggest I increase the throughput?
You can get some speedup by tracking a dirty flag on your objects (especially your collection of child objects). You only delete/update the dirty ones. Depending on what % of them change on each write, you might save a good chunk.
The other thing you can do is do bulk writes via batch updating on the prepared statement. (Look at PreparedStatement.addBatch()) This can be an order of magnitude faster, but might not be record by record,e.g. might look something like:
delete all dirty-flagged children as a single batch command
update all parents as a single batch command
insert all dirty-flagged children as a single batch command.
Note that since you're dealing with millions of records you're probably not going to be able to load them all into a map and dump them at once, you'll have to stream them into a batch handler and dump the changes to the db 1000 records at a time or so. Once you've done this the actual speed is sensitive to the batch size, you'll have to determine the defaults by trial-and-error.
Deleting any existing ChildObject1 records from the table and then inserting the ChildObject1 instances from the current state of your Parent object seems unnecessary to me. Are the values of the all of the child objects different than what was previously stored?
A better solution might involve only modifying the database when you need to, i.e. when there has been a change in state of the ChildObject1 instances.
Rolling your own persistence logic for this type of thing can be hard (your persistence layer needs to know the state of the ChildObject1 objects when they were retrieved to compare them with the versions of the objects at save-time). You might want to look into using an ORM like Hibernate for something like this, which does an excellent job of knowing when it needs to update the records in the database or not.

Categories

Resources