Spring Data JPA: Batch insert for nested entities

Spring Data JPA: Batch insert for nested entities - java

I have a test case where I need to persist 100'000 entity instances into the database. The code I'm currently using does this, but it takes up to 40 seconds until all the data is persisted in the database. The data is read from a JSON file which is about 15 MB in size.
Now I had already implemented a batch insert method in a custom repository before for another project. However, in that case I had a lot of top level entities to persist, with only a few nested entities.
In my current case I have 5 Job entities that contain a List of about ~30 JobDetail entities. One JobDetail contains between 850 and 1100 JobEnvelope entities.
When writing to the database I commit the List of Job entities with the default save(Iterable<Job> jobs) interface method. All nested entities have the CascadeType PERSIST. Each entity has it's own table.
The usual way to enable batch inserts would be to implement a custom method like saveBatch that flushes every once in a while. But my problem in this case are the JobEnvelope entities. I don't persist them with a JobEnvelope repository, instead I let the repository of the Jobentity handle it. I'm using MariaDB as database server.
So my question boils down to the following: How can I make the JobRepository insert it's nested entities in batches?
These are my 3 entites in question:
Job
#Entity
public class Job {
#Id
#GeneratedValue
private int jobId;
#OneToMany(fetch = FetchType.EAGER, cascade = CascadeType.PERSIST, mappedBy = "job")
#JsonManagedReference
private Collection<JobDetail> jobDetails;
}
JobDetail
#Entity
public class JobDetail {
#Id
#GeneratedValue
private int jobDetailId;
#ManyToOne(fetch = FetchType.EAGER, cascade = CascadeType.PERSIST)
#JoinColumn(name = "jobId")
#JsonBackReference
private Job job;
#OneToMany(fetch = FetchType.EAGER, cascade = CascadeType.PERSIST, mappedBy = "jobDetail")
#JsonManagedReference
private List<JobEnvelope> jobEnvelopes;
}
JobEnvelope
#Entity
public class JobEnvelope {
#Id
#GeneratedValue
private int jobEnvelopeId;
#ManyToOne(fetch = FetchType.EAGER, cascade = CascadeType.PERSIST)
#JoinColumn(name = "jobDetailId")
private JobDetail jobDetail;
private double weight;
}

Make sure to configure Hibernate batch-related properties properly:
<property name="hibernate.jdbc.batch_size">100</property>
<property name="hibernate.order_inserts">true</property>
<property name="hibernate.order_updates">true</property>
The point is that successive statements can be batched if they manipulate the same table. If there comes the statement doing insert to another table, the previous batch construction must be interrupted and executed before that statement. With the hibernate.order_inserts property you are giving permission to Hibernate to reorder inserts before constructing batch statements (hibernate.order_updates has the same effect for update statements).
jdbc.batch_size is the maximum batch size that Hibernate will use. Try and analyze different values and pick one that shows best performance in your use cases.
Note that batching of insert statements is disabled if IDENTITY id generator is used.
Specific to MySQL, you have to specify rewriteBatchedStatements=true as part of the connection URL. To make sure that batching is working as expected, add profileSQL=true to inspect the SQL the driver sends to the database. More details here.
If your entities are versioned (for optimistic locking purposes), then in order to utilize batch updates (doesn't impact inserts) you will have to turn on also:
<property name="hibernate.jdbc.batch_versioned_data">true</property>
With this property you tell Hibernate that the JDBC driver is capable to return the correct count of affected rows when executing batch update (needed to perform the version check). You have to check whether this works properly for your database/jdbc driver. For example, it does not work in Oracle 11 and older Oracle versions.
You may also want to flush and clear the persistence context after each batch to release memory, otherwise all of the managed objects remain in the persistence context until it is closed.
Also, you may find this blog useful as it nicely explains the details of Hibernate batching mechanism.

To complete the previous answer of Dragan Bozanovic. Hibernate sometimes silently deactivates the order of execution of the batches if for example it encounters cyclic relations between the entities when it builds the graph of dependencies between the batches (see InsertActionSorter.sort(..) method). It would have been interesting for hibernate to trace this behavior when this happens.

Related

Weblogic to Liberty w JPA upgrade - related entities intermittently not being queried

just a quick question please in case something stands out immediately.
We're migrating an EAR/EJB application from Weblogic 11g to latest WS Liberty (22.x) also upgrading several of the frameworks including JPA to 2.2. This also changes JPA implementation to eclipseLink. We came from com.oracle.weblogic.11g.modules:javax.persistence:1.0.0.0_1-0-2. Underlying DB is MS-SQL Server.
And I'm running into some weirdness with regards to related objects not being resolved/queried intermittently.
Just as an example we have entities where the columns hold reference data codes or similar lookups. Say I have an entity called PayemntRecordT and it has a status code which refers to a ref table that also holds a textual description. Something like this:
SQL:
CREATE TABLE [PAYMENT_RECORD_T](
[PAYMENT_ID] [int] NOT NULL,
...
[PAYMENT_STATUS_CD] [CHAR](8) NOT NULL,
...
)
ALTER TABLE [PAYMENT_RECORD_T] WITH CHECK ADD CONSTRAINT [FK_PAYM4] FOREIGN KEY([PAYMENT_STATUS_CD])
REFERENCES [RECORD_STATUS_T] ([REC_STAT_CD])
GO
CREATE TABLE [RECORD_STATUS_T] (
[RECORD_STAT_CD] [CHAR](8) NOT NULL,
[RECORD_STAT_DSC] [VARCHAR](60) NOT NULL
CONSTRAINT [PK_RECORD_STATUS_T] PRIMARY KEY CLUSTERED (
[RECORD_STAT_CD] ASC
)WITH (PAD_INDEX = OFF...) ON [PRIMARY]
) ON [PRIMARY]
GO
Java:
#Table(name = "PAYMENT_RECORD_T")
#Entity
public class PaymentRecordT {
...
#ManyToOne
#PrimaryKeyJoinColumn(name = "payment_status_cd", referencedColumnName = "REC_STAT_CD")
private RecordStatusT recordStatusT;
}
#Table(name = "RECORD_STATUS_T")
#Entity
public class RecordStatusT {
#Column(name = "REC_STAT_CD")
#Id
private String recStatCd;
#Column(name = "REC_STAT_DSC")
#Basic
private String recStatDsc;
}
Others relations in our app might not be primary key relations but loose relations in which case its just #JoinColumn but the pattern would be the same.
My 'weirdness' is the following:
So in this example I have a list of 10 'Payment Records' each of them have such a record status, which is actually NON NULL in the database. When I do the initial retrieval via EJB method it grabs the 10 records and I also get the correctly resolved/queried record statuses.
Then I add a new record via EJB method (TRANSACTION_REQUIERD). After the add method returns I can query the new payment record in the database via SSMS. Its committed and it looks 100% correct and it contains a correct record status code.
Now I run the retrieval method again and I get the 11 records as I would expect. Only the 11th (newly inserted) record will have recordStatusT as null.
When I restart the app all goes well again for the retrieval of all 11 records. But for subsequent additions the outcome seems again 'undefined'.
In JDBC logging I an see that during the original retrieval of the records the record_status_t table was queried but the 2nd time around it was not and I have no explanation why.
I played with FETCHTYPE.EAGER and read up on caching etc but I'm not going anywhere.
Any ideas?
Thanks for your time
Carsten

I solved the problem by ensuring that after inserts/updates the objects arent being queried from the cache.
In the end - rather than doing it with query hint - I disabled caching for the entity involved using the #Chacheable annotation, like so
#Table(name = "PAYMENT_RECORD_T")
#Entity
#Cacheable(false)
public class PaymentRecordT {
...
#ManyToOne
#PrimaryKeyJoinColumn(name = "payment_status_cd", referencedColumnName = "REC_STAT_CD")
private RecordStatusT recordStatusT;
}
I still feel like there should be a better solution. Eclipselink tracks the inserts/updates so it should be able track what needs rereading from the DB and what not. I still feel like I don't fully understand the entire picture, but this works for me and its reasonably clean.
I can leave the considerable amount of read-only data/objects chacheable and the few that are changeable as non-cacheable.
Thanks for reading
Carsten

JPA/Hibernate Bulk/Batch Insert using Mysql with Database Generated ID's

Okay, I've searched forever and I can't seem to find a good way of accomplishing batch inserts with JPA/Hibernate and MySql.
I want to be able to save/insert many records at once using JPA, but by default batching behavior is disabled if you use GenerationType.IDENTITY. I'm aware that you can switch to GenerationType.SEQUENCE, but that isn't available on MySql and creating new tables and using GenerationType.TABLE is not an option in my scenario.
So in the end, I need an efficient way of doing batch/bulk inserts using JPA/Hibernate, MySQL, and database generated IDs. I know it's possible to do this efficiently because I can do it with a JDBC connection, but I'd really like to not have to write my own JDBC queries for each of my repositories.
Anyone know how to accomplish this?
I'm okay if I'm unable to get the updated entities with the IDs back (think void saveAll() instead of List<User> saveAll()). My main requirement is this happens in one/two big queries instead of saving iteratively each entity like it does now when I call saveAll.
I can include more if needed, but my entity looks like this:
#Entity
#Builder
#Getter
#Setter
#With
#AllArgsConstructor
#NoArgsConstructor
#EqualsAndHashCode(callSuper = false, exclude = "id")
#Table(name = "user")
#ToString(callSuper = true, onlyExplicitlyIncluded = true)
public class User {
#Id
#ToString.Include
#GeneratedValue(strategy = GenerationType.IDENTITY)
#Column(name = "uID")
private long id;
private String name;
}

There is no way to accomplish JDBC batching on insert with Hibernate when using the identity generation strategy, because for Hibernate, every entity must have a PK value assigned after a persist/insert.
You can use Hibernate SPIs to implement this yourself though. Take a look at how Hibernate implements inserts here org.hibernate.persister.entity.AbstractEntityPersister#insert(java.lang.Object, java.lang.Object[], java.lang.Object, org.hibernate.engine.spi.SharedSessionContractImplementor). You can reduce the complexity if you want to implement this only for a few known entities that only use a handful of features.

IDENTITY generator disables JDBC batch insert of hibernate and JPA. since the sequence is not supported in MySQL, there is no way to bulk/batch insert records using MySQL and spring data JPA. Please read my blog on that. This is not the end of the road. we can use the JDBC template or query DSL-SQL. To see how to implement using query DSL-SQL click here. For the JDBC template click here.
If you need type-safe, easy to code choose query DSL-SQL else choose JDBC template

Dynamically cascade type in JPA

I am using JPA in my application, and take one model for example:
public class Project {
#Id
private String uuid;
private String name;
.................
#OneToMany(mappedBy = "project", cascade = CascadeType.ALL)
private List<ProjectDetails> details;
}
As shown,there is a one-to-many association between the Project and the ProjectDetail, once a project is fetched, its details will be populated by the jpa provider which is exactly what I want.
However once the authentication is added to the project, the details will be only available for specified users, which means the auto-fetching details is not necessary.
I know I can use the
project.setDetails(null);
to remove the details information for un-authenticated user in the application level. But I wonder if this is a waste of sql resource? So it would be better if I can set the cascade type at runtime.
How do you solve this kind of problem?

By default, since #OneToMany relationships are lazily loaded, you may not necessarily access the getter from your code if the user has not yet been authenticated thereby preventing the relationship from been loaded.
When the entity eventually gets loaded, I think you should just decide on what to do from your application rather than setting the relationship to null.

Delete cascade hangs in JPA when large number of objects

I have a JPA Entities like this:
#Entity
class MyEntity{
#JsonIgnore
#OneToMany(mappedBy = "application", cascade = ALL, fetch = LAZY)
private List<MyChildEnity> myChildEntities;
}
...
#Entity
class MyChildEnity {
#ManyToOne(optional = false, fetch = FetchType.LAZY, cascade = { REFRESH,
DETACH })
#JoinColumn(name = "APPLICATION_ID")
private MyEntity application;
}
I access this entity from a REST call. When the number of elements is very large, and I try to delete the MyEntity Object the REST call hangs and then timeout. For small number of elements in MyChildEnity table it works fine. When I debugged, I saw that JPA fetches one record at a time and deletes it. This is too slow and too much work done.
Is this an expected behavior? Shouldn't JPA be intelligent to convert this to a single DELETE call on the MyChildEnity table.
I'm using OpenJPA with Derby and DB2 database.

The reason why you get one delete statement for each element probably has something to do with the fact that JPA let you do something pre- and post removal. If you write a JPQL with a deletestatement you are able to bypass the callback mechanism and delete everything in a single request.
Documentation for entity listeners and callbacks. (This is JPA functionality).

JPA persisting objects without calling persist

I have an entity class Document and another one called Space. The relation:
#ManyToOne(fetch = FetchType.EAGER, cascade = {CascadeType.PERSIST,
CascadeType.MERGE, CascadeType.REFRESH}, optional = true)
#ForeignKey(name = "FK_TO_SPACE__DOCUMENT")
#IndexedEmbedded(prefix = DocumentDefaultFields.SPACE_TO_PREFIX)
private Space toSpace;
Well, i do query the db and take some docs into a LinkedList.
This list is binded to a dataTable from where i can do some update operations like:
<a:commandLink value="move" action="#{moveDocsOperation.moveDocumentToNewSpace(entity)}" reRender="confim,origTable,newTable"/>
and the method:
public void moveDocumentToNewSpace(final Document document) {
log.info("~~move document #0 from space #1 to space #2", document.getDocumentId(), origSpace.getPath(), newSpace.getPath());
document.setToSpace(newSpace);
origSpaceDocuments.remove(document);
newSpaceDocuments.add(document);
entityAuditer.auditBean(document, Crud.UPDATE);
}
I do not understand why, when setting the toSpace of the document entity, the update is also done in DB without actually doing PERSIST....
do you know WHY?

When you load an object via the hibernate session, it is managed by that session. When you make changes, at flush time the changes in the object are synchronized with the database.
So calling persist() is not needed to persist data modifications. (Related: http://techblog.bozho.net/?p=227)

One way you can get round this and make changes to the entity without persisting to the database is by removing from session:
org.hibernate.Session session = (Session) em.getDelegate();
session.evict(yrEnity);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.