Stream rows from PostgreSQL (with fetch size)

Stream rows from PostgreSQL (with fetch size) - java

I would like to stream results from PostgreSQL 11.2 and not read all results to memory at once. I use the newest stable SpringBoot 2.1.4.RELEASE.
I read the article how to do it in MySQL.
http://knes1.github.io/blog/2015/2015-10-19-streaming-mysql-results-using-java8-streams-and-spring-data.html
I also read article how to do it in PostgreSQL:
Java 8 JPA Repository Stream row-by-row in Postgresql
I have repository like that:
public interface ProductRepository extends JpaRepository<Product, UUID> {
#Query("SELECT p from Product p")
#QueryHints(value = #QueryHint(name = HINT_FETCH_SIZE, value = "50"))
Stream<Product> streamAll();
}
Than I use the stream that way:
productRepository.streamAll().forEach(product -> export(product));
To make the example easier, 'export' method is completely empty.
When I call the method I see Hibernate query
Hibernate: select product0_.id as id1_0_, product0_.created as created2_0_, product0_.description as descript3_0_, product0_.name as name4_0_, product0_.product_type_id as product_5_0_ from products product0_ order by product0_.id
and after some time I have OutOfMemoryError.
The query hint didn't help.
How to read data using Spring Boot repository (or even EntityManager) and load rows from DB in optimal way.
I know that I can make pagination, but as in articles was written, it is not the most optimal way.

You must detach the entity after your work finishes.
import javax.persistence.EntityManager;
...
#Autowired
private EntityManager entityManager;
...
// Your business logic
productRepository.streamAll().forEach(product -> {
export(product);
// must detach so that garbage collector can reclaim the memory.
entityManager.detach(product);
});

At the moment using spring all the data are retrieved and the Stream is applied only to data already in memory.
If you look at the source of org.springframework.data.jpa.provider.PersistenceProvider it seems that it uses a ScrollableResults to stream over the data.
Generally a ScrollableResults retrieve all data in memory.
You can find an interesting complete analysis using a MySql database here, but probably the same works for a Postgres database.
So also if you think to use a solution that doesn't need to use a lot memory in reality it does because the underlying implementation is not using an optimal implementation.

I faced exactly the same problem, and after long debugging of internals of spring data and hibernate have found solution which worked for me.
So to fetch data using the cursor in PostgreSQL it should be method with Stream result + annotation (kotlin syntax):
#QueryHints(QueryHint(name = org.hibernate.annotations.QueryHints.FETCH_SIZE, value = "50"))
which value it should be 50 or some else - it's not so important.
Probably you put the wrong name of the hint.

Related

How to iterate over large number of records in MySQL with memory efficient manner in Spring Boot

I want to fetch all records from a table using findAll and do some processing on each of them, but I'm not sure if it will give memory issues if the number of records is huge like in millions.
I have looked into the Pageable but I'm not sure how to iterate over all the data using Pageable approach. Is it even possible to fetch few records at a time process them and fetch them again until all the records are processed?
And what would be better? Fetch all the records in Iterable using findAll() method or the Pageable approach?

Don't use findAll if there is a lot of entities.
If you want to use pagination you can do something like this:
Pageable pageRequest = PageRequest.of(0, 200);
Page<Qmail> onePage = repository.findAll(pageRequest);
while (!onePage.isEmpty()) {
pageRequest = pageRequest.next();
//DO SOMETHING WITH ENTITIES
onePage.forEach(entity -> System.out.println(entity.getId()));
onePage = repository.findAll(pageRequest);
}

Since Spring Data 1.8 you can Stream over results.
Stream<Record> findAll();
Important is here that you add a QueryHint about the fetch size for the database. If set it internally uses pages for streaming over the results.
Use this for MySQL databases:
#QueryHints(value = #QueryHint(name = org.hibernate.jpa.QueryHints.HINT_FETCH_SIZE, value = "-2147483648"))
Stream<Record> findAll();
For none-MySQL databases you can play with the fetch size:
#QueryHints(value = #QueryHint(name = org.hibernate.jpa.QueryHints.HINT_FETCH_SIZE, value = "5000"))
Stream<Record> findAll();
And, if you do not update / delete the records, do not forget to set your transaction to read-only:
#Transactional(readOnly = true)

If it can be millions..
1) Do not use findAll() and retrieve a list of actual managed entities. If you only need to read the data then use a projection query along with Spring Data JPA projection interface. This will bypass the persistence context and save a lot of time and memory.
2) Use Paging (to save memory) and make sure make each call in a new transaction (#Transactional(propagation = REQUIRES_NEW)). This will allow other transactions not to hang forever which might be the case if you did NOT use paging and triggered only one, give me all, query.
3) It looks like a candidate for an overnight batch job also. Think about that.

What you need is read data as a batch and process each of them and may be persist same somewhere else or generate report out of it.
This the ETL use case.
Spring Batch can be used for this case which can handle it very well.
Reader reads the data one at a time and process it in processor. Writer will persist or generate report based on chunk/batch size you set.
This way you are not holding a lot of data in memory.

Hibernate first level cache is missed

I'm a newbie to JPA/Hibernate first level caching.
I have the following repository class
Each time I call the findByState method(within the same transaction), I see the hibernate sql query being output onto the console
public interface PersonRepository extends JpaRepository<PersonEntity, id> {
#Query("select person from PersonEntity p where name= (?1)")
List<PersonEntity> findByState(String state);
....
}
I expected the results to be cached by the first level cache and the database not be repeatedly queried.
What am I doing wrong?

There is often a misunderstanding about caching.
Hibernate does not cache queries and query results by default. The only thing the first level cache is used is when you call EntityManger.find() you will not see a SQL query executing. And the cache is used to avoid object creation if the entity is already loading.
What you are looking for is called "query cache".
This can be enabled by setting hibernate.cache.use_query_cache=true
Please read more about this topic in the official documentation:
https://docs.jboss.org/hibernate/orm/5.4/userguide/html_single/Hibernate_User_Guide.html#caching-query

The query will always go to the database. The first level cache will only contain the constructed entities. Its purpose is to ensure that the same db id is mapped to the same entity object (within a session)
Its also possible to use a query cache. You have to enable per query. Check the docs https://docs.jboss.org/hibernate/core/4.0/devguide/en-US/html/ch06.html

Use Spring Data JPA, QueryDSL to update a bunch of records

I'm refactoring a code base to get rid of SQL statements and primitive access and modernize with Spring Data JPA (backed by hibernate). I do use QueryDSL in the project for other uses.
I have a scenario where the user can "mass update" a ton of records, and select some values that they want to update. In the old way, the code manually built the update statement with an IN statement for the where for the PK (which items to update), and also manually built the SET clauses (where the options in SET clauses can vary depending on what the user wants to update).
In looking at QueryDSL documentation, it shows that it supports what I want to do. http://www.querydsl.com/static/querydsl/4.1.2/reference/html_single/#d0e399
I tried looking for a way to do this with Spring Data JPA, and haven't had any luck. Is there a repostitory interface I'm missing, or another library that is required....or would I need to autowire a queryFactory into a custom repository implementation and very literally implement the code in the QueryDSL example?

You can either write a custom method or use #Query annotation.
For custom method;
public interface RecordRepository extends RecordRepositoryCustom,
CrudRepository<Record, Long>
{
}
public interface RecordRepositoryCustom {
// Custom method
void massUpdateRecords(long... ids);
}
public class RecordRepositoryImpl implements RecordRepositoryCustom {
#Override
public void massUpdateRecords(long... ids) {
//implement using em or querydsl
}
}
For #Query annotation;
public interface RecordRepository extends CrudRepository<Record, Long>
{
#Query("update records set someColumn=someValue where id in :ids")
void massUpdateRecords(#Param("ids") long... ids);
}
There is also #NamedQuery option if you want your model class to be reusable with custom methods;
#Entity
#NamedQuery(name = "Record.massUpdateRecords", query = "update records set someColumn=someValue where id in :ids")
#Table(name = "records")
public class Record {
#Id
#GeneratedValue(strategy = GenerationType.AUTO)
private Long id;
//rest of the entity...
}
public interface RecordRepository extends CrudRepository<Record, Long>
{
//this will use the namedquery
void massUpdateRecords(#Param("ids") long... ids);
}
Check repositories.custom-implementations, jpa.query-methods.at-query and jpa.query-methods.named-queries at spring data reference document for more info.

This question is quite interesting for me because I was solving this very problem in my current project with the same technology stack mentioned in your question. Particularly we were interested in the second part of your question:
where the options in SET clauses can vary depending on what the user
wants to update
I do understand this is the answer you probably do not want to get but we did not find anything out there :( Spring data is quite cumbersome for update operations especially when it comes to their flexibility.
After I saw your question I tried to look up something new for spring and QueryDSL integration (you know, maybe something was released during past months) but nothing was released.
The only thing that brought me quite close is .flush in entity manager meaning you could follow the following scenario:
Get ids of entities you want to update
Retrieve all entities by these ids (first actual query to db)
Modify them in any way you want
Call entityManager.flush resulting N separate updates to database.
This approach results N+1 actual queries to database where N = number of ids needed to be updated. Moreover you are moving the data back and forth which is actually not good too.
I would advise to
autowire a queryFactory into a custom repository
implementation
Also, have a look into spring data and querydsl example. However you will find only lookup examples.
Hope my pessimistic answer helps :)

Counting query in spring-data-couchbase (N1QL)

I'm writing couchbase repository using Spring module and I'm trying to add my own implementation of count method using N1QL query:
public interface MyRepository extends CouchbaseRepository<Entity, Long> {
#Query("SELECT count(*) FROM default")
long myCount();
}
But it doesn't work:
org.springframework.data.couchbase.core.CouchbaseQueryExecutionException: Unable to retrieve enough metadata for N1QL to entity mapping, have you selected _ID and _CAS?
So my question is: how can I write counting query using spring-data-couchbase?
I cannot find anything about this in spring documentation. link

This exception happens because the #Query annotation was designed with the use-case of retrieving entities in mind. Projections to a scalar like count are uncovered corner cases as of RC1. Maybe I can think of some way of adding support for it through explicit boolean flag in the annotation?
Unfortunately I was unable to find a workaround. I was trying to come up with a custom repository method implementation but it appears support for it is broken in 2.0.0-RC1 :(
edit:
The use case of simple return types like long, with a SELECT that only uses a single aggregation, should work so this is a bug/improvement. I've opened ticket DATACOUCH-187 in the Spring Data JIRA.

#Query("SELECT count(*) , META(default).id as _ID, META(default).cas as _CAS FROM default")
Change your query to this one.

Use this query :
#Query("SELECT count(*) as count FROM #{#n1ql.bucket} WHERE #{#n1ql.filter} ")
long myCount();

JPA - Setting entity class property from calculated column?

I'm just getting to grips with JPA in a simple Java web app running on Glassfish 3 (Persistence provider is EclipseLink). So far, I'm really liking it (bugs in netbeans/glassfish interaction aside) but there's a thing that I want to be able to do that I'm not sure how to do.
I've got an entity class (Article) that's mapped to a database table (article). I'm trying to do a query on the database that returns a calculated column, but I can't figure out how to set up a property of the Article class so that the property gets filled by the column value when I call the query.
If I do a regular "select id,title,body from article" query, I get a list of Article objects fine, with the id, title and body properties filled. This works fine.
However, if I do the below:
Query q = em.createNativeQuery("select id,title,shorttitle,datestamp,body,true as published, ts_headline(body,q,'ShortWord=0') as headline, type from articles,to_tsquery('english',?) as q where idxfti ## q order by ts_rank(idxfti,q) desc",Article.class);
(this is a fulltext search using tsearch2 on Postgres - it's a db-specific function, so I'm using a NativeQuery)
You can see I'm fetching a calculated column, called headline. How do I add a headline property to my Article class so that it gets populated by this query?
So far, I've tried setting it to be #Transient, but that just ends up with it being null all the time.

There are probably no good ways to do it, only manually:
Object[] r = (Object[]) em.createNativeQuery(
"select id,title,shorttitle,datestamp,body,true as published, ts_headline(body,q,'ShortWord=0') as headline, type from articles,to_tsquery('english',?) as q where idxfti ## q order by ts_rank(idxfti,q) desc","ArticleWithHeadline")
.setParameter(...).getSingleResult();
Article a = (Article) r[0];
a.setHeadline((String) r[1]);
-
#Entity
#SqlResultSetMapping(
name = "ArticleWithHeadline",
entities = #EntityResult(entityClass = Article.class),
columns = #ColumnResult(name = "HEADLINE"))
public class Article {
#Transient
private String headline;
...
}

AFAIK, JPA doesn't offer standardized support for calculated attributes. With Hibernate, one would use a Formula but EclipseLink doesn't have a direct equivalent. James Sutherland made some suggestions in Re: Virtual columns (#Formula of Hibernate) though:
There is no direct equivalent (please
log an enhancement), but depending on
what you want to do, there are ways to
accomplish the same thing.
EclipseLink defines a
TransformationMapping which can map a
computed value from multiple field
values, or access the database.
You can override the SQL for any CRUD
operation for a class using its
descriptor's DescriptorQueryManager.
You could define a VIEW on your
database that performs the function
and map your Entity to the view
instead of the table.
You can also perform minor
translations using Converters or
property get/set methods.
Also have a look at the enhancement request that has a solution using a DescriptorEventListener in the comments.
All this is non standard JPA of course.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Stream rows from PostgreSQL (with fetch size) - java

Related

How to iterate over large number of records in MySQL with memory efficient manner in Spring Boot

Hibernate first level cache is missed

Use Spring Data JPA, QueryDSL to update a bunch of records

Counting query in spring-data-couchbase (N1QL)

JPA - Setting entity class property from calculated column?

Categories

Resources