I'm trying to integrate Hibernate Search into one of the projects I'm currently working on. The first step in such an endeavour is fairly simply - index all the existing entities with Hibernate Search(which uses Lucene under the hood). Many of the tables mapped to entities in the domain model contain a lot of records(> 1 million) and I'm using simple pagination technique to split them into smaller units. However I'm experiencing some memory leak while indexing the entities. Here's my code:
#Service(objectName = "LISA-Admin:service=HibernateSearch")
#Depends({"LISA-automaticStarters:service=CronJobs", "LISA-automaticStarters:service=InstallEntityManagerToPersistenceMBean"})
public class HibernateSearchMBeanImpl implements HibernateSearchMBean {
private static final int PAGE_SIZE = 1000;
private static final Logger LOGGER = LoggerFactory.getLogger(HibernateSearchMBeanImpl.class);
#PersistenceContext(unitName = "Core")
private EntityManager em;
#Override
#TransactionAttribute(TransactionAttributeType.NOT_SUPPORTED)
public void init() {
FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(em);
Session s = (Session) em.getDelegate();
SessionFactory sf = s.getSessionFactory();
Map<String, EntityPersister> classMetadata = sf.getAllClassMetadata();
for (String key : classMetadata.keySet()) {
LOGGER.info("Class: " + key + "\nEntity name: " + classMetadata.get(key).getEntityName());
Class entityClass = classMetadata.get(key).getMappedClass(EntityMode.POJO);
LOGGER.info("Class: " + entityClass.getCanonicalName());
if (entityClass != null && entityClass.getAnnotation(Indexed.class) != null) {
index(fullTextEntityManager, entityClass, classMetadata.get(key).getEntityName());
}
}
}
#TransactionAttribute(TransactionAttributeType.NOT_SUPPORTED)
public void index(FullTextEntityManager pFullTextEntityManager, Class entityClass, String entityName) {
LOGGER.info("Class " + entityClass.getCanonicalName() + " is indexed by hibernate search");
int currentResult = 0;
Query tQuery = em.createQuery("select c from " + entityName + " as c order by oid asc");
tQuery.setFirstResult(currentResult);
tQuery.setMaxResults(PAGE_SIZE);
List entities;
do {
entities = tQuery.getResultList();
indexUnit(pFullTextEntityManager, entities);
currentResult += PAGE_SIZE;
tQuery.setFirstResult(currentResult);
} while (entities.size() == PAGE_SIZE);
LOGGER.info("Finished indexing for " + entityClass.getCanonicalName() + ", current result is " + currentResult);
}
#TransactionAttribute(TransactionAttributeType.REQUIRES_NEW)
public void indexUnit(FullTextEntityManager pFullTextEntityManager, List entities) {
for (Object object : entities) {
pFullTextEntityManager.index(object);
LOGGER.info("Indexed object with id " + ((BusinessObject)object).getOid());
}
}
}
It's just a simple MBean, whose init method I execute manually via JBoss's JMX console. When I monitor the execution of the method in the JVisualVM I see that the memory usage constantly grows until all the heap is consumed and although a lot of garbage collections happen no memory get freed that leads me to believe I have introduced a memory leak in my code. I however cannot spot the offending code, so I'm hoping for your assistance in locating it.
The problem is certainly not in the indexing itself, because I get the leak even without it, so I think I'm not doing the pagination right. The only reference to the entities that I have, however, is the list entities, that should be easily garbage collected after each iteration of the loop calling indexUnit.
Thanks in advance for your help.
EDIT
Changing the code to
List entities;
do {
Query tQuery = em.createQuery("select c from " + entityName + " as c order by oid asc");
tQuery.setFirstResult(currentResult);
tQuery.setMaxResults(PAGE_SIZE);
entities = tQuery.getResultList();
indexUnit(pFullTextEntityManager, entities);
currentResult += PAGE_SIZE;
tQuery.setFirstResult(currentResult);
} while (entities.size() == PAGE_SIZE);
alleviated the problem. The leak is still there, but not as bad as it was. I guess there is something fault with the JPA query itself, keeping references it shouldn't, but who knows.
It looks like the injected EntityManager is holding on to a reference to all the entities returned from your query. It's a container managed EM so it should be closed or cleared automatically at the end of a transaction - but you're doing a bunch of non-transactional queries.
If you are just going to index the entities, you might want to call em.clear() at the end of the loop in init(). The entities will be detached (the EntityManager track changes made to them) but if they're just going to be GC'ed that shouldn't be a problem.
I don't think there is a "leak"; however, I do think that you're accumulating a high number of entities into the persistence context (yes, you are, since you're loading them) and, ultimately, eating all the memory. You need to clear the EM after each loop (without clear, paging doesn't help). Something like this:
do {
entities = tQuery.getResultList();
indexUnit(pFullTextEntityManager, entities);
pFullTextEntityManager.clear();
currentResult += PAGE_SIZE;
tQuery.setFirstResult(currentResult);
} while (entities.size() == PAGE_SIZE);
Seems like this question won't be finding a real solution. In the end I've just moved out the indexing code into a separate app - the leak is still there, but it doesn't matter that much, since the app is running to completion(with a huge heap) outside of the critical container.
Related
I am new to Java and Hibernate.
I have implemented a functionality where I generate request nos. based on already saved request no. This is done by finding the maximum request no. and incrementing it by 1,and then again save i it to database.
However I am facing issues with multithreading. When two threads access my code at the same time both generate same request no. My code is already synchronized. Please suggest some solution.
synchronized (this.getClass()) {
System.out.println("start");
certRequest.setRequestNbr(generateRequestNumber(certInsuranceRequestAddRq.getAccountInfo().getAccountNumberId()));
reqId = Utils.getUniqueId();
certRequest.setRequestId(reqId);
ItemIdInfo itemIdInfo = new ItemIdInfo();
itemIdInfo.setInsurerId(certRequest.getRequestId());
certRequest.setItemIdInfo(itemIdInfo);
dao.insert(certRequest);
addAccountRel();
System.out.println("end");
}
Following is the output showing my synchronization:
start
end
start
end
Is it some Hibernate issue.
Does the use of transactional attribute in Spring affects the code commit in my Case?
I am using the following Transactional Attribute:
#Transactional(readOnly = false, propagation = Propagation.REQUIRED, rollbackFor = Exception.class)
EDIT: code for generateRequestNumber() shown in chat room.
public String generateRequestNumber(String accNumber) throws Exception {
String requestNumber = null;
if (accNumber != null) {
String SQL_QUERY = "select CERTREQUEST.requestNbr from CertRequest as CERTREQUEST, "
+ "CertActObjRel as certActObjRel where certActObjRel.certificateObjkeyId=CERTREQUEST.requestId "
+ " and certActObjRel.certObjTypeCd=:certObjTypeCd "
+ " and certActObjRel.certAccountId=:accNumber ";
String[] parameterNames = {"certObjTypeCd", "accNumber"};
Object[] parameterVaues = new Object[]
{
Constants.REQUEST_RELATION_CODE, accNumber
};
List<?> resultSet = dao.executeNamedQuery(SQL_QUERY,
parameterNames, parameterVaues);
// List<?> resultSet = dao.retrieveTableData(SQL_QUERY);
if (resultSet != null && resultSet.size() > 0) {
requestNumber = (String) resultSet.get(0);
}
int maxRequestNumber = -1;
if (requestNumber != null && requestNumber.length() > 0) {
maxRequestNumber = maxValue(resultSet.toArray());
requestNumber = Integer.toString(maxRequestNumber + 1);
} else {
requestNumber = Integer.toString(1);
}
System.out.println("inside function request number" + requestNumber);
return requestNumber;
}
return null;
}
Don't synchronize on the Class instance obtained via getClass(). It can have some strange side effects. See https://www.securecoding.cert.org/confluence/pages/viewpage.action?pageId=43647087
For example use:
synchronize(this) {
// synchronized code
}
or
private synchronized void myMethod() {
// synchronized code
}
To synchronize on the object instance.
Or do:
private static final Object lock = new Object();
private void myMethod() {
synchronize(lock) {
// synchronized code
}
}
Like #diwakar suggested. This uses a constant field to synchronize on to guarantee that this code is synchronizing on the same lock.
EDIT: Based on information from chat, you are using a SELECT to get the maximum requestNumber and increasing the value in your code. Then this value is set on the CertRequest which is then persisted in the database via a DAO. If this persist action is not committed (e.g. by making the method #Transactional or some other means) then another thread will still see the old requestNumber value. So you could solve this by making the code transactional (how depends on which frameworks you use etc.). But I agree with #VA31's answer which states that you should use a database sequence for this instead of incrementing the value in code. Instead of a sequence you could also consider using an auto-incement field in CertRequest, something like:
#GeneratedValue(strategy=GenerationType.AUTO)
private int requestNumber;
For getting the next value from a sequence you can look at this question.
You mentioned this information in your question.
I have implemented a functionality where I generate request nos. based on already saved request no. This is done by finding the maximum request no. and incrementing it by 1,and then again save i it to database.
On a first look, it seems the problem caused by multi appserver code. Threads are synchronised inside one JVM(appserver). If you are using more than one appserver then you have to do it differently using more robust approach by using server to server communication or by batch allocation of request no to each appserver.
But, if you are using only one appserver and multiple threads accessing the same code then you can put a lock on the instance of the class rather then the class itself.
synchronized(this) {
lastName = name;
nameCount++;
}
Or you can use the locks private to the class instance
private Object lock = new Object();
.
.
synchronized(lock) {
System.out.println("start");
certRequest.setRequestNbr(generateRequestNumber(certInsuranceRequestAddRq.getAccountInfo().getAccountNumberId()));
reqId = Utils.getUniqueId();
certRequest.setRequestId(reqId);
ItemIdInfo itemIdInfo = new ItemIdInfo();
itemIdInfo.setInsurerId(certRequest.getRequestId());
certRequest.setItemIdInfo(itemIdInfo);
dao.insert(certRequest);
addAccountRel();
System.out.println("end");
}
But make sure that your DB is updated by the new sequence no before the next thread is accessing it to get new one.
It is a good practice to generate "the request number (Unique Id)" by using the DATABASE SEQUENCE so that you don't need to synchronize your Service/DAO methods.
First thing:
Why are you getting the thread inside the method. I is not required here.
Also, one thing;
Can you try like this once:
final static Object lock = new Object();
synchronized (lock)
{
.....
}
what I feel is that object what you are calling is different so try this once.
I need to iterate 50k objects and change some fields in them.
I'm limited in memory so I don't want to bring all 50k objects into memory at once.
I thought doing it with the following code using cursor, but I was wondering whether all the objects I've processes using the cursor are left in the Entity Manager cache.
The reason I don't want to do it with offset and limit is because the database needs to work much harder since each page is a complete new query.
From previous experience once the Entity manager cache gets bigger, updates become real slow.
So usually I call flush and clear after every few hundreds of updates.
The problem here is that flushing / clearing will break the cursor.
I will be happy to learn the best approach of updating a large set of objects without loading them all into memory.
Additional information on how EclipseLink cursor works in such scenraio will be valuable too.
JpaQuery<T> jQuery = (JpaQuery<T>) query;
jQuery.setHint(QueryHints.RESULT_SET_TYPE, ResultSetType.ForwardOnly)
.setHint(QueryHints.SCROLLABLE_CURSOR, true);
Cursor cursor = jQuery.getResultCursor();
Iterator<MyObj> cursorIterator = cursor.iterator();
while (cursorIterator.hasNext()) {
MyObj myObj = cursorIterator.next();
ChangeMyObj(myObj);
}
cursor.close();
Use pagination + entityManager.clear() after each page. Also execute every page in a single transaction OR you will have to create/get a new EntityManager after an exception occurs (ar least with Hibernate: the EntityManager instance could be in an inconsistent state after an exception).
Try this sample code:
List results;
int index= 0;
int max = 100;
do {
Query query= manager.createQuery("JPQL QUERY");
query.setMaxResults(max).
setFirstResult(index);
results = query.getResultList( );
Iterator it = results.iterator( );
while (it.hasNext( )) {
Object c = (Object)it.next( );
}
entityManager.clear( );
index = index + results.getSize( );
} while (results.size( ) > 0);
I'm writing a java application that copies one database's information (db2) to anther database (sql server). The order of operations is very simple:
Check to see if anything has been updated in a certain time frame
Grab everything from the first database that is within the designated time frame
Map database information to POJOs
Divide subsets of POJOs into threads (pre defined # in properties file)
Threads cycle through each POJO Individually
Update the second database
I have everything working just fine, but at certain times of the day there is a huge jump in the amount of updates that need to take place (can get in to the hundreds of thousands).
Below you can see a generic version of my code. It follows the basic algorithm of the application. Object is generic, the actual application has 5 different types of specified objects each with its own updater thread class. But the generic functions below are exactly what they all look like. And in the updateDatabase() method, they all get added to threads and all run at the same time.
private void updateDatabase()
{
List<Thread> threads = new ArrayList<>();
addObjectThreads( threads );
startThreads( threads );
joinAllThreads( threads );
}
private void addObjectThreads( List<Thread> threads )
{
List<Object> objects = getTransformService().getObjects();
logger.info( "Found " + objects.size() + " Objects" );
createThreads( threads, objects, ObjectUpdaterThread.class );
}
private void createThreads( List<Thread> threads, List<?> objects, Class threadClass )
{
final int BASE_OBJECT_LOAD = 1;
int objectLoad = objects.size() / Database.getMaxThreads() > 0 ? objects.size() / Database.getMaxThreads() + BASE_OBJECT_LOAD : BASE_OBJECT_LOAD;
for (int i = 0; i < (objects.size() / objectLoad); ++i)
{
int startIndex = i * objectLoad;
int endIndex = (i + 1) * objectLoad;
try
{
List<?> objectSubList = objects.subList( startIndex, endIndex > objects.size() ? objects.size() : endIndex );
threads.add( new Thread( (Thread) threadClass.getConstructor( List.class ).newInstance( objectSubList ) ) );
}
catch (Exception exception)
{
logger.error( exception.getMessage() );
}
}
}
public class ObjectUpdaterThread extends BaseUpdaterThread
{
private List<Object> objects;
final private Logger logger = Logger.getLogger( ObjectUpdaterThread.class );
public ObjectUpdaterThread( List<Object> objects)
{
this.objects = objects;
}
public void run()
{
for (Object object : objects)
{
logger.info( "Now Updating Object: " + object.getId() );
getTransformService().updateObject( object );
}
}
}
All of these go to a spring service that looks like the code below. Again its generic, but each type of object has the exact same type of logic to them. The getObjects() from the code above are just one line pass throughs to the DAO so no need to really post that.
#Service
#Scope(value = "prototype")
public class TransformServiceImpl implements TransformService
{
final private Logger logger = Logger.getLogger( TransformServiceImpl.class );
#Autowired
private TransformDao transformDao;
#Override
public void updateObject( Object object )
{
String sql;
if ( object.exists() )
{
sql = Object.Mapper.UPDATE;
}
else
{
sql = Object.Mapper.INSERT;
}
boolean isCompleted = false;
while ( !isCompleted )
{
try
{
transformDao.updateObject( object, sql );
isCompleted = true;
}
catch (Exception exception)
{
logger.error( exception.getMessage() );
threadSleep();
logger.info( "Now retrying update for Object: " + object.getId() );
}
}
logger.info( "Updated Object: " + object.getId() );
}
}
Finally these all go to the DAO that looks like this:
#Repository
#Scope(value = "prototype")
public class TransformDaoImpl implements TransformDao
{
//#Resource is like #Autowired but with the added option of being able to specify the name
//Good for autowiring two different instances of the same class [NamedParameterJdbcTemplate]
//Another alternative = #Autowired #Qualifier(BEAN_NAME)
#Resource(name = "db2")
private NamedParameterJdbcTemplate db2;
#Resource(name = "sqlServer")
private NamedParameterJdbcTemplate sqlServer;
final private Logger logger = Logger.getLogger( TransformerImpl.class );
#Override
public void updateObject( Objet object, String sql )
{
MapSqlParameterSource source = new MapSqlParameterSource();
source.addValue( "column1_value", object.getColumn1Value() );
//put all source values from the POJO in just like above
sqlServer.update( sql, source );
}
}
My insert statements look like this:
"INSERT INTO dbo.OBJECT_TABLE " +
"(COLUMN1, COLUMN2...) " +
"VALUES(:column1_value, :column2_value... "
And my update statements look like this:
"UPDATE dbo.OBJECT_TABLE SET " +
"COLUMN1 = :column1_value, COLUMN2 = :column2_value, " +
"WHERE PRIMARY_KEY_COLUMN = :primary_key_value"
Its a lot of code and stuff I know, But I just wanted to layout everything I have in hopes that I can get help making this faster or more efficient. It takes hours on hours to update so many rows and it would nice if it only took a couple/few hours instead hours on hours. Thanks for any help. I welcome all learning experiences about spring, threads and databases.
If you're sending large amounts of SQL to the server, you should consider Batching it using the Statement.addBatch and Statement.executeBatch methods. The batches are finite in size (I always limited mine to 64K of SQL), but they dramatically lower the round trips to the database.
As I was iterating and creating SQL, I would keep track of how much I had batched already, when the SQL crossed the 64K boundary, I'd fire off an executeBatch and start a fresh one.
You may want to experiment with the 64K number, it may have been an Oracle limitation, which I was using at the time.
I can't speak to Spring, but batching is a part of the JDBC Statement. I'm sure it's straightforward to get to this.
Check to see if anything has been updated in a certain time frame
Grab everything from the first database that is within the designated time frame
Is there an index on the LAST_UPDATED_DATE column (or whatever you're using) in the source table? Rather than put the burden on your application, if it's within your control, why not write some triggers in the source database that create entries in an "update log" table? That way, all that your app would need to do is consume and execute those entries.
How are you managing your transactions? If you're creating a new transaction for each operation it's going to be brutally slow.
Regarding the threading code, have you considered using something more standard rather than writing your own? What you have is a pretty typical producer/consumer and Java has excellent support for that type of thing with ThreadPoolExecutor and numerous queue implementations to move data between threads that perform different tasks.
The benefit with using something off the shelf is that 1) it's well tested 2) there are numerous tuning options and sizing strategies that you can adjust to increase performance.
Also, rather than use 5 different thread types for each type of object that needs to be processed, have you considered encapsulating the processing logic for each type into separate strategy classes? That way, you could use a single pool of worker threads (which would be easier to size and tune).
Does Objectify throw a ConcurrentModificationException in case an entity with the same key (without a parent) is created at the same time (when before it did not exist) in two different transactions? I just found information regarding the case that the entity already exists and is modified, but not in case it does not yet exist...
ofy().transactNew(20, new VoidWork() {
#Override
public void vrun() {
Key<GameRequest> key = Key.create(GameRequest.class, numberOfPlayers + "_" + rules);
Ref<GameRequest> ref = ofy().load().key(key);
GameRequest gr = ref.get();
if(gr == null) {
// create new gamerequest and add...
// <-- HERE
} else {
...
}
}
});
Thanks!
Yes, you will get CME if anything in that entity group changes - including entity creation and deletion.
The code you show should work fine. Unless you really know what you are doing, you're probably better off just using the transact() method without trying to limit retries or forcing a new transaction. 99% of the time, transact() just does the right thing.
I'm running an import job that has worked pretty well until a couple days ago when the amount of entities has increased dramatically.
What happens is that I get a Lock wait timout exceeded. The application then retries and an exception is thrown since I call em.getTransaction().begin(); one more time.
To get rid of this problem I changed the innodb_lock_wait_timeout to 120 and
lowered the batch side to 50 entities.
What I can't figure out is how to handle all of this properly in code. I don't want the entire import to fail because of locking. How would you handle this? Do you have any code
example? Maybe some other thoughts? Please go nuts!
My BatchPersister:
public class BatchPersister implements Persister {
private final static Log log = getLog(BatchPersister.class);
private WorkLogger workLog = WorkLogger.instance();
private static final int BATCH_SIZE = 500;
private int persistedObjects;
private long startTime;
private UpdateBatch batch;
private String dataSource;
public BatchPersister(String dataSource) {
this.dataSource = dataSource;
}
public void persist(Persistable obj) {
persistedObjects++;
logProgress(100);
if (batch == null)
batch = new UpdateBatch(BATCH_SIZE, dataSource);
batch.add(obj);
if (batch.isFull()) {
batch.persist();
batch = null;
}
}
}
UpdateBatch
public class UpdateBatch {
private final static Log log = LogFactory.getLog(UpdateBatch.class);
private WorkLogger workLogger = WorkLogger.instance();
private final Map<Object, Persistable> batch;
private final EntityManager em;
private int size;
/**
* Initializes the batch and specifies its size.
*/
public UpdateBatch(int size, String dataSource) {
this.size = size;
batch = new LinkedHashMap<Object, Persistable>();
em = EmFactory.getEm(dataSource);
}
public void persist() {
log.info("Persisting " + this);
em.getTransaction().begin();
persistAllToDB();
em.getTransaction().commit();
WorkLog batchLog = new WorkLog(IMPORT_PERSIST, IN_PROGRESS);
batchLog.setAffectedItems(batch.size());
workLogger.log(batchLog);
em.close();
}
/**
* Persists all data in this update batch
*/
private void persistAllToDB() {
for (Persistable persistable : batch.values())
em.persist(persistable);
}
#Override
public String toString() {
final ArrayList<Persistable> values = new ArrayList<Persistable>(batch.values());
Persistable first = values.get(0);
Persistable last = values.get(values.size() - 1);
return "UpdateBatch[" +
first.getClass().getSimpleName() + "(" + first.getId() + ")" +
" - " +
last.getClass().getSimpleName() + "(" + last.getId() + ")" +
"]";
}
}
}
Solution 1.
Do not use JPA, it was not designed to work with massive database operations. Since you have access to your DataSource and you are managing transactions manually there is nothing that stops you from using plain old SQL.
Solution 2.
There might be a performance problem connected with persistence context first level cache - every persisted entity is kept in that cache, when this cache gets large it may hurt performance (mostly memory)
To improve situation set hibernate.jdbc.batch_size property (or equivalent, if your are not using Hibernate implementation of JPA) to more or less 20 - thanks to that queries will be send to database in 20 queries packs.
Secondly clean persistence context every 20 operations, forcing synchronization with database.
private void persistAllToDB() {
int counter = 0;
for (Persistable persistable : batch.values())
em.persist(persistable);
counter++;
if(counter % 20 == 0){
em.flush();
em.clear();
}
}
}
Solution 3.
Tune MySQL InnoDB engine [http://dev.mysql.com/doc/refman/5.1/en/insert-speed.html, http://dev.mysql.com/doc/refman/5.0/en/innodb-tuning.html]. If your table is heavily indexed it may hurt inserts performance.
That's my speculations, hope something would help you.
Pitor already named a couple of options. I would point out that a variation of his "Solution 2" would be to leverage the Hibernate StatelessSession api instead of using Session and clearing.
However, something else you should consider is that a transaction is a grouping of statements that that are expected to fail or succeed in total. If you have a bunch of statements and one in the middle fails and you want all the preceding statements to be persistent, then you should not be grouping them together in a single transaction. Group your statements properly in transactions. Generally it is a good idea to enable jdbc batching in Hibernate anyway; it generally leads to more efficient database communication.