Restarting quartz scheduler without getting an error - java

I am trying to use the quartz scheduler in cluster mode using jdbc.
Before I started with jdbc in clustered mode I just tested the scheduler in general with the RAM store. That worked without a problem and I was able to restart the scheduler (main class) without any errors. The problem I have now is that when I stop the execution (ctrl+c) and then restart it I always get the error message:
org.quartz.ObjectAlreadyExistsException: Unable to store Job : 'MyTestJob', because one already exists with this identification.
I don't understand what is going on here. Does quartz not support restarting the scheduler? I mean, what happens if there is a crash and the scheduler restarts after recovery? Is the only option to then delete the jobs from the quartz database? Perhaps there is another method or something that I have missed. I don't feel very comfortable using a library that does not cope with restarts.
Another odd thing is, that when changing to jdbc my job does not get triggered anymore and I just see the state WAITING in the DB. What could this be? The job (cron-schedule) worked without a problem in RAM mode.
I am a bit surprised about the level of documentation and the problems I am encountering with this simple task because I have heard of the quartz scheduler for many years now, but never got round to using it. Goodle suggests that I am not the only one with this problem. I hope that this is just me and that there is a simple solution to my problem, otherwise it would be very disappointing to try this library out for the first time in the 2.2.x version and already having to look for something else.
Here is my configuration:
# Configure Main Scheduler Properties
org.quartz.scheduler.skipUpdateCheck = true
org.quartz.scheduler.instanceName = Test-Scheduler
org.quartz.scheduler.instanceId = AUTO
# Configure ThreadPool
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 25
org.quartz.threadPool.threadPriority = 5
# Configure JobStore
org.quartz.jobStore.misfireThreshold = 60000
org.quartz.jobStore.class = org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.StdJDBCDelegate
org.quartz.jobStore.useProperties = true
org.quartz.jobStore.dataSource = quartzDS
org.quartz.jobStore.tablePrefix = QRTZ_
org.quartz.jobStore.isClustered = true
org.quartz.jobStore.clusterCheckinInterval = 20000
Here is my code:
SchedulerFactory sf = new StdSchedulerFactory();
Scheduler scheduler = sf.getScheduler();
JobDetail jobDetail = newJob(job.getClass())
.withIdentity("test-name", "test-group")
CronTrigger trigger = newTrigger()
.withIdentity("test-name-trigger", "test-group")
.withSchedule(cronSchedule("0 0/1 * * * ?"))
scheduler.scheduleJob(jobDetail, trigger);
This is interesting.
1) RAM mode works.
2) jdbc with cluster enable does not work and fails (almost) silently - even with logging enabled. In log output I see the following:
19:57:29,913 INFO StdSchedulerFactory:1184 - Using default implementation for ThreadExecutor
19:57:29,936 INFO SchedulerSignalerImpl:61 - Initialized Scheduler Signaller of type: class org.quartz.core.SchedulerSignalerImpl
19:57:29,936 INFO QuartzScheduler:240 - Quartz Scheduler v.2.2.1 created.
19:57:29,938 INFO JobStoreTX:667 - Using db table-based data access locking (synchronization).
19:57:29,940 INFO JobStoreTX:59 - JobStoreTX initialized.
19:57:29,941 INFO QuartzScheduler:305 - Scheduler meta-data: Quartz Scheduler (v2.2.1) 'Test-Scheduler' with instanceId 'Michael-PC1405447049916'
Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.
Currently in standby mode.
Number of jobs executed: 0
Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 25 threads.
Using job-store 'org.quartz.impl.jdbcjobstore.JobStoreTX' - which supports persistence. and is clustered.
19:57:29,941 INFO StdSchedulerFactory:1339 - Quartz scheduler 'Test-Scheduler' initialized from default resource file in Quartz package: ''
19:57:29,941 INFO StdSchedulerFactory:1343 - Quartz scheduler version: 2.2.1
19:57:29,995 INFO AbstractPoolBackedDataSource:462 - Initializing c3p0 pool... com.mchange.v2.c3p0.ComboPooledDataSource [ acquireIncrement -> 3, acquireRetryAttempts -> 30, acquireRetryDelay -> 1000, autoCommitOnClose -> false, automaticTestTable -> null, breakAfterAcquireFailure -> false, checkoutTimeout -> 0, connectionCustomizerClassName -> null, connectionTesterClassName -> com.mchange.v2.c3p0.impl.DefaultConnectionTester, dataSourceName -> 1hgeby993gf1xpdmdc44s|7ec4d0, debugUnreturnedConnectionStackTraces -> false, description -> null, driverClass -> com.mysql.jdbc.Driver, factoryClassLocation -> null, forceIgnoreUnresolvedTransactions -> false, identityToken -> 1hgeby993gf1xpdmdc44s|7ec4d0, idleConnectionTestPeriod -> 50, initialPoolSize -> 3, jdbcUrl -> jdbc:mysql://localhost:3306/scheduler, maxAdministrativeTaskTime -> 0, maxConnectionAge -> 0, maxIdleTime -> 0, maxIdleTimeExcessConnections -> 0, maxPoolSize -> 5, maxStatements -> 0, maxStatementsPerConnection -> 120, minPoolSize -> 1, numHelperThreads -> 3, numThreadsAwaitingCheckoutDefaultUser -> 0, preferredTestQuery -> SELECT 1 FROM QRTZ_JOB_DETAILS, properties -> {user=******, password=******}, propertyCycle -> 0, testConnectionOnCheckin -> true, testConnectionOnCheckout -> false, unreturnedConnectionTimeout -> 0, usesTraditionalReflectiveProxies -> false ]
19:57:30,243 DEBUG StdRowLockSemaphore:107 - Lock 'TRIGGER_ACCESS' is desired by: main
19:57:30,262 DEBUG StdRowLockSemaphore:92 - Lock 'TRIGGER_ACCESS' is being obtained: main
19:58:21,328 DEBUG StdRowLockSemaphore:141 - Lock 'TRIGGER_ACCESS' was not obtained by: main - will try again.
19:58:22,329 DEBUG StdRowLockSemaphore:92 - Lock 'TRIGGER_ACCESS' is being obtained: main
19:59:13,389 DEBUG StdRowLockSemaphore:141 - Lock 'TRIGGER_ACCESS' was not obtained by: main - will try again.
19:59:14,389 DEBUG StdRowLockSemaphore:92 - Lock 'TRIGGER_ACCESS' is being obtained: main
Although, just as as I was about to enable cluster mode again, I saw the exception:
Exception in thread "main" org.quartz.impl.jdbcjobstore.LockException: Failure obtaining db row lock: Lock wait timeout exceeded; try restarting transaction [See nested exception: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction]
at org.quartz.impl.jdbcjobstore.StdRowLockSemaphore.executeSQL(
at org.quartz.impl.jdbcjobstore.DBSemaphore.obtainLock(
at org.quartz.impl.jdbcjobstore.JobStoreSupport.executeInNonManagedTXLock(
at org.quartz.impl.jdbcjobstore.JobStoreTX.executeInLock(
at org.quartz.impl.jdbcjobstore.JobStoreSupport.clearAllSchedulingData(
at org.quartz.core.QuartzScheduler.clear(
at org.quartz.impl.StdScheduler.clear(
at com.scs.core.cron.TaskRunner.main(
3) In jdbc mode with clustering disabled it does not work either, but I get an exception:
20:04:15,993 DEBUG SimpleSemaphore:132 - Lock 'TRIGGER_ACCESS' retuned by: main
20:04:15,993 DEBUG JobStoreTX:703 - JobStore background threads started (as scheduler was started).
20:04:15,994 INFO QuartzScheduler:575 - Scheduler Test-Scheduler_$_NON_CLUSTERED started.
20:04:15,994 DEBUG JobStoreTX:3933 - MisfireHandler: scanning for misfires...
20:04:16,000 DEBUG JobStoreTX:3182 - Found 0 triggers that missed their scheduled fire-time.
20:04:16,004 DEBUG QuartzSchedulerThread:276 - batch acquisition of 0 triggers
20:04:16,008 DEBUG SimpleSemaphore:81 - Lock 'TRIGGER_ACCESS' is desired by: main
20:04:16,008 DEBUG SimpleSemaphore:88 - Lock 'TRIGGER_ACCESS' is being obtained: main
20:04:16,008 DEBUG SimpleSemaphore:105 - Lock 'TRIGGER_ACCESS' given to: main
20:04:16,052 DEBUG SimpleSemaphore:132 - Lock 'TRIGGER_ACCESS' retuned by: main
Found job: class to.test.cron.ImportProducts
Tue Jul 15 20:05:00 CEST 2014
20:04:16,058 DEBUG QuartzSchedulerThread:276 - batch acquisition of 0 triggers
20:04:42,961 ERROR ErrorLogger:2425 - An error occurred while scanning for the next triggers to fire.
org.quartz.JobPersistenceException: Couldn't acquire next trigger: to.test.cron.ImportProducts [See nested exception: java.lang.ClassNotFoundException: to.test.cron.ImportProducts]
at org.quartz.impl.jdbcjobstore.JobStoreSupport.acquireNextTrigger(
at org.quartz.impl.jdbcjobstore.JobStoreSupport$40.execute(
at org.quartz.impl.jdbcjobstore.JobStoreSupport$40.execute(
at org.quartz.impl.jdbcjobstore.JobStoreSupport.executeInNonManagedTXLock(
at org.quartz.impl.jdbcjobstore.JobStoreSupport.acquireNextTriggers(
Caused by: java.lang.ClassNotFoundException: to.test.cron.ImportProducts
at$ Source)
at$ Source)
at Method)
at Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at org.quartz.simpl.InitThreadContextClassLoadHelper.loadClass(
at org.quartz.simpl.CascadingClassLoadHelper.loadClass(
at org.quartz.simpl.CascadingClassLoadHelper.loadClass(
at org.quartz.impl.jdbcjobstore.StdJDBCDelegate.selectJobDetail(
at org.quartz.impl.jdbcjobstore.JobStoreSupport.acquireNextTrigger(
... 5 more
I don't quite understand why I am getting 3 completely different behaviours in the 3 differnt modes. Surley if the class can be found in RAM-mode, why should it not find it in jdbc mode? And why is it not being logged in clustered mode? The class is actually in a osgi-type module. Can that cause a problem (in jdbc-mode)? Is there anything I can do, so that the class can be found, like a passing the classloader etc to quartz?
I am pretty lost now and would really appreciate any help. It would be a shame to have to go back to standard cron jobs, especially as quartz has so much more to offer.
Thanks in advance for any help provided,

This is a general "problem" with a persistent job store. Your application apparently tries to add a job that already exists in the job store because it has already been added by your application in the past. You have two options:
You wipe out contents of your job store during the initialization of your application before you attempt to add jobs/triggers. Since Quartz 2.x, there is a new method Scheduler.clear() that you can use.
You modify your application code to deal with the fact that the job/trigger you are trying to add may be already present in the job store. If it is present, you simply update the job/trigger if necessary, or skip the job/trigger altogether.
When you think of it, this Quartz behavior actually makes sense, because jobs / triggers in the job store can be modified from outside of your application (e.g. by external systems using Quartz remote APIs).
You may also want to look into the XMLSchedulingDataProcessorPlugin that allows you to externalize job and trigger definitions from your application to an XML file/resource and it can deal with job/trigger name conflicts. This article provides an example of the XML file structure.
Hope this helps.


Unable to get c3p0 logging to go to file

I am using Java 8, Hibernate 4.3.11 and c3p0 9.2.1 and the standard Java logging package and am having trouble with writing the debug information from c3p0 to my debug log.
I added
to start up, and this gets c3p0 to use standard logging and write to the console , but it doesnt write to my debug log file.
I initialize loggers for my application and lib
SongKong.ioLogger = Logger.getLogger("org.jaudiotagger");
MainWindow.logger = Logger.getLogger("com.jthink");
and then call my LogProperties class to configure the logs files and console and writing the data, and this works.
What am I doing wrong
package com.jthink.songkong.logging;
import com.jthink.songkong.cmdline.SongKong;
import com.jthink.songkong.preferences.GeneralPreferences;
import com.jthink.songkong.preferences.UserPreferences;
import com.jthink.songkong.ui.MainWindow;
import com.jthink.songkong.util.Platform;
import java.nio.charset.StandardCharsets;
import java.util.logging.ConsoleHandler;
import java.util.logging.FileHandler;
import java.util.logging.Level;
import java.util.logging.Logger;
* This defines the command line properties of SongKong, currently consists of logger settings
public final class LogProperties
public static int LOG_SIZE_IN_BYTES = 10000000;
public LogProperties()
//Set logging for jaudiotagger lib, user configurable
//Set logging for songkongdebug, user configurable
//C3p0 Logger
Logger c3p0Logger = Logger.getLogger("com.mchange.v2.c3p0");
//Set Filehandler used for writing to debug log
String logFileName = Platform.getPlatformLogFolderInLogfileFormat() + "songkong_debug%u-%g.log";
FileHandler fe = new FileHandler(logFileName, LOG_SIZE_IN_BYTES, 10, true);
fe.setFormatter(new com.jthink.songkong.logging.LogFormatter());
//Write output from these loggers to the debug log file
ConsoleHandler ch = new ConsoleHandler();
ch.setFormatter(new com.jthink.songkong.logging.LogFormatter());
catch (IOException ioe)
MainWindow.userInfoLogger.severe("Unable to open log file");
I need the debugging to get written to the log file because I want a customer to run some tests, so it is no good to be if the data is just written to console. Also the format of c3p0 data written the console is not in the format of my other messages (as defined by com.jthink.songkong.logging.LogFormatter()) so it seems that my call to LogProperties() is effectively being ignored even though it is called before I access c3p0 for the first time.
e.g this is output to console at startup
debuglogfile is:C:\Users\Paul\AppData\Roaming\SongKong\Logs/songkong_debug%u-%g.log
userlogfile is:C:\Users\Paul\AppData\Roaming\SongKong\Logs/songkong_user%u-%g.log
23/08/2019 10.44.26:BST:SongKong:setLocale:SEVERE: Locale is:en
23/08/2019 10.44.27:BST:SongKong:setFonts:WARNING: Fonts Enabled:true
23/08/2019 10.44.27:BST:SongKong:setFonts:WARNING: Fonts configured successfully
23/08/2019 10.44.27:BST:SongKong:init:WARNING: end
23/08/2019 10.44.27:BST:SongKong:finish:WARNING: finish
23/08/2019 10.44.29:BST:SongKong:writeSystemInfo:WARNING: SongKong 6.3 Psychocandy 1099 24/07/2019 using Java 1.8.0_181 25.181-b13 64bit on Windows 10 10.0 amd64 initialized successfully
23/08/2019 10.44.29:BST:SongKong:writeSystemInfo:WARNING: No of CPUs:8
23/08/2019 10.44.29:BST:SongKong:writeSystemInfo:WARNING: SongKong has been configured with minimum heap memory of 100 mb, maximum heap memory of 1,778 mb and maximum permanent memory of -32 mb
23/08/2019 10.44.29:BST:SongKong:writeSystemInfo:WARNING: Total Computer Memory is 24,466 mb
23/08/2019 10.44.30:BST:SongKong:writeSystemInfo:WARNING: Username:Paul:Domain:pclaptop:RunningAsAdmin:false
23/08/2019 10.44.30:BST:SongKong:checkDatabase:WARNING: Setting Db Folder:C:\Users\Paul\AppData\Roaming\SongKong/Database
23/08/2019 10.44.30:BST:SongKong:checkDatabase:WARNING: Lock File remaining from previous, deleting lock
23/08/2019 10.44.30:BST:HibernateUtil:createFactory:SEVERE: ----Initilizing Hibernate Session factory
Aug 23, 2019 10:44:31 AM com.mchange.v2.log.MLog <clinit>
INFO: MLog clients using java 1.4+ standard logging.
Aug 23, 2019 10:44:32 AM com.mchange.v2.c3p0.C3P0Registry banner
INFO: Initializing c3p0- [built 20-March-2013 10:47:27 +0000; debug? true; trace: 10]
Aug 23, 2019 10:44:32 AM com.mchange.v2.c3p0.impl.AbstractPoolBackedDataSource getPoolManager
INFO: Initializing c3p0 pool... com.mchange.v2.c3p0.PoolBackedDataSource#3c73cbbb [ connectionPoolDataSource -> com.mchange.v2.c3p0.WrapperConnectionPoolDataSource#adb66302 [ acquireIncrement -> 3, acquireRetryAttempts -> 10, acquireRetryDelay -> 1000, autoCommitOnClose -> false, automaticTestTable -> null, breakAfterAcquireFailure -> false, checkoutTimeout -> 0, connectionCustomizerClassName -> null, connectionTesterClassName -> com.mchange.v2.c3p0.impl.DefaultConnectionTester, debugUnreturnedConnectionStackTraces -> true, factoryClassLocation -> null, forceIgnoreUnresolvedTransactions -> false, identityToken -> 2rwcn5a41gohnzr1p7tndj|54e1c68b, idleConnectionTestPeriod -> 3000, initialPoolSize -> 1, maxAdministrativeTaskTime -> 0, maxConnectionAge -> 0, maxIdleTime -> 2000, maxIdleTimeExcessConnections -> 0, maxPoolSize -> 5, maxStatements -> 3000, maxStatementsPerConnection -> 50, minPoolSize -> 1, nestedDataSource -> com.mchange.v2.c3p0.DriverManagerDataSource#2d7c4b75 [ description -> null, driverClass -> null, factoryClassLocation -> null, identityToken -> 2rwcn5a41gohnzr1p7tndj|f736069, jdbcUrl -> jdbc:h2:async:C:\Users\Paul\AppData\Roaming\SongKong/Database/Database;FILE_LOCK=SOCKET;MVCC=TRUE;DB_CLOSE_ON_EXIT=FALSE;CACHE_SIZE=50000;, properties -> {user=******, password=******} ], preferredTestQuery -> null, propertyCycle -> 0, statementCacheNumDeferredCloseThreads -> 0, testConnectionOnCheckin -> false, testConnectionOnCheckout -> false, unreturnedConnectionTimeout -> 10, usesTraditionalReflectiveProxies -> false; userOverrides: {} ], dataSourceName -> null, factoryClassLocation -> null, identityToken -> 2rwcn5a41gohnzr1p7tndj|a38c7fe, numHelperThreads -> 3 ]
23/08/2019 10.44.36:BST:SongKong:checkDatabase:SEVERE: Accessed Database okay
23/08/2019 10.44.36:BST:SongKong:checkCache:WARNING: Checking Cache:C:\Users\Paul\AppData\Roaming\SongKong\Database\EhCache
23/08/2019 10.44.38:BST:SongKong:checkCache:WARNING: Checked Cache:C:\Users\Paul\AppData\Roaming\SongKong\Database\EhCache
23/08/2019 10.44.39:BST:SongKong:setUserAgent:WARNING: start
23/08/2019 10.44.41:BST:AbstractAcoustidQuery:performBasicSubmissionQuery:SEVERE: Posting to url:
23/08/2019 10.44.42:BST:SongKong:setUserAgent:WARNING: end
23/08/2019 10.44.42:BST:SongKong:finish:WARNING: finish
Also the format of c3p0 data written the console is not in the format of my other messages (as defined by com.jthink.songkong.logging.LogFormatter()) so it seems that my call to LogProperties() is effectively being ignored even though it is called before I access c3p0 for the first time.
Loggers are subject to garbage collection. One bug in your code is the following:
Logger c3p0Logger = Logger.getLogger("com.mchange.v2.c3p0");
Remove that line and create a constant:
private static final Logger c3p0Logger = Logger.getLogger("com.mchange.v2.c3p0");
The fundamental flaw with my approach was that I was calling my logging code from within the application itself. But what I needed to was specify the following property
at startup with the name of my logging config class
I also needed
so that c3p0 knew i was using standard logging.
This solved the issue, although not being able to call the class from the code made some logic problematic.

MarkLogic Java API deadlock detection

One of our application just suffered from some nasty deadlocks. I had quite a hard time recreating the problem because the deadlock (or stacktrace) did not show up immediately in my java application logs.
To my surprise the marklogic java api retries failing requests (e.g because of a deadlock). This might make sense, if your request is not a multi statement request, but otherwise i'm not sure if it does.
So lets stick with this deadlock problem. I created a simple code snippet in which i create a deadlock on purpose. The snippet creates a document test.xml and then tries to read and write from two different transactions, each on a new thread.
public static void main(String[] args) throws Exception {
final Logger root = (Logger) LoggerFactory.getLogger(Logger.ROOT_LOGGER_NAME);
final Logger ok = (Logger) LoggerFactory.getLogger(OkHttpServices.class);
final DatabaseClient client = DatabaseClientFactory.newClient("localhost", 8000, new DatabaseClientFactory.DigestAuthContext("username", "password"));
final StringHandle handle = new StringHandle("<doc><name>Test</name></doc>")
client.newTextDocumentManager().write("test.xml", handle);"t1: opening");
final Transaction t1 = client.openTransaction();"t1: reading");
.read("test.xml", new StringHandle(), t1);"t2: opening");
final Transaction t2 = client.openTransaction();"t2: reading");
.read("test.xml", new StringHandle(), t2);
new Thread(() -> {"t1: writing");
client.newXMLDocumentManager().write("test.xml", new StringHandle("<doc><t>t1</t></doc>").withFormat(Format.XML), t1);
new Thread(() -> {"t2: writing");
client.newXMLDocumentManager().write("test.xml", new StringHandle("<doc><t>t2</t></doc>").withFormat(Format.XML), t2);
This code will produce the following log:
14:12:27.437 [main] DEBUG c.m.client.impl.OkHttpServices - Connecting to localhost at 8000 as admin
14:12:27.570 [main] DEBUG c.m.client.impl.OkHttpServices - Sending test.xml document in transaction null
14:12:27.608 [main] INFO ROOT - t1: opening
14:12:27.609 [main] DEBUG c.m.client.impl.OkHttpServices - Opening transaction
14:12:27.962 [main] INFO ROOT - t1: reading
14:12:27.963 [main] DEBUG c.m.client.impl.OkHttpServices - Getting test.xml in transaction 5298588351036278526
14:12:28.283 [main] INFO ROOT - t2: opening
14:12:28.283 [main] DEBUG c.m.client.impl.OkHttpServices - Opening transaction
14:12:28.286 [main] INFO ROOT - t2: reading
14:12:28.286 [main] DEBUG c.m.client.impl.OkHttpServices - Getting test.xml in transaction 8819382734425123844
14:12:28.289 [Thread-1] INFO ROOT - t1: writing
14:12:28.289 [Thread-1] DEBUG c.m.client.impl.OkHttpServices - Sending test.xml document in transaction 5298588351036278526
14:12:28.289 [Thread-2] INFO ROOT - t2: writing
14:12:28.290 [Thread-2] DEBUG c.m.client.impl.OkHttpServices - Sending test.xml document in transaction 8819382734425123844
Neither t1 or t2 will get commited. MarkLogic logs confirm that there actually is a deadlock:
==> /var/opt/MarkLogic/Logs/8000_AccessLog.txt <== - admin [24/Nov/2018:14:12:30 +0000] "PUT /v1/documents?txid=5298588351036278526&category=content&uri=test.xml HTTP/1.1" 503 1034 - "okhttp/3.9.0"
==> /var/opt/MarkLogic/Logs/ErrorLog.txt <==
2018-11-24 14:12:30.719 Info: Deadlock detected locking Documents test.xml
This would not be a problem, if one of the requests would fail and throw an exception, but this is not the case. MarkLogic Java Api retries every request up to 120 seconds and one of the updates timeouts after like 120 seconds or so:
Exception in thread "Thread-1" com.marklogic.client.FailedRequestException: Service unavailable and maximum retry period elapsed: 121 seconds after 65 retries
at com.marklogic.client.impl.OkHttpServices.putPostDocumentImpl(
at com.marklogic.client.impl.OkHttpServices.putDocument(
at com.marklogic.client.impl.DocumentManagerImpl.write(
at com.marklogic.client.impl.DocumentManagerImpl.write(
at com.marklogic.client.impl.DocumentManagerImpl.write(
at Scratch.lambda$main$0(
What are possible ways to overcome this problem? One way might be to set a maximum time to live for a transaction (like 5 seconds), but this feels hacky and unreliable. Any other ideas? Are there any other settings i should check out?
I'm on MarkLogic 9.0-7.2 and using marklogic-client-api:4.0.3.
Edit: One way to solve the deadlock would be by syncronizing the calling function, this is actually the way i solved it in my case (see comments). But i think the underlying problem still exists. Having a deadlock in a multi statement transaction should not be hidden away in a 120 second timeout. I rather have a immediately failing request than a 120 second lock on one of my documents + 64 failing retries per thread.
Deadlocks are usually resolvable by retrying. Internally, the server does a inner-retry loop because usually deadlocks are transient and incidental, lasting a very short time. In your case you have constructed a case that will never succeed with any timeout that's equal for both threads.
Deadlocks can be avoided at the application layer by avoiding multi-statement transactions when using the REST API. (which is what the Java api uses).
Multi statement transactions over REST cannot be implemented 100% safely due to the client's responsibility to manage the transaction ID and the server's inability to detect client-side errors or client-side identity. Very subtle problems can and do occur unless you are aggressively proactive wrt handling errors and multithreading. If you 'push' the logic to the server (xquery or javascript) the server is able to manage things much better.
As for if its 'good' or not for the Java API to implement retries for this case, that's debatable either way. (The compromise for an seemingly easy-to-use interface is that many things that would otherwise be options are decided for you as a convention. There's generally no one-size-fits-all answer. In this case I am presuming the thought was that a deadlock is more likely caused by independant code/logic by 'accident' as opposed to identical code running in tangent -- a retry in that case would be a good choice. In your example its not, but then an earlier error would still fail predictably until you change your code to 'not do that' ).
If it doesn't already exist, a feature request for a configurable timeout and retry behaviour does seem a reasonable request. I would recommend, however, to attempt to avoid any REST calls that result in an open transaction -- inherently that is problematic, particularly if you don't notice the problem upfront (then its more likely to bite you in production). Unlike JDBC, which keeps a connection open so that the server can detect client disconnects, HTTP and the ML Rest API do not -- which leads to a different programming model then traditional database coding in java.

Spark Job fails at saveAsHadoopDataset stage due to Lost Executor due to some unknown reason

I have a spark jobs that runs on yarn, it works with about 150gb of dataset and does multiple shuffle operations and finally stores data into hbase. It keeps failing at saveAsHadoopDataset Basically multiple Executors fails at this stage after reporting high GC activities. However none of the executor logs, driver logs or node manager logs indicate any OutOfMemory errors or GC Overhead Exceeded errors or memory limits exceeded errors. I don't see any other reason for Executor failures as well in spark ui as well.
val hConf = HBaseConfiguration.create
hConf.setInt("hbase.client.scanner.caching", 10000)
hConf.setBoolean("hbase.cluster.distributed", true)
new PairRDDFunctions(hbaseRdd).saveAsHadoopDataset(jobConfig)
Driver Logs:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, Job aborted due to stage failure: Task 388 in stage 22.0 failed 4 times, most recent failure: Lost task 388.3 in stage 22.0 (TID 32141, maprnode5): ExecutorLostFailure (executor 5 lost)
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 388 in stage 22.0 failed 4 times, most recent failure: Lost task 388.3 in stage 22.0 (TID 32141, maprnode5): ExecutorLostFailure (executor 5 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1914)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1124)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1065)
Executor Logs:
16/02/24 11:09:47 INFO executor.Executor: Finished task 224.0 in stage 8.0 (TID 15318). 2099 bytes result sent to driver
16/02/24 11:09:47 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 15333
16/02/24 11:09:47 INFO executor.Executor: Running task 239.0 in stage 8.0 (TID 15333)
16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Getting 125 non-empty blocks out of 3007 blocks
16/02/24 11:09:47 INFO storage.ShuffleBlockFetcherIterator: Started 14 remote fetches in 10 ms
16/02/24 11:11:47 ERROR server.TransportChannelHandler: Connection to maprnode5 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust if this is wrong.
16/02/24 11:11:47 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from maprnode5 is closed
16/02/24 11:11:47 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches Connection from maprnode5 closed
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(
at io.netty.util.concurrent.SingleThreadEventExecutor$
16/02/24 11:11:47 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 6 outstanding blocks after 5000 ms
16/02/24 11:11:52 INFO client.TransportClientFactory: Found inactive connection to maprnode5, creating a new one.
16/02/24 11:12:16 WARN server.TransportChannelHandler: Exception in connection from maprnode5 Connection reset by peer
at Method)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(
at io.netty.buffer.AbstractByteBuf.writeBytes(
at io.netty.util.concurrent.SingleThreadEventExecutor$
16/02/24 11:12:16 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from maprnode5 is closed
16/02/24 11:12:16 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
So it turns out although spark UI says it failed at saveAsHadoopDataSet it was in fact failing at first step of the stage where saveAsHadoopDataSet was the last step. To elaborate more, spark defines stage boundaries based on sequence of narrow transformation or sequence of combined wide transformation and narrow transformation. In my particular case, sequence was groupByKey(wide dep) -> mapValues(narrow dep) -> map(narrow dep) where last map is actually doing saveAsHadoopDataSet. Executor was reporting hight GC activity and memory usage at in fact shuffle stage groupByKey. I changed my application logic to use reduceByKey instead of groupByKey. Now its super slow but at least not failing.

Unable to Execute More than a spark Job "Initial job has not accepted any resources"

Using a Standalone Spark Java to execute the below code snippet, I'm getting the Status is always WAITING with the below error.It doesn't work when I try to add the Print statement. Is there any configuration I might have missed to run multiple jobs?
15/09/18 15:02:56 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MapPartitionsRDD[2] at filter at
15/09/18 15:02:56 INFO TaskSchedulerImpl: Adding task set 0.0 with 2
15/09/18 15:03:11 WARN TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are
registered and have sufficient resources
15/09/18 15:03:26 WARN TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are
registered and have sufficient resources
15/09/18 15:03:41 WARN TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are
registered and have sufficient resources
JavaRDD<String> words = input.flatMap(new FlatMapFunction<String, String>() //Ln:143
public Iterable<String> call(String x)
return Arrays.asList(x.split(" "));
// Count all the words
System.out.println("Total words is" + words.count())
This error message means that your application is requesting more resources from the cluster than the cluster can currently provide i.e. more cores or more RAM than available in the cluster.
One of the reasons for this could be that you already have a job running which uses up all the available cores.
When this happens, your job is most probably waiting for another job to finish and release resources.
You can check this in the Spark UI.

Slow Hibernate/C3P0 handling of non-slow Postgres SELECT

A certain indexed SELECT query against a Postgres database takes a highly variable amount of time - from 50 msecs to multiple seconds, and very occasionally minutes, even under the lightest load.
Our Postgres query log records anything over 10 msecs, but never records any of these. The EXPLAIN output suggests the query isn't particularly efficient, nonetheless it shouldn't be slow in this tiny database (000s of records), and we're trusting the Postgres logs.
With our applications logs set to report all Hibernate, C3P0, and Spring/Spring Data logging (see version numbers at end), the evidence suggests this is very much a Hibernate/C3P0 issue, however, all the evidence from the logs suggests the pool size and utilisation is fine for the time being. Unfortunately we cannot drill down any further.
Can you suggest an explanation for the 26 second gap?
10:19:29.149 DEBUG org.hibernate.SQL [I=9534] - select eventrepor0_.consortium_id as consorti1_3_3_, eventrepor0_.customer_resource_id as customer6_3_3_, eventrepor0_.item_type_id as item2_3_3_, eventrepor0_.reporting_date as reportin3_3_3_, eventrepor0_.event_subtype as event4_3_3_, eventrepor0_.event_count as event5_3_3_, as id1_2_0_, customerre1_.customer_id as customer2_2_0_, customerre1_.resource_id as resource3_2_0_, as id1_8_1_, resource2_.data_type_id as data2_8_1_, resource2_.platform_id as platform5_8_1_, resource2_.prop_id as prop3_8_1_, resource2_.title as title4_8_1_, resource2_1_.doi as doi1_6_1_, resource2_1_.isbn as isbn2_6_1_, resource2_1_.online_issn as online3_6_1_, resource2_1_.print_issn as print4_6_1_, resource2_1_.publisher as publishe5_6_1_, resource2_1_.yop as yop6_6_1_, case when is not null then 1 when is not null then 0 end as clazz_1_, as id1_4_2_, platform3_.api_key as api2_4_2_, platform3_.platform_name as platform3_4_2_, hostnames4_.platform_id as platform1_4_5_, hostnames4_.hostname as hostname2_5_5_ from event_report eventrepor0_ inner join customer_resource customerre1_ on left outer join resource resource2_ on left outer join published_resource resource2_1_ on left outer join platform platform3_ on left outer join platform_hostnames hostnames4_ on where eventrepor0_.consortium_id=? and eventrepor0_.customer_resource_id=? and eventrepor0_.item_type_id=? and eventrepor0_.reporting_date=? and eventrepor0_.event_subtype=?
10:19:29.149 DEBUG c.m.v.a.ThreadPoolAsynchronousRunner [I=9534] - com.mchange.v2.async.ThreadPoolAsynchronousRunner#4ffa2724: Adding task to queue -- com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#31e6b320
10:19:29.149 DEBUG c.m.v.c3p0.stmt.GooGooStatementCache [I=9534] - CULLING: update event_report set event_count=event_count+1 where customer_resource_id=? and item_type_id=? and event_subtype=? and reporting_date=? and consortium_id=?
10:19:29.149 DEBUG c.m.v.a.ThreadPoolAsynchronousRunner [I=9534] - com.mchange.v2.async.ThreadPoolAsynchronousRunner#4ffa2724: Adding task to queue -- com.mchange.v2.c3p0.stmt.GooGooStatementCache$StatementDestructionManager$1UncheckedStatementCloseTask#20fa1378
10:19:29.149 DEBUG c.m.v.c3p0.stmt.GooGooStatementCache [I=9534] - cxnStmtMgr.statementSet( org.postgresql.jdbc4.Jdbc4Connection#38e040d2 ).size(): 5
10:19:29.150 DEBUG c.m.v.c3p0.stmt.GooGooStatementCache [I=9534] - checkoutStatement: com.mchange.v2.c3p0.stmt.GlobalMaxOnlyStatementCache stats -- total size: 20; checked out: 5; num connections: 6; num keys: 20
10:19:29.150 TRACE o.h.e.j.internal.JdbcCoordinatorImpl [I=9534] - Registering statement [com.mchange.v2.c3p0.impl.NewProxyPreparedStatement#7cc20161]
10:19:29.150 TRACE o.h.e.j.internal.JdbcCoordinatorImpl [I=9534] - Registering last query statement [com.mchange.v2.c3p0.impl.NewProxyPreparedStatement#7cc20161]
10:19:29.150 TRACE o.h.type.descriptor.sql.BasicBinder [I=9534] - binding parameter [1] as [VARCHAR] -
10:19:29.150 TRACE o.h.type.descriptor.sql.BasicBinder [I=9534] - binding parameter [2] as [BIGINT] - 47
10:19:29.150 TRACE org.hibernate.type.EnumType [I=9534] - Binding [SEARCH_REG] to parameter: [3]
10:19:29.150 TRACE o.h.type.descriptor.sql.BasicBinder [I=9534] - binding parameter [4] as [TIMESTAMP] - Tue Jul 16 00:00:00 BST 2013
10:19:29.151 TRACE o.h.type.descriptor.sql.BasicBinder [I=9534] - binding parameter [5] as [VARCHAR] -
10:19:29.151 TRACE org.hibernate.loader.Loader [I=9534] - Bound [6] parameters total
[... massive gap ...]
10:19:55.644 TRACE o.h.e.j.internal.JdbcCoordinatorImpl [I=9534] - Registering result set [com.mchange.v2.c3p0.impl.NewProxyResultSet#fa7b109]
A note on concurrency: there is a great deal of variability even when only one request: 50 - 300 msecs end-to-end, but when one user submits a batch of about 100 of these lookups (probably 10-20 run concurrently), there's a high probability that a few will take 5-10 seconds. And yet the C3P0 stats are never any worse than:
com.mchange.v2.c3p0.stmt.GlobalMaxOnlyStatementCache stats -- total size: 20; checked out: 6; num connections: 6; num keys: 20
These are pretty powerful servers, so there's no obvious disk, network, or CPU activity. We use NewRelic to monitor.
Our DataSource setup:
ComboPooledDataSource dataSource = new com.mchange.v2.c3p0.ComboPooledDataSource();
dataSource.setTestConnectionOnCheckin( Boolean.TRUE.toString() );
dataSource.setPreferredTestQuery("select 1");
JPA properties:
props.put("hibernate.dialect", "org.hibernate.dialect.PostgreSQL82Dialect");
props.put("hibernate.show_sql", "false");
props.put("generate_statistics", "false");
props.put("javax.persistence.sharedCache.mode", "ENABLE_SELECTIVE");
props.put("javax.persistence.validation.mode", "NONE");
props.put("hibernate.cache.use_second_level_cache", "false");
props.put("hibernate.cache.region.factory_class", "org.hibernate.cache.impl.NoCachingRegionFactory");
props.put("", "false");
Versions: Postgres 9.1.7 with latest 9.2 JDBC driver; Hibernate 4.2.3.Final; C3P0; Spring 3.2.2.RELEASE; Spring Data JPA 1.1.0; Tomcat 7; JDK 1.7
Update - the C3P0 properties we're currently using (after switch to use maxStatementsPerConnection)
c.m.v.c.i.AbstractPoolBackedDataSource [] - Initializing c3p0 pool... com.mchange.v2.c3p0.ComboPooledDataSource [ acquireIncrement -> 3, acquireRetryAttempts -> 30, acquireRetryDelay -> 1000, autoCommitOnClose -> false, automaticTestTable -> null, breakAfterAcquireFailure -> false, checkoutTimeout -> 0, connectionCustomizerClassName -> null, connectionTesterClassName -> com.mchange.v2.c3p0.impl.DefaultConnectionTester, dataSourceName -> 2s05p58v1s6oref13lw967|538ab4bc, debugUnreturnedConnectionStackTraces -> false, description -> null, driverClass -> org.postgresql.Driver, factoryClassLocation -> null, forceIgnoreUnresolvedTransactions -> false, identityToken -> 2s05p58v1s6oref13lw967|538ab4bc, idleConnectionTestPeriod -> 1800, initialPoolSize -> 5, jdbcUrl -> jdbc:postgresql://******, maxAdministrativeTaskTime -> 0, maxConnectionAge -> 0, maxIdleTime -> 0, maxIdleTimeExcessConnections -> 0, maxPoolSize -> 20, maxStatements -> 0, maxStatementsPerConnection -> 20, minPoolSize -> 5, numHelperThreads -> 3, preferredTestQuery -> select 1, properties -> {user=******, password=******}, propertyCycle -> 0, statementCacheNumDeferredCloseThreads -> 0, testConnectionOnCheckin -> true, testConnectionOnCheckout -> false, unreturnedConnectionTimeout -> 0, userOverrides -> {}, usesTraditionalReflectiveProxies -> false ]
I can't be certain this is the cause of your issue, but I think there's a pretty good shot!
You have set maxStatements to a value that is way, way too low for the load you are carrying. Try setting maxStatements to zero (turn Statement caching off), or else try setting maxStatementsPerConnection to 20, which I think is what you may have intended. As is, you've set a global max of 20 PreparedStatements to be shared by up to 20 Connections. That's not likely to yield good performance.

