Easiest way to set spark's connection timeout

Easiest way to set spark's connection timeout - java

What's the easiest way to set a timeout for spark's connection methods, like read and write?
So far I tried to add "spark.network.timeout" with something really low, like "2s", but then I got an exception requesting for "spark.executor.heartbeatInterval" to be lower than the timeout, so I set the heartbeatInterval to "1s".
Setting the timeout:
SparkSession sparkSession = SparkSession.builder().appName("test").master("local[*]").config("spark.network.timeout","2s").config("spark.executor.heartbeatInterval", "1s").getOrCreate();
Reading data:
Dataset<Row> dataset = sparkSession.read().jdbc(url, fromStatement, properties);
Writing data:
dataset.write().mode(SaveMode.Overwrite).jdbc(destinyUrl, tableName, accessProperties);
The read method took 11 seconds to load the dataset, and the write method took 13 seconds to save the dataset into the database, but no actions got stopped after the 2 seconds.

When there is no n/w issues you cant expect to get it stopped. 'spark.network.timeout' is default for all network interactions.
This config will be used in place if below properties are not configured.
spark.core.connection.ack.wait.timeout,
spark.storage.blockManagerSlaveTimeoutMs,
spark.shuffle.io.connectionTimeout
spark.rpc.askTimeout or spark.rpc.lookupTimeout
where as spark.executor.heartbeatInterval is Interval between each executor's heartbeats to the driver. Heartbeats let the driver know that the executor is still alive and update it with metrics for in-progress tasks. spark.executor.heartbeatInterval should be significantly less than spark.network.timeout (default = 10s)
source : https://spark.apache.org/docs/latest/configuration.html#networking

Related

what should be the Hikari maxLifetime for mysql wait_time out value 28800

I have a spring boot application with below HikariCP properties enabled in application.properties and in mysql i have wait_timeout = 28800
spring.datasource.hikari.minimumIdle=9
spring.datasource.hikari.maximumPoolSize=10
spring.datasource.hikari.maxLifetime=28799990
I still get the below error
13:02:46.103 [http-nio-8082-exec-2] WARN com.zaxxer.hikari.pool.PoolBase - HikariPool-1 - Failed to validate connection com.mysql.cj.jdbc.ConnectionImpl#13f6e098 (No operations allowed after connection closed.). Possibly consider using a shorter maxLifetime value.
what values i need to set in HikariCP to fix this issue
Thanks in advance
Edit
#Autowired
JdbcTemplate jdbcTemplate;
public Map<String, Object> getSlideData(String date, String sp){
SimpleJdbcCall simpleJdbcCall = new SimpleJdbcCall(jdbcTemplate).withProcedureName(sp)
.withoutProcedureColumnMetaDataAccess()
.useInParameterNames(ReportGeneratorConstants.TIMEPERIOD)
.declareParameters(
new SqlParameter(ReportGeneratorConstants.TIMEPERIOD,java.sql.Types.VARCHAR)
);
Map<String,String> map=new HashMap<>();
map.put(ReportGeneratorConstants.TIMEPERIOD, date);
return simpleJdbcCall.execute(map);
}
}
I am using simpleJdbcCall to call the stored procedure, I know that simpleJdbcCall uses multithread to run the stored procedure, What i want to know is, does simpleJdbcCall releases/close all connection to pool once execute() is completed (does spring boot take care of closing connections) if yes, where i can see the code for it.

There are many reasons why connections can be terminated. Usually there's something in the middle, tipically a firewall, that drops connections after set amount of time. Find what time that is and set the max life to at least 1 minutes before that. Hikari has a background thread that expires connections but I think it triggers every few seconds; I'm mentioning this as it's not exact to the millisecond.
A note that Hikari will try to keep the minimumIdle number of connections, so usually new connections will be open in the background and won't create a pause in the application to wait for a new connection.
Edit
I'm not familiar with Azure, so I'm not sure if the network stack drops connections after a certain amount of time. Personally, I never set the idle timeout nor max-life settings as long as the ones you have as the chances of a connection dropping is quite high in a cloud environment (in my experience). As an example, in AWS, our data team suggests to use a max lifetime of 5 minutes and idletimeout of 1 minute and this is in a platform with high traffic.
Opening a DB connection is quite fast and usually happens behind the scenes. My suggestion would be to use these settings, but you have to validate them that they work ok for the load of your application (based on the ones you added on the comment below):
# set this equal to maximum pool size if the traffic has burst
spring.datasource.hikari.minimumIdle=5
spring.datasource.hikari.maximumPoolSize=30
# It's the same as the default = 30 minutes
spring.datasource.hikari.maxLifetime=1800000
# It's the same as the default = 10 minutes
spring.datasource.hikari.keepalive-time=60000

How to set connectTimeout in case of slow internet and if I don't know the size of file to download

private fun downloadAPKStream() : InputStream? {
val url = URL(this.url)
val connection = url.openConnection() as HttpURLConnection
connection.requestMethod = "GET"
connection.connect() connection.connectTimeout = 5000
fileSize = connection.contentLength
val inputStream = connection.inputStream
return inputStream
}
I'm using this method to download apk file. But here if internet is slow then due to timeout of 5000 ms, download gets stuck in between without get completed. And if I comment this line or I don't provide any **connection.connectTimeout then it runs fine but sometimes get stuck in infinite time loop. What should I do to make it download files of any size and with slow internet as well.

You got timeout meaning wrong. It is not a max. allowed time of given (network in this case) operation, but max. allowed time of inactivity after which operation is considered stalled and fail. So you should set the timeout to some sane value, that would make sense in real life. As value is in milliseconds, the 5000 is not the one because it's just 5 seconds - any small network hiccup and your connection will get axed. Set it to something higher, like 30 secs or 1 minute or more.
Also note that this is connection timeout only. This means you should be able to establish protocol connection to remote server during that time, but this got nothing to data transfer itself. Data transfer oa process that comes next, once connection is established. For data transfer timeout (which definitely should be set higher) you need to use setReadTimeout().
Finally, you must set connection timeout prior calling connect() otherwise it makes no sense as it is already too late - this is what you got in your code now.
PS: use Download Manager instead.

JDBC Connection Pool test query "SELECT 1" does not catch AWS RDS Writer/Reader failover

We are running an AWS RDS Aurora/MySQL database in a cluster with a writer and a reader instance where the writer is replicated to the reader.
The application accessing the database is a standard java application using a HikariCP Connection Pool. The pool is configured to use a "SELECT 1" test query on checkout.
What we noticed is that once in a while RDS fails over the writer to the reader. The failover can also be replicated manually by clicking "Instance Actions/Failover" in the AWS console.
The connection pool is not able to detect the failover and the fact that it is now connected to a reader database, as the "SELECT 1" test queries still succeed. However any subsequent database updates fail with "java.sql.SQLException: The MySQL server is running with the --read-only option so it cannot execute this statement" errors.
It appears that instead of a "SELECT 1" test query, the Connection Pool can detect that it is now connected to the reader by using a "SELECT count(1) FROM test_table WHERE 1 = 2 FOR UPDATE" test query.
Has anybody experienced the same issue?
Are there any downsides on using "FOR UPDATE" in the test query?
Are there any alternate or better approaches of handling an AWS RDS cluster writer/reader failover?
Your help is much appreciated
Bernie

I've been giving this a lot of thought in the two months since my original reply...
How Aurora endpoints work
When you start up an Aurora cluster you get multiple hostnames to access the cluster. For the purposes of this answer, the only two that we care about are the "cluster endpoint," which is read-write, and the "read-only endpoint," which is (you guessed it) read-only. You also have an endpoint for each node within the cluster, but accessing nodes directly defeats the purpose of using Aurora, so I won't mention them again.
For example, if I create a cluster named "example", I'll get the following endpoints:
Cluster endpoint: example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com
Read-only endpoint: example.cluster-ro-x91qlr44xxxz.us-east-1.rds.amazonaws.com
You might think that these endpoints would refer to something like an Elastic Load Balancer, which would be smart enough to redirect traffic on failover, but you'd be wrong. In fact, they're simply DNS CNAME entries with a really short time-to-live:
dig example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com
; <<>> DiG 9.11.3-1ubuntu1.3-Ubuntu <<>> example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40120
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com. IN A
;; ANSWER SECTION:
example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com. 5 IN CNAME example.x91qlr44xxxz.us-east-1.rds.amazonaws.com.
example.x91qlr44xxxz.us-east-1.rds.amazonaws.com. 4 IN CNAME ec2-18-209-198-76.compute-1.amazonaws.com.
ec2-18-209-198-76.compute-1.amazonaws.com. 7199 IN A 18.209.198.76
;; Query time: 54 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Fri Dec 14 18:12:08 EST 2018
;; MSG SIZE rcvd: 178
When a failover happens, the CNAMEs are updated (from example to example-us-east-1a):
; <<>> DiG 9.11.3-1ubuntu1.3-Ubuntu <<>> example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27191
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com. IN A
;; ANSWER SECTION:
example.cluster-x91qlr44xxxz.us-east-1.rds.amazonaws.com. 5 IN CNAME example-us-east-1a.x91qlr44xxxz.us-east-1.rds.amazonaws.com.
example-us-east-1a.x91qlr44xxxz.us-east-1.rds.amazonaws.com. 4 IN CNAME ec2-3-81-195-23.compute-1.amazonaws.com.
ec2-3-81-195-23.compute-1.amazonaws.com. 7199 IN A 3.81.195.23
;; Query time: 158 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Fri Dec 14 18:15:33 EST 2018
;; MSG SIZE rcvd: 187
The other thing that happens during a failover is that all of the connections to the "cluster" endpoint get closed, which will fail any in-process transactions (assuming that you've set reasonable query timeouts).
The connections to the "read-only" endpoint don't get closed, which means that whatever node gets promoted will get read-write traffic in addition to read-only traffic (assuming, of course, that your application doesn't just send all requests to the cluster endpoint). Since read-only connections are typically used for relatively expensive queries (eg, reporting), this may cause performance problems for your read-write operations.
The Problem: DNS Caching
When failover happens, all in-process transactions will fail (again, assuming that you've set query timeouts). There will be a short amount of time that any new connections will also fail, as the connection pool attempts to connect to the same host before it's done with recovery. In my experience, failover takes around 15 seconds, during which time your application shouldn't expect to get a connection.
After that 15 seconds (or so), everything should return to normal: your connection pool attempts to connect to the cluster endpoint, it resolves to the IP address of the new read-write node, and all is well. But if anything prevents resolving that chain of CNAMEs, you may find that your connection pool makes connections to a read-only endpoint, which will fail as soon as you try an update operation.
In the case of the OP, he had his own CNAME with a longer timeout. So rather than connect to the cluster endpoint directly, he would connect to something like database.example.com. This is a useful technique in a world where you would manually fail-over to a replica database; I suspect it's less useful with Aurora. Regardless, if you use your own CNAMEs to refer to database endpoints, you need them to have short time-to-live values (certainly no more than 5 seconds).
In my original answer, I also pointed out that Java caches DNS lookups, in some cases forever. The behavior of this cache depends on (I believe) the version of Java, and also whether you're running with a security manager installed. With OpenJDK 8 running as an application, it appears that the JVM will delegate all naming lookups and not cache anything itself. However, you should be familiar with the networkaddress.cache.ttl system property, as described in this Oracle doc and this SO question.
However, even after you've eliminated any unexpected caches, there may still be times where the cluster endpoint is resolved to a read-only node. That leaves the question of how you handle this situation.
Not-so-good solution: use a read-only test on checkout
The OP was hoping to use a database connection test to verify that his application was running on a read-only node. This is surprisingly hard to do: most connection pools (including HikariCP, which is what the OP is using) simply verify that the test query executes successfully; there's no ability to look at what it returns. This means that any test query has to throw an exception to fail.
I haven't been able to come up with a way to make MySQL throw an exception with just a stand-alone query. The best I've come up with is to create a function:
DELIMITER EOF
CREATE FUNCTION throwIfReadOnly() RETURNS INTEGER
BEGIN
IF ##innodb_read_only THEN
SIGNAL SQLSTATE 'ERR0R' SET MESSAGE_TEXT = 'database is read_only';
END IF;
RETURN 0;
END;
EOF
DELIMITER ;
Then you call that function in your test query:
select throwIfReadOnly()
This works, mostly. When running my test program I could see a series of "failed to validate connection" messages, but then, inexplicably, the update query would run with a read-only connection. Hikari doesn't have a debug message to indicate which connection it hands out, so I couldn't identify whether it had allegedly passed validation.
But aside from that possible problem, there's a deeper issue with this implementation: it hides the fact that there's a problem. A user makes a request, and maybe waits for 30 seconds to get a response. There's nothing in the log (unless you enable Hikari's debug logging) to give a reason for this delay.
Moreover, while the database is inaccessible Hikari is furiously trying to make connections: in my single-threaded test, it would attempt a new connection every 100 milliseconds. And these are real connections, they simply go to the wrong host. Throw in an app-server with a few dozen or hundred threads, and that could cause a significant ripple effect on the database.
Better solution: use a read-only test on checkout, via a wrapper Datasource
Rather than let Hikari silently retry connections, you could wrap the HikariDataSource in your own DataSource implementation and test/retry yourself. This has the benefit that you can actually look at the results of the test query, which means that you can use a self-contained query rather than calling a separately-installed function. It also lets you log the problem using your preferred log levels, lets you pause between attempts, and gives you a chance to change pool configuration.
private static class WrappedDataSource
implements DataSource
{
private HikariDataSource delegate;
public WrappedDataSource(HikariDataSource delegate) {
this.delegate = delegate;
}
#Override
public Connection getConnection() throws SQLException {
while (true) {
Connection cxt = delegate.getConnection();
try (Statement stmt = cxt.createStatement()) {
try (ResultSet rslt = stmt.executeQuery("select ##innodb_read_only")) {
if (rslt.next() && ! rslt.getBoolean(1)) {
return cxt;
}
}
}
// evict connection so that we won't get it again
// should also log here
delegate.evictConnection(cxt);
try {
Thread.sleep(1000);
}
catch (InterruptedException ignored) {
// if we're interrupted we just retry
}
}
}
// all other methods can just delegate to HikariDataSource
This solution still suffers from the problem that it introduces a delay into user requests. True, you know that it's happening (which you didn't with the on-checkout test), and you could introduce a timeout (limit the number of times through the loop). But it still represents a bad user experience.
The best (imo) solution: switch into "maintenance mode"
Users are incredibly impatient: if it takes more than a few seconds to get a response back, they'll probably try to reload the page, or submit the form again, or do something that doesn't help and may hurt.
So I think the best solution is to fail quickly and let them know that somethng's wrong. Somewhere near the top of the call stack you should already have some code that responds to exceptions. Maybe you just return a generic 500 page now, but you can do a little better: look at the exception, and return a "sorry, temporarily unavailable, try again in a few minutes" page if it's a read-only database exception.
At the same time, you should send a notification to you ops staff: this may be a normal maintance window failover, or it may be something more serious (but don't wake them up unless you have some way of knowing that it's more serious).

set connection pool idle connection timeout in your java code datasource. set around 1000ms

Aurora failover
As Sayantan Mandal hints in his comments. When using Aurora just use the MariaDb driver it has support for failover.
It is documented here:
https://aws.amazon.com/blogs/database/using-the-mariadb-jdbc-driver-with-amazon-aurora-with-mysql-compatibility/
And here:
https://mariadb.com/kb/en/failover-and-high-availability-with-mariadb-connector-j/#aurora-endpoints-and-discovery
Your connection string will start with jdbc:mariadb:aurora// or jdbc:mysql:aurora//.
The connection pool normally calls JDBC4Connection#isValid which should correctly return false with this driver when on a read only replica.
No custom code required.
DNS Caching
As for DNS caching (networkaddress.cache.ttl) depending on your JVM the default is 30 or 60s depening of whether a security manager is present.
You can retrieve the value at runtime with this snippet if unsure:
Class.forName("sun.net.InetAddressCachePolicy").getMethod("get").invoke(null)
With 30s DNS cache, your connection will start to arrive at the read-write replica at most 30s after the failover happens.

Cassandra - Set write timeout with Java API

I am trying to set the write timeout in Cassandra with the Java drive. SocketOptions allows me to set a read and connect timeout but not a write timeout.
Does anyone knows the way to do this without changing the cassandra.yaml?
thanks
Altober

The name is misleading, but SocketOptions.getReadTimeoutMillis() applies to all requests from the driver to cassandra. You can think of it as a client-level timeout. If a response hasn't been returned by a cassandra node in that period of time an OperationTimeoutException will be raised and another node will be tried. Refer to the javadoc link above for more nuanced information about when the exception is raised to the client. Generally, you will want this timeout to be greater than your timeouts in cassandra.yaml, which is why 12 seconds is the default.
If you want to effectively manage timeouts at the client level, you can control this on a per query basis by using executeAsync along with a timed get on the ResultSetFuture to give up on the request after a period of time, i.e.:
ResultSet result = session.executeAsync("your query").get(300, TimeUnit.MILLISECONDS);
This will throw a TimeoutException if the request hasn't been completed in 300 ms.

How can I solve MongoWaitQueueFullException?

I run a java program which is a thread executor program that inserts thousands of documents to a table in mongodb. I get the following error
Exception in thread "pool-1-thread-301" com.mongodb.MongoWaitQueueFullException: Too many threads are already waiting for a connection. Max number of threads (maxWaitQueueSize) of 500 has been exceeded.
at com.mongodb.PooledConnectionProvider.get(PooledConnectionProvider.java:70)
at com.mongodb.DefaultServer.getConnection(DefaultServer.java:73)
at com.mongodb.BaseCluster$WrappedServer.getConnection(BaseCluster.java:221)
at com.mongodb.DBTCPConnector$MyPort.getConnection(DBTCPConnector.java:508)
at com.mongodb.DBTCPConnector$MyPort.get(DBTCPConnector.java:456)
at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:414)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:176)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:159)
at com.mongodb.DBCollection.insert(DBCollection.java:93)
at com.mongodb.DBCollection.insert(DBCollection.java:78)
at com.mongodb.DBCollection.insert(DBCollection.java:120)
at ScrapResults103$MyRunnable.run(MyProgram.java:368)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
How can I resolve this? Please help me.

You need to check what is the connections per host value which you have given while setting up connection (looking at the exception I think you would have set it to 500).
MongoClientOptions.Builder builder = new MongoClientOptions.Builder();
builder.connectionsPerHost(200);
MongoClientOptions options = builder.build();
mongoClient = new MongoClient(URI, connectionOptions);
An ideal way of setting the connections per host would be by trial and error but you need to make sure that the value which you set should not exceed the number of connections you can have by opening the mongo shell and executing:
db.serverStatus().connections.available

you are in maxWaitQueueSize limit , so increase multiplier ;)
MongoClientOptions options = MongoClientOptions.builder()
.threadsAllowedToBlockForConnectionMultiplier(10)
.build();
MongoClient mongo = new MongoClient("127.0.0.1:27017", options);
//run 2000 threads and use database ;)

waitQueueMultiple is the product of maxConnectionPoolSize and threadsAllowedToBlockForConnectionMultiplier hence you can modify one of these three options to tune your app in MongoClientOptions with corresponding values and consume it to your MongoClient as an argument how it was done above here (marked as an answer) https://stackoverflow.com/a/25347310/2852528
BUT
I strongly recommend analysing your code first (where it communicates to the DB), and if no optimization is available (e.g. caching, using aggregation, paging etc.) then go ahead and change the options

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.