Cassandra java driver - high latency while extracting data with multiple threads

Cassandra java driver - high latency while extracting data with multiple threads - java

I can see a strange behavior with datastax cassandra driver (3.0). I've created a new cluster, then I've started a set of threads using the same cluster object. If I keep threads to 1 or 2, I see an avg extraction time of 5ms, but if I increase threads to 60, extraction time increase to 200ms (per single thread). Strange thing is that, if I let the 60 threads app running and I start on the same machine another process with only 1 threads, extraction time for that single threaded app is again 5ms. So it seems something related to client. I've repeated the same tests many times to avoid the cache cold start problem.
Here is how cluster object is configured:
PoolingOptions poolingOptions = new PoolingOptions();
poolingOptions
.setConnectionsPerHost(HostDistance.LOCAL, parallelism, parallelism+20)
.setConnectionsPerHost(HostDistance.REMOTE, parallelism, parallelism+20)
.setMaxRequestsPerConnection(HostDistance.LOCAL, 32768)
.setMaxRequestsPerConnection(HostDistance.REMOTE, 2000);
this.cluster = Cluster.builder()
.addContactPoints(nodes)
.withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
.withReconnectionPolicy(new ConstantReconnectionPolicy(100L))
.withLoadBalancingPolicy(new TokenAwarePolicy(DCAwareRoundRobinPolicy.builder().build()))
.withCompression(Compression.LZ4)
.withPoolingOptions(poolingOptions)
.withProtocolVersion(ProtocolVersion.V4)
.build();
Does anyone have experienced the same problem? It seems a client configuration issue. Maybe some additional missing configuration for Netty?
UPDATE 1
What application is doing is extracting chunk of data using a query like:
select * from table where id=? and ts>=? and ts<?
So I have 60 threads that are extracting those data in parallel. id is the partition key. Every query is executed by the thread as:
//Prepare statement
PreparedStatement stmt = ... get the prepared statment cached
BoundStatement bstmt = stmt.bind(...)
//Execute query
long te1 = System.nanoTime();
ResultSet rs = this.session.execute(bstmt);
long te2 = System.nanoTime();
//Fetch...
Iterator<Row> iterator = rs.iterator();
while (!rs.isExhausted() && iterator.hasNext()) { .... }
session is one and shared cross all threads. What I'm measuring is the avg time of the session.execute() method call.
Thanks!
UPDATE 2
Here is schema definition
CREATE TABLE d_t (
id bigint,
xid bigint,
ts timestamp,
avg double,
ce double,
cg double,
p double,
w double,
c double,
sum double,
last double,
max double,
min double,
p75 double,
p90 double,
p95 double,
squad double,
sumq double,
wavg double,
weight double,
PRIMARY KEY ((id), xid, ts)
) WITH CLUSTERING ORDER BY (xid DESC, ts DESC)
and compaction = {'class': 'SizeTieredCompactionStrategy'}
and gc_grace_seconds=86400
and caching = { 'keys' : 'ALL', 'rows_per_partition':'36000' }
and min_index_interval = 2
and max_index_interval = 20;
UPDATE 3
Also tried with
.setMaxRequestsPerConnection(HostDistance.LOCAL, 1)
.setMaxRequestsPerConnection(HostDistance.REMOTE, 1)
with no changes

Ultimately I think it will depend on what your code is doing. Can you share an example?
With regards to increased latency, how are you measuring this? Based on your statement:
Strange thing is that, if I let the 60 threads app running and I start on the same machine another process with only 1 threads, extraction time for that single threaded app is again 5ms.
60 concurrent requests really isn't all too much and in general, you shouldn't need to do a thread-per-request using the datastax java driver. You can achieve high throughput with a single application thread as the netty event loop group the driver uses will do most of the work.
The native protocol C* uses allows many requests per connection. As you have configured here, each connection is maxed out to 32768 concurrent requests. In reality, you don't really need to touch this configuration at all, as the default (1000 requests per connection) is sensible as in practice C* is not going to process more than native_transport_max_threads from cassandra.yaml (128 default) at a time and queue up the rest.
Because of this, you do not need many connections for each host. The default of 1 core connection per host should be more than enough for 60 concurrent requests. Increasing the number of connections per host won't do much for you and in profiling I've found diminishing returns beyond 8 connections per host with high throughputs (thousands of concurrent requests) and throughput getting worse past 16 connections per host, though your mileage may vary based on environment.
With all that said, I would recommend not configuring PoolingOptions beyond the default, other than maybe setting core and max to 8 for scenarios where you are trying to achieve higher throughputs (> 10k requests/sec).

Related

Reading from Database through multiple threads in java

I am reading data from vertica database using multiple threads in java.
I have around 20 million records and I am opening 5 different threads having select queries like this....
start = threadnum;
while (start*20000<=totalRecords){
select * from tableName order by colname limit 20000 offset start*20000.
start +=5;
}
The above query assigns 20K distinct records to read from db to each thread.
for eg the first thread will read first 20k records then 20K records starting from 100 000 position,etc
But I am not getting performance improvement. In fact using a single thread if it takes x seconds to read 20 million records then it is taking almost x seconds for each thread to read from database.
Shouldn't there be some improvement from x seconds (was expecting x/5 seconds)?
Can anybody pinpoint what is going wrong?

Your database presumably lies on a single disk; that disk is connected to a motherboard using a single data cable; if the database server is on a network, then it is connected to that network using a single network cable; so, there is just one path that all that data has to pass through before it can arrive at your different threads and be processed.
The result is, of course, bad performance.
The lesson to take home is this:
Massive I/O from the same device can never be improved by multithreading.
To put it in different terms: parallelism never increases performance when the bottleneck is the transferring of the data, and all the data come from a single sequential source.
If you had 5 different databases stored on 5 different disks, that would work better.
If transferring the data was only taking half of the total time, and the other half of the time was consumed in doing computations with the data, then you would be able to halve the total time by desynchronizing the transferring from the processing, but that would require only 2 threads. (And halving the total time would be the best that you could achieve: more threads would not increase performance.)
As for why reading 20 thousand records appears to perform almost as bad as reading 20 million records, I am not sure why this is happening, but it could be due to a silly implementation of the database system that you are using.
What may be happening is that your database system is implementing the offset and limit clauses on the database driver, meaning that it implements them on the client instead of on the server. If this is in fact what is happening, then all 20 million records are being sent from the server to the client each time, and then the offset and limit clauses on the client throw most of them away and only give you the 20 thousand that you asked for.
You might think that you should be able to trick the system to work correctly by turning the query into a subquery nested inside another query, but my experience when I tried this a long time ago with some database system that I do not remember anymore is that it would result in an error saying that offset and limit cannot appear in a subquery, they must always appear in a top-level query. (Precisely because the database driver needed to be able to do its incredibly counter-productive filtering on the client.)
Another approach would be to assign an incrementing unique integer id to each row which has no gaps in the numbering, so that you can select ... where unique_id >= start and unique_id <= (start + 20000) which will definitely be executed on the server rather than on the client.
However, as I wrote above, this will probably not allow you to achieve any increase in performance by parallelizing things, because you will still have to wait for a total of 20 million rows to be transmitted from the server to the client, and it does not matter whether this is done in one go or in 1000 gos of 20 thousand rows each. You cannot have two stream of rows simultaneously flying down a single wire.

I will not repeat what Mike Nakis says as it is true and well explained :
I/O from a physical disk cannot be improved by multithreading
Nevertheless I would like to add something.
When you execute a query like that :
select * from tableName order by colname limit 20000 offset start*20000.
from the client side you may handle the result of the query that you could improve by using multiple threads.
But from the database side you have not the hand on the processing of the query and the Vertica database is probably designed to execute your query by performing parallel tasks according to the machine possibilities.
So from the client side you may split the execution of your query in one, two or three parallel threads, it should not change many things finally as a professional database is designed to optimize the response time according to the number of requests it receives and machine possibilities.

No, you shouldn't get x/5 seconds. You are not thinking about the fact that you are getting 5 times the number of records in the same amount of time. It's about throughput, not about time.

In my opinion, the following is a good solution. It has worked for us to stream and process millions of records without much of a memory and processing overhead.
PreparedStatement pstmt = conn.prepareStatement(sql, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
pstmt.setFetchSize(Integer.MIN_VALUE);
ResultSet rs = pstmt.executeQuery();
while(rs.next()) {
// Do the thing
}
Using OFFSET x LIMIT 20000 will result in the same query being executed again and again. For 20 million records and for 20K records per execution, the query will get executed 1000 times.
OFFSET 0 LIMIT 20000 will perform well, but OFFSET 19980000 LIMIT 20000 itself will take a lot of time. As the query will be executed fully and then from the top it will have to ignore 19980000 records and give the last 20000.
But using the ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY options and setting the fetch size to Integer.MIN_VALUE will result in the query being executed only ONCE and the records will be streamed in chunks, and can be processed in a single thread.

Sporadically failing Cassandra queries

We're experiencing issues with constinuously running java applications that update counters in Cassandra. From monitoring the load of the servers we don't see any correlations with the load. The queries are quite constant, because they update values in only 8 different tables. Every minute the java applications fires thousands of queries (can be 20k or even 50k queries), but every once in a while some of those fail. When that happens we write them to a file, along with the exception message. This message is always
Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
We did some googling and troubleshooting and took several actions:
Changed the retry policy in the java applications to DefaultRetryPolicy instead of the FallthroughRetryPolicy, to have the client retry a query on failure.
Changed the write_request_timeout_in_ms setting on the Cassandra nodes from the standard value of 2000 to 4000 and then to 10000.
These actions diminished the number of failing queries, but they still occur. From the millions of queries that are executed on an hourly basis, we see about 2000 failing queries over a period of 24 hours. All have the same exception listed above, and they occur at varying times.
Of course we see from the logs that when queries do fail, it takes a while, because it's waiting for a time out and performs retries.
Some facts:
We run Cassandra v2.2.5 (recently upgraded from v2.2.4)
We have a geo aware Cassandra cluster with 6 nodes: 3 in Europe, 3 in US.
The java applications that fire queries are the only clients that communicate with Cassandra (for now).
The number of java applications is 10: 5 in EU, 5 in US.
We execute all queries asynchronously (session.executeAsync(statement);) and keep track of which individual queries by adding callbacks for success and failure.
The replication factor is 2.
The replication factor is 2.
We run Oracle Java 1.7.0_76 Java(TM) SE Runtime Environment (build 1.7.0_76-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.76-b04, mixed mode)
The 6 Cassandra nodes run on bare metal with the following specs:
Storage is a group of SSDs in raid 5.
Each node has 2x (6 core) Intel Xeon E5-2620 CPU's # 2.00GHz (totalling the number of hardware threads to 24).
The RAM size is 128GB.
How we create the cluster:
private Cluster createCluster() {
return Cluster.builder()
.addContactPoints(contactPoints)
.withRetryPolicy(DefaultRetryPolicy.INSTANCE)
.withLoadBalancingPolicy(getLoadBalancingPolicy())
.withReconnectionPolicy(new ConstantReconnectionPolicy(reconnectInterval))
.build();
}
private LoadBalancingPolicy getLoadBalancingPolicy() {
return DCAwareRoundRobinPolicy.builder()
.withUsedHostsPerRemoteDc(allowedRemoteDcHosts) // == 3
.build();
}
How we create the keyspace:
CREATE KEYSPACE IF NOT EXISTS traffic WITH REPLICATION = { 'class': 'NetworkTopologyStrategy', 'AMS1': 2, 'WDC1': 2};
Example table (they all look similar)
CREATE TABLE IF NOT EXISTS traffic.per_node (
node text,
request_time timestamp,
bytes counter,
ssl_bytes counter,
hits counter,
ssl_hits counter,
PRIMARY KEY (edge, request_time)
) WITH CLUSTERING ORDER BY (request_time DESC)
AND compaction = {'class': 'DateTieredCompactionStrategy'};

Many remarks:
first for the Cluster config, you should specify the local DC name
you should use LOCAL_ONE instead of ONE for consistency level to enhance data locality
DO NOT change the write_request_timeout_in_ms value. You're just sweeping issues under the carpet, your real issue is not the timeout setting
What is your Replication Factor ?
Every minute the java applications fires thousands of queries (can be 20k or even 50k queries)--> simple maths give me ~ 300 inserts/sec per node with the assumption that RF=1. It is not that huge but your inserts may be limited by hardware. What is your CPU config (number of cores) and disk type (spinning disk or SSD) ?
Do you throttle the async inserts ? E.g. fire those in batch of N inserts and wait a little bit for the cluster to breath. See my answer here for throttling: What is the best way to get backpressure for Cassandra Writes?

Discrepancy between Cassandra trace and client-side latency

We're on Cassandra 2.0.15, and seeing huge read latencies (>60sec) coming up at regular intervals (about every 3min), from all app hosts. We measure this latency around calls to session.execute(stmt). At the same time, Cassandra traces report duration of <1s. We also ran, in a loop, a query via cqlsh from the same hosts during those peak latency times, and cqlsh always came back within 1s. What can explain this discrepancy at the Java driver level?
-- edit: in reply to comments --
Cassandra servers JVM settings: -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -Xms32G -Xmx32G -XX:+UseG1GC -Djava.net.preferIPv4Stack=true -Dcassandra.jmx.local.port=7199 -XX:+DisableExplicitGC.
Client side GC is negligible (below). Client settings: -Xss256k -Xms4G -Xmx4G, Cassandra driver version is 2.1.7.1
Client side measuring code:
val selectServiceNames = session.prepare(QueryBuilder.select("service_name").from("service_names"))
override def run(): Unit = {
val start = System.currentTimeMillis()
try {
val resultSet = session.execute(selectServiceNames.bind())
val serviceNames = resultSet.all()
val elapsed = System.currentTimeMillis() - start
latency.add(elapsed) // emits metric to statsd
if (elapsed > 10000) {
log.info("Canary2 sensed high Cassandra latency: " + elapsed + "ms")
}
} catch {
case e: Throwable =>
log.error(e, "Canary2 select failed")
} finally {
Thread.sleep(100)
schedule()
}
}
Cluster construction code:
def createClusterBuilder(): Cluster.Builder = {
val builder = Cluster.builder()
val contactPoints = parseContactPoints()
val defaultPort = findConnectPort(contactPoints)
builder.addContactPointsWithPorts(contactPoints)
builder.withPort(defaultPort) // This ends up config.protocolOptions.port
if (cassandraUsername.isDefined && cassandraPassword.isDefined)
builder.withCredentials(cassandraUsername(), cassandraPassword())
builder.withRetryPolicy(ZipkinRetryPolicy.INSTANCE)
builder.withLoadBalancingPolicy(new TokenAwarePolicy(new LatencyAwarePolicy.Builder(new RoundRobinPolicy()).build()))
}
One more observation I cannot explain. I ran two threads that execute the same query in the same manner (as above) in a loop, the only difference is yellow thread sleeps 100millisec between queries, and green thread sleeps 60sec between queries. Green thread hits low latency (under 1s) much more often than the yellow one.

This is a common problem when you get a compoent to test itself.
you can experience delays which are not visible to the tools in question.
your component has no idea when the request should have started.
when the JVM stops, this can prevent you from seeing the delays you are try to measure.
The most likely explanation is the second one. Say you have a queue of 100 tasks but because the system is running slowly each task is taking 1 second. You time each task internally and it sees it took 1 seconds, however add 100 tasks to the queue and the first one starts after 0 second, but the last starts after 99 seconds and then reports it took 1 second, but from your point of view it took 100 seconds to complete, 99 seconds of which was waiting to start.
There can also be delays in the result reaching you but this is less likely unless the operations you do in processing the results is more than the database takes. i.e. you might assume the bottleneck is on the server.

I tracked the issue down to queries timing out on the nodes from remote data center. The cluster has nodes in two DCs, but the keyspace is only replicated within the local DC, so it is surprising that remove nodes were even considered. I was able to bring the latency down by
changing from ONE to LOCAL_ONE consistency and
changing from plain round-robin load balancer to DC-aware one (also using latency-aware and token-aware).
It still feels to me like a bug in the Java driver that it tries to use nodes from remote data center as coordination nodes when the keyspace is clearly non-existent in that data center. Also, even if that wasn't possible somehow, I was also using latency-aware policy, which should've excluded remote DC nodes from consideration.

Pro and Cons of opening multiple InputStream?

While coding the solution to the problem of downloading a huge dynamic zip with low RAM impact, an idea started besieging me, and led to this question, asked for pure curiosity / hunger of knowledge:
What kind of drawbacks could I meet with if, instead of loading the InputStreams one at a time (with separate queries to the database), I would load all the InputStreams in a single query, returning a List of (n, potentially thousands, "opened") InputStreams ?
Current (safe) version: n queries, one inputStream instantiated at a time
for (long id : ids){
InputStream in = getMyService().loadStreamById(id);
IOUtils.copyStream(in, out);
in.close();
}
Hypothetical version: one query, n instantiated inputStreams
List<InputStream> streams = getMyService().loadAllStreams();
for (InputStream in : streams){
IOUtils.copyStream(in, out);
in.close();
in = null;
}
Which are the pro and cons of the second approach, excluding the (I suppose little) amount of memory used to keep multiple java InputStream instantiated ?
Could it lead to some kind of network freeze or database stress (or lock, or problems if others read/write the same BLOB field the Stream is pointing to, etc...) more than multiple queries ?
Or are they smart enough to be almost invisible until asked for data, and then 1 query + 1000 active stream could be better than 1000 query + 1 active stream ?

The short answer is that you risk hitting a limit of your operating system and/or DBMS.
The longer answer depends on the specific operating system and DBMS, but here are a few things to think about:
On Linux there are a maximum number of open file descriptors that any process can hold. The default is/was 1024, but it's relatively easy to increase. The intent of this limit IMO is to kill a poorly-written process, as the amount of memory required per file/socket is minimal (on a modern machine).
If the open stream represents an individual socket connection to the database, there's a hard limit on the total number of client sockets that a single machine may open to a single server address/port. This is driven by the client's dynamic port address range, and it's either 16 or 32k (but can be modified). This limit is across all processes on the machine, so excessive consumption by one process may starve other processes trying to access the same server.
Depending on how the DBMS manages the connections used to retrieve BLOBs, you may run into a much smaller limit enforced by the DBMS. Oracle, for example, defaults to a total of 50 "cursors" (active retrieval operations) per user connection.
Aside from these limits, you won't get any benefit given your code as written, since it runs through the connections sequentially. If you were to use multiple threads to read, you may see some benefit from having multiple concurrent connections. However, I'd still open those connections on an as-needed basis. And lest you think of spawning a thread for each connection (and running into the physical limit of number of threads), you'll probably reach a practical throughput limit before you hit any physical limits.

I tested it in PostgreSQL, and it works.
Since PostgreSQL seems to not have a predefined max cursor limit, I still don't know if the simple assignment of a cursor/pointer from a BLOB field to an Java InputStream object via java.sql.ResultSet.getBinaryStream("blob_field") is considered an active retrieval operation or not (I guess no, but who knows...);
Loading all the InputStreams at once with something like SELECT blob_field FROM table WHERE length(blob_field)>0 , produced a very long query execution time, and a very fast access to the binary content (in the sequential way, as above).
With a test case of 200 MB with 20 files of 10 MB each one:
The old way was circa 1 second per each query, plus 0.XX seconds for the other operations (reading each InputStream and writing it to the outputstream, and something else);
Total Elapsed time: 35 seconds
The experimental way was circa 22 seconds for the big query, and 12 total seconds for iterating and performing the other operations.
Total Elapsed time: 34 seconds
This makes me think that while assigning the BinaryStream from the database to the Java InputStream object, the complete reading is already being performed :/ making the use of an InputStream similar to an byte[] one (but worst in this case, because of the memory overload coming from having all the items instantiated);
Conclusion
reading all at once is a bit faster (~ 1 second faster every 30 seconds of execution),
but it could seriously make the big query timing out, other than cause RAM memory leaks and, potentially, max cursor hits.
Do not try this at home, just stick with one query at once...

Optimal number of connections in connection pool

Currently we are using 4 cpu windows box with 8gb RAM with MySQL 5.x installed on same box. We are using Weblogic application server for our application. We are targeting for 200 concurrent users for our application (Obviously not for same module/screen). So what is optimal number of connections should we configured in connection pool (min and max number) (We are using weblogic AS' connection pooling mechanism) ?

Did you really mean 200 concurrent users or just 200 logged in users? In most cases, a browser user is not going to be able to do more than 1 page request per second. So, 200 users translates into 200 transactions per second. That is a pretty high number for most applications.
Regardless, as an example, let's go with 200 transactions per second. Say each front end (browser) tx takes 0.5 seconds to complete and of the 0.5 seconds, 0.25 are spent in the database. So, you would need 0.5 * 200, or 100 connections in the WebLogic thead pool and 0.25 * 200 = 50 connections in the DB connection pool.
To be safe, I would set the max thread pool sizes to at least 25% larger than you expect to allow for spikes in load. The minimums can be a small fraction of the max, but the tradeoff is that it could take longer for some users because a new connection would have to be created. In this case, 50 - 100 connections is not that many for a DB so that's probably a good starting number.
Note, that to figure out what your average transaction response times are, along with your average DB query time, you are going to have to do a performance test because your times at load are probably not going to be the times you see with a single user.

There is a very simple answer to this question:
The number of connections in the connection pool should be equal the number of the exec threads configured in WebLogic.
The rationale is very simple: If the number of the connections is less than the number of threads, some of the thread maybe waiting for a connection thus making the connection pool a bottleneck. So, it should be equal at least the number the exec threads (thread pool size).

Sizing a connection pool is not a trivial thing to do. You basically need:
metrics to investigate the connection usage
failover mechanisms for when there is no connection available
FlexyPool aims to aid you in figuring out the right connection pool size.

You should profile the different expected workflows to find out. Ideally, your connection pool will also dynamically adjust the number of live connections based on recent usage, as it's pretty common for load to be a function of the current time of day in your target geographical area.
Start with a small number and try to reach a reasonable number of concurrent users, then crank it up. I think it's likely that you'll find that your connection pooling mechanism is not nearly as instrumental in your scalability as the rest of the software.

The connection pool should be able to grow and shink based on actual needs. Log the numbers needed to do analysis on the running system, either through logging statements or through JMX surveillance. Consider setting up alerts for scenarios like "peak detected: more than X new entries had to be allocated in Y seconds", "connection was out of pool for more than X seconds" which will allow you to give attention to performance issues before they get real problems.

It's difficult to get hard data for this. It's also dependent on a number of factors you don't mention -
200 concurrent users, but how much of their activity will generate database queries? 10 queries per page load? 1 query just on login? etc. etc.
Size of the queries and the db obviously. Some queries run in milliseconds, some in minutes.
You can monitor mysql to watch the current active queries with "show processlist". This could give you a better sense of how much activity is actually going on in the db under peak load.

This is something that needs to be tested and determined on an individual basis - it's pretty much impossible to give an accurate answer for your circumstances without intimately being familiar with them.

Based on my experience on high transaction financial systems, if you want to handle for example 1K requests per seconds, and you have 32 CPU's, You need to have 1000/32 open connection polls to your database.
Here is my formula:
RPS / CPU_COUNT
If most cases, your database engine will be able to handle your requests even in much lower numbers, but your connections will be in waiting mode if the number is low.
I think it's pretty important to mention that your database should be able to handle those transactions (based on your disk speed, database configuration and server power).
Good luck.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.