Is Cassandra table creation slow? - java

I've a Jar file that initialises my Cassandra database during which it creates ~13 tables. This file is being run by our tests where we start a Cassandra test container and use the jar to set it up.
But I'm surprised to see that each table takes ~1-2 seconds to initialise, totaling ~15 seconds. If I manually create one of these tables, using cqlsh, it takes ~100-120 ms.
Is there an explanation for this delay? Is there a work around?
I came across Why does it take so long to create a table? but I don't have any tabs in my tables.
Update
The Java Code boils down to
cqlSession.execute( SimpleStatement.newInstance(query).setIdempotent(isIdempotent) );
which uses the java-driver-core version 4.14.1. The query looks like
CREATE TABLE settings (key text, value text, PRIMARY KEY (key))
and took 1.125 seconds.

Because Cassandra is a distributed system, when you create a table you need to make sure that changes are propagated to all nodes so schema will be in agreement. This is especially important when you use something like Java driver that by default uses round-robin policy, so different DDL statements could be sent to different nodes, causing schema mismatch errors. You can find an example of how to do that correctly here.
In cqlsh it's not the issue as it always uses the same connection to send all commands, so you won't get schema mismatch because schema versions are generated on the same node.

Turns out in the java-driver-core there is a feature called debouncing. Here requests are accumulated over 1 second / an upper count before being sent to Cassandra. You can see the code here.
There are driver config settings that can be used to control the debounce behaviour, which I set as
datastax-java-driver.advanced.metadata {
schema.debouncer {
window = 1 milliseconds
max-events = 1
}
}
in order to remove the 1 second delay. This is appropriate for my use case. But the change requires consideration depending on usage.

Related

Populate database table on a frequent basis using JPA

One of my Java application's functionality is to read and parse very frequently (almost every 5 minutes) an xml file and populate a database table. I have created a cron job to do that. Most of the columns' values remain the same but for certain columns there may be a frequent update on the value. I was wondering what is the most efficient way of doing that:
1) Delete the table every time and re-create it or
2) Update the table data and specifically the column where a change in the source file has appeared.
The number of rows parsed and persisted every time is about 40000-50000.
I would assume that around 2000-3000 rows need to update on every cron job run.
I am using JPA to persist data to a mysql server and I have gone for the first option so far.
Obviously for both options the job would execute as a single transaction.
Any ideas which one is better and possibly any optimization suggestions?
I would suggest scheduling your jobs using something more sophisticated than cron. For instance, Quartz.

How To Change the replication of a Column family in HBase

The question is mainly about that, in my project, I want to create a table with 3 column family. the default replication number is 3. but I wanna change this replication number for centain column family, just because we dont need so much repliction for it.
for example, a table name table1, and has 3 column family, f1,f2,f3. In this case, we want to set the replication number of f3 is 1. so how can I set this config? Is there any solutions without change the source code?
PS: via hbase shell or JAVA?
First we should specify that the term replication is a little overloaded.
HBase uses HDFS as it's storage. HDFS will replicate, to multiple DataNodes, the blocks that make up any files that HBase generates. (see http://hadoop.apache.org/docs/stable/hdfs_design.html#Data+Replication ) This value isn't configurable per column family, or table. It's only configurable per server. (See http://hbase.apache.org/book.html#hdfs_client_conf )
If this is something you'd like to change then I would suggest filing a jira requesting a new feature.
HBase also has the ability to replicate edits from one HBase cluster to another cluster. This replication is per write ahead log and is configurable per column family. Setting REPLICATION_SCOPE to one will tell HBase to apply the edits from this region server onto another cluster. Setting this to 0 will turn replication off.
i looked a lot on this. as i see it - you can not define different replication for a the tables, let alone for column family.
The number of replications is defined in the hbase-site.xml which is for the whole table.
you can define if you want to replicate the column family or not using REPLICATION_SCOPE.

H2 Database - Creating Indexes

I'm using the H2 database - running in embedded mode - and when my app starts up I load the H2 database with data from a mySQL database. I'm using linked tables to point to the mySQL tables.
My issue is that I'm trying to speed up the time that H2 takes to create the indexes on the tables, particularly for larger tables (5Million+).
Does anyone know if it is safe to run the CREATE INDEX commands in a separate thread while I load the next table's data into H2?
For example:
Thread 1: Loads table 1 -> tells Thread 2 to start creating indexes and then Thread 1 loads table 2, etc.
I can't use the MVCC mode when loading the tables because later on I need to use the MULTI_THREADED mode when I do my selects. When I try using the MULTI_THREADED mode I got locking errors even though I was loading data into discrete tables.
Many thanks!
What might work (but I'm not sure if it's faster) is to create the tables and indexes first, and then load the tables in parallel. This should avoid locking problems in the system table.
I would also like to add the method rst.findColumn("columnName") to find the indexes AFTER getting the result set of the table. rst is a ResultSet object. This is what I have used.
Another way to dramatically improve H2 loading and especially indexing performance is to set the initial memory close to what the expected memory requirement is. As one example, this one change allowed an app with about a 1.5GB requirement to startup in 47 seconds instead of failing after 15 - 20 minutes. Prior to this, we were seeing GC Overhead limit exceeded and JVMTI errors.
Add the following to your VM arguments (as an example):
-Xms2g
-Xmx4g

how to create a copy of a table in HBase on same cluster? or, how to serve requests using original state while operating on a working state

Is there an efficient way to create a copy of table structure+data in HBase, in the same cluster? Obviously the destination table would have a different name. What I've found so far:
The CopyTable job, which has been described as a tool for copying data between different HBase clusters. I think it would support intra-cluster operation, but have no knowledge on whether it has been designed to handle that scenario efficiently.
Use the export+import jobs. Doing that sounds like a hack but since I'm new to HBase maybe that might be a real solution?
Some of you might be asking why I'm trying to do this. My scenario is that I have millions of objects I need access to, in a "snapshot" state if you will. There is a batch process that runs daily which updates many of these objects. If any step in that batch process fails, I need to be able to "roll back" to the original state. Not only that, during the batch process I need to be able to serve requests to the original state.
Therefore the current flow is that I duplicate the original table to a working copy, continue to serve requests using the original table while I update the working copy. If the batch process completes successfully I notify all my services to use the new table, otherwise I just discard the new table.
This has worked fine using BDB but I'm in a whole new world of really large data now so I might be taking the wrong approach. If anyone has any suggestions of patterns I should be using instead, they are more than welcome. :-)
All data in HBase has a certain timestamp. You can do reads (Gets and Scans) with a parameter indicating that you want to the latest version of the data as of a given timestamp. One thing you could do would be to is to do your reads to server your requests using this parameter pointing to a time before the batch process begins. Once the batch completes, bump your read timestamp up to the current state.
A couple things to be careful of, if you take this approach:
HBase tables are configured to store the most recent N versions of a given cell. If you overwrite the data in the cell with N newer values, then you will lose the older value during the next compaction. (You can also configure them to with a TTL to expire cells, but that doesn't quite sound like it matches your case).
Similarly, if you delete the data as part of your process, then you won't be able to read it after the next compaction.
So, if you don't issue deletes as part of your batch process, and you don't write more versions of the same data that already exists in your table than you've configured it to save, you can keep serving old requests out of the same table that you're updating. This effectively gives you a snapshot.

speed up operation on mysql

I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?
I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).
I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.
Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.
Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.
Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.

Categories

Resources