Multiple Operations on Multiple Sets (Tables) in Aerospike cluster

Multiple Operations on Multiple Sets (Tables) in Aerospike cluster - java

Current system state:
Currently, I maintain three sets (tables, equivalent in RDBMS) in my aerospike namespace (database, equivalent in RDBMS) backed by RESTful service.
Use-case:
I want to perform CRUD operations on at least one set and sometimes at most on all sets based on some bulk inputs into my system.
Expectation:
I want to perform all these CRUD operations in an atomic manner (means either all happen or none. This also contains an edge-case where some sets are successfully updated with their respective latest updates, and later on even a single set is un-successful. I would want to rollback my data to the previous state in each set.)
My workaround:
First I tried to find the equivalent of InsertOnSubmit in aerospike to use the approaches explained on this answer of StackOverflow, but seems like that doesn't exist.
Second, I thought of creating an intermediate rollback workflow module. Psuedo code shown below:
Temporarily save the new data in some data-type segregated set-wise.
Loop through set-wise data, and pick the primary key from them and get the older data from aerospike and save it into some other data-type again segregated set-wise.
Loop through all the sets one-by-one from first data-type and start performing CRUD operation accordingly. IF[everything runs till the end]: GOTO step 6; ELSE: GOTO step 4.
Start rolling-back by looping through all the sets one-by-one from second data-type and start performing CRUD operation. IF[everything runs till the end]: GOTO step 7; ELSE: GOTO step 5.
Log the error including all the details and report this error to alert system. Someone will get paged for it to have a look. GO TO step 7;
Terminate, operation successful.
Terminate, operation un-successful.
Help Needed:
Is there any chance to incoperate InsertOnSubmit behaviour on Aerospike cluster without creating my own roll-back workflow?
If not, then is there any better way to optimize my second approach?

1 - No. Aerospike offers atomicity only at a single record level. While inserting Master record and then replicating its copy to another node do follow true 2-phase commit semantics in Aerospike's Strong Consistency (SC) mode, any multi-record transaction has to be implemented at the application level.
2 - Any scheme implementing multi-record transactions, such as one you are thinking of, typically involves - creating some kind of "lock" bin in a record that you set, do the multi-record updates, build a before and after state of your data, have some kind of a maximum time to complete so you can rollback and clear abandoned operations and lock by the client application. Any of these schemes will only work reliably under Aerospike's Strong Consistency mode.

Related

Can I force a step in my dataflow pipeline to be single-threaded (and on a single machine)?

I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header.
To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table.
For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be strictly serial.
Can I enforce single-thread execution (on a single machine) to satisfy this for part of my pipeline only?
(After this point, each of my 100s of URLs/files would be suitable for processed on a separate thread; each single file gives rise to 10000-10000000 rows, so throttling at that point will almost certainly not give performance issues.)

Beam is designed for parallel processing of data and it tries to explicitly stop you from synchronizing or blocking except using a few built-in primitives, such as Combine.
It sounds like what you want is a filter that emits an element (your URL) only the first time it is seen. You can probably use the built-in Distinct transform for this. This operator uses a Combine per-key to group the elements by key (your URL in this case), then emits each key only the first time it is seen.

Counting Number Of Specific Record In Database

I have a application which needs to aware of latest number of some records from a table from database, the solution should be applicable without changing the database code or add triggers or functions to it ,so I need a database vendor independent solution.
My program written in java but database could be (SQLite,MySQL,PostgreSQL or MSSQL),for now I'm doing Like that:
In a separate thread that is set as a daemon my application sends a simple command through JDBC to database to be aware of latest number of the records with condition:
while(true){
SELECT COUNT(*) FROM Mytable WHERE exited='1'
}
and this sort of coding causes DATABASE To lock,slows down the whole system and generates huge DB Logs which finally brings down the whole thing!
how can i do it in a right way to always have latest number of certain records or only counting when the number changed?

A SELECT statement should not -- by itself -- have the behavior that you are describing. For instance, nothing is logged with a SELECT. Now, it is possible that concurrent insert/update/delete statements are going on, and that these cause problems because the SELECT locks the table.
Two general things you can do:
Be sure that the comparison is of the same type. So, if exited is a number, do not use single quotes (mixing of types can confuse some databases).
Create an index on (exited). In basically all databases, this is a single command: create index idx_mytable_exited on mytable(exited).
If locking and concurrent transactions are an issue, then you will need to do more database specific things, to avoid that problem.

As others have said, make sure that exited is indexed.
Also, you can set the transaction isolation on your query to do a "dirty read"; this indicates to the database server that you do not need to wait for other processes' transactions to commit, and instead you wish to read the current value of exited on rows that are being updated by those other processes.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED is the standard syntax for using "dirty read".

Concurrency Control on my Code

I am working on an order capture and generator application. Application is working fine with concurrent users working on different orders. The problem starts when two Users from different systems/locations try to work on the same order. How it is impacting the business is, that for same order, application will generate duplicate data since two users are working on that order simultaneously.
I have tried to synchronize the method where I am generating the order, but that would mean that no other user can work on any new order since synchronize will put a lock for that method. This will certainly block all the users from generating a new order when one order is being progressed, since, it will hit the synchronized code.
I have also tried with criteria initilization for an order, but no success.
Can anyone please suggest a proper approach??
All suggestions/comments are welcome. Thanks in advance.

Instead of synchronizing on the method level, you may use block-level synchronization for the blocks of code which must be operated on by only one thread at a time. This way you can increase the scope for parallel processing of the same order.

On a grand scale, if you are backing up your entities in a database, I would advice you to look at optimistic locking.
Add a version field to your order entity. Once the order is placed (the first time) the version is 1. Every update should then come in order from this, so imagine two subsequent concurrent processes
a -> Read data (version=1)
Update data
Store data (set version=2 if version=1)
b -> Read data (version=1)
Update data
Store data (set version=2 if version=1)
If the processing of these two are concurrent rather than serialized, you will notice how one of the processes indeed will fail to store data. That is the losing user, who will have to retry his edits. (Where he reads version=2 instead).
If you use JPA, optimistic locking is as easy as adding a #Version attribute to your model. If you use raw JDBC, you will need to add the add it to the update condition
update table set version=2, data=xyz where orderid=x and version=1
That is by far the best and in fact preferred solution to your general problem.

how to create a copy of a table in HBase on same cluster? or, how to serve requests using original state while operating on a working state

Is there an efficient way to create a copy of table structure+data in HBase, in the same cluster? Obviously the destination table would have a different name. What I've found so far:
The CopyTable job, which has been described as a tool for copying data between different HBase clusters. I think it would support intra-cluster operation, but have no knowledge on whether it has been designed to handle that scenario efficiently.
Use the export+import jobs. Doing that sounds like a hack but since I'm new to HBase maybe that might be a real solution?
Some of you might be asking why I'm trying to do this. My scenario is that I have millions of objects I need access to, in a "snapshot" state if you will. There is a batch process that runs daily which updates many of these objects. If any step in that batch process fails, I need to be able to "roll back" to the original state. Not only that, during the batch process I need to be able to serve requests to the original state.
Therefore the current flow is that I duplicate the original table to a working copy, continue to serve requests using the original table while I update the working copy. If the batch process completes successfully I notify all my services to use the new table, otherwise I just discard the new table.
This has worked fine using BDB but I'm in a whole new world of really large data now so I might be taking the wrong approach. If anyone has any suggestions of patterns I should be using instead, they are more than welcome. :-)

All data in HBase has a certain timestamp. You can do reads (Gets and Scans) with a parameter indicating that you want to the latest version of the data as of a given timestamp. One thing you could do would be to is to do your reads to server your requests using this parameter pointing to a time before the batch process begins. Once the batch completes, bump your read timestamp up to the current state.
A couple things to be careful of, if you take this approach:
HBase tables are configured to store the most recent N versions of a given cell. If you overwrite the data in the cell with N newer values, then you will lose the older value during the next compaction. (You can also configure them to with a TTL to expire cells, but that doesn't quite sound like it matches your case).
Similarly, if you delete the data as part of your process, then you won't be able to read it after the next compaction.
So, if you don't issue deletes as part of your batch process, and you don't write more versions of the same data that already exists in your table than you've configured it to save, you can keep serving old requests out of the same table that you're updating. This effectively gives you a snapshot.

speed up operation on mysql

I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?

I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).

I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.

Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.

Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.

Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.