Our system needs to process billions of queries from thousands of clients for millions of resources. Some resources will be queried much more often than others. Each client will submit anywhere from hundreds to hundreds-of-millions of queries at a time. Because each resource can only support thousands of queries per minute, the queries will be enqueued and their results will be determined asynchronously.
Now, here's the rub: Each client's queries need to be given equal priority with respect to each resource. That is, if one client submits a million queries for a particular resource, and then another client submits a dozen, immediately after, then the second client should not have to wait for the first client's queries to be processed before theirs are. Rather, first the one client's first query should be handled, and then the other's first query, then the first's second query, and so on, back and forth. (And the analogous idea for more than two clients, and multiple resources; also, it can be a little less granular, as long as this basic idea is preserved).
If this was small enough to be in-memory, we'd just have a map from resources to a map from accounts to a queue of queries, and circularly iterate accounts, per resource; but it's not, so we need a disk-based solution. We also need it to be robust, highly available, transactional etc.. What are my options? I'm using Java SE.
Thanks in advance!
Ahead of time, I know HBase much better than I do Cassandra. Some aspects of my response are HBase specific, and I'll mark them as such.
Assuming that you provision enough hardware, then a BigTable implementation like Cassandra or HBase would give you the following:
The ability to store and retrieve your queries at an extremely high rate
The ability to absorb deletes at an extremely high rate (though with HBase and Cassandra, flushing writes to disk can cause periodic delays)
Trivially, I could see a schema where you used a combination of resource-id as row key and account-id and perhaps timestamp as column key, but (in HBase specifically) this could lead to hotspots in the servers hosting certain popular resources (in both HBase and Cassandra, a single server is responsible for hosting the master copy of any given row at a time). In Cassandra you can reduce the overhead of updates by using async writes (writing to only one or two nodes, and allowing gossip to replicate them), but this could result in old records being around dramatically longer than you expect in situations where network traffic is high. In HBase writes are always consistent and always written to the RegionServer hosting the row, so hotspotting is definitely a potential problem.
You can reduce the impact of hotspotting by making your row key a combination of resource ID and account id, but then you need to scan all row keys to determine the list of accounts that have outstanding queries for a resource.
One other potential advantage that you may not have considered is the potential capability to run your queries directly from the HBase or Cassandra data nodes, saving you the need to ship your query over the network again to an executor process to actually run that query. You might want to look into HBase Coprocessors or Cassandra Plugins to do something like that. Specifically I am talking about turning this workflow:
/-> Query -> Executor -> Resource -> Results -> \
Client -> Query -> Query Storage --> Query -> Executor -> Resource -> Results -> --> Client
\-> Query -> Executor -> Resource -> Results -> /
into something like:
/-> Query -> Resource -> Results -> \
Client -> Query -> Query Storage --> Query -> Resource -> Results -> --> Client
\-> Query -> Resource -> Results -> /
This may not make sense in your use case though.
I can give you some answers with respect to Cassandra.
Cassandra internally writes only new data files and only does so sequentially, never overwriting or modifying existing files, and has an append-only write-ahead log like transactional relational databases. Cassandra internally sees deletes as essentially just like any other writes.
Cassandra is linearly scalable across many nodes and has no single point of failure. It is linearly scalable for both reads and writes. That is to say, a single cluster can support any number of concurrent reads and writes you wish to throw at it, so long as you add enough nodes to the cluster and give the cluster time to rebalance data across the new nodes. Netflix recently load-tested Cassandra on EC2 and found linear scalability, with the largest cluster they tested at 288 nodes supporting 1,000,000 writes/sec sustained for an hour.
Cassandra supports many consistency levels. When performing each read or write from Cassandra, you specify with what consistency level you want that read or write to be executed. This lets you determine, per-read and per-write, whether that read or write must be fast or must be done consistently across all nodes hosting that row.
Cassandra does not support multi-operation transactions.
If the Cassandra data model works well in your case, Cassandra may well be the simplest solution, at least at the operations level. Every node is configured exactly alike. There are no masters and no slaves, only peers of equals. It is not necessary to set up separate load balancing, failover, heartbeats, log shipping, replication, etc.
But the only way to find out for sure is to test it out.
Related
I have one database and in this we have millions of records. We are reading the record one by one using java and inserting those record to another system on daily basis after end of day. We have been told to make it faster.
I told them we will create multiple thread using thread pool and these thread will read data parallelly and inject into another system but I dont know how can we stop our thread to read same data again. how can make it faster and achieve data consistency as well. I mean how can we make this process faster using multithreading in java or is there any other way ,other than multithreading to achieve it?
One possible solution for your task would be taking the ids of records in your database, splitting them into chunks (e.g. with size 1000 each) and calling JpaRepository.findAllById(Iterable<ID>) within Runnables passed to ExecutorService.submit().
If you don't want to do it manually then you could have a look into Spring Batch. It is designed particularly for bulk transformation of large amounts of data.
I think you should identify the slowest part in this flow and try to optimize it step by step.
In the described flow you could:
Try to reduce the number of "roundtrips" between the java application (in coming from the driver driver) and the database: Stop reading records one by one and move to bulk reading. Namely, read, say, 2000 records at once from the db into memory and process the whole bulk. Consider even larger numbers (like 5000) but you should measure this really, it depends on the memory of the java application and other factors. Anyway, if there is an issue - discard the bulk.
The data itself might not be organized correctly: when you read the bulk of data you might need to order it by some criteria, so make sure it doesnt make a full table scan, define indices properly etc
If applicable, talk to your DBA, he/she might provide additional insights about data management itself: partitioning, storage related optimizations, etc.
If all this fails and reading from the db is still a bottleneck, consider the flow redesign (for instance - throw messages to kafka if you have one), these might be naturally partitioned so you could scale out the whole process, but this might be beyond the scope of this question.
I want to use more than one app instance for my Java application. The aplication works with DB: writes and reads data. I'd like to use Hibernate with L1 caching (Session level) only. My question is: should I sync cache for each instance or there's no need to worry about synchronization of the caches?
It's all depends on what is your application is about ?
Say you are running an e-commerce shop, in the admin panel, there is a service for managing products. Two separate user opens the same product page and update them. There is nothing wrong about it (unless you have some specific business case)
Another scenario is, you are tracking the inventory of products. Say you maintain a count of each product. When you add product, this count get increased and when you sell products this count get decreased. This operation is very sensitive and require some sort of locking. Without locking following scenario can happen
Timestamp
App Instance A
App Instance B
T1
Reads and found 10 product
Reads and found 10 product
T2
Removes Two product and write 8
Does nothing
T3
Done nothing
Add Two product and write 12
Thus it now tracks wrong count in the database.
To tackle these scenarios there are mostly two kind of locking mechanism
Optimistic locking
Pessimistic locking
To learn more about these sort of locking read here.
A simple way to implement the optimistic locking in hibernate is using the version column in the database and application entity.
Here is a good article about the entity versioning.
You can use caches like Hazelcast, Infinispan or EHCache that allow caching spanning multiple instances. These caches have different strategies, basically distributed (via a distributed hash table, DHT) or replicating. With a distributed cache only a subset of data is on each instance which leads to non uniform access times, however, its possible to cache a vast amount of data. With a replicating cache all data is replicated via the instances, so you get fast access times, but modification takes longer because all instances need to be notified.
To prevent a dirty read hibernate stops caching an object before the write transaction gets committed and then starts caching again. In case of a replicating cache this adds at least two network requests, so write throughput can decrease quite dramatically.
There are many details that should be understood and maybe tested before going in operation. Especially what happens when an instance is added or dies or is unreachable for an amount of time. When I looked at the hibernate code a few years back there was a hard coded timeout for a cache lockout of 30 seconds, meaning: If an instance disappears that was modifying data, other instances modifying the same data would "hang" for at most 30 seconds. But if the node did not die and it was just a connection problem and appears again after the timeout, you will get data inconsistencies. Within the caches you also have timers and recovery strategies for failure and connections problems that you need to understand and configure correctly depending on your operating environment.
We are using Redis 2.8.17 as JobQueues.
We are using RPUSH and BRPOPLPUSH for making a reliable queue.
As per our current design multiple app-servers push(RPUSH) jobs to a single job queue. Since the BRPOPLPUSH operation is atomic in redis, the jobs will later be poped(BRPOPLPUSH) and processed by any of the server's consumers.
Since the app servers are capable of scaling out, I am bit concerned that REDIS might become bottleneck in the future.
I learnt the following from documentation on redis partitioning:
"it is not possible to shard a dataset with a single huge key like a very big sorted set"
I wonder whether pre-sharding the queues for app servers is the only option to scale out.
Is there anything that cluster can do in the above design?
The main thing you need to consider is whether you'll need to shard at all. The entire StackExchange network (not just StackOverflow -- all the network) runs off of 2 Redis servers (one of which I'm pretty sure only exists for redundancy), which it uses very aggressively. Take a look at http://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-stack-overflow/
Redis is absurdly fast (and remarkably space-efficient), with only one caveat: deleting an entire list/set/sorted set/hash is O(n), where n is the number of elements it contains. As long as you don't do that (operations like RPOP, BRPOP, BRPOPLPUSH, etc. don't count -- those are constant time), you should be golden.
TLDR: Unless you're planning to go bigger than StackOverflow, you won't need to shard, especially for a simple job queue.
I have a database table with 3 million records. A java thread reads 10,000 records from table and processes it. After processing it jumps to next 10,000 and so on. In order to speed up, i have 25 threads doing the same task (reading + processing), and then I have 4 physical servers running the same java program. So effectively i have 100 thread doing the same work (reading + processing).
I strategy i have used is to have a sql procedure which does the work of grabbing next 10,000 records and marking them as being processed by a particular thread. However, i have noticed that the threads seems to be waiting for a some time trying to invoke the procedure and getting a response back. What other strategy i can use to speed up this process of data selection.
My database server is mysql and programming language is java
The idiomatic way of handling such scenario is producer-consumer design pattern. And in idiomatic way of implementing it in Java land is by using jms.
Essentially you need one master server reading records and pushing them to JMS queue. Then you'll have arbitrary number of consumers reading from that queue and competing with each other. It is up to you how you want to implement this in detail: do you want to send a message with whole record or only ID? All 10000 records in one message or record per message?
Another approach is map-reduce, check out hadoop. But the learning curve is a bit steeper.
Sounds like a job for Hadoop to me.
I would suspect that you are majorly database IO bound with this scheme. If you are trying to increase performance of your system, I would suggest partitioning your data across multiple database servers if you can do so. MySQL has some partitioning modes that I have no experience with. If you do partition yourself, it can add a lot of complexity to a database schema and you'd have to add some sort of routing layer using a hash mechanism to divide up your records across the multiple partitions somehow. But I suspect you'd get a significant speed increase and your threads would not be waiting nearly as much.
If you cannot partition your data, then moving your database to a SSD memory drive would be a huge win I suspect -- anything to increase the IO rates on those partitions. Stay away from RAID5 because of the inherent performance issues. If you need a reliable file system then mirroring or RAID10 would have much better performance with RAID50 also being an option for a large partition.
Lastly, you might find that your application performs better with less threads if you are thrashing your database IO bus. This depends on a number of factors including concurrent queries, database layout, etc.. You might try dialing down the per-client thread count to see if that makes a different. The effect may be minimal however.
I am trying to develop a piece of code in Java, that will be able to process large amounts of data fetched by JDBC driver from SQL database and then persisted back to DB.
I thought of creating a manager containing one reader thread, one writer thread and customizable number of worker threads processing data. The reader thread would read data to DTOs and pass them to a Queue labled 'ready for processing'. Worker threads would process DTOs and put processed objects to another queue labeld 'ready for persistence'. The writer thread would persist data back to DB. Is such an approach optimal? Or perhaps I should allow more readers for fetching data? Are there any ready libraries in Java for doing this sort of thing I am not aware of?
Whether or not your proposed approach is optimal depends crucially on how expensive it is to process the data in relation to how expensive it is to get it from the DB and to write the results back into the DB. If the processing is relatively expensive, this may work well; if it isn't, you may be introducing a fair amount of complexity for little benefit (you still get pipeline parallelism which may or may not be significant to the overall throughput.)
The only way to be sure is to benchmark the three stages separately, and then deside on the optimal design.
Provided the multithreaded approach is the way to go, your design with two queues sounds reasonable. One additional thing you may want to consider is having a limit on the size of each queue.
I hear echoes from my past and I'd like to offer a different approach just in case you are about to repeat my mistake. It may or may not be applicable to your situation.
You wrote that you need to fetch a large amount of data out of the database, and then persist back to the database.
Would it be possible to temporarily insert any external data you need to work with into the database, and perform all the processing inside the database? This would offer the following advantages:
It eliminates the need to extract large amounts of data
It eliminates the need to persist large amounts of data
It enables set-based processing (which outperforms procedural)
If your database supports it, you can make use of parallel execution
It gives you a framework (Tables and SQL) to make reports on any errors you encounter during the process.
To give an example. A long time ago I implemented a (java) program whose purpose was to load purchases, payments and related customer data from files into a central database. At that time (and I regret it deeply), I designed the load to process the transactions one-by-one , and for each piece of data, perform several database lookups (sql) and finally a number of inserts into appropriate tables. Naturally this did not scale once the volume increased.
Then I made another misstake. I deemed that it was the database which was the problem (because I had heard that the SELECT is slow), so I decided to pull out all data from the database and do ALL processing in Java. And then finally persist back all data to the database. I implemented all kinds of layers with callback mechanisms to easily extend the load process, but I just couldn't get it to perform well.
Looking in the rear mirror, what I should have done was to insert the (laughably small amount of) 100,000 rows temporarily in a table, and process them from there. What took nearly half a day to process would have taken a few minutes at most if I played to the strength of all technologies I had at my disposal.
An alternative to using an explicit queue is to have an ExecutorService and add tasks to it. This way you let Java manager the pool of threads.
You're describing writing something similar to the functionality that Spring Batch provides. I'd check that out if I were you. I've had great luck doing operations similar to what you're describing using it. Parallel and multithreaded processing, and several different database readers/writers and whole bunch of other stuff are provided.
Use Spring Batch! That is exactly what you need