We are writing hi-load order processing engine. Every cluster node processing some set of contracts and write action log to local file. This file should be distributed amount some other nodes (for fault tolerance). If node fail there should be way to restore it's state at one of replication nodes as fast as possible. Currently we use cassandra but there is some problems with partitioner: there is no way to specify what nodes should be used for a specific table.
So we need to replicate file. Is there a solution?
Edit: peak load will be about 200k records per second.
With respect to your Cassandra issue: while you can't have a different replication layout per table/columnfamily, you can have a different layout per keyspace. This includes a case like yours, where it sounds like you want some set of nodes S1 be wholly responsible for some parts of the data, and some other set S2 to be responsible for another part.
If you represent S1 and S2 as different datacenters to Cassandra (via PropertyFileSnitch or whatever), then you can configure, say, keyspace K1 to have X copies on S1 and none on S2, and vice versa for keyspace K2.
Related
From what I understand, there are multiple copies of data in RDDs in the cluster, so that in case of failure of a node, the program can recover. However, in cases where chance of failure is negligible, it would be costly memory-wise to have multiple copies of data in the RDDs. So, my question is, is there a parameter in Spark, which can be used to reduce the replication factor of the RDDs?
First, note Spark does not automatically cache all your RDDs, simply because applications may create many RDDs, and not all of them are to be reused. You have to call .persist() or .cache() on them.
You can set the storage level with which you want to persist an RDD with
myRDD.persist(StorageLevel.MEMORY_AND_DISK). .cache() is a shorthand for .persist(StorageLevel.MEMORY_ONLY).
The default storage level for persist is indeed StorageLevel.MEMORY_ONLY for an RDD in Java or Scala – but usually differs if you are creating a DStream (refer to your DStream constructor API doc). If you're using Python, it's StorageLevel.MEMORY_ONLY_SER.
The doc details a number of storage levels and what they mean, but they're fundamentally a configuration shorthand to point Spark to an object which extends the StorageLevel class. You can thus define your own with a replication factor of up to 40.
Note that of the various predefined storage levels, some keep a single copy of the RDD. In fact, that's true of all of those which name isn't postfixed with _2 (except NONE):
DISK_ONLY
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
OFF_HEAP
That's one copy per medium they employ, of course, if you want a single copy overall, you have to choose a single-medium storage level.
As huitseeker said unless you specifically ask Spark to persist an RDD and specify a StorageLevel that uses a replication, it won't have multiple copies of the partitions of an RDD.
What spark does do is keep a lineage of how a specific piece of data was calculated so that when/if a node fails it only repeats processing of relevant data that is needed to get to the lost RDD partitions - In my experience this mostly works though on occasion it is faster to restart the job then let it recover
Say I have 2 nodes with IPs 192.168.5.101 and 192.168.5.102. I'd like to launch first one with some task initializing a distributed map and, in a couple of minutes, the second one (on those two hosts). How should I configure them to be able to see one another and to share that Map?
UPD. I had a glance at the Hazelcast docs and managed to run two instances with the following code:
Config config = new Config();
config.getNetworkConfig().getJoin().getMulticastConfig().setEnabled(false);
config.getNetworkConfig().getJoin().getTcpIpConfig().addMember("192.168.4.101").addMember("192.168.4.102").setRequiredMember("192.168.4.101").setEnabled(true);
config.getNetworkConfig().getInterfaces().setEnabled(true).addInterface("192.168.4.*");
And somewhere further:
HazelcastInstance hazelcast = Hazelcast.newHazelcastInstance(config);
MultiMap<Long, Long> idToPids = hazelcast.getMultiMap("mapName");
IMap<Long, EntityDesc> idToDesc = hazelcast.getMap("multiMapName");
All that followed by some job-performing code.
I run this class on two different nodes, they successfully see each other and communicate (even share the resources, as far as I can tell).
But the problem is the work of two nodes seems a lot slower than in the case of single local node. What am I doing wrong?
One of the reasons of a slow down is that the data used in the tasks (I don't know anything about them) could be stored on a different member than the task is running. With a single node cluster, you don't have this problem. But with a multi node cluster, the map will be partitioned, so every member will only store a subset of the data.
Also with a single node, there is no backup and therefor it is a lot faster, than in a true clustered setup (so >1 members).
These are some of the obvious reasons why things could slow down. But without additional information, it will be very hard to guess what is the cause.
I am planing to use Couchbase as Documentation store in my web application. I am looking at Couchbase client for Java, and you need to create separate Couchbase Client for each bucket, if I treat Couchbase bucket as I would treat generic entity. This is a bit of overkill for the system (though, I can reuse executing service to minimize object creation and thread management overhead.)
So
Is there a way to reuse existing CouchbaseClient for multiple buckets (Not only adding ExecutionService)
Would not it be better to use single bucket, and distinguish objects based on the keys, and rely on views selectors for querying, from performance point of view.
You should treat couchbase bucket like a database. One bucket per application in most cases should be enough. But I prefer to have 2 buckets. One for common data and one for "temporary" or "fast changing" (like cache, user sessions, etc.) data. For the last purpose you can even use just memcached bucket.
And answering your 2 questions:
I don't know such way and never seen that someone even tried to do that. But remember that that client should implement singleton pattern. So if you have 2 buckets for your application, you'll only have 2 clients (that's definitely doesn't overkill something)
As I said before treat bucket like a database. You even don't need to create test database. Couchbase has built-in separated dev and production views, and you can easily test your app on production data with dev views.
About using a bucket as table/database, this post explains pretty well:
http://blog.couchbase.com/10-things-developers-should-know-about-couchbase
Start with everything in one bucket
A bucket is equivalent to a database. You store objects of different characteristics or attributes in the same bucket. So if you are moving from a RDBMS, you should store records from multiple tables in a single bucket.
Remember to create a “type” attribute that will help you differentiate the various objects stored in the bucket and create indexes on them. It is recommended to start with one bucket and grow to more buckets when necessary.
I've a standalone JAVA applications which operates on a large amount of elements read from an input file, each element being associated with an identifier. For each element, I do the following (among others of course):
Check that the element has not already been processed using its
identifier.
Map the element to a grid using some statistical method,
each cell of the grid being responsible for tracking the unique elements
that were assigned to it, along with some properties calculated on each element.
The number of elements might be quite large (several millions), as well as the grid itself. Each cell is created on the fly as soon as an element has been assigned to it to avoid storing empty cells.
Question is: with large amount of data, memory issues naturally arise. What would be the best strategy to process large amount of data while avoiding memory issues ?
I've a couple of things in mind, but I'd like to know if anyone already has had this kind of problem, and if so, share its experience:
Embedded lightweight SQL database
Caching solutions such as Ehcache or apache jcs
NoSQL Key-value stores such as cassandra
Thoughts ?
I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?
I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).
I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.
Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.
Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.
Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.