Cassandra - many small or fewer bigger nodes?

Cassandra - many small or fewer bigger nodes? - java

I will be hosting my Cassandra database on Google cloud. Instances are priced in a linear fashion meaning 1cpu with 2gb ram is $1, 2cpu with 4gb is $2, 4cpu with 8GB is $4 and so on.
I am deciding on the size of my instances and am not sure what the standard is? I was thinking of using more fewer larger instances (8cpu, 64gb) opposed to lighter such as (2cpu, 4 gb). My thought process is with more instances each node will carry less of the overall data which would have a smaller impact if nodes fail. As well, the os of these smaller instances would have less overhead because it would accept less connections.
These are pros, but here are some cons I can think of:
1) Each instance will be less utilized
2) Cassandra + JVM overhead on so many instances can add up and be a lot of overhead.
3) I will be using local SSD opposed to persistent SSD which are much more expensive meaning each instance will need their own local SSD which raises costs.
These are some reasons I can think of, is there any other pros/cons between choosing more smaller instances vs fewer larger for a Cassandra database (maybe even nodes in general)? Are there any best practices associated to choosing Cassandra server sizes?
PS: I added the 'Java' tag because Cassandra is built using JAVA and runs on the JVM and would like to see if the JVM has any pros/cons.

I think you've hit some of the tradeoff points, but here are a few other things:
As the amount of data stored on a single node increases, the cost of bootstrapping (adding new nodes) increases. For instance, you'll get reasonable bootstrapping times storing 100 GB per node, but the process will take eons with 10 TB per node.
SSD usage makes this less important, but consider using separate physical disks for your commitlog and data.
Configurations with fewer than 4 cores or less than 8 GB of memory are usually not recommended, but your mileage may vary.

Related

Valid values of -Xmx flag in java

In every example I see of the -Xmx flag in java documentation (and in examples on the web), it is always a power of 2. For example '-Xmx512m', '-Xmx1024m', etc.
Is this a requirement, or is it a good idea for some reason?
I understand that memory in general comes in powers of 2; this question is specifically about the java memory flags.

It keeps things simple, but it is more for your benefit than anything else.
There is no particular reason to pick a power of 2, or a multiple of 50 MB (also common) e.g. -Xmx400m or -Xmx750m
Note: the JVM doesn't follow this strictly. It will use this to calculate the sizes of different regions which if you add them up tends to be lightly less than the number you provide. In my experience the heap is 1% - 2% less, but if you consider all the other memory regions the JVM uses, this doesn't make much difference.
Note: memory sizes for hardware typically a power of two (on some PCs it was 3x a power of two) This might have got people into the habit of thinking of a memory sizes as a power of two.
BTW: AFAIK, in most JVMs the actual size of each region is a multiple of the page size i.e. 4 KB.

Java uses only 1 of 2 CPU with NUMA (Neo4J)

I’m working on a java program to create a really large Neo4J database. I use the batchinserter and Executors.newFixedThreadPool to speed things up. My Win2012R2 server has 2 cpu’s (2x6 Cores + 2x6 Hyper-threads) and 256GB in NUMA architecture. My problem is now, that my importer only uses 1 CPU (Node).
Is it possible to use both NUMA-Nodes with only one javaprocess?
Javaoptions: -XX:+UseNUMA -Xmx64g -Xms64g

It isn't clear how much memory is assigned to each node - is it 256GB or 128GB? Either way, as I understand it, setting a max-heap size less than the amount of memory assigned to the node will usually mean the application stays affined to a single node. This is true under Windows, Solaris and Linux, as far as I'm aware.
Even if you allocate a JVM max heap size greater then the assigned memory to a node, if your heap doesn't grow beyond that size, the process won't spill because the JVM object allocator will always try to create a new object in the same memory pool as the creating thread - and that includes new thread objects.
The primary design goal of the NUMA architecture is to enable different processes to operate on different CPUs with each CPU having localised memory access, rather than having all CPUs contend for the same global shared memory. Having the same process running across multiple nodes is not necessarily that efficient, unless you can arrange for a particular thread to always use the local memory associated with a specific node (thread affinity). Otherwise, remote memory access will slow you down.
I suspect that to use more than one node in your example you will need to either assign different tasks to different nodes, or parallelise the same task across multiple nodes. In the latter case you'll need to ensure that each node has a copy of the same data in local memory. There are libraries available to manage thread affinity from your Java code.
https://github.com/peter-lawrey/Java-Thread-Affinity

The BatchInserter is single-threaded. You should use the import tool instead. See http://neo4j.com/docs/stable/import-tool.html

Is there an alternative to AtomicReferenceArray for large amounts of data?

I have a large amount of data that I'm currently storing in an AtomicReferenceArray<X>, and processing from a large number of threads concurrently.
Each element is quite small and I've just got to the point where I'm going to have more than Integer.MAX_VALUE entries. Unfortunately List and arrays in java are limited to Integer.MAX_VALUE (or just less) values. Now I have enough memory to keep a larger structure in memory - with the machine having about 250GB of memory in a 64b VM.
Is there a replacement for AtomicReferenceArray<X> that is indexed by longs? (Otherwise I'm going to have to create my own wrapper that stores several smaller AtomicReferenceArray and maps long accesses to int accesses in the smaller ones.)

Sounds like it is time to use native memory. Having 4+ billion objects is going to cause some dramatic GC pause times. However if you use native memory you can do this with almost no impact on the heap. You can also use memory mapped files to support faster restarts and sharing the data between JVMs.
Not sure what your specific needs are but there are a number of open source data structures which do this like; HugeArray, Chronicle Queue and Chronicle Map You can create an array which 1 TB but uses almost no heap and has no GC impact.
BTW For each object you create, there is a 8 byte reference and a 16 byte header. By using native memory you can save 24 bytes per object e.g. 4 bn * 24 is 96 GB of memory.

Elasticsearch improve query performance

I'm trying to improve query performance. It takes an average of about 3 seconds for simple queries which don't even touch a nested document, and it's sometimes longer.
curl "http://searchbox:9200/global/user/_search?n=0&sort=influence:asc&q=user.name:Bill%20Smith"
Even without the sort it takes seconds. Here are the details of the cluster:
1.4TB index size.
210m documents that aren't nested (About 10kb each)
500m documents in total. (nested documents are small: 2-5 fields).
About 128 segments per node.
3 nodes, m2.4xlarge (-Xmx set to 40g, machine memory is 60g)
3 shards.
Index is on amazon EBS volumes.
Replication 0 (have tried replication 2 with only little improvement)
I don't see any noticeable spikes in CPU/memory etc. Any ideas how this could be improved?

Garry's points about heap space are true, but it's probably not heap space that's the issue here.
With your current configuration, you'll have less than 60GB of page cache available, for a 1.5 TB index. With less than 4.2% of your index in page cache, there's a high probability you'll be needing to hit disk for most of your searches.
You probably want to add more memory to your cluster, and you'll want to think carefully about the number of shards as well. Just sticking to the default can cause skewed distribution. If you had five shards in this case, you'd have two machines with 40% of the data each, and a third with just 20%. In either case, you'll always be waiting for the slowest machine or disk when doing distributed searches. This article on Elasticsearch in Production goes a bit more in depth on determining the right amount of memory.
For this exact search example, you can probably use filters, though. You're sorting, thus ignoring the score calculated by the query. With a filter, it'll be cached after the first run, and subsequent searches will be quick.

Ok, a few things here:
Decrease your heap size, you have a heap size of over 32gb dedicated to each Elasticsearch instance on each platform. Java doesn't compress pointers over 32gb. Drop your nodes to only 32gb and, if you need to, spin up another instance.
If spinning up another instance instance isn't an option and 32gb on 3 nodes isn't enough to run ES then you'll have to bump your heap memory to somewhere over 48gb!
I would probably stick with the default settings for shards and replicas. 5 shards, 1 replica. However, you can tweak the shard settings to suit. What I would do is reindex the data in several indices under several different conditions. The first index would only have 1 shard, the second index would have 2 shards, I'd do this all the way up to 10 shards. Query each index and see which performs best. If the 10 shard index is the best performing one keep increasing the shard count until you get worse performance, then you've hit your shard limit.
One thing to think about though, sharding might increase search performance but it also has a massive effect on index time. The more shards the longer it takes to index a document...
You also have quite a bit of data stored, maybe you should look at Custom Routing too.

Running time of disk I/O algorithms

In the memory based computing model, the only running time calculations that need to be done can be done abstractly, by considering the data structure.
However , there aren't alot of docs on high performance disk I/o algorithms. Thus I ask the following set of questions:
1) How can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
2) And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
3) Finally... how does the JVM optimize access of indexed portions of a file?
And... as far as resources -- in general... Are there any good idioms or libraries for on disk data structure implementations?

1) how can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
In chapter 6 of Computer Systems: A Programmer's Perspective they give a pretty practical mathematical model for how long it takes to read some data from a typical magnetic disk.
To quote the last page in the linked pdf:
Putting it all together, the total estimated access time is
Taccess = Tavg seek + Tavg rotation + Tavg transfer
= 9 ms + 4 ms + 0.02 ms
= 13.02 ms
This example illustrates some important points:
• The time to access the 512 bytes in a disk sector is dominated by the seek time and the rotational
latency. Accessing the first byte in the sector takes a long time, but the remaining bytes are essentially
free.
• Since the seek time and rotational latency are roughly the same, twice the seek time is a simple and
reasonable rule for estimating disk access time.
*note, the linked pdf is from the authors website == no piracy
Of course, if the data being accessed was recently accessed, there's a decent chance it's cached somewhere in the memory heiarchy, in which case the access time is extremely small(practically, "near instant" when compared to disk access time).
2)And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
Another seek + rotation amount of time may occur if the seeked location isnt stored sequentially nearby. It depends where in the file you're seeking, and where that data is physically stored on the disk. For example, fragmented files are guaranteed to cause disk seeks to read the entire file.
Something to keep in mind is that even though you may only request to read a few bytes, the physical reads tend to occur in multiples of a fixed size chunks(the sector size), which ends up in cache. So you may later do a seek to some nearby location in the file, and get lucky that its already in cache for you.
Btw- The full chapter in that book on the memory hierarchy is pure gold, if you're interested in the subject.

1) If you need to compare the speed of various IO functions, you have to just run it a thousand times and record how long it takes.
2) That depends on how you plan to get to this index. An index to the beginning of a file is exactly the same as an index to the middle of a file. It just points to a section of memory on the disk. If you get to this index by starting at the beginning and progressing there, then yes it will take longer.
3/4) No these are managed by the operating system itself. Java isn't low level enough to handle these kinds of operations.

high performance disk I/o algorithms.
The performance of your hardware is usually so important that what you do in software doesn't matter so much. You should first consider buying the right hardware for the job.
how can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
Its simple to time them as they are always going to take many micro-seconds each. For example a HDD can perform 80-120 IOPs and an SSD can perform 80K to 230K IOPs. You can usually get within 1/2 what the manufacturer specifies easily and getting 100% is the where you might do tricks in software. Never the less you will never get a HDD to perform like an SSD unless you have lots of memory and only ever read the data in which case the OS will do all the work for you.
You can buy hybrid drives which give you the capacity of an HDD but performance close to that of an SSD. For commercial production use you may be willing to spend the money of a disk sub-system with multiple drives. This can increase the perform to say 500 IOPS but can cost increases significantly. You usually buy a disk subsytem because you need the capacity and redundancy it provides but you usually get a performance boost as well but having more spinals working together. Although this link on disk subsystem performance is old (2004) they haven't changed that much since then.
And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
It depends on whether it is in memory or not. If it is very close to data you recently read it quite likely, if it far away it depends on what accesses you have done in the past and how much memory you have free to cache disk accesses.
The typical latency for a HDD is ~8 ms each (i.e. if you have 10 random reads queued it can be 80 ms) The typical latency of a SSD is 25 to 100 us. It is far less likely that reads will already be queued as it is much faster to start with.
how does the JVM optimize access of indexed portions of a file?
Assuming you are using sensible buffer sizes, there is little you can do about generically in software. What you can do is done by the OS.
are there any good idioms or libraries for on disk data structure implementations?
Use a sensible buffer size like 512 bytes to 64 KB.
Much more importantly, buy the right hardware for your requirements.

1) how can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
There are no such universal constants. In fact, performance models of physical disk I/O, file systems and operating systems are too complicated to be able to make accurate predictions for specific operations.
2)And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
It is too complicated to predict. For instance, it depends on how much file buffering the OS does, physical disk parameters (e.g. seek times) and how effectively the OS can schedule disk activity ... across all applications.
3)Finally... how does the JVM optimize access of indexed portions of a file?
It doesn't. It is an operating system level thing.
4) are there any good idioms or libraries for on disk data structure implementations?
That is difficult to answer without more details of your actual requirements. But the best idea is not to try and implement this kind of thing yourself. Find an existing library that is a good fit to your requirements.

Also note that Linux systems, at least, allow different file systems. Depending on the application, one might be a better fit than the others. http://en.wikipedia.org/wiki/File_system#Linux

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.