What is the difference between cpu cache and memory cache?
When data is cached in memory there is also a higher probability that
this data is also cached in the CPU cache of the CPU executing the
thread. [1]
And how we can relate caching in cpu and memory?
To go into detail your question relates to both hardware and software used in computing.
Cache
This is just a general term used to refer to sets of data that are accessed quite often.
In computing, a cache /ˈkæʃ/ kash,1 is a hardware or software component that stores data so future requests for that data can be served faster.
Source
Memory Cache
Quite simply is a cache of frequently accessed data that is stored in a reasonably fast medium, e.g. in RAM or on a disk drive.
CPU Cache
This is a small block of RAM like memory that is physically a part of the CPU. It generally doesn't have alot of memory.
e.g. The Intel Core 17-920 has a cache of 8MB
Source
The point of this cache is to store data that the CPU is using quite regularly to speed up transfer time, since the CPU cache is physically closer to the processor then RAM is.
According to Wikipedia;
In computing, a cache is a hardware or software component that stores
data so future requests for that data can be served faster
So basically it is a location where you store data so that the next time you want the data you can access it quicker. Which means that the cache needs to be in a location which is quicker than the original location.
Typically the hard disk is used to store most data in a persistent manner. This is the largest data store in a computer system and is normally slow.
All the "work" is however done by the CPU. So in order to do processing of the data the CPU needs to first read the data, then process it, then write it out. As the CPU has a very limited memory/data registers then it does a lot of reading and writing.
Ideally your CPU would have a large enough data registers to store everything you need. But memory on the CPU is very expensive, so this is not practical.
So, you have main memory where the applications store some data temporarily while running to make it quicker.
The way that applications work mean that you tend to have a lot of data which is accessed very frequently. Often referred to as hot data.
So the purpose of the cache is to store such hot data so that you can quicker and easier refer to it and use it when needed.
So the closer to the CPU core you have your data, the quicker it can be accessed and hence performance is increased. But the more expensive it is.
The graphic shows this the different levels together with approx. access times
It can vary slightly depending on CPU architecture (and has changed over time) but generally a L1 & L2 cache is per core. A L3 cache is shared between multiple cores. A L1 cache is also often split into a Data cache and an instructions cache.
So your CPU cache will contain data is which accessed the most at that time, so there is a relation of sorts to either the main memory or the HDD where the data was fetched from. But because it is small, the cache will quickly change to using other data if you do something else, or something else if running in the background.
It is therefore not really possible to control the cache of the CPU. Plus if you did you would effectively slow down everything else (including the O/S) because you are denying them the ability to use the cache.
Every time your application reads and stores data in main memory then it is effectively creating its own cache, assuming you then access the data from this location and don't read it from disk (or other location) every time you need it.
So this can mean that part of it is also in the CPU Cache, but not necessarily. As you can have data in your main memory from your application, but you application is not doing anything, or has not accessed that data for a long time.
Remember also that the data in the CPU caches are very small in comparison to data in main memory. For example the Broadwell Intel Xeon chips have;
L1 Cache = 64 KB (per core)
L2 Cache = 256 KB (per core)
L3 Cache = 2 - 6 MB (shared).
The "memory cache" appears to really just be talking about anywhere in memory. Sometimes this is a cache of data stored on disk or externally. This is a software cache.
The CPU cache is a hardware cache and is faster, more localised but smaller.
Related
In numerous articles, YouTube videos, etc., I have seen Java's volatile keyword explained as a problem of cache memory, where declaring a variable volatile ensures that reads/writes are forced to main memory, and not cache memory.
It has always been my understanding that in modern CPUs, cache memory implements coherency protocols that guarantee that reads/writes are seen equally by all processors, cores, hardware threads, etc, at all levels of the cache architecture. Am I wrong?
jls-8.3.1.4 simply states
A field may be declared volatile, in which case the Java Memory Model ensures that all threads see a consistent value for the variable (§17.4).
It says nothing about caching. As long as cache coherency is in effect, volatile variables simply need to be written to a memory address, as opposed to being stored locally in a CPU register, for example. There may also be other optimizations that need to be avoided to guarantee the contract of volatile variables, and thread visibility.
I am simply amazed at the number of people who imply that CPUs do not implement cache coherency, such that I have been forced to go to StackOverflow because I am doubting my sanity. These people go to a great deal of effort, with diagrams, animated diagrams, etc. to imply that cache memory is not coherent.
jls-8.3.1.4 really is all that needs to be said, but if people are going to explain things in more depth, wouldn't it make more sense to talk about CPU Registers (and other optimizations) than blame CPU Cache Memory?
CPUs are very, very fast. That memory is physically a few centimeters away. Let's say 15 centimeters.
The speed of light is 300,000 kilometers per second, give or take. That's 30,000,000,000,000 centimeters every second. The speed of light in medium is not as fast as in vacuum, but it's close so lets ignore that part. That means sending a single signal from the CPU to the memory, even if the CPU and memory both can instantly process it all, is already limiting you to 1,000,000,000 or 1Ghz (You need to cover 30 centimeters to get form the core to the memory and back, so you can do that 1,000,000,000 every second. If you can do it any faster, you're travelling backwards in time. Or some such. You get a nobel prize if you figure out how to manage that one).
Processors are about that fast! We measure core speeds in Ghz these days, as in, in the time it takes the signal to travel, the CPU's clock has already ticked. In practice of course that memory controller is not instantaneous either, nor is the CPU pipelining system.
Thus:
It has always been my understanding that in modern CPUs, cache memory implements coherency protocols that guarantee that reads/writes are seen equally by all processors, cores, hardware threads, etc, at all levels of the cache architecture. Am I wrong?
Yes, you are wrong. QED.
I don't know why you think that or where you read that. You misremember, or you misunderstood what was written, or whatever was written was very very wrong.
In actual fact, an actual update to 'main memory' takes on the order of a thousand cycles! A CPU is just sitting there, twiddling its thumbs, doing nothing, in a time window where it could roll through a thousand, on some cores, multiple thousands of instructions, memory is that slow. Molasses level slow.
The fix is not registers, you are missing about 20 years of CPU improvement. There aren't 2 layers (registers, then main memory), no. There are more like 5: Registers, on-die cache in multiple hierarchical levels, and then, eventually, main memory. To make it all very very fast these things are very very close to the core. So close, in fact, that each core has their own, and, drumroll here - modern CPUs cannot read main memory. At all. They are entirely incapable of it.
Instead what happens is that the CPU sees you write or read to main memory and translates that, as it can't actually do any of that, by figuring out which 'page' of memory that is trying to read/write to (each chunk of e.g. 64k worth of memory is a page; actual page size depends on hardware). The CPU then checks if any of the pages loaded in its on-die cache is that page. If yes, great, and it's all mapped to that. Which does mean that, if 2 cores both have that page loaded, they both have their own copy, and obviously anything that one core does to its copy is entirely invisible to the other core.
If the CPU does -not- find this page in its own on-die cache you get what's called a cache miss, and the CPU will then check which of its loaded pages is least used, and will purge this page. Purging is 'free' if the CPU hasn't modified it, but if that page is 'dirty', it will first send a ping to the memory controller followed by blasting the entire sequence of 64k bytes into it (because sending a burst is way, way faster than waiting for the signal to bounce back and forth or to try to figure out which part of that 64k block is dirty), and the memory controller will take care of it. Then, that same CPU pings the controller to blast the correct page to it and overwrites the space that was just purged out. Now the CPU 'retries' the instruction, and this time it does work, as that page IS now in 'memory', in the sense that the part of the CPU that translates the memory location to cachepage+offset now no longer throws a CacheMiss.
And whilst all of that is going on, THOUSANDS of cycles can pass, because it's all very very slow. Cache misses suck.
This explains many things:
It explains why volatile is slow and synchronized is slower. Dog slow. In general if you want big speed, you want processes that run [A] independent (do not need to share memory between cores, except at the very start and very end perhaps to load in the data needed to operate on, and to send out the result of the complex operation), and [B] fit all memory needs to perform the calculation in 64k or so, depending on CPU cache sizes and how many pages of L1 cache it has.
It explains why one thread can observe a field having value A and another thread observes the same field having a different value for DAYS on end if you're unlucky. If the cores aren't doing all that much and the threads checking the values of those fields does it often enough, that page is never purged, and the 2 cores go on their merry way with their local core value for days. A CPU doesn't sync pages for funsies. It only does this if that page is the 'loser' and gets purged.
It explains why Spectre happened.
It explains why LinkedList is slower than ArrayList even in cases where basic fundamental informatics says it should be faster (big-O notation, analysing computational complexity). Because as long as the arraylist's stuff is limited to a single page you can more or less consider it all virtually instant - it takes about the same order of magnitude to fly through an entire page of on-die cache as it takes for that same CPU to wait around for a single cache miss. And LinkedList is horrible on this front: Every .add on it creates a tracker object (the linkedlist has to store the 'next' and 'prev' pointers somewhere!) so for every item in the linked list you have to read 2 objects (the tracker and the actual object), instead of just the one (as the arraylist's array is in contiguous memory, that page is worst-case scenario read into on-die once and remains active for your entire loop), and it's very easy to end up with the tracker object and the actual object being on different pages.
It explains the Java Memory Model rules: Any line of code may or may not observe the effect of any other line of code on the value of any field. Unless you have established a happens-before/happens-after relationship using any of the many rules set out in the JMM to establish these. That's to give the JVM the freedom to, you know, not run literally 1000x slower than neccessary, because guaranteeing consistent reads/writes can only be done by flushing memory on every read, and that is 1000x slower than not doing that.
NB: I have massively oversimplified things. I do not have the skill to fully explain ~20 years of CPU improvements in a mere SO answer. However, it should explain a few things, and it is a marvellous thing to keep in mind as you try to analyse what happens when multiple java threads try to write/read to the same field and you haven't gone out of your way to make very very sure you have an HB/HA relationship between the relevant lines. If you're scared now, good. You shouldn't be attempting to communicate between 2 threads often, or even via fields, unless you really, really know what you are doing. Toss it through a message bus, use designs where the data flow is bounded to the start and end of the entire thread's process (make a job, initialize the job with the right data, toss it in an ExecutorPool queue, set up that you get notified when its done, read out the result, don't ever share anything whatsoever with the actual thread that runs it), or talk to each other via the database.
I have a large amount of data that I'm currently storing in an AtomicReferenceArray<X>, and processing from a large number of threads concurrently.
Each element is quite small and I've just got to the point where I'm going to have more than Integer.MAX_VALUE entries. Unfortunately List and arrays in java are limited to Integer.MAX_VALUE (or just less) values. Now I have enough memory to keep a larger structure in memory - with the machine having about 250GB of memory in a 64b VM.
Is there a replacement for AtomicReferenceArray<X> that is indexed by longs? (Otherwise I'm going to have to create my own wrapper that stores several smaller AtomicReferenceArray and maps long accesses to int accesses in the smaller ones.)
Sounds like it is time to use native memory. Having 4+ billion objects is going to cause some dramatic GC pause times. However if you use native memory you can do this with almost no impact on the heap. You can also use memory mapped files to support faster restarts and sharing the data between JVMs.
Not sure what your specific needs are but there are a number of open source data structures which do this like; HugeArray, Chronicle Queue and Chronicle Map You can create an array which 1 TB but uses almost no heap and has no GC impact.
BTW For each object you create, there is a 8 byte reference and a 16 byte header. By using native memory you can save 24 bytes per object e.g. 4 bn * 24 is 96 GB of memory.
In the memory based computing model, the only running time calculations that need to be done can be done abstractly, by considering the data structure.
However , there aren't alot of docs on high performance disk I/o algorithms. Thus I ask the following set of questions:
1) How can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
2) And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
3) Finally... how does the JVM optimize access of indexed portions of a file?
And... as far as resources -- in general... Are there any good idioms or libraries for on disk data structure implementations?
1) how can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
In chapter 6 of Computer Systems: A Programmer's Perspective they give a pretty practical mathematical model for how long it takes to read some data from a typical magnetic disk.
To quote the last page in the linked pdf:
Putting it all together, the total estimated access time is
Taccess = Tavg seek + Tavg rotation + Tavg transfer
= 9 ms + 4 ms + 0.02 ms
= 13.02 ms
This example illustrates some important points:
• The time to access the 512 bytes in a disk sector is dominated by the seek time and the rotational
latency. Accessing the first byte in the sector takes a long time, but the remaining bytes are essentially
free.
• Since the seek time and rotational latency are roughly the same, twice the seek time is a simple and
reasonable rule for estimating disk access time.
*note, the linked pdf is from the authors website == no piracy
Of course, if the data being accessed was recently accessed, there's a decent chance it's cached somewhere in the memory heiarchy, in which case the access time is extremely small(practically, "near instant" when compared to disk access time).
2)And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
Another seek + rotation amount of time may occur if the seeked location isnt stored sequentially nearby. It depends where in the file you're seeking, and where that data is physically stored on the disk. For example, fragmented files are guaranteed to cause disk seeks to read the entire file.
Something to keep in mind is that even though you may only request to read a few bytes, the physical reads tend to occur in multiples of a fixed size chunks(the sector size), which ends up in cache. So you may later do a seek to some nearby location in the file, and get lucky that its already in cache for you.
Btw- The full chapter in that book on the memory hierarchy is pure gold, if you're interested in the subject.
1) If you need to compare the speed of various IO functions, you have to just run it a thousand times and record how long it takes.
2) That depends on how you plan to get to this index. An index to the beginning of a file is exactly the same as an index to the middle of a file. It just points to a section of memory on the disk. If you get to this index by starting at the beginning and progressing there, then yes it will take longer.
3/4) No these are managed by the operating system itself. Java isn't low level enough to handle these kinds of operations.
high performance disk I/o algorithms.
The performance of your hardware is usually so important that what you do in software doesn't matter so much. You should first consider buying the right hardware for the job.
how can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
Its simple to time them as they are always going to take many micro-seconds each. For example a HDD can perform 80-120 IOPs and an SSD can perform 80K to 230K IOPs. You can usually get within 1/2 what the manufacturer specifies easily and getting 100% is the where you might do tricks in software. Never the less you will never get a HDD to perform like an SSD unless you have lots of memory and only ever read the data in which case the OS will do all the work for you.
You can buy hybrid drives which give you the capacity of an HDD but performance close to that of an SSD. For commercial production use you may be willing to spend the money of a disk sub-system with multiple drives. This can increase the perform to say 500 IOPS but can cost increases significantly. You usually buy a disk subsytem because you need the capacity and redundancy it provides but you usually get a performance boost as well but having more spinals working together. Although this link on disk subsystem performance is old (2004) they haven't changed that much since then.
And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
It depends on whether it is in memory or not. If it is very close to data you recently read it quite likely, if it far away it depends on what accesses you have done in the past and how much memory you have free to cache disk accesses.
The typical latency for a HDD is ~8 ms each (i.e. if you have 10 random reads queued it can be 80 ms) The typical latency of a SSD is 25 to 100 us. It is far less likely that reads will already be queued as it is much faster to start with.
how does the JVM optimize access of indexed portions of a file?
Assuming you are using sensible buffer sizes, there is little you can do about generically in software. What you can do is done by the OS.
are there any good idioms or libraries for on disk data structure implementations?
Use a sensible buffer size like 512 bytes to 64 KB.
Much more importantly, buy the right hardware for your requirements.
1) how can we estimate running time of disk I/o operations? I assume there is a simple set of constants which we might add for looking up a value on disk, rather than in memory...
There are no such universal constants. In fact, performance models of physical disk I/O, file systems and operating systems are too complicated to be able to make accurate predictions for specific operations.
2)And more specifically, what is the difference between performance for accessing a specific index in a file? Is this a constant time operation? Or does it depend on how "far down" the index is?
It is too complicated to predict. For instance, it depends on how much file buffering the OS does, physical disk parameters (e.g. seek times) and how effectively the OS can schedule disk activity ... across all applications.
3)Finally... how does the JVM optimize access of indexed portions of a file?
It doesn't. It is an operating system level thing.
4) are there any good idioms or libraries for on disk data structure implementations?
That is difficult to answer without more details of your actual requirements. But the best idea is not to try and implement this kind of thing yourself. Find an existing library that is a good fit to your requirements.
Also note that Linux systems, at least, allow different file systems. Depending on the application, one might be a better fit than the others. http://en.wikipedia.org/wiki/File_system#Linux
As title, in my module I had a blockingqueue to deliver my data. The data which server can produce is a a large number of logging information. In order to avoid affecting the performance of server , I wrote multi-thread clients to consume these data and persist them in data caches. Because the data can be produced hugely per mins,I became confused that how many sizes should I initialize my queue. And I knew that I can set my queue policy that if more data is produced , I can omit the overflow part. But how many size I created in the queue in order to hold these data as much as I can.
Could you give me some suggestion?As far as I know , it was related with my server JVM stack size & the single logging data in my JVM???
Make it "as large as is reasonable". For example, if you are OK with it consuming up to 1Gb of memory, then allocate its size to be 1Gb divided by the average number of bytes of the objects in the queue.
If I had to pick a "reasonable" number, I would start with 10000. The reason is, if it grows to larger than that, then making it larger isn't a good idea and isn't going to help much, because clearly the logging requirement is outpacing your ability to log, so it's time to back off the clients.
"Tuning" through experimentation is usually the best approach, as it depends on the profile of your application:
If there are highs and lows in your application's activity, then a larger queue will help "smooth out" the load on your server
If your application has a relatively steady load, then a smaller queue is appropriate as a larger queue only delays the inevitable point when clients are blocked - you would be better to make it smaller and dedicate more resources (a couple more logging threads) to consuming the work.
Note also that a very large queue may impact garbage collection responsiveness to freeing up memory, as it has to traverse a much larger heap (all the objects in the queue) each time it runs, increasing the load on both CPU and memory.
You want to make the size as small as you can without impacting throughput and responsiveness too much. To asses this you'll need to set up a test server and hit it with a typical load to see what happens. Note that you'll probably need to hit it from multiple machines to put a realistic load on the server, as hitting it from one machine can limit the load due to the number of CPU cores and other resources on the test client machine.
To be frank, I'd just make the size 10000 and tune the number of worker threads rather than the queue size.
Contiguous writes to disk are reasonably fast (easily 20MB per second). Instead of storing data in RAM, you might be better off writing it to disk without worrying about memory requirements. Your clients then can read data from files instead of RAM.
To know size of java object, you could use any java profiler. YourKit is my favorite.
I think the real problem is not size of queue but what you want to do when things exceed your planned capacity. ArrayBlockingQueue will simply block your threads, which may or may not be the right thing to do. Your options typically are:
1) Block the threads (use ArrayBlockingQueue) based on memory committed for this purpose
2) Return error to the "layer above" and let that layer decide what to do...may be send error to the client
3) Can you throw away some data...say which was en queued long ago.
4) Start writing to disk, once you overflow RAM capacity.
what are the effects of paging on garbage collection ?
The effects of paging on garbage-collection are pretty much the same as upon anything; it allows access to lots of memory, but hurts performance when it happens.
The more pressing question, is what is the effect of garbage-collection on paging?
Garbage collection can cause areas of memory to be read from and written to that would not be considered otherwise at a given point of time. Reducing the degree to which garbage collection causes paging to happen is therefore advantageous. This is one of the advantages that a generational compacting collector offers, as it leads to more short-lived objects being in one page, collected from that page, and the memory made available to other objects, while also keeping long-lived objects in a page where related objects are more likely to also be (long-lived objects will often be related to other long-lived objects because one long-lived object is keeping the others alive). This not only reduces the amount of paging necessary to perform the collection, but can help reduce the amount of paging necessary for the rest of the application.
First a bit of terminology. In some areas, e.g. Linux-related talks, paging is a feature of the operating system in which executable code needs not be permanently in RAM. Executable code comes from an executable file, and the kernel loads it from the disk on demand, when the CPU walks through the instructions in the program. When memory is tight, the kernel may decide to simply "forget" a page of code, because it knows that it can always reload it from the executable file, if that code needs to be executed again.
The kernel also implements another feature which is called swapping and is about something similar, but for data. Data is not obtained from the executable file. Hence, the kernel cannot simply forget a page of datal; it has to save it somewhere, in a dedicated area called a "swap file" or "swap partition". This makes swapping more expensive than paging: the kernel must write out the data page before reusing the corresponding RAM, whereas a code page can simply be reused directly. In practice, the kernel pages quite a lot before considering swapping.
Paging is thus orthogonal to garbage collection. Swapping, however, is not. The general rule of thumb is that swapping and GC do not mix well. Most GC work by regularly inspecting data, and if said data has been sent to the swap partition, then it will have to be reloaded from that partition, which means that some other data will have to be sent to the said partition, because if the data was in the swap and not in RAM then this means that memory is tight. In the presence of swapping, a GC tends to imply an awful lot of disk activity.
Some GC apply intricate strategies to reduce swap-related strategies. This includes generational GC (which try to explore old data less often) and strict typing (the GC looks at data because it needs to locate pointers; if it knows that a big chunk of RAM contains only non-pointers, e.g. it is some picture data with only pixel values, then it can leave it alone, and in particular not force it back from the swap area). The GC in the Java virtual machine (the one from Sun/Oracle) is known to be quite good at that. But that's only relative: if your Java application hits swap, then you will suffer horribly. But it could have been much worse.
Just buy some extra RAM.