This will be a bit abstract question since i don't even know if there are any developments like this.
Given we have an application which tries to deliver text data from point A to B.
A and B are quite far away so size of the data has significant effect on all important metrics we want to optimize for (speed, latency and throughput). First thing that comes to mind is compression, but compression is not that effective when we have to compress many many small messages but its very effective when the size of the compressed data is significant.
I have no experience with compression algorithms but my understanding is that bigger the input the better can be the compression rate since there is a bigger likelihood of repeating chunks and things that can be optimized.
One other way we could go is batching, by waiting for some N period of time and collecting all tiny messages and creating one compressed big one we could have good compression rate but we would sacrifice latency, the message that arrives first will take unnecessary delay of N.
Solution that I'm looking for is something like this, when a compression algorithm traverses the data set it is probably having some dictionary of things that it knows can be optimized. This dictionary is thrown away every time we finish with the compression and it is always sent with the message to B.
rawMsg -> [dictionary|compressedPayload] -> send to B
however if we could have this dictionary to be maintained in memory, and be sent only when there is a change in it that would mean that we can efficiently compress even small messages and avoid sending the dictionary to the other end every time...
rawMsg -> compress(existingDictrionaryOfSomeVersion, rawMsg) -> [dictionaryVersion|compressedPayload] -> send to B
now obviously the assumption here is that B will also keep the instance of dictionary and keep updating it when the newer version arrives.
Note that exactly this is happening already with protocols like protobuf or fix (in financial applications).
With any message you have schema (dictionary) and it is available on both ends and then you just send raw binary data, efficient and fast but your schema is fixed and unchanged.
I'm looking for something that can be used for free form text.
Is there any technology that allows to do this (without having some fixed schema)?
You can simply send the many small messages in a single compressed stream. Then they will be able to take advantage of the previous history of small messages. With zlib you can flush out each message, which will avoid having to wait for a whole block to be built up before transmitting. This will degrade compression, but not nearly as much as trying to compress each string individually (which will likely just end up expanding them). In the case of zlib, your dictionary is always the last 32K of messages that you have sent.
I know that insert/delete works in O(1) time with Java HashMaps.
But is it still the fastest data structure if I have over a million objects (with distinct keys - i.e. each object has a unique key) in my HashMap?
TL;DR - profile your code!
The average performance of HashMap insertion and deletion scales as O(1) (assuming you have a sound hashCode() method on the keys1) until you start running into 2nd-order memory effects:
The larger the heap is, the longer it takes to garbage collect. Generally, the factors that impact most are the number and size of non-garbage objects. A big enough HashMap will do that ...
Your hardware has a limited amount of physical memory. If your JVM's memory demand grows beyond that, the host OS will "swap" memory pages between RAM and disk. A big enough HashMap will do that ... if your heap size is bigger than the amount of physical RAM available to the JVM process.
There are memory effects that are due to the sizes of your processors' memory cache and TLB cache sizes. Basically, if the processors "demand" in reading and writing memory is too great, the memory system becomes the bottleneck. These effects can be exacerbated by a large heap and highly non-localized access patterns. (And running the GC!)
There is also a limit of about 2^31 on the size of a HashMap's primary hash array. So if you have more than about 2^31 / 0.75 entries, the performance of the current HashMap implementation theoretically O(N). However, we are talking billions of entries, and the 2nd order memory effects will be impacting on performance well before then.
1 - If your keys have a poor hashCode() function, then you may find that you get a significant proportion of the keys hash to the same code. If that happens, lookup, insert and delete performance for those keys will be either O(logN) or O(N) ... depending on the key's type, and your Java version. In this case, N is the number keys in the table with the same hashcode as the one you are looking up, etc.
Is HashMap the fastest data structure for your use-case?
It is hard to say without more details of your use-case.
It is hard to say without understanding how much time and effort you are prepared to put into the problem. (If you put in enough coding effort, you could almost certainly trim a few percent off. Maybe a lot more. HashMap is general purpose.)
It is hard to say without you (first!) doing a proper performance analysis.
For example, you first need to be sure that the HashMap really is the cause of your performance problems. Sure, you >>think<< it is, but have you actually profiled your code to find out? Until you do this, you risk wasting your time on optimizing something that isn't the bottleneck.
So HashMaps will have an O(1) insert/delete even for a huge number of objects. The problem for a huge amount of data is the space. For a million entries you may be fine with in memory.
Java has a default load factor of .75 for a HashMap, meaning that a HashMap would need 1.33 million slots to support this map. If you can support this in memory, it's all fine. Even if you can't hold this all in memory, you'd probably still want to use HashMaps, perhaps a Distributed HashMap.
As far as Big-O time goes, this refers to the worst case complexity. The only time the analysis of Big-O time is really useful is as data sizes get larger and larger. If you were working with a really small data set, O(5n+10) would not be the same as O(n). The reason that constant time ( O(1) ) time is so valuable is because it means that the time doesn't depend on the size of the data set. Therefore, for a large data set like the one you're describing, a HashMap would be an excellent option due to its constant time insert/delete.
I am a student in Computer Science and I am hearing the word "overhead" a lot when it comes to programs and sorts. What does this mean exactly?
It's the resources required to set up an operation. It might seem unrelated, but necessary.
It's like when you need to go somewhere, you might need a car. But, it would be a lot of overhead to get a car to drive down the street, so you might want to walk. However, the overhead would be worth it if you were going across the country.
In computer science, sometimes we use cars to go down the street because we don't have a better way, or it's not worth our time to "learn how to walk".
The meaning of the word can differ a lot with context. In general, it's resources (most often memory and CPU time) that are used, which do not contribute directly to the intended result, but are required by the technology or method that is being used. Examples:
Protocol overhead: Ethernet frames, IP packets and TCP segments all have headers, TCP connections require handshake packets. Thus, you cannot use the entire bandwidth the hardware is capable of for your actual data. You can reduce the overhead by using larger packet sizes and UDP has a smaller header and no handshake.
Data structure memory overhead: A linked list requires at least one pointer for each element it contains. If the elements are the same size as a pointer, this means a 50% memory overhead, whereas an array can potentially have 0% overhead.
Method call overhead: A well-designed program is broken down into lots of short methods. But each method call requires setting up a stack frame, copying parameters and a return address. This represents CPU overhead compared to a program that does everything in a single monolithic function. Of course, the added maintainability makes it very much worth it, but in some cases, excessive method calls can have a significant performance impact.
You're tired and cant do any more work. You eat food. The energy spent looking for food, getting it and actually eating it consumes energy and is overhead!
Overhead is something wasted in order to accomplish a task. The goal is to make overhead very very small.
In computer science lets say you want to print a number, thats your task. But storing the number, the setting up the display to print it and calling routines to print it, then accessing the number from variable are all overhead.
Wikipedia has us covered:
In computer science, overhead is
generally considered any combination
of excess or indirect computation
time, memory, bandwidth, or other
resources that are required to attain
a particular goal. It is a special
case of engineering overhead.
Overhead typically reffers to the amount of extra resources (memory, processor, time, etc.) that different programming algorithms take.
For example, the overhead of inserting into a balanced Binary Tree could be much larger than the same insert into a simple Linked List (the insert will take longer, use more processing power to balance the Tree, which results in a longer percieved operation time by the user).
For a programmer overhead refers to those system resources which are consumed by your code when it's running on a giving platform on a given set of input data. Usually the term is used in the context of comparing different implementations or possible implementations.
For example we might say that a particular approach might incur considerable CPU overhead while another might incur more memory overhead and yet another might weighted to network overhead (and entail an external dependency, for example).
Let's give a specific example: Compute the average (arithmetic mean) of a set of numbers.
The obvious approach is to loop over the inputs, keeping a running total and a count. When the last number is encountered (signaled by "end of file" EOF, or some sentinel value, or some GUI buttom, whatever) then we simply divide the total by the number of inputs and we're done.
This approach incurs almost no overhead in terms of CPU, memory or other resources. (It's a trivial task).
Another possible approach is to "slurp" the input into a list. iterate over the list to calculate the sum, then divide that by the number of valid items from the list.
By comparison this approach might incur arbitrary amounts of memory overhead.
In a particular bad implementation we might perform the sum operation using recursion but without tail-elimination. Now, in addition to the memory overhead for our list we're also introducing stack overhead (which is a different sort of memory and is often a more limited resource than other forms of memory).
Yet another (arguably more absurd) approach would be to post all of the inputs to some SQL table in an RDBMS. Then simply calling the SQL SUM function on that column of that table. This shifts our local memory overhead to some other server, and incurs network overhead and external dependencies on our execution. (Note that the remote server may or may not have any particular memory overhead associated with this task --- it might shove all the values immediately out to storage, for example).
Hypothetically we might consider an implementation over some sort of cluster (possibly to make the averaging of trillions of values feasible). In this case any necessary encoding and distribution of the values (mapping them out to the nodes) and the collection/collation of the results (reduction) would count as overhead.
We can also talk about the overhead incurred by factors beyond the programmer's own code. For example compilation of some code for 32 or 64 bit processors might entail greater overhead than one would see for an old 8-bit or 16-bit architecture. This might involve larger memory overhead (alignment issues) or CPU overhead (where the CPU is forced to adjust bit ordering or used non-aligned instructions, etc) or both.
Note that the disk space taken up by your code and it's libraries, etc. is not usually referred to as "overhead" but rather is called "footprint." Also the base memory your program consumes (without regard to any data set that it's processing) is called its "footprint" as well.
Overhead is simply the more time consumption in program execution. Example ; when we call a function and its control is passed where it is defined and then its body is executed, this means that we make our CPU to run through a long process( first passing the control to other place in memory and then executing there and then passing the control back to the former position) , consequently it takes alot performance time, hence Overhead. Our goals are to reduce this overhead by using the inline during function definition and calling time, which copies the content of the function at the function call hence we dont pass the control to some other location, but continue our program in a line, hence inline.
You could use a dictionary. The definition is the same. But to save you time, Overhead is work required to do the productive work. For instance, an algorithm runs and does useful work, but requires memory to do its work. This memory allocation takes time, and is not directly related to the work being done, therefore is overhead.
You can check Wikipedia. But mainly when more actions or resources are used. Like if you are familiar with .NET there you can have value types and reference types. Reference types have memory overhead as they require more memory than value types.
A concrete example of overhead is the difference between a "local" procedure call and a "remote" procedure call.
For example, with classic RPC (and many other remote frameworks, like EJB), a function or method call looks the same to a coder whether its a local, in memory call, or a distributed, network call.
For example:
service.function(param1, param2);
Is that a normal method, or a remote method? From what you see here you can't tell.
But you can imagine that the difference in execution times between the two calls are dramatic.
So, while the core implementation will "cost the same", the "overhead" involved is quite different.
Think about the overhead as the time required to manage the threads and coordinate among them. It is a burden if the thread does not have enough task to do. In such a case the overhead cost over come the saved time through using threading and the code takes more time than the sequential one.
To answer you, I would give you an analogy of cooking Rice, for example.
Ideally when we want to cook, we want everything to be available, we want pots to be already clean, rice available in enough quantities. If this is true, then we take less time to cook our rice( less overheads).
On the other hand, let's say you don't have clean water available immediately, you don't have rice, therefore you need to go buy it from the shops first and you need to also get clean water from the tap outside your house. These extra tasks are not standard or let me say to cook rice you don't necessarily have to spend so much time gathering your ingredients. Ideally, your ingredients must be present at the time of wanting to cook your rice.
So the cost of time spent in going to buy your rice from the shops and water from the tap are overheads to cooking rice. They are costs that we can avoid or minimize, as compared to the standard way of cooking rice( everything is around you, you don't have to waste time gathering your ingredients).
The time wasted in collecting ingredients is what we call the Overheads.
In Computer Science, for example in multithreading, communication overheads amongst threads happens when threads have to take turns giving each other access to a certain resource or they are passing information or data to each other. Overheads happen due to context switching.Even though this is crucial to them but it's the wastage of time (CPU cycles) as compared to the traditional way of single threaded programming where there is never a time wastage in communication. A single threaded program does the work straight away.
its anything other than the data itself, ie tcp flags, headers, crc, fcs etc..
I have a Java app that uses the spymemcached library (http://code.google.com/p/spymemcached) to read and write objects to memcached.
The app always caches the same type of object to memcached. The cached object is always an ArrayList of 5 or 6 java.util.Strings. Using the SizeOf library (http://www.codeinstructions.com/2008/12/sizeof-for-java.html), I've determined that the average deep size of the ArrayList is about 800 bytes.
Overall, I have allocated 12 GB of RAM to memcached. My question is: How many of these objects can memcached hold?
It's not clear to me if it's correct to use the "800 byte" metric from SizeOf, or if that's misleading. For example, SizeOf counts each char to be 2 bytes. I know that every char in my String is a regular ASCII character. I believe spymemcached uses Java serialization, and I'm not sure if that causes each char to take up 1 byte or 2 bytes.
Also, I don't know how much per-object overhead memcached uses. So the calculation should account for the RAM that memcached uses for its own internal data structures.
I don't need a number that's 100% exact. A rough back-of-the-envelope calculation would be great.
The simple approach would be experimentation:
restart memcache
Check bytes allocated: echo "stats" | nc localhost 11211 | fgrep "bytes "
insert 1 object, check bytes allocated
insert 10 objects, check bytes allocated
etc.
This should give you a good idea of bytes-per-key.
However, even if you figure out your serialized size, that alone probably won't tell you how many objects of that size memcache will hold. Memcache's slab system and LRU implementation make any sort of estimate of that nature difficult.
Memcache doesn't really seem to be designed around guaranteeing data availability -- when you GET a key, it might be there, or it might not: maybe it was prematurely purged; maybe one or two of the servers in your pool went down.
I wrote two matrix classes in Java just to compare the performance of their matrix multiplications. One class (Mat1) stores a double[][] A member where row i of the matrix is A[i]. The other class (Mat2) stores A and T where T is the transpose of A.
Let's say we have a square matrix M and we want the product of M.mult(M). Call the product P.
When M is a Mat1 instance the algorithm used was the straightforward one:
P[i][j] += M.A[i][k] * M.A[k][j]
for k in range(0, M.A.length)
In the case where M is a Mat2 I used:
P[i][j] += M.A[i][k] * M.T[j][k]
which is the same algorithm because T[j][k]==A[k][j]. On 1000x1000 matrices the second algorithm takes about 1.2 seconds on my machine, while the first one takes at least 25 seconds. I was expecting the second one to be faster, but not by this much. The question is, why is it this much faster?
My only guess is that the second one makes better use of the CPU caches, since data is pulled into the caches in chunks larger than 1 word, and the second algorithm benefits from this by traversing only rows, while the first ignores the data pulled into the caches by going immediately to the row below (which is ~1000 words in memory, because arrays are stored in row major order), none of the data for which is cached.
I asked someone and he thought it was because of friendlier memory access patterns (i.e. that the second version would result in fewer TLB soft faults). I didn't think of this at all but I can sort of see how it results in fewer TLB faults.
So, which is it? Or is there some other reason for the performance difference?
This because of locality of your data.
In RAM a matrix, although bidimensional from your point of view, it's of course stored as a contiguous array of bytes. The only difference from a 1D array is that the offset is calculated by interpolating both indices that you use.
This means that if you access element at position x,y it will calculate x*row_length + y and this will be the offset used to reference to the element at position specified.
What happens is that a big matrix isn't stored in just a page of memory (this is how you OS manages the RAM, by splitting it into chunks) so it has to load inside CPU cache the correct page if you try to access an element that is not already present.
As long as you go contiguously doing your multiplication you don't create any problems, since you mainly use all coefficients of a page and then switch to the next one but if you invert indices what happens is that every single element may be contained in a different memory page so everytime it needs to ask to RAM a different page, this almost for every single multiplication you do, this is why the difference is so neat.
(I rather simplified the whole explaination, it's just to give you the basic idea around this problem)
In any case I don't think this is caused by JVM by itself. It maybe related in how your OS manages the memory of the Java process..
The cache and TLB hypotheses are both reasonable, but I'd like to see the complete code of your benchmark ... not just pseudo-code snippets.
Another possibility is that performance difference is a result of your application using 50% more memory for the data arrays in the version with the transpose. If your JVM's heap size is small, it is possible that this is causing the GC to run too often. This could well be a result of using the default heap size. (Three lots of 1000 x 1000 x 8 bytes is ~24Mb)
Try setting the initial and max heap sizes to (say) double the current max size. If that makes no difference, then this is not a simple heap size issue.
It's easy to guess that the problem might be locality, and maybe it is, but that's still a guess.
It's not necessary to guess. Two techniques might give you the answer - single stepping and random pausing.
If you single-step the slow code you might find out that it's doing a lot of stuff you never dreamed of. Such as, you ask? Try it and find out. What you should see it doing, at the machine-language level, is efficiently stepping through the inner loop with no waste motion.
If it actually is stepping through the inner loop with no waste motion, then random pausing will give you information. Since the slow one is taking 20 times longer than the fast one, that implies 95% of the time it is doing something it doesn't have to. So see what it is. Each time you pause it, the chance is 95% that you will see what that is, and why.
If in the slow case, the instructions it is executing appear just as efficient as the fast case, then cache locality is a reasonable guess of why it is slow. I'm sure, once you've eliminated any other silliness that may be going on, that cache locality will dominate.
You might try comparing performance between JDK6 and OpenJDK7, given this set of results...