Why do bools take less memory in an array? - java

On the Oracle website, it says that bools take 32 bits on the stack, but 8 in an array. I'm having trouble understanding why it is that they would take less in a group than in as singles. How are they stored, and what difference does it make? If arrays of bools are more efficient, why has that technology not been transferred over to singles?
Also, why not 1 bit?
And what is the difference between how a 64 but system and a 32 bit system stores these?
Thanks!

A boolean value can be stored as a single binary digit, but our computers group values as a convenience. The smallest unit practically dealt with is a byte, the next largest being a word. A byte is, in modern hardware, always 8 bits. 32 bits has emerged as the standard for a word. Even our 64 bit computers can deal in 32 bit words effectively. It is much more convenient to store a bool in whatever unit comes naturally than as a single bit. In an array, the natural unit would be a byte, since you can address any byte in memory. On the stack, which is a word stack, the natural unit is a word. You could stuff bools into bytes and words and work on pulling them out again bit by bit, literally, but that's less efficient than storing them in bytes or words because modern memories are large, so CPU speed is more of a concern. You wouldn't want to waste all of the time it takes to pack bits in compactly, so we waste memory instead, since it is more expendable.

Look, when it comes to stack, one must keep in mind that speed is the most important thing. For example, consider the following:
void method(int foo, boolean bar, String name) ....
Then the stack just after entering the method looks like this:
|-other variables-|-...-|-name-|-bar-|-foo-|---- return address etc. --
^
stack pointer
These are all quantities on a word boundary, symbolized by |. Sure, the JVM could (theoretically, but see below) store the boolean in a single byte. But one must keep in mind that 32bit loads may be slower when they don't address word boundaries. Depending on architecture, it may be impossible to go through a pointer that does not live on a word boundary. Or it may be impossible use the quantity in a floating point instruction, etc. etc.
In addition, the byte code format can only address the n-th word on the stack. If this were not so, addresses relative to the stack pointer would have to be specified in bytes and this would mean that almost any stack access would have two bits that are irrelevant most of the time, as the majority of arguments will be words (int, float or reference) or double words (long, double).
What is never possible is to use 1 single bit for booleans. Why? Because bits are not directly addressable. The smallest addressable unit is the byte.
You can still store 32 booleans in an int if you feel that you should save on memory.

Because of the way CPUs work, all operations are done in 32 bits. If you have a single bool, the only realistic thing a compiler can do is zero out the rest of the 24 bits and save that to the stack, since it's not practical to scan your java file for other bools to and store them all in the same 32 bit memory block.
If you have an array of bools, it's simple to just reference them in blocks of 4, so it's only 8 bits per bool.
Note that this only applies to 32 bit applications/machines.

Related

Why must Java booleans be at least 1 byte in size?

It is known that C++ bools must be at least 1 byte in size so that pointers can be created for each [https://stackoverflow.com/a/2064565/7154924]. But there are no pointers to primitive types in Java. Yet, they still take up at least 1 byte [https://stackoverflow.com/a/383597/7154924].
Why is this the case - why can't Java booleans be 1 bit in size? Computation time aside, if one has a large boolean array, surely one could conceive a compiler that does the appropriate shifting to retrieve the individual bit corresponding to a boolean value?
There is no reason why a boolean must be one byte in size. In fact, is likely that booleans already aren't 1 byte in (effective) size in some scenarios: when packed next to other elements larger than 1 byte on the stack or in an object, they are likely to be larger (i.e., adding them to an object may cause the size to grow by more than one byte).
Any JVM is free to implement booleans as 1 bit, but as far as know none choose to do so, probably largely because:
Accessing one bit is often more expensive than accessing a byte, particularly when writing.
To read a bit, a CPU using a "classic RISC" instruction set would often need need an additional and instruction to extract the relevant bit out of a packed byte (or larger word) of boolean bits. Some might even need a additional instruction to load a constant to and. In the case of indexing an array of boolean, where the bit-index isn't fixed at compile-time, you'd need a variable shift. Some CPUs such as x86 have an easier time since they have memory sourcetestinstructions, including specific bit-test instructions taking a variable position such asbt`. Such a CPU probably has similar read performance in both representations.
Writing is worse: rather than a simple byte write to set a boolean value you now need to read the value, modify the appropriate bit and write it back. Some platforms such as x86 have memory source-and-destination RMW instructions such as and and or that will help, but these are still significantly more expensive than plain writes. In the worst case, repeatedly writing the same element will result in a dependency chain through memory that could slow your code down by an order of magnitude (a series of plain stores can't form a dependency chain).
Even worse, the write method above is totally thread-unsafe. Two threads working on "independent" booleans might clobber each other, so the runtime would have to use atomic update operations just to write a bit for any field where the object cannot be proven local to the thread.
The space savings outside of arrays is usually very small, and is often zero: alignment concerns mean that a single bit will often end up taking the same space as a byte on the stack or in the layout for an object. Only if you had many primitive boolean values on the stack or an object would you see a savings (for example, objects are typically aligned to 8-byte boundaries, so if you have an object whose non-boolean fields are int or larger, you'd need at least 4 boolean values to save any space, and often you'd need 8).
This leaves the last remaining "big win" for bit-representation boolean in arrays of boolean, where you could have an asymptotic 8x space savings for large arrays. In fact, this case was motivating enough in the C++ world that vector<bool> there has a "special" implementation where each bool takes one bit - a never ending source of headaches due to all the required special cases and non-intuitive behavior (and often used as an example of a mis-feature that can't be removed now).
If it weren't for the memory model I could imagine a world where Java
implemented arrays of boolean in a bit-wise manner. They don't have
the same issues as vector<bool> (mostly because of the extra layer of abtraction provided by the JIT and also because an array provides a simpler interface than vector) and it could be done efficiently, I think. There is that pesky memory model though. That model allows writes to different array elements to be safe if done by different threads (i.e,. they act as independent variables for the purpose of the memory model). All common CPUs support this directly if you implement boolean as a byte, since they have independent byte accesses. No CPUs offer independent bit-access though: you are stuck using atomic operations (x86 offers the lock bt* operations, but these are slow: other platforms have even worse options). That would destroy the performance of any boolean array implemented as a bit-array.
Finally, as described above, implementing boolean as a bit has significant downsides - but what about the upside?
As it turns out, if the user really wants this bit-packed representation of for boolean they can do so themselves! They can pack 8 boolean values into a byte (or 32 values into an int or whatever) in an object (and this is common for flags, etc) and the generated accessor code should be about efficient as efficient as if the JVM natively supported boolean-as-bit. In fact, when you know you want an array-of-bits representation for a large number of booleans, you can simply use BitSet - this has the representation you want and sidesteps the atomic issues by not offering any thread-safety guarantees. So by implementing boolean as a byte, you sidestep all the problems above, but still let the user "opt-in" to bit-level representation if they want, without much runtime penalty.

Why would you use a BitSet in java as opposed to an array of booleans (in Java)?

Other than the difference in methods available, why would someone use a BitSet as opposed to an array of booleans? Is performance better for some operations?
You would do it to save space: a boolean occupies a whole byte, so an array of N booleans would occupy eight times the space of a BitSet with the equivalent number of entries.
Execution speed is another closely related concern: you can produce a union or an intersection of several BitSet objects faster, because these operations can be performed by CPU as bitwise ANDs and ORs on 32 bits at a time.
In addition to the space savings noted by #dasblinkenlight, a BitSet has the advantage that it will grow as needed. If you do not know beforehand how many bits will be needed, or the high numbered bits are sparse and rarely used, (e.g. you are detecting which Unicode characters are present in a document and you want to allow for the unusual "foreign" ones > 128 but you know that they will be rare) a BitSet will save even more memory.

Memory efficiency: HashMap versus Array

I was thinking about the following situation: I want to count the occurrence of characters in a string (for example for a permutation check).
One way to do it would be to allocate an array with 256 integers (I assume that the characters are UTF-8), to fill it with zeros and then to go through the string and increment the integers on the array positions corresponding to the int value of the chars.
However, for this approach, you would have to allocate a 256 array each time, even when the analyzed string is very short (and consequently uses only a small part of the array).
An other approach would be to use a Character to Integer HashTable and to store a number for each encountered char. This way, you only would have keys for chars that actually are in the string.
As my understanding of the HashTable is rather theoretic and I do not really know how it is implemented in Java my question is: Which of the two approaches would be more memory efficient?
Edit:
During the discussion of this question (thank you for your answers everyone) I did realize that I had a very fuzzy understanding of the nature of UTF-8. After some searching, I have found this great video that I want to share, in case someone has the same problem.
Ich wonder why you choose 256 as the length of your array when you assume that your String is UTF-8. In UTF-8 a character can be composed of up to 4 bytes which means quite a number of more characters than just 256.
Anyway: Using a HashTable/HashMap needs a huge memory overhead. First all your characters and integer need to be wrapped in an object (Integer/Character). And Integer consumes about 3x as much memory as an int. For arrays the difference can be even larger due to the optimizations java performs on arrays (e.g. the java stack works only in multiples of 4 byte, while in an array java allows smaller types such as a char to consume only 2 bytes).
Then the HashTable itself creates a memory overhead because it needs to maintain an array (which is usually not fully used) and linked lists to maintain all objects which generate the same hash.
Additionally access times will be dramatically faster for arrays. You save multiple method invocations (add, hashCode, iterator,...) and there exist a number of opcode in java byte code to make working with arrays more efficient.
Anyway. You question was:
Which of the two approaches would be more memory efficient?
And it is safe to say that arrays will be more memory efficient.
However you should make absolutely sure what your requirements are. Do you need more memory efficiency? (Could be true if you process large amounts of data or you are on a slow device (mobile devices?)) How important is readability of code? How about size of code? Reuseability?
And ist 256 really the correct size?
Without looking in the code I know that a HashMap requires, at minimum, a base object, a hashtable array, and individual objects for each hash entry. Generally an int value would have to be stored as an Integer object so that's more objects. Let's assume you have 30 unique characters:
32 bytes for the base object
256 bytes for a minimum-size hashtable array
32 bytes for each of the 30 table entries
16 bytes (if highly optimized) for each of 30 Integers
32 + 256 + 960 + 480 = 1728 bytes. That's for a minimal, non-fancy implementation.
The array of 256 ints would be about 1056 bytes.
I would use the array. From a performance aspect, you have guaranteed constant access. Better than the what a hash table can get you.
As it also only uses an constant amount of memory, I see no downside. The HashMap will most likely need more memory, even if you only store a few elements.
By the way, the memory footprint should not be a concern, as you will only need the data structure as long as you need it for counting. Then it will be garbage collected, anyway.
Well here are the facts.
HashMap uses an array for its table behind the scenes.
So if you were actually limited by finding a contiguous space in memory, HashMap's benefit is only that the array may be smaller.
HashMap is generic and therefore uses objects.
Objects take up extra space. As I remember, it's typically 8 or 16 bytes minimum depending on whether it's a 32- or 64-bit system. This means the HashMap may very well not be smaller, even if the number of characters in the String is small. HashMap will require 3 extra objects for each entry: an Entry, a Character and an Integer. HashMap also needs to store the int for the index locally whereas the array does not.
That's beyond that there will be some extra computation using the HashMap.
I would also say space optimization is not something you should worry about here. Either way, the memory footprint is actually very small.
Initialize an array of integers that represent the int value of a char, for example the int value of f is 102 which is its ascii value
http://www.asciitable.com/
char c = 'f';
int x = (int)c;
If you know the range of char's youre dealing with then it is easier.
For each occurance of char increment the index of that char in the array by one. This approach would be slow if you have to iterate and complicated if you are to sort but wont be memory intensive.
Just be aware when you sort you lose the indexes

Purpose of byte type in Java

I read this line in the Java tutorial:
byte: The byte data type is an 8-bit signed two's complement integer. It has
a minimum value of -128 and a maximum value of 127 (inclusive). The
byte data type can be useful for saving memory in large arrays, where
the memory savings actually matters. They can also be used in place of
int where their limits help to clarify your code; the fact that a
variable's range is limited can serve as a form of documentation.
I don't clearly understand the bold line. Can somebody explain it for me?
Byte has a (signed) range from -128 to 127, where as int has a (also signed) range of −2,147,483,648 to 2,147,483,647.
What it means is that since the values you're going to use will always be between that range, by using the byte type you're telling anyone reading your code this value will be at most between -128 to 127 always without having to document about it.
Still, proper documentation is always key and you should only use it in the case specified for readability purposes, not as a replacement for documentation.
If you're using a variable which maximum value is 127 you can use byte instead of int so others know without reading any if conditions after, which may check the boundaries, that this variable can only have a value between -128 and 127.
So it's kind of self-documenting code - as mentioned in the text you're citing.
Personally, I do not recommend this kind of "documentation" - only because a variable can only hold a maximum value of 127 doesn't reveal it's really purpose.
Integers in Java are stored in 32 bits; bytes are stored in 8 bits.
Let's say you have an array with one million entries. Yikes! That's huge!
int[] foo = new int[1000000];
Now, for each of these integers in foo, you use 32 bits or 4 bytes of memory. In total, that's 4 million bytes, or 4MB.
Remember that an integer in Java is a whole number between -2,147,483,648 and 2,147,483,647 inclusively. What if your array foo only needs to contain whole numbers between, say, 1 and 100? That's a whole lot of numbers you aren't using, by declaring foo as an int array.
This is when byte becomes helpful. Bytes store whole numbers between -128 and 127 inclusively, which is perfect for what you need! But why choose bytes? Because they use one-fourth of the space of integers. Now your array is wasting less memory:
byte[] foo = new byte[1000000];
Now each entry in foo takes up 8 bits or 1 byte of memory, so in total, foo takes up only 1 million bytes or 1MB of memory.
That's a huge improvement over using int[] - you just saved 3MB of memory.
Clearly, you wouldn't want to use this for arrays that hold numbers that would exceed 127, so another way of reading the bold line you mentioned is, Since bytes are limited in range, this lets developers know that the variable is strictly limited to these bounds. There is no reason for a developer to assume that a number stored as a byte would ever exceed 127 or be less than -128. Using appropriate data types saves space and informs other developers of the limitations imposed on the variable.
I imagine one can use byte for anything dealing with actual bytes.
Also, the parts (red, green and blue) of colors commonly have a range of 0-255 (although byte is technically -128 to 127, but that's the same amount of numbers).
There may also be other uses.
The general opposition I have to using byte (and probably why it isn't seen as often as it can be) is that there's lots of casting needed. For example, whenever you do arithmetic operations on a byte (except X=), it is automatically promoted to int (even byte+byte), so you have to cast it if you want to put it back into a byte.
A very elementary example:
FileInputStream::read returns a byte wrapped in an int (or -1). This can be cast to an byte to make it clearer. I'm not supporting this example as such (because I don't really (at this moment) see the point of doing the below), just saying something similar may make sense.
It could also have returned a byte in the first place (and possibly thrown an exception if end-of-file). This may have been even clearer, but the way it was done does make sense.
FileInputStream file = new FileInputStream("Somefile.txt");
int val;
while ((val = file.read()) != -1)
{
byte b = (byte)val;
// ...
}
If you don't know much about FileInputStream, you may not know what read returns, so you see an int and you may assume the valid range is the entire range of int (-2^31 to 2^31-1), or possibly the range of a char (0-65535) (not a bad assumption for file operations), but then you see the cast to byte and you give that a second thought.
If the return type were to have been byte, you would know the valid range from the start.
Another example:
One of Color's constructors could have been changed from 3 int's to 3 byte's instead, since their range is limited to 0-255.
It means that knowing that a value is explicitly declared as a very small number might help you recall the purpose of it.
Go for real docs when you have to create a documentation for your code, though, relying on datatypes is not documentation.
An int covers the values from 0 to 4294967295 or 2 to the 32nd power. This is a huge range and if you are scoring a test that is out of 100 then you are wasting that extra spacce if all of your numbers are between 0 and 100. It just takes more memory and harddisk space to store ints, and in serious data driven applications this translates to money wasted if you are not using the extra range that ints provide.
byte data types are generally used when you want to handle data in the forms of streams either from file or from network. Reason behind this is because network and files works on the concept of byte.
Example: FileOutStream always takes byte array as input parameter.

Declaring 'long' over 'int' in Java

If int is enough for a field, and if I use long for some reason, would that cost me more memory?
In Java, yes, a long is 8 bytes wide and an integer is 4 bytes wide. This Java tutorial goes over the primitive data types. If you multiply the number of allocations by a certain amount (say, if you're allocating five million of these variables), the difference becomes more than negligible. For the average usage, however, it doesn't matter as much.
(You're already using Java, memory's kind of all over the place anyway.)
In native languages, there's a performance consideration; a 32-bit value can be held in a single register on a 32-bit architecture but not a 64-bit value; on 64-bit architectures, obviously, it can. I'm not sure what kind of optimization Java does on its native integers, but this might be true in its runtime as well. There's alignment issues to worry about, as well -- you see this more with using shorts and bytes, though.
Best practice would be to use the type you need. If the value will never be over 2^31, don't use a long.
Assuming from your previous questions that you mean to ask this in the scope of Java, the int data type is four bytes and the long data type is eight bytes.
However, whether the difference in size actually means a difference in memory usage depends on the situation.
If it's a local variable, it's allocated on the stack. As the stack is already allocated, using more stack space will not use more memory, provided of course that you don't exhaust the stack.
If it a member of a class, it will depend on how the members are aligned. Sometimes members are not stacked compactly in memory, but padding is used so that some members start at an even address. If you for example have a byte and an int in a class, there will likely be three bytes of padding between them so that the int starts at the next address divisible by four.
intis 32 bit and long is 64 bit. long takes twice as much memory (which is pretty insignificant for most applications).
long in Java is 64bit, and int is 32bit, so obviously longs uses more memory (8 bytes instead of 4 bytes).
ifwdev guessed correctly. Java defines an int as a 32-bit signed integer, and a long as a 64-bit signed integer. If you declare a variable as a long, then yes, it will take twice as much memory as the same variable declared as an int. In general, int is typically the "default" numeric type, even for values that could be contained in smaller types like a short. Unless you have a particular reason for requiring values greater than 2^31-1, use an int.
...would that cost me more memory?
You'll be using twice as memory
Before worrying about if you're using more memory or not, you should profile.
To use 1 megabyte extra of ram using long rather than int you'll have to declare: 262,144 long variables (or use them indirectly in your program ).
So if for some reason you declare one or two long variables when int's should be used, you'll be using 4 or 8 bytes more of memory. Not too much to worry about ( I mean, there might be worst memory problems in your app )
Taken from the Java Tutorial here's the definition of int and long
int: The int data type is a 32-bit signed two's complement integer. It has a minimum value of -2,147,483,648 and a maximum value of 2,147,483,647 (inclusive). For integral values, this data type is generally the default choice unless there is a reason (like the above) to choose something else. This data type will most likely be large enough for the numbers your program will use, but if you need a wider range of values, use long instead.
long: The long data type is a 64-bit signed two's complement integer. It has a minimum value of -9,223,372,036,854,775,808 and a maximum value of 9,223,372,036,854,775,807 (inclusive). Use this data type when you need a range of values wider than those provided by int.
But remember: "Premature optimization is the root of all evil" according to Donald Knuth ( according to me Copy/Paste is the root of all evil though )
If you know your data will fit in a specific data type (say short int in C), the only reason to use a bigger one is performance right? And if that's your goal, regardless of how marginal your performance gain is, as a general rule of thumb you want to use a size that matches your architecture's size (so for a normal 32-bit target system, you'd use a 32-bit type).
If you target more than one system, you can use a data type that matches the most often used one.

Categories

Resources