Alternative representation of a String

Alternative representation of a String - java

During the run of my program i create a lot of String(1.000.000) up to size of 700 and my program eats up a lot of memory.These Strings can contain only R,D,L,U as chars so i thought that i could represent them differently.I thought about using BitSet but i am not sure it is more memory efficient.Any ideas?
P.S:i could also shrink the String compressing equal chars(RRRRRRDDDD->R6D4) but i was hoping for a better solution.

as a first step, you could try to switch to char[]. Java String takes approx 40 bytes more than the sum of its characters (source) and char[] is considerably more convenient than bit arithmetic
even more economical is byte[] since one char requires two bytes allocation, while a byte is, of course, one byte (and still has room for 256 distinct values)

Related

Any good techniques for Memory Usage optimization for maximum String input length

I'm designing a string algorithm and the problem lies in the size of the input. By definition Java's maximum string length is 2147483647, to avoid confusion ~2.15x10^9.
Manacher's algorithm requires, by definition an array of characters:
char[n*2 +3] where n is the length of the input (string of size n)
By definition the maximum integer is the mentioned above ~2.15x10^9, so an array of characters can be of max size
char [ ~2.15x10^9 ];
This computation of manachers algorithm in java, lowers the limit of the inputed string to n = (~2.15x10^9 - 3) / 2 . To be accurate thats exactly 1073741822. ~1.1x10^9.
A char array of maximum length has (n*2) + 32 bytes = ( ~2.1x10^9 * 2 ) + 32 bytes = ~4.2x10^9 bytes (4.2GBs)
Theres additional arrays of various sizes, sets and other collections. I believe this will make the program take an entire space of ~~30GBs. for maximal input of RAM memory for the computation of the algorithm which we identified to be at most ~1.1x10^9 characters long.
Could you advice me on some technique to keep things even between "Longest possible string input" and "Memory management" ? Thank you

According to this article, the Manacher algorithm finds the longest palindromic substring in linear time (with n being the length of the original string).
Here's an implementation in Java, which shows the algorithm is also quite good at memory consumption (you need two arrays, one of chars and another one of ints, both having twice the length of the original string, and you also need to store the original string).
The problem is that your original string is extremely long, so you are reaching language limits, memory limits, etc.
On the other hand, your alphabet consists only of 7 characters: your original string characters A C T G, plus a letter separator (such as #) and start and end of string characters (such as $ and #). This means that you only need 3 bits to store each possible character. So, if you are willing to work with bitwise operators and bit masks, you could store 21 characters in a long (this is because a long is represented with 64 bits). This approach would be more complex to code, but it would use much less memory.
Another possible solution would be to work with dynamic structures instead of strings and arrays. These structures would use considerable memory, but it wouldn't be contiguous memory, meaning that you wouldn't reach max array size limits and language limits for ints, etc. This approach uses a suffix tree, which, according to this article, is a linear time approach. In that article there's a solution in C++. Good luck!

Memory efficiency: HashMap versus Array

I was thinking about the following situation: I want to count the occurrence of characters in a string (for example for a permutation check).
One way to do it would be to allocate an array with 256 integers (I assume that the characters are UTF-8), to fill it with zeros and then to go through the string and increment the integers on the array positions corresponding to the int value of the chars.
However, for this approach, you would have to allocate a 256 array each time, even when the analyzed string is very short (and consequently uses only a small part of the array).
An other approach would be to use a Character to Integer HashTable and to store a number for each encountered char. This way, you only would have keys for chars that actually are in the string.
As my understanding of the HashTable is rather theoretic and I do not really know how it is implemented in Java my question is: Which of the two approaches would be more memory efficient?
Edit:
During the discussion of this question (thank you for your answers everyone) I did realize that I had a very fuzzy understanding of the nature of UTF-8. After some searching, I have found this great video that I want to share, in case someone has the same problem.

Ich wonder why you choose 256 as the length of your array when you assume that your String is UTF-8. In UTF-8 a character can be composed of up to 4 bytes which means quite a number of more characters than just 256.
Anyway: Using a HashTable/HashMap needs a huge memory overhead. First all your characters and integer need to be wrapped in an object (Integer/Character). And Integer consumes about 3x as much memory as an int. For arrays the difference can be even larger due to the optimizations java performs on arrays (e.g. the java stack works only in multiples of 4 byte, while in an array java allows smaller types such as a char to consume only 2 bytes).
Then the HashTable itself creates a memory overhead because it needs to maintain an array (which is usually not fully used) and linked lists to maintain all objects which generate the same hash.
Additionally access times will be dramatically faster for arrays. You save multiple method invocations (add, hashCode, iterator,...) and there exist a number of opcode in java byte code to make working with arrays more efficient.
Anyway. You question was:
Which of the two approaches would be more memory efficient?
And it is safe to say that arrays will be more memory efficient.
However you should make absolutely sure what your requirements are. Do you need more memory efficiency? (Could be true if you process large amounts of data or you are on a slow device (mobile devices?)) How important is readability of code? How about size of code? Reuseability?
And ist 256 really the correct size?

Without looking in the code I know that a HashMap requires, at minimum, a base object, a hashtable array, and individual objects for each hash entry. Generally an int value would have to be stored as an Integer object so that's more objects. Let's assume you have 30 unique characters:
32 bytes for the base object
256 bytes for a minimum-size hashtable array
32 bytes for each of the 30 table entries
16 bytes (if highly optimized) for each of 30 Integers
32 + 256 + 960 + 480 = 1728 bytes. That's for a minimal, non-fancy implementation.
The array of 256 ints would be about 1056 bytes.

I would use the array. From a performance aspect, you have guaranteed constant access. Better than the what a hash table can get you.
As it also only uses an constant amount of memory, I see no downside. The HashMap will most likely need more memory, even if you only store a few elements.
By the way, the memory footprint should not be a concern, as you will only need the data structure as long as you need it for counting. Then it will be garbage collected, anyway.

Well here are the facts.
HashMap uses an array for its table behind the scenes.
So if you were actually limited by finding a contiguous space in memory, HashMap's benefit is only that the array may be smaller.
HashMap is generic and therefore uses objects.
Objects take up extra space. As I remember, it's typically 8 or 16 bytes minimum depending on whether it's a 32- or 64-bit system. This means the HashMap may very well not be smaller, even if the number of characters in the String is small. HashMap will require 3 extra objects for each entry: an Entry, a Character and an Integer. HashMap also needs to store the int for the index locally whereas the array does not.
That's beyond that there will be some extra computation using the HashMap.
I would also say space optimization is not something you should worry about here. Either way, the memory footprint is actually very small.

Initialize an array of integers that represent the int value of a char, for example the int value of f is 102 which is its ascii value
http://www.asciitable.com/
char c = 'f';
int x = (int)c;
If you know the range of char's youre dealing with then it is easier.
For each occurance of char increment the index of that char in the array by one. This approach would be slow if you have to iterate and complicated if you are to sort but wont be memory intensive.
Just be aware when you sort you lose the indexes

Purpose of byte type in Java

I read this line in the Java tutorial:
byte: The byte data type is an 8-bit signed two's complement integer. It has
a minimum value of -128 and a maximum value of 127 (inclusive). The
byte data type can be useful for saving memory in large arrays, where
the memory savings actually matters. They can also be used in place of
int where their limits help to clarify your code; the fact that a
variable's range is limited can serve as a form of documentation.
I don't clearly understand the bold line. Can somebody explain it for me?

Byte has a (signed) range from -128 to 127, where as int has a (also signed) range of −2,147,483,648 to 2,147,483,647.
What it means is that since the values you're going to use will always be between that range, by using the byte type you're telling anyone reading your code this value will be at most between -128 to 127 always without having to document about it.
Still, proper documentation is always key and you should only use it in the case specified for readability purposes, not as a replacement for documentation.

If you're using a variable which maximum value is 127 you can use byte instead of int so others know without reading any if conditions after, which may check the boundaries, that this variable can only have a value between -128 and 127.
So it's kind of self-documenting code - as mentioned in the text you're citing.
Personally, I do not recommend this kind of "documentation" - only because a variable can only hold a maximum value of 127 doesn't reveal it's really purpose.

Integers in Java are stored in 32 bits; bytes are stored in 8 bits.
Let's say you have an array with one million entries. Yikes! That's huge!
int[] foo = new int[1000000];
Now, for each of these integers in foo, you use 32 bits or 4 bytes of memory. In total, that's 4 million bytes, or 4MB.
Remember that an integer in Java is a whole number between -2,147,483,648 and 2,147,483,647 inclusively. What if your array foo only needs to contain whole numbers between, say, 1 and 100? That's a whole lot of numbers you aren't using, by declaring foo as an int array.
This is when byte becomes helpful. Bytes store whole numbers between -128 and 127 inclusively, which is perfect for what you need! But why choose bytes? Because they use one-fourth of the space of integers. Now your array is wasting less memory:
byte[] foo = new byte[1000000];
Now each entry in foo takes up 8 bits or 1 byte of memory, so in total, foo takes up only 1 million bytes or 1MB of memory.
That's a huge improvement over using int[] - you just saved 3MB of memory.
Clearly, you wouldn't want to use this for arrays that hold numbers that would exceed 127, so another way of reading the bold line you mentioned is, Since bytes are limited in range, this lets developers know that the variable is strictly limited to these bounds. There is no reason for a developer to assume that a number stored as a byte would ever exceed 127 or be less than -128. Using appropriate data types saves space and informs other developers of the limitations imposed on the variable.

I imagine one can use byte for anything dealing with actual bytes.
Also, the parts (red, green and blue) of colors commonly have a range of 0-255 (although byte is technically -128 to 127, but that's the same amount of numbers).
There may also be other uses.
The general opposition I have to using byte (and probably why it isn't seen as often as it can be) is that there's lots of casting needed. For example, whenever you do arithmetic operations on a byte (except X=), it is automatically promoted to int (even byte+byte), so you have to cast it if you want to put it back into a byte.
A very elementary example:
FileInputStream::read returns a byte wrapped in an int (or -1). This can be cast to an byte to make it clearer. I'm not supporting this example as such (because I don't really (at this moment) see the point of doing the below), just saying something similar may make sense.
It could also have returned a byte in the first place (and possibly thrown an exception if end-of-file). This may have been even clearer, but the way it was done does make sense.
FileInputStream file = new FileInputStream("Somefile.txt");
int val;
while ((val = file.read()) != -1)
{
byte b = (byte)val;
// ...
}
If you don't know much about FileInputStream, you may not know what read returns, so you see an int and you may assume the valid range is the entire range of int (-2^31 to 2^31-1), or possibly the range of a char (0-65535) (not a bad assumption for file operations), but then you see the cast to byte and you give that a second thought.
If the return type were to have been byte, you would know the valid range from the start.
Another example:
One of Color's constructors could have been changed from 3 int's to 3 byte's instead, since their range is limited to 0-255.

It means that knowing that a value is explicitly declared as a very small number might help you recall the purpose of it.
Go for real docs when you have to create a documentation for your code, though, relying on datatypes is not documentation.

An int covers the values from 0 to 4294967295 or 2 to the 32nd power. This is a huge range and if you are scoring a test that is out of 100 then you are wasting that extra spacce if all of your numbers are between 0 and 100. It just takes more memory and harddisk space to store ints, and in serious data driven applications this translates to money wasted if you are not using the extra range that ints provide.

byte data types are generally used when you want to handle data in the forms of streams either from file or from network. Reason behind this is because network and files works on the concept of byte.
Example: FileOutStream always takes byte array as input parameter.

How to manage and manipulate extremely large binary values

I need to read in a couple of extremely large strings which are comprised of binary digits. These strings can be extremely large (up to 100,000 digits) which I need to store, be able to manipulate (flip bits) and add together. My first though was to split the string in to 8 character chunks, convert them to bytes and store them in an array. This would allow me to flip bits with relative ease given an index of the bit needed to be flipped, but with this approach I'm unsure how I would go about adding the entirety of the two values together.
Can anyone see a way of storing these values in a memory efficient manner which would allow me to be able to still be able to perform calculations on them?
EDIT:
"add together" (concatenate? arithmetic addition?) - arithmetic addition
My problem is that in the hardest case I have two 100,000 bit numbers (stored in an array of 12,500 bytes). Storing and manually flipping bits isn't an issue, but I need the sum of both numbers and then to be able to find out what the xth bit of this is.

"Strings of binary digits" definitely sound like byte arrays to me. To "add" two such byte arrays together, you'd just allocate a new byte array which is big enough to hold everything, and copy the contents using System.arraycopy.
However that assumes each "string" is a multiple of 8 bits. If you want to "add" a string of 15 bits to another string of 15 bits, you'll need to do bit-shifting. Is that likely to be a problem for you? Depending on what operations you need, you may even want to just keep an object which knows about two byte arrays and can find an arbitrary bit in the logically joined "string".
Either way, byte[] is going to be the way forward - or possibly BitSet.

What about
// Addition
byte[] resArr = new byte[byteArr1.length];
for (int i=0; i<byteArr1.length; i++)
{
res = byteArr1[i]+byteArr2[i];
}
?
Is it something like this you are trying to do?

Reading a UTF-8 String from a ByteBuffer where the length is an unsigned int

I am trying to read a UTF8 string via a java.nio.ByteBuffer. The size is an unsinged int, which, of course, Java doesn't have. I have read the value into a long so that I have the value.
The next issue I have is that I cannot create an array of bytes with the long, and casting he long back to an int will cause it to be signed.
I also tried using limit() on the buffer, but again it works with int not long.
The specific thing I am doing is reading the UTF8 strings out of a class file, so the buffer has more in it that just the UTF8 string.
Any ideas on how to read a UTF8 string that has a potential length of an unsigned int from a ByteBuffer.
EDIT:
Here is an example of the issue.
SourceDebugExtension_attribute {
u2 attribute_name_index;
u4 attribute_length;
u1 debug_extension[attribute_length];
}
attribute_name_index
The value of the attribute_name_index item must be a valid index into the constant_pool table. The constant_pool entry at that index must be a CONSTANT_Utf8_info structure representing the string "SourceDebugExtension".
attribute_length
The value of the attribute_length item indicates the length of the attribute, excluding the initial six bytes. The value of the attribute_length item is thus the number of bytes in the debug_extension[] item.
debug_extension[]
The debug_extension array holds a string, which must be in UTF-8 format. There is no terminating zero byte.
The string in the debug_extension item will be interpreted as extended debugging information. The content of this string has no semantic effect on the Java Virtual Machine.
So, from a technical point of view, it is possible to have a string in the class file that is the full u4 (unsigned, 4 bytes) in length.
These won't be an issue if there is a limit to the size of a UTF8 string (I am no UTF8 expert so perhaps there is such a limit).
I could just punt on it and go with the reality that there is not going to be a String that long...

Unless your array of bytes is more than 2GB (the largest positive value of a Java int), you won't have a problem with casting the long back into a signed int.
If your array of bytes needs to be more than 2GB in length, you're doing it wrong, not least because that's way more than the default maximum heapsize of the JVM...

Having signed int won't be your main problem. Say you had a String which was 4 billion in length. You would need a ByteBuffer which is at least 4 GB, a byte[] which is at least 4 GB. When you convert this to a String, you need at least 8 GB (2 bytes per character) and a StringBuilder to build it. (Of at least 8 GB)
All up you need, 24 GB to process 1 String. Even if you have a lot of memory you won't get many Strings of this size.
Another approach is to treat the length as signed and if unsigned treat as a error as you won't have enough memory to process the String in any case. Even to handle a String which is 2 billion (2^31-1) in length you will need 12 GB to convert it to a String this way.

Java arrays use a (Java, i.e. signed) int for access as per the languge spec, so it's impossible to have an String (which is backed by a char array) longer than Integer.MAX_INT
But even that much is way too much to be processing in one chunk - it'll totally kill performance and make your program fail with an OutOfMemoryError on most machines if a sufficiently large String is ever encountered.
What you should do is process any string in chunks of a sensible size, say a few megs at a time. Then there's no practical limit on the size you can deal with.

I guess you could implement CharSequence on top of a ByteBuffer. That would allow you to keep your "String" from turning up on the heap, although most utilities that deal with characters actually expect a String. And even then, there is actually a limit on CharSequence as well. It expects the size to be returned as an int.
(You could theoretically create a new version of CharSequence that returns the size as a long, but then there's nothing in Java that would help you in dealing with that CharSequence. Perhaps it would be useful if you would implement subSequence(...) to return an ordinary CharSequence.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.