How do you make REALLY large boolean arrays using Java? - java

When I try to make a very large boolean array using Java, such as:
boolean[] isPrime1 = new boolean[600851475144];
I get a possible loss of precision error?
Is it too big?

To store 600 billion bits, you need an absolute minimum address space of 75 gigabytes! Good luck with that!
Even worse, the Java spec doesn't specify that a boolean array will use a single bit of memory for each element - it could (and in some cases does) use more.
In any case, I recognise that number from Project Euler #3. If it needs that much memory, you're doing it wrong...

Consider using a BitSet.

Since you're attempting to solve Euler-problem #3 the wrong way, here's a hint: You're supposed to find all the prime factors of a number, not all the prime numbers below a certain limit.
BTW: This particular Euler-problem can be solved using a very small amount of RAM.

An array index is an int, not a long, so your "array" is too big to fit into an array. One of the java Collection classses might be more suited. Never mind - Collection.size() returns an int as well, so Collection can't store more than Integer.MAX_VALUE items either.

Um... that would be about 70GB worth of booleans. Not gonna work. No way.

The problem is you are using a long value vs. an int value for the size of the array. Java does not support array lengths longer that the maximum value of an int. Java is treating your length as a long because the size you specified exceeds the maximum value for an int but fits within a long. Hence it must convert the length back to an int to create an array. The conversion from long -> int is producing the warning you are seeing

You can use an array of longs, encapsulated in a class that would handle all the operations on the array. Something like your own implementation of BitSet.

Why not just store the values in a file, and then seek to the position in the file and pull up the right value. Like others have stated, that's 70GB of data. In most cases, you wouldn't even be able to hold that in memory. If you're going to store it to a file, you could even look at individual bits when storing and retrieving the data using bitwise operators to save on storage space.
Also, since the number of primes decreases with the size of the numbers, it's probably better just to store the prime numbers themselves in the file, in order, and then do a binary search for the number to see if it is one of the primes.

What values do you have in the array?
For a such large number I guess it's going to be a sparse array so maybe it would be best to use a Map/List and just allocate space and store a value for a 1 value for a bit. Or for a 0 value if most of your values will be 1.

Apache ActiveMQ has a datastructure called BitArrayBin. This is used to find out whether a message is duplicated. A message ID is a combination of producer ID and sequence ID.
Each producer will have a BitArrayBin to track its sequence IDs. Once it finds out the BitArrayBin for the given producer, it sets the sequence ID which is a long value to the BitArrayBin.
oldValue = bitArrayBin.setBit(sequenceId, true)
if (oldVlaue) {
"message is duplicated"
}
The method returns the old Value.
If y is the long index, it is used to derive at a bin index and an offset into it.
y = bin index * 64 + offset
BitArrayBin is nothing but a holder for many bins where the size can be defined during its construction. Each bin contains a long variable to store the bits so so it can store up to 64 boolean values.
Bit masking is used to set the bit, and then, get it's value.
This class doesn't have much documentation. You need to go through its source code to know the internals.

Related

code giving Exception! can you figure out why? MSS [duplicate]

I'm trying to create a byte array whose size is of type long. For example, think of it as:
long x = _________;
byte[] b = new byte[x];
Apparently you can only specify an int for the size of a byte array.
Before anyone asks why I would need a byte array so large, I'll say I need to encapsulate data of message formats that I am not writing, and one of these message types has a length of an unsigned int (long in Java).
Is there a way to create this byte array?
I am thinking if there's no way around it, I can create a byte array output stream and keep feeding it bytes, but I don't know if there's any restriction on a size of a byte array...
(It is probably a bit late for the OP, but it might still be useful for others)
Unfortunately Java does not support arrays with more than 231−1 elements. The maximum consumption is 2 GiB of space for a byte[] array, or 16 GiB of space for a long[] array.
While it is probably not applicable in this case, if the array is going to be sparse, you might be able to get away with using an associative data structure like a Map to match each used offset to the appropriate value. In addition, Trove provides an more memory-efficient implementation for storing primitive values than standard Java collections.
If the array is not sparse and you really, really do need the whole blob in memory, you will probably have to use a two-dimensional structure, e.g. with a Map matching offsets modulo 1024 to the proper 1024-byte array. This approach might be be more memory efficient even for sparse arrays, since adjacent filled cells can share the same Map entry.
A byte[] with size of the maximum 32-bit signed integer would require 2GB of contiguous address space. You shouldn't try to create such an array. Otherwise, if the size is not really that large (and it's just a larger type), you could safely cast it to an int and use it to create the array.
You should probably be using a stream to read your data in and another to write it out. If you are gong to need access to data later on in the file, save it. If you need access to something you haven't ran into yet, you need a two-pass system where you run through once and store the "stuff you'll need for the second pass, then run through again".
Compilers work this way.
The only case for loading in the entire array at once is if you have to repeatedly randomly access many locations throughout the array. If this is the case, I suggest you load it into multiple byte arrays all stored in a single container class.
The container class would have an array of byte arrays, but from outside all the accesses would seem contiguous. You would just ask for byte 49874329128714391837 and your class would divide your Long by the size of each byte array to calculate which array to access, then use the remainder to determine the byte.
It could also have methods to store and retrieve "Chunks" that could span byte-array boundaries that would require creating a temporary copy--but the cost of creating a few temporary arrays would be more than made up for by the fact that you don't have a locked 2gb space allocated which I think could just destroy your performance.
Edit: ps. If you really need the random access and can't use streams then implementing a containing class is a Very Good Idea. It will let you change the implementation on the fly from a single byte array to a group of byte arrays to a file-based system without any change to the rest of your code.
It's not of immediate help but creating arrays with larger sizes (via longs) is a proposed language change for Java 7. Check out the Project Coin proposals for more info
One way to "store" the array is to write it to a file and then access it (if you need to access it like an array) using a RandomAccessFile. The api for that file uses long as an index into file instead of int. It will be slower, but much less hard on the memory.
This is when you can't extract what you need during the initial input scan.

Implementing efficient data structure using Arrays only

As part of my programming course I was given an exercise to implement my own String collection. I was planning on using ArrayList collection or similar but one of the constraints is that we are not allowed to use any Java API to implement it, so only arrays are allowed. I could have implemented this using arrays however efficiency is very important as well as the amount of data that this code will be tested with. I was suggested to use hash tables or ordered tress as they are more efficient than arrays. After doing some research I decided to go with hash tables because they seemed easy to understand and implement but once I started writing code I realised it is not as straight forward as I thought.
So here are the problems I have come up with and would like some advice on what is the best approach to solve them again with efficiency in mind:
ACTUAL SIZE: If I understood it correctly hash tables are not ordered (indexed) so that means that there are going to be gaps in between items because hash function gives different indices. So how do I know when array is full and I need to resize it?
RESIZE: One of the difficulties that I need to create a dynamic data structure using arrays. So if I have an array String[100] once it gets full I will need to resize it by some factor I decided to increase it by 100 each time so once I would do that I would need to change positions of all existing values since their hash keys will be different as the key is calculated:
int position = "orange".hashCode() % currentArraySize;
So if I try to find a certain value its hash key will be different from what it was when array was smaller.
HASH FUNCTION: I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.
DEALING WITH MULTIPLE OCCURRENCES: one of the requirements is to be able to add multiple words that are the same, because I need to be able to count how many times the word is stored in my collection. Since they are going to have the same hash code I was planning to add the next occurrence at the next index hoping that there will be a gap. I don't know if it is the best solution but here how I implemented it:
public int count(String word) {
int count = 0;
while (collection[(word.hashCode() % size) + count] != null && collection[(word.hashCode() % size) + count].equals(word))
count++;
return count;
}
Thank you in advance for you advice. Please ask anything needs to be clarified.
P.S. The length of words is not fixed and varies greatly.
UPDATE Thank you for your advice, I know I did do few stupid mistakes there I will try better. So I took all your suggestions and quickly came up with the following structure, it is not elegant but I hope it is what you roughly what you meant. I did have to make few judgements such as bucket size, for now I halve the size of elements, but is there a way to calculate or some general value? Another uncertainty was as to by what factor to increase my array, should I multiply by some n number or adding fixed number is also applicable? Also I was wondering about general efficiency because I am actually creating instances of classes, but String is a class to so I am guessing the difference in performance should not be too big?
ACTUAL SIZE: The built-in Java HashMap just resizes when the total number of elements exceeds the number of buckets multiplied by a number called the load factor, which is by default 0.75. It does not take into account how many buckets are actually full. You don't have to, either.
RESIZE: Yes, you'll have to rehash everything when the table is resized, which does include recomputing its hash.
So if I try to find a certain value it's hash key will be different from what it was when array was smaller.
Yup.
HASH FUNCTION: Yes, you should use the built in hashCode() function. It's good enough for basic purposes.
DEALING WITH MULTIPLE OCCURRENCES: This is complicated. One simple solution would just be to have the hash entry for a given string also keep count of how many occurrences of that string are present. That is, instead of keeping multiple copies of the same string in your hash table, keep an int along with each String counting its occurrences.
So how do I know when array is full and I need to resize it?
You keep track of the size and HashMap does. When the size used > capacity * load factor you grow the underlying array, either as a whole or in part.
int position = "orange".hashCode() % currentArraySize;
Some things to consider.
The % of a negative value is a negative value.
Math.abs can return a negative value.
Using & with a bit mask is faster however you need a size which is a power of 2.
I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.
The built in hashCode is cached, so it is fast. However it is not a great hashCode and has poor randomness for lower bit, and higher bit for short strings. You might want to implement your own hashing strategy, possibly a 64-bit one.
DEALING WITH MULTIPLE OCCURRENCES:
This is usually done with a counter for each key. This way you can have say 32767 duplicates (if you use short) or 2 billion (if you use int) duplicates of the same key/element.

How to remove possible loss of precision error in java

here is a part of my code where I am getting error:
long p=1000000000-size-1;
long j;
for(j=p;j>p-k;j--)
{
sum2=sum2+sum[j];
}
System.out.print(sum2);
I think it is because I am using a long variable to define the size of array.
How could I deal with this error
Or if I am wrong please tell me how could I declare an array containing 10^10 elements.
Java does not provide a way to create arrays with size bigger than Integer.MAX_VALUE (which is about 2*10^9). Also are you sure you have enough memory to store such array? If you want to store array of 10^10 int values, you will need at least 40 Gb of RAM.
Change both the variables to int. If the values don't fit into an int the code won't work anyway.
I think it is because I am using a long variable to define the size of array. How could I deal with this error
No it isn't, because you aren't. You can't.
Aside from the the maximum array concerns and availability of enough virtual memory:
You have an integer literal 1000000000, but what you actually want is a long literal 1000000000L.

How to count string num with limit memory?

The task is to count the num of words from a input file.
the input file is 8 chars per line, and there are 10M lines, for example:
aaaaaaaa
bbbbbbbb
aaaaaaaa
abcabcab
bbbbbbbb
...
the output is:
aaaaaaaa 2
abcabcab 1
bbbbbbbb 2
...
It'll takes 80MB memory if I load all of words into memory, but there are only 60MB in os system, which I can use for this task. So how can I solve this problem?
My algorithm is to use map<String,Integer>, but jvm throw Exception in thread "main" java.lang.OutOfMemoryError: Java heap space. I know I can solve this by setting -Xmx1024m, for example, but I want to use less memory to solve it.
I believe that the most robust solution is to use the disk space.
For example you can sort your file in another file, using an algorithm for sorting large files (that use disk space), and then count the consecutive occurrences of the same word.
I believe that this post can help you. Or search by yourself something about external sorting.
Update 1
Or as #jordeu suggest you can use a Java embedded database library: like H2, JavaDB, or similars.
Update 2
I thought about another possible solution, using Prefix Tree. However I still prefer the first one, because I'm not an expert on them.
Read one line at a time
and then have e.g. a HashMap<String,Integer>
where you put your words as key and the count as integer.
If a key exists, increase the count. Otherwise add the key to the map with a count of 1.
There is no need to keep the whole file in memory.
I guess you mean the number of distinct words do you?
So the obvious approach is to store (distinctive information about) each different word as a key in a map, where the value is the associated counter. Depending on how many distinct words are expected, storing all of them may even fit into your memory, however not in the worst case scenario when all words are different.
To lessen memory needs, you could calculate a checksum for the words and store that, instead of the words themselves. Storing e.g. a 4-byte checksum instead of an 8-character word (requiring at least 9 bytes to store) requires 40M instead of 90M. Plus you need a counter for each word too. Depending on the expected number of occurrences for a specific word, you may be able to get by with 2 bytes (for max 65535 occurrences), which requires max 60M of memory for 10M distinct words.
Update
Of course, the checksum can be calculated in many different ways, and it can be lossless or not. This also depends a lot on the character set used in the words. E.g. if only lowercase standard ASCII characters are used (as shown in the examples above), we have 26 different characters at each position. Consequently, each character can be losslessly encoded in 5 bits. Thus 8 characters fit into 5 bytes, which is a bit more than the limit, but may be dense enough, depending on the circumstances.
I suck at explaining theoretical answers but here we go....
I have made an assumption about your question as it is not entirely clear.
The memory used to store all the distinct words is 80MB (the entire file is bigger).
The words could contain non-ascii characters (so we just treat the data as raw bytes).
It is sufficient to read over the file twice storing ~ 40MB of distinct words each time.
// Loop over the file and for each word:
//
// Compute a hash of the word.
// Convert the hash to a number by some means (skip if possible).
// If the number is odd then skip to the next word.
// Use conventional means to store the distinct word.
//
// Do something with all the distinct words.
Then repeat the above a second time using even instead of odd.
Then you have divided the task into 2 and can do each separately.
No words from the first set will appear in the second set.
The hash is necessary because the words could (in theory) all end with the same letter.
The solution can be extended to work with different memory constraints. Rather than saying just odd/even we can divide the words into X groups by using number MOD X.
Use H2 Database Engine, it can work on disc or on memory if it's necessary. And it have a really good performance.
I'd create a SHA-1 of each word, then store these numbers in a Set. Then, of course, when reading a number, check the Set if it's there [(not totally necessary since Set by definition is unique, so you can just "add" its SHA-1 number also)]
Depending on what kind of character the words are build of you can chose for this system:
If it might contain any character of the alphabet in upper and lower case, you will have (26*2)^8 combinations, which is 281474976710656. This number can fit in a long datatype.
So compute the checksum for the strings like this:
public static long checksum(String str)
{
String tokes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
long checksum = 0;
for (int i = 0; i < str.length(); ++i)
{
int c = tokens.indexOf(str.charAt(i));
checksum *= tokens.length();
checksum += c;
}
return checksum;
}
This will reduce the taken memory per word by more than 8 bytes. A string is an array of char, each char is in Java 2 bytes. So, 8 chars = 16 bytes. But the string class contains more data than only the char array, it contains some integers for size and offset as well, which is 4 bytes per int. Don't forget the memory pointer to the Strings and char arrays as well. So, a raw estimation makes me think that this will reduce 28 bytes per word.
So, 8 bytes per word and you have 10 000 000 words, gives 76 MB. Which is your first wrong estimation, because you forgot all the things I noticed. So this means that even this method won't work.
You can convert each 8 byte word into a long and use TLongIntHashMap which is quite a bit more efficient than Map<String, Integer> or Map<Long, Integer>
If you just need the distinct words you can use TLongHashSet
If you can sort your file first (e.g. using the memory-efficient "sort" utility on Unix), then it's easy. You simply read the the sorted items, counting the neighboring duplicates as you go, and write the totals to a new file immediately.
If you need to sort using Java, this post might help:
http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194
You can use constant memory by reading your file multiple times.
Basic idea:
Treat the file as n partitions p_1...p_n, sized so that you can load each of them into ram.
Load p_i into a Map structure, scan through the whole file and keep track of counts of the p_i elements only (see answer of Heiko Rupp)
Remove element if we encounter the same value in a partition p_j with j smaller i
Output result counts for elements in the Map
Clear Map, repeat for all p_1...p_n
As in any optimization, there are tradeoffs. In your case, you can do the same task with less memory but it comes at the cost of increasing runtime.
Your scarce resource is memory, so you can't store the words in RAM.
You could use a hash instead of the word as other posts mention, but if your file grows in size this is no solution, since at some point you'll run into the same problem again.
Yes, you could use an external web server to crunch the file and do the job for your client app, but reading your question it seems that you want to do all the thing in one (your app).
So my proposal is to iterate over the file, and for each word:
If the word was found for first time, write the string to a result file together with the integer value 1.
If the word was processed before (it will appear in the result file), increment the record value.
This solution scales well no matter the number of lines of your input file nor the length of the words*.
You can optimize the way you do the writes in the output file, so that the search is made faster, but the basic version described above is enough to work.
EDIT:
*It scales well until you run out of disk space XD. So the precondition would be to have a disk with at least 2N bytes of free usable space, where N is the input file size in bytes.
possible solutions:
Use file sorting and then just count the consequent occurences of each value.
Load the file in a database and use a count statement like this: select value, count(*) from table group by value

in java, which is better - three arrays of booleans or 1 array of bytes?

I know the question sounds silly, but consider this: I have an array of ints (1..N) and a labelling algorithm. at any point the item the int represents is in one of three states. The current version holds these states in a byte array, where 0, 1 and 2 represent the three states. alternatively, I could have three arrays of boolean - one for each state. which is better (consumes less memory) depends on how jvm (sun's version) stores the arrays - is a boolean represented by 1 bit? is there any other magic happening behind the scenes? (p.s. don't start with all that "this is not the way OO/Java works" - I know, but here performance comes in front. plus the algorithm is simple and perfectly readable even in such form).
Thanks a lot
Instead of two booleans or 1 int, just use a BitSet - http://java.sun.com/j2se/1.4.2/docs/api/java/util/BitSet.html
You can then have two bits per label/state. And BitSet being a standard java class, you are likely to get good performance.
Theoretically, with 3 boolean arrays you'll need to do:
firstState[n] = false;
secondState[n] = true;
thirdState[n] = false;
every time when you want to change n-th element state. Here you can see 3 taking element by index operations and 3 assignment operations.
With 1 byte array you'll need:
elements[n] = 1;
It's more readable and 3 times faster. And one more advantage of this solution it that you can easily add as many new states as you want (when with boolean arrays you'll need to introduce new arrays).
But I don't think you'll ever see the performance difference.
UPD: actually I'd make it more java way (not looking that you don't find easy ways) and use array of enums. This will make it much more clear and will give you some flexibility (maybe in future you'll decide that oop is not so bad thing):
enum ElementState {
FIRST, SECOND, THIRD;
}
ElementState[] elementStates = new ElementState[N];
...
elementStates[i] = ElementState.FIRST;
The JVM second edition spec (http://java.sun.com/docs/books/jvms/second_edition/html/Overview.doc.html) specifies that boolean arrays are encoded as (0,1), but doesn't specify the type used. So the particular JVM may or may not use bit - it could use int.
However, if performance is paramount, using a single byte would in any case seem to be your best option anyway.
EDIT: I incorrectly said that boolean arrays are stored as bit arrays - this is possible but implementation specific.
If you want a guaranteed minimum you could use three java.util.BitSets. These will only use one bit per flag (though you will have the extra object overhead, that may outweigh the benefits if the number of flags is small.) I would say if you have a large number of objects BitSet may be a better alternative, otherwise an array of byte constants or enums will lead to more readable code (and the extra storage shouldn't be a real concern.)
The array of bytes is much better!
A boolean uses in every programming language 1 byte! So you will use for every state 3 bytes and you can do this with only 1 byte (in theory you can reduce it to only 1 bit (see other posts).
with a byte array, you can simply change it to the byte you want. With three arrays you have to change the value at every array!
When you are your application developing, it is possible you need an extra state. So, this means you have to create again an array. Plus you have to change 4 values (second point)
So, I hope we persuaded you!

Categories

Resources