How to store 100 million integers in a Java array? - java

I am working on a small task where I am required to store around 1 billion integers in an Array. However, I am running into a heap space problem. Could you please help me with this?
Machine Details : Core 2 Duo Processor with 4 GB RAM. I have even tried -Xmx 3072m . Is there any work around for this?
The same thing works in C++ , so there should definitely be a way to store this many numbers in memory.
Below is the code and the exception I am getting :
public class test {
private static int C[] = new int[10000*10000];
public static void main(String[] args) {
System.out.println(java.lang.Runtime.getRuntime().maxMemory());
}
}
Exception :
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at test.(test.java:3)

Use an associative array. The key is an integer, and the value is the count (the number of times the integer has been added to the list).
This should get you some decent space savings if the distribution is relatively random, much more so if it's not.

If you need to store 1 billion completely random integers then I am afraid that you really do need to corresponding space, i.e. about 4GB of memory for 32-bit int numbers. You can try increasing the JVM heap space but you need to have a 64-bit OS and at least as much physical memory - and there is only so far that you can go.
On the other hand, you might be able to store those number more efficiently if you can make use of specific constraints within your application.
E.g. if you only need to know if a specific int is contained in a set, you could get away with a bit set - i.e. a single bit for each value in the int range. That is about 4 billion bits, i.e. 512 MB - a far more reasonable space requirement. For example, a handful of BitSet objects could cover the whole 32-bit integer range without you having to write any bit-handling code...

May be using memory mapped files will help? They are not allocated from the heap.
Here is an article how to create a matrix. An array should be easier.
Using a memory mapped file for a huge matrix - Peter Lawrey

You can increase to 4GB on a 32 bit system.
If you're on a 64 bit system you can go higher.
Type in cmd this
java -Xmx4g programname

As array so big may not fit into your RAM, you need to configure the sufficient HDD swap space. 4 - 16 Gb on swap do not look like something unrealistic at these times.
Java only allows to use int as an index of array, not long. Hence the largest possible array could have 2147483648 values, enough.
Use -Xmx to raise the memory ceiling that by default will probably be insufficient. 3072m is not enough as one billion ints requires about 4 Gb. As space is needed also for the operating system and the like, that machine with 4 Gb RAM cannot hold all 4 Gb data structure in memory.
JRE or OS may also refuse to grant an piece of memory so big in one go, requiring to allocate in some smaller chunks (maybe array or arrays).

Related

How can I create an array or arraylist which the size is a BigInteger. I need a real big array [duplicate]

Is there a limit to the number of elements a Java array can contain? If so, what is it?
Using
OpenJDK 64-Bit Server VM (build 15.0.2+7, mixed mode, sharing)
... on MacOS, the answer seems to be Integer.MAX_VALUE - 2. Once you go beyond that:
cat > Foo.java << "END"
public class Foo {
public static void main(String[] args) {
boolean[] array = new boolean[Integer.MAX_VALUE - 1]; // too big
}
}
END
java -Xmx4g Foo.java
... you get:
Exception in thread "main" java.lang.OutOfMemoryError:
Requested array size exceeds VM limit
This is (of course) totally VM-dependent.
Browsing through the source code of OpenJDK 7 and 8 java.util.ArrayList, .Hashtable, .AbstractCollection, .PriorityQueue, and .Vector, you can see this claim being repeated:
/**
* Some VMs reserve some header words in an array.
* Attempts to allocate larger arrays may result in
* OutOfMemoryError: Requested array size exceeds VM limit
*/
private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
which is added by Martin Buchholz (Google) on 2010-05-09; reviewed by Chris Hegarty (Oracle).
So, probably we can say that the maximum "safe" number would be 2 147 483 639 (Integer.MAX_VALUE - 8) and "attempts to allocate larger arrays may result in OutOfMemoryError".
(Yes, Buchholz's standalone claim does not include backing evidence, so this is a calculated appeal to authority. Even within OpenJDK itself, we can see code like return (minCapacity > MAX_ARRAY_SIZE) ? Integer.MAX_VALUE : MAX_ARRAY_SIZE; which shows that MAX_ARRAY_SIZE does not yet have a real use.)
There are actually two limits. One, the maximum element indexable for the array and, two, the amount of memory available to your application. Depending on the amount of memory available and the amount used by other data structures, you may hit the memory limit before you reach the maximum addressable array element.
Going by this article http://en.wikipedia.org/wiki/Criticism_of_Java#Large_arrays:
Java has been criticized for not supporting arrays of more than 231−1 (about 2.1 billion) elements. This is a limitation of the language; the Java Language Specification, Section 10.4, states that:
Arrays must be indexed by int values... An attempt to access an array
component with a long index value results in a compile-time error.
Supporting large arrays would also require changes to the JVM. This limitation manifests itself in areas such as collections being limited to 2 billion elements and the inability to memory map files larger than 2 GiB. Java also lacks true multidimensional arrays (contiguously allocated single blocks of memory accessed by a single indirection), which limits performance for scientific and technical computing.
Arrays are non-negative integer indexed , so maximum array size you can access would be Integer.MAX_VALUE. The other thing is how big array you can create. It depends on the maximum memory available to your JVM and the content type of the array. Each array element has it's size, example. byte = 1 byte, int = 4 bytes, Object reference = 4 bytes (on a 32 bit system)
So if you have 1 MB memory available on your machine, you could allocate an array of byte[1024 * 1024] or Object[256 * 1024].
Answering your question - You can allocate an array of size (maximum available memory / size of array item).
Summary - Theoretically the maximum size of an array will be Integer.MAX_VALUE. Practically it depends on how much memory your JVM has and how much of that has already been allocated to other objects.
I tried to create a byte array like this
byte[] bytes = new byte[Integer.MAX_VALUE-x];
System.out.println(bytes.length);
With this run configuration:
-Xms4G -Xmx4G
And java version:
Openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-b16)
OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)
It only works for x >= 2 which means the maximum size of an array is Integer.MAX_VALUE-2
Values above that give
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at Main.main(Main.java:6)
Maximum number of elements of an array is (2^31)−1 or 2 147 483 647
Yes, there limit on java array. Java uses an integer as an index to the array and the maximum integer store by JVM is 2^32. so you can store 2,147,483,647 elements in the array.
In case you need more than max-length you can use two different arrays but the recommended method is store data into a file. because storing data in the file has no limit. because files stored in your storage drivers but array are stored in JVM. JVM provides limited space for program execution.
Actually it's java limitation caping it at 2^30-4 being 1073741820. Not 2^31-1. Dunno why but i tested it manually on jdk. 2^30-3 still throwing vm except
Edit: fixed -1 to -4, checked on windows jvm
Java array has a limit because its a integer array, what meant it has up to 2,147,483,647 elements in array

can java.util.BitSet hold more than MAX_INT no. of bits?

As the BitSet.get() function uses an int as an argument, I was thinking whether I could store more than 2^32 bits in a BitSet, and if so how would I retrieve them?
I am doing a Project Euler problem where I need to generate primes till 10^10. The algorithm I'm currently using to generate primes is the Erathonesus' Sieve, storing the boolean values as bits in a BitSet. Any workaround for this?
You could use a list of bitsets as List<BitSet> and when the end of one bitset has been reached you could move to the next one.
However, I think your approach is probably incorrect. Even if you use a single bit for each number you need 10^10 bits which is about 1 GB memory (8 bits in a byte and 1024^3 bytes in a GB). Most Project Euler problems should be solvable without needing that much memory.
No, it's limited by the int indexing in its interface.
So they didn't bother exploiting all its potential, (about 64x downsized) probably because it wasn't feasible to use that much RAM.
I worked on a LongBitSet implementation, published it here.
It can take:
//2,147,483,647 For reference, this is Integer.MAX_VALUE
137,438,953,216 LongBitSet max size (bits)
0b1111111_11111111_11111111_11111100_000000L in binary
Had to address corner cases, in the commit history you can see the 1st commit being a copy paste from java.util.BitSet.
See factory method:
public static LongBitSet getMaxSizeInstance() {
// Integer.MAX_VALUE - 3 << ADDRESS_BITS_PER_WORD
return new LongBitSet( 0b1111111_11111111_11111111_11111100_000000L);
}
Note: -Xmx24G -Xms24G -ea Min GB needed to start JVM with to call getMaxSizeInstance() without java.lang.OutOfMemoryError: Java heap space

Maximum size of an array - Type mismatch: cannot convert from long to int

I see that the maximum size of an array can be only maximum size of an Int. Why does Java not allow an array of size long-Max ?
long no = 10000000000L;
int [] nums = new int[no];//error here
You'll have to address the "why" question to the Java designers. Anyone else can only speculate. My speculation is that they felt that a two-billion-element array ought to be enough for anybody (which, in fairness, it probably is).
An int-sized length allows arrays of 231-1 ("~2 billion") elements. In the gigantically overwhelming majority of arrays' uses, that's plenty.
An array of that many elements will take between 2 gigabytes and 16 gigabytes of memory, depending on the element type. When Java appeared in 1995, new PCs had only around 8 megabytes of RAM. And those 32-bit operating systems, even if they used virtual memory on disk, had a practical limit on the size of a contiguous chunk of memory they could allocate which was quite a bit less than 2 gigabytes, because other allocated things are scattered around in a process's address space. Thus the limits of an int-sized array length were untestable, unforeseeable, and just very far away.
On 32-bit CPUs, arithmetic with ints is much faster than with longs.
Arrays are a basic internal type and they are used numerously. A long-sized length would take an extra 4 bytes per array to store, which in turn could affect packing together of arrays in memory, potentially wasting more bytes between them. (Even though the longer length would almost never be useful.)
If you ever do need in-RAM storage for more than ~2 billion items, you can use an array of arrays.
Unfortunately Java does not support arrays with more than 2^31 elements.
i.e. 16 GiB of space for a long[] array.
try creating this...
Object[] array = new Object[Integer.MAX_VALUE - 4];
you should get OUTOFMEMMORY error...SO the maximum size will be Integer.MAX_VALUE - 5

In Java, empty HashMap space allocation

How can i tell how much space a pre-sized HashMap takes up before any elements are added? For example how do i determine how much memory the following takes up?
HashMap<String, Object> map = new HashMap<String, Object>(1000000);
In principle, you can:
calculate it by theory:
look at the implementation of HashMap to figure out what this method does.
look at the implementation of the VM to know how much space the individual created objects take.
measure it somehow.
Most of the other answers are about the second way, so I'll look at the first one (in OpenJDK source, 1.6.0_20).
The constructor uses a capacity that is the next power of two >= your initialCapacity parameter, thus 1048576 = 2^20 in our case.
It then creates an new Entry[capacity] and assigns it to the table variable. (Additionally it assigns some primitive variables).
So, we now have one quite small HashMap object (it contains only 3 ints, one float and one reference variable), and one quite big Entry[] object. This array needs space for their array elements (which are normal reference variables) and some metadata (size, class).
So, it comes down to how big a reference variable is. This depends on VM implementation - usually in 32-bit VMs it is 32 bit (= 4 bytes), in 64-bit VMs 64 bit (= 8 bytes).
So, basically on 32-bit VMs your array takes 4 MB, on 64-bit VMs it takes 8 MB, plus some tiny administration data.
If you then fill your HashTable with mappings, each mapping corresponds to a Entry object. This entry object consists of one int and three references, taking about 24 bytes on 32-bit VMs, maybe the double on 64-bit VMs. Thus your 1000000-mappings HashMap (assuming an load factor > 1) would take ~28 MB on 32-bit-VMs and ~56 MB on 64-bit VMs.
Additionally to the key and value objects themselves, of course.
You could check memory usage before and after creation of the variable. For example:
long preMemUsage = Runtime.getRuntime().totalMemory() -
Runtime.getRuntime().freeMemory();
HashMap<String> map = new HashMap<String>(1000000);
long postMemUsage = Runtime.getRuntime().totalMemory() -
Runtime.getRuntime().freeMemory();
The exact answer will depend on the version of Java you are using, the JVM vendor and the target platform, and is best determined by direct measurement, as described in other answers.
But as a simple estimate, the size is likely to be either ~4 * 2^20 or ~8 * 2^20 bytes, for a 32 bit or 64 bit jvm respectively.
Reasoning:
The Sun Java 1.6 implementation of HashMap has a fixed side top-level object and a table field that points to the array of references to hash chains.
In a newly created (empty) HashMap the references are all null and the array size is the next power of two larger that the supplied initialCapacity. (Yes ... I checked the source code.)
A reference occupies 4 bytes on a typical 32bit JVM and 8 bytes on a typical 64 bit JVM. Some 64 bit JVMs support compact references ("compressed oops"), but you need to set JVM options to enable this.
The top object has 5 fields including the table array reference, but this is a relatively small constant overhead.
The top object and the array have object header overheads, but these are constant and relatively small.
Thus the size of the table array dominates, and it is 2^20 (the next power of 2 greater than 1,000,000) multiplied by the size of a reference.
So, this tells you that setting a large initial capacity really does use a lot of memory. On the other hand, if the initial capacity is a good estimate of the map's capacity when fully populated, you will save significant amounts of time by setting it. (This avoids a number of cycles of reallocating the array and rebuilding of the hash chains.)
You could probably use a profiler like VisualVM and track memory use.
Have a look at this too: http://www.velocityreviews.com/forums/t148009-java-hashmap-size.html
I'd have a look at this article: http://www.javaworld.com/javaworld/javatips/jw-javatip130.html
In short, java does not have a C-style sizeof operator. You could use profiling tools, but IMO the above link gives the simplest solution.
Another piece of info that may be helpful: an empty java String consumes 40 bytes. One million of them would probably be at least 40MB...
I agree that a profiler is really the only way to tell. The other bit of relevant information is whether you're using a 32-bit or 64-bit JVM. The amount of overhead due to memory references (pointers) varies depending on that and whether you have compressed oops turned on. I've found that for smaller data sets the overhead of objects and pointers is significant.
In the latest version of Java 1.7 (I'm looking at 1.7.0_55) HashMap actually lazily instantiates its internal table. It's only instantiated when put() is called - see the private method "inflateTable()". So your HashMap, before you add anything to it at least, will occupy only the handful of bytes of object overhead and instance fields.
You should be able to use VisualVM (comes with JDK 6 or can be downloaded) to create a memory snapshot and inspect the allocated objects for their size.

Why does creating a big Java array consume so much memory?

Why does the following line
Object[] objects = new Object[10000000];
result in a lot of memory (~40M) being used by the JVM? Is there any way to know the internal workings of the VM when allocating arrays?
Well, that allocates enough space for 10000000 references, as well as a small amount of overhead for the array object itself.
The actual size will depend on the VM - but it's surely not surprising that it's taking up a fair amount of memory... I'd expect at least 40MB, and probably 80MB on a 64-bit VM, unless it's using compressed oops for arrays.
Of course, if you populate the array with that many distinct objects, that will take much, much more memory... but the array itself still needs space just for the references.
What do you mean by "a lot of memory"? You allocating 10000000 pointers, each taking 4 bytes(on 32 bit machine) - this is about 40mb of memory.
You are creating ten million references to an object. A reference is at least 4 bytes; IIRC in Java it might be 8, but I'm unsure of that.
So with that one line you're creating 40 or 80 megabytes of data.
You are reserving space for ten million references. That is quite a bit.
It results in a lot of memory being used because it needs to allocate heap space for 10 million objects and their associated overhead.
To look into the internal workings of the JVM, you can check out its source code, as it is open source.
Your array has to hold 10 million object references, which on modern platforms are 64 bit (8 byte) pointers. Since it is allocated as a contiguous chunk of storage, it should take 80 million bytes. That's big in one sense, small compared to the likely amount of memory you have. Why does it bother you?
It creates an array with 10.000.000 reference pointers, all initialized with null.
What did you expect, saying this is "a lot"?
Further reading
Size of object references in Java
One of the principal reasons arrays are used so widely is that their elements can be accessed in constant time. This means that the time taken to access a[i] is the same for each index i. This is because the address of a[i] can be determined arithmetically by adding a suitable offset to the address of the head of the array. The reason is that space for the contents of an array is allocated as a contiguous block of memory.
According to this site, the memory usage for arrays is a 12 bytes header + 4 bytes per element. If you declare an empty array of Object holding 10M elements, then you have just about 40MB of memory used from the start. If you start filling that array with actually 10M object, then the size increases quite rapidly.
From this site, and I just tested it on my 64-bit machine, the size of a plain Object is about 31 bytes, so an array of 10M of Object is just about 12 bytes + (4 + 31 bytes) * 10M = 350 000 012 bytes (or 345.78 MB)
If your array is holding other type of objects, then the size will be even larger.
I would suggest you use some kind of random access file(s) to hold you data if you have to keep so much data inside your program. Or even use a database such as Apache Derby, which will also enable you to sort and filter your data, etc.
I may be behind the times but I understood from the book Practical Java that Vectors are more efficient and faster than Arrays. Is it possible to use a Vector instead of an array?

Categories

Resources