In my Java code, I am using Guava's Multimap (com.google.common.collect.Multimap) by using this:
Multimap<Integer, Integer> Index = HashMultimap.create()
Here, Multimap key is some portion of a URL and value is another portion of the URL (converted into an integer). Now, I assign my JVM 2560 Mb (2.5 GB) heap space (by using Xmx and Xms). However, it can only store 9 millions of such (key,value) pairs of integers (approx 10 million). But, theoretically (according to memory occupied by int) it should store more.
Can anybody help me,
Why is Multimap using lots of memory? I checked my code and without inserting pairs into the Multimap, it only uses 1/2 MB of memory.
2.
Is there another way or home-baked solution to solve this memory issue? Means, Is there any way to reduce those object overheads as I want to store only int-int? In any other language ? Or any other solution (home-baked preferred) to solve issue I faced, means DB based or something like that solution.
There's a huge amount of overhead associated with Multimap. At a minimum:
Each key and value is an Integer object, which (at a minimum) doubles the storage requirements of each int value.
Each unique key value in the HashMultimap is associated with a Collection of values (according to the source, the Collection is a Hashset).
Each Hashset is created with default space for 8 values.
So each key/value pair requires (at a minimum) perhaps an order of magnitude more space than you might expect for two int values. (Somewhat less when multiple values are stored under a single key.) I would expect 10 million key/value pairs to take perhaps 400MB.
Although you have 2.5GB of heap space, I wouldn't be all that surprised if that's not enough. The above estimate is, I think, on the low side. Plus, it only accounts for how much is needed to store the map once it is built. As the map grows, the table needs to be reallocated and rehashed, which temporarily at least doubles the amount of space used. Finally, all this assumes that int values and object references require 4 bytes. If the JVM is using 64-bit addressing, the byte count probably doubles.
Probably the simplest way to minimize the memory overhead would be to potentially mix Trove's primitive collection implementations (to avoid memory overhead of boxing) and Guava's Multimap, something like
SetMultimap<Integer, Integer> multimap = Multimaps.newSetMultimap(
TDecorators.wrap(TIntObjectHashMap<Collection<Integer>>()),
new Supplier<Set<Integer>>() {
public Set<Integer> get() {
return TDecorators.wrap(new TIntHashSet());
}
});
That still has the overhead of boxing and unboxing on queries, but the memory it consumes just sitting there would be significantly reduced.
It sounds like you need a sparse boolean matrix. Sparse matrices / arrays in Java should provide pointers to library code. Then instead of putting (i, j) into the multimap, just put a 1 into the matrix at [i][j].
You could use probably an ArrayListMultimap, which requires less memory than a HashMultimap, since ArrayLists are smaller than HashSets. Or, you could modify Louis's Trove solution, replacing the Set with a List, to reduce memory usage further.
Some applications depend on the fact that HashMultimap satisfies the SetMultimap interface, but most don't.
Related
I have a JavaPairRDD<Integer, Integer[]> on which I want to perform a groupByKey action.
The groupByKey action gives me a:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
which is practically an OutOfMemory error, if I am not mistaken. This occurs only in big datasets (in my case when "Shuffle Write" shown in the Web UI is ~96GB).
I have set:
spark.serializer org.apache.spark.serializer.KryoSerializer
in $SPARK_HOME/conf/spark-defaults.conf, but I am not sure if Kryo is used to serialize my JavaPairRDD.
Is there something else that I should do to use Kryo, apart from setting this conf parameter, to serialize my RDD? I can see in the serialization instructions that:
Spark automatically includes Kryo serializers for the many commonly-used core Scala classes covered in the AllScalaRegistrar from the Twitter chill library.
and that:
Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.
I also noticed that when I set spark.serializer to be Kryo, the Shuffle Write in the Web UI increases from ~96GB (with default serializer) to 243GB!
EDIT: In a comment, I was asked about the logic of my program, in case groupByKey can be replaced with reduceByKey. I don't think it's possible, but here it is anyway:
Input has the form:
key: index bucket id,
value: Integer array of entity ids in this bucket
The shuffle write operation produces pairs in the form:
entityId
Integer array of all entity Ids in the same bucket (call them neighbors)
The groupByKey operation gathers all the neighbor arrays of each entity, some possibly appearing more than once (in many buckets).
After the groupByKey operation, I keep a weight for each bucket (based on the number of negative entity ids it contains) and for each neighbor id I sum up the weights of the buckets it belongs to.
I normalize the scores of each neighbor id with another value (let's say it's given) and emit the top-3 neighbors per entity.
The number of distinct keys that I get is around 10 million (around 5 million positive entity ids and 5 million negatives).
EDIT2: I tried using Hadoop's Writables (VIntWritable and VIntArrayWritable extending ArrayWritable) instead of Integer and Integer[], respectively, but the shuffle size was still bigger than the default JavaSerializer.
Then I increased the spark.shuffle.memoryFraction from 0.2 to 0.4 (even if deprecated in version 2.1.0, there is no description of what should be used instead) and enabled offHeap memory, and the shuffle size was reduced by ~20GB. Even if this does what the title asks, I would prefer a more algorithmic solution, or one that includes a better compression.
Short Answer: Use fastutil and maybe increase spark.shuffle.memoryFraction.
More details:
The problem with this RDD is that Java needs to store Object references, which consume much more space than primitive types. In this example, I need to store Integers, instead of int values. A Java Integer takes 16 bytes, while a primitive Java int takes 4 bytes. Scala's Int type, on the other hand, is a 32-bit (4-byte) type, just like Java's int, that's why people using Scala may not have faced something similar.
Apart from increasing the spark.shuffle.memoryFraction to 0.4, another nice solution was to use the fastutil library, as suggest in Spark's tuning documentation:
The first way to reduce memory consumption is to avoid the Java features that add overhead, such as pointer-based data structures and wrapper objects. There are several ways to do this: Design your data structures to prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap). The fastutil library provides convenient collection classes for primitive types that are compatible with the Java standard library.
This enables storing each element in int array of my RDD pair as an int type (i.e., using 4 bytes instead of 16 for each element of the array). In my case, I used IntArrayList instead of Integer[]. This made the shuffle size drop significantly and allowed my program to run in the cluster. I also used this library in other parts of the code, where I was making some temporary Map structures. Overall, by increasing spark.shuffle.memoryFraction to 0.4 and using fastutil library, shuffle size dropped from 96GB to 50GB (!) using the default Java serializer (not Kryo).
Alternative: I have also tried sorting each int array of an rdd pair and storing the deltas using Hadoop's VIntArrayWritable type (smaller numbers use less space than bigger numbers), but this also required to register VIntWritable and VIntArrayWritable in Kryo, which didn't save any space after all. In general, I think that Kryo only makes things work faster, but does not decrease the space needed, but I am not still sure about that.
I am not marking this answer as accepted yet, because someone else might have a better idea, and because I didn't use Kryo after all, as my OP was asking. I hope reading it, will help someone else with the same issue. I will update this answer, if I manage to further reduce the shuffle size.
Still not really sure what you want to do. However, because you use groupByKey and say that there is no way to do it by using reduceByKey, it makes me more confused.
I think you have rdd = (Integer, Integer[]) and you want something like (Integer, Iterable[Integer[]]) that's why you are using groupByKey.
Anyway, I am not really familiar with Java in Spark, but in Scala I would use reduceByKey to avoid the shuffle by
rdd.mapValues(Iterable(_)).reduceByKey(_++_) . Basically, you want to convert the value to a list of array and then combine the list together.
I think the best approach that can be recommended here (without more specific knowledge of the input data) in general is to use the persist API on your input RDD.
As step one, I'd try to call .persist(MEMORY_ONLY_SER) on the input, RDD to lower memory usage (albeit at a certain CPU overhead, that shouldn't be that much of a problem for ints in your case).
If that is not sufficient you can try out .persist(MEMORY_AND_DISK_SER) or if your shuffle still takes so much memory that the input dataset needs to be made easier on the memory .persist(DISK_ONLY) may be an option, but one that will strongly deteriorate performance.
I am working on a project where I am processing a lot of tweets; the goal is to remove duplicates as I process them. I have the tweet IDs, which come in as strings of the format "166471306949304320"
I have been using a HashSet<String> for this, which works fine for a while. But by the time I get to around 10 million items I am drastically bogged down and eventually get a GC error, presumably from the rehashing. I tried defining a better size/load with
tweetids = new HashSet<String>(220000,0.80F);
and that lets it get a little farther, but is still excruciatingly slow (by around 10 million it is taking 3x as long to process). How can I optimize this? Given that I have an approximate idea of how many items should be in the set by the end (in this case, around 20-22 million) should I create a HashSet that rehashes only two or three times, or would the overhead for such a set incur too many time-penalties? Would things work better if I wasn't using a String, or if I define a different HashCode function (which, in this case of a particular instance of a String, I'm not sure how to do)? This portion of the implementation code is below.
tweetids = new HashSet<String>(220000,0.80F); // in constructor
duplicates = 0;
...
// In loop: For(each tweet)
String twid = (String) tweet_twitter_data.get("id");
// Check that we have not processed this tweet already
if (!(tweetids.add(twid))){
duplicates++;
continue;
}
SOLUTION
Thanks to your recommendations, I solved it. The problem was the amount of memory required for the hash representations; first, HashSet<String> was simply enormous and uncalled for because the String.hashCode() is exorbitant for this scale. Next I tried a Trie, but it crashed at just over 1 million entries; reallocating the arrays was problematic. I used a HashSet<Long> to better effect and almost made it, but speed decayed and it finally crashed on the last leg of the processing (around 19 million). The solution came with departing from the standard library and using Trove. It finished 22 million records a few minutes faster than not checking duplicates at all. Final implementation was simple, and looked like this:
import gnu.trove.set.hash.TLongHashSet;
...
TLongHashSet tweetids; // class variable
...
tweetids = new TLongHashSet(23000000,0.80F); // in constructor
...
// inside for(each record)
String twid = (String) tweet_twitter_data.get("id");
if (!(tweetids.add(Long.parseLong(twid)))) {
duplicates++;
continue;
}
You may want to look beyond the Java collections framework. I've done some memory intensive processing and you will face several problems
The number of buckets for large hashmaps and hash sets is going to
cause a lot of overhead (memory). You can influence this by using
some kind of custom hash function and a modulo of e.g. 50000
Strings are represented using 16 bit characters in Java. You can halve that by using utf-8 encoded byte arrays for most scripts.
HashMaps are in general quite wasteful data structures and HashSets are basically just a thin wrapper around those.
Given that, take a look at trove or guava for alternatives. Also, your ids look like longs. Those are 64 bit, quite a bit smaller than the string representation.
An alternative you might want to consider is using bloom filters (guava has a decent implementation). A bloom filter would tell you if something is definitely not in a set and with reasonable certainty (less than 100%) if something is contained. That combined with some disk based solution (e.g. database, mapdb, mecached, ...) should work reasonably well. You could buffer up incoming new ids, write them in batches, and use the bloom filter to check if you need to look in the database and thus avoid expensive lookups most of the time.
If you are just looking for the existence of Strings, then I would suggest you try using a Trie(also called a Prefix Tree). The total space used by a Trie should be less than a HashSet, and it's quicker for string lookups.
The main disadvantage is that it can be slower when used from a harddisk as it's loading a tree, not a stored linearly structure like a Hash. So make sure that it can be held inside of RAM.
The link I gave is a good list of pros/cons of this approach.
*as an aside, the bloom filters suggested by Jilles Van Gurp are great fast prefilters.
Simple, untried and possibly stupid suggestion: Create a Map of Sets, indexed by the first/last N characters of the tweet ID:
Map<String, Set<String>> sets = new HashMap<String, Set<String>>();
String tweetId = "166471306949304320";
sets.put(tweetId.substr(0, 5), new HashSet<String>());
sets.get(tweetId.substr(0, 5)).add(tweetId);
assert(sets.containsKey(tweetId.substr(0, 5)) && sets.get(tweetId.substr(0, 5)).contains(tweetId));
That easily lets you keep the maximum size of the hashing space(s) below a reasonable value.
I'm running some experiments over a large dataset and would like to optimize a particular part. Currently, I have 5-6 Models each of which stores a mapping from Topics to List of Strings. The set of Topics is large and the same between each Model, so there must be a better way. Ultimately the query I need to perform is: what is the String in position x of the List for some Model-Topic combination.
One of the problems with using the mapping method is that if there are say 500k-5M topics, each has a list of 20 strings. Then my Map<Model, Map<Topic, List<String>>> is going to be massive.
Have you tried SortedSet / Maps? Sounds like you need to optimize your search, sorted collections (like TreeMap) should be log(n) while regular list is O(1). Of course, this kind of thing is something at which databases excel...
Not clear where/how you want to achieve "memory efficiency". First one needs to look at the particulars of your detailed data to see how much storage that consumes, then examine various ways of organizing it and analyze their efficiency in terms of % overhead vs your "real" data.
A brief glance shows that a HashMap, when you consider the associated tables, has about 80 bytes of overhead per entry. An ArrayList looks to average out around 10-12. Without looking, I would guess that a TreeMap would be more than a HashMap -- maybe 100.
Generally speaking, links within your own objects will be "cheaper", both in storage and speed to access, than links using these aggregating objects. But the aggregating objects are convenient to use, and have been "optimized" to a degree.
(But looking at your update, you probably should be looking at a DB application, rather than holding everything in heap.)
You could use Topic and Model to construct a composite key in a single Map, e.g.
map.put(topic1_id + model1_id, list1_1);
map.put(topic1_id + model2_id, list1_2);
...
map.get(topic_id + model_id)
where the IDs are Strings (or a similar scheme could be used with numeric identifiers).
A similar approach is to assign each topic and model a unique number, then store the lists of strings in arrays, so looking up the list for a given combination is a matter of looking up two indexes, then accessing a given location in a 2D array. (however, this is easier when you know the number of topics and models in advance of constructing the data structure)
For memory efficiency, also consider the small details. In general, you want to minimise the number of Objects - each Object carries an overhead. ArrayLists can have a lot of wasted space as they grow dynamically, doubling in size when they exceed their current capacity. If you can pre-size them to the required capacity (or use an array instead) then you can save a lot of memory. The same applies when using large numbers of small HashMaps.
One possible data structure is a hierarchy of maps, leading to an array of Strings. E.g.:
HashMap<Model, HashMap<Topic, String[]>> map;
A query function would then look like:
public String query(Model model, Topic topic, int x) {
HashMap<Topic, String[]> childMap = map.get(model);
if (childMap == null) {
return null;
}
String[] list = childMap.get(topic);
if (list == null) {
return null;
}
return list[x];
}
Presuming your Model and Topic structures implement hashCode() and equals() reasonably, the query performance should be quite good.
One potential weakness: I'm assuming you need to index a large number of Model/Topic combinations, and related lists of Strings (if not, you presumably wouldn't be asking about optimization). My guess is that the child String[] arrays will consume a large amount of memory. Each array is a Java object (about 20 bytes) + a pointer at each array location.
2 suggestions there:
1) If many Model/Topic combinations share the same set of Strings, you could gain quite a lot by sharing those String[] instances.
2) If you're using a 64-bit VM, be sure to use compressed ordinary object pointers (-XX:+UseCompressedOops). That will at least keep most of the pointers to 4 bytes instead of 8. Compressed OOPs is the default since 1.6.0_23, so a relatively recent VM will save you some memory here.
One other possibility not mentioned is store the strings using String[][][] and models and topics in a List such as ArrayList and then at query time:
public String query(Model model, Topic topic, int x) {
return strings[models.indexOf(model)][topics.indexOf(topic)][x];
}
It could be further improved for speed if the topics and models were sorted, then binary search rather than indexOf could be used.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What exactly are hashtables?
I understand the purpose of using hash functions to securely store passwords. I have used arrays and arraylists for class projects for sorting and searching data. What I am having trouble understanding is the practical value of hashtables for something like sorting and searching.
I got a lecture on hashtables but we never had to use them in school, so it hasn't clicked. Can someone give me a practical example of a task a hashtable is useful for that couldn't be done with a numerical array or arraylist? Also, a very simple low level example of a hash function would be helpful.
There are all sorts of collections out there. Collections are used for storing and retrieving things, so one of the most important properties of a collection is how fast these operations are. To estimate "fastness" people in computer science use big-O notation which sort of means how many individual operations you have to accomplish to invoke a certain method (be it get or set for example). So for example to get an element of an ArrayList by an index you need exactly 1 operation, this is O(1), if you have a LinkedList of length n and you need to get something from the middle, you'll have to traverse from the start of the list to the middle, taking n/2 operations, in this case get has complexity of O(n). The same comes to key-value stores as hastable. There are implementations that give you complexity of O(log n) to get a value by its key whereas hastable copes in O(1). Basically it means that getting a value from hashtable by its key is really cheap.
Basically, hashtables have similar performance characteristics (cheap lookup, cheap appending (for arrays - hashtables are unordered, adding to them is cheap partly because of this) as arrays with numerical indices, but are much more flexible in terms of what the key may be. Given a continuous chunck of memory and a fixed size per item, you can get the adress of the nth item very easily and cheaply. That's thanks to the indices being integers - you can't do that with, say, strings. At least not directly. Hashes allows reducing any object (that implements it) to a number and you're back to arrays. You still need to add checks for hash collisions and resolve them (which incurs mostly a memory overhead, since you need to store the original value), but with a halfway decent implementation, this is not much of an issue.
So you can now associate any (hashable) object with any (really any) value. This has countless uses (although I have to admit, I can't think of one that's applyable to sorting or searching). You can build caches with small overhead (because checking if the cache can help in a given case is O(1)), implement a relatively performant object system (several dynamic languages do this), you can go through a list of (id, value) pairs and accumulate the values for identical ids in any way you like, and many other things.
Very simple. Hashtables are often called "associated arrays." Arrays allow access your data by index. Hash tables allow access your data by any other identifier, e.g. name. For example
one is associated with 1
two is associated with 2
So, when you got word "one" you can find its value 1 using hastable where key is one and value is 1. Array allows only opposite mapping.
For n data elements:
Hashtables allows O(k) (usually dependent only on the hashing function) searches. This is better than O(log n) for binary searches (which follow an n log n sorting, if data is not sorted you are worse off)
However, on the flip side, the hashtables tend to take roughly 3n amount of space.
I am wondering what is the memory overhead of java HashMap compared to ArrayList?
Update:
I would like to improve the speed for searching for specific values of a big pack (6 Millions+) of identical objects.
Thus, I am thinking about using one or several HashMap instead of using ArrayList. But I am wondering what is the overhead of HashMap.
As far as i understand, the key is not stored, only the hash of the key, so it should be something like size of the hash of the object + one pointer.
But what hash function is used? Is it the one offered by Object or another one?
If you're comparing HashMap with ArrayList, I presume you're doing some sort of searching/indexing of the ArrayList, such as binary search or custom hash table...? Because a .get(key) thru 6 million entries would be infeasible using a linear search.
Using that assumption, I've done some empirical tests and come up with the conclusion that "You can store 2.5 times as many small objects in the same amount of RAM if you use ArrayList with binary search or custom hash map implementation, versus HashMap". My test was based on small objects containing only 3 fields, of which one is the key, and the key is an integer. I used a 32bit jdk 1.6. See below for caveats on this figure of "2.5".
The key things to note are:
(a) it's not the space required for references or "load factor" that kills you, but rather the overhead required for object creation. If the key is a primitive type, or a combination of 2 or more primitive or reference values, then each key will require its own object, which carries an overhead of 8 bytes.
(b) In my experience you usually need the key as part of the value, (e.g. to store customer records, indexed by customer id, you still want the customer id as part of the Customer object). This means it is IMO somewhat wasteful that a HashMap separately stores references to keys and values.
Caveats:
The most common type used for HashMap keys is String. The object creation overhead doesn't apply here so the difference would be less.
I got a figure of 2.8, being 8880502 entries inserted into the ArrayList compared with 3148004 into the HashMap on -Xmx256M JVM, but my ArrayList load factor was 80% and my objects were quite small - 12 bytes plus 8 byte object overhead.
My figure, and my implementation, requires that the key is contained within the value, otherwise I'd have the same problem with object creation overhead and it would be just another implementation of HashMap.
My code:
public class Payload {
int key,b,c;
Payload(int _key) { key = _key; }
}
import org.junit.Test;
import java.util.HashMap;
import java.util.Map;
public class Overhead {
#Test
public void useHashMap()
{
int i=0;
try {
Map<Integer, Payload> map = new HashMap<Integer, Payload>();
for (i=0; i < 4000000; i++) {
int key = (int)(Math.random() * Integer.MAX_VALUE);
map.put(key, new Payload(key));
}
}
catch (OutOfMemoryError e) {
System.out.println("Got up to: " + i);
}
}
#Test
public void useArrayList()
{
int i=0;
try {
ArrayListMap map = new ArrayListMap();
for (i=0; i < 9000000; i++) {
int key = (int)(Math.random() * Integer.MAX_VALUE);
map.put(key, new Payload(key));
}
}
catch (OutOfMemoryError e) {
System.out.println("Got up to: " + i);
}
}
}
import java.util.ArrayList;
public class ArrayListMap {
private ArrayList<Payload> map = new ArrayList<Payload>();
private int[] primes = new int[128];
static boolean isPrime(int n)
{
for (int i=(int)Math.sqrt(n); i >= 2; i--) {
if (n % i == 0)
return false;
}
return true;
}
ArrayListMap()
{
for (int i=0; i < 11000000; i++) // this is clumsy, I admit
map.add(null);
int n=31;
for (int i=0; i < 128; i++) {
while (! isPrime(n))
n+=2;
primes[i] = n;
n += 2;
}
System.out.println("Capacity = " + map.size());
}
public void put(int key, Payload value)
{
int hash = key % map.size();
int hash2 = primes[key % primes.length];
if (hash < 0)
hash += map.size();
do {
if (map.get(hash) == null) {
map.set(hash, value);
return;
}
hash += hash2;
if (hash >= map.size())
hash -= map.size();
} while (true);
}
public Payload get(int key)
{
int hash = key % map.size();
int hash2 = primes[key % primes.length];
if (hash < 0)
hash += map.size();
do {
Payload payload = map.get(hash);
if (payload == null)
return null;
if (payload.key == key)
return payload;
hash += hash2;
if (hash >= map.size())
hash -= map.size();
} while (true);
}
}
The simplest thing would be to look at the source and work it out that way. However, you're really comparing apples and oranges - lists and maps are conceptually quite distinct. It's rare that you would choose between them on the basis of memory usage.
What's the background behind this question?
All that is stored in either is pointers. Depending on your architecture a pointer should be 32 or 64 bits (or more or less)
An array list of 10 tends to allocate 10 "Pointers" at a minimum (and also some one-time overhead stuff).
A map has to allocate twice that (20 pointers) because it stores two values at a time. Then on top of that, it has to store the "Hash". which should be bigger than the map, at a loading of 75% it SHOULD be around 13 32-bit values (hashes).
so if you want an offhand answer, the ratio should be about 1:3.25 or so, but you are only talking pointer storage--very small unless you are storing a massive number of objects--and if so, the utility of being able to reference instantly (HashMap) vs iterate (array) should be MUCH more significant than the memory size.
Oh, also:
Arrays can be fit to the exact size of your collection. HashMaps can as well if you specify the size, but if it "Grows" beyond that size, it will re-allocate a larger array and not use some of it, so there can be a little waste there as well.
I don't have an answer for you either, but a quick google search turned up a function in Java that might help.
Runtime.getRuntime().freeMemory();
So I propose that you populate a HashMap and an ArrayList with the same data. Record the free memory, delete the first object, record memory, delete the second object, record the memory, compute the differences,..., profit!!!
You should probably do this with magnitudes of data. ie Start with 1000, then 10000, 100000, 1000000.
EDIT: Corrected, thanks to amischiefr.
EDIT:
Sorry for editing your post, but this is pretty important if you are going to use this (and It's a little much for a comment)
.
freeMemory does not work like you think it would. First, it's value is changed by garbage collection. Secondly, it's value is changed when java allocates more memory. Just using the freeMemory call alone doesn't provide useful data.
Try this:
public static void displayMemory() {
Runtime r=Runtime.getRuntime();
r.gc();
r.gc(); // YES, you NEED 2!
System.out.println("Memory Used="+(r.totalMemory()-r.freeMemory()));
}
Or you can return the memory used and store it, then compare it to a later value. Either way, remember the 2 gcs and subtracting from totalMemory().
Again, sorry to edit your post!
Hashmaps try to maintain a load factor (usually 75% full), you can think of a hashmap as a sparsely filled array list. The problem in a straight up comparison in size is this load factor of the map grows to meet the size of the data. ArrayList on the other hand grows to meet it's need by doubling it's internal array size. For relatively small sizes they are comparable, however as you pack more and more data into the map it requires a lot of empty references in order to maintain the hash performance.
In either case I recommend priming the expected size of the data before you start adding. This will give the implementations a better initial setting and will likely consume less over all in both cases.
Update:
based on your updated problem check out Glazed lists. This is a neat little tool written by some of the Google people for doing operations similar to the one you describe. It's also very quick. Allows clustering, filtering, searching, etc.
HashMap hold a reference to the value and a reference to the key.
ArrayList just hold a reference to the value.
So, assuming that the key uses the same memory of the value, HashMap uses 50% more memory ( although strictly speaking , is not the HashMap who uses that memory because it just keep a reference to it )
In the other hand HashMap provides constant-time performance for the basic operations (get and put) So, although it may use more memory, getting an element may be much faster using a HashMap than a ArrayList.
So, the next thing you should do is not to care about who uses more memory but what are they good for.
Using the correct data structure for your program saves more CPU/memory than how the library is implemented underneath.
EDIT
After Grant Welch answer I decided to measure for 2,000,000 integers.
Here's the source code
This is the output
$
$javac MemoryUsage.java
Note: MemoryUsage.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
$java -Xms128m -Xmx128m MemoryUsage
Using ArrayListMemoryUsage#8558d2 size: 0
Total memory: 133.234.688
Initial free: 132.718.608
Final free: 77.965.488
Used: 54.753.120
Memory Used 41.364.824
ArrayListMemoryUsage#8558d2 size: 2000000
$
$java -Xms128m -Xmx128m MemoryUsage H
Using HashMapMemoryUsage#8558d2 size: 0
Total memory: 133.234.688
Initial free: 124.329.984
Final free: 4.109.600
Used: 120.220.384
Memory Used 129.108.608
HashMapMemoryUsage#8558d2 size: 2000000
Basically, you should be using the "right tool for the job". Since there are different instances where you'll need a key/value pair (where you may use a HashMap) and different instances where you'll just need a list of values (where you may use a ArrayList) then the question of "which one uses more memory", in my opinion, is moot, since it is not a consideration of choosing one over the other.
But to answer the question, since HashMap stores key/value pairs while ArrayList stores just values, I would assume that the addition of keys alone to the HashMap would mean that it takes up more memory, assuming, of course, we are comparing them by the same value type (e.g. where the values in both are Strings).
I think the wrong question is being asked here.
If you would like to improve the speed at which you can search for an object in a List containing six million entries, then you should look into how fast these datatype's retrieval operations perform.
As usual, the Javadocs for these classes state pretty plainly what type of performance they offer:
HashMap:
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets.
This means that HashMap.get(key) is O(1).
ArrayList:
The size, isEmpty, get, set, iterator, and listIterator operations run in constant time. The add operation runs in amortized constant time, that is, adding n elements requires O(n) time. All of the other operations run in linear time (roughly speaking).
This means that most of ArrayList's operations are O(1), but likely not the ones that you would be using to find objects that match a certain value.
If you are iterating over every element in the ArrayList and testing for equality, or using contains(), then this means that your operation is running at O(n) time (or worse).
If you are unfamiliar with O(1) or O(n) notation, this is referring to how long an operation will take. In this case, if you can get constant-time performance, you want to take it. If HashMap.get() is O(1) this means that retrieval operations take roughly the same amount of time regardless of how many entries are in the Map.
The fact that something like ArrayList.contains() is O(n) means that the amount of time it takes grows as the size of the list grows; so iterating thru an ArrayList with six million entries will not be very effective at all.
I don't know the exact number, but HashMaps are much heavier. Comparing the two, ArrayList's internal representation is self evident, but HashMaps retain Entry objects (Entry) which can balloon your memory consumption.
It's not that much larger, but it's larger. A great way to visualize this would be with a dynamic profiler such as YourKit which allows you to see all heap allocations. It's pretty nice.
This post is giving a lot of information about objects sizes in Java.
If you're considering two ArrayLists vs one Hashmap, it's indeterminate; both are partially-full data structures. If you were comparing Vector vs Hashtable, Vector is probably more memory efficient, because it only allocates the space it uses, whereas Hashtables allocate more space.
If you need a key-value pair and aren't doing incredibly memory-hungry work, just use the Hashmap.
As Jon Skeet noted, these are completely different structures. A map (such as HashMap) is a mapping from one value to another - i.e. you have a key that maps to a value, in a Key->Value kind of relationship. The key is hashed, and is placed in an array for quick lookup.
A List, on the other hand, is a collection of elements with order - ArrayList happens to use an array as the back end storage mechanism, but that is irrelevant. Each indexed element is a single element in the list.
edit: based on your comment, I have added the following information:
The key is stored in a hashmap. This is because a hash is not guaranteed to be unique for any two different elements. Thus, the key has to be stored in the case of hashing collisions. If you simply want to see if an element exists in a set of elements, use a Set (the standard implementation of this being HashSet). If the order matters, but you need a quick lookup, use a LinkedHashSet, as it keeps the order the elements were inserted. The lookup time is O(1) on both, but the insertion time is slightly longer on a LinkedHashSet. Use a Map only if you are actually mapping from one value to another - if you simply have a set of unique objects, use a Set, if you have ordered objects, use a List.
This site lists the memory consumption for several commonly (and not so commonly) used data structures. From there one can see that the HashMap takes roughly 5 times the space of an ArrayList. The map will also allocate one additional object per entry.
If you need a predictable iteration order and use a LinkedHashMap, the memory consumption will be even higher.
You can do your own memory measurements with Memory Measurer.
There are two important facts to note however:
A lot of data structures (including ArrayList and HashMap) do allocate space more space than they need currently, because otherwise they would have to frequently execute a costly resize operation. Thus the memory consumption per element depends on how many elements are in the collection. For example, an ArrayList with the default settings uses the same memory for 0 to 10 elements.
As others have said, the keys of the map are stored, too. So if they are not in memory anyway, you will have to add this memory cost, too. An additional object will usually take 8 bytes of overhead alone, plus the memory for its fields, and possibly some padding. So this will also be a lot of memory.