Array or List in Java. Which is faster? - java

I have to keep thousands of strings in memory to be accessed serially in Java. Should I store them in an array or should I use some kind of List ?
Since arrays keep all the data in a contiguous chunk of memory (unlike Lists), would the use of an array to store thousands of strings cause problems ?

I suggest that you use a profiler to test which is faster.
My personal opinion is that you should use Lists.
I work on a large codebase and a previous group of developers used arrays everywhere. It made the code very inflexible. After changing large chunks of it to Lists we noticed no difference in speed.

The Java way is that you should consider what data abstraction most suits your needs. Remember that in Java a List is an abstract, not a concrete data type. You should declare the strings as a List, and then initialize it using the ArrayList implementation.
List<String> strings = new ArrayList<String>();
This separation of Abstract Data Type and specific implementation is one the key aspects of object oriented programming.
An ArrayList implements the List Abstract Data Type using an array as its underlying implementation. Access speed is virtually identical to an array, with the additional advantages of being able to add and subtract elements to a List (although this is an O(n) operation with an ArrayList) and that if you decide to change the underlying implementation later on you can. For example, if you realize you need synchronized access, you can change the implementation to a Vector without rewriting all your code.
In fact, the ArrayList was specifically designed to replace the low-level array construct in most contexts. If Java was being designed today, it's entirely possible that arrays would have been left out altogether in favor of the ArrayList construct.
Since arrays keep all the data in a contiguous chunk of memory (unlike Lists), would the use of an array to store thousands of strings cause problems ?
In Java, all collections store only references to objects, not the objects themselves. Both arrays and ArrayList will store a few thousand references in a contiguous array, so they are essentially identical. You can consider that a contiguous block of a few thousand 32-bit references will always be readily available on modern hardware. This does not guarantee that you will not run out of memory altogether, of course, just that the contiguous block of memory requirement is not difficult to fufil.

Although the answers proposing to use ArrayList do make sense in most scenario, the actual question of relative performance has not really been answered.
There are a few things you can do with an array:
create it
set an item
get an item
clone/copy it
General conclusion
Although get and set operations are somewhat slower on an ArrayList (resp. 1 and 3 nanosecond per call on my machine), there is very little overhead of using an ArrayList vs. an array for any non-intensive use. There are however a few things to keep in mind:
resizing operations on a list (when calling list.add(...)) are costly and one should try to set the initial capacity at an adequate level when possible (note that the same issue arises when using an array)
when dealing with primitives, arrays can be significantly faster as they will allow one to avoid many boxing/unboxing conversions
an application that only gets/sets values in an ArrayList (not very common!) could see a performance gain of more than 25% by switching to an array
Detailed results
Here are the results I measured for those three operations using the jmh benchmarking library (times in nanoseconds) with JDK 7 on a standard x86 desktop machine. Note that ArrayList are never resized in the tests to make sure results are comparable. Benchmark code available here.
Array/ArrayList Creation
I ran 4 tests, executing the following statements:
createArray1: Integer[] array = new Integer[1];
createList1: List<Integer> list = new ArrayList<> (1);
createArray10000: Integer[] array = new Integer[10000];
createList10000: List<Integer> list = new ArrayList<> (10000);
Results (in nanoseconds per call, 95% confidence):
a.p.g.a.ArrayVsList.CreateArray1 [10.933, 11.097]
a.p.g.a.ArrayVsList.CreateList1 [10.799, 11.046]
a.p.g.a.ArrayVsList.CreateArray10000 [394.899, 404.034]
a.p.g.a.ArrayVsList.CreateList10000 [396.706, 401.266]
Conclusion: no noticeable difference.
get operations
I ran 2 tests, executing the following statements:
getList: return list.get(0);
getArray: return array[0];
Results (in nanoseconds per call, 95% confidence):
a.p.g.a.ArrayVsList.getArray [2.958, 2.984]
a.p.g.a.ArrayVsList.getList [3.841, 3.874]
Conclusion: getting from an array is about 25% faster than getting from an ArrayList, although the difference is only on the order of one nanosecond.
set operations
I ran 2 tests, executing the following statements:
setList: list.set(0, value);
setArray: array[0] = value;
Results (in nanoseconds per call):
a.p.g.a.ArrayVsList.setArray [4.201, 4.236]
a.p.g.a.ArrayVsList.setList [6.783, 6.877]
Conclusion: set operations on arrays are about 40% faster than on lists, but, as for get, each set operation takes a few nanoseconds - so for the difference to reach 1 second, one would need to set items in the list/array hundreds of millions of times!
clone/copy
ArrayList's copy constructor delegates to Arrays.copyOf so performance is identical to array copy (copying an array via clone, Arrays.copyOf or System.arrayCopy makes no material difference performance-wise).

You should prefer generic types over arrays. As mentioned by others, arrays are inflexible and do not have the expressive power of generic types. (They do however support runtime typechecking, but that mixes badly with generic types.)
But, as always, when optimizing you should always follow these steps:
Don't optimize until you have a nice, clean, and working version of your code. Changing to generic types could very well be motivated at this step already.
When you have a version that is nice and clean, decide if it is fast enough.
If it isn't fast enough, measure its performance. This step is important for two reasons. If you don't measure you won't (1) know the impact of any optimizations you make and (2) know where to optimize.
Optimize the hottest part of your code.
Measure again. This is just as important as measuring before. If the optimization didn't improve things, revert it. Remember, the code without the optimization was clean, nice, and working.

I'm guessing the original poster is coming from a C++/STL background which is causing some confusion. In C++ std::list is a doubly linked list.
In Java [java.util.]List is an implementation-free interface (pure abstract class in C++ terms). List can be a doubly linked list - java.util.LinkedList is provided. However, 99 times out of 100 when you want a make a new List, you want to use java.util.ArrayList instead, which is the rough equivalent of C++ std::vector. There are other standard implementations, such as those returned by java.util.Collections.emptyList() and java.util.Arrays.asList().
From a performance standpoint there is a very small hit from having to go through an interface and an extra object, however runtime inlining means this rarely has any significance. Also remember that String are typically an object plus array. So for each entry, you probably have two other objects. In C++ std::vector<std::string>, although copying by value without a pointer as such, the character arrays will form an object for string (and these will not usually be shared).
If this particular code is really performance-sensitive, you could create a single char[] array (or even byte[]) for all the characters of all the strings, and then an array of offsets. IIRC, this is how javac is implemented.

I agree that in most cases you should choose the flexibility and elegance of ArrayLists over arrays - and in most cases the impact to program performance will be negligible.
However, if you're doing constant, heavy iteration with little structural change (no adds and removes) for, say, software graphics rendering or a custom virtual machine, my sequential access benchmarking tests show that ArrayLists are 1.5x slower than arrays on my system (Java 1.6 on my one year-old iMac).
Some code:
import java.util.*;
public class ArrayVsArrayList {
static public void main( String[] args ) {
String[] array = new String[300];
ArrayList<String> list = new ArrayList<String>(300);
for (int i=0; i<300; ++i) {
if (Math.random() > 0.5) {
array[i] = "abc";
} else {
array[i] = "xyz";
}
list.add( array[i] );
}
int iterations = 100000000;
long start_ms;
int sum;
start_ms = System.currentTimeMillis();
sum = 0;
for (int i=0; i<iterations; ++i) {
for (int j=0; j<300; ++j) sum += array[j].length();
}
System.out.println( (System.currentTimeMillis() - start_ms) + " ms (array)" );
// Prints ~13,500 ms on my system
start_ms = System.currentTimeMillis();
sum = 0;
for (int i=0; i<iterations; ++i) {
for (int j=0; j<300; ++j) sum += list.get(j).length();
}
System.out.println( (System.currentTimeMillis() - start_ms) + " ms (ArrayList)" );
// Prints ~20,800 ms on my system - about 1.5x slower than direct array access
}
}

Well firstly it's worth clarifying do you mean "list" in the classical comp sci data structures sense (ie a linked list) or do you mean java.util.List? If you mean a java.util.List, it's an interface. If you want to use an array just use the ArrayList implementation and you'll get array-like behaviour and semantics. Problem solved.
If you mean an array vs a linked list, it's a slightly different argument for which we go back to Big O (here is a plain English explanation if this is an unfamiliar term.
Array;
Random Access: O(1);
Insert: O(n);
Delete: O(n).
Linked List:
Random Access: O(n);
Insert: O(1);
Delete: O(1).
So you choose whichever one best suits how you resize your array. If you resize, insert and delete a lot then maybe a linked list is a better choice. Same goes for if random access is rare. You mention serial access. If you're mainly doing serial access with very little modification then it probably doesn't matter which you choose.
Linked lists have a slightly higher overhead since, like you say, you're dealing with potentially non-contiguous blocks of memory and (effectively) pointers to the next element. That's probably not an important factor unless you're dealing with millions of entries however.

I wrote a little benchmark to compare ArrayLists with Arrays. On my old-ish laptop, the time to traverse through a 5000-element arraylist, 1000 times, was about 10 milliseconds slower than the equivalent array code.
So, if you're doing nothing but iterating the list, and you're doing it a lot, then maybe it's worth the optimisation. Otherwise I'd use the List, because it'll make it easier when you do need to optimise the code.
n.b. I did notice that using for String s: stringsList was about 50% slower than using an old-style for-loop to access the list. Go figure... Here's the two functions I timed; the array and list were filled with 5000 random (different) strings.
private static void readArray(String[] strings) {
long totalchars = 0;
for (int j = 0; j < ITERATIONS; j++) {
totalchars = 0;
for (int i = 0; i < strings.length; i++) {
totalchars += strings[i].length();
}
}
}
private static void readArrayList(List<String> stringsList) {
long totalchars = 0;
for (int j = 0; j < ITERATIONS; j++) {
totalchars = 0;
for (int i = 0; i < stringsList.size(); i++) {
totalchars += stringsList.get(i).length();
}
}
}

No, because technically, the array only stores the reference to the strings. The strings themselves are allocated in a different location. For a thousand items, I would say a list would be better, it is slower, but it offers more flexibility and it's easier to use, especially if you are going to resize them.

If you have thousands, consider using a trie. A trie is a tree-like structure that merges the common prefixes of the stored string.
For example, if the strings were
intern
international
internationalize
internet
internets
The trie would store:
intern
-> \0
international
-> \0
-> ize\0
net
->\0
->s\0
The strings requires 57 characters (including the null terminator, '\0') for storage, plus whatever the size of the String object that holds them. (In truth, we should probably round all sizes up to multiples of 16, but...) Call it 57 + 5 = 62 bytes, roughly.
The trie requires 29 (including the null terminator, '\0') for storage, plus sizeof the trie nodes, which are a ref to an array and a list of child trie nodes.
For this example, that probably comes out about the same; for thousands, it probably comes out less as long as you do have common prefixes.
Now, when using the trie in other code, you'll have to convert to String, probably using a StringBuffer as an intermediary. If many of the strings are in use at once as Strings, outside the trie, it's a loss.
But if you're only using a few at the time -- say, to look up things in a dictionary -- the trie can save you a lot of space. Definitely less space than storing them in a HashSet.
You say you're accessing them "serially" -- if that means sequentially an alphabetically, the trie also obviously gives you alphabetical order for free, if you iterate it depth-first.

Since there are already a lot of good answers here, I would like to give you some other information of practical view, which is insertion and iteration performance comparison : primitive array vs Linked-list in Java.
This is actual simple performance check. So, the result will depend on the machine performance.
Source code used for this is below :
import java.util.Iterator;
import java.util.LinkedList;
public class Array_vs_LinkedList {
private final static int MAX_SIZE = 40000000;
public static void main(String[] args) {
LinkedList lList = new LinkedList();
/* insertion performance check */
long startTime = System.currentTimeMillis();
for (int i=0; i<MAX_SIZE; i++) {
lList.add(i);
}
long stopTime = System.currentTimeMillis();
long elapsedTime = stopTime - startTime;
System.out.println("[Insert]LinkedList insert operation with " + MAX_SIZE + " number of integer elapsed time is " + elapsedTime + " millisecond.");
int[] arr = new int[MAX_SIZE];
startTime = System.currentTimeMillis();
for(int i=0; i<MAX_SIZE; i++){
arr[i] = i;
}
stopTime = System.currentTimeMillis();
elapsedTime = stopTime - startTime;
System.out.println("[Insert]Array Insert operation with " + MAX_SIZE + " number of integer elapsed time is " + elapsedTime + " millisecond.");
/* iteration performance check */
startTime = System.currentTimeMillis();
Iterator itr = lList.iterator();
while(itr.hasNext()) {
itr.next();
// System.out.println("Linked list running : " + itr.next());
}
stopTime = System.currentTimeMillis();
elapsedTime = stopTime - startTime;
System.out.println("[Loop]LinkedList iteration with " + MAX_SIZE + " number of integer elapsed time is " + elapsedTime + " millisecond.");
startTime = System.currentTimeMillis();
int t = 0;
for (int i=0; i < MAX_SIZE; i++) {
t = arr[i];
// System.out.println("array running : " + i);
}
stopTime = System.currentTimeMillis();
elapsedTime = stopTime - startTime;
System.out.println("[Loop]Array iteration with " + MAX_SIZE + " number of integer elapsed time is " + elapsedTime + " millisecond.");
}
}
Performance Result is below :

I came here to get a better feeling for the performance impact of using lists over arrays. I had to adapt code here for my scenario: array/list of ~1000 ints using mostly getters, meaning array[j] vs. list.get(j)
Taking the best of 7 to be unscientific about it (first few with list where 2.5x slower) I get this:
array Integer[] best 643ms iterator
ArrayList<Integer> best 1014ms iterator
array Integer[] best 635ms getter
ArrayList<Integer> best 891ms getter (strange though)
- so, very roughly 30% faster with array
The second reason for posting now is that no-one mentions the impact if you do math/matrix/simulation/optimization code with nested loops.
Say you have three nested levels and the inner loop is twice as slow you are looking at 8 times performance hit. Something that would run in a day now takes a week.
*EDIT
Quite shocked here, for kicks I tried declaring int[1000] rather than Integer[1000]
array int[] best 299ms iterator
array int[] best 296ms getter
Using Integer[] vs. int[] represents a double performance hit, ListArray with iterator is 3x slower than int[]. Really thought Java's list implementations were similar to native arrays...
Code for reference (call multiple times):
public static void testArray()
{
final long MAX_ITERATIONS = 1000000;
final int MAX_LENGTH = 1000;
Random r = new Random();
//Integer[] array = new Integer[MAX_LENGTH];
int[] array = new int[MAX_LENGTH];
List<Integer> list = new ArrayList<Integer>()
{{
for (int i = 0; i < MAX_LENGTH; ++i)
{
int val = r.nextInt();
add(val);
array[i] = val;
}
}};
long start = System.currentTimeMillis();
int test_sum = 0;
for (int i = 0; i < MAX_ITERATIONS; ++i)
{
// for (int e : array)
// for (int e : list)
for (int j = 0; j < MAX_LENGTH; ++j)
{
int e = array[j];
// int e = list.get(j);
test_sum += e;
}
}
long stop = System.currentTimeMillis();
long ms = (stop - start);
System.out.println("Time: " + ms);
}

list is slower than arrays.If you need efficiency use arrays.If you need flexibility use list.

UPDATE:
As Mark noted there is no significant difference after JVM warm up (several test passes). Checked with re-created array or even new pass starting with new row of matrix. With great probability this signs simple array with index access is not to be used in favor of collections.
Still first 1-2 passes simple array is 2-3 times faster.
ORIGINAL POST:
Too much words for the subject too simple to check. Without any question array is several times faster than any class container. I run on this question looking for alternatives for my performance critical section. Here is the prototype code I built to check real situation:
import java.util.List;
import java.util.Arrays;
public class IterationTest {
private static final long MAX_ITERATIONS = 1000000000;
public static void main(String [] args) {
Integer [] array = {1, 5, 3, 5};
List<Integer> list = Arrays.asList(array);
long start = System.currentTimeMillis();
int test_sum = 0;
for (int i = 0; i < MAX_ITERATIONS; ++i) {
// for (int e : array) {
for (int e : list) {
test_sum += e;
}
}
long stop = System.currentTimeMillis();
long ms = (stop - start);
System.out.println("Time: " + ms);
}
}
And here is the answer:
Based on array (line 16 is active):
Time: 7064
Based on list (line 17 is active):
Time: 20950
Any more comment on 'faster'? This is quite understood. The question is when about 3 time faster is better for you than flexibility of List. But this is another question.
By the way I checked this too based on manually constructed ArrayList. Almost the same result.

Remember that an ArrayList encapsulates an array, so there is little difference compared to using a primitive array (except for the fact that a List is much easier to work with in java).
The pretty much the only time it makes sense to prefer an array to an ArrayList is when you are storing primitives, i.e. byte, int, etc and you need the particular space-efficiency you get by using primitive arrays.

Array vs. List choice is not so important (considering performance) in the case of storing string objects. Because both array and list will store string object references, not the actual objects.
If the number of strings is almost constant then use an array (or ArrayList). But if the number varies too much then you'd better use LinkedList.
If there is (or will be) a need for adding or deleting elements in the middle, then you certainly have to use LinkedList.

It you can live with a fixed size, arrays will will be faster and need less memory.
If you need the flexibility of the List interface with adding and removing elements, the question remains which implementation you should choose. Often ArrayList is recommended and used for any case, but also ArrayList has its performance problems if elements at the beginning or in the middle of the list must be removed or inserted.
You therefore may want to have a look at https://dzone.com/articles/gaplist-lightning-fast-list which introduces GapList. This new list implementation combines the strengths of both ArrayList and LinkedList resulting in very good performance for nearly all operations. Get it at https://github.com/magicwerk/brownies-collections.

If you know in advance how large the data is then an array will be faster.
A List is more flexible. You can use an ArrayList which is backed by an array.

List is the preferred way in java 1.5 and beyond as it can use generics. Arrays cannot have generics. Also Arrays have a pre defined length, which cannot grow dynamically. Initializing an array with a large size is not a good idea.
ArrayList is the the way to declare an array with generics and it can dynamically grow.
But if delete and insert is used more frequently, then linked list is the fastest data structure to be used.

Which one to use depends on the problem. We need to look at the Big O.
image source: https://github.com/egonSchiele/grokking_algorithms

Depending on the implementation. it's possible that an array of primitive types will be smaller and more efficient than ArrayList. This is because the array will store the values directly in a contiguous block of memory, while the simplest ArrayList implementation will store pointers to each value. On a 64-bit platform especially, this can make a huge difference.
Of course, it's possible for the jvm implementation to have a special case for this situation, in which case the performance will be the same.

Arrays recommended everywhere you may use them instead of list, especially in case if you know items count and size would not be changing.
See Oracle Java best practice: http://docs.oracle.com/cd/A97688_16/generic.903/bp/java.htm#1007056
Of course, if you need add and remove objects from collection many times easy use lists.

A lot of microbenchmarks given here have found numbers of a few nanoseconds for things like array/ArrayList reads. This is quite reasonable if everything is in your L1 cache.
A higher level cache or main memory access can have order of magnitude times of something like 10nS-100nS, vs more like 1nS for L1 cache. Accessing an ArrayList has an extra memory indirection, and in a real application you could pay this cost anything from almost never to every time, depending on what your code is doing between accesses. And, of course, if you have a lot of small ArrayLists this might add to your memory use and make it more likely you'll have cache misses.
The original poster appears to be using just one and accessing a lot of contents in a short time, so it should be no great hardship. But it might be different for other people, and you should watch out when interpreting microbenchmarks.
Java Strings, however, are appallingly wasteful, especially if you store lots of small ones (just look at them with a memory analyzer, it seems to be > 60 bytes for a string of a few characters). An array of strings has an indirection to the String object, and another from the String object to a char[] which contains the string itself. If anything's going to blow your L1 cache it's this, combined with thousands or tens of thousands of Strings. So, if you're serious - really serious - about scraping out as much performance as possible then you could look at doing it differently. You could, say, hold two arrays, a char[] with all the strings in it, one after another, and an int[] with offsets to the starts. This will be a PITA to do anything with, and you almost certainly don't need it. And if you do, you've chosen the wrong language.

None of the answers had information that I was interested in - repetitive scan of the same array many many times. Had to create a JMH test for this.
Results (Java 1.8.0_66 x32, iterating plain array is at least 5 times quicker than ArrayList):
Benchmark Mode Cnt Score Error Units
MyBenchmark.testArrayForGet avgt 10 8.121 ? 0.233 ms/op
MyBenchmark.testListForGet avgt 10 37.416 ? 0.094 ms/op
MyBenchmark.testListForEach avgt 10 75.674 ? 1.897 ms/op
Test
package my.jmh.test;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
#State(Scope.Benchmark)
#Fork(1)
#Warmup(iterations = 5, timeUnit = TimeUnit.SECONDS)
#Measurement(iterations = 10)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MILLISECONDS)
public class MyBenchmark {
public final static int ARR_SIZE = 100;
public final static int ITER_COUNT = 100000;
String arr[] = new String[ARR_SIZE];
List<String> list = new ArrayList<>(ARR_SIZE);
public MyBenchmark() {
for( int i = 0; i < ARR_SIZE; i++ ) {
list.add(null);
}
}
#Benchmark
public void testListForEach() {
int count = 0;
for( int i = 0; i < ITER_COUNT; i++ ) {
for( String str : list ) {
if( str != null )
count++;
}
}
if( count > 0 )
System.out.print(count);
}
#Benchmark
public void testListForGet() {
int count = 0;
for( int i = 0; i < ITER_COUNT; i++ ) {
for( int j = 0; j < ARR_SIZE; j++ ) {
if( list.get(j) != null )
count++;
}
}
if( count > 0 )
System.out.print(count);
}
#Benchmark
public void testArrayForGet() {
int count = 0;
for( int i = 0; i < ITER_COUNT; i++ ) {
for( int j = 0; j < ARR_SIZE; j++ ) {
if( arr[j] != null )
count++;
}
}
if( count > 0 )
System.out.print(count);
}
}

"Thousands" is not a large number. A few thousand paragraph-length strings are on the order of a couple of megabytes in size. If all you want to do is access these serially, use an immutable singly-linked List.

Don't get into the trap of optimizing without proper benchmarking. As others have suggested use a profiler before making any assumption.
The different data structures that you have enumerated have different purposes. A list is very efficient at inserting elements in the beginning and at the end but suffers a lot when accessing random elements. An array has fixed storage but provides fast random access. Finally an ArrayList improves the interface to an array by allowing it to grow. Normally the data structure to be used should be dictated by how the data stored will be access or added.
About memory consumption. You seem to be mixing some things. An array will only give you a continuous chunk of memory for the type of data that you have. Don't forget that java has a fixed data types: boolean, char, int, long, float and Object (this include all objects, even an array is an Object). It means that if you declare an array of String strings [1000] or MyObject myObjects [1000] you only get a 1000 memory boxes big enough to store the location (references or pointers) of the objects. You don't get a 1000 memory boxes big enough to fit the size of the objects. Don't forget that your objects are first created with "new". This is when the memory allocation is done and later a reference (their memory address) is stored in the array. The object doesn't get copied into the array only it's reference.

I don't think it makes a real difference for Strings. What is contiguous in an array of strings is the references to the strings, the strings themselves are stored at random places in memory.
Arrays vs. Lists can make a difference for primitive types, not for objects. IF you know in advance the number of elements, and don't need flexibility, an array of millions of integers or doubles will be more efficient in memory and marginally in speed than a list, because indeed they will be stored contiguously and accessed instantly. That's why Java still uses arrays of chars for strings, arrays of ints for image data, etc.

Array is faster - all memory is pre-allocated in advance.

It depends on how you have to access it.
After storing, if you mainly want to do search operation, with little or no insert/delete, then go for Array (as search is done in O(1) in arrays, whereas add/delete may need re-ordering of the elements).
After storing, if your main purpose is to add/delete strings, with little or no search operation, then go for List.

Arrays - It would always be better when we have to achieve faster fetching of results
Lists- Performs results on insertion and deletion since they can be done in O(1) and this also provides methods to add, fetch and delete data easily. Much easier to use.
But always remember that the fetching of data would be fast when the index position in array where the data is stored - is known.
This could be achieved well by sorting the array. Hence this increases the time to fetch the data (ie; storing the data + sorting the data + seek for the position where the data is found). Hence this increases additional latency to fetch the data from the array even they may be good at fetching the data sooner.
Hence this could be solved with trie data structure or ternary data structure. As discussed above the trie data structure would be very efficient in searching for the data the search for a particularly word can be done in O(1) magnitude. When time matters ie; if you have to search and retrieve data quickly you may go with trie data structure.
If you want your memory space to be consumed less and you wish to have a better performance then go with ternary data structure. Both these are suitable for storing huge number of strings (eg; like words contained in dictionary).

Related

Very large Java ArrayList has slow traversal time

Solution: My ArrayList was filled with duplicates. I modified my code to filter these out, which reduced running times to about 1 second.
I am working on a algorithms project that requires me to look at large amounts of data.
My program has a potentially very large ArrayList (A) that has every element in it traversed. For each of these elements in (A), several other, calculated elements are added to another ArrayList (B). (B) will be much, much larger than (A).
Once my program has run through seven of these ArrayLists, the running time goes up to approximately 5 seconds. I'm trying to get that down to < 1 second, if possible.
I am open to different ways to traverse the ArrayList, as well as using a completely different data-structure. I don't care about the order of the values inside the lists, as long as I can go through all values, very fast. I have tried a linked-list and it was significantly slower.
Here is a snippet of code, to give you a better understanding. The code tries to find all single-digit permutations of a prime number.
public static Integer primeLoop(ArrayList current, int endVal, int size)
{
Integer compareVal = 0;
Integer currentVal = 0;
Integer tempVal = 0;
int currentSize = current.size()-1;
ArrayList next = new ArrayList();
for(int k = 0; k <= currentSize; k++)
{
currentVal = Integer.parseInt(current.get(k).toString());
for(int i = 1; i <= 5; i++)
{
for(int j = 0; j <= 9; j++)
{
compareVal = orderPrime(currentVal, endVal, i, j);
//System.out.println(compareVal);
if(!compareVal.equals(tempVal) && !currentVal.equals(compareVal))
{
tempVal = compareVal;
next.add(compareVal);
//System.out.println("Inserted: "+compareVal + " with parent: "+currentVal);
if(compareVal.equals(endVal))
{
System.out.println("Separation: " + size);
return -1;
}
}
}
}
}
size++;
//System.out.println(next);
primeLoop(next, endVal, size);
return -1;
}
*Edit: Removed unnecessary code from snippet above. Created a currSize variable that stops the program from having to call the size of (current) every time. Still no difference. Here is an idea of how the ArrayList grows:
2,
29,
249,
2293,
20727,
190819,
When something is slow, the typical advice is to profile it. This is generally wise, as it's often difficult to determine what's the cause of slowness, even for performance experts. Sometimes it's possible to pick out code that's likely to be a performance problem, but this is hit-or-miss. There are some likely things in this code, but it's hard to say for sure, since we don't have the code for the orderPrime() and primeLoop() methods.
That said, there's one thing that caught my eye. This line:
currentVal = Integer.parseInt(current.get(k).toString());
This gets an element from current, turns it into a string, parses it back to an int, and then boxes it into an Integer. Conversion to and from String is pretty expensive, and it allocates memory, so it puts pressure on garbage collection. Boxing primitive int values to Integer objects also allocates memory, contributing to GC pressure.
It's hard to say what the fix is, since you're using the raw type ArrayList for current. I surmise it might be ArrayList<Integer>, and if so, you could just replace this line with
currentVal = (Integer)current.get(k);
You should be using generics in order to avoid the cast. (But that doesn't affect performance, just the readability and type-safety of the code.)
If current doesn't contain Integer values, then it should. Whatever it contains should be converted to Integer beforehand, instead of putting conversions inside a loop.
After fixing this, you are still left with boxing/unboxing overhead. If performance is still a problem, you'll have to switch from ArrayList<Integer> to int[] because Java collections cannot contain primitives. This is inconvenient, since you'll have to implement your own list-like structure that simulates a variable-length array of int (or find a third party library that does this).
But even all of the above might not be enough to make your program run fast enough. I don't know what your algorithm is doing, but it looks like it's doing linear searching. There are a variety of ways to speed up searching. But another commenter suggested binary search, and you said it wasn't allowed, so it's not clear what can be done here.
Here is an idea of how the ArrayList grows: 2, 29, 249, 2293, 20727, 190819
Your next list grows too large, so it must contain duplicates:
190_819 entries for 100_000 numbers?
According to primes.utm.edu/howmany.html there are only 9,592 primes up to 100_000.
Getting rid of the duplicates will certainly improve your response times.
Why you have this line
current.iterator();
You don't use the iterator at all, you don't even have a variable for it. It's just waisting of time.
for(int k = 0; k <= current.size()-1; k++)
Instead of counting size every iteration, create value like:
int curSize = current.size() - 1;
And use it in loop.
It can save some time.

Slow initialization of large array of small objects

I've stumbled upon this case today and I'm wondering what is the reason behind this huge difference in time.
The first version initializes an 5k x 5k array of raw ints:
public void initializeRaw() {
int size = 5000;
int[][] a = new int[size][size];
for (int i = 0; i < size; i++)
for (int j = 0; j < size; j++)
a[i][j] = -1;
}
and it takes roughly 300ms on my machine.
On the other hand, initializing the same array with simple 2-int structs:
public class Struct { public int x; public int y; }
public void initializeStruct() {
int size = 5000;
Struct[][] a = new Struct[size][size];
for (int i = 0; i < size; i++)
for (int j = 0; j < size; j++)
a[i][j] = new Struct();
}
takes over 15000ms.
I would expect it to be a bit slower, after all there is more memory to allocate (10 bytes instead of 4 if I'm not mistaken), but I cannot understand why could it take 50 times longer.
Could anyone explain this? Maybe there is just a better way to to this kind of initialization in Java?
EDIT: For some comparison - the same code that uses Integer instead of int/Struct works 700ms - only two times slower.
I would expect it to be a bit slower, after all there is more memory to allocate (10 bytes instead of 4 if I'm not mistaken), but I cannot understand why could it take 50 times longer.
No, it's much worse than that. In the first case, you're creating 5001 objects. In the second case, you're creating 25,005,001 objects. Each of the Struct objects is going to take between 16 and 32 bytes, I suspect. (It will depend on various JVM details, but that's a rough guess.)
Your 5001 objects in the first case will take a total of ~100MB. The equivalent objects (the arrays) may take a total of ~200MB, if you're on a platform with 64-bit references... and then there's the other 25 million objects to allocate and initialize.
So yes, a pretty huge difference...
When you create an array of 5000 ints, you are allocating all the space you need for all those ints in one go, as a single block of consecutive elements. When you assign an int to each array element, you are not allocating anything. Contrast that with an array of 5000 Struct instances. You iterate through that array and for every single one of those 5000 elements you allocate a Struct instance. Allocating an object takes a lot longer than simply writing an int value into a variable.
The fact that you have two-dimensional arrays doesn't make much compararive difference here, as it just means you do allocate 5000 array objects in both cases.
If you are timing an array of Integer objects, and you are then setting each element to -1, then you are not allocating separate Integer objects each time. Instead, you are using autoboxing, which means the compiler is implicitly calling Integer.valueOf(-1) and that method returns the same object from the cache each time.
UPDATE: Going back to addressing your concern, if I understand correctly, you have a requirement to keep 5000x5000 Structs in a 2D array, and you are disappointed that creating this array takes a lot longer than using primitives. To improve performance, you can create two arrays of primitives, one for each field of Struct, but this would reduce code clarity.
You can also create a single array of longs (since each long is double the size of an int) and use the & and >> operators to get your original ints. Again, this would reduce code clarity, but you'll only have one array.
However, you seem to be concentrating on a single part of the code, namely the creation of the array. You may well find that the processing you do on each element overshadows the time it takes to create the array. Profile your whole application and see if the creation of arrays is significant.

Randomizing list processing faster than Collections.shuffle()?

I am developing an agent-based model in Java. I have used a profiler to reduce any inefficiencies down to the point that the only thing holding it back is Java's Collections.shuffle().
The agents (they're animals) in my model need to be processed in a random order so that no agent is consistently processed before the others.
I am looking for: Either a faster way to shuffle than Java's Collections.shuffle() or an alternative method of processing the elements in an ArrayList in a randomized order that is significantly faster. If you know of a data structure that would be faster than an ArrayList, by all means please answer. I have considered LinkedList and ArrayDeque, but they aren't making much of a difference.
Currently, I have over 1,000,000 elements in the list I am trying to shuffle. Over time, this amount increases and it is becoming increasingly inefficient to shuffle it.
Is there an alternative data structure or way of randomizing the processing of elements that is faster?
I only need to be able to store elements and process them in a randomized order. I do not use contains or anything more complex than storage and iterating over them.
Here is some sample code to better explain what I am trying to achieve:
UPDATE: Sorry for the ConcurrentModificationException, I didn't realize I had done that and I didn't intend to confuse anyone. Fixed it in the code below.
ArrayList<Agent> list = new ArrayList<>();
void process()
{
list.add(new Agent("Zebra"));
Random r = new Random();
for (int i = 0; i < 100000; i++)
{
ArrayList<Agent> newlist = new ArrayList<>();
Collections.shuffle(list);//Something that will allow the order to be random (random quality does not matter to me), yet faster than a shuffle
for (String str : list)
{
newlist.add(str);
if(r.nextDouble() > 0.99)//1% chance of adding another agent to the list
{
newlist.add(new Agent("Lion"));
}
}
list = newlist;
}
}
ANOTHER UPDATE
I thought about doing list.remove(rando.nextInt(list.size()) but since remove for ArrayLists is O(n) it would be even worse to do that rather than shuffle for such a large list size.
I would use a simple ArrayList and not shuffle it at all. Instead select random list indices to process. To avoid processing a list element twice, I'd remove the processed elements from the list.
Now if the list is very large, removing a random entry itself would be the bottleneck. This can however be avoided easily by removing the last entry instead and moving it into the place the selected entry occupied before:
public String pullRandomElement(List<String> list, Random random) {
// select a random list index
int size = list.size();
int index = random.nextInt(size);
String result = list.get(index);
// move last entry to selected index
list.set(index, list.remove(size - 1));
return result;
}
Needless to say you should chose a list implementation where get(index) and remove(lastIndex) are fast O(1), such as ArrayList. You may also want to add edge case handling (such as list is empty).
You could use this: If you already have the list of items, generate a random according to its size and get nextInt.
ArrayList<String> list = new ArrayList<>();
int sizeOfCollection = list.size();
Random randomGenerator = new Random();
int randomId = randomGenerator.nextInt(sizeOfCollection);
Object x = list.get(randomId);
list.remove(randomId);
Since your code doesn't actually depend on the order of the list, it's enough to shuffle it once at the end of the processing.
void process() {
Random r = new Random();
for (int i = 0; i < 100000; i++) {
for (String str : list) {
if(r.nextDouble() > 0.9) {
list.add(str + str);
}
}
}
Collections.shuffle(list);
}
Though this would still throw a ConcurrentModificationException, like the original code.
Collections.shuffle() uses the modern variant of Fisher-Yates Algorithm:
From https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
To shuffle an array a of n elements (indices 0..n-1):
for i from n − 1 downto 1 do
j ← random integer such that 0 ≤ j ≤ i
exchange a[j] and a[i]
Colections.shuffle convert the list to an array, then does the shuffle, just using random.nextInt() and then copies everything back. (see http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/Collections.java#Collections.shuffle%28java.util.List%29)
You make this only faster by avoiding the overhead of copying the array and writing back:
Either write your own implementation of ArrayList where you can directly access the backing array, or access the field "elementData" of your ArrayList via reflection.
Now use the the same algorithm as Collections.shuffle on that array, using the correct size().
This speeds up, because it avoid the copying if the whole array, like Collection.shuffle() does:
The access via reflection needs a bit time, so this solution is faster only for higher number of elements.
I would not reccomend this solution unless you want to win the race, having the fastes shuffle, also by means of execution time.
And as always when comparing speeds, make sure you warm up the VM by running the algorithm to be measured 1000 times before starting the measuring.
According to the documentation, Collections.shuffle() runs in O(N) time.
This method runs in linear time. If the specified list does not implement the RandomAccess interface and is large, this implementation dumps the specified list into an array before shuffling it, and dumps the shuffled array back into the list. This avoids the quadratic behavior that would result from shuffling a "sequential access" list in place.
I recommend you use the public static void shuffle(List<?> list, Random rnd) overload, although the performance benefit will probably be negligible.
Improving the performance will be difficult unless you allow some bias, such as with partial shuffling (only a segment of the list gets re-shuffled each time) or under-shuffling. Under-shuffling means writing your own Fisher-Yates routine and skipping certain list indices during the reverse traversal; for example, you could skip all odd indices. However the end of your list would receive less shuffling than the front which is another form of bias.
If you had a fixed list size M, you might consider caching some large number N of different fixed index permutations (0 to M-1 in random order) in memory at application startup. Then you could just randomly select one of these pre-orderings whenever you iterate the collection and just iterate according to that particular previously defined permutation. If N were large (say 1000 or more), the overall bias would be small (and also relatively uniform) and would be very fast. However you noted your list slowly grows, so this approach wouldn't be viable.

Adding 2 list element by element

I am having 2 Lists and want to add them element by element. Like that:
Is there an easier way and probably much more well performing way than using a for loop to iterate over the first list and add it to the result list?
I appreciate your answer!
Depends on what kind of list and what kind of for loop.
Iterating over the elements (rather than indices) would almost certainly be plenty fast enough.
On the other hand, iterating over indices and repeatedly getting the element by index could work rather poorly for certain types of lists (e.g. a linked list).
My understanding is that you have List1 and List2 and that you want to find the best performing way to find result[index] = List1[index] + list2[index]
My main suggestion is that before you start optimising for performance is to measure whether you need to optimise at all. You can iterate through the lists as you said, something like:
for(int i = 0; i < listSize; i++)
{
result[i] = List1[i] + List2[i];
}
In most cases this is fine. See NPE's answer for a description of where this might be expensive, i.e. a linked list. Also see this answer and note that each step of the for loop is doing a get - on an array it is done in 1 step, but in a linked list it is done in as many steps at it takes to iterate to the element in the list.
Assuming a standard array, this is O(n) and (depending on array size) will be done so quickly that it will hardly result in a blip on your performance profiling.
As a twist, since the operations are completely independent, that is result[0] = List1[0] + List2[0] is independent of result[1] = List1[1] + List2[1], etc, you can run these operations in parallel. E.g. you could run the first half of the calculations (<= List.Size / 2) on one thread and the other half (> List.Size / 2) on another thread and expect the elapsed time to roughly halve (assuming at least 2 free CPUs). Now, the best number of threads to use depends on the size of your data, the number of CPUs, other operations happening at the same time and is normally best decided by testing and modeling under different conditions. All this adds complexity to your program, so my main recommendation is to start simple, then measure and then decide whether you need to optimise.
Looping is inevitable except you have a matrix API (e.g. OpenGL). You could implement a List<Integer> which is backed by the original Lists:
public class CalcList implements List<Integer> {
private List<Integer> l1, l2;
#Override
public int get(int index) {
return l1.get(index) + l2.get(index);
}
}
This avoids copy operations and moves the calculations at the end of your stack:
CalcList<Integer> results1 = new CalcList(list, list1);
CalcList<Integer> results2 = new CalcList(results1, list3);
// no calculation or memory allocated until now.
for (int result : results2) {
// here happens the calculation, still without unnecessary memory
}
This could give an advantage if the compiler is able to translate it into:
for (int i = 0; i < list1.size; i++) {
int result = list1[i] + list2[i] + list3[i] + …;
}
But I doubt that. You have to run a benchmark for your specific use case to find out if this implementation has an advantage.
Java doesn't come with a map style function, so the the way of doing this kind of operation is using a for loop.
Even if you use some other construct, the looping will be done anyway. An alternative is using the GPU for computations but this is not a default Java feature.
Also using arrays should be faster than operating with linked lists.

Best way to write this program

I have a general programming question, that I have happened to use Java to answer. This is the question:
Given an array of ints write a program to find out how many numbers that are not unique are in the array. (e.g. in {2,3,2,5,6,1,3} 2 numbers (2 and 3) are not unique). How many operations does your program perform (in O notation)?
This is my solution.
int counter = 0;
for(int i=0;i<theArray.length-1;i++){
for(int j=i+1;j<theArray.length;j++){
if(theArray[i]==theArray[j]){
counter++;
break; //go to next i since we know it isn't unique we dont need to keep comparing it.
}
}
}
return counter:
Now, In my code every element is being compared with every other element so there are about n(n-1)/2 operations. Giving O(n^2). Please tell me if you think my code is incorrect/inefficient or my O expression is wrong.
Why not use a Map as in the following example:
// NOTE! I assume that elements of theArray are Integers, not primitives like ints
// You'll nee to cast things to Integers if they are ints to put them in a Map as
// Maps can't take primitives as keys or values
Map<Integer, Integer> elementCount = new HashMap<Integer, Integer>();
for (int i = 0; i < theArray.length; i++) {
if (elementCount.containsKey(theArray[i]) {
elementCount.put(theArray[i], new Integer(elementCount.get(theArray[i]) + 1));
} else {
elementCount.put(theArray[i], new Integer(1));
}
}
List<Integer> moreThanOne = new ArrayList<Integer>();
for (Integer key : elementCount.keySet()) { // method may be getKeySet(), can't remember
if (elementCount.get(key) > 1) {
moreThanOne.add(key);
}
}
// do whatever you want with the moreThanOne list
Notice that this method requires iterating through the list twice (I'm sure there's a way to do it iterating once). It iterates once through theArray, and then implicitly again as it iterates through the key set of elementCount, which if no two elements are the same, will be exactly as large. However, iterating through the same list twice serially is still O(n) instead of O(n^2), and thus has much better asymptotic running time.
Your code doesn't do what you want. If you run it using the array {2, 2, 2, 2}, you'll find that it returns 2 instead of 1. You'll have to find a way to make sure that the counting is never repeated.
However, your Big O expression is correct as a worst-case analysis, since every element might be compared with every other element.
Your analysis is correct but you could easily get it down to O(n) time. Try using a HashMap<Integer,Integer> to store previously-seen values as you iterate through the array (key is the number that you've seen, value is the number of times you've seen it). Each time you try to add an integer into the hashmap, check to see if it's already there. If it is, just increment that integers counter. Then, at the end, loop through the map and count the number of times you see a key with a corresponding value higher than 1.
First, your approach is what I would call "brute force", and it is indeed O(n^2) in the worst case. It's also incorrectly implemented, since numbers that repeat n times are counted n-1 times.
Setting that aside, there are a number of ways to approach the problem. The first (that a number of answers have suggested) is to iterate the array, and using a map to keep track of how many times the given element has been seen. Assuming the map uses a hash table for the underlying storage, the average-case complexity should be O(n), since gets and inserts from the map should be O(1) on average, and you only need to iterate the list and map once each. Note that this is still O(n^2) in the worst case, since there's no guarantee that the hashing will produce contant-time results.
Another approach would be to simply sort the array first, and then iterate the sorted array looking for duplicates. This approach is entirely dependent on the sort method chosen, and can be anywhere from O(n^2) (for a naive bubble sort) to O(n log n) worst case (for a merge sort) to O(n log n) average-though-likely case (for a quicksort.)
That's the best you can do with the sorting approach assuming arbitrary objects in the array. Since your example involves integers, though, you can do much better by using radix sort, which will have worst-case complexity of O(dn), where d is essentially constant (since it maxes out at 9 for 32-bit integers.)
Finally, if you know that the elements are integers, and that their magnitude isn't too large, you can improve the map-based solution by using an array of size ElementMax, which would guarantee O(n) worst-case complexity, with the trade-off of requiring 4*ElementMax additional bytes of memory.
I think your time complexity of O(n^2) is correct.
If space complexity is not the issue then you can have an array of 256 characters(ASCII) standard and start filling it with values. For example
// Maybe you might need to initialize all the values to 0. I don't know. But the following can be done with O(n+m) where n is the length of theArray and m is the length of array.
int[] array = new int[256];
for(int i = 0 ; i < theArray.length(); i++)
array[theArray[i]] = array[theArray[i]] + 1;
for(int i = 0 ; i < array.length(); i++)
if(array[i] > 1)
System.out.print(i);
As others have said, an O(n) solution is quite possible using a hash. In Perl:
my #data = (2,3,2,5,6,1,3);
my %count;
$count{$_}++ for #data;
my $n = grep $_ > 1, values %count;
print "$n numbers are not unique\n";
OUTPUT
2 numbers are not unique

Categories

Resources