Are ArrayLists more than twice as slow as arrays?

Are ArrayLists more than twice as slow as arrays? - java

I wrote a test that attempts to test two things:
Whether the size of a buffer array affects its performance, even if you don't use the whole buffer
The relative performance of arrays and ArrayList
I was kind of surprised with the results
Boxed arrays (i.e. Integer vs int) are not very much slower than the primitive version
The size of the underlying array doesn't matter very much
ArrayLists are more than twice as slow as the corresponding array.
The Questions
Why is ArrayList so much slower?
Is my benchmark written well? In other words, are my results accurate?
The Results
0% Scenario{vm=java, trial=0, benchmark=SmallArray} 34.57 ns; ?=0.79 ns # 10 trials
17% Scenario{vm=java, trial=0, benchmark=SmallBoxed} 40.40 ns; ?=0.21 ns # 3 trials
33% Scenario{vm=java, trial=0, benchmark=SmallList} 105.78 ns; ?=0.09 ns # 3 trials
50% Scenario{vm=java, trial=0, benchmark=BigArray} 34.53 ns; ?=0.05 ns # 3 trials
67% Scenario{vm=java, trial=0, benchmark=BigBoxed} 40.09 ns; ?=0.23 ns # 3 trials
83% Scenario{vm=java, trial=0, benchmark=BigList} 105.91 ns; ?=0.14 ns # 3 trials
benchmark ns linear runtime
SmallArray 34.6 =========
SmallBoxed 40.4 ===========
SmallList 105.8 =============================
BigArray 34.5 =========
BigBoxed 40.1 ===========
BigList 105.9 ==============================
vm: java
trial: 0
The Code
This code was written in Windows using Java 7 and Google caliper 0.5-rc1 (because last I checked 1.0 doesn't work in Windows yet).
Quick outline: in all 6 tests, in each iteration of the loop, it adds the values in the first 128 cells of the array (no matter how big the array is) and adds that to a total value. Caliper tells me how many times the test should run, so I loop through that addition 128 times.
The 6 tests have a big (131072) and a small (128) version of int[], Integer[], and ArrayList<Integer>. You can probably figure out which is which from the names.
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import com.google.caliper.Runner;
import com.google.caliper.SimpleBenchmark;
public class SpeedTest {
public static class TestBenchmark extends SimpleBenchmark {
int[] bigArray = new int[131072];
int[] smallArray = new int[128];
Integer[] bigBoxed = new Integer[131072];
Integer[] smallBoxed = new Integer[128];
List<Integer> bigList = new ArrayList<>(131072);
List<Integer> smallList = new ArrayList<>(128);
#Override
protected void setUp() {
Random r = new Random();
for(int i = 0; i < 128; i++) {
smallArray[i] = Math.abs(r.nextInt(100));
bigArray[i] = smallArray[i];
smallBoxed[i] = smallArray[i];
bigBoxed[i] = smallArray[i];
smallList.add(smallArray[i]);
bigList.add(smallArray[i]);
}
}
public long timeBigArray(int reps) {
long result = 0;
for(int i = 0; i < reps; i++) {
for(int j = 0; j < 128; j++) {
result += bigArray[j];
}
}
return result;
}
public long timeSmallArray(int reps) {
long result = 0;
for(int i = 0; i < reps; i++) {
for(int j = 0; j < 128; j++) {
result += smallArray[j];
}
}
return result;
}
public long timeBigBoxed(int reps) {
long result = 0;
for(int i = 0; i < reps; i++) {
for(int j = 0; j < 128; j++) {
result += bigBoxed[j];
}
}
return result;
}
public long timeSmallBoxed(int reps) {
long result = 0;
for(int i = 0; i < reps; i++) {
for(int j = 0; j < 128; j++) {
result += smallBoxed[j];
}
}
return result;
}
public long timeBigList(int reps) {
long result = 0;
for(int i = 0; i < reps; i++) {
for(int j = 0; j < 128; j++) {
result += bigList.get(j);
}
}
return result;
}
public long timeSmallList(int reps) {
long result = 0;
for(int i = 0; i < reps; i++) {
for(int j = 0; j < 128; j++) {
result += smallList.get(j);
}
}
return result;
}
}
public static void main(String[] args) {
Runner.main(TestBenchmark.class, new String[0]);
}
}

Firstly ...
Are ArrayLists more than twice as slow as arrays?
As a generalization, no. For operations that potentially involve "changing" the length of the list / array, an ArrayList will be faster than an array ... unless you use a separate variable to represent the array's logical size.
For other operations, the ArrayList is likely to be slower, though the performance ratio will most likely depend on the operation and the JVM implementation. Also note that you have only tested one operation / pattern.
Why is ArrayList so much slower?
Because an ArrayList has a distinct array object inside of it.
Operations typically involve extra indirections (e.g. to fetch the list's size and inner array) and there are extra bounds checks (e.g. checking the list's size and the array's length). A typical JIT compiler is (apparently) not able to optimize these away. (And in fact, you would NOT want to optimize away the inner array because that's what allows an ArrayList to grow.)
For array's of primitives, the corresponding list types involve wrapped primitive types / objects, and that adds an overhead. For example your result += ... involves unboxing, in the "list" cases.
Is my benchmark written well? In other words, are my results accurate?
There's nothing wrong with it technically. But it is not sufficient to demonstrate your point. For a start, you are only measuring one kind of operation: array element fetching and its equivalent. And you are only measuring for primitive types.
Finally, this largely misses the point of using List types. We use them because they are almost always easier to use than plain arrays. A difference in performance of (say) 2 is typically not important to overall application performance.

Keep in mind that in using ArrayList, you are actually calling a function, which in the case of get() actually makes two other function calls. (One of which is a range check, which I suspect may be part of the delay).
The important thing with ArrayList, is not so much how much faster or slower it is compared with straight arrays, but that it's access time always constant (like arrays). In the real world, you'll almost always find that the added delay is negligible. Especially if you have an application that even thinks about connecting to a database. :)
In short, I think your test (and the results) are legit.

These results don't surprise me. List.get(int) involves a cast, which is slow. Java's generics are implemented via type erasure, meaning that any type of List<T> is actually a List<Object>, and the only reason you get your type out is because of a cast. The source for ArrayList looks like this:
public E get(int index) {
rangeCheck(index);
return elementData(index);
}
// snip...
private void rangeCheck(int index) {
if (index >= size)
throw new IndexOutOfBoundsException(outOfBoundsMsg(index));
}
// snip...
#SuppressWarnings("unchecked")
E elementData(int index) {
return (E) elementData[index];
}
The rangeCheck and the overhead of the function calls are trivial, it's that cast to E that kills you.

If you store millions of objects, then the Add or Contains functions will be super slow. Best way is to split it using hashMap of Arrays. Although similar algorithms can be used for other types of objects, this is how I improved 1000 times faster the processing of 10 million strings (the memory taken is 2-3 times more)
public static class ArrayHashList {
private String temp1, temp2;
HashMap allKeys = new HashMap();
ArrayList curKeys;
private int keySize;
public ArrayHashList(int keySize) {
this.keySize = keySize;
}
public ArrayHashList(int keySize, String fromFileName) {
this.keySize = keySize;
String line;
try{
BufferedReader br1 = new BufferedReader(new FileReader(fromFileName));
while ((line = br1.readLine()) != null)
addString(line);
br1.close();
}catch(Exception e){
e.printStackTrace();
}
}
public boolean addString(String strToAdd) {
if (strToAdd.length()<keySize)
temp1 = strToAdd;
else
temp1 = strToAdd.substring(0,keySize);
if (!allKeys.containsKey(temp1))
allKeys.put(temp1,new ArrayList());
curKeys = (ArrayList)allKeys.get(temp1);
if (!curKeys.contains(strToAdd)){
curKeys.add(strToAdd);
return true;
}
return false;
}
public boolean haveString(String strCheck) {
if (strCheck.length()<keySize)
temp1 = strCheck;
else
temp1 = strCheck.substring(0,keySize);
if (!allKeys.containsKey(temp1))
allKeys.put(temp1,new ArrayList());
curKeys = (ArrayList)allKeys.get(temp1);
return curKeys.contains(strCheck);
}
}
to init and use it:
ArrayHashList fullHlist = new ArrayHashList(3, filesPath+"\\allPhrases.txt");
ArrayList pendingList = new ArrayList();
BufferedReader br1 = new BufferedReader(new FileReader(filesPath + "\\processedPhrases.txt"));
while ((line = br1.readLine()) != null) {
wordEnc = StrUtil.GetFirstToken(line,",~~~,");
if (!fullHlist.haveString(wordEnc))
pendingList.add(wordEnc);
}
br1.close();

Related

Why does Java HotSpot can not optimize array instance after one-time resizing (leads to massive performance loss)?

Question
Why is the use of fBuffer1 in the attached code example (SELECT_QUICK = true) double as fast as the other variant when fBuffer2 is resized only once at the beginning (SELECT_QUICK = false)?
The code path is absolutely identical but even after 10 minutes the throughput of fBuffer2 does not increase to this level of fBuffer1.
Background:
We have a generic data processing framework that collects thousands of Java primitive values in different subclasses (one subclass for each primitive type). These values are stored internally in arrays, which we originally sized sufficiently large. To save heap memory, we have now switched these arrays to dynamic resizing (arrays grow only if needed). As expected, this change has massively reduce the heap memory. However, on the other hand the performance has unfortunately degenerated significantly. Our processing jobs now take 2-3 times longer as before (e.g. 6 min instead of 2 min as before).
I have reduced our problem to a minimum working example and attached it. You can choose with SELECT_QUICK which buffer should be used. I see the same effect with jdk-1.8.0_202-x64 as well as with openjdk-17.0.1-x64.
Buffer 1 (is not resized) shows the following numbers:
duration buf1: 8,890.551ms (8.9s)
duration buf1: 8,339.755ms (8.3s)
duration buf1: 8,620.633ms (8.6s)
duration buf1: 8,682.809ms (8.7s)
...
Buffer 2 (is resized exactly 1 time at the beginning) shows the following numbers:
make buffer 2 larger
duration buf2 (resized): 19,542.750ms (19.5s)
duration buf2 (resized): 22,423.529ms (22.4s)
duration buf2 (resized): 22,413.364ms (22.4s)
duration buf2 (resized): 22,219.383ms (22.2s)
...
I would really appreciate to get some hints, how I can change the code so that fBuffer2 (after resizing) works as fast as fBuffer1. The other way round (make fBuffer1 as slow as fBuffer2) is pretty easy. ;-) Since this problem is placed in some framework-like component I would prefer to change the code instead of tuning the hotspot (with external arguments). But of course, suggestions in both directions are very welcome.
Source Code
import java.util.Locale;
public final class Collector {
private static final boolean SELECT_QUICK = true;
private static final long LOOP_COUNT = 50_000L;
private static final int VALUE_COUNT = 150_000;
private static final int BUFFER_LENGTH = 100_000;
private final Buffer fBuffer = new Buffer();
public void reset() {fBuffer.reset();}
public void addValueBuf1(long val) {fBuffer.add1(val);}
public void addValueBuf2(long val) {fBuffer.add2(val);}
public static final class Buffer {
private int fIdx = 0;
private long[] fBuffer1 = new long[BUFFER_LENGTH * 2];
private long[] fBuffer2 = new long[BUFFER_LENGTH];
public void reset() {fIdx = 0;}
public void add1(long value) {
ensureRemainingBuf1(1);
fBuffer1[fIdx++] = value;
}
public void add2(long value) {
ensureRemainingBuf2(1);
fBuffer2[fIdx++] = value;
}
private void ensureRemainingBuf1(int remaining) {
if (remaining > fBuffer1.length - fIdx) {
System.out.println("make buffer 1 larger");
fBuffer1 = new long[(fIdx + remaining) << 1];
}
}
private void ensureRemainingBuf2(int remaining) {
if (remaining > fBuffer2.length - fIdx) {
System.out.println("make buffer 2 larger");
fBuffer2 = new long[(fIdx + remaining) << 1];
}
}
}
public static void main(String[] args) {
Locale.setDefault(Locale.ENGLISH);
final Collector collector = new Collector();
if (SELECT_QUICK) {
while (true) {
final long start = System.nanoTime();
for (long j = 0L; j < LOOP_COUNT; j++) {
collector.reset();
for (int k = 0; k < VALUE_COUNT; k++) {
collector.addValueBuf1(k);
}
}
final long nanos = System.nanoTime() - start;
System.out.printf("duration buf1: %1$,.3fms (%2$,.1fs)%n",
nanos / 1_000_000d, nanos / 1_000_000_000d);
}
} else {
while (true) {
final long start = System.nanoTime();
for (long j = 0L; j < LOOP_COUNT; j++) {
collector.reset();
for (int k = 0; k < VALUE_COUNT; k++) {
collector.addValueBuf2(k);
}
}
final long nanos = System.nanoTime() - start;
System.out.printf("duration buf2 (resized): %1$,.3fms (%2$,.1fs)%n",
nanos / 1_000_000d, nanos / 1_000_000_000d);
}
}
}
}

JIT compilation in HotSpot JVM is 1) based on runtime profile data; 2) uses speculative optimizations.
Once the method is compiled at the maximum optimization level, HotSpot stops profiling this code, so it is never recompiled afterwards, no matter how long the code runs. (The exception is when the method needs to be deoptimized or unloaded, but it's not your case).
In the first case (SELECT_QUICK == true), the condition remaining > fBuffer1.length - fIdx is never met, and HotSpot JVM is aware of that from profiling data collected at lower tiers. So it speculatively hoists the check out of the loop, and compiles the loop body with the assumption that array index is always within bounds. After the optimization, the loop is compiled like this (in pseudocode):
if (VALUE_COUNT > collector.fBuffer.fBuffer1.length) {
uncommon_trap();
}
for (int k = 0; k < VALUE_COUNT; k++) {
collector.fBuffer.fBuffer1[k] = k; // no bounds check
}
In the second case (SELECT_QUICK == false), on the contrary, HotSpot knows that condition remaining > fBuffer2.length - fIdx is sometimes met, so it cannot eliminate the check.
Since fIdx is not the loop counter, HotSpot is apparently not smart enough to split the loop into two parts (with and without bounds check). However, you can help JIT compiler by splitting the loop manually:
for (long j = 0L; j < LOOP_COUNT; j++) {
collector.reset();
int fastCount = Math.min(collector.fBuffer.fBuffer2.length, VALUE_COUNT);
for (int k = 0; k < fastCount; k++) {
collector.addValueBuf2Fast(k);
}
for (int k = fastCount; k < VALUE_COUNT; k++) {
collector.addValueBuf2(k);
}
}
where addValueBuf2Fast inserts a value without bounds check:
public void addValueBuf2Fast(long val) {fBuffer.add2Fast(val);}
public static final class Buffer {
...
public void add2Fast(long value) {
fBuffer2[fIdx++] = value;
}
}
This should dramatically improve performance of the loop:
make buffer 2 larger
duration buf2 (resized): 5,537.681ms (5.5s)
duration buf2 (resized): 5,461.519ms (5.5s)
duration buf2 (resized): 5,450.445ms (5.5s)

Are arrays supposed to be faster than ArrayLists by this much?

I have implemented two methods, shuffleList and shuffleArray which use the exact same functionality. And I have an ArrayList of half millions integer, and an array of the same half million ints. In my benchmarking code, which performs each one of the methods a 100 times on the corresponding array or ArrayList and records the time, it looks like shuffleArray takes around 0.5 seconds while shuffleList takes around 3.5 seconds, even though the code does not use any ArrayList methods but get and set, which are supposed to work as fast as they work in an array.
Now I know that ArrayLists are a little bit slower because they internally use arrays but with some additional code, but does it make this big of a difference?
void shuffleList(List<Integer> list){
Random rnd = ThreadLocalRandom.current();
for(int i=list.size()-1;i>0;i--){
int index=rnd.nextInt(i+1);
int a=list.get(index);
list.set(index,list.get(i));
list.set(i,a);
}
}
void shuffleArray(int[] ar)
{
Random rnd = ThreadLocalRandom.current();
for (int i = ar.length - 1; i > 0; i--)
{
int index = rnd.nextInt(i + 1);
int a = ar[index];
ar[index] = ar[i];
ar[i] = a;
}
}
Benchmarking code:
import org.openjdk.jmh.Main;
import org.openjdk.jmh.annotations.*;
#BenchmarkMode(Mode.AverageTime)
public class MyBenchmark {
#Benchmark
#Fork(value = 1)
#Warmup(iterations = 3)
#Measurement(iterations = 10)
public void compete() {
try {
Sorting sorting = new Sorting();
sorting.load();
System.out.println(sorting.test());
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) throws Exception {
Main.main(args);
}
}
protected List<Integer> list = new ArrayList<Integer>();
protected List<int[]> arrays= new ArrayList<>();
protected void load(){
try (Stream<String> stream = Files.lines(Paths.get("numbers.txt"))) {
stream.forEach(x -> list.add(Integer.parseInt(x)));
} catch (IOException e) {
e.printStackTrace();
}
finally{
int[] arr =new int[list.size()];
for(int i=0;i<list.size();i++)
arr[i]=list.get(i);
arrays.add(arr);
}
}
protected double test(){
int arr[]=arrays.get(0);
Stopwatch watch = new Stopwatch();
for (int i=0; i<100; i++){
shuffleArray(arr);
shuffleList(list);
}
return watch.elapsedTime();
}
I comment out one of the methods on the for loop and use the other.
Update:
I did what a lot of you suggested of changing Int a to Integer a in the shuffleList method, and it is making it a little bit faster, it is 3 seconds instead of 3.5 now, but I still think it is a big difference.
It is worth mentioning that changing int[] arr to Integer[] arr in the shuffleArray method with keeping int a as it is to simulate the boxing and unboxing time for the array does actually make it a lot slower, it makes it take around 3 seconds, so I can make the array as slow as the ArrayList but I can not do the opposite.
Update:
Using Collections.swap() in shuffleList did indeed make it as fast as the array, but I still do not understand why, is my benchmarking too sensetive or does it really matter?
Final shuffleList code, courtesy of Andy Turner and Joop Eggen:
protected void shuffleList(List<Integer> list){
Random rnd = ThreadLocalRandom.current();
for(int i=list.size()-1;i>0;i--){
int index=rnd.nextInt(i+1);
Collections.swap(list, i, index);
}
}

Use Integer a, which save one unboxing and one boxing operation.
for (int i = list.size()-1; i>0; i--){
int index=rnd.nextInt(i+1);
Integer a=list.get(index);
list.set(index,list.get(i));
list.set(i,a);
}
And the Integer objects use more memory.
#Andy Turner mentioned the exist Collections#swap.
for (int i = list.size()-1; i > 0; i--) {
int index = rnd.nextInt(i+1);
Collections.swap(list, i, index);
}
Without warm-up of JIT compiler this might slow-down the bench-mark,
but will look better in production code. Though then you would probably use the Collections.shuffle anyway.
As commented the swap version is fast too. First the OP showed using the right microbenchmarking code.
swap uses the original Integer class too. It does l.set(i, l.set(j, l.get(i))); in order to swap - as set returns the previous element at that position. The JIT compiler can probably unwrap set and utilize that previous element immediately.

There is a Java function to do the job:
Collections.shuffle( list );
This should be significantly faster than a for loop.

Algorithm too slow for shuffling ArrayList

I am trying to implement the Fisher-Yates shuffle algorithm on java. It works but when my ArrayList is of a size > 100000, it goes very slow. I will show you my code and do you see any way to optimize the code? I did some research about the complexity of .get and .set from ArrayList and it is O(1) which makes sense to me.
UPDATE 1: I noticed my implementation was wrong. This is the proper Fisher-Yates algorithm. Also I included my next() function so you guys can see it. I tested with java.Random to see if my next() function was the problem but it gives the same result. I believe the problem is with the usage of my data structure.
UPDATE 2: I made a test and the ArrayList is an instanceof RandomAccess. So the problem is not there.
private long next(){ // MurmurHash3
seed ^= seed >> 33;
seed *= 0xff51afd7ed558ccdL;
seed ^= seed >> 33;
seed *= 0xc4ceb9fe1a85ec53L;
seed ^= seed >> 33;
return seed;
}
public int next(int range){
return (int) Math.abs((next() % range));
}
public ArrayList<Integer> shuffle(ArrayList<Integer> pList){
Integer temp;
int index;
int size = pList.size();
for (int i = size - 1; i > 0; i--){
index = next(i + 1);
temp = pList.get(index);
pList.set(index, pList.get(i));
pList.set(i, temp);
}
return pList;
}

EDIT: Added some comments after you implemented correctly the Fisher-Yates algorithm.
The Fisher-Yates algorithm relies on uniformly distributed random integers to produce unbiased permutations. Using an hash function (MurmurHash3) to generate random numbers and introducing the abs and modulo operations to force the numbers in a fixed range make the implementation less robust.
This implementation uses the java.util.Random PRNG and should work fine for your needs:
public <T> List<T> shuffle(List<T> list) {
// trust the default constructor which sets the seed to a value very likely
// to be distinct from any other invocation of this constructor
final Random random = new Random();
final int size = list.size();
for (int i = size - 1; i > 0; i--) {
// pick a random number between one and the number
// of unstruck numbers remaining (inclusive)
int index = random.nextInt(i + 1);
list.set(index, list.set(i, list.get(index)));
}
return list;
}
I can't see any major performance bottleneck in your code. However, here is a fire&forget comparison of the implementation above against the Collections#shuffle method:
public void testShuffle() {
List<Integer> list = new ArrayList<>();
for (int i = 0; i < 1_000_000; i++) {
list.add(i);
}
System.out.println("size: " + list.size());
System.out.println("Fisher-Yates shuffle");
for (int i = 0; i < 10; i++) {
long start = System.currentTimeMillis();
shuffle(list);
long stop = System.currentTimeMillis();
System.out.println("#" + i + " " + (stop - start) + "ms");
}
System.out.println("Java shuffle");
for (int i = 0; i < 10; i++) {
long start = System.currentTimeMillis();
Collections.shuffle(list);
long stop = System.currentTimeMillis();
System.out.println("#" + i + " " + (stop - start) + "ms");
}
}
which gives me the following results:
size: 1000000
Fisher-Yates shuffle
#0 84ms
#1 60ms
#2 42ms
#3 45ms
#4 47ms
#5 46ms
#6 52ms
#7 49ms
#8 47ms
#9 53ms
Java shuffle
#0 60ms
#1 46ms
#2 44ms
#3 48ms
#4 50ms
#5 46ms
#6 46ms
#7 49ms
#8 50ms
#9 47ms

(Better suited for Code Review forum.)
I changed what I could:
Random random = new Random(42);
for (ListIterator<Integer>.iter = pList.listIterator(); iter.hasNext(); ) {
Integer value = iter.next();
int index = random.nextInt(size);
iter.set(pList.get(index));
pList.set(index, value);
}
As an ArrayList is a list of large arrays, you might set the initialCapacity in the ArrayList constructor. trimToSize() might do something too. Using a ListIterator means that one already is at the the current partial array, and that might help.
The optional parameter of the Random constructor (here 42) allows to pick a fixed random sequence (= repeatable), allowing during development timing and tracing the same sequence.

Combining some fragments that have been scattered in comments and other answers:
The original code was not an implementation of the Fisher-Yates-Shuffle. It was only swapping random elements. This means that certain permutations are more likely than others, and the result is not truly random
If there is a bottleneck, it could (based on the code provided) only be in the next method, which you did not say anything about. It should be replaced by the nextInt method of an instance of java.util.Random
Here is an example of what it may look like. (Note that the speedTest method is not even remotely intended as a "benchmark", but should only indicate that the execution time is negligible even for large lists).
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Random;
class FisherYatesShuffle {
public static void main(String[] args) {
basicTest();
speedTest();
}
private static void basicTest() {
List<Integer> list = new ArrayList<Integer>(Arrays.asList(1,2,3,4,5));
shuffle(list, new Random(0));;
System.out.println(list);
}
private static void speedTest() {
List<Integer> list = new ArrayList<Integer>();
int n = 1000000;
for (int i=0; i<n; i++) {
list.add(i);
}
long before = System.nanoTime();
shuffle(list, new Random(0));;
long after = System.nanoTime();
System.out.println("Duration "+(after-before)/1e6+"ms");
System.out.println(list.get(0));
}
public static <T> void shuffle(List<T> list, Random random) {
for (int i = list.size() - 1; i > 0; i--) {
int index = random.nextInt(i + 1);
T t = list.get(index);
list.set(index, list.get(i));
list.set(i, t);
}
}
}
An aside: You gave a list as an argument, and returned the same list. This may be appropriate in some cases, but did not make any sense here. There are several options for the signature and behavior of such a method. But most likely, it should receive a List, and shuffle this list in-place. In fact, it would also make sense to explicitly check whether the list implements the java.util.RandomAccess interface. For a List that does not implement the RandomAccess interface, this algorithm would degrade to quadratic performance. In this case, it would be better to copy the given list into a list that implements RandomAccess, shuffle this copy, and the copy the results back into the original list.

Try this code and compare the execution time with your fisher yates method.
This is probably the "next" method which is slow
function fisherYates(array) {
for (var i = array.length - 1; i > 0; i--) {
var index = Math.floor(Math.random() * i);
//swap
var tmp = array[index];
array[index] = array[i];
array[i] = tmp;
}

In java, is it more efficient to use byte or short instead of int and float instead of double?

I've noticed I've always used int and doubles no matter how small or big the number needs to be. So in java, is it more efficient to use byte or short instead of int and float instead of double?
So assume I have a program with plenty of ints and doubles. Would it be worth going through and changing my ints to bytes or shorts if I knew the number would fit?
I know java doesn't have unsigned types but is there anything extra I could do if I knew the number would be positive only?
By efficient I mostly mean processing. I'd assume the garbage collector would be a lot faster if all the variables would be half size and that calculations would probably be somewhat faster too.
( I guess since I am working on android I need to somewhat worry about ram too)
(I'd assume the garbage collector only deals with Objects and not primitive but still deletes all the primitives in abandoned objects right? )
I tried it with a small android app I have but didn't really notice a difference at all. (Though I didn't "scientifically" measure anything.)
Am I wrong in assuming it should be faster and more efficient? I'd hate to go through and change everything in a massive program to find out I wasted my time.
Would it be worth doing from the beginning when I start a new project? (I mean I think every little bit would help but then again if so, why doesn't it seem like anyone does it.)

Am I wrong in assuming it should be faster and more efficient? I'd hate to go through and change everything in a massive program to find out I wasted my time.
Short answer
Yes, you are wrong. In most cases, it makes little difference in terms of space used.
It is not worth trying to optimize this ... unless you have clear evidence that optimization is needed. And if you do need to optimize memory usage of object fields in particular, you will probably need to take other (more effective) measures.
Longer answer
The Java Virtual Machine models stacks and object fields using offsets that are (in effect) multiples of a 32 bit primitive cell size. So when you declare a local variable or object field as (say) a byte, the variable / field will be stored in a 32 bit cell, just like an int.
There are two exceptions to this:
long and double values require 2 primitive 32-bit cells
arrays of primitive types are represent in packed form, so that (for example) an array of bytes hold 4 bytes per 32bit word.
So it might be worth optimizing use of long and double ... and large arrays of primitives. But in general no.
In theory, a JIT might be able to optimize this, but in practice I've never heard of a JIT that does. One impediment is that the JIT typically cannot run until after there instances of the class being compiled have been created. If the JIT optimized the memory layout, you could have two (or more) "flavors" of object of the same class ... and that would present huge difficulties.
Revisitation
Looking at the benchmark results in #meriton's answer, it appears that using short and byte instead of int incurs a performance penalty for multiplication. Indeed, if you consider the operations in isolation, the penalty is significant. (You shouldn't consider them in isolation ... but that's another topic.)
I think the explanation is that JIT is probably doing the multiplications using 32bit multiply instructions in each case. But in the byte and short case, it executes extra instructions to convert the intermediate 32 bit value to a byte or short in each loop iteration. (In theory, that conversion could be done once at the end of the loop ... but I doubt that the optimizer would be able to figure that out.)
Anyway, this does point to another problem with switching to short and byte as an optimization. It could make performance worse ... in an algorithm that is arithmetic and compute intensive.
Secondary questions
I know java doesn't have unsigned types but is there anything extra I could do if I knew the number would be positive only?
No. Not in terms of performance anyway. (There are some methods in Integer, Long, etc for dealing with int, long, etc as unsigned. But these don't give any performance advantage. That is not their purpose.)
(I'd assume the garbage collector only deals with Objects and not primitive but still deletes all the primitives in abandoned objects right? )
Correct. A field of an object is part of the object. It goes away when the object is garbage collected. Likewise the cells of an array go away when the array is collected. When the field or cell type is a primitive type, then the value is stored in the field / cell ... which is part of the object / array ... and that has been deleted.

That depends on the implementation of the JVM, as well as the underlying hardware. Most modern hardware will not fetch single bytes from memory (or even from the first level cache), i.e. using the smaller primitive types generally does not reduce memory bandwidth consumption. Likewise, modern CPU have a word size of 64 bits. They can perform operations on less bits, but that works by discarding the extra bits, which isn't faster either.
The only benefit is that smaller primitive types can result in a more compact memory layout, most notably when using arrays. This saves memory, which can improve locality of reference (thus reducing the number of cache misses) and reduce garbage collection overhead.
Generally speaking however, using the smaller primitive types is not faster.
To demonstrate that, behold the following benchmark:
public class Benchmark {
public static void benchmark(String label, Code code) {
print(25, label);
try {
for (int iterations = 1; ; iterations *= 2) { // detect reasonable iteration count and warm up the code under test
System.gc(); // clean up previous runs, so we don't benchmark their cleanup
long previouslyUsedMemory = usedMemory();
long start = System.nanoTime();
code.execute(iterations);
long duration = System.nanoTime() - start;
long memoryUsed = usedMemory() - previouslyUsedMemory;
if (iterations > 1E8 || duration > 1E9) {
print(25, new BigDecimal(duration * 1000 / iterations).movePointLeft(3) + " ns / iteration");
print(30, new BigDecimal(memoryUsed * 1000 / iterations).movePointLeft(3) + " bytes / iteration\n");
return;
}
}
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
private static void print(int desiredLength, String message) {
System.out.print(" ".repeat(Math.max(1, desiredLength - message.length())) + message);
}
private static long usedMemory() {
return Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
}
#FunctionalInterface
interface Code {
/**
* Executes the code under test.
*
* #param iterations
* number of iterations to perform
* #return any value that requires the entire code to be executed (to
* prevent dead code elimination by the just in time compiler)
* #throws Throwable
* if the test could not complete successfully
*/
Object execute(int iterations);
}
public static void main(String[] args) {
benchmark("long[] traversal", (iterations) -> {
long[] array = new long[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = i;
}
return array;
});
benchmark("int[] traversal", (iterations) -> {
int[] array = new int[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = i;
}
return array;
});
benchmark("short[] traversal", (iterations) -> {
short[] array = new short[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = (short) i;
}
return array;
});
benchmark("byte[] traversal", (iterations) -> {
byte[] array = new byte[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = (byte) i;
}
return array;
});
benchmark("long fields", (iterations) -> {
class C {
long a = 1;
long b = 2;
}
C[] array = new C[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = new C();
}
return array;
});
benchmark("int fields", (iterations) -> {
class C {
int a = 1;
int b = 2;
}
C[] array = new C[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = new C();
}
return array;
});
benchmark("short fields", (iterations) -> {
class C {
short a = 1;
short b = 2;
}
C[] array = new C[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = new C();
}
return array;
});
benchmark("byte fields", (iterations) -> {
class C {
byte a = 1;
byte b = 2;
}
C[] array = new C[iterations];
for (int i = 0; i < iterations; i++) {
array[i] = new C();
}
return array;
});
benchmark("long multiplication", (iterations) -> {
long result = 1;
for (int i = 0; i < iterations; i++) {
result *= 3;
}
return result;
});
benchmark("int multiplication", (iterations) -> {
int result = 1;
for (int i = 0; i < iterations; i++) {
result *= 3;
}
return result;
});
benchmark("short multiplication", (iterations) -> {
short result = 1;
for (int i = 0; i < iterations; i++) {
result *= 3;
}
return result;
});
benchmark("byte multiplication", (iterations) -> {
byte result = 1;
for (int i = 0; i < iterations; i++) {
result *= 3;
}
return result;
});
}
}
Run with OpenJDK 14 on my Intel Core i7 CPU # 3.5 GHz, this prints:
long[] traversal 3.206 ns / iteration 8.007 bytes / iteration
int[] traversal 1.557 ns / iteration 4.007 bytes / iteration
short[] traversal 0.881 ns / iteration 2.007 bytes / iteration
byte[] traversal 0.584 ns / iteration 1.007 bytes / iteration
long fields 25.485 ns / iteration 36.359 bytes / iteration
int fields 23.126 ns / iteration 28.304 bytes / iteration
short fields 21.717 ns / iteration 20.296 bytes / iteration
byte fields 21.767 ns / iteration 20.273 bytes / iteration
long multiplication 0.538 ns / iteration 0.000 bytes / iteration
int multiplication 0.526 ns / iteration 0.000 bytes / iteration
short multiplication 0.786 ns / iteration 0.000 bytes / iteration
byte multiplication 0.784 ns / iteration 0.000 bytes / iteration
As you can see, the only significant speed savings occur when traversing large arrays; using smaller object fields yields negligible benefit, and computations are actually slightly slower on the small datatypes.
Overall, the performance differences are quite minor. Optimizing algorithms is far more important than the choice of primitive type.

Using byte instead of int can increase performance if you are using them in a huge amount. Here is an experiment:
import java.lang.management.*;
public class SpeedTest {
/** Get CPU time in nanoseconds. */
public static long getCpuTime() {
ThreadMXBean bean = ManagementFactory.getThreadMXBean();
return bean.isCurrentThreadCpuTimeSupported() ? bean
.getCurrentThreadCpuTime() : 0L;
}
public static void main(String[] args) {
long durationTotal = 0;
int numberOfTests=0;
for (int j = 1; j < 51; j++) {
long beforeTask = getCpuTime();
// MEASURES THIS AREA------------------------------------------
long x = 20000000;// 20 millions
for (long i = 0; i < x; i++) {
TestClass s = new TestClass();
}
// MEASURES THIS AREA------------------------------------------
long duration = getCpuTime() - beforeTask;
System.out.println("TEST " + j + ": duration = " + duration + "ns = "
+ (int) duration / 1000000);
durationTotal += duration;
numberOfTests++;
}
double average = durationTotal/numberOfTests;
System.out.println("-----------------------------------");
System.out.println("Average Duration = " + average + " ns = "
+ (int)average / 1000000 +" ms (Approximately)");
}
}
This class tests the speed of creating a new TestClass. Each tests does it 20 million times and there are 50 tests.
Here is the TestClass:
public class TestClass {
int a1= 5;
int a2= 5;
int a3= 5;
int a4= 5;
int a5= 5;
int a6= 5;
int a7= 5;
int a8= 5;
int a9= 5;
int a10= 5;
int a11= 5;
int a12=5;
int a13= 5;
int a14= 5;
}
I've run the SpeedTest class and in the end got this:
Average Duration = 8.9625E8 ns = 896 ms (Approximately)
Now I'm changing the ints into bytes in the TestClass and running it again. Here is the result:
Average Duration = 6.94375E8 ns = 694 ms (Approximately)
I believe this experiment shows that if you are instancing a huge amount of variables, using byte instead of int can increase efficiency

byte is generally considered to be 8 bits.
short is generally considered to be 16 bits.
In a "pure" environment, which isn't java as all implementation of bytes and longs, and shorts, and other fun things is generally hidden from you, byte makes better use of space.
However, your computer is probably not 8 bit, and it is probably not 16 bit. this means that
to obtain 16 or 8 bits in particular, it would need to resort to "trickery" which wastes time in order to pretend that it has the ability to access those types when needed.
At this point, it depends on how hardware is implemented. However from I've been tought,
the best speed is achieved from storing things in chunks which are comfortable for your CPU to use. A 64 bit processor likes dealing with 64 bit elements, and anything less than that often requires "engineering magic" to pretend that it likes dealing with them.

One of the reason for short/byte/char being less performant is for lack of direct support for these data types. By direct support, it means, JVM specifications do not mention any instruction set for these data types. Instructions like store, load, add etc. have versions for int data type. But they do not have versions for short/byte/char. E.g. consider below java code:
void spin() {
int i;
for (i = 0; i < 100; i++) {
; // Loop body is empty
}
}
Same gets converted into machine code as below.
0 iconst_0 // Push int constant 0
1 istore_1 // Store into local variable 1 (i=0)
2 goto 8 // First time through don't increment
5 iinc 1 1 // Increment local variable 1 by 1 (i++)
8 iload_1 // Push local variable 1 (i)
9 bipush 100 // Push int constant 100
11 if_icmplt 5 // Compare and loop if less than (i < 100)
14 return // Return void when done
Now, consider changing int to short as below.
void sspin() {
short i;
for (i = 0; i < 100; i++) {
; // Loop body is empty
}
}
The corresponding machine code will change as follows:
0 iconst_0
1 istore_1
2 goto 10
5 iload_1 // The short is treated as though an int
6 iconst_1
7 iadd
8 i2s // Truncate int to short
9 istore_1
10 iload_1
11 bipush 100
13 if_icmplt 5
16 return
As you can observe, to manipulate short data type, it is still using int data type instruction version and explicitly converting int to short when required. Now, due to this, performance gets reduced.
Now, reason cited for not giving direct support as follows:
The Java Virtual Machine provides the most direct support for data of
type int. This is partly in anticipation of efficient implementations
of the Java Virtual Machine's operand stacks and local variable
arrays. It is also motivated by the frequency of int data in typical
programs. Other integral types have less direct support. There are no
byte, char, or short versions of the store, load, or add instructions,
for instance.
Quoted from JVM specification present here (Page 58).

I would say that accepted answer is somewhat wrong saying "it makes little difference in terms of space used". Here is the example showing that difference in some cases is very different:
Baseline usage 4.90MB, java: 11.0.12
Mem usage - bytes : +202.60 MB
Mem usage - shorts: +283.02 MB
Mem usage - ints : +363.02 MB
Mem usage - bytes : +203.02 MB
Mem usage - shorts: +283.02 MB
Mem usage - ints : +363.02 MB
Mem usage - bytes : +203.02 MB
Mem usage - shorts: +283.02 MB
Mem usage - ints : +363.02 MB
The code to verify:
static class Bytes {
public byte f1;
public byte f2;
public byte f3;
public byte f4;
}
static class Shorts {
public short f1;
public short f2;
public short f3;
public short f4;
}
static class Ints {
public int f1;
public int f2;
public int f3;
public int f4;
}
#Test
public void memUsageTest() throws Exception {
int countOfItems = 10 * 1024 * 1024;
float MB = 1024*1024;
Runtime rt = Runtime.getRuntime();
System.gc();
Thread.sleep(1000);
long baseLineUsage = rt.totalMemory() - rt.freeMemory();
trace("Baseline usage %.2fMB, java: %s", (baseLineUsage / MB), System.getProperty("java.version"));
for( int j = 0; j < 3; j++ ) {
Bytes[] bytes = new Bytes[countOfItems];
for( int i = 0; i < bytes.length; i++ ) {
bytes[i] = new Bytes();
}
System.gc();
Thread.sleep(1000);
trace("Mem usage - bytes : +%.2f MB", (rt.totalMemory() - rt.freeMemory() - baseLineUsage) / MB);
bytes = null;
Shorts[] shorts = new Shorts[countOfItems];
for( int i = 0; i < shorts.length; i++ ) {
shorts[i] = new Shorts();
}
System.gc();
Thread.sleep(1000);
trace("Mem usage - shorts: +%.2f MB", (rt.totalMemory() - rt.freeMemory() - baseLineUsage) / MB);
shorts = null;
Ints[] ints = new Ints[countOfItems];
for( int i = 0; i < ints.length; i++ ) {
ints[i] = new Ints();
}
System.gc();
Thread.sleep(1000);
trace("Mem usage - ints : +%.2f MB", (rt.totalMemory() - rt.freeMemory() - baseLineUsage) / MB);
ints = null;
}
}
private static void trace(String message, Object... args) {
String line = String.format(US, message, args);
System.out.println(line);
}

The difference is hardly noticeable! It's more a question of design, appropriateness, uniformity, habit, etc... Sometimes it's just a matter of taste. When all you care about is that your program gets up and running and substituting a float for an int would not harm correctness, I see no advantage in going for one or another unless you can demonstrate that using either type alters performance. Tuning performance based on types that are different in 2 or 3 bytes is really the last thing you should care about; Donald Knuth once said: "Premature optimization is the root of all evil" (not sure it was him, edit if you have the answer).

Why does Java's ArrayList's remove function seem to cost so little?

I have a function which manipulates a very large list, exceeding about 250,000 items. For the majority of those items, it simply replaces the item at position x. However, for about 5% of them, it must remove them from the list.
Using a LinkedList seemed to be the most obvious solution to avoid expensive removals. However, naturally, accessing a LinkedList by index becomes increasingly slow as time goes on. The cost here is minutes (and a lot of them).
Using an Iterator over that LinkedList is also expensive, as I appear to need a separate copy to avoid Iterator concurrency issues while editing that list. The cost here is minutes.
However, here's where my mind is blown a bit. If I change to an ArrayList, it runs almost instantly.
For a list with 297515 elements, removing 11958 elements and modifying everything else takes 909ms. I verified that the resulting list is indeed 285557 in size, as expected, and contains the updated information I need.
Why is this so fast? I looked at the source for ArrayList in JDK6 and it appears to be using an arraycopy function as expected. I would love to understand why an ArrayList works so well here when common sense would seem to indicate that an array for this task is an awful idea, requiring shifting several hundred thousand items.

I ran a benchmark, trying each of the following strategies for filtering the list elements:
Copy the wanted elements into a new list
Use Iterator.remove() to remove the unwanted elements from an ArrayList
Use Iterator.remove() to remove the unwanted elements from a LinkedList
Compact the list in-place (moving the wanted elements to lower positions)
Remove by index (List.remove(int)) on an ArrayList
Remove by index (List.remove(int)) on a LinkedList
Each time I populated the list with 100000 random instances of Point and used a filter condition (based on the hash code) that would accept 95% of elements and reject the remaining 5% (the same proportion stated in the question, but with a smaller list because I didn't have time to run the test for 250000 elements.)
And the average times (on my old MacBook Pro: Core 2 Duo, 2.2GHz, 3Gb RAM) were:
CopyIntoNewListWithIterator : 4.24ms
CopyIntoNewListWithoutIterator: 3.57ms
FilterLinkedListInPlace : 4.21ms
RandomRemoveByIndex : 312.50ms
SequentialRemoveByIndex : 33632.28ms
ShiftDown : 3.75ms
So removing elements by index from a LinkedList was more than 300 times more expensive than removing them from an ArrayList, and probably somewhere between 6000-10000 times more expensive than the other methods (that avoid linear search and arraycopy)
Here there doesn't seem to be much difference between the four faster methods, but I ran just those four again with a 500000-element list with the following results:
CopyIntoNewListWithIterator : 92.49ms
CopyIntoNewListWithoutIterator: 71.77ms
FilterLinkedListInPlace : 15.73ms
ShiftDown : 11.86ms
I'm guessing that with the larger size cache memory becomes the limiting factor, so the cost of creating a second copy of the list becomes significant.
Here's the code:
import java.awt.Point;
import java.security.SecureRandom;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.TreeMap;
public class ListBenchmark {
public static void main(String[] args) {
Random rnd = new SecureRandom();
Map<String, Long> timings = new TreeMap<String, Long>();
for (int outerPass = 0; outerPass < 10; ++ outerPass) {
List<FilterStrategy> strategies =
Arrays.asList(new CopyIntoNewListWithIterator(),
new CopyIntoNewListWithoutIterator(),
new FilterLinkedListInPlace(),
new RandomRemoveByIndex(),
new SequentialRemoveByIndex(),
new ShiftDown());
for (FilterStrategy strategy: strategies) {
String strategyName = strategy.getClass().getSimpleName();
for (int innerPass = 0; innerPass < 10; ++ innerPass) {
strategy.populate(rnd);
if (outerPass >= 5 && innerPass >= 5) {
Long totalTime = timings.get(strategyName);
if (totalTime == null) totalTime = 0L;
timings.put(strategyName, totalTime - System.currentTimeMillis());
}
Collection<Point> filtered = strategy.filter();
if (outerPass >= 5 && innerPass >= 5) {
Long totalTime = timings.get(strategyName);
timings.put(strategy.getClass().getSimpleName(), totalTime + System.currentTimeMillis());
}
CHECKSUM += filtered.hashCode();
System.err.printf("%-30s %d %d %d%n", strategy.getClass().getSimpleName(), outerPass, innerPass, filtered.size());
strategy.clear();
}
}
}
for (Map.Entry<String, Long> e: timings.entrySet()) {
System.err.printf("%-30s: %9.2fms%n", e.getKey(), e.getValue() * (1.0/25.0));
}
}
public static volatile int CHECKSUM = 0;
static void populate(Collection<Point> dst, Random rnd) {
for (int i = 0; i < INITIAL_SIZE; ++ i) {
dst.add(new Point(rnd.nextInt(), rnd.nextInt()));
}
}
static boolean wanted(Point p) {
return p.hashCode() % 20 != 0;
}
static abstract class FilterStrategy {
abstract void clear();
abstract Collection<Point> filter();
abstract void populate(Random rnd);
}
static final int INITIAL_SIZE = 100000;
private static class CopyIntoNewListWithIterator extends FilterStrategy {
public CopyIntoNewListWithIterator() {
list = new ArrayList<Point>(INITIAL_SIZE);
}
#Override
void clear() {
list.clear();
}
#Override
Collection<Point> filter() {
ArrayList<Point> dst = new ArrayList<Point>(list.size());
for (Point p: list) {
if (wanted(p)) dst.add(p);
}
return dst;
}
#Override
void populate(Random rnd) {
ListBenchmark.populate(list, rnd);
}
private final ArrayList<Point> list;
}
private static class CopyIntoNewListWithoutIterator extends FilterStrategy {
public CopyIntoNewListWithoutIterator() {
list = new ArrayList<Point>(INITIAL_SIZE);
}
#Override
void clear() {
list.clear();
}
#Override
Collection<Point> filter() {
int inputSize = list.size();
ArrayList<Point> dst = new ArrayList<Point>(inputSize);
for (int i = 0; i < inputSize; ++ i) {
Point p = list.get(i);
if (wanted(p)) dst.add(p);
}
return dst;
}
#Override
void populate(Random rnd) {
ListBenchmark.populate(list, rnd);
}
private final ArrayList<Point> list;
}
private static class FilterLinkedListInPlace extends FilterStrategy {
public String toString() {
return getClass().getSimpleName();
}
FilterLinkedListInPlace() {
list = new LinkedList<Point>();
}
#Override
void clear() {
list.clear();
}
#Override
Collection<Point> filter() {
for (Iterator<Point> it = list.iterator();
it.hasNext();
) {
Point p = it.next();
if (! wanted(p)) it.remove();
}
return list;
}
#Override
void populate(Random rnd) {
ListBenchmark.populate(list, rnd);
}
private final LinkedList<Point> list;
}
private static class RandomRemoveByIndex extends FilterStrategy {
public RandomRemoveByIndex() {
list = new ArrayList<Point>(INITIAL_SIZE);
}
#Override
void clear() {
list.clear();
}
#Override
Collection<Point> filter() {
for (int i = 0; i < list.size();) {
if (wanted(list.get(i))) {
++ i;
} else {
list.remove(i);
}
}
return list;
}
#Override
void populate(Random rnd) {
ListBenchmark.populate(list, rnd);
}
private final ArrayList<Point> list;
}
private static class SequentialRemoveByIndex extends FilterStrategy {
public SequentialRemoveByIndex() {
list = new LinkedList<Point>();
}
#Override
void clear() {
list.clear();
}
#Override
Collection<Point> filter() {
for (int i = 0; i < list.size();) {
if (wanted(list.get(i))) {
++ i;
} else {
list.remove(i);
}
}
return list;
}
#Override
void populate(Random rnd) {
ListBenchmark.populate(list, rnd);
}
private final LinkedList<Point> list;
}
private static class ShiftDown extends FilterStrategy {
public ShiftDown() {
list = new ArrayList<Point>();
}
#Override
void clear() {
list.clear();
}
#Override
Collection<Point> filter() {
int inputSize = list.size();
int outputSize = 0;
for (int i = 0; i < inputSize; ++ i) {
Point p = list.get(i);
if (wanted(p)) {
list.set(outputSize++, p);
}
}
list.subList(outputSize, inputSize).clear();
return list;
}
#Override
void populate(Random rnd) {
ListBenchmark.populate(list, rnd);
}
private final ArrayList<Point> list;
}
}

Array copy is a rather unexpensive operation. It is done on a very basic level (its a java native static method) and you are not yet in the range where the performance becomes really important.
In your example you copy approx 12000 times an array of size 150000 (on average). This does not take much time. I tested it here on my laptop and it took less than 500 ms.
Update I used the following code to measure on my laptop (Intel P8400)
import java.util.Random;
public class PerformanceArrayCopy {
public static void main(String[] args) {
int[] lengths = new int[] { 10000, 50000, 125000, 250000 };
int[] loops = new int[] { 1000, 5000, 10000, 20000 };
for (int length : lengths) {
for (int loop : loops) {
Object[] list1 = new Object[length];
Object[] list2 = new Object[length];
for (int k = 0; k < 100; k++) {
System.arraycopy(list1, 0, list2, 0, list1.length);
}
int[] len = new int[loop];
int[] ofs = new int[loop];
Random rnd = new Random();
for (int k = 0; k < loop; k++) {
len[k] = rnd.nextInt(length);
ofs[k] = rnd.nextInt(length - len[k]);
}
long n = System.nanoTime();
for (int k = 0; k < loop; k++) {
System.arraycopy(list1, ofs[k], list2, ofs[k], len[k]);
}
n = System.nanoTime() - n;
System.out.print("length: " + length);
System.out.print("\tloop: " + loop);
System.out.print("\truntime [ms]: " + n / 1000000);
System.out.println();
}
}
}
}
Some results:
length: 10000 loop: 10000 runtime [ms]: 47
length: 50000 loop: 10000 runtime [ms]: 228
length: 125000 loop: 10000 runtime [ms]: 575
length: 250000 loop: 10000 runtime [ms]: 1198

I think the difference in performance is likely coming down to the difference that ArrayList supports random access where LinkedList does not.
If I want to get(1000) of an ArrayList I am specifying a specific index to access this, however LinkedList doesn't support this as it is organized through Node references.
If I call get(1000) of LinkedList, it will iterate the entire list until if finds index 1000 and this can be exorbitantly expensive if you have a large number of items in the LinkedList.

Interesting and unexpected results. This is just a hypothesis, but...
On average one of your array element removals will require moving half of your list (everything after it) back one element. If each item is a 64-bit pointer to an object (8 bytes), then this means copying 125000 items x 8 Bytes per pointer = 1 MB.
A modern CPU can copy a contiguous block of 1 MB of RAM to RAM pretty quickly.
Compared to looping over a linked list for every access, which requires comparisons and branching and other CPU unfriendly activities, the RAM copy is fast.
You should really try benchmarking the various operations independently and see how efficient they are with various list implementations. Share your results here if you do!

I'm skipping over some implementation details on purpose here, just to explain the fundamental difference.
To remove the N-th element of a list of M elements, the LinkedList implementation will navigate up to this element, then simply remove it and update the pointers of the N-1 and N+1 elements accordingly. This second operation is very simple, but it's getting up to this element that costs you time.
For an ArrayList however, the access time is instantaneous as it is backed by an array, meaning contiguous memory spaces. You can jump directly to the right memory address to perform, broadly speaking, the following:
reallocate a new array of M - 1 elements
put everything from 0 to N - 1 at index 0 in the new arraylist's array
put everything N + 1 to M at index N in the arraylist's array.
Thinking of it, you'll notice you can even reuse the same array as Java can use ArrayList with pre-allocated sizes, so if you remove elements you might as well skip steps 1 and 2 and directly do step 3 and update your size.
Memory accesses are fast, and copying a chunk of memory is probably sufficiently fast on modern hardware that moving to the N-position is too time consuming.
However, should you use your LinkedList in such a way that it allows you to remove multiple elements that follow each other and keep track of your position, you would see a gain.
But clearly, on a long list, doing a simple remove(i) will be costly.
To add a bilt of salt and spice to this:
See the note on Efficiency on the Array Data Structure and the note on Performance on the Dynamic Array Wikipedia entries, which describe your concern.
Keep in mind that using a memory structure that requires contiguous memory requires, well, contiguous memory. Which means your virtual memory will need to be able to allocate contiguous chunks. Or even with Java, you'll see your JVM happily going down with an obscure OutOfMemoryException taking its cause in a low-level crash.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.