Huge performance difference between Vector and HashSet

Huge performance difference between Vector and HashSet - java

I have a program which fetches records from database (using Hibernate) and fills them in a Vector. There was an issue regarding the performance of the operation and I did a test with the Vector replaced by a HashSet. With 300000 records, the speed gain is immense - 45 mins to 2 mins!
So my question is, what is causing this huge difference? Is it just the point that all methods in Vector are synchronized or the point that internally Vector uses an array whereas HashSet does not? Or something else?
The code is running in a single thread.
EDIT:
The code is only inserting the values in the Vector (and in the other case, HashSet).

If it's trying to use the Vector as a set, and checking for the existence of a record before adding it, then filling the vector becomes an O(n^2) operation, compared with O(n) for HashSet. It would also become an O(n^2) operation if you insert each element at the start of the vector instead of at the end.
If you're just using collection.add(item) then I wouldn't expect to see that sort of difference - synchronization isn't that slow.
If you can try to test it with different numbers of records, you could see how each version grows as n increases - that would make it easier to work out what's going on.
EDIT: If you're just using Vector.add then it sounds like something else could be going on - e.g. your database was behaving differently between your different test runs. Here's a little test application:
import java.util.*;
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
Vector<String> vector = new Vector<String>();
for (int i = 0; i < 300000; i++) {
vector.add("dummy value");
}
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + "ms");
}
}
Output:
Time taken: 38ms
Now obviously this isn't going to be very accurate - System.currentTimeMillis isn't the best way of getting accurate timing - but it's clearly not taking 45 minutes. In other words, you should look elsewhere for the problem, if you really are just calling Vector.add(item).
Now, changing the code above to use
vector.add(0, "dummy value"); // Insert item at the beginning
makes an enormous difference - it takes 42 seconds instead of 38ms. That's clearly a lot worse - but it's still a long way from being 45 minutes - and I doubt that my desktop is 60 times as fast as yours.

If you are inserting them at the middle or beginning instead of at the end, then the Vector needs to move them all along. Every insert. The hashmap, on the other hand, doesn't really care or have to do anything.

Vector is outdated and should not be used anymore. Profile with ArrayList or LinkedList (depends on how you use the list) and you will see the difference (sync vs unsync).
Why are you using Vector in a single threaded application at all?

Vector is synchronized by default; HashSet is not. That's my guess. Obtaining a monitor for access takes time.
I don't know if there are reads in your test, but Vector and HashSet are both O(1) if get() is used to access Vector entries.

Under normal circumstances, it is totally implausible that inserting 300,000 records into a Vector will take 43 minutes longer than inserting the same records into a HashSet.
However, I think there is a possible explanation of what might be going on.
First, the records coming out of the database must have a very high proportion of duplicates. Or at least, they must be duplicates according to the semantics of the equals/hashcode methods of your record class.
Next, I think you must be pushing very close to filling up the heap.
So the reason that the HashSet solution is so much faster is that it is most of the records are being replaced by the set.add operation. By contrast the Vector solution is keeping all of the records, and the JVM is spending most of its time trying to squeeze that last 0.05% of memory by running the GC over, and over and over.
One way to test this theory is to run the Vector version of the application with a much bigger heap.
Irrespective, the best way to investigate this kind of problem is to run the application using a profiler, and see where all the CPU time is going.

import java.util.*;
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
Vector<String> vector = new Vector<String>();
for (int i = 0; i < 300000; i++) {
if(vector.contains(i)) {
vector.add("dummy value");
}
}
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + "ms");
}
}
If you check for duplicate element before insert the element in the vector, it will take more time depend upon the size of vector. best way is to use the HashSet for high performance, because Hashset will not allow duplicate and no need to check for duplicate element before inserting.

According to Dr Heinz Kabutz, he said this in one of his newsletters.
The old Vector class implements serialization in a naive way. They simply do the default serialization, which writes the entire Object[] as-is into the stream. Thus if we insert a bunch of elements into the List, then clear it, the difference between Vector and ArrayList is enormous.
import java.util.*;
import java.io.*;
public class VectorWritingSize {
public static void main(String[] args) throws IOException {
test(new LinkedList<String>());
test(new ArrayList<String>());
test(new Vector<String>());
}
public static void test(List<String> list) throws IOException {
insertJunk(list);
for (int i = 0; i < 10; i++) {
list.add("hello world");
}
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream out = new ObjectOutputStream(baos);
out.writeObject(list);
out.close();
System.out.println(list.getClass().getSimpleName() +
" used " + baos.toByteArray().length + " bytes");
}
private static void insertJunk(List<String> list) {
for(int i = 0; i<1000 * 1000; i++) {
list.add("junk");
}
list.clear();
}
}
When we run this code, we get the following output:
LinkedList used 107 bytes
ArrayList used 117 bytes
Vector used 1310926 bytes
Vector can use a staggering amount of bytes when being serialized. The lesson here? Don't ever use Vector as Lists in objects that are Serializable. The potential for disaster is too great.

Related

Why is my java program becoming gradually slower?

I recently built a Fibonacci generator that uses recursion and hashmaps to reduce complexity. I am using the System.nanoTime() to keep track of the time it takes for my program to print 10000 Fibonacci number. It started out good with less than a second but gradually became slower and now it takes more than 4 seconds. Can someone explain why this might be happening. The code is down here-
import java.util.*;
import java.math.*;
public class FibonacciGeneratorUnlimited {
static int numFibCalls = 0;
static HashMap<Integer, BigInteger> d = new HashMap<Integer, BigInteger>();
static Scanner fibNumber = new Scanner(System.in);
static BigInteger ans = new BigInteger("0");
public static void main(String[] args){
d.put(0 , new BigInteger("0"));
d.put(1 , new BigInteger("1"));
System.out.print("Enter the term:\t");
int n = fibNumber.nextInt();
long startTime = System.nanoTime();
for (int i = 0; i <= n; i++) {
System.out.println(i + " : " + fib_efficient(i, d));
}
System.out.println((double)(System.nanoTime() - startTime) / 1000000000);
}
public static BigInteger fib_efficient(int n, HashMap<Integer, BigInteger> d) {
numFibCalls += 1;
if (d.containsKey(n)) {
return (d.get(n));
} else {
ans = (fib_efficient(n-1, d).add(fib_efficient(n-2, d)));
d.put(n, ans);
return ans;
}
}
}

If you are restarting the program every time you make a new fibonacci sequence, then your program most likely isn't the problem. It might just be the your processor got hot after running the program a few times, or a background process in your computer suddenly started, causing your program to slow down.

More memory java -Xmx=... or less caching
public static BigInteger fib_efficient(int n, HashMap<Integer, BigInteger> d) {
numFibCalls++;
if ((n & 3) <= 1) { // Every second is cached.
BigInteger cached = d.get(n);
if (cached != null) {
return cached;
} else {
BigInteger ans = fib_efficient(n-1, d).add(fib_efficient(n-2, d));
d.put(n, ans);
return ans;
}
} else {
return fib_efficient(n-1, d).add(fib_efficient(n-2, d));
}
}
Two subsequent numbers are cached out of four in order to stop the
recursion on both branches for:
fib(n) = fib(n-1) + fib(n-2)
BigInteger isn't the nicest class where performance and memory is concerned.

It started out good with less than a second but gradually became slower and now it takes more than 4 seconds.
What do you mean by this? Do you mean that you ran this exact same program with the same input and its run-time changed from < 1 second to > 4 seconds?
If you have the same exact code running with the same exact inputs in a deterministic algorithm...
then the differences are probably external to your code - maybe other processes are taking up more CPU on one run.
Do you mean that you increased the inputs from some value X to 10,000 and now it takes > 4 seconds?
Then that's just a matter of the algorithm taking longer with larger inputs, which is perfectly normal.
recursion and hashmaps to reduce complexity
That's not quite how complexity works. You have improved the best-case and the average-case, but you have done nothing to change the worst-case.
Now for some actual performance improvement advice
Stop printing out the results... that's eating up over 99% of your processing time. Seriously, though, switch out "System.out.println(i + " : " + fib_efficient(i, d))" with "fib_efficient(i,d)" and it'll execute over 100x faster.
Concatenating strings and printing to console are very expensive processes.

It happens because the complexity for Fibonacci is Big-O(n^2). This means that, the larger the input the time increases exponentially, as you can see in the graph for Big-O(n^2) in this link. Check this answer to see a complete explanation about it´s complexity.
Now, the complexity of your algorithm increases because you are using a HashMap to search and insert elements each time that function is invoked. Consider remove this HashMap.

How can I evaluate a hash table implementation? (Using HashMap as reference)

Problem:
I need to compare 2 hash table implementations (well basically HashMap with another one) and make a reasonable conclusion.
I am not interested in 100% accuracy but just being in the right direction in my estimation.
I am interested in the difference not only per operation but mainly on the hashtable as a "whole".
I don't have a strict requirement on speed so if the other implementation is reasonably slower I can accept it but I do expect/require that the memory usage be better (since one of the hashtables is backed by primitive table).
What I did so far:
Originally I created my own custom "benchmark" with loops and many calls to hint for gc to get a feeling of the difference but I am reading online that using a standard tool is more reliable/appropriate.
Example of my approach (MapInterface is just a wrapper so I can switch among implementations.):
int[] keys = new int[10000000];
String[] values = new String[10000000];
for(int i = 0; i < keys.length; ++i) {
keys[i] = i;
values[i] = "" + i;
}
if(operation.equals("put", keys, values)) {
runPutOperation(map);
}
public static long[] runOperation(MapInterface map, Integer[] keys, String[] values) {
long min = Long.MAX_VALUE;
long max = Long.MIN_VALUE;
long run = 0;
for(int i = 0; i < 10; ++i) {
long start = System.currentTimeMillis();
for(int i = 0; i < keys.length; ++i) {
map.put(keys[i], values[i]);
}
long total = System.currentTimeMillis() - start;
System.out.println(total/1000d + " seconds");
if(total < min) {
min = time;
}
if(total > max) {
max = time;
}
run += time;
map = null;
map = createNewHashMap();
hintsToGC();
}
return new long[] {min, max, run};
}
public void hintsToGC() {
for(int i = 0; i < 20; ++i) {
System.out.print(". ");
System.gc();
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
private HashMapInterface<String> createNewHashMap() {
if(jdk) {
return new JDKHashMapWrapper<String>();
}
else {
return new AlternativeHashMapWrapper<String>();
}
}
public class JDKHashMapWrapper implements HashMapInterface<String> {
HashMap<Integer, String> hashMap;
JDKHashMapWrapper() {
hashMap = new HashMap<Integer, String>();
}
public String put(Integer key, String value) {
return hashMap.put(key, value);
}
//etc
}
(I want to test put, get, contains and the memory utilization)
Can I be sure by using my approach that I can get reasonable measurements?
If not what would be the most appropriate tool to use and how?
Update:
- I also test with random numbers (also ~10M random numbers) using SecureRandom.
- When the hash table resizes I print the logical size of the hash table/size of the actual table to get the load factor
Update:
For my specific case, where I am interested also in integers what can of pitfalls are there with my approach?
UPDATE after #dimo414 comments:
Well at a minimum the hashtable as a "whole" isn't meaningful
I mean how the hashtable behaves under various loads both at runtime and in memory consumption.
Every data structure is a tradeoff of different methods
I agree. My trade-off is an acceptable access penalty for memory improvement
You need to identify what features you're interested in verifying
1) put(key, value);
2) get(key, value);
3) containsKey(key);
4) all the above when having many entries in the hash table

Some key consideration for using Hash tables is the size of the "buckets" allocation, the collision resolution strategy, and the shape of your data. Essentially, a Hash table takes the key supplied by the application and then hashes it to a value less than or equal to the number of allocated buckets. When two key values hash to the same bucket, the implementation has to resolve the collision and return the right value. For example, one could have a sorted linked list for each bucket and that list is searched.
If your data happens to have a lot of collisions, then your performance will suffer, because the Hash table implementation will spend too much time resolving the collision. On the other hand, if you have a very large number of buckets, you solve the collision problem at the expense of memory. Also, Java's built-in HashMap implementation will "rehash" if the number of entries gets larger than a certain amount - I imagine this is an expensive operation that is worth avoiding.
Since your key data is the positive integers from 1 to 10M, your test data looks good. I would also ensure that the different hash tables implementations were initialized to the same bucket size for a given test, otherwise it's not a fair comparison. Finally, I would vary the bucket size over a pretty significant range and rerun the tests to see how the implementations changed their behavior.

As I understand you are interested in both operations execution time and memory consumption of the maps in the test.
I will start with memory consumption as this seams not to be answered at all. What I propose is to use a small library called Classmexer. I personally used it when I need to get the 100% correct memory consumption of any object. It has the java agent approach (because it's using the Instrumentation API), which means that you need to add it as the parameter to the JVM executing your tests:
-javaagent: [PATH_TO]/classmexer.jar
The usage of the Classmexer is very simple. At any point of time you can get the memory consumption in bytes by executing:
MemoryUtil.deepMemoryUsageOf(mapIamInterestedIn, VisibilityFilter.ALL)
Note that with visibility filter you can specify if the memory calculation should be done for the object (our map) plus all other reachable object through references. That's what VisibilityFilter.ALL is for. However, this would mean that the size you get back includes all objects you used for keys and values. Thus if you have 100 Integer/String entries the reported size will include those as well.
For the timing aspect I would propose JMH tool, as this tool is made for micro bench-marking. There are plenty examples online, for example this article has map testing examples that can guide you pretty good.
Note that I should be careful when do you call the Classmexer's Memory Util as it will interfere with the time results if you call it during the time measuring. Furthermore, I am sure that there are many other tools similar to Classmexer, but I like it because it small and simple.

I was just doing something similar to this, and I ended up using the built in profiler in the Netbeans IDE. You can get really detailed info on both CPU and memory usage. I had originally written all my code in Eclipse, but Netbeans has an import feature for bringing in Eclipse projects and it set it all up no problem, if that is possibly your situation too.
For timing, you might also look at the StopWatch class in Apache Commons. It's a much more intuitive way of tracking time on targeted operations, e.g.:
StopWatch myMapTimer = new StopWatch();
HashMap<Integer, Integer> hashMap = new HashMap<>();
myMapTimer.start();
for (int i = 0; i < numElements; i++)
hashMap.put(i, i);
myMapTimer.stop();
System.out.println(myMapTimer.getTime()); // time will be in milliseconds

Performance difference between assignment and conditional test

This question is specifically geared towards the Java language, but I would not mind feedback about this being a general concept if so. I would like to know which operation might be faster, or if there is no difference between assigning a variable a value and performing tests for values. For this issue we could have a large series of Boolean values that will have many requests for changes. I would like to know if testing for the need to change a value would be considered a waste when weighed against the speed of simply changing the value during every request.
public static void main(String[] args){
Boolean array[] = new Boolean[veryLargeValue];
for(int i = 0; i < array.length; i++) {
array[i] = randomTrueFalseAssignment;
}
for(int i = 400; i < array.length - 400; i++) {
testAndChange(array, i);
}
for(int i = 400; i < array.length - 400; i++) {
justChange(array, i);
}
}
This could be the testAndChange method
public static void testAndChange(Boolean[] pArray, int ind) {
if(pArray)
pArray[ind] = false;
}
This could be the justChange method
public static void justChange(Boolean[] pArray, int ind) {
pArray[ind] = false;
}
If we were to end up with the very rare case that every value within the range supplied to the methods were false, would there be a point where one method would eventually become slower than the other? Is there a best practice for issues similar to this?
Edit: I wanted to add this to help clarify this question a bit more. I realize that the data type can be factored into the answer as larger or more efficient datatypes can be utilized. I am more focused on the task itself. Is the task of a test "if(aConditionalTest)" is slower, faster, or indeterminable without additional informaiton (such as data type) than the task of an assignment "x=avalue".

As #TrippKinetics points out, there is a semantical difference between the two methods. Because you use Boolean instead of boolean, it is possible that one of the values is a null reference. In that case the first method (with the if-statement) will throw an exception while the second, simply assigns values to all the elements in the array.
Assuming you use boolean[] instead of Boolean[]. Optimization is an undecidable problem. There are very rare cases where adding an if-statement could result in better performance. For instance most processors use cache and the if-statement can result in the fact that the executed code is stored exactly on two cache-pages where without an if on more resulting in cache faults. Perhaps you think you will save an assignment instruction but at the cost of a fetch instruction and a conditional instruction (which breaks the CPU pipeline). Assigning has more or less the same cost as fetching a value.
In general however, one can assume that adding an if statement is useless and will nearly always result in slower code. So you can quite safely state that the if statement will slow down your code always.
More specifically on your question, there are faster ways to set a range to false. For instance using bitvectors like:
long[] data = new long[(veryLargeValue+0x3f)>>0x06];//a long has 64 bits
//assign random values
int low = 400>>0x06;
int high = (veryLargeValue-400)>>0x06;
data[low] &= 0xffffffffffffffff<<(0x3f-(400&0x3f));
for(int i = low+0x01; i < high; i++) {
data[i] = 0x00;
}
data[high] &= 0xffffffffffffffff>>(veryLargeValue-400)&0x3f));
The advantage is that a processor can perform operations on 32- or 64-bits at once. Since a boolean is one bit, by storing bits into a long or int, operations are done in parallel.

Java wordcount: a mediocre implementation

I implemented a wordcount program with Java. Basically, the program takes a large file (in my tests, I used a 10 gb data file that contained numbers only), and counts the number of times each 'word' appears - in this case, a number (23723 for example might appear 243 times in the file).
Below is my implementation. I seek to improve it, with mainly performance in mind, but a few other things as well, and I am looking for some guidance. Here are a few of the issues I wish to correct:
Currently, the program is threaded and works properly. However, what I do is pass a chunk of memory (500MB/NUM_THREADS) to each thread, and each thread proceeds to wordcount. The problem here is that I have the main thread wait for ALL the threads to complete before passing more data to each thread. It isn't too much of a problem, but there is a period of time where a few threads will wait and do nothing for a while. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance). Currently, I use the fact that I know the input is an integer, and just store the temporary variables as an int, so no memory problems there. I want to be able to use some sort of delimiter, whether that delimiter be a space, or several characters.
I am using a global ConcurrentHashMap to story key value pairs. For example, if a thread finds a number "24624", it searches for that number in the map. If it exists, it will increase the value of that key by one. The value of the keys at the end represent the number of occurrences of that key. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
I am open to other possibilities as well, this is just what comes to mind.
Note: Splitting the file is not an option I want to explore, as I might be deploying this on a server in which I should not be creating my own files, but if it would really be a performance boost, I might listen.
Other Note: I am new to java threading, as well as new to StackOverflow. Be gentle.
public class BigCount2 {
public static void main(String[] args) throws IOException, InterruptedException {
int num, counter;
long i, j;
String delimiterString = " ";
ArrayList<Character> delim = new ArrayList<Character>();
for (char c : delimiterString.toCharArray()) {
delim.add(c);
}
int counter2 = 0;
num = Integer.parseInt(args[0]);
int bytesToRead = 1024 * 1024 * 1024 / 2; //500 MB, size of loop
int remainder = bytesToRead % num;
int k = 0;
bytesToRead = bytesToRead - remainder;
int byr = bytesToRead / num;
String filepath = "C:/Users/Daniel/Desktop/int-dataset-10g.dat";
RandomAccessFile file = new RandomAccessFile(filepath, "r");
Thread[] t = new Thread [num];//array of threads
ConcurrentMap<Integer, Integer> wordCountMap = new ConcurrentHashMap<Integer, Integer>(25000);
byte [] byteArray = new byte [byr]; //allocates 500mb to a 2D byte array
char[] newbyte;
for (i = 0; i < file.length(); i += bytesToRead) {
counter = 0;
for (j = 0; j < bytesToRead; j += byr) {
file.seek(i + j);
file.read(byteArray, 0, byr);
newbyte = new String(byteArray).toCharArray();
t[counter] = new Thread(
new BigCountThread2(counter,
newbyte,
delim,
wordCountMap));//giving each thread t[i] different file fileReader[i]
t[counter].start();
counter++;
newbyte = null;
}
for (k = 0; k < num; k++){
t[k].join(); //main thread continues after ALL threads have finished.
}
counter2++;
System.gc();
}
file.close();
System.exit(0);
}
}
class BigCountThread2 implements Runnable {
private final ConcurrentMap<Integer, Integer> wordCountMap;
char [] newbyte;
private ArrayList<Character> delim;
private int threadId; //use for later
BigCountThread2(int tid,
char[] newbyte,
ArrayList<Character> delim,
ConcurrentMap<Integer, Integer> wordCountMap) {
this.delim = delim;
threadId = tid;
this.wordCountMap = wordCountMap;
this.newbyte = newbyte;
}
public void run() {
int intCheck = 0;
int counter = 0; int i = 0; Integer check; int j =0; int temp = 0; int intbuilder = 0;
for (i = 0; i < newbyte.length; i++) {
intCheck = Character.getNumericValue(newbyte[i]);
if (newbyte[i] == ' ' || intCheck == -1) { //once a delimiter is found, the current tempArray needs to be added to the MAP
check = wordCountMap.putIfAbsent(intbuilder, 1);
if (check != null) { //if returns null, then it is the first instance
wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1);
}
intbuilder = 0;
}
else {
intbuilder = (intbuilder * 10) + intCheck;
counter++;
}
}
}
}

Some thoughts on a little of most ..
.. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
If all the threads take about the same time to process the same amount of data, then there really isn't that much of a "problem" here.
However, one nice thing about a Thread Pool is it allows one to rather trivially adjust some basic parameters such as number of concurrent workers. Furthermore, using an executor service and Futures can provide an additional level of abstraction; in this case it could be especially handy if each thread returned a map as the result.
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance) ..
This sounds like an implementation issue. While I would first try a StreamTokenizer (because it's already written), if doing it manually, I would check out the source - a good bit of that can be omitted when simplifying the notion of a "token". (It uses a temporary array to build the token.)
I am using a global ConcurrentHashMap to story key value pairs. .. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
It would reduce locking and may increase performance to use a separate map per thread and merge strategy. Furthermore, the current implementation is broken as wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1) is not atomic and thus the operation might under count. I would use a separate map simply because reducing mutable shared state makes a threaded program much easier to reason about.
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
Consider using a FileReader (and BufferedReader) per thread on the same file. This will avoid having to first copy the file into the array and slice it out for individual threads which, while the same amount of total reading, avoids having to soak up so much memory. The reading done is actually not random access, but merely sequential (with a "skip") starting from different offsets - each thread still works on a mutually exclusive range.
Also, the original code with the slicing is broken if an integer value was "cut" in half as each of the threads would read half the word. One work-about is have each thread skip the first word if it was a continuation from the previous block (i.e. scan one byte sooner) and then read-past the end of it's range as required to complete the last word.

Why is System.out.println so slow?

Is this something common to all programming languages? Doing multiple print followed by a println seems faster but moving everything to a string and just printing that seems fastest. Why?
EDIT: For example, Java can find all the prime numbers up to 1 million in less than a second - but printing then all out each on their own println can take minutes! Up to a 10 billion can hours to print!
EX:
package sieveoferatosthenes;
public class Main {
public static void main(String[] args) {
int upTo = 10000000;
boolean primes[] = new boolean[upTo];
for( int b = 0; b < upTo; b++ ){
primes[b] = true;
}
primes[0] = false;
primes[1] = false;
int testing = 1;
while( testing <= Math.sqrt(upTo)){
testing ++;
int testingWith = testing;
if( primes[testing] ){
while( testingWith < upTo ){
testingWith = testingWith + testing;
if ( testingWith >= upTo){
}
else{
primes[testingWith] = false;
}
}
}
}
for( int b = 2; b < upTo; b++){
if( primes[b] ){
System.out.println( b );
}
}
}
}

println is not slow, it's the underlying PrintStream that is connected with the console, provided by the hosting operating system.
You can check it yourself: compare dumping a large text file to the console with piping the same textfile into another file:
cat largeTextFile.txt
cat largeTextFile.txt > temp.txt
Reading and writing are similiar and proportional to the size of the file (O(n)), the only difference is, that the destination is different (console compared to file). And that's basically the same with System.out.
The underlying OS operation (displaying chars on a console window) is slow because
The bytes have to be sent to the console application (should be quite fast)
Each char has to be rendered using (usually) a true type font (that's pretty slow, switching off anti aliasing could improve performance, btw)
The displayed area may have to be scrolled in order to append a new line to the visible area (best case: bit block transfer operation, worst case: re-rendering of the complete text area)

System.out is a static PrintStream class. PrintStream has, among other things, those methods you're probably quite familiar with, like print() and println() and such.
It's not unique to Java that input and output operations take a long time. "long." printing or writing to a PrintStream takes a fraction of a second, but over 10 billion instances of this print can add up to quite a lot!
This is why your "moving everything to a String" is the fastest. Your huge String is built, but you only print it once. Sure, it's a huge print, but you spend time on actually printing, not on the overhead associated with the print() or println().
As Dvd Prd has mentioned, Strings are immutable. That means whenever you assign a new String to an old one but reusing references, you actually destroy the reference to the old String and create a reference to the new one. So you can make this whole operation go even faster by using the StringBuilder class, which is mutable. This will decrease the overhead associated with building that string you'll eventually print.

I believe this is because of buffering. A quote from the article:
Another aspect of buffering concerns
text output to a terminal window. By
default, System.out (a PrintStream) is
line buffered, meaning that the output
buffer is flushed when a newline
character is encountered. This is
important for interactivity, where
you'd like to have an input prompt
displayed before actually entering any
input.
A quote explaining buffers from wikipedia:
In computer science, a buffer is a
region of memory used to temporarily
hold data while it is being moved from
one place to another. Typically, the
data is stored in a buffer as it is
retrieved from an input device (such
as a Mouse) or just before it is sent
to an output device (such as Speakers)
public void println()
Terminate the current line by writing
the line separator string. The line
separator string is defined by the
system property line.separator, and is
not necessarily a single newline
character ('\n').
So the buffer get's flushed when you do println which means new memory has to be allocated etc which makes printing slower. The other methods you specified require lesser flushing of buffers thus are faster.

Take a look at my System.out.println replacement.
By default, System.out.print() is only line-buffered and does a lot work related to Unicode handling. Because of its small buffer size, System.out.println() is not well suited to handle many repetitive outputs in a batch mode. Each line is flushed right away. If your output is mainly ASCII-based then by removing the Unicode-related activities, the overall execution time will be better.

If you're printing to the console window, not to a file, that will be the killer.
Every character has to be painted, and on every line the whole window has to be scrolled.
If the window is partly overlaid with other windows, it also has to do clipping.
That's going to take far more cycles than what your program is doing.
Usually that's not a bad price to pay, since console output is supposed to be for your reading pleasure :)

The problem you have is that displaying to the screen is very espensive, especially if you have a graphical windows/X-windows environment (rather than a pure text terminal) Just to render one digit in a font is far more expensive than the calculations you are doing. When you send data to the screen faster than it can display it, it buffered the data and quickly blocks. Even writing to a file is significant compare to the calculations, but its 10x - 100x faster than displaying on the screen.
BTW: math.sqrt() is very expensive, and using a loop is much slower than using modulus i.e. % to determine if a number is a multiple. BitSet can be 8x more space efficient than boolean[], and faster for operations on multiple bits e.g. counting or searching bits.
If I dump the output to a file, it is quick, but writing to the console is slow, and if I write to the console the data which was written to a file it takes about the same amount of time.
Took 289 ms to examine 10,000,000 numbers.
Took 149 ms to toString primes up to 10,000,000.
Took 306 ms to write to a file primes up to 10,000,000.
Took 61,082 ms to write to a System.out primes up to 10,000,000.
time cat primes.txt
real 1m24.916s
user 0m3.619s
sys 0m12.058s
The code
int upTo = 10*1000*1000;
long start = System.nanoTime();
BitSet nonprimes = new BitSet(upTo);
for (int t = 2; t * t < upTo; t++) {
if (nonprimes.get(t)) continue;
for (int i = 2 * t; i <= upTo; i += t)
nonprimes.set(i);
}
PrintWriter report = new PrintWriter("report.txt");
long time = System.nanoTime() - start;
report.printf("Took %,d ms to examine %,d numbers.%n", time / 1000 / 1000, upTo);
long start2 = System.nanoTime();
for (int i = 2; i < upTo; i++) {
if (!nonprimes.get(i))
Integer.toString(i);
}
long time2 = System.nanoTime() - start2;
report.printf("Took %,d ms to toString primes up to %,d.%n", time2 / 1000 / 1000, upTo);
long start3 = System.nanoTime();
PrintWriter pw = new PrintWriter(new BufferedOutputStream(new FileOutputStream("primes.txt"), 64*1024));
for (int i = 2; i < upTo; i++) {
if (!nonprimes.get(i))
pw.println(i);
}
pw.close();
long time3 = System.nanoTime() - start3;
report.printf("Took %,d ms to write to a file primes up to %,d.%n", time3 / 1000 / 1000, upTo);
long start4 = System.nanoTime();
for (int i = 2; i < upTo; i++) {
if (!nonprimes.get(i))
System.out.println(i);
}
long time4 = System.nanoTime() - start4;
report.printf("Took %,d ms to write to a System.out primes up to %,d.%n", time4 / 1000 / 1000, upTo);
report.close();

Most of the answers here are right, but they don't cover the most important point: system calls. This is the operation that induces the more overhead.
When your software needs to access some hardware resource (your screen for example), it needs to ask the OS (or hypervisor) if it can access the hardware. This costs a lot:
Here are interesting blogs about syscalls, the last one being dedicated to syscall and Java
http://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead
http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html
https://blog.packagecloud.io/eng/2017/03/14/using-strace-to-understand-java-performance-improvement/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.