sorting 50 000 000 numbers - java

Suppose, that we need to sort 50 000 000 numbers. suppose, that the numbers is stored in a file. What is the most efficient algorithm for solving this problem? Parallel algorithm for sorting...
How to do it? Maybe useful link )
I can't use standard algorithm Therefore i ask you about methods and algorithms :)
Ok.. I read about parallel mergesort... But it's not clear for me.
solution, the first version
code is located here

50 million is not particularly large. I would just read them into memory. Sort them and write them out. It should take just a few seconds. How fast do you need it be? How compilcated do you need it to be?
On my old labtop it took 28 seconds. If I had more processors, it might be a little faster but much of the time is spent reading and writing the file (15 seconds) which wouldn't be any faster.
One of the critical factors is the size of your cache. The comparison itself is very cheap provided the data is in cache. As the L3 cache is shared, one thread is all you need to make full use of it.
public static void main(String...args) throws IOException {
generateFile();
long start = System.currentTimeMillis();
int[] nums = readFile("numbers.bin");
Arrays.sort(nums);
writeFile("numbers2.bin", nums);
long time = System.currentTimeMillis() - start;
System.out.println("Took "+time+" secs to sort "+nums.length+" numbers.");
}
private static void generateFile() throws IOException {
Random rand = new Random();
int[] ints = new int[50*1000*1000];
for(int i= 0;i<ints.length;i++)
ints[i] = rand.nextInt();
writeFile("numbers.bin", ints);
}
private static int[] readFile(String filename) throws IOException {
DataInputStream dis = new DataInputStream(new BufferedInputStream(new FileInputStream(filename), 64*1024));
int len = dis.readInt();
int[] ints = new int[len];
for(int i=0;i<len;i++)
ints[i] = dis.readInt();
return ints;
}
private static void writeFile(String name, int[] numbers) throws IOException {
DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(name), 64*1024));
dos.writeInt(numbers.length);
for (int number : numbers)
dos.writeInt(number);
dos.close();
}

From top of my head, merge sort seems to be the best option when it comes to parallelisation and distribution, as it uses divide-and-conquer approach. For more information, google for "parallel merge sort" and "distributed merge sort".
For single-machine, multiple cores example, see see Correctly multithreaded quicksort or mergesort algo in Java?. If you can use Java 7 fork/join then see: "Java 7: more concurrency" and "Parallelism with Fork/Join in Java 7".
For distributing it over many machines, see Hadoop, it has a distributed merge sort implementation: see MergeSort and MergeSorter. Also of interest: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds

For sorting than many elements, your best shot is Merge Sort. It's usually the algorithms used by databases. Even though is not as fast as Quick Sort, it uses intermediate storage so you don't need huge amounts of memory to perform the sort.
Also, as pointed by sje397 and Scott in the comments, Merge Sort is highly parallelizable.

It depends a lot on the problem domain. For instance, if all the numbers are positive ints, the best way may be to make an array of 0-MAX_INT and then just count how many times each number occurs as you read the file, and then print out each int with a non-zero count however many times it occurred. That's an O(n) "sort". There's an official name for that sort, but I forget what it is.
By the way, I got asked this question in a Google interview. From the problem constraints I came up with this solution, and it seemed to be the answer they were looking for. (I turned down the job because I didn't want to move.)

do not be afraid of the large number. in fact, 50 000 000 numbers is not that big. so if the numbers were integers then each number is 4bytes in size so the overall memory needed to be allocated for this array is 50 000 000*4 /1024/1024 = 190.7 Mega Bytes which is relatively small. Having the math done, you can proceed to do QuickSort which runs in O(nLogn). note that the builtin sort method in .net arrays uses QuickSort, im not sure if this is the case in java too.
sorting 250 000 000 integers on my machine took about 2 minutes so go for it :)

They are not so many. If they are 10 bytes long extended for example it would be an array of 500Mbytes, it almost can stay on my phone! ;)
So I'd say go for Quicksort if it's only that.

50e6 numbers is very small nowadays, don't make things more complex than they need to be...
bash$ sort < file > sorted.file

Related

Like to know the time complexity of this code snippet in Java

I would like to know the time complexity of the following code snippet,
FileReader fr = new FileReader("myfile.txt");
BufferedReader br = new BufferedReader(fr);
for (long i = 0; i < n-1; i++ ) {
br.readLine();
}
System.out.println("Line content:" + br.readLine());
br.close();
fr.close();
Edit: I would like to say, n = a constant number, e.g. 100000
The complexity is O(n) but that doesn't tell you much because you don't know how much time each readLine() needs.
Calculating the complexity doesn't make much sense when individual operations have a very variable runtime behavior.
In this case, the loop is very cheap and will not contribute much to the runtime of the whole program. The loading from disk, on the other hand, will contribute very much to the runtime but it's hard to say without statistical information about the average number of lines per file and the average length of a line.
This is a very simple case, but here's how to find the time complexity. The same method can be applied for more complex algorithms.
For the following portion of code (and regardless of the complexity of readline())
for (long i = 0; i < n-1; i++ ) {
br.readLine();
}
i = 0 will be executed (n-1) times, i < n-1 will be executed n times, i++ will be executed n-1 times, and br.readline(); will be executed n-1 times.
this gives us n-1+n+n-1+n-1 = 4*n-3. This is proportional to n, so the complexity is O(n).
I'm not sure what you mean by "time complexity", but it would appear that it's performance is linear (AKA O(n)) with the size of the file it reads from.
The readLine() function has to scan every character of the input up to the next newline. This should be O(N), where N is the number of bytes in the first n lines (which you read). Using a buffered reader does not reduce algorithmic complexity, it just reduces the number of actual IO calls needed to read a given number of bytes (a good thing, since IO calls are expensive). In this case, the only way that would change things is if the buffer's read size was much larger than the total number of bytes you were going to read.
The time complexity of reading an entire file should be O(N) where N is the size of the file.
However, proving this would be difficult, given the amount of software that is involved. You have got the Java code in the main method, the Reader stack (including the Charset decoder) and the JVM. Then you have the code in the OS. Then you have to take into account file buffering in kernel memory, file system organizations, disk seek times, etcetera.
(It is not meaningful to just consider just the time taken by the application. We can safely predict that component of the total time taken will be dominated by the other components.)
And, as Aaron says the complexity measure is not going to be a reliable predictor of the actual file read time.

Java - BitSet Replacement

In my code I am trying to get if a number exist in the hashmap or not. My code is following:
BitSet arp = new BitSet();
for i = 0 to 10 million
HashMap.get (i)
if number exist
arp.set(i , true)
else
arp.set(i , false)
After that from bitset I get if number i exist or not. However, I found this bitset operation is quite slow (tried with string = string + 0/1 also, more slower). Can anybody help me how to replace this operation with a faster one.
Your code is really difficult to read clearly, but I suspect you're just trying to set bits in the BitSet that are keys from your HashMap?
In that case, your code should just be more or less
BitSet bits = new BitSet(10000000);
for (Integer k : map.keySet()) {
bits.set(k);
}
Even if this wasn't what you meant, as a general rule, BitSet is blazing fast; I suspect it's the rest of your code that's slow.
If you provided your actual relevant code, we could have found some performance errors in the first place. But assuming your code is ok and you profiled your application to make sure that the BitSet operations are actually slow:
If you have enough memory space available, you can always just go for a boolean[] instead of a BitSet.
BitSet internally uses long[] to store the separate bits, so it's very good memory-wise, but can sometimes be a little bit too slow.

Why is Java HashMap slowing down?

I try to build a map with the content of a file and my code is as below:
System.out.println("begin to build the sns map....");
String basePath = PropertyReader.getProp("oldbasepath");
String pathname = basePath + "\\user_sns.txt";
FileReader fr;
Map<Integer, List<Integer>> snsMap =
new HashMap<Integer, List<Integer>>(2000000);
try {
fr = new FileReader(pathname);
BufferedReader br = new BufferedReader(fr);
String line;
int i = 1;
while ((line = br.readLine()) != null) {
System.out.println("line number: " + i);
i++;
String[] strs = line.split("\t");
int key = Integer.parseInt(strs[0]);
int value = Integer.parseInt(strs[1]);
List<Integer> list = snsMap.get(key);
//if the follower is not in the map
if(snsMap.get(key) == null)
list = new LinkedList<Integer>();
list.add(value);
snsMap.put(key, list);
System.out.println("map size: " + snsMap.size());
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("finish building the sns map....");
return snsMap;
The program is very fast at first but gets much slowly when the information printed is :
map size: 1138338
line number: 30923602
map size: 1138338
line number: 30923603
....
I try to find to reason with two System.out.println() clauses to judge the preformance of BufferedReader and HashMap instead of a Java profiler.
Sometimes it takes a while to get the information of the map size after getting the line number information, and sometimes, it takes a while to get the information of the line number information after get the map size. My question is: which makes my program slow? the BufferedReader for a big file or HashMap for a big map?
If you are testing this from inside Eclipse, you should be aware of the huge performance penalty of writing to stdout/stderr, due to Eclipse's capturing that ouptut in the Console view. Printing inside a tight loop is always a performance issue, even outside of Eclipse.
But, if what you are complaining about is the slowdown experienced after processing 30 million lines, then I bet it's a memory issue. First it slows down due to intense GC'ing and then it breaks with OutOfMemoryError.
You will have to check you program with some profiling tools to understand why it is slow.
In general file access is much more slower than in memory operations (unless you are constrained in memory and doing excess GC) so the guess would be that reading file could be the slower here.
Before you profiled, you will not know what is slow and what isn't.
Most likely, the System.out will show up as being the bottleneck, and you'll then have to profile without them again. System.out is the worst thing you can do for finding performance bottlenecks, because in doing so you usually add an even worse bottleneck.
An obivous optimization for your code is to move the line
snsMap.put(key, list);
into the if statement. You only need to put this when you created a new list. Otherwise, the put will just replace the current value with itself.
Java cost associated with Integer objects (and in particular the use of Integers in the Java Collections API) is largely a memory (and thus Garbage Collection!) issue. You can sometimes get significant gains by using primitive collections such as GNU trove, depending how well you can adjust your code to use them efficiently. Most of the gains of Trove are in memory usage. Definitely try rewriting your code to use TIntArrayList and TIntObjectMap from GNU trove. I'd avoid linked lists, too, in particular for primitive types.
Roughly estimated, a HashMap<Integer, List<Integer>> needs at least 3*16 bytes per entry. The doubly linked list again needs at least 2*16 bytes per entry stored. 1m keys + 30m values ~ 1 GB. No overhead included yet. With GNU trove TIntObjectHash<TIntArrayList> that should be 4+4+16 bytes per key and 4 bytes per value, so 144 MB. The overhead is probably similar for both.
The reason that Trove uses less memory is because the types are specialized for primitive values such as int. They will store the int values directly, thus using 4 bytes to store each.
A Java collections HashMap consists of many objects. It roughly looks like this: there are Entry objects that point to a key and a value object each. These must be objects, because of the way generics are handled in Java. In your case, the key will be an Integer object, which uses 16 bytes (4 bytes mark, 4 bytes type, 4 bytes actual int value, 4 bytes padding) AFAIK. These are all 32 bit system estimates. So a single entry in the HashMap will probably need some 16 (entry) + 16 (Integer key) + 32 (yet empty LinkedList) bytes of memory that all need to be considered for garbage collection.
If you have lots of Integer objects, it just will take 4 times as much memory as if you could store everything using int primitives. This is the cost you pay for the clean OOP principles realized in Java.
The best way is to run your program with profiler (for example, JProfile) and see what parts are slow. Also debug output can slow your program, for example.
Hash Map is not slow, but in reality its the fastest among the maps. HashTable is the only thread safe among maps, and can be slow sometimes.
Important note: Close the BufferedReader and File after u read the data... this might help.
eg: br.close()
file.close()
Please check you system processes from task manager, there may be too may processes running in the background.
Sometimes eclipse is real resource heavy, so try to run it from console to check it.

CPU Intensive Calculation Examples?

I need a few easily implementable single cpu and memory intensive calculations that I can write in java for a test thread scheduler.
They should be slightly time consuming, but more importantly resource consuming.
Any ideas?
A few easy examples of CPU-intensive tasks:
searching for prime numbers (involves lots of BigInteger divisions)
calculating large factorials e.g. 2000! ((involves lots of BigInteger multiplications)
many Math.tan() calculations (this is interesting because Math.tan is native, so you're using two call stacks: one for Java calls, the other for C calls.)
Multiply two matrices. The matrices should be huge and stored on the disk.
String search. Or, index a huge document (detect and count the occurrence of each word or strings of alphabets) For example, you can index all of the identifiers in the source code of a large software project.
Calculate pi.
Rotate a 2D matrix, or an image.
Compress some huge files.
...
The CPU soak test for the PDP-11 was tan(atan(tan(atan(...))) etc. Works the FPU pretty hard and also the stack and registers.
Ok this is not Java, but this is based on Dhrystone benchmark algorithm found here. These implementations of the algorithm might give you an idea on how is it done. The link here contains sources to C/C++ and Assembler to obtain the benchmarks.
Calculate nth term of the fibonacci series, where n is greater than 70. (time consuming)
Calculate factorials of large numbers. (time consuming)
Find all possible
paths between two nodes, in a graph. (memory consuming)
Official RSA Challenge
Unofficial RSA Challenge - Grab some ciphertext that you want to read in plaintext. Let the computer at it. If u use a randomized algorithm, there is a small but non-zero chance that u will succeed.
I was messing around with Thread priority in Java and used the code below. It seems to keep the CPU busy enough that the thread priority makes a difference.
#Test
public void testCreateMultipleThreadsWithDifferentPriorities() throws Exception {
class MyRunnable implements Runnable {
#Override
public void run() {
for (int i=0; i<1_000_000; i++) {
double d = tan(atan(tan(atan(tan(atan(tan(atan(tan(atan(123456789.123456789))))))))));
cbrt(d);
}
LOGGER.debug("I am {}, and I have finished", Thread.currentThread().getName());
}
}
final int NUMBER_OF_THREADS = 32;
List<Thread> threadList = new ArrayList<Thread>(NUMBER_OF_THREADS);
for (int i=1; i<=NUMBER_OF_THREADS; i++) {
Thread t = new Thread(new MyRunnable());
if (i == NUMBER_OF_THREADS) {
// Last thread gets MAX_PRIORITY
t.setPriority(Thread.MAX_PRIORITY);
t.setName("T-" + i + "-MAX_PRIORITY");
} else {
// All other threads get MIN_PRIORITY
t.setPriority(Thread.MIN_PRIORITY);
t.setName("T-" + i);
}
threadList.add(t);
}
threadList.forEach(t->t.start());
for (Thread t : threadList) {
t.join();
}
}

Any code tips for speeding up random reads from a Java FileChannel?

I have a large (3Gb) binary file of doubles which I access (more or less) randomly during an iterative algorithm I have written for clustering data. Each iteration does about half a million reads from the file and about 100k writes of new values.
I create the FileChannel like this...
f = new File(_filename);
_ioFile = new RandomAccessFile(f, "rw");
_ioFile.setLength(_extent * BLOCK_SIZE);
_ioChannel = _ioFile.getChannel();
I then use a private ByteBuffer the size of a double to read from it
private ByteBuffer _double_bb = ByteBuffer.allocate(8);
and my reading code looks like this
public double GetValue(long lRow, long lCol)
{
long idx = TriangularMatrix.CalcIndex(lRow, lCol);
long position = idx * BLOCK_SIZE;
double d = 0;
try
{
_double_bb.position(0);
_ioChannel.read(_double_bb, position);
d = _double_bb.getDouble(0);
}
...snip...
return d;
}
and I write to it like this...
public void SetValue(long lRow, long lCol, double d)
{
long idx = TriangularMatrix.CalcIndex(lRow, lCol);
long offset = idx * BLOCK_SIZE;
try
{
_double_bb.putDouble(0, d);
_double_bb.position(0);
_ioChannel.write(_double_bb, offset);
}
...snip...
}
The time taken for an iteration of my code increases roughly linearly with the number of reads. I have added a number of optimisations to the surrounding code to minimise the number of reads, but I am at the core set that I feel are necessary without fundamentally altering how the algorithm works, which I want to avoid at the moment.
So my question is whether there is anything in the read/write code or JVM configuration I can do to speed up the reads? I realise I can change hardware, but before I do that I want to make sure that I have squeezed every last drop of software juice out of the problem.
Thanks in advance
As long as your file is stored on a regular harddisk, you will get the biggest possible speedup by organizing your data in a way that gives your accesses locality, i.e. causes as many get/set calls in a row as possible to access the same small area of the file.
This is more important than anything else you can do because accessing random spots on a HD is by far the slowest thing a modern PC does - it takes about 10,000 times longer than anything else.
So if it's possible to work on only a part of the dataset (small enough to fit comfortably into the in-memory HD cache) at a time and then combine the results, do that.
Alternatively, avoid the issue by storing your file on an SSD or (better) in RAM. Even storing it on a simple thumb drive could be a big improvement.
Instead of reading into a ByteBuffer, I would use file mapping, see: FileChannel.map().
Also, you don't really explain how your GetValue(row, col) and SetValue(row, col) access the storage. Are row and col more or less random? The idea I have in mind is the following: sometimes, for image processing, when you have to access pixels like row + 1, row - 1, col - 1, col + 1 to average values; on trick is to organize the data in 8 x 8 or 16 x 16 blocks. Doing so helps keeping the different pixels of interest in a contiguous memory area (and hopefully in the cache).
You might transpose this idea to your algorithm (if it applies): you map a portion of your file once, so that the different calls to GetValue(row, col) and SetValue(row, col) work on this portion that's just been mapped.
Presumably if we can reduce the number of reads then things will go more quickly.
3Gb isn't huge for a 64 bit JVM, hence quite a lot of the file would fit in memory.
Suppose that you treat the file as "pages" which you cache. When you read a value, read the page around it and keep it in memory. Then when you do more reads check the cache first.
Or, if you have the capacity, read the whole thing into memory, in at the start of processing.
Access byte-by-byte always produce poor performance (not only in Java). Try to read/write bigger blocks (e.g. rows or columns).
How about switching to database engine for handling such amounts of data? It would handle all optimizations for you.
May be This article helps you ...
You might want to consider using a library which is designed for managing large amounts of data and random reads rather than using raw file access routines.
The HDF file format may by a good fit. It has a Java API but is not pure Java. It's licensed under an Apache Style license.

Categories

Resources