Java data structure of 500 million (double) values?

Java data structure of 500 million (double) values? - java

I am generating random edges for a complete graph with 32678 Vertices. So, 500 million + values.
I am using a HashMap to using the edges as key and the random edge weight as the value. I keep encountering:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringBuilder.toString(StringBuilder.java:430) at
pa1.Graph.(Graph.java:60) at pa1.Main.main(Main.java:19)
This graph will then be used to construct a Minimum Spanning Tree.
Any ideas on a better data-structure or approach?
I know there are overrides to allocate more memory, but I would prefer a solution that works as-is.

A HashMap will be very large, cause it will contain Doubles (with a capital D) which are significantly larger than 8 bytes. (Not to mention the Entry) Depends on implementation and the CPU chip, but I think it's at least 16 bytes each, and probably more?
I think you should consider keeping the primary data in a huge double[] (or, if you can spare some accuracy, a float[]). That cuts memory usage by an easy 2x or 4x. (500M floats is a "mere" 2GB) Then use integer indexes into this array to implement your edges and vertices. For example, an edge could be an int[2]. This is far from O-O, and there's some serious hand-waving here. (and I don't understand all the nuances of what you are trying to do)
Very "old fashioned" in style, but requires a lot less memory.
Correction - I think an edge might be int[4], a vertex an int[2]. But you get the idea. Actually, for edges and vertices, you will have a smaller number of Objects and for them you can probably use "real" Objects, Maps, etc...

Since it is a complete graph, there is no doubt on what the edges are. How about storing the labels for those edges in a simple list which is ordered in a certain manner? So e.g. if you have 5 nodes, the weights for the edges which would be ordered as follows: {1,2}, {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}.
However, as pointed out by #BillyO'Neal this might still take up 8 GB of space. You might want to split up this list into multiple files and simultaneously maintain an index of these files suggesting where one set of weights ends in one file and where the next set of weights begin.
Additionally, given that you are finding the MST for the graph, you might want to have a look at the following paper as well: http://cvit.iiit.ac.in/papers/Vibhav09Fast.pdf. The paper seems to based off the Boruvka's Algorithm (http://en.wikipedia.org/wiki/Bor%C5%AFvka's_algorithm; http://iss.ices.utexas.edu/?p=projects/galois/benchmarks/mst).

Related

The most efficient implementation of adjacency list?

I want to create an adjacency list in Java and since I will get a huge set of nodes later as input, it needs to be really efficient.
What sort of implementation is best for this scenario?
A list of lists or maybe a map? I also need to save the edge weights somewhere. I could not figure out how to do this, since the adjacency list itself apparently just keeps track of the connected nodes, but not the edge weight.

Warning: this route is the most masochistic and hardest to maintain possible, and only recommended when the highest possible performance is required.
Adjacency lists are one of the most awkward classes of data structures to optimize, mainly because they vary in size from one vertex to the next. At some broad conceptual level, if you include the adjacency data as part of the definition of a Vertex or Node, then that makes the size of a Vertex/Node variable. Variable-sized data and the kind of memory contiguity needed to be cache-friendly tend to fight one another in most programming languages.
Most object-oriented languages weren't designed to deal with objects that can actually vary in size. They solve that by making them point to/reference memory elsewhere, but that leads to much higher cache misses.
If you want cutting-edge efficiency and you traverse adjacent vertices/nodes a lot, then you want a vertex and its variable number of references/indices to adjacent neighbors (and their weights in your case) to fit in a single cache line, and possibly with a good likelihood that some of those neighboring vertices also fit in the same cache line (though solving this and reorganizing the data to map a 2D graph to a 1-dimensional memory space is an NP-hard problem, but existing heuristics help a lot).
So it ceases to become a question of what data structures to use so much as what memory layouts to use. Arrays are your friend here, but not arrays of nodes. You want an array of bytes packing node data contiguously. Something like this:
[node1_data num_adj adj1 adj2 adj3 (possibly some padding for alignment and to avoid straddling) node2_data num_adj adj1 adj2 adj3 ...]
Node insertion and removal here starts to resemble the kind of algorithms you find to implement memory allocators. When you connect a new edge, that actually changes the node's size and potentially its position in these giant, contiguous memory blocks. Unlike memory allocators, you're potentially allowed to reshuffle and compact and defrag the data provided that you can update your references/indices to it.
Now this is only if you want the fastest possible solution, and provided your use cases are heavily weighted towards read operations (evaluation, traversal) rather than writes (connecting edges, inserting nodes, removing nodes). It's completely overkill otherwise, and a complete PITA since you'll lose all that nice object-oriented structure that helps keep the code easy to maintain, reuse, etc. This has you obliterating all that structure in favor of dealing with things at the bits and bytes level, and it's only worth doing if your software is in a realm where its quality is somehow very proportional to the efficiency of that graph.

One solution you can think of create a class Node which contains the data and a wt. this weight will be the weight of edge through which it is connected to the Node.
suppose you have a list for Node I which is connected to node A B C with edge weight a b c. And Node J is connected to A B C with x y z weights, so the adj List of I will contains the Node object as
I -> <A, a>,<B b>,<C c>
List of J will contains the Node object as
J -> <A, x>,<B y>,<C z>

Are Bit Set really faster than Sorted Set operations?

I am looking around for the best algorithms for the bitset operations like intersection and union, and found a lot of links and similar questions also.
Eg: Similar Question on Stack-Overflow
One thing however, which I am trying to understand is that where bit set stands into this. Eg, Lucene has taken BitSet operations to give a high performing set operations, specially because it can work at a lower level.
However, what looks to me is, the bit-set will start performing slow and slow, as the number of elements increase and the set is sparse, say set has ~10 elements where the max number of elements can be 2 Billion, because that will call out for unnecessary matching. What do you suggest ?

Bit Sets indeed make sense for dense sets, i.e. covering a significant fraction of the domain, as they represent every possible element. The space and running time requirements are O(D) [D = domain size = 2 billion !].
Sorted Set operations represent only the elements in the given set and will have an O(E) behavior [E = number of elements = 10], much more appropriate.
Bit Sets are fast, they are not efficient. I mean their hidden constant is smaller. They are blazingly fast for small sets (say D <= 1024) as they can process 32/64 elements in a single CPU instruction.

For sparse bitsets you can greatly improve performance (and reduce memory usage) using sparse bitmaps where you divide your data into chunks as opposed to storing everything under a single key.
When using bitmaps for analytics, you have a limited number of users active at any given time (e.g. day) and sparse bitmaps use this fact to their advantage.
Shameless plug: http://github.com/bilus/redis-bitops (if you're using Ruby but there are also performance notes there).

PageRank implementation for Research

After reading theory of PageRank algorithm from this site I would like to play with it.
I am trying to implement this in Java. I mean I would like to play with PageRank in detail (like giving different weights and so on). For this I need to build hyperlink matrix. If I have 1 million nodes then my hyperlink matrix will be 1 million x 1 million size, which causes this exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at WebGraph.main(WebGraph.java:6)
How can I implement PageRank in Java, is there any way of storing hyperlink matrix?

That is a great article to learn about pagerank. I implemented a Perl version from it here to use with Textrank. However, if you want to just learn about pagerank and how the various aspects discussed in the article affect the results (dampening factor, direct or undirected graph, etc.), I would recommend running experiments in R or Octave. If you want to learn how to implement it efficiently, then programming it up from scratch, as you are doing, is best.
Most web graphs (or networks) are very sparse, which means most of the entries in the matrix representation of the graph are zero. A common data structure used to represent a sparse matrix is a hash-map, where the zero values are not stored. For example, if the matrix was
1, 0, 0
0, 0, 2,
0, 3, 0
a two dimension hash-map would store only the values for hm(0,0)=1, hm(1,2)=2, and hm(2,1)=3. So in a 1,000,000 by 1,000,000 matrix of a web graph, I would expect only a few million values to be non-zero. If each row averages only 5 non-zero values, a hash-map will use about 5*(8+8+8)10^6 bytes ~ 115mb to store it (8 for the left int index, 8 for the right int index, and 8 for the double value). The square matrix will use 810^6*10^6 ~ 7 terabytes.
Implementing an efficient sparse matrix-vector multiply in Java is not trivial, and there are some already implemented if you don't want to devote time to that aspect of the algorithm. The sparse-matrix multiply is the most difficult aspect to implement of the pagerank algorithm, so after that it gets easier (and more interesting).

Python networkx module has a nice implementation of pagerank. It uses scipy/numpy for the matrix implementation. The below two questions on stackoverflow should be enough to get you started.
How do weighted edges affect PageRank in networkx?
Networkx: Differences between pagerank, pagerank_numpy, and pagerank_scipy?

A few suggestions:
Use python, not Java: python is an excellent prototyping language, and has available sparse matrices (in scipy) as well as many other goodies. As others have noted, it also has a pagerank implementation.
Store your data not all in memory: any type of lightweight database would be fine, for instance sqlite, hibernate, ...
Work on tiles of the data: if there is a big matrix NxN, break it up into small tiles MxM where M is a fraction of N, that fit in memory. Combined with sparse matrices this allows you to work with really big N (hundreds of millions to billions, depending on how sparse the data is).

As Dan W suggested, try to increase the heap size. If you run your Java application from the command line, just add the switch -Xmx with the desired heap size. Let's assume you compiled your Java code into a runnable JAR file called pagerank.jar, and you want to set your heap size to 512 MB, you would issue the following command:
java -jar -Xmx512m pagerank.jar
EDIT:
But that only works if you don't have that many "pages" ... A 1 Million x 1 Million array is too big to fit into your RAM (1 trillion times * 64 bit double value = 7.27595761 terabytes). You should change your algorithm to load chunks of data from the disk, manipulate it, and store it back to disk.
You could use a graph database like Neo4j for that purpose.

You don't have to store the whole 1000000x1000000 matrix, because most matrix entries will be zero. Instead, you can (for example) store a list of nonzero entries for each row, and write your matrix functions to use it directly, without expanding it into a full matrix.
This kind of compressed representation is called a sparse matrix format, and most matrix libraries have an option to build and work with sparse matrices.
One disadvantage with sparse matrices is that multiplying two of them will result in a matrix which is much less sparse. However, the PageRank algorithm is designed so that you don't need to do that: the hyperlink matrix is constant, and only the score vector is updated.

PageRank is performed by Google using the 'Pregel' BSP (really just keywords) framework.
I remembered Apache Giraph (another Pregel), which includes a version of PageRank in its benchmark package.
Here's a video about Giraph: it's an introduction, and it specifically talks about handling PageRank.
If that doesn't work:
In Java there is an implementation of Pregel called GoldenOrb.
Pseudo code for the PageRank algorithm is here (on a different implementation of Pregel).
You'll have to read around BSP, and PageRank to handle the size of data you have.

Because the matrix is sparse you can implement dimensionality reduction like svd,pca,mds or Lsi that includes svd. There is a library to implement this kind of processes which is called Jama. You can find it here

Algorithm to find peaks in 2D array

Let's say I have a 2D accumulator array in java int[][] array. The array could look like this:
(x and z axes represent indexes in the array, y axis represents values - these are images of an int[56][56] with values from 0 ~ 4500)
or
What I need to do is find peaks in the array - there are 2 peaks in the first one and 8 peaks in the second array. These peaks are always 'obvious' (there's always a gap between peaks), but they don't have to be similar like on these images, they can be more or less random - these images are not based on the real data, just samples. The real array can have size like 5000x5000 with peaks from thousands to several hundred thousands... The algorithm has to be universal, I don't know how big the array or peaks can be, I also don't know how many peaks there are. But I do know some sort of threshold - that the peaks can't be smaller than a given value.
The problem is, that one peak can consist of several smaller peaks nearby (first image), the height can be quite random and also the size can be significantly different within one array (size - I mean the number of units it takes in the array - one peak can consist from 6 units and other from 90). It also has to be fast (all done in 1 iteration), the array can be really big.
Any help is appreciated - I don't expect code from you, just the right idea :) Thanks!
edit: You asked about the domain - but it's quite complicated and imho it can't help with the problem. It's actually an array of ArrayLists with 3D points, like ArrayList< Point3D >[][] and the value in question is the size of the ArrayList. Each peak contains points that belong to one cluster (plane, in this case) - this array is a result of an algorithm, that segments a pointcloud . I need to find the highest value in the peak so I can fit the points from the 'biggest' arraylist to a plane, compute some parameters from it and than properly cluster most of the points from the peak.

He's not interested in estimating the global maximum using some sort of optimization heuristic - he just wants to find the maximum values within each of a number of separate clusters.
These peaks are always 'obvious' (there's always a gap between peaks)
Based on your images, I assume you mean there's always some 0-values separating clusters? If that's the case, you can use a simple flood-fill to identify the clusters. You can also keep track of each cluster's maximum while doing the flood-fill, so you both identify the clusters and find their maximum simultaneously.
This is also as fast as you can get, without relying on heuristics (which could return the wrong answer), since the maximum of each cluster could potentially be any value in the cluster, so you have to check them all at least once.
Note that this will iterate through every item in the array. This is also necessary, since (from the information you've given us) it's potentially possible for any single item in the array to be its own cluster (which would also make it a peak). With around 25 million items in the array, this should only take a few seconds on a modern computer.

This might not be an optimal solution, but since the problem sounds somewhat fluid too, I'll write it down.
Construct a list of all the values (and coordinates) that are over your minimum treshold.
Sort it in descending order of height.
The first element will be the biggest peak, add it to the peak list.
Then descend down the list, if the current element is further than the minimum distance from all the existing peaks, add it to the peak list.
This is a linear description but all the steps (except 3) can be trivially parallelised. In step 4 you can also use a coverage map: a 2D array of booleans that show which coordinates have been "covered" by a nearby peak.
(Caveat emptor: once you refine the criteria, this solution might become completely unfeasible, but in general it works.)

Simulated annealing, or hill climbing are what immediately comes to mind. These algorithms though will not guarantee that all peaks are found.
However if your "peaks" are separated by values of 0 as the gap, maybe a connected components analysis would help. You would label a region as "connected" if it is connected with values greater than 0(or if you have a certain threshold, label regions as connected that are over that threshold), then your number of components would be your number of peaks. You could also then do another pass of the array to find the max of each component.
I should note that connected components can be done in linear time, and finding the peak values can also be done in linear time.

Programming a 2D grid in Java

What is the best data structure to use when programming a 2-dimensional grid of tiles in Java? Tiles on the grid should be easily referenced by their location, so that neighbors and paths can be efficiently computed. Should it be a 2D array? An ArrayList? Something else?

If you're not worrying about speed or memory too much, you can simply use a 2D array - this should work well enough.
If speed and/or memory are issues for you then this depends on memory usage and the access pattern.
A single dimensional array is the way to go if you need high performance. You compute the proper index as y * wdt + x. There are 2 potential problems with this: cache misses and memory usage.
If you know that your access pattern is such that you fetch neighbours of an element most of the time, then mapping a 2D space into a 1D array as described above may cause cache misses - you want the neighbours to be close in memory, and neighbours from 2 different rows are not. You may have to map your 2d tiles in a different order to your 1d array. See Hilbert curves for example.
For better memory usage, if you know that most of your tiles are always the same (e.g. always grass), you might want to implement a sparse array or a quad tree. Both can be implemented quite efficiently, with cache awareness in mind (the sparse array link is good example for this). Another benefit is that these can be dynamically extended. However, you will always have to pay extra levels of indirection in the end for this to work.
NOTE: Be careful with using generic classes such as HashMaps with the key type being some primitive type or a special location class if you're worried about performance - you will either have to allocate an object each time you index the hash map or pay the price of boxing/unboxing. In addition to this, hash maps will not allow you efficient spatial queries (e.g. give me all objects existing in the radius R of a given object - quad trees are better for this).

If you have a fixed dimension for your grid, use a 2D array. If you need the size to be dynamic, use an ArrayList of ArrayLists.

A 2D array seems like a good bet if you plan on inserting stuff into specific locations. As long as its a fixed Size.

The data structure to use really depends on the type of operations you will perform:
In case the number of meaningful positions (nonzero/nondefault) in the grid is rather low (<< n x m) it might be more space efficient to use a hashmap, that maps (x,y) positions to specific tiles. Also you can iterate over meaningful positions alot more efficiently. In addition you could store references to neighboring tiles to each tile to speed up path/neighborhood traversal.
If your grid is densely filled with "information" you should consider using a 2d array or ArrayList (in case you will at some point have generic types involved as "tile-type", you have to use ArrayLists, since Java does not allow native arrays of generic type).

If you simply need to iterate over the grid and random addressing of cells, then MyCellType[][] should be fine. This is most efficient in terms of space and (one would expect) time for these use-cases.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.