After reading theory of PageRank algorithm from this site I would like to play with it.
I am trying to implement this in Java. I mean I would like to play with PageRank in detail (like giving different weights and so on). For this I need to build hyperlink matrix. If I have 1 million nodes then my hyperlink matrix will be 1 million x 1 million size, which causes this exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at WebGraph.main(WebGraph.java:6)
How can I implement PageRank in Java, is there any way of storing hyperlink matrix?
That is a great article to learn about pagerank. I implemented a Perl version from it here to use with Textrank. However, if you want to just learn about pagerank and how the various aspects discussed in the article affect the results (dampening factor, direct or undirected graph, etc.), I would recommend running experiments in R or Octave. If you want to learn how to implement it efficiently, then programming it up from scratch, as you are doing, is best.
Most web graphs (or networks) are very sparse, which means most of the entries in the matrix representation of the graph are zero. A common data structure used to represent a sparse matrix is a hash-map, where the zero values are not stored. For example, if the matrix was
1, 0, 0
0, 0, 2,
0, 3, 0
a two dimension hash-map would store only the values for hm(0,0)=1, hm(1,2)=2, and hm(2,1)=3. So in a 1,000,000 by 1,000,000 matrix of a web graph, I would expect only a few million values to be non-zero. If each row averages only 5 non-zero values, a hash-map will use about 5*(8+8+8)10^6 bytes ~ 115mb to store it (8 for the left int index, 8 for the right int index, and 8 for the double value). The square matrix will use 810^6*10^6 ~ 7 terabytes.
Implementing an efficient sparse matrix-vector multiply in Java is not trivial, and there are some already implemented if you don't want to devote time to that aspect of the algorithm. The sparse-matrix multiply is the most difficult aspect to implement of the pagerank algorithm, so after that it gets easier (and more interesting).
Python networkx module has a nice implementation of pagerank. It uses scipy/numpy for the matrix implementation. The below two questions on stackoverflow should be enough to get you started.
How do weighted edges affect PageRank in networkx?
Networkx: Differences between pagerank, pagerank_numpy, and pagerank_scipy?
A few suggestions:
Use python, not Java: python is an excellent prototyping language, and has available sparse matrices (in scipy) as well as many other goodies. As others have noted, it also has a pagerank implementation.
Store your data not all in memory: any type of lightweight database would be fine, for instance sqlite, hibernate, ...
Work on tiles of the data: if there is a big matrix NxN, break it up into small tiles MxM where M is a fraction of N, that fit in memory. Combined with sparse matrices this allows you to work with really big N (hundreds of millions to billions, depending on how sparse the data is).
As Dan W suggested, try to increase the heap size. If you run your Java application from the command line, just add the switch -Xmx with the desired heap size. Let's assume you compiled your Java code into a runnable JAR file called pagerank.jar, and you want to set your heap size to 512 MB, you would issue the following command:
java -jar -Xmx512m pagerank.jar
EDIT:
But that only works if you don't have that many "pages" ... A 1 Million x 1 Million array is too big to fit into your RAM (1 trillion times * 64 bit double value = 7.27595761 terabytes). You should change your algorithm to load chunks of data from the disk, manipulate it, and store it back to disk.
You could use a graph database like Neo4j for that purpose.
You don't have to store the whole 1000000x1000000 matrix, because most matrix entries will be zero. Instead, you can (for example) store a list of nonzero entries for each row, and write your matrix functions to use it directly, without expanding it into a full matrix.
This kind of compressed representation is called a sparse matrix format, and most matrix libraries have an option to build and work with sparse matrices.
One disadvantage with sparse matrices is that multiplying two of them will result in a matrix which is much less sparse. However, the PageRank algorithm is designed so that you don't need to do that: the hyperlink matrix is constant, and only the score vector is updated.
PageRank is performed by Google using the 'Pregel' BSP (really just keywords) framework.
I remembered Apache Giraph (another Pregel), which includes a version of PageRank in its benchmark package.
Here's a video about Giraph: it's an introduction, and it specifically talks about handling PageRank.
If that doesn't work:
In Java there is an implementation of Pregel called GoldenOrb.
Pseudo code for the PageRank algorithm is here (on a different implementation of Pregel).
You'll have to read around BSP, and PageRank to handle the size of data you have.
Because the matrix is sparse you can implement dimensionality reduction like svd,pca,mds or Lsi that includes svd. There is a library to implement this kind of processes which is called Jama. You can find it here
Related
I am generating random edges for a complete graph with 32678 Vertices. So, 500 million + values.
I am using a HashMap to using the edges as key and the random edge weight as the value. I keep encountering:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringBuilder.toString(StringBuilder.java:430) at
pa1.Graph.(Graph.java:60) at pa1.Main.main(Main.java:19)
This graph will then be used to construct a Minimum Spanning Tree.
Any ideas on a better data-structure or approach?
I know there are overrides to allocate more memory, but I would prefer a solution that works as-is.
A HashMap will be very large, cause it will contain Doubles (with a capital D) which are significantly larger than 8 bytes. (Not to mention the Entry) Depends on implementation and the CPU chip, but I think it's at least 16 bytes each, and probably more?
I think you should consider keeping the primary data in a huge double[] (or, if you can spare some accuracy, a float[]). That cuts memory usage by an easy 2x or 4x. (500M floats is a "mere" 2GB) Then use integer indexes into this array to implement your edges and vertices. For example, an edge could be an int[2]. This is far from O-O, and there's some serious hand-waving here. (and I don't understand all the nuances of what you are trying to do)
Very "old fashioned" in style, but requires a lot less memory.
Correction - I think an edge might be int[4], a vertex an int[2]. But you get the idea. Actually, for edges and vertices, you will have a smaller number of Objects and for them you can probably use "real" Objects, Maps, etc...
Since it is a complete graph, there is no doubt on what the edges are. How about storing the labels for those edges in a simple list which is ordered in a certain manner? So e.g. if you have 5 nodes, the weights for the edges which would be ordered as follows: {1,2}, {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}.
However, as pointed out by #BillyO'Neal this might still take up 8 GB of space. You might want to split up this list into multiple files and simultaneously maintain an index of these files suggesting where one set of weights ends in one file and where the next set of weights begin.
Additionally, given that you are finding the MST for the graph, you might want to have a look at the following paper as well: http://cvit.iiit.ac.in/papers/Vibhav09Fast.pdf. The paper seems to based off the Boruvka's Algorithm (http://en.wikipedia.org/wiki/Bor%C5%AFvka's_algorithm; http://iss.ices.utexas.edu/?p=projects/galois/benchmarks/mst).
We have a java service that computes some logical operations on a huge binary matrix (10 000 x 10 000). This matrix is array of bitsets. The most important operation is an intersection (logical AND) betwen a given bitset and each bitset in the array. We are using OpenBitset and it shows quite good results (at least better than java.util.BitSet). Data sparsity is moderate (could be many 0 or 1 in a row), bitset size is fixed.
The most important thing for us is fast response times (for now it's ~0.05 sec), so we would like to find ways for further improvements as the matrix and the quantity of requests are growing. There could be some algebraic methods or faster libraries for that.
We tried to use javaewah, but this library performed operations 10x times slower comparing to OpenBitset. There is comparision on the project's page, that shows that other bitset-compression libraries slower than Java BitSet.
Could you suggest some other methods or new ideas?
In my recent blog I discussed a "yet another" bitset implementation - with source code. Maybe you want to give it a try: http://www.censhare.com/en/aktuelles/censhare-labs/yet-another-compressed-bitset
If you don't mind using client-server solution, pilosa would be perfect for your use case.
bindings for java,python,go
groupBy support
time range support
huge matrix support
uses high-performance roaringbitmap
scales horizontally
helm chart https://github.com/pilosa/helm
I have thousands of data points and each data point has 50 dimensions. I would like to see the sparseness of data using java. Is there any java package/methods to plot such high dimensional data.
What you need to look for is multidimensional scaling. It basically reduces the dimensionality of the data space, trying to maintain the distances.
So you take a MDS package, reduce your data to 2D (or 3D) and plot them using 2D/3d graphics library (swing, jogl).
It might work or not, depending on the number of the data points and the space they're in. For 50 dimensions you might be out of luck, but it really depends.
A quick google for java implementation got me this
There's a package in R too, so you can use that.
I've written a basic Java applet which works as a map viewer (like Google Maps) for a game fansite.
In it, I've implemented an A* pathfinding algorithm on a 2D map with 16 different floors, "connected" on certain points. The floors are stored in PNG images which are downloaded when needed and converted to byte arrays. The node cost is retrieved from the pixel RGB values and put in the byte arrays.
The map contains about 2 million tiles, spread over 16 floors. The images are 1475 x 2000 (15-140 KB for the PNG images) big, so some of the floors contain a lot of empty tiles .
The byte arrays will be huge in memory, resulting in a “java.lang.OutOfMemoryError: Java heap space” error for most JVM configurations.
So my questions are
Is there anyway to reduce the size of these byte arrays and still have the pathfinder function properly?
Should I take a different approach to finding the optimal path, not having the tiles in memory?
I would think finding a path on the web server would be too CPU intensive.
Best regards,
A
You've just run into the biggest problem with A*: its memory requirement is proportional to the size of the state space.
You have a few options here.
The first would be to change your search algorithm from A* to IDA*, and add in search enhancements like a memory cache to remember as many previously-searched node costs as possible.
Another alternative is to keep A* but move to hierarchical search. This will likely require you to do some preprocessing on your image files, however.
You'll find several good resources (downloadable papers) on this subject here: http://webdocs.cs.ualberta.ca/~holte/CMPUT651/readinglist.html
I am sorry, if my question sounds stupid :)
Can you please recommend me any pseudo code or good algo for LSI implementation in java?
I am not math expert. I tried to read some articles on wikipedia and other websites about
LSI ( latent semantic indexing ) they were full of math.
I know LSI is full of math. But if i see some source code or algo. I understand things more
easily. That's why i asked here, because so many GURU are here !
Thanks in advance
An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".
Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.
Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.
Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.
But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.
So, algorithm for LSA will look something like this:
Collect all documents and all unique words from them.
Extract frequency information and build co-occurrence matrix.
Reduce matrix with SVD.
If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.
Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.
This maybe a bit late but I always liked Sujit Pal's blog http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html and I have written a bit on my site if you are interested.
The process is way less complicated than it is often written up as. And really all you need is a library that can do single value decomposition of a matrix.
If you are interested I can explain in a couple of the short take away bits:
1) you create a matrix/dataset/etc with word counts of various documents - the different documents will be your columns and the rows the distinct words.
2) Once you've created the matrix you use a library like Jama (for Java) or SmartMathLibrary (for C#) and run the single value decomposition. All this does is take your original matrix and break it up in to three different parts/matrix that essentially represent your documents, your words, and kind of a multiplier (sigma) these are called the vectors.
3) Once you have you word, document, sigma vectors you shrink them equally (k) by just copying smaller parts of the vector/matrix and then multiply them back together. By shrinking them it kind of normalizes your data and this is LSI.
here are some fairly clear resources:
http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf
Hope this help you out a bit.
Eric