We have a java service that computes some logical operations on a huge binary matrix (10 000 x 10 000). This matrix is array of bitsets. The most important operation is an intersection (logical AND) betwen a given bitset and each bitset in the array. We are using OpenBitset and it shows quite good results (at least better than java.util.BitSet). Data sparsity is moderate (could be many 0 or 1 in a row), bitset size is fixed.
The most important thing for us is fast response times (for now it's ~0.05 sec), so we would like to find ways for further improvements as the matrix and the quantity of requests are growing. There could be some algebraic methods or faster libraries for that.
We tried to use javaewah, but this library performed operations 10x times slower comparing to OpenBitset. There is comparision on the project's page, that shows that other bitset-compression libraries slower than Java BitSet.
Could you suggest some other methods or new ideas?
In my recent blog I discussed a "yet another" bitset implementation - with source code. Maybe you want to give it a try: http://www.censhare.com/en/aktuelles/censhare-labs/yet-another-compressed-bitset
If you don't mind using client-server solution, pilosa would be perfect for your use case.
bindings for java,python,go
groupBy support
time range support
huge matrix support
uses high-performance roaringbitmap
scales horizontally
helm chart https://github.com/pilosa/helm
Related
I am looking around for the best algorithms for the bitset operations like intersection and union, and found a lot of links and similar questions also.
Eg: Similar Question on Stack-Overflow
One thing however, which I am trying to understand is that where bit set stands into this. Eg, Lucene has taken BitSet operations to give a high performing set operations, specially because it can work at a lower level.
However, what looks to me is, the bit-set will start performing slow and slow, as the number of elements increase and the set is sparse, say set has ~10 elements where the max number of elements can be 2 Billion, because that will call out for unnecessary matching. What do you suggest ?
Bit Sets indeed make sense for dense sets, i.e. covering a significant fraction of the domain, as they represent every possible element. The space and running time requirements are O(D) [D = domain size = 2 billion !].
Sorted Set operations represent only the elements in the given set and will have an O(E) behavior [E = number of elements = 10], much more appropriate.
Bit Sets are fast, they are not efficient. I mean their hidden constant is smaller. They are blazingly fast for small sets (say D <= 1024) as they can process 32/64 elements in a single CPU instruction.
For sparse bitsets you can greatly improve performance (and reduce memory usage) using sparse bitmaps where you divide your data into chunks as opposed to storing everything under a single key.
When using bitmaps for analytics, you have a limited number of users active at any given time (e.g. day) and sparse bitmaps use this fact to their advantage.
Shameless plug: http://github.com/bilus/redis-bitops (if you're using Ruby but there are also performance notes there).
After reading theory of PageRank algorithm from this site I would like to play with it.
I am trying to implement this in Java. I mean I would like to play with PageRank in detail (like giving different weights and so on). For this I need to build hyperlink matrix. If I have 1 million nodes then my hyperlink matrix will be 1 million x 1 million size, which causes this exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at WebGraph.main(WebGraph.java:6)
How can I implement PageRank in Java, is there any way of storing hyperlink matrix?
That is a great article to learn about pagerank. I implemented a Perl version from it here to use with Textrank. However, if you want to just learn about pagerank and how the various aspects discussed in the article affect the results (dampening factor, direct or undirected graph, etc.), I would recommend running experiments in R or Octave. If you want to learn how to implement it efficiently, then programming it up from scratch, as you are doing, is best.
Most web graphs (or networks) are very sparse, which means most of the entries in the matrix representation of the graph are zero. A common data structure used to represent a sparse matrix is a hash-map, where the zero values are not stored. For example, if the matrix was
1, 0, 0
0, 0, 2,
0, 3, 0
a two dimension hash-map would store only the values for hm(0,0)=1, hm(1,2)=2, and hm(2,1)=3. So in a 1,000,000 by 1,000,000 matrix of a web graph, I would expect only a few million values to be non-zero. If each row averages only 5 non-zero values, a hash-map will use about 5*(8+8+8)10^6 bytes ~ 115mb to store it (8 for the left int index, 8 for the right int index, and 8 for the double value). The square matrix will use 810^6*10^6 ~ 7 terabytes.
Implementing an efficient sparse matrix-vector multiply in Java is not trivial, and there are some already implemented if you don't want to devote time to that aspect of the algorithm. The sparse-matrix multiply is the most difficult aspect to implement of the pagerank algorithm, so after that it gets easier (and more interesting).
Python networkx module has a nice implementation of pagerank. It uses scipy/numpy for the matrix implementation. The below two questions on stackoverflow should be enough to get you started.
How do weighted edges affect PageRank in networkx?
Networkx: Differences between pagerank, pagerank_numpy, and pagerank_scipy?
A few suggestions:
Use python, not Java: python is an excellent prototyping language, and has available sparse matrices (in scipy) as well as many other goodies. As others have noted, it also has a pagerank implementation.
Store your data not all in memory: any type of lightweight database would be fine, for instance sqlite, hibernate, ...
Work on tiles of the data: if there is a big matrix NxN, break it up into small tiles MxM where M is a fraction of N, that fit in memory. Combined with sparse matrices this allows you to work with really big N (hundreds of millions to billions, depending on how sparse the data is).
As Dan W suggested, try to increase the heap size. If you run your Java application from the command line, just add the switch -Xmx with the desired heap size. Let's assume you compiled your Java code into a runnable JAR file called pagerank.jar, and you want to set your heap size to 512 MB, you would issue the following command:
java -jar -Xmx512m pagerank.jar
EDIT:
But that only works if you don't have that many "pages" ... A 1 Million x 1 Million array is too big to fit into your RAM (1 trillion times * 64 bit double value = 7.27595761 terabytes). You should change your algorithm to load chunks of data from the disk, manipulate it, and store it back to disk.
You could use a graph database like Neo4j for that purpose.
You don't have to store the whole 1000000x1000000 matrix, because most matrix entries will be zero. Instead, you can (for example) store a list of nonzero entries for each row, and write your matrix functions to use it directly, without expanding it into a full matrix.
This kind of compressed representation is called a sparse matrix format, and most matrix libraries have an option to build and work with sparse matrices.
One disadvantage with sparse matrices is that multiplying two of them will result in a matrix which is much less sparse. However, the PageRank algorithm is designed so that you don't need to do that: the hyperlink matrix is constant, and only the score vector is updated.
PageRank is performed by Google using the 'Pregel' BSP (really just keywords) framework.
I remembered Apache Giraph (another Pregel), which includes a version of PageRank in its benchmark package.
Here's a video about Giraph: it's an introduction, and it specifically talks about handling PageRank.
If that doesn't work:
In Java there is an implementation of Pregel called GoldenOrb.
Pseudo code for the PageRank algorithm is here (on a different implementation of Pregel).
You'll have to read around BSP, and PageRank to handle the size of data you have.
Because the matrix is sparse you can implement dimensionality reduction like svd,pca,mds or Lsi that includes svd. There is a library to implement this kind of processes which is called Jama. You can find it here
Currently, I am serializing some long data using DataOutput.writeLong(long). The issue with this is obvious: there are many many cases where the longs will be quite small. I was wondering what the most performant varint implementation is? I've seen the strategy from protocol buffers, and testing on Random long data (which probably isn't the right distribution to test against), I'm seeing a pretty big performance drop (about 3-4x slower). Is this to be expected? Are there any good strategies for serializing longs as quickly as possible while still saving space?
Thanks for your help!
How about using the standard DataOutput format for serializing and using a generic compression algorithm such as GZIPOutputStream for compression?
The protocol buffer encoding is actually pretty good but isn't helpful with random longs - it is mostly useful if your longs are probably going to be small positive or negative numbers (let's say in the +/- 1000 range 95% of the time).
Numbers in this range will typically get encoded in 1, 2 or 3 bytes compared with 8 for a normal long. Try it with this sort of input on a large set of longs, you can often often get 50-70% space savings.
Of course calculating this encoding has some performance overhead, but if you are using this for serialisation then CPU time will not be your bottleneck anyway - so you can effectively ignore the encoding cost.
I am looking for an alternative to Java Bitset implementation. I am implementing a high performance algorithm and seems like using a Bitset object is killing its performance. Any ideas?
Someone here has compared boolean[] to BitSet and concluded with:
BitSet is more memory efficient than boolean[] except for very
small sizes. Each boolean in the array takes a byte. The numbers
from runtime.freeMemory() are a bit muddled for BitSet, but less.
boolean[] is more CPU efficient except for very large sizes, where
they are about even. E.g., for size 1 million boolean[] is about
four times faster (e.g. 6ms vs 27ms), for ten and a hundred million
they are about even.
If you Google, you can find some alternative implementations as well, like JavaEWAH, used by Apache Hive, Apache Spark and Eclipse JGit. It claims:
The goal of word-aligned compression is not to achieve the best
compression, but rather to improve query processing time. Hence, we
try to save CPU cycles, maybe at the expense of storage. However, the
EWAH scheme we implemented is always more efficient storage-wise than
an uncompressed bitmap as implemented in the BitSet class). Unlike
some alternatives, javaewah does not rely on a patented scheme.
While searching an answer for my question single byte comparison vs multiple boolean comparison, I found OpenBitSet
They claim to be faster than Java BitSet and direct access to the array of words storing the bits.
I am definitely gonna try that. See if it solve your purpose too.
Look at Javolution FastBitSet :
A high-performance bitset integrated with the collection framework as a set of indices and obeying the collection semantic for methods such as FastSet.size() (cardinality) or FastCollection.equals(java.lang.Object) (same set of indices).
See also http://code.google.com/p/guava-libraries/issues/detail?id=724#c3.
If you really must squeeze the maximum performance out of this thing, and if memory does not matter, you can try storing each one of your flags in an integer whose bit size is equal to the width of the data bus of your CPU.
You are probably on a 64-bit data bus CPU, so try long integers.
There are a number of compressed alternatives to the BitSet class. EWAH was already mentioned (https://github.com/lemire/javaewah). More recent additions include Roaring bitmaps (https://github.com/RoaringBitmap/RoaringBitmap) that are used by Apache Lucene, Apache Spark, Elastic Search, and so forth.
I am sorry, if my question sounds stupid :)
Can you please recommend me any pseudo code or good algo for LSI implementation in java?
I am not math expert. I tried to read some articles on wikipedia and other websites about
LSI ( latent semantic indexing ) they were full of math.
I know LSI is full of math. But if i see some source code or algo. I understand things more
easily. That's why i asked here, because so many GURU are here !
Thanks in advance
An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".
Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.
Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.
Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.
But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.
So, algorithm for LSA will look something like this:
Collect all documents and all unique words from them.
Extract frequency information and build co-occurrence matrix.
Reduce matrix with SVD.
If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.
Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.
This maybe a bit late but I always liked Sujit Pal's blog http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html and I have written a bit on my site if you are interested.
The process is way less complicated than it is often written up as. And really all you need is a library that can do single value decomposition of a matrix.
If you are interested I can explain in a couple of the short take away bits:
1) you create a matrix/dataset/etc with word counts of various documents - the different documents will be your columns and the rows the distinct words.
2) Once you've created the matrix you use a library like Jama (for Java) or SmartMathLibrary (for C#) and run the single value decomposition. All this does is take your original matrix and break it up in to three different parts/matrix that essentially represent your documents, your words, and kind of a multiplier (sigma) these are called the vectors.
3) Once you have you word, document, sigma vectors you shrink them equally (k) by just copying smaller parts of the vector/matrix and then multiply them back together. By shrinking them it kind of normalizes your data and this is LSI.
here are some fairly clear resources:
http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf
Hope this help you out a bit.
Eric