How to add two BitSet

How to add two BitSet - java

I am looking for a way to add two BitSet. Should i go to the basics of binary number and perform XOR and AND operation over BitSet. As basic says here-
Will this be efficient?

No, it wouldn't be efficient, because you wouldn't know the carry for bit N until you've processed all bits through N-1. This is the problem solved by carry look-ahead adders in hardware.
There is no way to implement addition of BitSets in a way that does not involve examining all their bits one-by-one in the worst case. An alternative strategy depends a lot on your specific requirements: if you mutate your bit sets a lot, you may want to roll your own, based on Sun's Oracle's implementation. You can shamelessly copy borrow their code, and add an implementation of add that operates on the "guts" of the BitSet, stored as long[] bits. You'll need to be very careful dealing with overflows (remember, all numbers in Java are signed), but otherwise it should be rather straightforward.

The most efficient way would be to convert both bitsets to numbers and just add them

Related

Java's BigInteger implementation

I'm new here so please excuse my noob mistakes. I'm currently working on a little project of mine that sees me dealing with digits with a length in the forty thousands and beyond.
I'm currently using BigInteger to handle these values, and I need something that performs faster. I've read that BigInteger uses an array of integers in its implementation, and what I need to know is whether BigInteger is using each index in this array to represent each decimal point, as in 1 - 9, or is it using something more efficient.
I ask this because I already have an implementation in mind that uses bit operations, which makes it more efficient, memory and processing wise.
So the final question is - is BigInteger already efficient enough, and should I just rely on that? It would better to know this rather than putting it to the test unnecessarily, which would take a lot of time.
Thank you.

At least with Oracle's Java 8 and OpenJDK 8, it doesn't store one decimal digit per int. It stores full 32-bit portions per 32-bit int in the int[], which can be seen with its source code.
Bit operations are fast for it, since it's a sign-magnitude value and the magnitude is stored packed just as you'd expect, just make sure that you use the relevant BigInteger bitwise methods rather than implementing your own.
If you still need more speed, try something like GMP, though be aware that it uses a LGPL or GPL license. It would also be better to use it outside of Java.

Datastructure to use for storing a huge list of long values

I am caching list of Long indexes in my Java program and it is causing the memory to overflow.
So, decided to cache only the start and end indexes of all continuous indexes and rewrite the ArrayList's required APIs. Now, what data structure will be best here to implement the start-end index cache? Is it better to go for TreeMap and keep start index as key and end index as value?

If I were you, I would use some variation of bit string storage.
In Java bit strings are implemented by BitSet.
For example, to represent arbitrary list of unique 32-bit integers, you could store it as a single bit string 4 billion bits long, so this will take 4 bln / 8 bits = 512MB of memory. This is a lot, but it is worst possible case.
But, you can be a lot smarter than that. For example, you could store it as list or binary tree of some smaller fixed (or dynamic) sized bit strings, say 65536 bits or less (or 8KB or less). In other words, each leaf object in this tree will have small header representing start offset and length (probably power of 2 for simplicity, but it does not have to be), and bit string storing actual array members. For efficiency, you could optionally compress this bit string using gzip or similar algorithm - it will make access slower, but could improve memory efficiency by factor of 10 or more.
If your 20 million index elements are almost consecutive (not very sparse), it should take only around 20mln/8bits ~= 2 million bits = 2 MB to represent it in memory. If you gzip it, it will be probably under 1MB overall.

The most compact representation will depend greatly on the distribution of indices in your specific application.
If your indices are densely clustered, the range-based representation suggested by mvp will probably work well (you might look at implementations of run-length encoding for raster graphics, since they're similar problems).
If your indices aren't clustered in dense runs, that encoding will actually increase memory consumption. For sparsely-populated lists, you might look into primitive data structures such as LongArrayList or LongOpenHashSet in FastUtil, or similar structures in Gnu Trove or Colt. In most VMs, each Long object in your ArrayList consumes 20+ bytes, whereas a primitive long consumes only 8. So you can often realize a significant memory savings with type-specific primitive collections instead of the standard Collections framework.
I've been very pleased with FastUtil, but you might find another solution suits you better. A little simulation and memory profiling should help you determine the most effective representation for your own data.

Most BitSet (compressed or uncompressed) implementations are for integers. Here's one for longs: http://www.censhare.com/en/aktuelles/censhare-labs/yet-another-compressed-bitset which works like an ordered primitive long hash set or long to long hash map.

Logarithm Algorithm

I need to evaluate a logarithm of any base, it does not matter, to some precision. Is there an algorithm for this? I program in Java, so I'm fine with Java code.
How to find a binary logarithm very fast? (O(1) at best) might be able to answer my question, but I don't understand it. Can it be clarified?

Use this identity:
logb(n) = loge(n) / loge(b)
Where log can be a logarithm function in any base, n is the number and b is the base. For example, in Java this will find the base-2 logarithm of 256:
Math.log(256) / Math.log(2)
=> 8.0
Math.log() uses base e, by the way. And there's also Math.log10(), which uses base 10.

I know this is extremely late, but this may come to be useful for some since the matter here is precision. One way of doing this is essentially implementing a root-finding algorithm that uses, from its base, the high precision types you might want to be using, consisting of simple +-x/ operations.
I would recommend implementing Newton's method since it demands relatively few iterations and has great convergence. For this sort of application, specifically, I believe it's fair to say it will always provide the correct result provided good input validation is implemented.
Considering a simple constant "a" where
Where a is sought to be solved for such that it obeys, then
We can use the Newton method iteratively to find "a" within any specified tolerance, where each a-ith iteration can be computed by
and the denominator is
,
because that's the first derivative of the function, as necessary for the Newton method. Once this is solved for, "a" is the direct answer for the "a = log,b(x)" problem, obtainable by simple +-x/ operations, so you're already good to go. "Wait, but there's a power there?". Yes. If you can rely on your power function as being accurate enough, then there are no issues with going ahead and using it there. Otherwise, you can further break down the power operation into a series of other +-x/ operations by using these methods to simplify whatever decimal number that is on the power into two integer power operations that can be computed easily with a series of multiplication operations. This process will eventually leave you with nth-roots to solve for, which you can also find with the Newton method. If you do go down that road, you can use this for the newton method
which, as you can see, has to be solved for recursively until you reach b = 1.
Phew, but yeah, that's it. This is the way you can solve the problem by making sure you use high precision types along the whole way with only +-x/ operations. Below is a quick implementation I did in Excel to solve for log,2(3), compared with the solution given by the software's original function. As you can see, I can just keep refining "a" until I reach the tolerance I want by monitoring what the optimization function gives me. In this, I used a=2 as the initial guess, which you can use and should be fine for most cases.

How to deal with underflow in scientific computing?

I am working on probabilistic models, and when doing inference on those models, the estimated probabilities can become very small. In order to avoid underflow, I am currently working in the log domain (I store the log of the probabilities). Multiplying probabilities is equivalent to an addition, and summing is done by using the formula:
log(exp(a) + exp(b)) = log(exp(a - m) + exp(b - m)) + m
where m = max(a, b).
I use some very large matrices, and I have to take the element-wise exponential of those matrices to compute matrix-vector multiplications. This step is quite expensive, and I was wondering if there exist other methods to deal with underflow, when working with probabilities.
Edit: for efficiency reasons, I am looking for a solution using primitive types and not objects storing arbitrary-precision representation of real numbers.
Edit 2: I am looking for a faster solution than the log domain trick, not a more accurate solution. I am happy with the accuracy I currently get, but I need a faster method. Particularly, summations happen during matrix-vector multiplications, and I would like to be able to use efficient BLAS methods.
Solution: after a discussion with Jonathan Dursi, I decided to factorize each matrix and vector by its largest element, and to store that factor in the log domain. Multiplications are straightforward. Before additions, I have to factorize one of the added matrices/vectors by the ratio of the two factors. I update the factor every ten operations.

This issue has come up recently on the computational science stack exchange site as well, and although there the immediate worry there was overflow, the issues are more or less the same.
Transforming into log space is certainly one reasonable approach. Whatever space you're in, to do a large number of sums correctly, there's a couple of methods you can use to improve the accuracy of your summations. Compensated summation approaches, most famously Kahan summation, keep both a sum and what's effectively a "remainder"; it gives you some of the advantages of using higher precision arithmeitic without all of the cost (and only using primitive types). The remainder term also gives you some indication of how well you're doing.
In addition to improving the actual mechanics of your addition, changing the order of how you add your terms can make a big difference. Sorting your terms so that you're summing from smallest to largest can help, as then you're no longer adding terms as frequently that are very different (which can cause significant roundoff problems); in some cases, doing log2 N repeated pairwise sums can also be an improvement over just doing the straight linear sum, depending on what your terms look like.
The usefullness of all these approaches depend a lot on the properties of your data. The arbitrary precision math libraries, while enormously expensive in compute time (and possibly memory) to use, have the advantage of being a fairly general solution.

I ran into a similar problem years ago. The solution was to develop an approximation of log(1+exp(-x)). The range of the approximation does not need to be all that large (x from 0 to 40 will more than suffice), and at least in my case the accuracy didn't need to be particularly high, either.
In your case, it looks like you need to compute log(1+exp(-x1)+exp(-x2)+...). Throw out those large negative values. For example, suppose a, b, and c are three log probabilities, with 0>a>b>c. You can ignore c if a-c>38. It's not going to contribute to your joint log probability at all, at least not if you are working with doubles.

Option 1: Commons Math - The Apache Commons Mathematics Library
Commons Math is a library of lightweight, self-contained mathematics and statistics components addressing the most common problems not
available in the Java programming language or Commons Lang.
Note: The API protects the constructors to force a factory pattern while naming the factory DfpField (rather than the somewhat more intuitive DfpFac or DfpFactory). So you have to use
new DfpField(numberOfDigits).newDfp(myNormalNumber)
to instantiate a Dfp, then you can call .multiply or whatever on this. I thought I'd mention this because it's a bit confusing.
Option 2: GNU Scientific Library or Boost C++ Libraries.
In these cases you should use JNI in order to call these native libraries.
Option 3: If you are free to use other programs and/or languages, you could consider using programs/languages for numerical computations such as Octave, Scilab, and similar.
Option 4: BigDecimal of Java.

Rather than storing values in logarithmic form, I think you'd probably be better off using the same concept as doubles, namely, floating-point representation. For example, you might store each value as two longs, one for sign-and-mantissa and one for the exponent. (Real floating-point has a carefully tuned design to support lots of edge cases and avoid wasting a single bit; but you probably don't need to worry so much about any of those, and can focus on designing it in a way that's simple to implement.)

I don't understand why this works, but this formula seems to work and is simpler:
c = a + log(1 + exp(b - a))
Where c = log(exp(a)+exp(b))

Java - computing large mathematical expressions

Im facing a scenario where in ill have to compute some huge math expressions. The expressions in themselves are simple, ie have just the conventional BODMAS fundamental but the numbers that occur as operands are very large, to the tune of 1000 digit numbers. I do know of the BigInteger class of the java.math module but am looking for a different way so that the computation can occur also in a speedy manner. Im a guy still finding his feet in Java, so any pointers or advice in gthis regard would be of great help.
Regards
p1nG

Try it with BigInteger, profile the results with some test calculations, and see if it will work for you, before you look for something more optimized.

Since you say you are new to Java, I would have to suggest you use BigInteger and BigDecimal unless you want to write your own arbitrarily large number handlers. BigInteger and BigDecimal are fast enough for most uses of them. The only time I've had speed issues with them is when dealing with numbers on the order of a million digits.
That is unless you have a specific need for not using BigInteger.

First write the program correctly (using BigFoo) and then determine if optimization is appropriate.

BigInteger/BigFloat will be the most optimized implementation of generalized math that you will possibly get.
If you want it faster, you MIGHT be able to write assembly to use bit-shifting patters for specialized math (well, like divide by 2 tends to be a simple right shift), but if you are doing more than a few different types of equations, that will be very impractical.
BigInteger is only slow in comparison to int, but it will probably be the best you are going to possibly get for operations on numbers of more than 64 bits or so without going to another language--and even then you probably won't get much of an improvement unless that other language is assembly...

I am surprised that equations with 1000 digits have a practical application (except perhaps encryption)
Could you could explain what you are doing and what your speed requirements are?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.