Several months ago I implemented a solution to choose unique values from a range between 1 and 65535 (16 bits). This range is used to generate unique Route Targets suffixes, which for this customer massive network (it's a huge ISP) are a very disputed resource, so any released value needs to become immediately available to the end user.
To tackle this requirement I used a BitSet. Allocate a suffix on the RT index with set and deallocate a suffix with clear. The method nextClearBit() can find the next available value. I handle synchronization / concurrency issues manually.
This works pretty well for a small range... The entire index is small (around 10k), it is blazing fast and can be easy serialized and stored in a blob field.
The problem is, some new devices can handle RT suffixes of 32 bits unsigned (range 1 / 4294967296). Which can't be managed with a BitSet (it would, by itself, consume around 600Mb, plus be limited to int - 32 bits signed - range). Even with this massive range available, the client still wants to free Route Target suffixes that become available to the end user, mainly because the lowest ones (up to 65535) - which are compatible with old routers - are being heavily disputed.
Before I tell the customer that this is impossible and he will have to conform with my reusable index for lower RTs (up to 65550) and use a database sequence for the other ones (which means that when the user frees a Route Target, it will not become available again), would anyone shed some light?
Maybe some kind soul already implemented a high performance number pool for Java (6 if it matters), or I am missing a killer feature of Oracle database (11R2 if it matters)... Just some wishful thinking :).
Thank you very much in advance.
I would combine the following:
your current BitSet solution for 1-65535
PLUS
Oracle-based solution with a sequence for 65536 - 4294967296 which wraps around defined as
CREATE SEQUENCE MyIDSequence
MINVALUE 65536
MAXVALUE 4294967296
START WITH 65536
INCREMENT BY 1
CYCLE
NOCACHE
ORDER;
This sequence gives you ordered values in the specified range and allows for reuse of any values but only after the maximum is reached - which should allow enough time for the values being released... if need be you can keep track of values in use in a table and just increment further if the returned value is already in use - all this can be wrapped nicely into a stored procedure for convenience...
This project may be of some use.
Related
I am writing a process that returns data to a subscribers every few seconds. I would like to create a unique id for to the subscribers:
producer -> subsriber1
-> subsriber2
What is the difference between using:
java.util.UUID.randomUUID()
System.nanoTime()
System.currentTimeMillis()
Will the nano time always be unique? What about the random UUID?
UUID
The 128-bit UUID was invented exactly for your purpose: Generating identifiers across one or more machines without coordinating through a central authority.
Ideally you would use the original Version 1 UUID, or its variations in Versions 2, 3, and 5. The original takes the MAC address of the host computer’s network interface and combines it with the current moment plus a small arbitrary number that increments when the host clock has been adjusted. This approach eliminates any practical concern for duplicates.
Java does not bundle an implementation for generating these Versions. I presume the Java designers had privacy and security concerns over divulging place, time, and MAC address.
Java comes with only one implementation of a generator, for Version 4. In this type all but 6 of the 128 bits are randomly generated. If a cryptographically strong random generator is used, this Version is good enough to use in most common situations without concern for collisions.
Understand that 122 bits is a really big range of numbers (5.316911983139664e+36). 64-bits yields a range of 18,446,744,073,709,552,000 (18 quintillion). The remaining 58 bits (122-64=58) yields a number range of 288,230,376,151,711,740 (288 quadrillion). Now multiply those two numbers to get the range of 122-bits: 2^122 = ( 18,446,744,073,709,552,000 * 288,230,376,151,711,740 ) which is 5.3 undecillion.
Nevertheless, if you have access to generating a Version of UUID other than 4, take it. For example in a database system such as Postgres, the database server can generate UUID numbers in the various Versions including Version 1. Or you may find a Java library for generating such UUIDs, though that library may not be platform-independent (it may have native code within).
System.nanoTime
Be clear that System.nanoTime has nothing to do with the current date and time. To quote the Javadoc:
This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.
The System.nanoTime feature simply returns a long number, a count of nanoseconds since some origin, but that origin is not specified.
The only promise made in the Java spec is that the origin will not change during the runtime of a JVM. So you know the number is ever increasing during execution of your app. Unless reaching the limit of a long, when the counter will rollover. That rollover might take 292 years (2^63 nanoseconds), if the origin is zero — but, again, the origin is not specified.
In my experience with the particular Java implementations I have used, the origin is the moment when the JVM starts up. This means I will most certainly see the same numbers all over again after the next JVM restart.
So using System.nanoTime as an identifier is a poor choice. Whether your app happens to hit coincidentally the exact same nanosecond number as seen in a prior run is pure chance, but a chance you need not take. Use UUID instead.
java.util.UUID.randomUUID() is potentially thread-safe.
It is not safe to compare the results of System.nanoTime() calls between different threads. If many threads run during the same millisecond, this function returns the same milliseconds.
The same is true for System.currentTimeMillis() also.
Comparing System.currentTimeMillis() and System.nanoTime(), the latter is more expensive as it takes more cpu cycles but is more accurate too. So UUID should serve your purpose.
I think yes, you can use System.nanoTime() as id. I have tested it and did not face with duplication.
P.S. But I strongly offer you to use UUID.
As the BitSet.get() function uses an int as an argument, I was thinking whether I could store more than 2^32 bits in a BitSet, and if so how would I retrieve them?
I am doing a Project Euler problem where I need to generate primes till 10^10. The algorithm I'm currently using to generate primes is the Erathonesus' Sieve, storing the boolean values as bits in a BitSet. Any workaround for this?
You could use a list of bitsets as List<BitSet> and when the end of one bitset has been reached you could move to the next one.
However, I think your approach is probably incorrect. Even if you use a single bit for each number you need 10^10 bits which is about 1 GB memory (8 bits in a byte and 1024^3 bytes in a GB). Most Project Euler problems should be solvable without needing that much memory.
No, it's limited by the int indexing in its interface.
So they didn't bother exploiting all its potential, (about 64x downsized) probably because it wasn't feasible to use that much RAM.
I worked on a LongBitSet implementation, published it here.
It can take:
//2,147,483,647 For reference, this is Integer.MAX_VALUE
137,438,953,216 LongBitSet max size (bits)
0b1111111_11111111_11111111_11111100_000000L in binary
Had to address corner cases, in the commit history you can see the 1st commit being a copy paste from java.util.BitSet.
See factory method:
public static LongBitSet getMaxSizeInstance() {
// Integer.MAX_VALUE - 3 << ADDRESS_BITS_PER_WORD
return new LongBitSet( 0b1111111_11111111_11111111_11111100_000000L);
}
Note: -Xmx24G -Xms24G -ea Min GB needed to start JVM with to call getMaxSizeInstance() without java.lang.OutOfMemoryError: Java heap space
I am caching list of Long indexes in my Java program and it is causing the memory to overflow.
So, decided to cache only the start and end indexes of all continuous indexes and rewrite the ArrayList's required APIs. Now, what data structure will be best here to implement the start-end index cache? Is it better to go for TreeMap and keep start index as key and end index as value?
If I were you, I would use some variation of bit string storage.
In Java bit strings are implemented by BitSet.
For example, to represent arbitrary list of unique 32-bit integers, you could store it as a single bit string 4 billion bits long, so this will take 4 bln / 8 bits = 512MB of memory. This is a lot, but it is worst possible case.
But, you can be a lot smarter than that. For example, you could store it as list or binary tree of some smaller fixed (or dynamic) sized bit strings, say 65536 bits or less (or 8KB or less). In other words, each leaf object in this tree will have small header representing start offset and length (probably power of 2 for simplicity, but it does not have to be), and bit string storing actual array members. For efficiency, you could optionally compress this bit string using gzip or similar algorithm - it will make access slower, but could improve memory efficiency by factor of 10 or more.
If your 20 million index elements are almost consecutive (not very sparse), it should take only around 20mln/8bits ~= 2 million bits = 2 MB to represent it in memory. If you gzip it, it will be probably under 1MB overall.
The most compact representation will depend greatly on the distribution of indices in your specific application.
If your indices are densely clustered, the range-based representation suggested by mvp will probably work well (you might look at implementations of run-length encoding for raster graphics, since they're similar problems).
If your indices aren't clustered in dense runs, that encoding will actually increase memory consumption. For sparsely-populated lists, you might look into primitive data structures such as LongArrayList or LongOpenHashSet in FastUtil, or similar structures in Gnu Trove or Colt. In most VMs, each Long object in your ArrayList consumes 20+ bytes, whereas a primitive long consumes only 8. So you can often realize a significant memory savings with type-specific primitive collections instead of the standard Collections framework.
I've been very pleased with FastUtil, but you might find another solution suits you better. A little simulation and memory profiling should help you determine the most effective representation for your own data.
Most BitSet (compressed or uncompressed) implementations are for integers. Here's one for longs: http://www.censhare.com/en/aktuelles/censhare-labs/yet-another-compressed-bitset which works like an ordered primitive long hash set or long to long hash map.
I have to store millions of entries in a database. Each entry is identified by a set of unique integer identifiers. For example a value may be identified by a set of 10 integer identifiers, each of which are less than 100 million.
In order to reduce the size of the database, I thought of the following encoding using a single 32 bit integer value.
Identifier 1: 0 - 100,000,000
Identifier 2: 100,000,001 - 200,000,000
.
.
.
Identifier 10: 900,000,001 - 1,000,000,000
I am using Java. I can write a simple method to encode/decode. The user code does not have to know that I am encoding/decoding during fetch/store.
What I want to know is: what is the most efficient (fastest) and recommended way to implement such encoding/decoding. A simple implementation will perform a large number of multiplications/subtractions.
Is it possible to use shifts (or bitwise operations) and choose different partition size (the size of each segment still has to be close to 100 million)?
I am open to any suggestions, ideas, or even a totally different scheme. I want to exploit the fact that the integer identifiers are bounded to drastically reduce the storage size without noticeably compromising performance.
Edit: I just wanted to add that I went through some of the answers posted on this forum. A common solution was to split the bits for each identifier. If I use 2 bits for each identifier for a total of 10 identifiers, then my range of identifiers gets severely limited.
It sounds like you want to pack multiple integer values of 0...100m into a single 32bit Integer? Unless you are omitting important information that would allow to store these 0...100m values more efficiently, there is simply no way to do it.
ceil(log2(100m)) = 27bit, which means you have only 5 "spare bits".
You can make the segmentation size 27 bits which gives you 32 * 128 M segements. instead of 42 * 100 M
int value =
int high = value >>> 27;
int low = value & ((1L << 27) -1);
It is worth nothing this calculation is likely to be trivial compared to the cost of using a database.
It's unclear what you actually want to do, but it sounds like you want an integer value, each bit representing having a particular attribute, and applying a bitmask.
A 32-bit integer can save 32 different attributes, 64-bit 64 etc. To have more, you'll need multiple integer columns.
If that's not it, I don't know what you mean by "encode".
I am storing certain entities in my database with integer Ids of size 32 bits thus using the range of -2.14 billion to +2.14 billion.
I have given tried giving some meaning to my ids due to which my Ids, in the positive range, have finished up a bit quickly. I am looking forward to use the negative integer range of -2.14 billion to 0.
Wanted to know, if you could see any downsides of using negative integers as ids, though personally I don't see any downsides.
There is an old saying in database design that goes like this: "Intelligent keys are not". You should never design for special meaning in an id when a descriptive attribute is more appropriate.
Given than dumb keys are only compared for equality, sign or lack thereof has no impact.