Reducing the number of bits in UUID

Reducing the number of bits in UUID - java

I have a use-case for getting distributed unique sequence numbers in integer format. UUID comes out to be the best and simple solution for me.
However, I need to have integers only, so I will convert that big hexadecimal number (UUID) to decimal number. UUID has 128 bits and hence will produce a decimal number of 39 digits.
I can't afford to have 39 digits number due to some strict database constraints. So, I get back to basics and try to convert the number to binary first and then decimal. Now, the standard process of converting a hexadecimal directly to binary is to take each hexadecimal digit and convert it into 4 bits. Each hexadecimal number can be converted to a set of 4 bits. Hence, for 32 hex digits in UUID, we get 128 bits (32*4) .
Now, I am thinking of not to follow the rule of converting each hexadecimal digit to 4 bits. Instead I will just use enough bits to represent that digit.
For example , take 12B as one hexadecimal number.
By standard process, conversion to binary comes out to be 0000-0001-0010-1011 (9 bits actually).
By my custom process, it comes out to be 1-10-1011 (7 bits actually).
So, by this method, number of bits got reduced. Now if bits reduced, the digits in the converted decimal number will get reduced and can live with my constraints.
Can you please help in validating my theory? Does this approach has some problem? Will this cause collision ? Is the method correct and can I go ahead with it?
Thanks in advance.

Yes, this will cause collisions.
e.g.
0000-0001-0010-1011 -> 1101011
0000-0000-0110-1011 -> 1101011

Sometime ago I spend couple of days debugging problems with UUID collisions (UUIDS were trimmed), debuging these things is a nightmare. You won't have a good time.
What you need is just to implement your own unique identifier shema --- depending on your use case developing such schema could be either very easy or very hard. You could for example assign each machine an unique number (lets say two bytes) and each machine would assing IDS serialy from 4 byte namespace. And in 6 bytes you have a nice UUID-like schema (with some constraints).

Related

What are octal numbers (base 8) used for?

Java provides ways for writing numeric literals in the bases 2, 8, 10 and 16.
I am wondering why base 8 is included, e.g. int x = 0123;?
I am thinking that there might be something akin to the fact that in hexadecimal the capacity of one byte is FF+1, and so forth.

This answer was written for the original question, "Why is writing a number in base 8 useful?"
It was to make the language familiar to those who knew C etc. Then the question is why support it in those!
There were architectures (various PDPs) which used 18 bit wide words (and others used 36 bit words), so literals where the digit is 3 bits wide would be useful.
Practically, the only place I have seen it used in Java code is for specifying unix-style permissions, e.g. 0777, 0644 etc.
(The tongue-in-cheek answer to why it is supported is "to get upvotes on this question").

"The octal numbers are not as common as they used to be. However, Octal is used when the number of bits in one word is a multiple of 3. It is also used as a shorthand for representing file permissions on UNIX systems and representation of UTF8 numbers, etc."
From: https://www.tutorialspoint.com/octal-number-system

Historicy of computer (science). To represent a goup of bits a base 10 does not fit, base 8 = 23 for 3 bits, and base 16 = 24 for 4 bits fit better.
The advantage of base 8 is that all digits are really digits: 0-7, whereas base 16 has "digits" 0-9A-F.
For 8 bits of a byte base 16 (hexadecimal) is a better fit, and won. For Unix base 8 octal, often still is used for rwx bits (read, write, execute) for user, group and others; hence octal numbers like 0666 or 0777.
Hexadecimal is ubiquitous, not the least because of computers' word sizes nowadays are
multiple bytes. That the 8 bit byte became a standard is an other, tough related story (23 bits, and addressing).

Original answer for "What are octal numbers (base 8) used for?"
Common Usage of Octal
As an abbreviation of binary: For computing machines (such as UNIVAC 1050, PDP-8, ICL 1900, etc.), Octal has been used as an abbreviation of binary because their word size is divisible by three (each octal digit represents three binary digits). So two, four, eight or twelve digits could concisely display an entire machine word. It also cut costs by allowing Nixie tubes, seven-segment displays, and calculators to be used for the operator consoles, where binary displays were too complex to use, decimal displays needed complex hardware to convert radices, and hexadecimal displays needed to display more numerals.
16-, 32-, or 62-bit words representation: All modern computing platforms use 16-, 32-, or 64-bit words, further divided into eight-bit bytes. On such systems, three octal digits per byte would be required, with the most significant octal digit representing two binary digits (plus one bit of the next significant byte, if any). Octal representation of a 16-bit word requires 6 digits, but the most significant octal digit represents (quite inelegantly) only one bit (0 or 1). This representation offers no way to easily read the most significant byte because it's smeared over four octal digits. Therefore, hexadecimal is more commonly used in programming languages today, since two hexadecimal digits exactly specify one byte. Some platforms with a power-of-two word size still have instruction subwords that are more easily understood if displayed in octal; this includes the PDP-11 and Motorola 68000 family. The modern-day ubiquitous x86 architecture belongs to this category as well, but octal is rarely used on this platform.
Encoding descriptions: Certain properties of the binary encoding of opcodes in modern x86 architecture become more readily apparent when displayed in octal, e.g. the ModRM byte, which is divided into fields of 2, 3, and 3 bits, so octal can be useful in describing these encodings.
Computations and File access Permissions: Octal is sometimes used in computing instead of hexadecimal, perhaps most often in modern times in conjunction with file permissions under Unix systems (In permission access to chmod). It has the advantage of not requiring any extra symbols as digits (the hexadecimal system is base-16 and therefore needs six additional symbols beyond 0–9).
Digital Displays: Octal numbers are also used in displaying digital content onto a screen since it has less number of symbols used for representation.
Graphical representation of byte strings: Some programming languages (C, Perl, Postscript, etc.) have a representation of texts/graphics in Octal with escaped as \nnn. Octal representation is particularly handy with non-ASCII bytes of UTF-8, which encodes groups of 6 bits, and where any start byte has octal value \3nn and any continuation byte has octal value \2nn.
Early Floating-Point Arithmetics: Octal was also used for floating-point in the Ferranti Atlas (1962), Burroughs B5500 (1964), Burroughs B5700 (1971), Burroughs B6700 (1971) and Burroughs B7700 (1972) computers.
In Transponders: Aircraft transmit a code, expressed as a four-octal-digit number when interrogated by ground radar. This code is used to distinguish different aircraft on the radar screen.
Further Readings: https://en.wikipedia.org/wiki/Octal

Want to generate unique key of 15 chars in java

I want to generate unique of 15 chars in java. Requests are coming to multiple servers and it should support 3-4 years of timeframe.
Also, we can consider 15 tps max.
Please point me into right direction.

Consider using UUID. It has much higher length than you specified, but it has a very low (like really low) probability to give you the same values in different runs. It is pretty commonly used.

You do not state your security requirements. For a non-secure solution use 15 digit base 36 numbers. The digits are A..Z, 0..9 and increment the numbers in the usual way. It may be easier to use ordinary increment of a variable and convert to base 36 at the end.
For a secure solution, encrypt the underlying number and convert the encrypted number to base 36. As long as the inputs are unique: 1, 2, 3 ... K, L, M, ... 1H2J, 1H2K, 1H2L ... then the encrypted outputs will be unique.
You may want to omit potentially confusing letters like I, O and Z (confused with 1, 0 and 2) and reduce to base 33 output.
15 digits in base 36 gives 2.18e42 unique numbers, which should be enough.

I want to generate unique ID ...
I think you must consider the UUIDs (Universal Unique IDs).
The standard Java libraries include many options to generate UUIDs. For instance, you can use the java.util.UUID introduced in Java 1.5. In addition, You can also use other opensource libraries such as Java UUID Generator of the UUID library. Johan Bukard compares some of them.
I want to generate unique of 15 chars in java
15 chars ?
Usually, UUIDs are represented by 32 hexadecimal (base 16) digits. If you need an ID with only 15 chars you may check the source code of the JDK or the JUG and create your own code.
Usually an UUID combines information from the device (such as the Apple UUID or the MAC address of a network interface) with a timestamp.

Q: use rejection sampling for true random number generation in a range (entropy from radioactive decay)

Hi everyone I have been doing some reading and have come across true random number generation using the entropy from radioactive decay. I have written a helper tool that returns the next random byte. It uses a server that provides bits from such a setup, I believe its data is from cesium decaying. I have done quite a bit of searching and have not really been able to figure out how to go about using this to generate numbers in a range from 0..n-1.
A user on the unofficial SO irc told me this
if you have a random byte, 0..255 evenly distributed and you want a random number in the range 0..5 there are 6 values in the output range and 256 in the input range the greatest multiple of 6 that is <= 256 is 252 so you would sample your random byte until you get a number in the range 0..251 then you could take the number MOD 6 to get your output number.
Im not really sure how to sample the byte. Do I use a single byte or do I have to continually request more bytes? Im really just having a hard time rapping my head around this, so any thorough explanation not using obscure math notations would be extremely appreciated.
Thanks.

"Sampling" means (Disclaimer: not the dictionary definition) "repeatedly checking for a value", so in your case you'd read bytes until you get one in the proper range, discarding the others.

Why does Random.nextLong not generate all possible long values in Java?

The Javadoc of the nextLong() method of the Random class states that
Because class Random uses a seed with only 48 bits, this algorithm will not return all possible long values. (Random javadoc)
The implementation is:
return ((long)next(32) << 32) + next(32);
The way I see it is as follows: to create any possible long, we should generate any possible bit pattern of 64 bits with equal likelihood. Assuming the calls to next(int) give us 32 random bits, then the concatenation of these bits will be a sequence of 64 random bits and hence we generate each 64 bit pattern with equal likelihood. And therefore all possible long values.
I suppose that the person who wrote the javadoc knows better and that my reasoning is flaw somehow. Can anyone explain where my reasoning is incorrect and what kind of longs will be returned then?

Since Random is pseudo-random we know that given the same seed it will return the same values. Taking the docs at their word there are 48 bits of seed. This means there are at most 2^48 unique values that can be printed out. If there were more that would mean that some value that we used before in position < 2^48 gives us a different value this time than it did last time.
If we try to join up two results what do we see?
|a|b|c|d|e|f|...|(2^48)-1|
Above are some values. How many pairs are there? a-b, b-c, c-d,... (2^48)-1-a. There are also 2^48 pairs. We can't fill all values of 2^64 with only the 2^48 pairs.

Pseudo-Random Number Generators are like giant rings of numbers. You start somewhere, and then move around the ring step by step, as you pull numbers out. This means that with a given seed - an initial internal state - all subsequent numbers are predetermined. Therefor, since the internal state is only 48 bits wide, only 2 to the power 48 random numbers are possible. So since the next number is given by the previous number, it is now clear why that implementation of nextLong will not generate all possible long values.

Let's say a perfect pseudo random K-bit generator is one that creates all possible 2^K seed values in 2^K trys. We can't do better, as there are only 2^K states, and every state is completly determined by the previous state and determines itself the next state.
Assume we write the output of the 48-bit generator down in binary. We get 2^48 * 48 bits that way.
And now we can say exactly how many 64-bit sequences we can get by going through the list and noting the next 64 bits (wrapping to the start when needed). It is exactly the number of bits we have: 13510798882111488.
Even if we assume that all those 64-bit sequences are pairwise different (which is not at all obvious), we have a long way to go until 2^64: 18446744073709551616.
I write the numbers again:
18446744073709551616 pairwise different 64 bit sequences we need
13510798882111488 64 bit sequences we can get with a 48 bit seed.
This proves that the javadoc writer was right. Only 1/1844th of all long values can be produced with the random generator

Good Hash function? (32-bit too small, 64-bit too large)

I need to generate a hash value used for uniqueness of many billions of records in Java. Trouble is, I only have 16 numeric digits to play with. In researching this, I have found algorithms for 32-bit hash, which return Java integers. But this is too small, as it only has a range of +/ 2 billion, and have will have more records that that. I cannot go to a 64-bit hash, as that will give me numeric values back that are too large (+/ 4 quintillion, or 19 digits). Trouble is, I am dealing with a legacy system that is forcing me into a static key length of 16 digits.
Suggestions? I know no hash function will guarantee uniqueness, but I need a good one that will fit into these restrictions.
Thanks

If your generated hash is too large you can just mod it with your keyspace max to make it fit.
myhash = hash64bitvalue % 10^16

If you are limited to 16 decimal digits, your key space contains 10^16 values.
Even if you find a hash that gives uniform distribution on your data set, due to Birthday Paradox you will have a 50% chance of collision on ~10^8 items of data, which is an order of magnitude less than your billions of records.
This means that you cannot use any kind of hash alone and rely on uniqueness.
A straightforward solution is to use a global counter instead. If global counter is infeasible, counters with preallocated ranges can be used. For example, 6 most significant digits denote fixed data source index, 10 least significant digits contain monotonous counter maintained by that data source.

So your restriction is 53 bit?
For my understanding order number of bit in hashcode doesn't affect its value (order and value of bit are fully independent from each other). So you could get 64-bit hash function and use only last 53 bits from it. And you must use binary operations for this ( hash64 & (1<<54 - 1) ) not arithmetic.

You don't have to store your hashes in a human readable form (hex, as you said). Just store the 64-bit long datatype (generated by a 64-bit hash function) in your database, which is only 8 bytes. And not the 19 bytes of which you were scared off.
If that isn't a solution, improve the legacy system.
Edit: Wait!
64-bit: 264 =
18446744073709551616
16 hex-digits: 1616 =
18446744073709551616
Exact fit! So make a hex representation of your 64-bit hash, and there you are.

If you can save 16 alphanumeric characters then you can use a hexadecimal representation and pack 16^16 bits into 16 chars. 16^16 is 2^64.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.