Why is there a Java radix limit?

Why is there a Java radix limit? - java

I noticed that the max limit for a radix in Java is base 36.
Is this an arbitrary limit, or does Java have reason for limiting the radix in this way?

It's the number of decimal digits (10), plus the number of letters in the alphabet (26).
If a radix of 37 were allowed, a new character would have to be picked to represent the 37th digit. While it certainly would have been possible to pick some character, there is no obvious choice. It makes sense to just disallow larger radixes.

Very simple: 26 letters + 10 digits = 36.
In order to represent a number, traditionally digits and Latin letters are used.

For completeness, I would add that there are two constants defined in JDK:
Character.MIN_RADIX
Character.MAX_RADIX

Radix limit make sense if the output has to be readable.
In various case, the output does NOT need to be readable.
Thus indeed, a higher limit would help in such cases.
And the java langage radix limit is a weak point for java.

You can use the Base64 encoding scheme as specified in RFC 4648 and RFC 2045.
Just generate the byte representation of you int number according you needs, to be compatible with the majority of the libraries that implement Base64.

Related

What are octal numbers (base 8) used for?

Java provides ways for writing numeric literals in the bases 2, 8, 10 and 16.
I am wondering why base 8 is included, e.g. int x = 0123;?
I am thinking that there might be something akin to the fact that in hexadecimal the capacity of one byte is FF+1, and so forth.

This answer was written for the original question, "Why is writing a number in base 8 useful?"
It was to make the language familiar to those who knew C etc. Then the question is why support it in those!
There were architectures (various PDPs) which used 18 bit wide words (and others used 36 bit words), so literals where the digit is 3 bits wide would be useful.
Practically, the only place I have seen it used in Java code is for specifying unix-style permissions, e.g. 0777, 0644 etc.
(The tongue-in-cheek answer to why it is supported is "to get upvotes on this question").

"The octal numbers are not as common as they used to be. However, Octal is used when the number of bits in one word is a multiple of 3. It is also used as a shorthand for representing file permissions on UNIX systems and representation of UTF8 numbers, etc."
From: https://www.tutorialspoint.com/octal-number-system

Historicy of computer (science). To represent a goup of bits a base 10 does not fit, base 8 = 23 for 3 bits, and base 16 = 24 for 4 bits fit better.
The advantage of base 8 is that all digits are really digits: 0-7, whereas base 16 has "digits" 0-9A-F.
For 8 bits of a byte base 16 (hexadecimal) is a better fit, and won. For Unix base 8 octal, often still is used for rwx bits (read, write, execute) for user, group and others; hence octal numbers like 0666 or 0777.
Hexadecimal is ubiquitous, not the least because of computers' word sizes nowadays are
multiple bytes. That the 8 bit byte became a standard is an other, tough related story (23 bits, and addressing).

Original answer for "What are octal numbers (base 8) used for?"
Common Usage of Octal
As an abbreviation of binary: For computing machines (such as UNIVAC 1050, PDP-8, ICL 1900, etc.), Octal has been used as an abbreviation of binary because their word size is divisible by three (each octal digit represents three binary digits). So two, four, eight or twelve digits could concisely display an entire machine word. It also cut costs by allowing Nixie tubes, seven-segment displays, and calculators to be used for the operator consoles, where binary displays were too complex to use, decimal displays needed complex hardware to convert radices, and hexadecimal displays needed to display more numerals.
16-, 32-, or 62-bit words representation: All modern computing platforms use 16-, 32-, or 64-bit words, further divided into eight-bit bytes. On such systems, three octal digits per byte would be required, with the most significant octal digit representing two binary digits (plus one bit of the next significant byte, if any). Octal representation of a 16-bit word requires 6 digits, but the most significant octal digit represents (quite inelegantly) only one bit (0 or 1). This representation offers no way to easily read the most significant byte because it's smeared over four octal digits. Therefore, hexadecimal is more commonly used in programming languages today, since two hexadecimal digits exactly specify one byte. Some platforms with a power-of-two word size still have instruction subwords that are more easily understood if displayed in octal; this includes the PDP-11 and Motorola 68000 family. The modern-day ubiquitous x86 architecture belongs to this category as well, but octal is rarely used on this platform.
Encoding descriptions: Certain properties of the binary encoding of opcodes in modern x86 architecture become more readily apparent when displayed in octal, e.g. the ModRM byte, which is divided into fields of 2, 3, and 3 bits, so octal can be useful in describing these encodings.
Computations and File access Permissions: Octal is sometimes used in computing instead of hexadecimal, perhaps most often in modern times in conjunction with file permissions under Unix systems (In permission access to chmod). It has the advantage of not requiring any extra symbols as digits (the hexadecimal system is base-16 and therefore needs six additional symbols beyond 0–9).
Digital Displays: Octal numbers are also used in displaying digital content onto a screen since it has less number of symbols used for representation.
Graphical representation of byte strings: Some programming languages (C, Perl, Postscript, etc.) have a representation of texts/graphics in Octal with escaped as \nnn. Octal representation is particularly handy with non-ASCII bytes of UTF-8, which encodes groups of 6 bits, and where any start byte has octal value \3nn and any continuation byte has octal value \2nn.
Early Floating-Point Arithmetics: Octal was also used for floating-point in the Ferranti Atlas (1962), Burroughs B5500 (1964), Burroughs B5700 (1971), Burroughs B6700 (1971) and Burroughs B7700 (1972) computers.
In Transponders: Aircraft transmit a code, expressed as a four-octal-digit number when interrogated by ground radar. This code is used to distinguish different aircraft on the radar screen.
Further Readings: https://en.wikipedia.org/wiki/Octal

Is there a datatype that uses less storage for 2 letters than String?

Basically what the title says. I'm aware that i could use char as type if i only had one letter, but i need a datatype for 2 letters, e.g "XY". Is there anything that uses less storage (bit) or is smaller than a String? Or are multiple letters generally just saved as Strings? Thanks!

If you are sure that there are no higher-unicode characters (i.e. characters that use more than 1 char to store) in use, there are a few options:
As mentioned by #rdas, you could use an array: char[2]. This would be a bit more memory-efficient than a String, as the String has additional members. If it's only ASCII-characters, you could even use byte[2].
As one char is 16 bits, 2 chars are 32 bits, so could also try to encode the 2 characters into 1 int, as this also uses only 32 bytes, and you would not have the object overhead of the array. Clearly, this requires some additional steps to encode/decode when you need to show the stored information as actual characters, e.g. when presenting it to the user.
If your characters are only ASCII codes, i.e. every character fits into 1 byte, you could even fit it into a short.
Depending on the number of two-character combinations that you actually need to support, you could actually just enumerate all the possible combinations, use a lookup Map or sorted Array, and then only store the number/index of the code. Again, depending on the number of combinations, use a byte, short or int.

No it is not possible
This is why::
String s = "ab" // uses only 4 bytes of data as each character reserves 2 bytes
And other data types uses >= 4 bytes except short and byte but here short and byte cannot store characters

What is the significance of radix in Character.fordigit() in Java?

Could someone please help me understand the significance of radix in the Character.forDigit(int digit, int radix) method?

Choosing a suitable radix lets you produce "digits" that are not decimal digits - for example, if you pass 16 for the radix, you can produce characters for digits ten through fifteen, because counting in hex uses sixteen digits - 0 through 9, followed by A through F:
char ten = Character.forDigit(10, 16); // returns 'a'
Demo.

This is tricky because the significance isn't as obvious as it first appears. When converting a string to an integer, of course the radix matters a lot. If you are converting "101" to an integer, you will get different answers depending on whether the radix (base) is binary (2), decimal (10), octal (8), hex (16), or any other base. Similarly, when converting an integer to a string, the results (when the source is >= MAX_RADIX) are all different for the different radices.
For forDigit, the answer isn't as clear. When you're converting a number to a single character representing a digit, the answer is always the same as long as the digit is valid for the radix. Thus, Character.forDigit(11,radix) always returns 'b' for all radices 12 and up. So the only significance is in how it handles the case when the digit is not valid for the radix? That is, for binary (radix=2), forDigit only works if the digit is 0 or 1; so what should it do if you say Character.forDigit(2,2), since 2 is not a valid binary digit?
There are a few things the language designers could have done: (1) get rid of the radix parameter and put the onus on the programmer to make sure the digit is in range (which in many cases will be a given anyway); (2) throw an exception; (3) return some special value. They chose (3): if you give it a digit that isn't valid for the radix, it returns '\0', the null character. This doesn't seem to be the best choice--you're unlikely to really want to use the null character for anything, which means you have to make your own check, which means they probably should have had the method throw an exception. But there it is.
But anyway, that's the significance of radix for this method: it performs a check to make sure the argument is in range, based on the radix.

It is the base of the number. One normally uses base 10 (ie 0-9). However, you might also be interested in using hexadecimal (ie 0-9, A-F), for example. The radix would then be 16.
Example:
Character.forDigit(8, 16)=8
Character.forDigit(9, 16)=9
Character.forDigit(10, 16)=a
Character.forDigit(11, 16)=b

Reducing the number of bits in UUID

I have a use-case for getting distributed unique sequence numbers in integer format. UUID comes out to be the best and simple solution for me.
However, I need to have integers only, so I will convert that big hexadecimal number (UUID) to decimal number. UUID has 128 bits and hence will produce a decimal number of 39 digits.
I can't afford to have 39 digits number due to some strict database constraints. So, I get back to basics and try to convert the number to binary first and then decimal. Now, the standard process of converting a hexadecimal directly to binary is to take each hexadecimal digit and convert it into 4 bits. Each hexadecimal number can be converted to a set of 4 bits. Hence, for 32 hex digits in UUID, we get 128 bits (32*4) .
Now, I am thinking of not to follow the rule of converting each hexadecimal digit to 4 bits. Instead I will just use enough bits to represent that digit.
For example , take 12B as one hexadecimal number.
By standard process, conversion to binary comes out to be 0000-0001-0010-1011 (9 bits actually).
By my custom process, it comes out to be 1-10-1011 (7 bits actually).
So, by this method, number of bits got reduced. Now if bits reduced, the digits in the converted decimal number will get reduced and can live with my constraints.
Can you please help in validating my theory? Does this approach has some problem? Will this cause collision ? Is the method correct and can I go ahead with it?
Thanks in advance.

Yes, this will cause collisions.
e.g.
0000-0001-0010-1011 -> 1101011
0000-0000-0110-1011 -> 1101011

Sometime ago I spend couple of days debugging problems with UUID collisions (UUIDS were trimmed), debuging these things is a nightmare. You won't have a good time.
What you need is just to implement your own unique identifier shema --- depending on your use case developing such schema could be either very easy or very hard. You could for example assign each machine an unique number (lets say two bytes) and each machine would assing IDS serialy from 4 byte namespace. And in 6 bytes you have a nice UUID-like schema (with some constraints).

What are the implications of using base 40 to encode a String?

I've seen it suggested that Base 40 encoding can be used to compress Strings (in Java to send to a Redis instance FWIW) and a quick test shows it more efficient for some of the data I'm using than an alternative I'm considering; Smaz.
Is there any reason to prefer base 32 or 64 encoding over 40? Any disadvantage, is encoding like this potentially lossless?

40 provides letters (probably lower case unless your application tends to use upper case most of the time) and digits for 36, and then four more for punctuation and shifts. You can make it lossless by making one of the remaining an escape so the next one or two characters represent a byte not in the other 39. Also a good approach is to have a shift-lock character that toggles between upper and lower case, if you tend to have strings of upper case characters.
40 is a convenient base, since three base-40 digits fit nicely in two bytes. 40^3 (64000) is a smidge less than 2^16 (65536).
What you should use depends on the statistics of your data.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.