Java provides ways for writing numeric literals in the bases 2, 8, 10 and 16.
I am wondering why base 8 is included, e.g. int x = 0123;?
I am thinking that there might be something akin to the fact that in hexadecimal the capacity of one byte is FF+1, and so forth.
This answer was written for the original question, "Why is writing a number in base 8 useful?"
It was to make the language familiar to those who knew C etc. Then the question is why support it in those!
There were architectures (various PDPs) which used 18 bit wide words (and others used 36 bit words), so literals where the digit is 3 bits wide would be useful.
Practically, the only place I have seen it used in Java code is for specifying unix-style permissions, e.g. 0777, 0644 etc.
(The tongue-in-cheek answer to why it is supported is "to get upvotes on this question").
"The octal numbers are not as common as they used to be. However, Octal is used when the number of bits in one word is a multiple of 3. It is also used as a shorthand for representing file permissions on UNIX systems and representation of UTF8 numbers, etc."
From: https://www.tutorialspoint.com/octal-number-system
Historicy of computer (science). To represent a goup of bits a base 10 does not fit, base 8 = 23 for 3 bits, and base 16 = 24 for 4 bits fit better.
The advantage of base 8 is that all digits are really digits: 0-7, whereas base 16 has "digits" 0-9A-F.
For 8 bits of a byte base 16 (hexadecimal) is a better fit, and won. For Unix base 8 octal, often still is used for rwx bits (read, write, execute) for user, group and others; hence octal numbers like 0666 or 0777.
Hexadecimal is ubiquitous, not the least because of computers' word sizes nowadays are
multiple bytes. That the 8 bit byte became a standard is an other, tough related story (23 bits, and addressing).
Original answer for "What are octal numbers (base 8) used for?"
Common Usage of Octal
As an abbreviation of binary: For computing machines (such as UNIVAC 1050, PDP-8, ICL 1900, etc.), Octal has been used as an abbreviation of binary because their word size is divisible by three (each octal digit represents three binary digits). So two, four, eight or twelve digits could concisely display an entire machine word. It also cut costs by allowing Nixie tubes, seven-segment displays, and calculators to be used for the operator consoles, where binary displays were too complex to use, decimal displays needed complex hardware to convert radices, and hexadecimal displays needed to display more numerals.
16-, 32-, or 62-bit words representation: All modern computing platforms use 16-, 32-, or 64-bit words, further divided into eight-bit bytes. On such systems, three octal digits per byte would be required, with the most significant octal digit representing two binary digits (plus one bit of the next significant byte, if any). Octal representation of a 16-bit word requires 6 digits, but the most significant octal digit represents (quite inelegantly) only one bit (0 or 1). This representation offers no way to easily read the most significant byte because it's smeared over four octal digits. Therefore, hexadecimal is more commonly used in programming languages today, since two hexadecimal digits exactly specify one byte. Some platforms with a power-of-two word size still have instruction subwords that are more easily understood if displayed in octal; this includes the PDP-11 and Motorola 68000 family. The modern-day ubiquitous x86 architecture belongs to this category as well, but octal is rarely used on this platform.
Encoding descriptions: Certain properties of the binary encoding of opcodes in modern x86 architecture become more readily apparent when displayed in octal, e.g. the ModRM byte, which is divided into fields of 2, 3, and 3 bits, so octal can be useful in describing these encodings.
Computations and File access Permissions: Octal is sometimes used in computing instead of hexadecimal, perhaps most often in modern times in conjunction with file permissions under Unix systems (In permission access to chmod). It has the advantage of not requiring any extra symbols as digits (the hexadecimal system is base-16 and therefore needs six additional symbols beyond 0–9).
Digital Displays: Octal numbers are also used in displaying digital content onto a screen since it has less number of symbols used for representation.
Graphical representation of byte strings: Some programming languages (C, Perl, Postscript, etc.) have a representation of texts/graphics in Octal with escaped as \nnn. Octal representation is particularly handy with non-ASCII bytes of UTF-8, which encodes groups of 6 bits, and where any start byte has octal value \3nn and any continuation byte has octal value \2nn.
Early Floating-Point Arithmetics: Octal was also used for floating-point in the Ferranti Atlas (1962), Burroughs B5500 (1964), Burroughs B5700 (1971), Burroughs B6700 (1971) and Burroughs B7700 (1972) computers.
In Transponders: Aircraft transmit a code, expressed as a four-octal-digit number when interrogated by ground radar. This code is used to distinguish different aircraft on the radar screen.
Further Readings: https://en.wikipedia.org/wiki/Octal
Related
I'm learning about Text I/O and Binary I/O in java right now. I read that each value that you write to a file is initially stored in binary. For text I/O, the individual digits are converted to it's corresponding Unicode values and then encoded to the file-specific encoding such as ASCII. For binary I/O, the binary value is directly represented in the file. For example, 199 would be represented as 0xC7 which in binary is 11000111. Now I'm confused on one part. If a variable is initially stored as a binary format, does each digit represent a separate byte that is stored or is the entirety of the number stored as a single byte. For example, is 199 originally stored as 0xc7 which would be 11000111 in binary? Or would it be stored in 3 bytes with each byte representing the binary value for the digit. If it was stored in 3 separate bytes, does binary I/O convert that 3 byte number to a single byte? If it's stored in a single byte, how does text I/O translate that single byte into 3 separate byte values. I'm just confused on how to word this. Hope you can understand what I'm getting at. Thanks
The only thing which a computer is capable of dealing with are sets of 0/1 bits which are stored in memory or, if you wish on a storage device. Those bits can be streamed to monitors and converted to characters by graphical hardware. Sams story with keyboards, you type a key and a few bits of data will be send to the computer.
Bits are stored in memory and are accessible by memory addresses. The addresses are also sets of bits.
For practical reasons the bits are grouped into bytes, words, long words, ... A byte used to be the smallest addressable unit of bits and historically ended up as a group of 8 bits, which is currently used in most of the hardware. Modern memory can store data in multiple byte addressable chunks. Same for the disk, you store data there, using specific addressing mechanisms. But in any case those are just sets of bits.
What you are confused about is the interpretation of those bits. They can represent integer numbers, floating point numbers, characters, addresses, ... The way they are interpreted only depends on the program which uses them.
Characters do not exist in the computer. They are just an abstraction which is provided by programming languages. The programs interpret the bits stored on the computer. There are standards. For example the ASCII encoding maps English characters plus a few special characters into numbers from 0 to 127. Those fit into a single byte (leaving number 128 to 255 for special use). A print command will read those bytes one by one and send them to graphics to form letters on the screen as specified in the encoding standard. Different encoding scheme will display the same bytes differently.
If you write a program wit the "hello world" sting in it, the program will convert the symbols between quotes into a set of 11 ascii bytes. (In 'c' it will add yet another byte which is equal to '0' and ends the string this way). Unicode is yet another way to represent characters. Every unicode character is represented by multiple bytes of data. There are other schemes as well. One thing to pay attention to. If you write strings on the disk using certain encoding, you should read them with the same encoding, or your prints will give you garbage. But you can always read and copy then as binary data without interpretation.
So, any variable of any type is just an abstraction and always consists of bytes of data which your program knows how to interpret based on the data type and/or operations it wants to perform. Variables of type int, double, any java object, including String, are just sets of bytes of different sizes. Only the program (and java interpreter is a program) knows what to do with them, use them in calculations or display as characters.
I'm confused about LEB128 or Little Endian Base 128 format. In the AOSP source code Leb128.java, its read function's return type whether signed or unsigned is int. I know the the size of int in java is 4 bytes aka 32bits. But the max length of LEB128 in AOSP is 5 bytes aka 35 bits. So where are the other lost 3bits.
Thanks for your reply.
Each byte of data in LEB only accounts for 7 bits in the actual output - the remaining bit is used to indicate whether or not it's the end.
From Wikipedia:
To encode an unsigned number using unsigned LEB128 first represent the number in binary. Then zero extend the number up to a multiple of 7 bits (such that the most significant 7 bits are not all 0). Break the number up into groups of 7 bits. Output one encoded byte for each 7 bit group, from least significant to most significant group.
The extra bits aren't so much "lost" as "used to indicate whether or not it's the end of the data".
You can't hope to encode arbitrary 32-bit values and some of them taking less than 4 bytes without some of them taking more than 4 bytes.
I can store numbers ranging from -127 to 127 but other than that it is impossible and the compiler give warning. The binary value of 127 is 01111111, and 130 is 10000010 still the same size (8 bits) and what I think is I can store 130 in Byte but it is not possible. How did that happen?
Java does not have unsigned types, each numeric type in Java is signed (except char but it is not meant for representing numbers but unicode characters).
Let's take a look at byte. It is one byte which is 8 bits. If it would be unsigned, yes, its range would be 0..255.
But if it is signed, it takes 1 bit of information to store the sign (2 possible values: + or -), which leaves us 7 bits to store the numeric (absolute) value. Range of 7 bit information is 0..127.
Note that the representation of signed integer numbers use the 2's complement number format in most languages, Java included.
Note: The range of Java's byte type is actually -128..127. The range -127..127 only contains 255 numbers (not 256 which is the number of all combinations of 8 bits).
In Java, a byte is a signed data type. You are thinking about unsigned bytes, in which case it is possible to store the value 130 in 8 bits. But with a signed data type, that also allows negative numbers, the first bit is necessary to indicate a negative number.
There are two ways to store negative numbers, (one's complement and two's complement) but the most popular one is two's complement. The benefit of it is that for two's complement, most arithmetic operations do not need to take the sign of the number into account; they can work regardless of the sign.
The first bit indicates the sign of the number: When the first bit is 1, then the number is negative. When the first bit is 0, then the number is positive. So you basically only have 7 bits available to store the magnitude of the number. (Using a small trick, this magnitude is shifted by 1 for negative numbers - otherwise, there would be two different bit patterns for "zero", namely 00000000 and 10000000).
When you want to store a number like 130, whose binary representation is 10000010, then it will be interpreted as a negative number, due to the first bit being 1.
Also see http://en.wikipedia.org/wiki/Two%27s_complement , where the trick of how the magnitude is shifted is explained in more detail.
I have a use-case for getting distributed unique sequence numbers in integer format. UUID comes out to be the best and simple solution for me.
However, I need to have integers only, so I will convert that big hexadecimal number (UUID) to decimal number. UUID has 128 bits and hence will produce a decimal number of 39 digits.
I can't afford to have 39 digits number due to some strict database constraints. So, I get back to basics and try to convert the number to binary first and then decimal. Now, the standard process of converting a hexadecimal directly to binary is to take each hexadecimal digit and convert it into 4 bits. Each hexadecimal number can be converted to a set of 4 bits. Hence, for 32 hex digits in UUID, we get 128 bits (32*4) .
Now, I am thinking of not to follow the rule of converting each hexadecimal digit to 4 bits. Instead I will just use enough bits to represent that digit.
For example , take 12B as one hexadecimal number.
By standard process, conversion to binary comes out to be 0000-0001-0010-1011 (9 bits actually).
By my custom process, it comes out to be 1-10-1011 (7 bits actually).
So, by this method, number of bits got reduced. Now if bits reduced, the digits in the converted decimal number will get reduced and can live with my constraints.
Can you please help in validating my theory? Does this approach has some problem? Will this cause collision ? Is the method correct and can I go ahead with it?
Thanks in advance.
Yes, this will cause collisions.
e.g.
0000-0001-0010-1011 -> 1101011
0000-0000-0110-1011 -> 1101011
Sometime ago I spend couple of days debugging problems with UUID collisions (UUIDS were trimmed), debuging these things is a nightmare. You won't have a good time.
What you need is just to implement your own unique identifier shema --- depending on your use case developing such schema could be either very easy or very hard. You could for example assign each machine an unique number (lets say two bytes) and each machine would assing IDS serialy from 4 byte namespace. And in 6 bytes you have a nice UUID-like schema (with some constraints).
I noticed that the max limit for a radix in Java is base 36.
Is this an arbitrary limit, or does Java have reason for limiting the radix in this way?
It's the number of decimal digits (10), plus the number of letters in the alphabet (26).
If a radix of 37 were allowed, a new character would have to be picked to represent the 37th digit. While it certainly would have been possible to pick some character, there is no obvious choice. It makes sense to just disallow larger radixes.
Very simple: 26 letters + 10 digits = 36.
In order to represent a number, traditionally digits and Latin letters are used.
For completeness, I would add that there are two constants defined in JDK:
Character.MIN_RADIX
Character.MAX_RADIX
Radix limit make sense if the output has to be readable.
In various case, the output does NOT need to be readable.
Thus indeed, a higher limit would help in such cases.
And the java langage radix limit is a weak point for java.
You can use the Base64 encoding scheme as specified in RFC 4648 and RFC 2045.
Just generate the byte representation of you int number according you needs, to be compatible with the majority of the libraries that implement Base64.