Android LEB128 type size - java

I'm confused about LEB128 or Little Endian Base 128 format. In the AOSP source code Leb128.java, its read function's return type whether signed or unsigned is int. I know the the size of int in java is 4 bytes aka 32bits. But the max length of LEB128 in AOSP is 5 bytes aka 35 bits. So where are the other lost 3bits.
Thanks for your reply.

Each byte of data in LEB only accounts for 7 bits in the actual output - the remaining bit is used to indicate whether or not it's the end.
From Wikipedia:
To encode an unsigned number using unsigned LEB128 first represent the number in binary. Then zero extend the number up to a multiple of 7 bits (such that the most significant 7 bits are not all 0). Break the number up into groups of 7 bits. Output one encoded byte for each 7 bit group, from least significant to most significant group.
The extra bits aren't so much "lost" as "used to indicate whether or not it's the end of the data".
You can't hope to encode arbitrary 32-bit values and some of them taking less than 4 bytes without some of them taking more than 4 bytes.

Related

How to send an integer greater than 127 from android(server java) using Byte array in java to computer (client c)? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 12 months ago.
Improve this question
I want to send an integer value (6000) from android.I have to transfer it using byte[], for which i tried converting int[] to byte[].But [0,0,23,112] is being stored.Could someone help?
[0,0,23,112] is 6000.
As you said, you must send the data in the shape of a byte array. A single byte is 8 bits; a single bit is an on/off switch. With 8 on/off switches, you can represent 256 different unique states. (2^8 is 256). A byte is just that, and it ends there. The bitsequence 00000101 is commonly understood to mean '5', but that's just conventions. The computer doesn't know what 5 is, it just knows bits and bytes, it keeps seeing 00000101. If you call System.out.println and pass that byte, and you see 5? That's println that decided to render it that way. It's not a universal truth about bytes.
In java specifically, all the various methods that interact with bytes, including println, have decreed that they interpret byte values as two's complement signed. That means that it counts up from 0 to 127, then 'rolls over' to -128, and as you keep incrementing your bits, goes back to -1, at which point we've covered all the 256 unique combinations (0, that's 1 combination. 127 positive integers, 128 negative ones: 1+127+128 is 256). But, again, just a choice.
This is where there is no such thing as a "signed byte" and an "unsigned byte", as far as the byte is concerned. The question of 'is it signed or unsigned' is for the code that prints to decide. When you put bytes on a wire or in a file, it's irrelevant. In that sense, the byte 255 and the byte -1 are the identical value. That value (The bit sequence 11111111) is printed as -1 if the code that does the printing decides to treat it as signed, and prints as 255 if the printing code decides to treat it as unsigned.
This explains, however, that you can't "just" put, say, '200' in a byte value because java treats things as signed (even the compiler), but this:
byte b = (byte) 200;
works fine and does exactly what you want.
However, even unsigned, bytes are still limited to the 0-255 range (or, at least, they can represent 256 unique things, and assuming we want to start at 0, that means only 0-255 is covered).
Thus, how do you represent higher numbers? Simply by adding more bytes.
The exact same thing happens when you count. We have 10 digits in the common western arabic numeral system: A single digit symbol can differentiate 10 different things. So what happens if you want to count up to 12?
Once you get to the 10th digit (the 9), and you want to add 1 more to it, what do we do?
We invent a second digit! We increment the second digit (from blank/0 to 1), and start our first digit (what used to be the 9) over from the beginning. Thus, after 9, we have 10.
You can do the exact same thing with bytes. A common digit covers 10 different things (0-9). A byte, however, covers 255 different things.
So what do you do when you want to 'roll over' and you need to add 1 to 255?
You add a second digit byte, and restart the first digit byte from 0 again.
So, we go from a single byte with bitsequence 11111111 (representing 255, let's say we treat it as unsigned for this exercise), and when you add one to that, we end up with 2 bytes. The first byte is 00000001 (representing 1), and the second byte is 00000000) representing 0. Just like we went from 9 to 10.
Just like with human decimal, computers treat 10 in bytes the same way: that leftmost digit now counts the amount of times we 'rolled over' our first digit. Except with bytes, of course, each 'rollover' was 256, whereas with human decimal digits, each rollover is 10. Mathematically: Decimal (western arabic numerals)) counting does 1 * 10^1 + 0 * 10^0, byte-based counting does 1 * 256^1 + 0 * 256^0. So, the byte sequence of [1, 0] is 256.
Let's do that math now on the byte sequence: 23, 112.
We 'rolled over' our 256-ranged byte 23 times, so that's 23 * 256 = 5888, and that final digitbyte adds 112 more. 5888 + 112 is... 6000!
Hence, whatever you did to turn 6000 into the byte array [0, 0, 23, 112]? That was correct. That is 6000 in big endian bytes.
NB: Little Endian means that the least number is sent first - that you write the digits in reverse order. 6000 in little endian is [112, 23, 0, 0]. Most protocols (networks, files, etc) use big-endian. Most CPUs work in big endian. But intel CPUs work in little endian, as in, if your computer has an intel chip and it stores 6000 in its own memory banks, it stores [112, 23, 0, 0]. Some protocols/file formats just dump memory, and they tend to be in little endian, because for a decade or two a lot of computers had intel chips in them. However, that era appears to be ending, as is the era of 'just dump memory straight to a file, voila, state saved'. Hence, little endian was never particularly relevant and is getting less relevant as time progresses.

Does Java read a single byte in big endian bit order?

So we can talk about the endianness of both the bit and byte order.
When I read the next byte from FileInputStream, for example, I practically get an 8-bit signed integer, but I have no idea what is the bit order with which Java calculates the byte's integer value. Which comes first, the most significant or the least significant bit?
(sign bit, 2^6 ..... 2^0)
Or...
(2^0, ..... 2^6, sign bit)
Endianness only really applies when a unit is broken down into other units. So if you were transmitting a byte over a bit stream, you could observe whether the least significant bit was transmitted first or last. And at that point we could say that the stream was little-endian or big-endian.
But within a byte-addressable machine, i.e., where the byte is the smallest unit of storage, there is no "endianness" within the byte. No bit of the byte is "before" any other bit of the byte.
Note that another term for endianness is "byte order". The order of bytes within larger entities.
It is true we like to number bits (0 to 7, for an 8-bit byte) so we can talk about them, but this really does not define endianness, even though the numbering is often chosen to match the byte order of the machine; this is convention.
With respect to FileInputStream - according to its documentation, that transfers bytes: no part of the byte is sent before any other part, at least not as far as FileInputStream is concerned. If the byte has to be sent bitwise over some interconnect (say, a SATA cable), then the decision about which bit goes first is a matter for the hardware. The higher layer code is dealing in bytes (or even blocks).
in int first bit is the sign, the rest is the value, the last bit is the least significant bit.

The number 149 is stored in the byte at address 16

I am reading a book about Java programming and in the first chapter it says: "The number 149 is stored in the byte at address 16" - is storing three characters, the 1, the 4, and the 9 in one byte possible?
No, the size of a character in java is 2 bytes. Thus obviously 6 bytes cannot fit into 1.
I think the book was trying to ask whether the number 149 could fit into a byte, in which yes and no, an unsigned byte can hold a value of 255 at max while a two's complement (signed) byte can only hold a value of 126.
Info about primitive data type
Storing the number 149 and characters '1', '4', and '9' separately are completely different. Storing the character '1' is actually storing its ASCII value 49, and the ASCII value 52, and 57 represent '4' and '9 respectively. The size of each character in Java is 2 bytes. So therefore 3 characters with a total size of 6 cannot fit into a single byte.
The byte data type is only 8 bits, and therefore it can store numbers from -128 to +127. That means the maximum value for a byte (Byte.Max_VALUE) is 127, and since 149 is bigger than 127, therefore it cannot fit into a byte, and you have to at least use a short to store 149. A short is 2 bytes in java.
I highly encourage you to read this Java documentation on data types. It's very short, but pretty useful to understand everything about data types.
http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
In General, A byte is 8 bits and can store a number range from 0 to 255. Bytes are often used in RAW data processing and is how data is stored in memory. When storing characters or a "String", you are storing a sequence of bytes that represent a sequence of characters.
the number 149 in binary Byte form is 10010101
Decimal to Binary Converter
But storing characters are different than storing numbers. To address you question, storing the characters "1", "4", and "9" in a single byte is not possible, but storing the number 149 is.
Also, the number of bytes that a given character/string uses are highly dependant on which encoding you are using.
Java String see .getBytes(Charset charset)
All this being said, a byte in Java is signed. Its range goes from -128 to +127 inclusive. A byte can store 256 unique values. You can think of them as numbers, individual flags, whatever you want. I have no context to the OP's original problem, but if they are using a Java primitive byte, it cannot by default hold the number 149. IF you are talking about a sequence of 8 bits, it can.
Java Primitive Datatypes

What does it mean when we say the width of Byte in java is 8 bit?

I can store numbers ranging from -127 to 127 but other than that it is impossible and the compiler give warning. The binary value of 127 is 01111111, and 130 is 10000010 still the same size (8 bits) and what I think is I can store 130 in Byte but it is not possible. How did that happen?
Java does not have unsigned types, each numeric type in Java is signed (except char but it is not meant for representing numbers but unicode characters).
Let's take a look at byte. It is one byte which is 8 bits. If it would be unsigned, yes, its range would be 0..255.
But if it is signed, it takes 1 bit of information to store the sign (2 possible values: + or -), which leaves us 7 bits to store the numeric (absolute) value. Range of 7 bit information is 0..127.
Note that the representation of signed integer numbers use the 2's complement number format in most languages, Java included.
Note: The range of Java's byte type is actually -128..127. The range -127..127 only contains 255 numbers (not 256 which is the number of all combinations of 8 bits).
In Java, a byte is a signed data type. You are thinking about unsigned bytes, in which case it is possible to store the value 130 in 8 bits. But with a signed data type, that also allows negative numbers, the first bit is necessary to indicate a negative number.
There are two ways to store negative numbers, (one's complement and two's complement) but the most popular one is two's complement. The benefit of it is that for two's complement, most arithmetic operations do not need to take the sign of the number into account; they can work regardless of the sign.
The first bit indicates the sign of the number: When the first bit is 1, then the number is negative. When the first bit is 0, then the number is positive. So you basically only have 7 bits available to store the magnitude of the number. (Using a small trick, this magnitude is shifted by 1 for negative numbers - otherwise, there would be two different bit patterns for "zero", namely 00000000 and 10000000).
When you want to store a number like 130, whose binary representation is 10000010, then it will be interpreted as a negative number, due to the first bit being 1.
Also see http://en.wikipedia.org/wiki/Two%27s_complement , where the trick of how the magnitude is shifted is explained in more detail.

Reducing the number of bits in UUID

I have a use-case for getting distributed unique sequence numbers in integer format. UUID comes out to be the best and simple solution for me.
However, I need to have integers only, so I will convert that big hexadecimal number (UUID) to decimal number. UUID has 128 bits and hence will produce a decimal number of 39 digits.
I can't afford to have 39 digits number due to some strict database constraints. So, I get back to basics and try to convert the number to binary first and then decimal. Now, the standard process of converting a hexadecimal directly to binary is to take each hexadecimal digit and convert it into 4 bits. Each hexadecimal number can be converted to a set of 4 bits. Hence, for 32 hex digits in UUID, we get 128 bits (32*4) .
Now, I am thinking of not to follow the rule of converting each hexadecimal digit to 4 bits. Instead I will just use enough bits to represent that digit.
For example , take 12B as one hexadecimal number.
By standard process, conversion to binary comes out to be 0000-0001-0010-1011 (9 bits actually).
By my custom process, it comes out to be 1-10-1011 (7 bits actually).
So, by this method, number of bits got reduced. Now if bits reduced, the digits in the converted decimal number will get reduced and can live with my constraints.
Can you please help in validating my theory? Does this approach has some problem? Will this cause collision ? Is the method correct and can I go ahead with it?
Thanks in advance.
Yes, this will cause collisions.
e.g.
0000-0001-0010-1011 -> 1101011
0000-0000-0110-1011 -> 1101011
Sometime ago I spend couple of days debugging problems with UUID collisions (UUIDS were trimmed), debuging these things is a nightmare. You won't have a good time.
What you need is just to implement your own unique identifier shema --- depending on your use case developing such schema could be either very easy or very hard. You could for example assign each machine an unique number (lets say two bytes) and each machine would assing IDS serialy from 4 byte namespace. And in 6 bytes you have a nice UUID-like schema (with some constraints).

Categories

Resources