Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 years ago.
Improve this question
I'm new in java and I have to understand a huffman code. The program take the content of any file and encode it according to huffman coding schemes. There is this small part I don't understand in the code.
Its in the main file.
An array with a size of 256.
The comment says we will assume that all our characters will have code less than 256, for simplicity.
Why is it simple with 256? What happens if I increase or decrease the size of the array?
Also, for some sizes, I get an error out of bound.
Can someone explain why,
thanks
import java.io.File;
import java.io.FileNotFoundException;
import java.math.BigInteger;
import java.util.Scanner;
public class Main {
public static void main(String[] args) throws FileNotFoundException {
String a = "test-short.txt";
#SuppressWarnings("resource")
final long startTime = System.currentTimeMillis();
String content = new Scanner(new File(a)).useDelimiter("\\Z").next();
HuffmanCode newCode = new HuffmanCode();
// we will assume that all our characters will have
// code less than 256, for simplicity
int[] charFreqs = new int[256];
// read each character and record the frequencies
for (char loop : content.toCharArray()){
charFreqs[loop]++;
}
//Build tree
//Parse the int array of frequencies to HuffmanTree
HuffmanTree tree = newCode.createTree(charFreqs);
// print out results
System.out.println("Char\tFreq\tHUFF CODE");
newCode.printResults(tree, new StringBuffer());
newCode.findHeight(tree);
printRwquiredResults(content, newCode.realcode, newCode.height, newCode.numberOfNode, newCode.printAverageDepth());
final long endTime = System.currentTimeMillis();
System.out.println("Total execution time: " + (endTime - startTime) );
}
public static void printRwquiredResults(String content, String compressedCode, int heightOfTree, int huffTreeTotalNode, float avrTreeDepth){
int textFileLenght = (content.length()*3);
int textFileCompressed = compressedCode.length();
float compressionRatio = ((float) textFileLenght/textFileCompressed);
System.out.println("Uncompressed file size: " + textFileLenght);
System.out.println("Compressed file size: " + textFileCompressed);
System.out.printf("Compression ratio: %.6f%n" , compressionRatio);
System.out.println("Huffman tree height: " + heightOfTree);
System.out.println("Huffman tree number of nodes: " + huffTreeTotalNode);
System.out.printf("Huffman tree average depth: %.6f%n", avrTreeDepth);
}
}
Java chars (which you get when you call toCharArray()) have a size of 16 bits, ranging (when interpreted as an unsigned type) from 0 to 65535:
For char, from '\u0000' to '\uffff' inclusive, that is, from 0 to 65535
This model is based on the original Unicode specification.
Theoretically, in order to represent all possible character frequencies, your array would require a size of 65536 (last index + 1).
However, most charsets (put simply: which code represents which character and vice versa) are built upon ASCII which only uses 7 bits per character. Therefore, if you only use ASCII characters (e.g. digits, some special characters, whitespace and English letters), you can safely assume that all codes will be in a range from 0 to 255 (which is 8 bits, so one bit more than ASCII). And further: If you only use characters from the ASCII table, you can probably reduce your array size to 128 which should still be enough to hold all required frequencies (here is a list of printable characters in ASCII):
int[] freq = new int[128]; // enough for ASCII characters
freq['A'] = 10; // okay: ASCII character
freq['Ä'] = 10; // not okay, will throw an ArrayIndexOutOfBoundsException
// as Ä is not an ASCII character
int[] freq = new int[256]; // enough for ASCII characters plus all
// 8-bit wide characters
freq['A'] = 10; // okay: ASCII character
freq['Ä'] = 10; // okay: Ä has the code 196 in UTF-8 which is not ASCII
// but our array is large enough to hold it
Related
The unicode value of ucs-4 character '🤣' is 0001f923, it gets auto changed to the corresponding value of \uD83E\uDD23 when being copied into java code in intelliJ IDEA.
Java only supports ucs-2, so there occurs a transformation from ucs-4 to ucs-2.
I want to know the logic of the transformation, but didn't find any material about it.
https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF
U+010000 to U+10FFFF
0x10000 is subtracted from the code point (U), leaving a 20-bit number (U') in the range 0x00000–0xFFFFF. U is defined to be no
greater than 0x10FFFF.
The high ten bits (in the range 0x000–0x3FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate (W1), which will be
in the range 0xD800–0xDBFF.
The low ten bits (also in the range 0x000–0x3FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate (W2),
which will be in the range 0xDC00–0xDFFF.
Now with input code point \U1F923:
\U1F923 - \U10000 = \UF923
\UF923 = 1111100100100011 = 00001111100100100011 = [0000111110][0100100011] = [\U3E][\U123]
\UD800 + \U3E = \UD83E
\UDC00 + \U123 = \UDD23
The result: \UD83E\UDD23
Programming:
public static void main(String[] args) {
int input = 0x1f923;
int x = input - 0x10000;
int highTenBits = x >> 10;
int lowTenBits = x & ((1 << 10) - 1);
int high = highTenBits + 0xd800;
int low = lowTenBits + 0xdc00;
System.out.println(String.format("[%x][%x]", high, low));
}
Though String contains Unicode as a char array where char is a two byte UTF-16BE encoding, there also is support for UCS4.
UCS4: UTF-32, "code points":
Unicode code points, UCS4, are represented in java as int.
int[] ucs4 = new int[] {0x0001_f923};
String s = new String(ucs4, 0, ucs4.length);
ucs4 = s.codePoints().toArray();
There are encodings, transformations, of code points to UTF-16 and UTF-8 which require longer sequences of respectively 2-byte or 1-byte values.
The encoding is chosen such that the 2/1-byte values will be different from any other value. That means that such a value will not erroneously match "/" or any other string search. That is realized by high bits starting with 1... and then bits of the code point in big-endian format (most significant first).
Rather than searching for UCS4 and UCS2 a search for UTF-16 will yield info on the algorithms used.
I am trying to figure out how to convert hex into a string and integer so I can manipulate an RGB light on my arduino micro-controller through it's serialport. I found a good example on the java website, but I'm having a difficult time understanding some of the methods and I am getting hung up. I could easily just copy-paste this code and have it work but I want to fully understand it. I will add comments to my understandings and hopefully someone can provide some feedback.
public class HexToDecimalExample3{
public static int getDecimal(String hex){ //this is the function which we will call later and they are declaring string hex here. Can we declare string hex inside the scope..?
String digits = "0123456789ABCDEF"; //declaring string "digits" with all possible inputs in linear order for later indexing
hex = hex.toUpperCase(); //converting string to uppercase, just "in case"
int val = 0; //declaring int val. I don't get this part.
for (int i = 0; i < hex.length(); i++) //hex.length is how long the string is I think, so we don't finish the loop until all letters in string is done. pls validate this
{
char c = hex.charAt(i); //char is completely new to me. Are we taking the characters from the string 'hex' and making an indexed array of a sort? It seems similar to indexOf but non-linear? help me understand this..
int d = digits.indexOf(c); //indexing linearly where 0=1 and A=11 and storing to an integer variable
val = 16*val + d; //How do we multiply 16(bits) by val=0 to get a converted value? I do not get this..
}
return val;
}
public static void main(String args[]){
System.out.println("Decimal of a is: "+getDecimal("a")); //printing the conversions out.
System.out.println("Decimal of f is: "+getDecimal("f"));
System.out.println("Decimal of 121 is: "+getDecimal("121"));
}}
To summerize the comments, it's primarily the char c = hex.charAt(i); AND the val = 16*val + d; parts I don't understand.
Ok, let's go line for line
public static int getDecimal(String hex)
hex is the parameter, it needs to be declared there, so you can pass a String when you call the function.
String digits = "0123456789ABCDEF";
Yes, this declares a string with all characters which can occur in a hexadecimal number.
hex = hex.toUpperCase();
It converts the letters in the hex-String to upper case, so that it is consistent, i.e. you always have F and never f, no matter which is being input.
int val = 0;
This is the variable where the corresponding decimal value will later be in. We will do our calculations with this variable.
for (int i = 0; i < hex.length(); i++)
hex.length() is the number of characters in the hex-String provided. We execute the code inside this for loop once per character.
char c = hex.charAt(i);
Yes, char represents a single character. We retrieve the character from the hex-String at index i, so in the first iteration it is the first character, in the second iteration the second character and so on.
int d = digits.indexOf(c);
We look which index the character has in the digit-String. In that way we determine the decimal representation of this specific digit. Like 0-9 stay 0-9 and F becomes a 15.
val = 16*val + d;
Let's think about what we have to do. We have the decimal value of the digit. But in hexadecimal we have this digit at a specific position with which it gets multiplied. Like the '1' in '100' is actually not a 1, but 100 * 1 because it is at this position.
10 in hexadecimal is 16 in decimal, because we have 1 * 16. Now the approach here is a little bit complicated. val is not uninitialized. val is 0 at the beginning and then contains the cumulated values from the previous iterations. Since the first character in the String is the highest position we don't know directly with what we have to multiply, because we don't know how many digits the number has (actually we do, but this approach doesn't use this). So we just add the digit value to it. In the consecutive iterations it will get multiplied by 16 to scale it up to the corresponding digit base value. Let me show you an example:
Take 25F as hex number. Now the first iteration takes the 2 and converts it to a 2 and adds it to val. The 16 * val resolves to 0 so is not effective in the first time.
The next iteration multiplies the 2 with 16 and takes the 5 (converted to 5) and adds it to val. So now we have (I split it mathematically so you understand it):
2 * 16 + 5
Next we get the F which is decimal 15. We multiply val by 16 and add the 15.
We get 2 * 256 + 5 * 16 + 16 (* 1), which is actually how you calculate the decimal value of this hex value mathematically.
Another possibility to compute val is:
val += Math.pow(16, hex.length() - i - 1) * d;
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I am reading a number from a file, and I want to convert it to an ASCII text string. I've seen questions relating to this but they all were dealing with single decimal-equivalent ASCII values, like 12 to ASCII, 87 to ASCII, 112 to ASCII, etc. I need to convert a large amount of decimals with no spaces into ASCII text. Can anyone show me how this is done? How would the system ascertain whether the first number to translate is 1, 12, or 123?
For example:
int intval = 668976111;
//CONVERSION CODE
System.out.println(newval);
prints "bylo"
If the int 123, how would it know if I was saying 1,2,3 or 12,3 or 123 or 1,23 etc? How can I convert decimal numbers like this to Unicode characters?
Try this.
static void decode(String s, int index, byte[] decode, int size) {
if (index >= s.length())
System.out.println(new String(decode, 0, size));
else
for (int i = index + 1; i <= s.length(); ++i) {
int d = Integer.parseInt(s.substring(index, i));
if (Character.isISOControl(d)) continue;
if (d > 255) break;
decode[size] = (byte)d;
decode(s, i, decode, size + 1);
}
}
static void decode(String s) {
decode(s, 0, new byte[s.length()], 0);
}
and
decode("668976111"); // -> BYLo
This is a kinda hard problem, as ascii codes can be single, two or three digit long.
Now if your encoding only alphadecimal characters and characters above the decimal number 20 it is pretty easy.
The algorithm wouild be as follows. Iterate through the array(a digit is an element of the array), if the first number is 1, take 3 numbers, as you cant have a char with code less than 20. If the first number is higher than 20, take only 2 numbers.
This way you will get the right decoding, assuming you dont have anything encoded with codes less than 20, which is a very possible assumption, as the first "useful" code is at number 32, which is space
I need to represent both very large and small numbers in the shortest string possible. The numbers are unsigned. I have tried just straight Base64 encode, but for some smaller numbers, the encoded string is longer than just storing the number as a string. What would be the best way to most optimally store a very large or short number in the shortest string possible with it being URL safe?
I have tried just straight Base64 encode, but for some smaller numbers, the encoded string is longer than just storing the number as a string
Base64 encoding of binary byte data will make it longer, by about a third. It is not supposed to make it shorter, but to allow safe transport of binary data in formats that are not binary safe.
However, base 64 is more compact than decimal representation of a number (or of byte data), even if it is less compact than base 256 (the raw byte data). Encoding your numbers in base 64 directly will make them more compact than decimal. This will do it:
private static final String base64Chars =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_";
static String encodeNumber(long x) {
char[] buf = new char[11];
int p = buf.length;
do {
buf[--p] = base64Chars.charAt((int)(x % 64));
x /= 64;
} while (x != 0);
return new String(buf, p, buf.length - p);
}
static long decodeNumber(String s) {
long x = 0;
for (char c : s.toCharArray()) {
int charValue = base64Chars.indexOf(c);
if (charValue == -1) throw new NumberFormatException(s);
x *= 64;
x += charValue;
}
return x;
}
Using this encoding scheme, Long.MAX_VALUE will be the string H__________, which is 11 characters long, compared to its decimal representation 9223372036854775807 which is 19 characters long. Numbers up to about 16 million will fit in a mere 4 characters. That's about as short as you'll get it. (Technically there are two other characters which do not need to be encoded in URLs: . and ~. You can incorporate those to get base 66, which would be a smidgin shorter for some numbers, although that seems a bit pedantic.)
To extend on Stephen C's answer, here is a piece of code to convert to base 62 (but you can increase this by adding more characters to the digits String (just pick what characters are valid for you):
public static String toString(long n) {
String digits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
int base = digits.length();
String s = "";
while (n > 0) {
long d = n % base;
s = digits.charAt(d) + s;
n = n / base;
}
return s;
}
This will never result in the string representation being longer than the digit one.
Assuming that you don't do any compression, and that you restrict yourself to URL safe characters, then the following procedure will give you the most compact encoding possible.
Make a list of all URL safe characters
Count them. Suppose you have N.
Represent your number in base N, representing 0 by the first character, 1 by the 2nd and so on.
So, what about compression ...
If you assume that the numbers you are representing are uniformly distributed across their range, then there is no real opportunity for compression.
Otherwise, there is potential for compression. If you can reduce the size of the common numbers then you can typically achieve a saving by compression. This is how Huffman encoding works.
But the downside is that compression at this level is not perfect across the range of numbers. It reduces the size of some numbers, but it inevitably increases the size of others.
So what does this mean for your use-case?
I think it means that you are looking at the problem the wrong way. You should not be aiming for a minimal encoded size for every number. You should be aiming to minimize the size on average ... averaged over the actual distribution of your numbers.
I recall reading about a method for efficiently using random bits in an article on a math-oriented website, but I can't seem to get the right keywords in Google to find it anymore, and it's not in my browser history.
The gist of the problem that was being asked was to take a sequence of random numbers in the domain [domainStart, domainEnd) and efficiently use the bits of the random number sequence to project uniformly into the range [rangeStart, rangeEnd). Both the domain and the range are integers (more correctly, longs and not Z). What's an algorithm to do this?
Implementation-wise, I have a function with this signature:
long doRead(InputStream in, long rangeStart, long rangeEnd);
in is based on a CSPRNG (fed by a hardware RNG, conditioned through SecureRandom) that I am required to use; the return value must be between rangeStart and rangeEnd, but the obvious implementation of this is wasteful:
long doRead(InputStream in, long rangeStart, long rangeEnd) {
long retVal = 0;
long range = rangeEnd - rangeStart;
// Fill until we get to range
for (int i = 0; (1 << (8 * i)) < range; i++) {
int in = 0;
do {
in = in.read();
// but be sure we don't exceed range
} while(retVal + (in << (8 * i)) >= range);
retVal += in << (8 * i);
}
return retVal + rangeStart;
}
I believe this is effectively the same idea as (rand() * (max - min)) + min, only we're discarding bits that push us over max. Rather than use a modulo operator which may incorrectly bias the results to the lower values, we discard those bits and try again. Since hitting the CSPRNG may trigger re-seeding (which can block the InputStream), I'd like to avoid wasting random bits. Henry points out that this code biases against 0 and 257; Banthar demonstrates it in an example.
First edit: Henry reminded me that summation invokes the Central Limit Theorem. I've fixed the code above to get around that problem.
Second edit: Mechanical snail suggested that I look at the source for Random.nextInt(). After reading it for a while, I realized that this problem is similar to the base conversion problem. See answer below.
Your algorithm produces biased results. Let's assume rangeStart=0 and rangeEnd=257. If first byte is greater than 0, that will be the result. If it's 0, the result will be either 0 or 256 with 50/50 probability. So 0 and 256 are twice less likely to be chosen than any other number.
I did a simple test to confirm this:
p(0)=0.001945
p(1)=0.003827
p(2)=0.003818
...
p(254)=0.003941
p(255)=0.003817
p(256)=0.001955
I think you need to do the same as java.util.Random.nextInt and discard the whole number, instead just the last byte.
After reading the source to Random.nextInt(), I realized that this problem is similar to the base conversion problem.
Rather than converting a single symbol at a time, it would be more effective to convert blocks of input symbol at a time through an accumulator "buffer" which is large enough to represent at least one symbol in the domain and in the range. The new code looks like this:
public int[] fromStream(InputStream input, int length, int rangeLow, int rangeHigh) throws IOException {
int[] outputBuffer = new int[length];
// buffer is initially 0, so there is only 1 possible state it can be in
int numStates = 1;
long buffer = 0;
int alphaLength = rangeLow - rangeHigh;
// Fill outputBuffer from 0 to length
for (int i = 0; i < length; i++) {
// Until buffer has sufficient data filled in from input to emit one symbol in the output alphabet, fill buffer.
fill:
while(numStates < alphaLength) {
// Shift buffer by 8 (*256) to mix in new data (of 8 bits)
buffer = buffer << 8 | input.read();
// Multiply by 256, as that's the number of states that we have possibly introduced
numStates = numStates << 8;
}
// spits out least significant symbol in alphaLength
outputBuffer[i] = (int) (rangeLow + (buffer % alphaLength));
// We have consumed the least significant portion of the input.
buffer = buffer / alphaLength;
// Track the number of states we've introduced into buffer
numStates = numStates / alphaLength;
}
return outputBuffer;
}
There is a fundamental difference between converting numbers between bases and this problem, however; in order to convert between bases, I think one needs to have enough information about the number to perform the calculation - successive divisions by the target base result in remainders which are used to construct the digits in the target alphabet. In this problem, I don't really need to know all that information, as long as I'm not biasing the data, which means I can do what I did in the loop labeled "fill."