Determine whether a UTF-32 encoded string has unique characters

Determine whether a UTF-32 encoded string has unique characters - java

I have a question about using the bit-vector approach that is common to finding whether a string has unique characters. I have seen those solutions out there (one of them) work well for ASCII and UTF-16 character set.
However, how will the same approach work for UTF-32? The longest continuous bit vector can be a long variable in Java right? UTF-16 requires 1024 such variables. If we take the same approach it will require 2^26 long variables (I think). Is it possible to solve for such a big character set using bit-vector?

I think you are missing something important here. UTF-32 is an encoding for Unicode. Unicode actually fits within a 21 bit space. As the Unicode FAQ states:
"The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space."
Any UTF-32 "characters" that are outside of the Unicode code space are invalid ... and you should never see them in a UTF-32 encoded String. So 2^15 longs should be enough.
In practice, you are unlikely to see code points outside of the Basic Linguistic Plane (plane 0). So it makes sense to use a bitmap for the BMP (i.e. codes up to 65535) and a sparse data structure (e.g. a HashSet<Integer>) for the other panes.
You could also consider using BitSet instead for "rolling your own" bit-set data structures using long or long[].
Finally, I should not that some of the code in the Q&A that you linked to is NOT appropriate for looking for unique characters in UTF-16 for a couple of reasons:
The idea of using N variables of type long and a switch statement does not scale. The code of the switch statement gets large and unmanageable ... and possibly gets bigger than the JVM spec can cope with. (The maximum size of a compiled method is 2^16 - 1 bytes of bytecode, so it clearly isn't viable for implementing a bit-vector for all of the Unicode code space.)
It is a better idea to use an array of long and get rid of the need for a switch ... which is only really there because you have N distinct long variables.
In UTF-16, each code unit (16 bit value) encodes either 1 code point (character) or half a code point. If you simply create a bitmap of the code units, you won't detect unique characters properly.

Well, a long contains 64 bits of information, and the set of UTF-32 characters contains approximately 2^21 elements, which would require 2^21 bits of information. You would be right that it would require 2^26 long variables if the UTF-32 dataset used all bits. However, as it is, you only require 2^13 long variables (still a lot).
If you assume that the characters are evenly distributed over the dataset, this inefficiency is unavoidable and the best solution would be to use something else like a Set<Long> or something. However, English plaintext tends to have a majority of its characters in the ASCII range (0-127), and most Western languages are fairly constrained to a specific range, so you could use a bit vector for the high-frequency regions and a Set or other order-independent, high-efficiency contains data structure to represent the rest of the regions.

Related

How is a variable actually stored in memory for Java?

I'm learning about Text I/O and Binary I/O in java right now. I read that each value that you write to a file is initially stored in binary. For text I/O, the individual digits are converted to it's corresponding Unicode values and then encoded to the file-specific encoding such as ASCII. For binary I/O, the binary value is directly represented in the file. For example, 199 would be represented as 0xC7 which in binary is 11000111. Now I'm confused on one part. If a variable is initially stored as a binary format, does each digit represent a separate byte that is stored or is the entirety of the number stored as a single byte. For example, is 199 originally stored as 0xc7 which would be 11000111 in binary? Or would it be stored in 3 bytes with each byte representing the binary value for the digit. If it was stored in 3 separate bytes, does binary I/O convert that 3 byte number to a single byte? If it's stored in a single byte, how does text I/O translate that single byte into 3 separate byte values. I'm just confused on how to word this. Hope you can understand what I'm getting at. Thanks

The only thing which a computer is capable of dealing with are sets of 0/1 bits which are stored in memory or, if you wish on a storage device. Those bits can be streamed to monitors and converted to characters by graphical hardware. Sams story with keyboards, you type a key and a few bits of data will be send to the computer.
Bits are stored in memory and are accessible by memory addresses. The addresses are also sets of bits.
For practical reasons the bits are grouped into bytes, words, long words, ... A byte used to be the smallest addressable unit of bits and historically ended up as a group of 8 bits, which is currently used in most of the hardware. Modern memory can store data in multiple byte addressable chunks. Same for the disk, you store data there, using specific addressing mechanisms. But in any case those are just sets of bits.
What you are confused about is the interpretation of those bits. They can represent integer numbers, floating point numbers, characters, addresses, ... The way they are interpreted only depends on the program which uses them.
Characters do not exist in the computer. They are just an abstraction which is provided by programming languages. The programs interpret the bits stored on the computer. There are standards. For example the ASCII encoding maps English characters plus a few special characters into numbers from 0 to 127. Those fit into a single byte (leaving number 128 to 255 for special use). A print command will read those bytes one by one and send them to graphics to form letters on the screen as specified in the encoding standard. Different encoding scheme will display the same bytes differently.
If you write a program wit the "hello world" sting in it, the program will convert the symbols between quotes into a set of 11 ascii bytes. (In 'c' it will add yet another byte which is equal to '0' and ends the string this way). Unicode is yet another way to represent characters. Every unicode character is represented by multiple bytes of data. There are other schemes as well. One thing to pay attention to. If you write strings on the disk using certain encoding, you should read them with the same encoding, or your prints will give you garbage. But you can always read and copy then as binary data without interpretation.
So, any variable of any type is just an abstraction and always consists of bytes of data which your program knows how to interpret based on the data type and/or operations it wants to perform. Variables of type int, double, any java object, including String, are just sets of bytes of different sizes. Only the program (and java interpreter is a program) knows what to do with them, use them in calculations or display as characters.

How can I directly work with bits in Clojure?

I began working through the first problem set over at https://cryptopals.com the other day. I'm trying to learn Clojure simultaneously, so I figured I'd implement all of the exercises in Clojure. These exercises are for learning purposes of course, but I'm going out of my way to not use any libraries besides clojure.core and the Java standard library.
The first exercise asks you to write code that takes in a string encoded in hexadecimal and spit out a string encoded in base64. The algorithm for doing this is fairly straightforward:
Get the byte associated with each couplet of hex digits (for example, the hex 49 becomes 01001001).
Once all bytes for the hex string have been retrieved, turn the list of bytes into a sequence of individual bits.
For every 6 bits, return a base64 character (they're all represented as units of 6 bits).
I'm having trouble actually representing/working-with bits and bytes in Clojure (operating on raw bytes is one of the requirements of the exercise). I know I can do byte-array on the initial hex values and get back an array of bytes, but how do I access the raw bits so that I can translate from a series of bytes into a base64 encoded string?
Any help or direction would be greatly appreciated.

Always keep a browser tab open to the Clojure CheatSheet.
For detailed bit work, you want functions like bit-and, bit-test, etc.
If you are just parsing a hex string, see java.lang.BigInteger withe the radix option: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/math/BigInteger.html#%3Cinit%3E(java.lang.String,int)
java.lang.Long/parse( string, radix ) is also useful.
For the base64 part, you may be interested in the tupelo.base64 functions. This library function is all you really need to convert a string of hex into a base-64 string, although it may not count for your homework!
Please note that Java includes base-64 functions:
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Base64.html
Remember, also, that you can get ideas by looking at the source code for both Clojure & the Tupelo lib.
And also, keep in mind that one of Clojure's super-powers is the ability to write low-level or performance-critical code in native Java and then link all the *.clj and *.java files together into one program (you can use Leiningen to compile & link everything in one step).

Why do bools take less memory in an array?

On the Oracle website, it says that bools take 32 bits on the stack, but 8 in an array. I'm having trouble understanding why it is that they would take less in a group than in as singles. How are they stored, and what difference does it make? If arrays of bools are more efficient, why has that technology not been transferred over to singles?
Also, why not 1 bit?
And what is the difference between how a 64 but system and a 32 bit system stores these?
Thanks!

A boolean value can be stored as a single binary digit, but our computers group values as a convenience. The smallest unit practically dealt with is a byte, the next largest being a word. A byte is, in modern hardware, always 8 bits. 32 bits has emerged as the standard for a word. Even our 64 bit computers can deal in 32 bit words effectively. It is much more convenient to store a bool in whatever unit comes naturally than as a single bit. In an array, the natural unit would be a byte, since you can address any byte in memory. On the stack, which is a word stack, the natural unit is a word. You could stuff bools into bytes and words and work on pulling them out again bit by bit, literally, but that's less efficient than storing them in bytes or words because modern memories are large, so CPU speed is more of a concern. You wouldn't want to waste all of the time it takes to pack bits in compactly, so we waste memory instead, since it is more expendable.

Look, when it comes to stack, one must keep in mind that speed is the most important thing. For example, consider the following:
void method(int foo, boolean bar, String name) ....
Then the stack just after entering the method looks like this:
|-other variables-|-...-|-name-|-bar-|-foo-|---- return address etc. --
^
stack pointer
These are all quantities on a word boundary, symbolized by |. Sure, the JVM could (theoretically, but see below) store the boolean in a single byte. But one must keep in mind that 32bit loads may be slower when they don't address word boundaries. Depending on architecture, it may be impossible to go through a pointer that does not live on a word boundary. Or it may be impossible use the quantity in a floating point instruction, etc. etc.
In addition, the byte code format can only address the n-th word on the stack. If this were not so, addresses relative to the stack pointer would have to be specified in bytes and this would mean that almost any stack access would have two bits that are irrelevant most of the time, as the majority of arguments will be words (int, float or reference) or double words (long, double).
What is never possible is to use 1 single bit for booleans. Why? Because bits are not directly addressable. The smallest addressable unit is the byte.
You can still store 32 booleans in an int if you feel that you should save on memory.

Because of the way CPUs work, all operations are done in 32 bits. If you have a single bool, the only realistic thing a compiler can do is zero out the rest of the 24 bits and save that to the stack, since it's not practical to scan your java file for other bools to and store them all in the same 32 bit memory block.
If you have an array of bools, it's simple to just reference them in blocks of 4, so it's only 8 bits per bool.
Note that this only applies to 32 bit applications/machines.

What are the implications of using base 40 to encode a String?

I've seen it suggested that Base 40 encoding can be used to compress Strings (in Java to send to a Redis instance FWIW) and a quick test shows it more efficient for some of the data I'm using than an alternative I'm considering; Smaz.
Is there any reason to prefer base 32 or 64 encoding over 40? Any disadvantage, is encoding like this potentially lossless?

40 provides letters (probably lower case unless your application tends to use upper case most of the time) and digits for 36, and then four more for punctuation and shifts. You can make it lossless by making one of the remaining an escape so the next one or two characters represent a byte not in the other 39. Also a good approach is to have a shift-lock character that toggles between upper and lower case, if you tend to have strings of upper case characters.
40 is a convenient base, since three base-40 digits fit nicely in two bytes. 40^3 (64000) is a smidge less than 2^16 (65536).
What you should use depends on the statistics of your data.

How to parse byte[] (including BCD coded values) to Object in Java

I'd like to know if there is a simple way to "cast" a byte array containing a data-structure of a known layout to an Object. The byte[] consists of BCD packed values, 1 or 2-byte integer values and character values. I'm obtaining the byte[] via reading a file with a FileInputStream.
People who've worked on IBM-Mainframe systems will know what I mean right away - the problem is I have to do the same in Java.
Any suggestions welcome.

No, because the object layout can vary depending on what VM you're using, what architecture the code is running on etc.
Relying on an in-memory representation has always felt brittle to me...
I suggest you look at DataInputStream - that will be the simplest way to parse your data, I suspect.

Not immediately, but you can write one pretty easily if you know exactly what the bytes represent.
To convert a BCD packed number you need to extract the two digits encoded. The four lower bits encode the lowest digit and you get that by &'ing with 15 (1111 binary). The four upper bits encode the highest digit which you get by shifting right 4 bits and &'ing with 15.
Also note that IBM most likely have tooling available if you this is what you are actually doing. For the IBM i look for the jt400 IBM Toolbox for Java.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.