How can I directly work with bits in Clojure?

How can I directly work with bits in Clojure? - java

I began working through the first problem set over at https://cryptopals.com the other day. I'm trying to learn Clojure simultaneously, so I figured I'd implement all of the exercises in Clojure. These exercises are for learning purposes of course, but I'm going out of my way to not use any libraries besides clojure.core and the Java standard library.
The first exercise asks you to write code that takes in a string encoded in hexadecimal and spit out a string encoded in base64. The algorithm for doing this is fairly straightforward:
Get the byte associated with each couplet of hex digits (for example, the hex 49 becomes 01001001).
Once all bytes for the hex string have been retrieved, turn the list of bytes into a sequence of individual bits.
For every 6 bits, return a base64 character (they're all represented as units of 6 bits).
I'm having trouble actually representing/working-with bits and bytes in Clojure (operating on raw bytes is one of the requirements of the exercise). I know I can do byte-array on the initial hex values and get back an array of bytes, but how do I access the raw bits so that I can translate from a series of bytes into a base64 encoded string?
Any help or direction would be greatly appreciated.

Always keep a browser tab open to the Clojure CheatSheet.
For detailed bit work, you want functions like bit-and, bit-test, etc.
If you are just parsing a hex string, see java.lang.BigInteger withe the radix option: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/math/BigInteger.html#%3Cinit%3E(java.lang.String,int)
java.lang.Long/parse( string, radix ) is also useful.
For the base64 part, you may be interested in the tupelo.base64 functions. This library function is all you really need to convert a string of hex into a base-64 string, although it may not count for your homework!
Please note that Java includes base-64 functions:
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Base64.html
Remember, also, that you can get ideas by looking at the source code for both Clojure & the Tupelo lib.
And also, keep in mind that one of Clojure's super-powers is the ability to write low-level or performance-critical code in native Java and then link all the *.clj and *.java files together into one program (you can use Leiningen to compile & link everything in one step).

Related

How to write ints with specific amount of bits in Java

So I want to write in a file integers with, for example, 10 bits each in Little Endian format. They also shouldn't be aligned to the byte.
The following image may help you understand the scructure.
I looked at ByteBuffer (I'm coding in Java) but it doesn't seem to do this.

This is not possible by default. Java doesn't have a bit type, so the closest you are going to get is Byte or Boolean. You can make a util class with 10 booleans (as bits) in whatever order you would like, but other than that, Java does not hold this functionality.

Determine whether a UTF-32 encoded string has unique characters

I have a question about using the bit-vector approach that is common to finding whether a string has unique characters. I have seen those solutions out there (one of them) work well for ASCII and UTF-16 character set.
However, how will the same approach work for UTF-32? The longest continuous bit vector can be a long variable in Java right? UTF-16 requires 1024 such variables. If we take the same approach it will require 2^26 long variables (I think). Is it possible to solve for such a big character set using bit-vector?

I think you are missing something important here. UTF-32 is an encoding for Unicode. Unicode actually fits within a 21 bit space. As the Unicode FAQ states:
"The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space."
Any UTF-32 "characters" that are outside of the Unicode code space are invalid ... and you should never see them in a UTF-32 encoded String. So 2^15 longs should be enough.
In practice, you are unlikely to see code points outside of the Basic Linguistic Plane (plane 0). So it makes sense to use a bitmap for the BMP (i.e. codes up to 65535) and a sparse data structure (e.g. a HashSet<Integer>) for the other panes.
You could also consider using BitSet instead for "rolling your own" bit-set data structures using long or long[].
Finally, I should not that some of the code in the Q&A that you linked to is NOT appropriate for looking for unique characters in UTF-16 for a couple of reasons:
The idea of using N variables of type long and a switch statement does not scale. The code of the switch statement gets large and unmanageable ... and possibly gets bigger than the JVM spec can cope with. (The maximum size of a compiled method is 2^16 - 1 bytes of bytecode, so it clearly isn't viable for implementing a bit-vector for all of the Unicode code space.)
It is a better idea to use an array of long and get rid of the need for a switch ... which is only really there because you have N distinct long variables.
In UTF-16, each code unit (16 bit value) encodes either 1 code point (character) or half a code point. If you simply create a bitmap of the code units, you won't detect unique characters properly.

Well, a long contains 64 bits of information, and the set of UTF-32 characters contains approximately 2^21 elements, which would require 2^21 bits of information. You would be right that it would require 2^26 long variables if the UTF-32 dataset used all bits. However, as it is, you only require 2^13 long variables (still a lot).
If you assume that the characters are evenly distributed over the dataset, this inefficiency is unavoidable and the best solution would be to use something else like a Set<Long> or something. However, English plaintext tends to have a majority of its characters in the ASCII range (0-127), and most Western languages are fairly constrained to a specific range, so you could use a bit vector for the high-frequency regions and a Set or other order-independent, high-efficiency contains data structure to represent the rest of the regions.

What integer format when reading binary data from Java DataOutputStream in PHP?

I'm aware that this is probably not the best idea but I've been playing around trying to read a file in PHP that was encoded using Java's DataOutputStream.
Specifically, in Java I use:
dataOutputStream.writeInt(number);
Then in PHP I read the file using:
$data = fread($handle, 4);
$number = unpack('N', $data);
The strange thing is that the only format character in PHP that gives the correct value is 'N', which is supposed to represent "unsigned long (always 32 bit, big endian byte order)". I thought that int in java was always signed?
Is it possible to reliably read data encoded in Java in this way or not? In this case the integer will only ever need to be positive. It may also need to be quite large so writeShort() is not possible. Otherwise of course I could use XML or JSON or something.

This is fine, as long as you don't need that extra bit. l (instead of N) would work on a big endian machine.
Note, however, that the maximum number that you can store is 2,147,483,647 unless you want to do some math on the Java side to get the proper negative integer to represent the desired unsigned integer.
Note that a signed Java integer uses the two's complement method to represent a negative number, so it's not as easy as flipping a bit.

DataOutputStream.writeInt:
Writes an int to the underlying output stream as four bytes, high byte
first.
The formats available for the unpack function for signed integers all use machine dependent byte order. My guess is that your machine uses a different byte order than Java. If that is true, the DataOutputStream + unpack combination will not work for any signed primitive.

Problems with reproducing the same HMAC MD5 on Java and C

I am currently stumped on recreating an HMAC MD5 hash generated by a Java program on C. Any help, suggestions, correction and recommendation would be greatly appreciated.
The Java program creates the HMAC MD5 string (encoded to a base 16 HEX string which is 32 characters long) using UTF16LE and MAC; what I need is to recreate the same result on C program.
I am using the RSA source for MD5 and the HMAC-MD5 code is from RFC 2104 (http://www.koders.com/c/fidBA892645B9DFAD21A2B5ED526824968A1204C781.aspx)
I have "simulated" UTF16LE on the C implementation by padding every even byte with 0s. The Hex/Int representation seem to be consistent on both ends when I do this; but is this the correct way to do this? I figured this would be the best way because the HMAC-MD5 function call only allows for a byte array (no such thing as a double byte array call in the RFC2104 implementation but that's irrelevant).
When I run the string to be HMAC'd through - you naturally get "garbage". Now my problem is that not even the "garbage" is consistent across the systems (excluding the fact that perhaps the base 16 encoding could be inconsistent). What I mean by this is "�����ԙ���," might be the result from Java HMAC-MD5 but C might give "v ����?��!��{� " (Just an example, not actual data).
I have 2 things I would like to confirm:
Did padding every even byte with 0 mess up the HMAC-MD5 algorithms? (either because it would come across a null immediately after the first byte or whatever)
Is the fact that I see different "garbage" because C and Java are using different character encodings? (same machine running Ubuntu)
I am going to read through the HMAC-MD5 and MD5 code to see how they treat the byte array going in (whether or not the null even bytes is causing a problem). I am also having a hard time writing a proper encoding function on the C side to convert the resultant string into a 32 character hex string. Any input/help would be greatly appreciated.
Update (Feb 3rd): Would passing signed/unsigned byte array alter the output of HMAC-MD5? The Java implementation takes a byte array (which is SIGNED); but the C implementation takes an UNSIGNED byte array. I think this might also be a factor in producing different results. If this does affect the final output; what can I really do? Would I pass a SIGNED byte array in C (the method takes an unsigned byte array) or would I cast the SIGNED byte array as unsigned?
Thanks!
Clement

The problem is probably due to your naive creation of the UTF-16 string. Any character greater than 0x7F (see unicode explanation) needs to be expanded into the UTF encoding scheme.
I would work on first getting the same byte string between the C and Java implementation as that is probably where your problem lies -- so I would agree with your assumption (1)
Have you tried to calculate the MD5 without padding the C-string, but rather just converting it to UTF -- you can use iconv to make experiments with the encoding.

The problem was that I used the RSA implementation. After I switched to OpenSSL all my problems were resolved. RSA implementation did not take into consideration all the necessary details of cross platform support (including 32bit/64bit processors).
Always use OpenSSL because they have already resolved all the cross platform issues.

How to parse byte[] (including BCD coded values) to Object in Java

I'd like to know if there is a simple way to "cast" a byte array containing a data-structure of a known layout to an Object. The byte[] consists of BCD packed values, 1 or 2-byte integer values and character values. I'm obtaining the byte[] via reading a file with a FileInputStream.
People who've worked on IBM-Mainframe systems will know what I mean right away - the problem is I have to do the same in Java.
Any suggestions welcome.

No, because the object layout can vary depending on what VM you're using, what architecture the code is running on etc.
Relying on an in-memory representation has always felt brittle to me...
I suggest you look at DataInputStream - that will be the simplest way to parse your data, I suspect.

Not immediately, but you can write one pretty easily if you know exactly what the bytes represent.
To convert a BCD packed number you need to extract the two digits encoded. The four lower bits encode the lowest digit and you get that by &'ing with 15 (1111 binary). The four upper bits encode the highest digit which you get by shifting right 4 bits and &'ing with 15.
Also note that IBM most likely have tooling available if you this is what you are actually doing. For the IBM i look for the jt400 IBM Toolbox for Java.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.