Distance Coding (DC) BWT

Distance Coding (DC) BWT - java

i am trying to write BWT with Huffman compression program with Java. In BWT i want to implement Distance Coding (DC). I am looking for some examples, but there isn't so much of them.
I found this example:
http://www.cs.ucr.edu/~stelo/cpm/cpm07/move_to_front_gagie.pdf
DC is starting with 29 page. But it really hard to understand because there is no comments.
Maybe someone had implemented DC or know theory how to implement it in real code ? :)
I understood that part that first off all need to write what char was. But then with distance i didn't get it.
I red that for each character , DC finds its next occurrence in the sequence and write it to S and outputs the distance to it. If there is no occurrence then write 0.
Thanks.

I have written an implementation in Java:
http://code.google.com/p/kanzi/source/browse/java/src/kanzi/function/DistanceCodec.java
You can see the explanation of the algorithm at the beginning of the code (complete example).
Also, take a look at the Block coder (it includes BWT + MoveToFront + Zero Length Transform + Entropy coding):
http://code.google.com/p/kanzi/source/browse/java/src/kanzi/function/BlockCodec.java
I have tried to use Distance Coding instead of Move To Front. The output is smaller (better compression) with DC compared to MTFT. However, after entropy coding, the result is the opposite: MTFT looks more amenable to entropy compression.

Related

Count how many list entries have a string property that ends with a particular char

I have an array list with some names inside it (first and last names). What I have to do is go through each "first name" and see how many times a character (which the user specifies) shows up at the end of every first name in the array list, and then print out the number of times that character showed up.
public int countFirstName(char c) {
int i = 0;
for (Name n : list) {
if (n.getFirstName().length() - 1 == c) {
i++;
}
}
return i;
}
That is the code I have. The problem is that the counter (i) doesn't add 1 even if there is a character that matches the end of the first name.

You're comparing the index of last character in the string to the required character, instead of the last character itself, which you can access with charAt:
String firstName = n.getFirstName()
if (firstName.charAt(firstName.length() - 1) == c) {
i++;
}

When you're setting out learning to code, there is a great value in using pencil and paper, or describing your algorithm ahead of time, in the language you think in. Most people that learn a foreign language start out by assembling a sentence in their native language, translating it to foreign, then speaking the foreign. Few, if any, learners of a foreign language are able to think in it natively
Coding is no different; all your life you've been speaking English and thinking in it. Now you're aiming to learn a different pattern of thinking, syntax, key words. This task will go a lot easier if you:
work out in high level natural language what you want to do first
write down the steps in clear and simple language, like a recipe
don't try to do too much at once
Had I been a tutor marking your program, id have been looking for something like this:
//method to count the number of list entries ending with a particular character
public int countFirstNamesEndingWith(char lookFor) {
//declare a variable to hold the count
int cnt = 0;
//iterate the list
for (Name n : list) {
//get the first name
String fn = n.getFirstName();
//get the last char of it
char lc = fn.charAt(fn.length() - 1);
//compare
if (lc == lookFor) {
cnt++;
}
}
return cnt;
}
Taking the bullet points in turn:
The comments serve as a high level description of what must be done. We write them aLL first, before even writing a single line of code. My course penalised uncommented code, and writing them first was a handy way of getting the requirement out of the way (they're a chore, right? Not always, but..) but also it is really easy to write a logic algorithm in high level language, then translate the steps into the language learning. I definitely think if you'd taken this approach you wouldn't have made the error you did, as it would have been clear that the code you wrote didn't implement the algorithm you'd have described earlier
Don't try to do too much in one line. Yes, I'm sure plenty of coders think it looks cool, or trick, or shows off what impressive coding smarts they have to pack a good 10 line algorithm into a single line of code that uses some obscure language features but one day it's highly likely that someone else is going to have to come along to maintain that code, improve it or change part of what it does - at that moment it's no longer cool, and it was never really a smart thing to do
Aominee, in their comment, actually gives us something like an example of this:
return (int)list.stream().filter(e -> e.charAt.length()-1)==c).count();
It's a one line implementation of a solution to your problem. Cool huh? Well, it has a bug* (for a start) but it's not the main thrust of my argument. At a more basic level: have you got any idea what it's doing? can you look at it and in 2 seconds tell me how it works?
It's quite an advanced language feature, it's trick for sure, but it might be a very poor solution because it's hard to understand, hard to maintain as a result, and does a lot while looking like a little- it only really makes sense if you're well versed in the language. This one line bundles up a facility that loops over your list, a feature that effectively has a tiny sub method that is called for every item in the list, and whose job is to calculate if the name ends with the sought char
It p's a brilliant feature, a cute example and it surely has its place in production java, but it's place is probably not here, in your learning exercise
Similarly, I'd go as far to say that this line of yours:
if (n.getFirstName().length() - 1 == c) {
Is approaching "doing too much" - I say this because it's where your logic broke down; you didn't write enough code to effectively implement the algorithm. You'd actually have to write even more code to implement this way:
if (n.getFirstName().charAt(n.getFirstName().length() - 1) == c) {
This is a right eyeful to load into your brain and understand. The accepted answer broke it down a bit by first getting the name into a temporary variable. That's a sensible optimisation. I broke it out another step by getting the last char into a temp variable. In a production system I probably wouldn't go that far, but this is your learning phase - try to minimise the number of operations each of your lines does. It will aid your understanding of your own code a great deal
If you do ever get a penchant for writing as much code as possible in as few chars, look at some code golf games here on the stack exchange network; the game is to abuse as many language features as possible to make really short, trick code.. pretty much every winner stands as a testament to condense that should never, ever be put into a production system maintained by normal coders who value their sanity
*the bug is it doesn't get the first name out of the Name object

Why do I sometimes get different SHA256 hashes in Java and PHP?

So I have an odd little problem with the hashing function in PHP. It only happens some of the time, which is what is confusing me. Essentially, I have a Java app and a PHP page, both of which calculate the SHA256 of the same string. There hasn't been any issues across the two, as they calculate the same hash (generally). The one exception is that every once in a while, PHP's output is one character longer than Java's.
I have this code in PHP:
$token = $_GET["token"];
$token = hash("sha256", $token."<salt>");
echo "Your token is " . $token;
99% of the time, I get the right hash. But every once in a while, I get something like this (space added to show the difference):
26be60ec9a36f217df83834939cbefa33ac798776977c1970f6c38ba1cf92e92 # PHP
26be60ec9a36f217df83834939cbefa33ac798776977c197 f6c38ba1cf92e92 # Java
As you can see, they're nearly identical. But the top one (computed by PHP) has one more 0 for some reason. I haven't really noticed a rhyme or reason to it, but it's certainly stumped me. I've tried thinking of things like the wrong encoding, or wrong return value, but none of them really explain why they're almost identical except for that one character.
Any help on this issue would be much appreciated.
EDIT: The space is only in the bottom one to highlight where the extra 0 is. The actual hash has no space, and is indeed a valid hash, as it's the same one that Java produces.
EDIT2: Sorry about that. I checked the lengths with Notepad++, and since it's different than my normal text editor, I misread the length by 1. So yes, the top one is indeed right. Which means that it's a bug in my Java code. I'm going to explore Ignacio's answer and get back to you.

The top hash is the correct length; the bottom hash is output because the hexadecimal values were not zero-filled on output (note that it's the MSn of a byte). So, a bug in the Java program unrelated to the hash algorithm.
>>> '%04x %02x%02x %x%x' % (0x1201, 0x12, 0x01, 0x12, 0x01)
'1201 1201 121'

Actually it's the SECOND hash which seems to have an incorrect length (63). Could it be that it is generated by assembling two different tokens, and maybe the last one - which should be 16 characters - gets the initial zero removed?

Cryptanalysis: XOR of two plaintext files

I have a file which contains the result of two XORed plaintext files. How do I attack this file in order to decrypt either of the plaintext files? I have searched quite a bit, but could not find any answers. Thanks!
EDIT:
Well, I also have the two ciphertexts which i XORed to get the XOR of the two plaintexts. The reason I ask this question, is because, according to Bruce Schneier, pg. 198, Applied Cryptography, 1996 "...she can XOR them together and get two plaintext messages XORed with each other. This is easy to break, and then she can XOR one of the plaintexts with the ciphertext to get the keystream." (This is in relation to a simple stream cipher) But beyond that he provided no explanation. Which is why I asked here. Forgive my ignorance.
Also, the algorithm used is a simple one, and a symmetric key is used whose length is 3.
FURTHER EDIT:
I forgot to add: Im assuming that a simple stream cipher was used for encryption.

I'm no cryptanalyst, but if you know something about the characteristics of the files you might have a chance.
For example, lets assume that you know that both original plaintexts:
contain plain ASCII English text
are articles about sports (or whatever)
Given those 2 pieces of information, one approach you might take is to scan through the ciphertext 'decrypting' using words that you might expect to be in them, such as "football", "player", "score", etc. Perform the decryption using "football" at position 0 of the ciphertext, then at position 1, then 2 and so on.
If the result of decrypting a sequence of bytes appears to be a word or word fragment, then you have a good chance that you've found plaintext from both files. That may give you a clue as to some surrounding plaintext, and you can see if that results in a sensible decryption. And so on.
Repeat this process with other words/phrases/fragments that you might expect to be in the plaintexts.
In response to your question's edit: what Schneier is talking about is that if someone has 2 ciphertexts that have been XOR encrypted using the same key, XORing those ciphertexts will 'cancel out' the keystream, since:
(A ^ k) - ciphertext of A
(B ^ k) - ciphertext of B
(A ^ k) ^ (B ^ k) - the two ciphertexts XOR'ed together which simplifies to:
A ^ B ^ k ^ k - which continues to simplify to
A ^ B ^ 0
A ^ B
So now, the attacker has a new ciphertext that's composed only of the two plaintexts. If the attacker knows one of the plaintexts (say the attacker has legitimate access to A, but not B), that can be used to recover the other plaintext:
A ^ (A ^ B)
(A ^ A) ^ B
0 ^ B
B
Now the attacker has the plaintext for B.
It's actually worse than this - if the attacker has A and the ciphertext for A then he can recover the keystream already.
But, the guessing approach I gave above is a variant of the above with the attacker using (hopefully good) guesses instead of a known plaintext. Obviously it's not as easy, but it's the same concept, and it can be done without starting with known plaintext. Now the attacker has a ciphertext that 'tells' him when he's correctly guessed some plaintext (because it results in other plaintext from the decryption). So even if the key used in the original XOR operation is random gibberish, an attacker can use the file that has that random gibberish 'removed' to gain information when he's making educated guesses.

You need to take advantage of the fact that both files are plain text. There is a lot of implications which can be derived from that fact. Assuming that both texts are English texts, you can use fact that some letters are much more popular than the others. See this article.
Another hint is to note the structure of correct English text. For example, every time one statements ends, and next begins you there is a (dot, space, capital letter) sequence.
Note that in ASCII code, space is binary "0010 0000" and changing that bit in a letter will change the letter case (lower to upper and vice versa). There will be a lot of XORing using space, if both files are plain text, right?
Analyse printable characters table on this page.
Also, at the end you can use spell checker.
I know I didn't provide a solution for your question.
I just gave you some hints. Have fun, and please share your findings.
It's really an interesting task.

That is interesting. The Schneier book does indeed say that it is easy to break this. And then he kind of leaves it hanging at that. I guess you have to leave some exercises up to the reader!
There is an article by Dawson and Nielson that apparently describes an automated process for this task for text files. It's a bit on the $$ side to buy the single article. However, a second paper titled A Natural Language Approach to Automated Cryptanalysis
of Two-time Pads references the Dawson and Nielsen work and describes some assumptions they made (primarily that the text was limited to 27 characters). But this second paper appears to be freely available and describes their own system. I don't know for sure that it is free, but it is openly available on a Johns Hopkins University server.
That paper is about 10 pages long and looks interesting. I don't have time to read it at the moment but may later. I find it interesting (and telling) that it takes a 10 page paper to describe a task that another cryptographer describes as "easy".

I don't think you can - not without knowing anything about the structure of the two files.

Unless you have one of the plaintext files, you can't get the original information of the other. Mathematically expressed:
p1 XOR p2 = en
You have one equation with two unknowns, you can't possibly get something meaningful out of it.

comparing "the likes" smartly

Suppose you need to perform some kind of comparison amongst 2 files. You only need to do it when it makes sense, in other words, you wouldn't want to compare JSON file with Property file or .txt file with .jar file
Additionally suppose that you have a mechanism in place to sort all of these things out and what it comes down to now is the actual file name. You would want to compare "myFile.txt" with "myFile.txt", but not with "somethingElse.txt". The goal is to be as close to "apples to apples" rules as possible.
So here we are, on one side you have "myFile.txt" and on another side you have "_myFile.txt", "_m_y_f_i_l_e.txt" and "somethingReallyClever.txt".
Task is to pick the closest name to later compare. Unfortunately, identical name is not found.
Looking at the character composition, it is not hard to figure out what the relationship is. My algo says:
_myFile.txt to _m_y_f_i_l_e.txt 0.312
_myFile.txt to somethingReallyClever.txt 0.16
So _m_y_f_i_l_e.txt is closer to_myFile.txt then somethingReallyClever.txt. Fantastic. But also says that ist is only 2 times closer, where as in reality we can look at the 2 files and would never think to compare somethingReallyClever.txt with _myFile.txt.
Why?
What logic would you suggest i apply to not only figure out likelihood by having chars on the same place, but also test whether determined weight makes sense?
In my example, somethingReallyClever.txt should have had a weight of 0.0
I hope i am being clear.
Please share your experience and thoughts on this.
(whatever approach you suggest should not depend on number of characters filename consists out of)

Possibly helpful previous question which highlights several possible algorithms:
Word comparison algorithm
These algorithms are based on how many changes would be needed to get from one string to the other - where a change is adding a character, deleting a character, or replacing a character.
Certainly any sensible metric here should have a low score as meaning close (think distance between the two strings) and larger scores as meaning not so close.

Sounds like you want the Levenshtein distance, perhaps modified by preconverting both words to the same case and normalizing spaces (e.g. replace all spaces and underscores with empty string)

How best to search binary data for variable length bit strings?

Can anyone tell me the best way to decode binary data with variable length bit strings in java?
For example:
The binary data is 10101000 11100010 01100001 01010111 01110001 01010110
I might need to find the first match of any of the following 01, 100, 110, 1110, 1010...
In this case the match would be 1010. I then need to do the same for the remainder of the binary data. The bit strings can be up to 16 bits long and cross the byte boundaries.
Basically, I'm trying to Huffman decode jpegs using the bit strings I created from the Huffman tables in the headers. I can do it, only it's very messy, I'm turning everything, binary data included, into Stringbuffers first and I know that isn't the right way.
Before I loaded everything in string buffers I tried using just numbers in binary but of course I can't ignore the leading 0s in a code like 00011. I'm sure there must be some clever way using bit wise operators and the like to do this, but I've been staring at pages explaining bit masks and leftwise shifts etc and I still don't have a clue!
Thanks a lot for any help!
EDIT:
Thanks for all the suggestions. I've gone with the binary tree approach as it seems to be the standard way with Huffman stuff. Makes sense really as Huffman codes are created using trees. I'll also look into to storing the binary data I need to search in a big integer. Don't know how to mark multiple answers as correct, but thanks all the same.

You might use a state machine consuming zeros and ones. The state machine would have final states for all the patterns that you want to detect. Whenever it enters one of the final states, is sends a message to you with the matched pattern and goes back to the initial state.
Finally you would have only one state machine in form of a DAG which contains all your patterns.
To implement it use the state pattern (http://en.wikipedia.org/wiki/State_pattern) or any other implementation of a state machine.

Since you are decoding Huffman encoded-data, you should create a binary tree, where leaves hold the decoded bit string as data, and the bits of each Huffman code are the path to the corresponding data. The bits of the Huffman code are accessed with bit-shift and bit-mask operations. When you get to a leaf, you output the data at that leaf and go back to the root of the tree. It's very fast and efficient.

You could try stuffing it into a BigInteger then using the shift and test methods. Then use loop to walk and accept each sub pattern.
If the huffman code are in a tree, 1 == right node, 0 == left node.
for( int i =numbitsTotal; i > 0; --i )
{
int bit = bigInt.testBit( i );
if( bit == 1 )
{
// take right node -- if null accept code, apply from top
}
else
{
// take left node -- if null accept code, apply from top
}
}

I would suggest a trie. It is explicitly designed for prefix searching. In your case, it would be a binary trie.

You could use a java.util.BitSet to store your binary data and then you can implement some search functions to find the position of a smaller BitSet inside the big one...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.