How to find the encryption key of a shift cipher? - java

So i got an assignment and i need to decrypt a string that was encrypted with a certain key in a shift cipher. So my flow of thought was to find the most common letter in the string and find the key from that because the most common letter is E and when i will get the key i will decrypt the text (just basically each letter in the string that isn't a space - the key. Now i faced a few difficulties:
I want my program to work for every key so I don't know if the most common letter is after or before E(in the ASCII chart) So i can't just substrate E and get the answer and i can't figure out what math things i need to do to make it work.
Also when i found the key i don't know how to make like for example A go back to Y, like doing the circle (but i think i know maybe how i think with the % operator)
Anyway anyone that could help is much appreciated and i'm not allowed for really advanced commands. Also all letters in the String are caps and the cipher is a simple shift cipher.

I'm going to assume that you are trying to solve a caesar (or shift) cipher - for simplicity sake. The principles involved should be applicable to other ciphers as well though.
You indicated that you wanted to find the most common character. This doesn't really help you in all cases because that letter can become any of the other letters. However you could do a letter frequency attack using this...This is only really effective with very long strings however...
Probably the easiest way to solve such a problem is brute forcing the solution. Since there are only a very few number of solutions, brute force can be very effective. There are 26 letters in the alphabet, so we can shift 0-25. That means there are only 25 potential strings to check.
Finding these strings is relatively trivial. Use a loop in the range 1-25. Now, just convert each character in your string to numbers 0-25 [A=0, B=1 ... Z= 25] and add the shift [1-25]. (char_value + shift) % 26 will give you your new character value. (You can also use ASCII character values, but I use these for the sake of understanding.
Less trivial, however, is determining which string is most likely the correct one. In my opinion, the best way to do this is to use a dictionary of common words - you can read more about this type of attack here: http://en.wikipedia.org/wiki/Known-plaintext_attack. With your dictionary, you just look for the string with the highest number of known words - although it can get more complex than this. Chances are that this will be your solution.
This will work for all cases.
If you only want to look at problems where the most common letter is shifted to E, the problem becomes significantly simpler. However, in real examples, the most common letter in the string will not necessarily be E, so this strategy is less than optimal.
Find the most common letter by looping through the string, keeping a running tally of the occurances of each letter. You can do this in a number of ways. You could use a list of ints, a map, or pretty much anything...
From here, determine the size of shift from this letter to E. For example, if S is the most common, the shift is 12. E has a value of 4. S has a value of 18. The shift is (26-18)+4. This can be converted to all possible letters using the remainder function ie ((26-o)+4) % 26, where o is the value of the most frequent letter. So if the most frequent letter is A=0, 26+4 % 25 = 4, which is the correct shift.
Now, you can shift all the characters as explained above.

Related

Can anybody explain hot to determine if the string has all unique characters without any additional data structures? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am not familiar with any ASCII rules or string ASCII representations. I have looked at the book Cracking the coding interview but cannot understand how come the string can have only 256 maximum character representations. How it is possible, can anybody explain it and help to solve the problem with the easiest explanation possible.
Here is the question:
Implement an algorithm to determine if a string has all unique
characters. What if you cannot use additional data structures
Thanks in advance!
There is absolutely no need to limit yourself to 256 unique characters to solve this problem.
Here is a trivial algorithm:
First, consider the string not so much a java.lang.String, but instead, an array of characters.
Sort this array of characters, in place. This takes 0 additional space, and O(nlogn) time.
Loop through the char array, front to back, starting at index 1 (the second character). For each index, check if the char you find there is equal to the char you find at the previous index. If the answer is ever yes, return immediately, and answer false. If you get to the end without a hit, return true (the string consisted of unique characters).
runtime characteristic is O(n logn), requiring no additional space. Though you did mangle the input.
Now, in java this is a tad tricky; java.lang.String instances are immutable. You cannot modify them, therefore step 2 (sort-in-place) is not possible. You'd have to make a char[] copy via yourString.toCharArray() and then you can write this algorithm. That's an intentional restriction of string, not a fundamental limitation.
But, if you want to go with the rule that the input also cannot be modified in any way, and you can't make any new data structures, that's still possible, and still has absolutely no requirement that 'strings can only select from a plane of 256 characters'. It's just going to be far, far slower:
Loop through every character. i = position of the character.
Loop through every further character (from i+1, to the end).
Compare the characters at the two positions. If equal, return false.
if you get to the end, return true.
runtime characteristic is O(n^2) (icky), requiring no additional space, and does not modify any data in place.
The 256 thing just doesn't factor into any of this.
However, the thing is, a lot of code and examples incorrectly conflate the idea of 'a sequence of bytes' and 'a string' (as in, a sequence of characters), treating them all 'just as a bag o numbers'. At that point, if you have unicode characters, or charset encoding factors into the equation, all sorts of complications occur.
Properly written code knows that chars are chars, bytes are bytes, chars are never bytes and bytes are never chars, and every single last time you go from one to the other, you always, always, always specify which encoding, explicitly. If you do that, you'll never have any issues. But I guess your prof didn't want you to worry about it? I dunno - dumb restriction, not material whatsoever to the question.
It is because the ASCII table uses 8 bit, so there are maximum 2^8 possible combinations of characters. Actually, there aren't 256 but 255 since the first bit is used to store the size.

Getting/Applying capitlization mask before/after encoding?

My project takes a String s and passes an all lower case version s.toLowerCase() to a lossless encoder.
I can convert encode/decode the lower case string just fine, but this obviously would not be practical, so I need to be able to preserve the original String's capitalization somehow.
I was thinking of using Character.isUpperCase() to get an array of integers UpperCaseLetters[] that represents the locations of all capital letters in s. I would then use this array to place a ^ at all locations UpperCaseLettes[i] + 1 in the encoded string. When decoding the string, I would know that every character preceding a ^ is capital. (By the way, for this encoder will never generate ^ when encoding).
This method seems sloppy to me though. I was also thinking of using bit strings to represent capitalization, but the over all goal of the application is compression, so that would not be very efficient.
Is there any easier way to get and apply capitlization masks for strings? If there is, how much "storage" would it need?
Your options:
Auto-capitalize:
Use a general algorithm for capitalization, use one of the below techniques to only record the letters that differ between the generated and the actual capitalization. To regenerate, just run the algorithm again and flip the capitalization of all the recorded letters. Assuming there are capital letters where there should be (e.g. start of sentences), this will slow the algorithm down slightly (only by a small constant factor of n, and decent compression is generally much slower than that) and always reduce the amount of storage space required by a few.
Bitmap of capital positions:
You've already covered this one, not particularly efficient.
Prefix capitals with identifying character:
Also already covered, except that you described postfix, but prefix is generally better and, for a more generic solution, you can also escape the ^ with ^^. Not a bad idea. Depending on the compression, it might be a good idea to instead use a letter that already appears in the dataset. Either the most or least common letter, or you may have to look at the compression algorithm and do quite a bit of processing to determine the ideal letter to use.
Store distance of capital from start in any format:
Has no advantage over distance to next capital (below).
Distance to next capital - non-bitstring representation:
Generally less efficient than using bitstrings.
Bit string = distance to next capital:
You have a sequence of lengths, each indicating, in sequence, the distances between capitals. So if we have distances 0,3,1,0,5 capitalization would be as follows: AbcdEfGHijklmNo (skip 0 characters to the first, 3 character to the second, 1 character to the 3rd, etc.). There are some options available to store this:
Fixed length: Not a good idea since it needs to be = the longest possible distance. An obvious alternative is having some sort of overflow into the next length, but this still uses too much space.
Fixed length, different settings: Best explained with an example - the first 4 bits indicate the length, 00 means there are 2-bits following to indicate the distance, 01 means 4-bits, 10 means 8-bits, 11 means 16-bits, if there's a chance of more than 16-bits, you may want to do something like - 110 means 16-bits, 1110 means 32-bits, 11110 means 64-bits, etc. (this may sound similar to determining the class of a IPv4 address). So 0001010100 would split into 00-01, 01-0100, thus distances 1, 4. Note that the lengths don't have to increment in powers of 2. 16-bits = 65535 characters is a lot and 2-bits = 3 is very little, you can probably make it 4, 6, 8, (16?), (32?), ??? (unless there are a few capitals in a row, then you probably want 2-bits as well).
Variable length using escape sequence: Say the escape sequence is 00, we want to use all strings that doesn't contain 00, so the bit value table will look as follows:
Bits Value
1 1
10 2
11 3
101 4 // skipped 100
110 5
111 6
1010 7 // skipped 1000 and 1001
10100101010010101000101000010 will split into 101, 10101, 101010, 101, 0, 10. Note that ...1001.. just causes a split ending at the left 1 and a split starting at the right 1, and ...10001... causes a split ending at the first 0 and a split starting at the right 1, and ...100001... indicates a 0-valued distance in between. The pseudo-code is something like:
if (current value == 1 && zeroCount < 2)
add to current split
zeroCount = 0
else if (current value == 1) // after 00...
if (zeroCount % 2 == 1) { add zero to current split; zeroCount--; }
record current split, clear current split
while (zeroCount > 2) { record 0-distance split; zeroCount -= 2; }
else zeroCount++
This looks like a good solution for short distances, but once the distances become large I suspect you start skipping too many values and the length increases to quickly.
There is no ideal solution, it greatly depends on the data, you'll have to play around with prefixing capitals and different options for bit string distances to see which is best for your typical dataset.

Convert string to a large integer?

I have an assignment (i think a pretty common one) where the goal is to develop a LargeInteger class that can do calculations with.. very large integers.
I am obviously not allowed to use the Java.math.bigeinteger class at all.
Right off the top I am stuck. I need to take 2 Strings from the user (the long digits) and then I will be using these strings to perform the various calculation methods (add, divide, multiply etc.)
Can anyone explain to me the theory behind how this is supposed to work? After I take the string from the user (since it is too large to store in int) am I supposed to break it up maybe into 10 digit blocks of long numbers (I think 10 is the max long maybe 9?)
any help is appreciated.
First off, think about what a convenient data structure to store the number would be. Think about how you would store an N digit number into an int[] array.
Now let's take addition for example. How would you go about adding two N digit numbers?
Using our grade-school addition, first we look at the least significant digit (in standard notation, this would be the right-most digit) of both numbers. Then add them up.
So if the right-most digits were 7 and 8, we would obtain 15. Take the right-most digit of this result (5) and that's the least significant digit of the answer. The 1 is carried over to the next calculation. So now we look at the 2nd least significant digit and add those together along with the carry (if there is no carry, it is 0). And repeat until there are no digits left to add.
The basic idea is to translate how you add, multiply, etc by hand into code when the numbers are stored in some data structure.
I'll give you a few pointers as to what I might do with a similar task, but let you figure out the details.
Look at how addition is done from simple electronic adder circuits. Specifically, they use small blocks of addition combined together. These principals will help. Specifically, you can add the blocks, just remember to carry over from one block to the next.
Your idea of breaking it up into smaller blogs is an excellent one. Just remember to to the correct conversions. I suspect 9 digits is just about right, for the purpose of carry overs, etc.
These tasks will help you with addition and subtraction. Multiplication and Division are a bit trickier, but again, a few tips.
Multiplication is the easier of the tasks, just remember to multiply each block of one number with the other, and carry the zeros.
Integer division could basically be approached like long division, only using whole blocks at a time.
I've never actually build such a class, so hopefully there will be something in here you can use.
Look at the source code for MPI 1.8.6 by Michael Bromberger (a C library). It uses a simple data structure for bignums and simple algorithms. It's C, not Java, but straightforward.
Its division performs poorly (and results in slow conversion of very large bignums to tex), but you can follow the code.
There is a function mpi_read_radix to read a number in an arbitrary radix (up to base 36, where the letter Z is 35) with an optional leading +/- sign, and produce a bignum.
I recently chose that code for a programming language interpreter because although it is not the fastest performer out there, nor the most complete, it is very hackable. I've been able to rewrite the square root myself to a faster version, fix some coding bugs affecting a port to 64 bit digits, and add some missing operations that I needed. Plus the licensing is BSD compatible.

how can i generate a unique int from a unique string?

I have an object with a String that holds a unique id .
(such as "ocx7gf" or "67hfs8")
I need to supply it an implementation of int hascode() which will be unique obviously.
how do i cast a string to a unique int in the easiest/fastest way?
10x.
Edit - OK. I already know that String.hashcode is possible. But it is not recommended in any place. Actually' if any other method is not recommended - Should I use it or not if I have my object in a collection and I need the hashcode. should I concat it to another string to make it more successful?
No, you don't need to have an implementation that returns a unique value, "obviously", as obviously the majority of implementations would be broken.
What you want to do, is to have a good spread across bits, especially for common values (if any values are more common than others). Barring special knowledge of your format, then just using the hashcode of the string itself would be best.
With special knowledge of the limits of your id format, it may be possible to customise and result in better performance, though false assumptions are more likely to make things worse than better.
Edit: On good spread of bits.
As stated here and in other answers, being completely unique is impossible and hash collisions are possible. Hash-using methods know this and can deal with it, but it does impact upon performance, so we want collisions to be rare.
Further, hashes are generally re-hashed so our 32-bit number may end up being reduced to e.g. one in the range 0 to 22, and we want as good a distribution within that as possible to.
We also want to balance this with not taking so long to compute our hash, that it becomes a bottleneck in itself. An imperfect balancing act.
A classic example of a bad hash method is one for a co-ordinate pair of X, Y ints that does:
return X ^ Y;
While this does a perfectly good job of returning 2^32 possible values out of the 4^32 possible inputs, in real world use it's quite common to have sets of coordinates where X and Y are equal ({0, 0}, {1, 1}, {2, 2} and so on) which all hash to zero, or matching pairs ({2,3} and {3, 2}) which will hash to the same number. We are likely better served by:
return ((X << 16) | (x >> 16)) ^ Y;
Now, there are just as many possible values for which this is dreadful than for the former, but it tends to serve better in real-world cases.
Of course, there is a different job if you are writing a general-purpose class (no idea what possible inputs there are) or have a better idea of the purpose at hand. For example, if I was using Date objects but knew that they would all be dates only (time part always midnight) and only within a few years of each other, then I might prefer a custom hash code that used only the day, month and lower-digits of the years, over the standard one. The writer of Date though can't work on such knowledge and has to try to cater for everyone.
Hence, If I for instance knew that a given string is always going to consist of 6 case-insensitive characters in the range [a-z] or [0-9] (which yours seem to, but it isn't clear from your question that it does) then I might use an algorithm that assigned a value from 0 to 35 (the 36 possible values for each character) to each character, and then walk through the string, each time multiplying the current value by 36 and adding the value of the next char.
Assuming a good spread in the ids, this would be the way to go, especially if I made the order such that the lower-significant digits in my hash matched the most frequently changing char in the id (if such a call could be made), hence surviving re-hashing to a smaller range well.
However, lacking such knowledge of the format for sure, I can't make that call with certainty, and I could well be making things worse (slower algorithm for little or even negative gain in hash quality).
One advantage you have is that since it's an ID in itself, then presumably no other non-equal object has the same ID, and hence no other properties need be examined. This doesn't always hold.
You can't get a unique integer from a String of unlimited length. There are 4 billionish (2^32) unique integers, but an almost infinite number of unique strings.
String.hashCode() will not give you unique integers, but it will do its best to give you differing results based on the input string.
EDIT
Your edited question says that String.hashCode() is not recommended. This is not true, it is recommended, unless you have some special reason not to use it. If you do have a special reason, please provide details.
Looks like you've got a base-36 number there (a-z + 0-9). Why not convert it to an int using Integer.parseInt(s, 36)? Obviously, if there are too many unique IDs, it won't fit into an int, but in that case you're out of luck with unique integers and will need to get by using String.hashCode(), which does its best to be close to unique.
Unless your strings are limited in some way or your integers hold more bits than the strings you're trying to convert, you cannot guarantee the uniqueness.
Let's say you have a 32 bit integer and a 64-character character set for your strings. That means six bits per character. That will allow you to store five characters into an integer. More than that and it won't fit.
represent each string character by a five-digit binary digit, eg. a by 00001 b by 00010 etc. thus 32 combinations are possible, for example, cat might be written as 00100 00001 01100, then convert this binary into decimal, eg. this would be 4140, thus cat would be 4140, similarly, you can get cat back from 4140 by converting it to binary first and Map the five digit binary to string
One way to do it is assign each letter a value, and each place of the string it's own multiple ie a = 1, b = 2, and so on, then everything in the first digit (read left to right) would be multiplied by a prime number, the next the next prime number and so on, such that the final digit was multiplied by a prime larger than the number of possible subsets in that digit (26+1 for a space or 52+1 with capitols and so on for other supported characters). If the number is mapped back to the first digits (leftmost character) any number you generate from a unique string mapping back to 1 or 6 whatever the first letter will be, gives a unique value.
Dog might be 30,3(15),101(7) or 782, while God 33,3(15),101(4) or 482. More importantly than unique strings being generated they can be useful in generation if the original digit is kept, like 30(782) would be unique to some 12(782) for the purposes of differentiating like strings if you ever managed to go over the unique possibilities. Dog would always be Dog, but it would never be Cat or Mouse.

frequency analysis algorithm

I want to write a java program that searches through a cipher text and returns a frequency count of the characters in the cipher, for example the cipher:
"jshddllpkeldldwgbdpked" will have a result like this:
2 letter occurrences:
pk = 2, ke = 2, ld = 2
3 letter occurrences:
pke = 2.
Any algorithm that allows me to do this as efficiently as possible?
The map strategy is a good one, but I'd go for HashMap<String, Integer>, since it's tuples of characters being counted.
Iterating over the characters in the ciphertext, you can save the last X characters and that will give you a map over all occurrences of substrings of length X+1.
The usual approach would be to use some kind of map to map your characters to their counts. You can use a HashMap<Character, Integer> for example. You can then iterate through your ciphertext, character-wise and either put the character into the map with a count of 1 (if it doesn't yet exist) or increment its count.
You could store the n-grams in a trie, reversing the normal order so the last character in the n-gram is at the top of the trie. Each node in the trie stores a character count. Loop over the string, keeping track of the last N characters (as Buhb suggests). Each time through the outer loop, you traverse the trie, using the last N characters to pick the path, starting with the last character and ending with the Nth to last. For each node you visit, incrementing its counter.
To print the n-gram frequencies, perform a breadth-first traversal of the trie.
Overall performance left as an exercise.
Either have an array with a cell for each possible value (easy if the cipher text is all lower case characters - 26 - harder if not) or go for a Map where you pass in the character and increment the value in either case. The array is quicker but less flexible.
If the set of lengths of sequences you need is fixed, the obvious algorithm takes a linear number of counting operations (say, looking up a counter in a hashtable and incrementing it).
When you say "as efficiently as possible", do you propose to spend a lot of effort for a meagre constant-factor improvement, to search hopelessly for a sublinear algorithm, or do you not understand algorithm complexity classes at all?
You can use hash or graph (Thanks to outis, I know it's special name now, such kind of graphs is called "trie"). Hash will be slower, graph will be faster. Hash will get less memory, graph will take more in bad implementation.
You cannot get it done using array since it will get HUGE amount of memory if your maximum char sequence length is equal to your text length, and text is long enough. If you will limit it it will get smth like ([number of letters]^[max sequence length])*4 bytes which will be (52^4)*4 ~= 24Mb of memory for 4 lower/upper letter sequence. If limited sequence length is ok for you and this memory amount is normal than algorithm will be pretty easy for <=4 letters in sequence.
You could start by looking for the largest possible repeatable sequence first then work your way down from there. For example if the string is 10 characters the largest repeatable sequence that could occur would be 5 letters long so first look for 5 letter sequences then 4 letters and so on till you reach 2. This should reduce the number of iterations in your program.
I dont have an answer in mind for this,
But I feel, this algorithm is the exact same as the algorithm used by compression algorithms to create compressed files with the dictionary approach.
If I am not wrong, in this approach, a dictionary is used in the following manner:
data:
abccccabaccabcaaaaabcaaabbbbbccccaaabcbbbbabbabab
parse 1 :
key: *
value: abc
new data:
*cccabacc*aaaa*aaabbbbbccccaa*bbbbabbabab
Just an educated guess, I think (not sure here) the standard "zip" file uses this approach,
so I suggest you look at these algorithms

Categories

Resources