Getting/Applying capitlization mask before/after encoding?

Getting/Applying capitlization mask before/after encoding? - java

My project takes a String s and passes an all lower case version s.toLowerCase() to a lossless encoder.
I can convert encode/decode the lower case string just fine, but this obviously would not be practical, so I need to be able to preserve the original String's capitalization somehow.
I was thinking of using Character.isUpperCase() to get an array of integers UpperCaseLetters[] that represents the locations of all capital letters in s. I would then use this array to place a ^ at all locations UpperCaseLettes[i] + 1 in the encoded string. When decoding the string, I would know that every character preceding a ^ is capital. (By the way, for this encoder will never generate ^ when encoding).
This method seems sloppy to me though. I was also thinking of using bit strings to represent capitalization, but the over all goal of the application is compression, so that would not be very efficient.
Is there any easier way to get and apply capitlization masks for strings? If there is, how much "storage" would it need?

Your options:
Auto-capitalize:
Use a general algorithm for capitalization, use one of the below techniques to only record the letters that differ between the generated and the actual capitalization. To regenerate, just run the algorithm again and flip the capitalization of all the recorded letters. Assuming there are capital letters where there should be (e.g. start of sentences), this will slow the algorithm down slightly (only by a small constant factor of n, and decent compression is generally much slower than that) and always reduce the amount of storage space required by a few.
Bitmap of capital positions:
You've already covered this one, not particularly efficient.
Prefix capitals with identifying character:
Also already covered, except that you described postfix, but prefix is generally better and, for a more generic solution, you can also escape the ^ with ^^. Not a bad idea. Depending on the compression, it might be a good idea to instead use a letter that already appears in the dataset. Either the most or least common letter, or you may have to look at the compression algorithm and do quite a bit of processing to determine the ideal letter to use.
Store distance of capital from start in any format:
Has no advantage over distance to next capital (below).
Distance to next capital - non-bitstring representation:
Generally less efficient than using bitstrings.
Bit string = distance to next capital:
You have a sequence of lengths, each indicating, in sequence, the distances between capitals. So if we have distances 0,3,1,0,5 capitalization would be as follows: AbcdEfGHijklmNo (skip 0 characters to the first, 3 character to the second, 1 character to the 3rd, etc.). There are some options available to store this:
Fixed length: Not a good idea since it needs to be = the longest possible distance. An obvious alternative is having some sort of overflow into the next length, but this still uses too much space.
Fixed length, different settings: Best explained with an example - the first 4 bits indicate the length, 00 means there are 2-bits following to indicate the distance, 01 means 4-bits, 10 means 8-bits, 11 means 16-bits, if there's a chance of more than 16-bits, you may want to do something like - 110 means 16-bits, 1110 means 32-bits, 11110 means 64-bits, etc. (this may sound similar to determining the class of a IPv4 address). So 0001010100 would split into 00-01, 01-0100, thus distances 1, 4. Note that the lengths don't have to increment in powers of 2. 16-bits = 65535 characters is a lot and 2-bits = 3 is very little, you can probably make it 4, 6, 8, (16?), (32?), ??? (unless there are a few capitals in a row, then you probably want 2-bits as well).
Variable length using escape sequence: Say the escape sequence is 00, we want to use all strings that doesn't contain 00, so the bit value table will look as follows:
Bits Value
1 1
10 2
11 3
101 4 // skipped 100
110 5
111 6
1010 7 // skipped 1000 and 1001
10100101010010101000101000010 will split into 101, 10101, 101010, 101, 0, 10. Note that ...1001.. just causes a split ending at the left 1 and a split starting at the right 1, and ...10001... causes a split ending at the first 0 and a split starting at the right 1, and ...100001... indicates a 0-valued distance in between. The pseudo-code is something like:
if (current value == 1 && zeroCount < 2)
add to current split
zeroCount = 0
else if (current value == 1) // after 00...
if (zeroCount % 2 == 1) { add zero to current split; zeroCount--; }
record current split, clear current split
while (zeroCount > 2) { record 0-distance split; zeroCount -= 2; }
else zeroCount++
This looks like a good solution for short distances, but once the distances become large I suspect you start skipping too many values and the length increases to quickly.
There is no ideal solution, it greatly depends on the data, you'll have to play around with prefixing capitals and different options for bit string distances to see which is best for your typical dataset.

Related

Find if a permutation (using + and -) on a string of integers matches a number

Basically what I am doing it taking a string of integers (e.g. "1234"), and I am able to insert a + or - anywhere in this string, as much or little as I want. For example, I can do "1 + 2 + 3 + 4", "12 + 34", "123 - 4", etc. It is required to use all integers of the string, I cannot exclude any.
What I am trying to do is take another array of integers, and find if it was possible to get that number using the permutations mentioned in the first paragraph. I am somewhat lost on where to start looking for this. I could possibly create a recursive loop function to create every possible combination of the string and see if each result matches but this seems like it will be terribly slow. Another thought was to index them into an array - that way I could simply look up the answers after calculating them once.
Anyone have any suggestions?

I could possibly create a recursive loop function to create every possible combination of the string and see if each result matches but this seems like it will be terribly slow.
Doing an exhaustive search is your only option here. Fortunately, the timing isn't going to be too bad even for moderately long strings of up to 7..10 characters, because you do not need to "redo" additions and subtractions of a prior string when you process the "tail".
An outline of a possible implementation could be as follows:
Put all desired results from your array of integers in a hash set
Make a recursive method that takes the result so far, the string, and the position of the next "cut"
When the next "cut" is at the end of the string, check the result so far against the hash set from step 1
Otherwise, try these three possibilities in a loop on k
Use a k-digit number from the "cut" as a positive number, and make a recursive invocation with the "cut" moved by k digits. This is equivalent to inserting a + at the cut
Use a k-digit number from the "cut" as a negative number, and make a recursive invocation with the "cut" moved by k digits. This is equivalent to inserting a - at the cut

I'll give start help, with the approach for such a solution.
formal problem statement;
data model;
algorithm;
heuristics, cleverness.
For N digits there are some 3^N possibilities.
The solution must model the running data as:
the digits, as int[]
the sum
index from which to advance, prior digits were done.
number partalready tried, plus sign. Sign must come separate (as -1, +1) as the coming digit may be 0;
(What I leave out is the collecting of the entire result.)
The brute force solution then could be:
boolean solve(int[] digits, int sum) {
return solve(digits, sum, 1, 0, 0);
}
boolean solve(int[] digits, int sum, int signum, int part, int index) {
if (index >= digits.length) {
return signum * part == sum;
}
// Before the digit at index do either nothing, +, or -
return solve(digits, sum, signum, part * 10 + digits[index], index + 1)
|| solve(digits, sum - signum * part, 1, 0, index + 1)
|| solve(digits, sum - signum * part, -1, 0, index + 1);
}
Mind you could also split the digits in half and try to insert (nothing, +, -) there.
There are pruning opportunities, to diminish the number of tries. First the above can be done in a loop, the alternatives need not all to be tried. The order of evaluation might favor more likely candidates:
if digit 0 ...
if part > sum first - then +
...
Unfortunately +/- make a number theoretical approach AFAIK for me illusory.
#dasblinkenlight mentions even better data models, allowing to not
repeat evaluation in the alternatives. That would be even more
interesting. But might fail miserably due to time constraints. And I
wanted to come with something concrete. Without providing an entirely
ready made solution.

It is reasonable to take a brute force approach if you can rely on the input string not to be too long. If it contains n digits then you can construct 3n-1 formulae from it (between each pair of digits you can insert '+', '-', or nothing, for n-1 internal positions). For a 12-digit input string that's roughly 270000 formulae, which should be computable quite quickly. Of course, you would build and compute each one once, and compare the result to all the alternatives. Don't redo the computation for each array element.
It may be that there's a dynamic programming approach to this, but I'm not immediately seeing it, at least not one that would be substantially better than brute force.

How to find the encryption key of a shift cipher?

So i got an assignment and i need to decrypt a string that was encrypted with a certain key in a shift cipher. So my flow of thought was to find the most common letter in the string and find the key from that because the most common letter is E and when i will get the key i will decrypt the text (just basically each letter in the string that isn't a space - the key. Now i faced a few difficulties:
I want my program to work for every key so I don't know if the most common letter is after or before E(in the ASCII chart) So i can't just substrate E and get the answer and i can't figure out what math things i need to do to make it work.
Also when i found the key i don't know how to make like for example A go back to Y, like doing the circle (but i think i know maybe how i think with the % operator)
Anyway anyone that could help is much appreciated and i'm not allowed for really advanced commands. Also all letters in the String are caps and the cipher is a simple shift cipher.

I'm going to assume that you are trying to solve a caesar (or shift) cipher - for simplicity sake. The principles involved should be applicable to other ciphers as well though.
You indicated that you wanted to find the most common character. This doesn't really help you in all cases because that letter can become any of the other letters. However you could do a letter frequency attack using this...This is only really effective with very long strings however...
Probably the easiest way to solve such a problem is brute forcing the solution. Since there are only a very few number of solutions, brute force can be very effective. There are 26 letters in the alphabet, so we can shift 0-25. That means there are only 25 potential strings to check.
Finding these strings is relatively trivial. Use a loop in the range 1-25. Now, just convert each character in your string to numbers 0-25 [A=0, B=1 ... Z= 25] and add the shift [1-25]. (char_value + shift) % 26 will give you your new character value. (You can also use ASCII character values, but I use these for the sake of understanding.
Less trivial, however, is determining which string is most likely the correct one. In my opinion, the best way to do this is to use a dictionary of common words - you can read more about this type of attack here: http://en.wikipedia.org/wiki/Known-plaintext_attack. With your dictionary, you just look for the string with the highest number of known words - although it can get more complex than this. Chances are that this will be your solution.
This will work for all cases.
If you only want to look at problems where the most common letter is shifted to E, the problem becomes significantly simpler. However, in real examples, the most common letter in the string will not necessarily be E, so this strategy is less than optimal.
Find the most common letter by looping through the string, keeping a running tally of the occurances of each letter. You can do this in a number of ways. You could use a list of ints, a map, or pretty much anything...
From here, determine the size of shift from this letter to E. For example, if S is the most common, the shift is 12. E has a value of 4. S has a value of 18. The shift is (26-18)+4. This can be converted to all possible letters using the remainder function ie ((26-o)+4) % 26, where o is the value of the most frequent letter. So if the most frequent letter is A=0, 26+4 % 25 = 4, which is the correct shift.
Now, you can shift all the characters as explained above.

Java algorithm for evenly distributing ranges of strings into buckets [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Short version - I'm looking for a Java algorithm that given a String and an integer representing a number of buckets returns which bucket to place the String into.
Long version - I need to distribute a large number of objects into bins, evenly (or approximately evenly). The number of bins/buckets will vary, so the algorithm can't assume a particular number of bins. It may be 1, 30, or 200. The key for these objects will be a String.
The String has some predictable qualities that are important. The first 2 characters of the string actually appear to be a hex representation of a byte. i.e. 00-ff , and the strings themselves are quite evenly distributed within that range. There are a couple of outliers that start differently though, so this can't be relied on 100% (though easily 99.999%). This just means that edge cases do need to be handled.
It's critical that once all the strings have been distributed that there is zero overlap in range between what values appear in any 2 bins. So, that if I know what range of values appear in a bin, I don't have to look in any other bins to find the object. So for example, if I had 2 bins, it could be that bin 0 has Strings starting with letters a-m and bin 1 starting with n-z. However, that wouldn't satisfy the need for even distribution given what we know about the Strings.
Lastly, the implementation can have no knowledge of the current state of the bins. The method signature should literally be:
public int determineBucketIndex(String key, int numBuckets);
I believe that the foreknowledge about the distribution of the Strings should be sufficient.
EDIT: Clarifying for some questions
Number of buckets can exceed 256. The strings do contain additional characters after the first 2, so this can be leveraged.
The buckets should hold a range of Strings to enable fast lookup later. In fact, that's why they're being binned to begin with. With only the knowledge of ranges, I should be able to look in exactly 1 bucket to see if the value is there or not. I shouldn't have to look in others.
Hashcodes won't work. I need the buckets to contain only String within a certain range of the String value (not the hash). Hashing would lose that.
EDIT 2: Apparently not communicating well.
After bins have been chosen, these values are written out to files. 1 file per bin. The system that uses these files after binning is NOT Java. It's already implemented, and it needs values in the bins that fit within a range. I repeat, hashcode will not work. I explicitly said the ranges for strings cannot overlap between two bins, using hashcode cannot work.

I have read through your question twice and I still don't understand the constraints. Therefore, I am making a suggestion here and you can give feedback on it. If this won't work, please explain why.
First, do some math on the number of bins, to determine how many bits you need for a unique bin number. Take the logarithm to base 2 of the number of bins, then take the ceiling of number of bits divided by 8. This is the number of bytes of data you need, numBytes.
Take the first two letters and convert them to a byte. Then grab numBytes - 1 characters and convert them to bytes. Take the ordinal value of the character ('A' becomes 65, and so on). If the next characters could be Unicode, pick some rule to convert them to bytes... probably grab the least significant byte (modulus by 256). Get numBytes bytes total, including the byte made from the first two letters, and convert to an integer. Make the byte from the first two letters the least significant 8 bits of the integer, the next byte the next 8 significant bits, and so on. Now simply take the modulus of this value by the number of bins, and you have an integer bin number.
If the string is too short and there are no more characters to turn into byte values, use 0 for each missing character.
If there are any predictable characters (for example, the third character is always a space) then don't use those characters; skip past them.
Now, if this doesn't work for you, please explain why, and then maybe we will understand the question well enough to answer it.

answer edited after 2 updates to original post
It would have been an excellent idea to include all the information in your question from the start - with your new edits, your description already gives you the answer: stick your objects into a Balanced Tree (giving you the homogenous distribution you say you need) based on the hashCode for your string's substring(0,2) or something similarly head-based. Then write each leaf (being a set of strings) in the BTree to file.

I seriously doubt that the problem, as described, can be done perfectly. How about this:
Create 257 bins.
Put all normal Strings into bins 0-255.
Put all the outliers into bin 256.
Other than the "even distribution", doesn't this meet all your requirements?
At this point, if you really want more even distribution, you could reorganize bins 0-255 into a smaller number of more evenly distributed bins. But I think you may just have to lesses the requirements there.

how can i generate a unique int from a unique string?

I have an object with a String that holds a unique id .
(such as "ocx7gf" or "67hfs8")
I need to supply it an implementation of int hascode() which will be unique obviously.
how do i cast a string to a unique int in the easiest/fastest way?
10x.
Edit - OK. I already know that String.hashcode is possible. But it is not recommended in any place. Actually' if any other method is not recommended - Should I use it or not if I have my object in a collection and I need the hashcode. should I concat it to another string to make it more successful?

No, you don't need to have an implementation that returns a unique value, "obviously", as obviously the majority of implementations would be broken.
What you want to do, is to have a good spread across bits, especially for common values (if any values are more common than others). Barring special knowledge of your format, then just using the hashcode of the string itself would be best.
With special knowledge of the limits of your id format, it may be possible to customise and result in better performance, though false assumptions are more likely to make things worse than better.
Edit: On good spread of bits.
As stated here and in other answers, being completely unique is impossible and hash collisions are possible. Hash-using methods know this and can deal with it, but it does impact upon performance, so we want collisions to be rare.
Further, hashes are generally re-hashed so our 32-bit number may end up being reduced to e.g. one in the range 0 to 22, and we want as good a distribution within that as possible to.
We also want to balance this with not taking so long to compute our hash, that it becomes a bottleneck in itself. An imperfect balancing act.
A classic example of a bad hash method is one for a co-ordinate pair of X, Y ints that does:
return X ^ Y;
While this does a perfectly good job of returning 2^32 possible values out of the 4^32 possible inputs, in real world use it's quite common to have sets of coordinates where X and Y are equal ({0, 0}, {1, 1}, {2, 2} and so on) which all hash to zero, or matching pairs ({2,3} and {3, 2}) which will hash to the same number. We are likely better served by:
return ((X << 16) | (x >> 16)) ^ Y;
Now, there are just as many possible values for which this is dreadful than for the former, but it tends to serve better in real-world cases.
Of course, there is a different job if you are writing a general-purpose class (no idea what possible inputs there are) or have a better idea of the purpose at hand. For example, if I was using Date objects but knew that they would all be dates only (time part always midnight) and only within a few years of each other, then I might prefer a custom hash code that used only the day, month and lower-digits of the years, over the standard one. The writer of Date though can't work on such knowledge and has to try to cater for everyone.
Hence, If I for instance knew that a given string is always going to consist of 6 case-insensitive characters in the range [a-z] or [0-9] (which yours seem to, but it isn't clear from your question that it does) then I might use an algorithm that assigned a value from 0 to 35 (the 36 possible values for each character) to each character, and then walk through the string, each time multiplying the current value by 36 and adding the value of the next char.
Assuming a good spread in the ids, this would be the way to go, especially if I made the order such that the lower-significant digits in my hash matched the most frequently changing char in the id (if such a call could be made), hence surviving re-hashing to a smaller range well.
However, lacking such knowledge of the format for sure, I can't make that call with certainty, and I could well be making things worse (slower algorithm for little or even negative gain in hash quality).
One advantage you have is that since it's an ID in itself, then presumably no other non-equal object has the same ID, and hence no other properties need be examined. This doesn't always hold.

You can't get a unique integer from a String of unlimited length. There are 4 billionish (2^32) unique integers, but an almost infinite number of unique strings.
String.hashCode() will not give you unique integers, but it will do its best to give you differing results based on the input string.
EDIT
Your edited question says that String.hashCode() is not recommended. This is not true, it is recommended, unless you have some special reason not to use it. If you do have a special reason, please provide details.

Looks like you've got a base-36 number there (a-z + 0-9). Why not convert it to an int using Integer.parseInt(s, 36)? Obviously, if there are too many unique IDs, it won't fit into an int, but in that case you're out of luck with unique integers and will need to get by using String.hashCode(), which does its best to be close to unique.

Unless your strings are limited in some way or your integers hold more bits than the strings you're trying to convert, you cannot guarantee the uniqueness.
Let's say you have a 32 bit integer and a 64-character character set for your strings. That means six bits per character. That will allow you to store five characters into an integer. More than that and it won't fit.

represent each string character by a five-digit binary digit, eg. a by 00001 b by 00010 etc. thus 32 combinations are possible, for example, cat might be written as 00100 00001 01100, then convert this binary into decimal, eg. this would be 4140, thus cat would be 4140, similarly, you can get cat back from 4140 by converting it to binary first and Map the five digit binary to string

One way to do it is assign each letter a value, and each place of the string it's own multiple ie a = 1, b = 2, and so on, then everything in the first digit (read left to right) would be multiplied by a prime number, the next the next prime number and so on, such that the final digit was multiplied by a prime larger than the number of possible subsets in that digit (26+1 for a space or 52+1 with capitols and so on for other supported characters). If the number is mapped back to the first digits (leftmost character) any number you generate from a unique string mapping back to 1 or 6 whatever the first letter will be, gives a unique value.
Dog might be 30,3(15),101(7) or 782, while God 33,3(15),101(4) or 482. More importantly than unique strings being generated they can be useful in generation if the original digit is kept, like 30(782) would be unique to some 12(782) for the purposes of differentiating like strings if you ever managed to go over the unique possibilities. Dog would always be Dog, but it would never be Cat or Mouse.

frequency analysis algorithm

I want to write a java program that searches through a cipher text and returns a frequency count of the characters in the cipher, for example the cipher:
"jshddllpkeldldwgbdpked" will have a result like this:
2 letter occurrences:
pk = 2, ke = 2, ld = 2
3 letter occurrences:
pke = 2.
Any algorithm that allows me to do this as efficiently as possible?

The map strategy is a good one, but I'd go for HashMap<String, Integer>, since it's tuples of characters being counted.
Iterating over the characters in the ciphertext, you can save the last X characters and that will give you a map over all occurrences of substrings of length X+1.

The usual approach would be to use some kind of map to map your characters to their counts. You can use a HashMap<Character, Integer> for example. You can then iterate through your ciphertext, character-wise and either put the character into the map with a count of 1 (if it doesn't yet exist) or increment its count.

You could store the n-grams in a trie, reversing the normal order so the last character in the n-gram is at the top of the trie. Each node in the trie stores a character count. Loop over the string, keeping track of the last N characters (as Buhb suggests). Each time through the outer loop, you traverse the trie, using the last N characters to pick the path, starting with the last character and ending with the Nth to last. For each node you visit, incrementing its counter.
To print the n-gram frequencies, perform a breadth-first traversal of the trie.
Overall performance left as an exercise.

Either have an array with a cell for each possible value (easy if the cipher text is all lower case characters - 26 - harder if not) or go for a Map where you pass in the character and increment the value in either case. The array is quicker but less flexible.

If the set of lengths of sequences you need is fixed, the obvious algorithm takes a linear number of counting operations (say, looking up a counter in a hashtable and incrementing it).
When you say "as efficiently as possible", do you propose to spend a lot of effort for a meagre constant-factor improvement, to search hopelessly for a sublinear algorithm, or do you not understand algorithm complexity classes at all?

You can use hash or graph (Thanks to outis, I know it's special name now, such kind of graphs is called "trie"). Hash will be slower, graph will be faster. Hash will get less memory, graph will take more in bad implementation.
You cannot get it done using array since it will get HUGE amount of memory if your maximum char sequence length is equal to your text length, and text is long enough. If you will limit it it will get smth like ([number of letters]^[max sequence length])*4 bytes which will be (52^4)*4 ~= 24Mb of memory for 4 lower/upper letter sequence. If limited sequence length is ok for you and this memory amount is normal than algorithm will be pretty easy for <=4 letters in sequence.

You could start by looking for the largest possible repeatable sequence first then work your way down from there. For example if the string is 10 characters the largest repeatable sequence that could occur would be 5 letters long so first look for 5 letter sequences then 4 letters and so on till you reach 2. This should reduce the number of iterations in your program.

I dont have an answer in mind for this,
But I feel, this algorithm is the exact same as the algorithm used by compression algorithms to create compressed files with the dictionary approach.
If I am not wrong, in this approach, a dictionary is used in the following manner:
data:
abccccabaccabcaaaaabcaaabbbbbccccaaabcbbbbabbabab
parse 1 :
key: *
value: abc
new data:
*cccabacc*aaaa*aaabbbbbccccaa*bbbbabbabab
Just an educated guess, I think (not sure here) the standard "zip" file uses this approach,
so I suggest you look at these algorithms

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.