substitution cipher with different alphabet length - java
I would like to implement a simple substitution cipher to mask private ids in URLs.
I know how my IDs will look like (combination of uppercase ASCII letters, digits and underscore), and they will be rather long, as they are composed keys. I would like to use a longer alphabet to shorten the resulting codes (I'd like to use upper- and lowercase ASCII letters, digits and nothing else). So my incoming alphabet would be
[A-Z0-9_] (37 chars)
and my outgoing alphabet would be
[A-Za-z0-9] (62 chars)
so a compression of almost 50% reasonable amount of compression would be available.
Let's say my URLs look like this:
/my/page/GFZHFFFZFZTFZTF_24_F34
and I want them to look like this instead:
/my/page/Ft32zfegZFV5
Obviously both arrays would be shuffled to bring some random order in.
This does not have to be secure. If someone figures it out: fine, but I don't want the scheme to be obvious.
My desired solution would be to convert the string to an integer representation of radix 37, convert the radix to 62 and use the second alphabet to write out that number. is there any sample code available that does something similar? Integer.parseInt() has some similar logic, but it is hard-coded to use standard digit behavior.
Any ideas?
I am using Java to implement this but code or pseudo-code in any other language is of course also helpful.
Inexplicably Character.MAX_RADIX is only 36, but you can always write your own base conversion routine. The following implementation isn't high-performance, but it should be a good starting point:
import java.math.BigInteger;
public class BaseConvert {
static BigInteger fromString(String s, int base, String symbols) {
BigInteger num = BigInteger.ZERO;
BigInteger biBase = BigInteger.valueOf(base);
for (char ch : s.toCharArray()) {
num = num.multiply(biBase)
.add(BigInteger.valueOf(symbols.indexOf(ch)));
}
return num;
}
static String toString(BigInteger num, int base, String symbols) {
StringBuilder sb = new StringBuilder();
BigInteger biBase = BigInteger.valueOf(base);
while (!num.equals(BigInteger.ZERO)) {
sb.append(symbols.charAt(num.mod(biBase).intValue()));
num = num.divide(biBase);
}
return sb.reverse().toString();
}
static String span(char from, char to) {
StringBuilder sb = new StringBuilder();
for (char ch = from; ch <= to; ch++) {
sb.append(ch);
}
return sb.toString();
}
}
Then you can have a main() test harness like the following:
public static void main(String[] args) {
final String SYMBOLS_AZ09_ = span('A','Z') + span('0','9') + "_";
final String SYMBOLS_09AZ = span('0','9') + span('A','Z');
final String SYMBOLS_AZaz09 = span('A','Z') + span('a','z') + span('0','9');
BigInteger n = fromString("GFZHFFFZFZTFZTF_24_F34", 37, SYMBOLS_AZ09_);
// let's convert back to base 37 first...
System.out.println(toString(n, 37, SYMBOLS_AZ09_));
// prints "GFZHFFFZFZTFZTF_24_F34"
// now let's see what it looks like in base 62...
System.out.println(toString(n, 62, SYMBOLS_AZaz09));
// prints "ctJvrR5kII1vdHKvjA4"
// now let's test with something we're more familiar with...
System.out.println(fromString("CAFEBABE", 16, SYMBOLS_09AZ));
// prints "3405691582"
n = BigInteger.valueOf(3405691582L);
System.out.println(toString(n, 16, SYMBOLS_09AZ));
// prints "CAFEBABE"
}
Some observations
BigInteger is probably easiest if the numbers can exceed long
You can shuffle the char in the symbol String, just stick to one "secret" permutation
Note regarding "50% compression"
You can't generally expect the base 62 string to be around half as short as the base 36 string. Here's Long.MAX_VALUE in base 10, 20, and 30:
System.out.format("%s%n%s%n%s%n",
Long.toString(Long.MAX_VALUE, 10), // "9223372036854775807"
Long.toString(Long.MAX_VALUE, 20), // "5cbfjia3fh26ja7"
Long.toString(Long.MAX_VALUE, 30) // "hajppbc1fc207"
);
It's not a substitution cipher at all, but your question is clear enough.
Have a look at Base85: http://en.wikipedia.org/wiki/Ascii85
For Java (as indirectly linked by the Wikipedia article):
http://java.freehep.org/freehep-io/apidocs/org/freehep/util/io/ASCII85InputStream.html
http://java.freehep.org/freehep-io/apidocs/org/freehep/util/io/ASCII85OutputStream.html
I now have a working solution which you can find here:
http://pastebin.com/Mctnidng
The problem was that a) I was losing precision in long codes through this part:
value = value.add(//
BigInteger.valueOf((long) Math.pow(alphabet.length, i)) // error here
.multiply(
BigInteger.valueOf(ArrayUtils.indexOf(alphabet, c))));
(long just wasn't long enough)
and b) whenever I had a text that started with the character at offset 0 in the alphabet, this would be dropped, so I needed to add a length character (a single character will do fine here, as my codes will never be as long as the alphabet)
Related
Java Hexadecimal to Decimal conversion: Custom Logic
I am trying to figure out how to convert hex into a string and integer so I can manipulate an RGB light on my arduino micro-controller through it's serialport. I found a good example on the java website, but I'm having a difficult time understanding some of the methods and I am getting hung up. I could easily just copy-paste this code and have it work but I want to fully understand it. I will add comments to my understandings and hopefully someone can provide some feedback. public class HexToDecimalExample3{ public static int getDecimal(String hex){ //this is the function which we will call later and they are declaring string hex here. Can we declare string hex inside the scope..? String digits = "0123456789ABCDEF"; //declaring string "digits" with all possible inputs in linear order for later indexing hex = hex.toUpperCase(); //converting string to uppercase, just "in case" int val = 0; //declaring int val. I don't get this part. for (int i = 0; i < hex.length(); i++) //hex.length is how long the string is I think, so we don't finish the loop until all letters in string is done. pls validate this { char c = hex.charAt(i); //char is completely new to me. Are we taking the characters from the string 'hex' and making an indexed array of a sort? It seems similar to indexOf but non-linear? help me understand this.. int d = digits.indexOf(c); //indexing linearly where 0=1 and A=11 and storing to an integer variable val = 16*val + d; //How do we multiply 16(bits) by val=0 to get a converted value? I do not get this.. } return val; } public static void main(String args[]){ System.out.println("Decimal of a is: "+getDecimal("a")); //printing the conversions out. System.out.println("Decimal of f is: "+getDecimal("f")); System.out.println("Decimal of 121 is: "+getDecimal("121")); }} To summerize the comments, it's primarily the char c = hex.charAt(i); AND the val = 16*val + d; parts I don't understand.
Ok, let's go line for line public static int getDecimal(String hex) hex is the parameter, it needs to be declared there, so you can pass a String when you call the function. String digits = "0123456789ABCDEF"; Yes, this declares a string with all characters which can occur in a hexadecimal number. hex = hex.toUpperCase(); It converts the letters in the hex-String to upper case, so that it is consistent, i.e. you always have F and never f, no matter which is being input. int val = 0; This is the variable where the corresponding decimal value will later be in. We will do our calculations with this variable. for (int i = 0; i < hex.length(); i++) hex.length() is the number of characters in the hex-String provided. We execute the code inside this for loop once per character. char c = hex.charAt(i); Yes, char represents a single character. We retrieve the character from the hex-String at index i, so in the first iteration it is the first character, in the second iteration the second character and so on. int d = digits.indexOf(c); We look which index the character has in the digit-String. In that way we determine the decimal representation of this specific digit. Like 0-9 stay 0-9 and F becomes a 15. val = 16*val + d; Let's think about what we have to do. We have the decimal value of the digit. But in hexadecimal we have this digit at a specific position with which it gets multiplied. Like the '1' in '100' is actually not a 1, but 100 * 1 because it is at this position. 10 in hexadecimal is 16 in decimal, because we have 1 * 16. Now the approach here is a little bit complicated. val is not uninitialized. val is 0 at the beginning and then contains the cumulated values from the previous iterations. Since the first character in the String is the highest position we don't know directly with what we have to multiply, because we don't know how many digits the number has (actually we do, but this approach doesn't use this). So we just add the digit value to it. In the consecutive iterations it will get multiplied by 16 to scale it up to the corresponding digit base value. Let me show you an example: Take 25F as hex number. Now the first iteration takes the 2 and converts it to a 2 and adds it to val. The 16 * val resolves to 0 so is not effective in the first time. The next iteration multiplies the 2 with 16 and takes the 5 (converted to 5) and adds it to val. So now we have (I split it mathematically so you understand it): 2 * 16 + 5 Next we get the F which is decimal 15. We multiply val by 16 and add the 15. We get 2 * 256 + 5 * 16 + 16 (* 1), which is actually how you calculate the decimal value of this hex value mathematically. Another possibility to compute val is: val += Math.pow(16, hex.length() - i - 1) * d;
Getting letter from integer index
I wish to have a java method which gives me, index given, a corresponding letter set excel like, so: 258 => IZ (last index) 30 => AD 120 => DR 56 => BD First method gives correct output, but it's very dumb and I don't like that. I tried to build a second method that involves a bit of thinking. I already saw methods using String Builder or something else like this one but I tried to build a method myself aka betterGetColumnName. better 258 => IHGFEDCBAX (not ok) better 30 => AD (OK, 2nd alphabet round it's ok) better 120 => DCBAP (not ok) better 56 => BAD (seems like 3rd alphabet round breaks my logic) public String getColumnName(int index){ String[] letters = { "A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R", "S","T","U","V","W","X","Y","Z","AA","AB","AC","AD","AE","AF","AG","AH", "AI","AJ","AK","AL","AM","AN","AO","AP","AQ","AR","AS","AT","AU","AV", "AW","AX","AY","AZ","BA","BB","BC","BD","BE","BF","BG","BH","BI","BJ", "BK","BL","BM","BN","BO","BP","BQ","BR","BS","BT","BU","BV","BW","BX", "BY","BZ","CA","CB","CC","CD","CE","CG","CH","CI","CJ","CK","CL","CM", "CN","CO","CP","CQ","CR","CS","CT","CU","CV","CW","CX","CY","CZ","DA", "DB","DC","DD","DF","DG","DH","DI","DJ","DK","DL","DM","DN","DO","DP", "DQ","DR","DS","DT","DU","DV","DW","DX","DY","DZ","EA","EB","EC","ED", "EE","EF","EG","EH","EI","EJ","EK","EL","EM","EN","EO","EP","EQ","ER", "ES","ET","EU","EV","EW","EX","EY","EZ","FA","FB","FC","FD","FE","FF", "FG","FH","FI","FJ","FK","FL","FM","FN","FO","FP","FQ","FR","FS","FT", "FU","FV","FW","FX","FY","FZ","GA","GB","GC","GD","GE","GF","GG","GH", "GI","GJ","GK","GL","GM","GN","GO","GP","GQ","GR","GS","GT","GU","GV", "GW","GX","GY","GZ","HA","HB","HC","HD","HE","HF","HG","HH","HI","HJ", "HK","HL","HM","HN","HO","HP","HQ","HR","HS","HT","HU","HV","HW","HX", "HY","HZ","IA","IB","IC","ID","IE","IF","IG","IH","II","IJ","IK","IL", "IM","IN","IO","IP","IQ","IR","IS","IT","IU","IV","IW","IX","IY","IZ" }; if (index<=letters.length){ return letters[index-1]; }else{ return null; } } I think I should save how many times I made a full alphabet round, I wouldn't use StringBuilder or else, just char, String and integers because at school we can't upgrade java version (1.5.x) also I think it might be useful for me to understand why is my logic so wrong. public String betterGetColumnName(int index){ int res=0; String s = ""; char h='0'; while(index>26){ res=index/26; h=(char)(res+64); s+=h; index -=26; } h=(char)(index+64); s+=h; return s; }
You are definitely on the right track, though your logic is a bit off. What you are effectively trying to do is to convert a base 10 integer into a base 26 character. But instead of digits, the converted "number" actually consists of the 26 letters of the alphabet. The algorithm you want here is to determine each letter of the output by taking the remainder of the input number divided by 26. Then, divide the input by 26 and again inspect the "tens" position to see what letter it is. In the code snippet below, I assume that 1 corresponds to A, 26 corresponds to Z, and 27 to AA. You may shift the indices however you feel is best. int input = 53; String output = ""; while (input > 0) { int num = (input - 1) % 26; char letter = (char)(num+65); output = letter + output; input = (input-1) / 26; } System.out.println(output); BA Demo Note: A helpful edit was suggested which uses StringBuilder instead of String to do the concatenations. While this might be more optimal than the above code, it might make it harder to see the algorithm.
Can I multiply charAt in Java?
When I try to multiply charAt I received "big" number: String s = "25999993654"; System.out.println(s.charAt(0)+s.charAt(1)); Result : 103 But when I want to receive only one number it's OK . On the JAVA documentation: the character at the specified index of this string. The first character is at index 0. So I need explanation or solution (I think that I should convert string to int , but it seems to me that is unnesessary work)
char is an integral type. The value of s.charAt(0) in your example is the char version of the number 50 (the character code for '2'). s.charAt(1) is (char)53. When you use + on them, they're converted to ints, and you end up with 103 (not 100). If you're trying to use the numbers 2 and 5, yes, you'll have to parse them. Or if you know they're standard ASCII-style digits (character codes 48 through 57, inclusive), you can just subtract 48 from them (as 48 is the character code for '0'). Or better yet, as Peter Lawrey points out elsewhere, use Character.getNumericValue, which handles a broader range of characters.
Yes - you should parse extracted digit or use ASCII chart feature and substract 48: public final class Test { public static void main(String[] a) { String s = "25999993654"; System.out.println(intAt(s, 0) + intAt(s, 1)); } public static int intAt(String s, int index) { return Integer.parseInt(""+s.charAt(index)); //or //return (int) s.charAt(index) - 48; } }
Short, case-insensitive string obfuscation strategy
I am looking for a way to identify (i.e. encode and decode) a set of Java strings with one token. The identification should not involve DB persistence. So far I have looked into Base64 encoding and DES encryption, but both are not optimal with respect to the following requirements: Token should be as short as possible Token should be insensitive to casing Token should survive a URLEncoder/Decoder round-trip (i.e. will be used in URLs) Is Base32 my best shot or are there better options? Note that I'm primarily interested in shortening & obfuscating the set, encryption/security is not important.
What's a structure of the text (i.e. set of strings)? You could use your knowledge of it to encode it in a shorten form. E.g. if you have large base-decimal number "1234567890" you could translate it into 36-base number, which will be shorter. Otherwise it looks like you are trying invent an universal archiver. If you don't care about length, then yes, processing by alphabet based encoder (such as Base32) is the only choice. Also, if text is large enough, maybe you could save some space by gzipping it.
Rot13 obfuscates but does not shorten. Zip shortens (usually) but does not survive the URL round trip. Encryption will not shorten, and may lengthen. Hashing shortens but is one-way. You do not have an easy problem. Base32 is case insensitive, but takes more space than Base64, which isn't. I suspect that you are going to have to drop or modify your requirements. Which requirements are most important and which least important?
I have spent some time on this and I have a good solution for you. Encode as base64 then as a custom base32 that uses 0-9a-v. Essentially, you lay out the bits 6 at a time (your chars are 0-9a-zA-Z) then encode them 5 at a time. This leads to hardly any extra space. For example, ABCXYZdefxyz123789 encodes as i9crnsuj9ov1h8o4433i14 Here's an implementation that works, including some test code that proves it is case-insensitive: // Note: You can add 1 more char to this if you want to static String chars = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"; private static String decodeToken(String encoded) { // Lay out the bits 5 at a time StringBuilder sb = new StringBuilder(); for (byte b : encoded.toLowerCase().getBytes()) sb.append(asBits(chars.indexOf(b), 5)); sb.setLength(sb.length() - (sb.length() % 6)); // Consume it 6 bits at a time int length = sb.length(); StringBuilder result = new StringBuilder(); for (int i = 0; i < length; i += 6) result.append(chars.charAt(Integer.parseInt(sb.substring(i, i + 6), 2))); return result.toString(); } private static String generateToken(String x) { StringBuilder sb = new StringBuilder(); for (byte b : x.getBytes()) sb.append(asBits(chars.indexOf(b), 6)); // Round up to 5 bit multiple // Consume it 5 bits at a time int length = sb.length(); sb.append("00000".substring(0, length % 5)); StringBuilder result = new StringBuilder(); for (int i = 0; i < length; i += 5) result.append(chars.charAt(Integer.parseInt(sb.substring(i, i + 5), 2))); return result.toString(); } private static String asBits(int index, int width) { String bits = "000000" + Integer.toBinaryString(index); return bits.substring(bits.length() - width); } public static void main(String[] args) { String input = "ABCXYZdefxyz123789"; String token = generateToken(input); System.out.println(input + " ==> " + token); Assert.assertEquals("mixed", input, decodeToken(token)); Assert.assertEquals("lower", input, decodeToken(token.toLowerCase())); Assert.assertEquals("upper", input, decodeToken(token.toUpperCase())); System.out.println("pass"); }
How to convert Java long's as Strings while keeping natural order
I'm currently looking at a simple programming problem that might be fun to optimize - at least for anybody who believes that programming is art :) So here is it: How to best represent long's as Strings while keeping their natural order? Additionally, the String representation should match ^[A-Za-z0-9]+$. (I'm not too strict here, but avoid using control characters or anything that might cause headaches with encodings, is illegal in XML, has line breaks, or similar characters that will certainly cause problems) Here's a JUnit test case: #Test public void longConversion() { final long[] longs = { Long.MIN_VALUE, Long.MAX_VALUE, -5664572164553633853L, -8089688774612278460L, 7275969614015446693L, 6698053890185294393L, 734107703014507538L, -350843201400906614L, -4760869192643699168L, -2113787362183747885L, -5933876587372268970L, -7214749093842310327L, }; // keep it reproducible //Collections.shuffle(Arrays.asList(longs)); final String[] strings = new String[longs.length]; for (int i = 0; i < longs.length; i++) { strings[i] = Converter.convertLong(longs[i]); } // Note: Comparator is not an option Arrays.sort(longs); Arrays.sort(strings); final Pattern allowed = Pattern.compile("^[A-Za-z0-9]+$"); for (int i = 0; i < longs.length; i++) { assertTrue("string: " + strings[i], allowed.matcher(strings[i]).matches()); assertEquals("string: " + strings[i], longs[i], Converter.parseLong(strings[i])); } } and here are the methods I'm looking for public static class Converter { public static String convertLong(final long value) { // TODO } public static long parseLong(final String value) { // TODO } } I already have some ideas on how to approach this problem. Still, I though I might get some nice (creative) suggestions from the community. Additionally, it would be nice if this conversion would be as short as possible easy to implement in other languages EDIT: I'm quite glad to see that two very reputable programmers ran into the same problem as I did: using '-' for negative numbers can't work as the '-' doesn't reverse the order of sorting: -0001 -0002 0000 0001 0002
Ok, take two: class Converter { public static String convertLong(final long value) { return String.format("%016x", value - Long.MIN_VALUE); } public static long parseLong(final String value) { String first = value.substring(0, 8); String second = value.substring(8); long temp = (Long.parseLong(first, 16) << 32) | Long.parseLong(second, 16); return temp + Long.MIN_VALUE; } } This one takes a little explanation. Firstly, let me demonstrate that it is reversible and the resultant conversions should demonstrate the ordering: for (long aLong : longs) { String out = Converter.convertLong(aLong); System.out.printf("%20d %16s %20d\n", aLong, out, Converter.parseLong(out)); } Output: -9223372036854775808 0000000000000000 -9223372036854775808 9223372036854775807 ffffffffffffffff 9223372036854775807 -5664572164553633853 316365a0e7370fc3 -5664572164553633853 -8089688774612278460 0fbba6eba5c52344 -8089688774612278460 7275969614015446693 e4f96fd06fed3ea5 7275969614015446693 6698053890185294393 dcf444867aeaf239 6698053890185294393 734107703014507538 8a301311010ec412 734107703014507538 -350843201400906614 7b218df798a35c8a -350843201400906614 -4760869192643699168 3dedfeb1865f1e20 -4760869192643699168 -2113787362183747885 62aa5197ea53e6d3 -2113787362183747885 -5933876587372268970 2da6a2aeccab3256 -5933876587372268970 -7214749093842310327 1be00fecadf52b49 -7214749093842310327 As you can see Long.MIN_VALUE and Long.MAX_VALUE (the first two rows) are correct and the other values basically fall in line. What is this doing? Assuming signed byte values you have: -128 => 0x80 -1 => 0xFF 0 => 0x00 1 => 0x01 127 => 0x7F Now if you add 0x80 to those values you get: -128 => 0x00 -1 => 0x7F 0 => 0x80 1 => 0x81 127 => 0xFF which is the correct order (with overflow). Basically the above is doing that with 64 bit signed longs instead of 8 bit signed bytes. The conversion back is a little more roundabout. You might think you can use: return Long.parseLong(value, 16); but you can't. Pass in 16 f's to that function (-1) and it will throw an exception. It seems to be treating that as an unsigned hex value, which long cannot accommodate. So instead I split it in half and parse each piece, combining them together, left-shifting the first half by 32 bits.
EDIT: Okay, so just adding the negative sign for negative numbers doesn't work... but you could convert the value into an effectively "unsigned" long such that Long.MIN_VALUE maps to "0000000000000000", and Long.MAX_VALUE maps to "FFFFFFFFFFFFFFFF". Harder to read, but will get the right results. Basically you just need to add 2^63 to the value before turning it into hex - but that may be a slight pain to do in Java due to it not having unsigned longs... it may be easiest to do using BigInteger: private static final BigInteger OFFSET = BigInteger.valueOf(Long.MIN_VALUE) .negate(); public static String convertLong(long value) { BigInteger afterOffset = BigInteger.valueOf(value).add(OFFSET); return String.format("%016x", afterOffset); } public static long parseLong(String text) { BigInteger beforeOffset = new BigInteger(text, 16); return beforeOffset.subtract(OFFSET).longValue(); } That wouldn't be terribly efficient, admittedly, but it works with all your test cases.
If you don't need a printable String, you can encode the long in four chars after you've shifted the value by Long.MIN_VALUE (-0x80000000) to emulate an unsigned long: public static String convertLong(long value) { value += Long.MIN_VALUE; return "" + (char)(value>>48) + (char)(value>>32) + (char)(value>>16) + (char)value; } public static long parseLong(String value) { return ( (((long)value.charAt(0))<<48) + (((long)value.charAt(1))<<32) + (((long)value.charAt(2))<<16) + (long)value.charAt(3)) + Long.MIN_VALUE; } Usage of surrogate pairs is not a problem, since the natural order of a string is defined by the UTF-16 values in its chars and not by the UCS-2 codepoint values.
There's a technique in RFC2550 -- an April 1st joke RFC about the Y10K problem with 4-digit dates -- that could be applied to this purpose. Essentially, each time the integer's string representation grows to require another digit, another letter or other (printable) character is prepended to retain desired sort-order. The negative rules are more arcane, yielding strings that are harder to read at a glance... but still easy enough to apply in code. Nicely, for positive numbers, they're still readable. See: http://www.faqs.org/rfcs/rfc2550.html