I am looking for a way to identify (i.e. encode and decode) a set of Java strings with one token. The identification should not involve DB persistence. So far I have looked into Base64 encoding and DES encryption, but both are not optimal with respect to the following requirements:
Token should be as short as possible
Token should be insensitive to casing
Token should survive a URLEncoder/Decoder round-trip (i.e. will be used in URLs)
Is Base32 my best shot or are there better options? Note that I'm primarily interested in shortening & obfuscating the set, encryption/security is not important.
What's a structure of the text (i.e. set of strings)? You could use your knowledge of it to encode it in a shorten form. E.g. if you have large base-decimal number "1234567890" you could translate it into 36-base number, which will be shorter.
Otherwise it looks like you are trying invent an universal archiver.
If you don't care about length, then yes, processing by alphabet based encoder (such as Base32) is the only choice.
Also, if text is large enough, maybe you could save some space by gzipping it.
Rot13 obfuscates but does not shorten. Zip shortens (usually) but does not survive the URL round trip. Encryption will not shorten, and may lengthen. Hashing shortens but is one-way. You do not have an easy problem. Base32 is case insensitive, but takes more space than Base64, which isn't. I suspect that you are going to have to drop or modify your requirements. Which requirements are most important and which least important?
I have spent some time on this and I have a good solution for you.
Encode as base64 then as a custom base32 that uses 0-9a-v. Essentially, you lay out the bits 6 at a time (your chars are 0-9a-zA-Z) then encode them 5 at a time. This leads to hardly any extra space. For example, ABCXYZdefxyz123789 encodes as i9crnsuj9ov1h8o4433i14
Here's an implementation that works, including some test code that proves it is case-insensitive:
// Note: You can add 1 more char to this if you want to
static String chars = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static String decodeToken(String encoded) {
// Lay out the bits 5 at a time
StringBuilder sb = new StringBuilder();
for (byte b : encoded.toLowerCase().getBytes())
sb.append(asBits(chars.indexOf(b), 5));
sb.setLength(sb.length() - (sb.length() % 6));
// Consume it 6 bits at a time
int length = sb.length();
StringBuilder result = new StringBuilder();
for (int i = 0; i < length; i += 6)
result.append(chars.charAt(Integer.parseInt(sb.substring(i, i + 6), 2)));
return result.toString();
}
private static String generateToken(String x) {
StringBuilder sb = new StringBuilder();
for (byte b : x.getBytes())
sb.append(asBits(chars.indexOf(b), 6));
// Round up to 5 bit multiple
// Consume it 5 bits at a time
int length = sb.length();
sb.append("00000".substring(0, length % 5));
StringBuilder result = new StringBuilder();
for (int i = 0; i < length; i += 5)
result.append(chars.charAt(Integer.parseInt(sb.substring(i, i + 5), 2)));
return result.toString();
}
private static String asBits(int index, int width) {
String bits = "000000" + Integer.toBinaryString(index);
return bits.substring(bits.length() - width);
}
public static void main(String[] args) {
String input = "ABCXYZdefxyz123789";
String token = generateToken(input);
System.out.println(input + " ==> " + token);
Assert.assertEquals("mixed", input, decodeToken(token));
Assert.assertEquals("lower", input, decodeToken(token.toLowerCase()));
Assert.assertEquals("upper", input, decodeToken(token.toUpperCase()));
System.out.println("pass");
}
Related
I need to represent both very large and small numbers in the shortest string possible. The numbers are unsigned. I have tried just straight Base64 encode, but for some smaller numbers, the encoded string is longer than just storing the number as a string. What would be the best way to most optimally store a very large or short number in the shortest string possible with it being URL safe?
I have tried just straight Base64 encode, but for some smaller numbers, the encoded string is longer than just storing the number as a string
Base64 encoding of binary byte data will make it longer, by about a third. It is not supposed to make it shorter, but to allow safe transport of binary data in formats that are not binary safe.
However, base 64 is more compact than decimal representation of a number (or of byte data), even if it is less compact than base 256 (the raw byte data). Encoding your numbers in base 64 directly will make them more compact than decimal. This will do it:
private static final String base64Chars =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_";
static String encodeNumber(long x) {
char[] buf = new char[11];
int p = buf.length;
do {
buf[--p] = base64Chars.charAt((int)(x % 64));
x /= 64;
} while (x != 0);
return new String(buf, p, buf.length - p);
}
static long decodeNumber(String s) {
long x = 0;
for (char c : s.toCharArray()) {
int charValue = base64Chars.indexOf(c);
if (charValue == -1) throw new NumberFormatException(s);
x *= 64;
x += charValue;
}
return x;
}
Using this encoding scheme, Long.MAX_VALUE will be the string H__________, which is 11 characters long, compared to its decimal representation 9223372036854775807 which is 19 characters long. Numbers up to about 16 million will fit in a mere 4 characters. That's about as short as you'll get it. (Technically there are two other characters which do not need to be encoded in URLs: . and ~. You can incorporate those to get base 66, which would be a smidgin shorter for some numbers, although that seems a bit pedantic.)
To extend on Stephen C's answer, here is a piece of code to convert to base 62 (but you can increase this by adding more characters to the digits String (just pick what characters are valid for you):
public static String toString(long n) {
String digits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
int base = digits.length();
String s = "";
while (n > 0) {
long d = n % base;
s = digits.charAt(d) + s;
n = n / base;
}
return s;
}
This will never result in the string representation being longer than the digit one.
Assuming that you don't do any compression, and that you restrict yourself to URL safe characters, then the following procedure will give you the most compact encoding possible.
Make a list of all URL safe characters
Count them. Suppose you have N.
Represent your number in base N, representing 0 by the first character, 1 by the 2nd and so on.
So, what about compression ...
If you assume that the numbers you are representing are uniformly distributed across their range, then there is no real opportunity for compression.
Otherwise, there is potential for compression. If you can reduce the size of the common numbers then you can typically achieve a saving by compression. This is how Huffman encoding works.
But the downside is that compression at this level is not perfect across the range of numbers. It reduces the size of some numbers, but it inevitably increases the size of others.
So what does this mean for your use-case?
I think it means that you are looking at the problem the wrong way. You should not be aiming for a minimal encoded size for every number. You should be aiming to minimize the size on average ... averaged over the actual distribution of your numbers.
I'm here:
String password = "123";
byte passwordByte[] = password.getBytes();
MessageDigest md = MessageDigest.getInstance("SHA-512");
byte passwortHashByte[] = md.digest(passwordByte);
The passwortHashByte-Array contains only a lot of numbers. I want to convernt this numbers to one String which contains the hash-code as plaintext.
How i do this?
I want to convernt this numbers to one String which contains the hash-code as plaintext.
The hash isn't plain-text. It's binary data - arbitrary bytes. That isn't plaintext any more than an MP3 file is.
You need to work out what textual representation you want to use for the binary data. That in turn depends on what you want to use the data for. For the sake of easy diagnostics I'd suggest a pure-ASCII representation - probably either base64 or hex. If you need to easily look at individual bytes, hex is simpler to read, but base64 is a bit more compact.
It's also important to note that MD5 isn't a particularly good way of hashing passwords... and it looks like you're not even salting them. It may be good enough for a demo app which will never be released into the outside world, but you should really look into more secure approaches. See Jeff Atwood's blog post on the topic for an introduction, and ideally get hold of a book about writing secure code.
Here is how I did it for my website.
private static byte[] fromHex(String hex) {
byte[] bytes = new byte[hex.length() / 2];
for (int i = 0; i < hex.length() / 2; i++) {
bytes[i] = (byte)(Character.digit(hex.charAt(i * 2), 16) * 16 + Character.digit(hex.charAt(i * 2 + 1), 16) - 128);
}
return bytes;
}
private static String toHex(byte[] bytes) {
String hex = new String();
for (int i = 0; i < bytes.length; i++) {
String c = Integer.toHexString(bytes[i] + 128);
if (c.length() == 1) c = "0" + c;
hex = hex + c;
}
return hex;
}
That'll allow you to convert your byte array to and from a hex string.
Well, byte passwortHashByte[] = md.digest(passwordByte); can contain some controll characters, which will broke your String. Consider encoding passwortHashByte[] to Base64 form, if you really need String from it. You can use Apache Commons Codecs for making base64 form.
I would like to implement a simple substitution cipher to mask private ids in URLs.
I know how my IDs will look like (combination of uppercase ASCII letters, digits and underscore), and they will be rather long, as they are composed keys. I would like to use a longer alphabet to shorten the resulting codes (I'd like to use upper- and lowercase ASCII letters, digits and nothing else). So my incoming alphabet would be
[A-Z0-9_] (37 chars)
and my outgoing alphabet would be
[A-Za-z0-9] (62 chars)
so a compression of almost 50% reasonable amount of compression would be available.
Let's say my URLs look like this:
/my/page/GFZHFFFZFZTFZTF_24_F34
and I want them to look like this instead:
/my/page/Ft32zfegZFV5
Obviously both arrays would be shuffled to bring some random order in.
This does not have to be secure. If someone figures it out: fine, but I don't want the scheme to be obvious.
My desired solution would be to convert the string to an integer representation of radix 37, convert the radix to 62 and use the second alphabet to write out that number. is there any sample code available that does something similar? Integer.parseInt() has some similar logic, but it is hard-coded to use standard digit behavior.
Any ideas?
I am using Java to implement this but code or pseudo-code in any other language is of course also helpful.
Inexplicably Character.MAX_RADIX is only 36, but you can always write your own base conversion routine. The following implementation isn't high-performance, but it should be a good starting point:
import java.math.BigInteger;
public class BaseConvert {
static BigInteger fromString(String s, int base, String symbols) {
BigInteger num = BigInteger.ZERO;
BigInteger biBase = BigInteger.valueOf(base);
for (char ch : s.toCharArray()) {
num = num.multiply(biBase)
.add(BigInteger.valueOf(symbols.indexOf(ch)));
}
return num;
}
static String toString(BigInteger num, int base, String symbols) {
StringBuilder sb = new StringBuilder();
BigInteger biBase = BigInteger.valueOf(base);
while (!num.equals(BigInteger.ZERO)) {
sb.append(symbols.charAt(num.mod(biBase).intValue()));
num = num.divide(biBase);
}
return sb.reverse().toString();
}
static String span(char from, char to) {
StringBuilder sb = new StringBuilder();
for (char ch = from; ch <= to; ch++) {
sb.append(ch);
}
return sb.toString();
}
}
Then you can have a main() test harness like the following:
public static void main(String[] args) {
final String SYMBOLS_AZ09_ = span('A','Z') + span('0','9') + "_";
final String SYMBOLS_09AZ = span('0','9') + span('A','Z');
final String SYMBOLS_AZaz09 = span('A','Z') + span('a','z') + span('0','9');
BigInteger n = fromString("GFZHFFFZFZTFZTF_24_F34", 37, SYMBOLS_AZ09_);
// let's convert back to base 37 first...
System.out.println(toString(n, 37, SYMBOLS_AZ09_));
// prints "GFZHFFFZFZTFZTF_24_F34"
// now let's see what it looks like in base 62...
System.out.println(toString(n, 62, SYMBOLS_AZaz09));
// prints "ctJvrR5kII1vdHKvjA4"
// now let's test with something we're more familiar with...
System.out.println(fromString("CAFEBABE", 16, SYMBOLS_09AZ));
// prints "3405691582"
n = BigInteger.valueOf(3405691582L);
System.out.println(toString(n, 16, SYMBOLS_09AZ));
// prints "CAFEBABE"
}
Some observations
BigInteger is probably easiest if the numbers can exceed long
You can shuffle the char in the symbol String, just stick to one "secret" permutation
Note regarding "50% compression"
You can't generally expect the base 62 string to be around half as short as the base 36 string. Here's Long.MAX_VALUE in base 10, 20, and 30:
System.out.format("%s%n%s%n%s%n",
Long.toString(Long.MAX_VALUE, 10), // "9223372036854775807"
Long.toString(Long.MAX_VALUE, 20), // "5cbfjia3fh26ja7"
Long.toString(Long.MAX_VALUE, 30) // "hajppbc1fc207"
);
It's not a substitution cipher at all, but your question is clear enough.
Have a look at Base85: http://en.wikipedia.org/wiki/Ascii85
For Java (as indirectly linked by the Wikipedia article):
http://java.freehep.org/freehep-io/apidocs/org/freehep/util/io/ASCII85InputStream.html
http://java.freehep.org/freehep-io/apidocs/org/freehep/util/io/ASCII85OutputStream.html
I now have a working solution which you can find here:
http://pastebin.com/Mctnidng
The problem was that a) I was losing precision in long codes through this part:
value = value.add(//
BigInteger.valueOf((long) Math.pow(alphabet.length, i)) // error here
.multiply(
BigInteger.valueOf(ArrayUtils.indexOf(alphabet, c))));
(long just wasn't long enough)
and b) whenever I had a text that started with the character at offset 0 in the alphabet, this would be dropped, so I needed to add a length character (a single character will do fine here, as my codes will never be as long as the alphabet)
What's the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8 for example the characters have a variable byte length, so each character needs to be determined individually. As far now I've come up with this:
char c = getCharSomehow();
String encoding = getEncodingSomehow();
// ...
int length = new String(new char[] { c }).getBytes(encoding).length;
But this is clumsy and inefficient in a loop since a new String needs to be created everytime. I can't find other and more efficient ways in the Java API. There's a String#valueOf(char), but according its source it does basically the same as above. I imagine that this can be done with bitwise operations like bit shifting, but that's my weak point and I'm unsure how to take the encoding into account here :)
If you question the need for this, check this topic.
Update: the answer from #Bkkbrad is technically the most efficient:
char c = getCharSomehow();
String encoding = getEncodingSomehow();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
// ...
int length = encoder.encode(CharBuffer.wrap(new char[] { c })).limit();
However as #Stephen C pointed out, there are more problems with this. There may for example be combined/surrogate characters which needs to be taken into account as well. But that's another problem which needs to be solved in the step before this step.
Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.
On my system, the following code takes 25 seconds to encode 100,000 single characters:
Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
for (array[0] = 0; array[0] < 10000; array[0]++) {
int len = new String(array).getBytes(utf8).length;
}
}
However, the following code does the same thing in under 4 seconds:
Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
for (array[0] = 0; array[0] < 10000; array[0]++) {
output.clear();
input.clear();
encoder.encode(input, output, false);
int len = output.position();
}
}
Edit: Why do haters gotta hate?
Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:
Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);
int limit = input.limit();
while(input.position() < limit) {
output.clear();
input.mark();
input.limit(Math.max(input.position() + 2, input.capacity()));
if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
//Malformed surrogate pair; do something!
}
input.limit(input.position());
input.reset();
encoder.encode(input, output, false);
int encodedLen = output.position();
}
If you can guarantee that the input is well-formed UTF-8, then there's no reason to find code points at all. One of the strengths of UTF-8 is that you can detect the start of a code point from any position in the string. Simply search backwards until you find a byte such that (b & 0xc0) != 0x80, and you've found another character. Since a UTF-8 encoded code point is always 6 bytes or less, you can copy the intermediate bytes into a fixed-length buffer.
Edit: I forgot to mention, even if you don't go with this strategy, it is not sufficient to use a Java "char" to store arbitrary code points since code point values can exceed 0xffff. You need to store code points in an "int".
It is possible that an encoding scheme could encode a given character as a variable number of bytes, depending on what comes before and after it in the character sequence. The byte length you get from encoding a single character String is therefore not the whole answer.
(For example, you could theoretically receive a baudot / teletype characters encoded as 4 characters every 3 bytes, or you could theoretically treat a UTF-16 + a stream compressor as an encoding scheme. Yes, it is all a bit implausible, but ...)
Try Charset.forName("UTF-8").encode("string").limit(); Might be a bit more efficient, maybe not.
I'm looking for a simple way to translate long to String and back in a way that will "hide" the long value.
I'd prefer to avoid adding another .jar to the project for this feature.
It does not have to be a hard-to-crack encryption, just to look random to the inexperienced eye.
Addition:
My purpose here is to attach a counter value (of type long) to URLs as a tracking parameter without the users aware of the counter's value (sort of like tinyURL's hash), so that the servlet will know the value of the counter when the URL is clicked.
Thanks
If
X * Y = 1 (mod 2^32)
then
A * (X * Y) = A (mod 2^32)
(A * X) * Y = A (mod 2^32)
So, you can "encrypt" some 32-bit number by multiplying it by X, and then later "decrypt" by multiplying by Y. All you need is to find some non-trivial X and Y that satisfy the condition.
For instance, (X, Y) = (3766475841, 1614427073), or (699185821, 3766459317). I just found these with a simple brute force program.
Once you have A*X you can encode it using Base-64 or hexadecimal or some similar scheme in the URL. I'd suggest Base64 because it takes up less space and looks fairly "random".
If you're just obfuscating, something trivial like
long -> string -> char array
for each character, XOR with the previous output value (with some arbitrary value for before the first character)
char array -> string
To reveal for each character, XOR with the previous obfuscated input value (with some arbitrary value for before the first character)
Uhm… without further details it’s hard to come up with a solution on this. You can do lots of different things to accomplish your goal… I guess.
You could XOR it with a random number and store both numbers. This obviously won’t work when your algorithm is out in the open (e.g. if you want to put it in open source).
Or you could use some symmetric encryption algorithm, e.g. Twofish or Serpent.
All of this is rather futile if the number is stored on the client system and evaluated by a client application. As soon as you hand out your application clients can access everything it stores. If the information is worth it, decryption programs will be available soon after your application is released. If it’s not worth it, then why bother?
I like to use Long.toString(value, 36). This prints the number in base 36 which is compact, though cryptic (provided the number is >= 10) You can parse it with Long.parseLong(text, 36)
XOR with an one-time pad (OTP) provides perfect secrecy. I wrote a cipher using OTP to encrypt integers. I just adapted it to long to meet your requirement. It encrypts a long into a 24 char hex string with salt prepended,
-3675525778535888036 => 4fe555ca33021738a3797ab2
-6689673470125604264 => 76092fda5cd67e93b18b4f2f
8956473951386520443 => 0fb25e533be315bdb6356a2a
4899819575233977659 => 7cf17d74d6a2968370fbe149
Here is the code,
public class IntegerCipher {
// This is for salt so secure random is not needed
private static Random PRNG = new Random();
private String secret;
public IntegerCipher(String secret) {
if (secret.length() < 4)
throw new IllegalArgumentException("Secret is too short");
this.secret = secret;
}
public String encrypt(long value) {
int salt = PRNG.nextInt();
long otp = generatePad(salt);
long cipher = value ^ otp;
return String.format("%08x%016x", salt, cipher);
}
public long decrypt(String ciphertext) {
if (ciphertext.length() != 24)
throw new IllegalArgumentException("Invalid cipher text");
try {
int salt = (int) Long.parseLong(ciphertext.substring(0, 8), 16);
long cipher = Long.parseLong(ciphertext.substring(8, 16), 16) << 32
| Long.parseLong(ciphertext.substring(16), 16);
long otp = generatePad(salt);
return cipher ^ otp;
} catch (NumberFormatException e) {
throw new IllegalArgumentException("Invalid hex value: "
+ e.getMessage());
}
}
private long generatePad(int salt) {
String saltString = Integer.toString(salt);
String lpad = saltString + secret;
String rpad = secret + saltString;
return ((long) lpad.hashCode()) << 32 | (long) rpad.hashCode();
}
public static void main(String[] args) {
IntegerCipher cipher = new IntegerCipher("Top Secret");
Random rand = new Random();
for (int i = 0; i < 100; i++) {
Long n = rand.nextLong();
String ciphertext = cipher.encrypt(n);
Long m = cipher.decrypt(ciphertext);
System.out.printf("%24d => %s\n", n, ciphertext);
assert(m == n);
}
}
}
For encrypting short plaintexts in a deterministic way, you might want to have a look at
FFSEM.