URL shortening algorithm - java

Now, this is not strictly about URL shortening, but my purpose is such anyway, so let's view it like that. Of course the steps to URL shortening are:
Take the full URL
Generate a unique short string to be the key for the URL
Store the URL and the key in a database (a key-value store would be a perfect match here)
Now, about the second point. Here's what I've come up with:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
UUID uuid = UUID.randomUUID();
dos.writeLong(uuid.getMostSignificantBits());
String encoded = new String(Base64.encodeBase64(baos.toByteArray()), "ISO-8859-1");
String shortUrlKey = StringUtils.left(encoded, 6); // returns the leftmost 6 characters
// check if exists in database, repeat until it does not
Is this good enough?

For a file upload application I wrote, I needed this functionality, too. Having read this SO article, I decided to stick with just some random numbers and check whether they exists in the DB.
So your aproach is similar to what I did.

Well what do you mean by URL shortening?
There are very different techniques. Most websites, AFAIK, use the technique to just put the databse primary key (maybe in some encoded) form in the URL at some position where it can be parsed by a regular expression and just enhancing the rest with keywords.
Example from Amazon: http://www.amazon.de/Bauknecht-WA-PLUS-614-Waschmaschine/dp/B003V1JDU8/
You can enter anything in place of the name of the product, only the id at the end is important.
However you may want to keep your links clean and check if it's correct and do 301 forwarding to the real URL or put a canonical URL if a wrong URL turns up.
However:
If you want to do something like TinyURL, my answer is a definite no.
It's not good enough.
Well it depends.
It's not "secure". It would be pretty easy to guess URLs. A better approach would be using some cryptographic function like SHA-1/MD5.
When it comes to collisions I can't really tell. GUID was designed to have no collisions, but you are only using the first 6 characters. I don't know what exactly they represent in the algorithm. But it's definitely not optimal.
Why, however, don't you just use the database auto incrementing primary key? If security is important you also definitely have go to with more than 6 characters.
On a project I did I used something like
/database-primary-key/hash-of-primary-key-with-some-token-or-client-information/
This way I could directly look up the primary key in the database which was the fastest possible way but also could verify that the link was not found out by brute forced by the hash. In my case the hash was the SHA-1 sum of the client's secret token and the primary key.

Related

Easiest way in Java to turn String into UUID

How to generate a valid UUID from a String? The String alone is not what I'm looking for. Rather, I'm looking for something like a hash function converting any String to a valid UUID.
Try this out:
String superSecretId = "f000aa01-0451-4000-b000-000000000000";
UUID.fromString(superSecretId);
I am using this in my project and it works. Make sure you import the right stuff.
In the Java core library, there's java.util.UUID.nameUUIDFromBytes(byte[]).
I wouldn't recommend it because it's UUID 3 which is based on MD5, a very broken hash function. You'd be better off finding an implementation of UUID 5, which is based on SHA-1 (better although also sort of broken).

Adler32 Repeating Very Quickly

I'm using the adler32 checksum algorithm to generate a number from a database id. So, when I insert a row into the database, I take the identity of that row and use it to create the checksum. The problem that I'm running into is that I just generated a repeat checksum after only 207 inserts into the database. This is much much faster than I expected. Here is my code:
String dbIdStr = Long.toString(dbId);
byte[] bytes = dbIdStr.getBytes();
Checksum checksum = new Adler32();
checksum.update(bytes, 0, bytes.length);
result = checksum.getValue();
Is there something wrong with what/how I'm doing? Should I be using a different method to create unique strings? I'm doing this because I don't want to use the db id in a url... a change to the structure of the db will break all the links out there in the world.
Thanks!
You should not be using Adler-32 as a hash code generator. That's not what it's for. You should use an algorithm that has good hash properties, which, among other things minimizes the probability of collisions.
You can simply use Java's hashCode method (on any object). For the String object, the hash code is the sum of the byte values of string times successive powers of 31. There can be collisions with very short strings, but it's not a horrible algorithm. It's definitely a lot better than Adler-32 as a hash algorithm.
The suggestions to use a cryptographically secure hash function (like SHA-256) are certainly overkill for your application, both in terms of execution time and hash code size. You should try Java's hashCode and see how many collisions you get. If it seems much more frequent than you'd expect for a 2-n probability (where n is the number of bits in the hash code), then you can override it with a better one. You can find a link here for decent Java hash functions.
Try and use a secure hash function like SHA-256. If you ever find a collision for any data that is not binary equal, you'll get $1000 on your bank account, with compliments. Offer ends if/when SHA-2 is cracked and you enter a collision deliberately. That said, the output is 32 bytes instead of 32 bits.

Random GUID in Java (A different format)

One of the components that I use needs to feed an XML into it. The component provider has not provided any documentation or the specs of the XML. I am trying to generate the XMLs by trial and error using the sample XMLs from the component.
This was the story. Here is my problem.
In the XML, they have used some f_key = "b3f39bb9-3f8c-453a-bdb4-2486a887e39f-0000a008:000001e8"
Java gives me this : UUID.randomUUID().toString()
which generates random strings in this format : "22572e59-f7dc-404a-9c0c-78161e3a4df7"
Any clue, what does "0000a008:000001e8" in the f_key provided by the component mean [The random string up to 5 pieces matches in both. The 6th and 7th piece are extra in the random string provided by the component]? What sort of UUID generator would be generating that? Does it look familiar?
According to this code
Regex guidRegEx = new Regex(#"^(\{{0,1}([0-9a-fA-F]){8}-([0-9a-fA-F]){4}-([0-9a-fA-F]){4}-([0-9a-fA-F]){4}-([0-9a-fA-F]){12}\}{0,1})$");
guidRegEx.IsMatch("b3f39bb9-3f8c-453a-bdb4-2486a887e39f-0000a008:000001e8");
that isn't a valid guid, its a valid guid with something on the end. I am guessing they've tacked a timestamp on the end. I've seen stuff come out of timestamp appliances in the past.
But that is a best guess.
I believe that it is just some kind of key that is generated by the provider. Although I have no idea about the rules of the key generation (that is application specific) I translated hex numbers a008 1e8 to decimal view and found that the ratio between them is 83: 40968/488=83. So, probably try to create UUID and add suffix that consists of 2 numbers in hex format so that the ration of them is 83.

What java library are there provides the the facility to generate unique random string combination from a given set of characters?

What java library are there provides the the facility to generate unique random string combination from a given set of characters?
Say I have these set of characters: [a-zA-Z0-9]
And I need to generate 4-character string from this set that is less likely to collide.
Apache Commons Lang has a RandomStringUtils class with a method that takes a sequence of characters and a count, and does what you ask. It makes no guarantee of collision avoidance, though, and with only 4 characters, you're going to struggle to achieve that.
And I need to generate 4-character string from this set that is less likely to collide.
Less likely than what? There are 62^4 = 14.8 million such strings. Due to the birthday paradox, you get about a 50% chance of a collision if you randomly generate 3800 of them. If that's not acceptable, no library will help you, you need to use a longer string or establish uniqueness explicitly (e.g. via incrementing an integer and formatting it in base 62).
if you'd be ok with a longer hash, you'd certainly be able to find some md5 libraries. It's most common for this kind of task. A lot of web sites use it to generate password hashes.

Making a line of code difficult to read

Im writing a way of checking if a customers serial number matches my hard coded number. Is there a way of making this as hard to read as possible in case an undesirable gets their hands on the code?
I am working in java.
For instance (pseudo code)
if (x != y) jump out of code and return error
Cheers , apologies if this is a bit of an odd one
Security through obscurity is always a bad idea. You don't need to avoid it, but you should not trust solely on it.
Either encrypt your serials with a key you type in at startup of the service, or just specify the serials as hex or base64, not ASCII.
The normal way to do this would be to use a hash.
Create a hash of your serial code.
To validate the client serial, hash that using the same function.
If the hashes match, the serial was correct, even though the serial itself was not in the code.
By definition, a from the hash it's almost impossible to deduce the original code.
Making the code look complex to avoid being hacked never helps!
You can try SHA1 or some other one-way encrypting (MD5 not so secure but it's pretty good). Don't do this:
if (userPassword equals myHardCodedpassword)
Do this:
if (ENCRYPTED(userPassword) equals myhardcodedEncryptedpassword)
So the code-reader only can see an encrypted (and very very very difficult to decrypt) value.
Tangle the control structure of the released code?
e.g feed the numbers in at a random point in the code under a different variable and at some random point make them equal x and y?
http://en.wikipedia.org/wiki/Spaghetti_code
There is a wikipedia article on code obfuscation. Maybe the tricks there can help you =)
Instead of trying to make the code complex, you can implement other methods which will not expose your hard-coded serial number.
Try storing the hard coded number at some permanent location as encrypted byte array. That way its not readable. For comparison encrypt the client serial code with same algorithm and compare.

Categories

Resources