Encode String (rfc4122) to Number in Java, decode in PHP - java

In my use case, a javascript tracker generate a unique ID for a visitor whenever he/she visits the site, using the following formula:
function generateUUID(){
return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
var r = Math.random()*16|0, v = c == 'x' ? r : (r&0x3|0x8);
return v.toString(16);
});
}
It generates strings like this (rfc4122):
"3314891e-285e-40a7-ac59-8b232863bead"
Now I need to encode that string in a Number (e.g. BigInteger in Java) that can be read by Mahout. And likewise, restore it (in PHP) to display results. Is there any fast, consistent and reliable way to do that?
Some solutions are:
Mapping each possible char (alphanumeric + '-') to a number [1..M] and summing each char position accordingly
get 2 longs from md5 hash
keep a hash map in memory
Any ideas appreciated!

If Mahout can use a compound ID of two longs, you can use:
UUID uuid = UUID.fromString(string);
long l1 = uuid.getMostSignificantBits();
long l2 = uuid.getLeastSignificantBits();
If you really are stuck with one long, then I'd agree with your idea to use a portion of a hash based on the entire UUID

Related

How to convert string to number in java ? But i am not looking for hashCode as it wont give unique number. at least i want math logic for same

Hi I want to convert string to some unique number in java.
Exmple: "Production-0-1" to 100021
"Process-23-30" to 12310
And all return number has to be unique.
I dont wanted to use hashCode as they can return duplicate like "Aa" and "BB" has same has code.
Let me know math logic to create this is no method available.
String random = "Production-0-1";
String bi = new BigInteger(random.getBytes("UTF-8")).toString();
BigInteger numBig = new BigInteger(bi);
System.out.println(numBig);
Based on #markspace comments, I tried the following and every time it produces random unique number but beware if you have a very large String and a limited memory space then the output may go out of bound.

How to get a unique alphanumeric based on a unique integer

My webapplication has a table in the database with an id column which will always be unique for each row. In addition to this I want to have another column called code that will have a 6 digit unique Alphanumeric code with numbers 0-9 and alphabets A-Z. Alphabets and number can be duplicate in a code. i.e. FFQ77J. I understand the uniqueness of this 6 digit alphanumeric code reduces over time as more rows are added but for now I am ok with this.
Requirement (update)
- The code should be at least of length 6
- Each code should be Alphanumeric
So I want to generate this Alphanumeric code.
Question
What is a good way to do this?
Should I generate the code and after the generation, run a query to the database and check if it already exists, and if so then generate a new one? To ensure the uniqueness, does this piece of code need to be synchronized so that only one thread runs it?
Is there something built-in to the database that will let me do this?
For the generation I will be using something like this which I saw in this answer
char[] symbols = new char[36];
char[] buf;
for (int idx = 0; idx < 10; ++idx)
symbols[idx] = (char) ('0' + idx);
for (int idx = 10; idx < 36; ++idx)
symbols[idx] = (char) ('A' + idx - 10);
public String nextString()
{
for (int idx = 0; idx < buf.length; ++idx)
buf[idx] = symbols[random.nextInt(symbols.length)];
return new String(buf);
}
Since it's a requirement for the shortcode to not be guessable, you don't want to tie it to your uniqueID row ID. Otherwise that means your rowID needs to be random, in addition to unique. Starting with a counter 0, and incrementing, makes it pretty obvious when your codes are: 000001, 000002, 000003, and so forth.
For your short code, generate a random 32bit int, omit the sign and convert to base36. Make a call to your database, to ensure it's available.
You haven't explicitly called out scalability, but I think it's important to understand the limitations of your design wrt to scale.
At 2^31 possible 6 char base36 values, you will have collisions at ~65k rows (see Birthday Paradox questions)
From your comment, modify your code:
public String nextString()
{
return Integer.toString(random.nextInt(),36);
}
I would simply do this:
String s = Integer.toString(i, 36).toUpperCase();
Choosing base-36 will use characters 0-9a-z for the digits. To get a string that uses uppercase letters (as per your question) you would need to fold the result to upper case.
If you use an auto increment column for your id, set the next value to at least 60,466,176, which when rendered to base 36 is 100000 - always giving you a 6 digit number.
I would start with 0 for an empty table and do a
SELECT MAX(ID) FROM table
to find the largest id so far. Store it in an AtmoicInteger and convert it using toString
AtomicInteger counter = new AtomicInteger(maxSoFar);
String nextId = Integer.toString(counter.incrementAndGet(), 36);
or for padding. 36 ^^ 6 = 2176782336L
String nextId = Long.toString(2176782336L + counter.incrementAndGet(), 36).substring(1);
This will give you uniqueness and no duplicates to worry about. (it's not random either)
Simply, you can use Integer.toString(int i, int radix). Since you have base 36(26 letters+10 digits) you set the radix to 36 and i to your integer. For example, to use 16501, do:
String identifier=Integer.toString(16501, 36);
You can uppercase it with .toUpperCase()
Now onto your other questions, yes, you should query the database first to ensure it doesn't exist. If depending on the database, it may need to be synchronized, or it may not be as it'll use its own locking system. In any case, you'd need to tell us which database.
On the question of whether there's a builtin, we'd need to know the DB type as well.
To create a random but unique value within a small range here are some ideas I know of:
Create a new random value and try to insert it.
Let a database constraint catch violations. This column should also likely be indexed. The DML may need to be tried several times until a unique ID is found. This will lead to more collisions as time progresses, as noted (see the birthday problem).
Create a "free IDs" table ahead of time and on usage mark the ID as being used (or delete it from the "free IDs" table). This is similar to #1 but shifts when the work is done.
This allows the work of finding "free IDs" to be done at another time, perhaps during a cron job, so that there will not be a contraint violation during the insert keeping the insert itself the "same speed" throughout the usage of said domain. Make sure to use transactions.
Create a 1-to-1/injective "mixer" function such that the output "appears random". The point is this function must be 1-to-1 to inherently avoid duplicates.
This output number would then be "base 36 encoded" (which is also injective); but it would be guaranteed unique as long as the input (say, an auto-increment PK) was unique. This would likely be less random than the other approaches, but should still create a nice-looking non-linear output.
A custom injective function can be created around an 8-bit lookup table fairly trivially - just process a byte at a time and shuffle the map appropriately. I really like this idea, but it can still lead to somewhat predictable output
To find free IDs, approaches #1 and #2 above can use "probing with IN" to minimize the number of SQL statements used. That is, generate a bunch of random values and query for them using IN (keeping in mind what sizes of IN your database likes) and then see which values were free (as having no results).
To create a unique ID not constained to such a small space, a GUID or even hashing (e.g. SHA1) might be useful. However, these only guarantee uniqueness because they have 126/160-bit spaces so that the chance of collision (for different input/time-space) is currently accepted as improbable.
I actually really like the idea of using an injective function. Bearing in mind that it is not good "random" output, consider this pseudo-code:
byte_map = [0..255]
map[0] = shuffle(byte_map, seed[0])
..
map[n] = shuffle(byte_map, seed[1])
output[0] = map[0][input[0]]
..
output[n] = map[n][input[n]]
output_str = base36_encode(output[0] .. output[n])
While a very simple setup, numbers like 0x200012 and 0x200054 will still share common output - e.g. 0x1942fe and 0x1942a9 - although the lines will be changed a bit due to the later application of the base-36 encoding. This could probably be further improved to "make it look more random".
For efficient usage, try caching generated code in a HashSet<String> in your application:
HashSet<String> codes = new HashSet<String>();
This way you don't have to make a db call every time to check whether the generated code is unique or not. All you have to do is:
codes.contains(newCode);
And, yes, you should synchronize your method which updates the cache
public synchronize String getCode ()
{
String newCode = "";
do {
newCode = nextString();
}
while(codes.contains(newCode));
codes.put(newCode);
}
You mentioned in your comments that the relationship between id and code should not be easily guessable. For this you basically need encryption; there are plenty of encryption programs and modules out there that will perform encryption for you, given a secret key that you initially generate. To employ this approach, I would recommend converting your id into ascii (i.e., representing as base-256, and then interpreting each base-256 digit as a character) and then running the encryption, and then converting the encrypted ascii (base-256) into base 36 so you get your alpha-numeric, and then using 6 randomly chosen locations in the base 36 representation to get your code. You can resolve collisions e.g. by just choosing the nearest unused 6-digit alpha-numeric code when a collision occurs, and noting the re-assigned alpha-numeric code for the id in a (code <-> id) table that you will have to maintain anyway since you cannot decrypt directly if you only store 6 base-36 digits of the encrypted id.

Hector does not handle Control Characters correctly in Java Strings - how to get Hexadecimal from Hector instead of text string?

I have a problem with Hector's handling of control-characters in Key and Column names. I am writing a program using Hector to talk with a Cassandra instance, and there are pre-existing Keys and Column names with e.g. hexadecimal "594d69e0b8e611e10000242d50cf1ff7".
I have inputted that hexadecimal into a Java String and plugged it through some simple conversion-to-text code:
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s1.length() - 1; i+=2 ){
/*Grab the hex in pairs*/
String output = s1.substring(i, (i + 2));
/*Convert Hex to Decimal*/
int decimal = Integer.parseInt(output, 16);
sb.append((char)decimal);
}
return sb.toString();
(Converting the returned Java String back to hexadecimal by calling hexString.append(Integer.toHexString(textString.charAt(i))); for every character, returns the original hexadecimal, so Java should be capable of handling this data.) Printing said Java String yields the top line in the below image:
[Image not posted because new users aren't allowed to post images.]
Image here: http://i.stack.imgur.com/yUJxs.png
Unfortunately, the bottom line (corrupted) is what Hector is returning to me when I call the following code (lots of checks and setup omitted, for simplicity of the question):
OrderedRows<String, String, String> orderedRows;
orderedRows = rangeSlicesQuery.execute().get();
Row<String,String,String> lastRow = orderedRows.peekLast();
for (Row<String, String, String> r : orderedRows) {
String key = r.getKey();
System.out.println(key);
...
So, Hector is not handling control characters properly when returning the Java String. How can I get Hector to return to me the Keys and Columns in Hexadecimal instead of a (corrupted) text-based Java String? I tried to look it up but the documentation on how to do so is essentially is missing (http://hector-client.github.com/hector//source/content/API/core/1.0-1/me/prettyprint/hector/api/beans/OrderedRows.html - what are K, V, and N?). I imagine it should be simple, as the Cassandra CLI assumes hexadecimal if you do not wrap the input with ascii(''), but I cannot figure out how to do it.
In Cassandra, everything is stored as hex bytes. The Cassandra thrift API also accepts binary. In real life however, people like to deal with human types like String, integer etc. Hector makes it easy for you to use the thrift API by abstracting out the serializing/deserializing logic.
K, N and V are types of the row key, column name and column value respectively. When you use String, String, String, you are telling hector that all the three types for your column family are Strings.
If you are storing the row key and column names as Bytes, you should use byte[] instead for retrievals and BytesArraySerializer for serializing.

Is it possible to compare two strings by their "hash" numbers?

I have a string which is lost forever. The only thing I have about it is some magic hash number. Now I have a new string, which could be similar or equal to the lost one. I need to find out how close it is.
Integer savedHash = 352736;
String newText = "this is new string";
if (Math.abs(hash(newText) - savedHash) < 100) {
// wow, they are very close!
}
Are there any algorithms for this purpose?
ps. The length of the text is not fixed.
pps. I know how usual hash codes work. I'm interested in an algorithm that will work differently, giving me the functionality explained above.
ppps. In a very simple scenario this hash() method would look like:
public int hash(String txt) {
return txt.length();
}
Standard hashing will not work in this case since close hash values do not imply close strings. In fact, most hash functions are designed to give close strings very different values, so as to create a random distribution of hash values for any given set of input strings.
If you had access to both strings, then you could use some kind of string distance function, such as Levenshtein distance. This calculates the edit distance between two strings, or the number of edits required to transform one to the other.
In this case however, the best approach might be to use some kind of fuzzy hashing technique. That way you don't have to store the original string, and can still get some measure of similarity.
If the hashes don't match then the strings are different.
If the hashes match then the strings are probably the same.
There is nothing else you can infer from the hash value.
No, this isn't going to work. The similarity of a hash bears no relation to the similarity of the original strings. In fact, it is entirely possible for 2 different strings to have the same hash. All you can say for sure is that if the hashes are different the strings were different.
[Edited in light of comment, possibility of collision is of course very real]
Edit for clarification:
If you only have the hash of the old string then there is no way you are going to find the original value of that string. There is no algorithm that would tell you if the hashes of 2 different strings represented strings that were close, and even if there was it wouldn't help. Even if you find a string that has an exact hash match with your old string there is still no way you would know if it was your original string, as any number of strings can produce the same hash value. In fact, there is a vast* number of strings that can produce the same hash.
[In theory this vast number is actually infinite but on any real storage system you can't generate an infinte number of strings. In any case your chance of matching an unknown string via this approach is very slim unless your hashes are large in relation to the input string, and even then you will need to brute force your way through every possible string]
As others have pointed out, with a typical hash algorithm, it just doesn't work like that at all.
There are, however, a few people who've worked out algorithms that are at least somewhat similar to that. For one example, there's a company called "Xpriori" that has some hashing (or least hash-like) algorithms that allow things like that. They'll let you compare for degree of similarity, or (for example) let you combine hashes so hash(a) + hash(b) == hash(a+b) (for some definition of +, not just simple addition of the numbers). Like with most hashes, there's always a possibility of collision, so you have some chance of a false positive (but by picking the hash size, you can set that chance to an arbitrarily small value).
As such, if you're dealing with existing data, you're probably out of luck. If you're creating something new, and want capabilities on this order, it is possible -- though trying to do it on your own is seriously non-trivial.
No. Hashes are designed so that minor variations in the input string cause huge differences in the resulting hashe. This is very useful for dictionary implementations, as well as verifying the integrity of a file (a single changed bit will cause a completely different hash). So no, it's not some kind of thing you can ever use as an inequality comparison.
If the hashCodes are different it cannot be the same String, however many Strings can have the same hashCode().
Depending on the nature of the Strings, doing a plain comparision could be more efficent than comparing the hashCode() it has to inspect and perform a calculation on every character, whereas comparision can store early e.g. if the length is different or as soon as it see a different character.
Any good hashing algorithm will by definition NEVER yield similar hashes for similar arguments. Otherwise, it would be too easy to crack. If the hashed value of "aaaa" looks similar to "aaab", then that is a poor hash. I have racked ones like that before without too much difficulty (fun puzzle to solve!) But you never know maybe your hash algorithm is poor. An idea what it is?
If you have time, you can just brute force this solution by hashing every possible word. Not elegant, but possible. Easier if you know the length of the original word as well.
If it is a standard has algorithm, like MD5, you can find websites that already have large mappings of source and hash, and get the answer that way. Try http://hashcrack.com/
I used this website successfully after one of our devs left and I needed to recover a password.
Cheers,
Daniel
You can treat the string as a really big number, but that's about the extent of your abilities in the general situation. If you have a specific problem domain, you may be able to compress a representation of the string to something smaller without losses, but still it will not be very useful.
For example, if you are working with individual words, you can use soundex to compare how similar two words will sound...
The best you can do with traditional hash codes will be to compare two strings for equality vs likely inequality. False positives are possible, but there will be no false negatives. You cannot compare for similarity this way, though.
a normal hash code changes a lot when the object changes a bit. that's made to distinguish different objects and don't care how resembling they could be. therefore the answer is no
Well, seems you want not real hash of string, but some fingerprint of string. Because you want it to be of 32-bits one way could be:
Calculate Pearson correlation coefficient between first and second half of string (if string length is odd number of chars, then add some padding) and store this number as 32-bit floating point number. But I'm not sure how reliable this method will be.
==EDIT==
Here is C example code (un-optimized) which implements this idea (a little bit modified):
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
float mean(char *str) {
char *x;
float sum = 0.0;
for(x=str; *x!='\0'; x++) {
sum += (float) *x;
}
return sum/strlen(str);
}
float stddev(char *str) {
char *x;
float sum = 0.0;
float u = mean(str);
for(x=str; *x!='\0'; x++) {
sum += ((float)*x - u)*((float)*x - u);
}
return sqrt(sum/strlen(str));
}
float covariance(char *str1, char *str2) {
int i;
int im = fmin(strlen(str1),strlen(str2));
float sum = 0.0;
float u1 = mean(str1);
float u2 = mean(str2);
for(i=0; i<im; i++) {
sum += ((float)str1[i] - u1)*((float)str2[i] - u2);
}
return sum/im;
}
float correlation(char *str1, char *str2) {
float cov = covariance(str1,str2);
float dev1 = stddev(str1);
float dev2 = stddev(str2);
return cov/(dev1*dev2);
}
float string_fingerprint(char *str) {
int len = strlen(str);
char *rot = (char*) malloc((len+1)*sizeof(char));
int i;
// rotate string by CHAR_COUNT/2
for(i=0; i<len; i++){
rot[i] = str[(i+len/2)%len];
}
rot[len] = '\0';
// now calculate correlation between original and rotated strings
float corr = correlation(str,rot);
free(rot);
return corr;
}
int main() {
char string1[] = "The quick brown fox jumps over the lazy dog";
char string2[] = "The slow brown fox jumps over the crazy dog";
float f1 = string_fingerprint(string1);
float f2 = string_fingerprint(string2);
if (fabs(f1 - f2) < 0.2) {
printf("wow, they are very close!\n");
}
return 0;
}
hth!

Data structure for soundex algorithm?

Can anyone suggest me on what data structure to use for a soundex algorithm program? The language to be used is Java. If anybody has worked on this before in Java. The program should have these features:
be able to read about 50,000 words
should be able to read a word and return the related words having the same soundex
I don't want the program implementation just few advices on what data structure to use.
TIP: If you use SQL as a databackend then you can let SQL handle it with the two sql-functions SOUNDEX and DIFFERENCE.
Maybe not what you wanted, but many people do not know that MSsql has those two functions.
Well soundex can be implemented in a straightforward pass over a string, so that doesn't require anything special.
After that the 4 character code can be treated as an integer key.
Then just build a dictionary that stores word sets indexed by that integer key. 50,000 words should easily fit into memory so nothing fancy is required.
Then walk the dictionary and each bucket is a group of similar sounding words.
Actually, here is the whole program in perl:
#!/usr/bin/perl
use Text::Soundex;
use Data::Dumper;
open(DICT,"</usr/share/dict/linux.words");
my %dictionary = ();
while (<DICT>) {
chomp();
chomp();
push #{$dictionary{soundex($_)}},$_;
}
close(DICT);
while (<>) {
my #words = split / +/;
foreach (#words) {
print Dumper $dictionary{soundex($_)};
}
}
I believe you just need to convert the original strings into soundex keys into a hashtable; the value for each entry in the table would be a collection of original strings mapping to that soundex.
The MultiMap collection interface (and its implementations) in Google Collections would be useful to you.
class SpellChecker
{
interface Hash {
String hash(String);
}
private final Hash hash;
private final Map<String, Set<String>> collisions;
SpellChecker(Hash hash) {
this.hash = hash;
collisions = new TreeSet<String, Set<String>>();
}
boolean addWord(String word) {
String key = hash.hash(word);
Set<String> similar = collisions.get(key);
if (similar == null)
collisions.put(key, similar = new TreeSet<String>());
return similar.add(word);
}
Set<String> similar(String word) {
Set<String> similar = collisions.get(hash.hash(word));
if (similar == null)
return Collections.emptySet();
else
return Collections.unmodifiableSet(similar);
}
}
The hash strategy could be Soundex, Metaphone, or what have you. Some strategies might be tunable (how many characters does it output, etc.)
Since soundex is a hash, I'd use a hash table, with the soundex as the key.
you want a 4-byte integer.
The soundex algorithm always returns a 4-character code, if you use ANSI inputs, you'll get 4-bytes back (represented as 4 letters).
So store the codes returned in a hashtable, convert your word to the code and look it up in the hashtable. Its really that easy.

Categories

Resources