Is it possible to compare two strings by their "hash" numbers? - java

I have a string which is lost forever. The only thing I have about it is some magic hash number. Now I have a new string, which could be similar or equal to the lost one. I need to find out how close it is.
Integer savedHash = 352736;
String newText = "this is new string";
if (Math.abs(hash(newText) - savedHash) < 100) {
// wow, they are very close!
}
Are there any algorithms for this purpose?
ps. The length of the text is not fixed.
pps. I know how usual hash codes work. I'm interested in an algorithm that will work differently, giving me the functionality explained above.
ppps. In a very simple scenario this hash() method would look like:
public int hash(String txt) {
return txt.length();
}

Standard hashing will not work in this case since close hash values do not imply close strings. In fact, most hash functions are designed to give close strings very different values, so as to create a random distribution of hash values for any given set of input strings.
If you had access to both strings, then you could use some kind of string distance function, such as Levenshtein distance. This calculates the edit distance between two strings, or the number of edits required to transform one to the other.
In this case however, the best approach might be to use some kind of fuzzy hashing technique. That way you don't have to store the original string, and can still get some measure of similarity.

If the hashes don't match then the strings are different.
If the hashes match then the strings are probably the same.
There is nothing else you can infer from the hash value.

No, this isn't going to work. The similarity of a hash bears no relation to the similarity of the original strings. In fact, it is entirely possible for 2 different strings to have the same hash. All you can say for sure is that if the hashes are different the strings were different.
[Edited in light of comment, possibility of collision is of course very real]
Edit for clarification:
If you only have the hash of the old string then there is no way you are going to find the original value of that string. There is no algorithm that would tell you if the hashes of 2 different strings represented strings that were close, and even if there was it wouldn't help. Even if you find a string that has an exact hash match with your old string there is still no way you would know if it was your original string, as any number of strings can produce the same hash value. In fact, there is a vast* number of strings that can produce the same hash.
[In theory this vast number is actually infinite but on any real storage system you can't generate an infinte number of strings. In any case your chance of matching an unknown string via this approach is very slim unless your hashes are large in relation to the input string, and even then you will need to brute force your way through every possible string]

As others have pointed out, with a typical hash algorithm, it just doesn't work like that at all.
There are, however, a few people who've worked out algorithms that are at least somewhat similar to that. For one example, there's a company called "Xpriori" that has some hashing (or least hash-like) algorithms that allow things like that. They'll let you compare for degree of similarity, or (for example) let you combine hashes so hash(a) + hash(b) == hash(a+b) (for some definition of +, not just simple addition of the numbers). Like with most hashes, there's always a possibility of collision, so you have some chance of a false positive (but by picking the hash size, you can set that chance to an arbitrarily small value).
As such, if you're dealing with existing data, you're probably out of luck. If you're creating something new, and want capabilities on this order, it is possible -- though trying to do it on your own is seriously non-trivial.

No. Hashes are designed so that minor variations in the input string cause huge differences in the resulting hashe. This is very useful for dictionary implementations, as well as verifying the integrity of a file (a single changed bit will cause a completely different hash). So no, it's not some kind of thing you can ever use as an inequality comparison.

If the hashCodes are different it cannot be the same String, however many Strings can have the same hashCode().
Depending on the nature of the Strings, doing a plain comparision could be more efficent than comparing the hashCode() it has to inspect and perform a calculation on every character, whereas comparision can store early e.g. if the length is different or as soon as it see a different character.

Any good hashing algorithm will by definition NEVER yield similar hashes for similar arguments. Otherwise, it would be too easy to crack. If the hashed value of "aaaa" looks similar to "aaab", then that is a poor hash. I have racked ones like that before without too much difficulty (fun puzzle to solve!) But you never know maybe your hash algorithm is poor. An idea what it is?
If you have time, you can just brute force this solution by hashing every possible word. Not elegant, but possible. Easier if you know the length of the original word as well.
If it is a standard has algorithm, like MD5, you can find websites that already have large mappings of source and hash, and get the answer that way. Try http://hashcrack.com/
I used this website successfully after one of our devs left and I needed to recover a password.
Cheers,
Daniel

You can treat the string as a really big number, but that's about the extent of your abilities in the general situation. If you have a specific problem domain, you may be able to compress a representation of the string to something smaller without losses, but still it will not be very useful.
For example, if you are working with individual words, you can use soundex to compare how similar two words will sound...
The best you can do with traditional hash codes will be to compare two strings for equality vs likely inequality. False positives are possible, but there will be no false negatives. You cannot compare for similarity this way, though.

a normal hash code changes a lot when the object changes a bit. that's made to distinguish different objects and don't care how resembling they could be. therefore the answer is no

Well, seems you want not real hash of string, but some fingerprint of string. Because you want it to be of 32-bits one way could be:
Calculate Pearson correlation coefficient between first and second half of string (if string length is odd number of chars, then add some padding) and store this number as 32-bit floating point number. But I'm not sure how reliable this method will be.
==EDIT==
Here is C example code (un-optimized) which implements this idea (a little bit modified):
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
float mean(char *str) {
char *x;
float sum = 0.0;
for(x=str; *x!='\0'; x++) {
sum += (float) *x;
}
return sum/strlen(str);
}
float stddev(char *str) {
char *x;
float sum = 0.0;
float u = mean(str);
for(x=str; *x!='\0'; x++) {
sum += ((float)*x - u)*((float)*x - u);
}
return sqrt(sum/strlen(str));
}
float covariance(char *str1, char *str2) {
int i;
int im = fmin(strlen(str1),strlen(str2));
float sum = 0.0;
float u1 = mean(str1);
float u2 = mean(str2);
for(i=0; i<im; i++) {
sum += ((float)str1[i] - u1)*((float)str2[i] - u2);
}
return sum/im;
}
float correlation(char *str1, char *str2) {
float cov = covariance(str1,str2);
float dev1 = stddev(str1);
float dev2 = stddev(str2);
return cov/(dev1*dev2);
}
float string_fingerprint(char *str) {
int len = strlen(str);
char *rot = (char*) malloc((len+1)*sizeof(char));
int i;
// rotate string by CHAR_COUNT/2
for(i=0; i<len; i++){
rot[i] = str[(i+len/2)%len];
}
rot[len] = '\0';
// now calculate correlation between original and rotated strings
float corr = correlation(str,rot);
free(rot);
return corr;
}
int main() {
char string1[] = "The quick brown fox jumps over the lazy dog";
char string2[] = "The slow brown fox jumps over the crazy dog";
float f1 = string_fingerprint(string1);
float f2 = string_fingerprint(string2);
if (fabs(f1 - f2) < 0.2) {
printf("wow, they are very close!\n");
}
return 0;
}
hth!

Related

java hashing function collision

Attempting to write my own hash function in Java. I'm aware that this is the same one that java implements but wanted to test it out myself. I'm getting collisions when I input different values and am not sure why.
public static int hashCodeForString(String s) {
int m = 1;
int myhash = 0;
for (int i = 0; i < s.length(); i++, m++){
myhash += s.charAt(i) * Math.pow(31,(s.length() - m));
}
return myhash;
}
Kindly remember just how a hash-table (in any language ...) actually works: it consists of a (usually, prime) number of "buckets." The purpose of the hash-function is simply to convert any incoming key-value into a bucket-number. (The worst-case scenario is always that 100% of the incoming keys wind-up in a single bucket, leaving you with "a linked list.") You simply strive to devise a hash-function that will "typically" produce a "widely scattered" distribution of values so that, when calculated modulo the (prime ...) number of buckets, "most of the time, most of the buckets" will be "more-or-less equally" filled. (But remember: you can never be sure.)
"Collisions" are entirely to be expected: in fact, "they happen all the time."
In my humble opinion, you're "over-thinking" the hash-function: I see no compelling reason to use Math.pow() at all. Expect that any value which you produce will be converted to a hash-bucket number by taking its absolute value modulo the number of buckets. The best way to see if you came up with a good one (for your data ...) is to observe the resulting distribution of bucket-sizes. (Is it "good enough" for your purposes yet?)

How to get a unique alphanumeric based on a unique integer

My webapplication has a table in the database with an id column which will always be unique for each row. In addition to this I want to have another column called code that will have a 6 digit unique Alphanumeric code with numbers 0-9 and alphabets A-Z. Alphabets and number can be duplicate in a code. i.e. FFQ77J. I understand the uniqueness of this 6 digit alphanumeric code reduces over time as more rows are added but for now I am ok with this.
Requirement (update)
- The code should be at least of length 6
- Each code should be Alphanumeric
So I want to generate this Alphanumeric code.
Question
What is a good way to do this?
Should I generate the code and after the generation, run a query to the database and check if it already exists, and if so then generate a new one? To ensure the uniqueness, does this piece of code need to be synchronized so that only one thread runs it?
Is there something built-in to the database that will let me do this?
For the generation I will be using something like this which I saw in this answer
char[] symbols = new char[36];
char[] buf;
for (int idx = 0; idx < 10; ++idx)
symbols[idx] = (char) ('0' + idx);
for (int idx = 10; idx < 36; ++idx)
symbols[idx] = (char) ('A' + idx - 10);
public String nextString()
{
for (int idx = 0; idx < buf.length; ++idx)
buf[idx] = symbols[random.nextInt(symbols.length)];
return new String(buf);
}
Since it's a requirement for the shortcode to not be guessable, you don't want to tie it to your uniqueID row ID. Otherwise that means your rowID needs to be random, in addition to unique. Starting with a counter 0, and incrementing, makes it pretty obvious when your codes are: 000001, 000002, 000003, and so forth.
For your short code, generate a random 32bit int, omit the sign and convert to base36. Make a call to your database, to ensure it's available.
You haven't explicitly called out scalability, but I think it's important to understand the limitations of your design wrt to scale.
At 2^31 possible 6 char base36 values, you will have collisions at ~65k rows (see Birthday Paradox questions)
From your comment, modify your code:
public String nextString()
{
return Integer.toString(random.nextInt(),36);
}
I would simply do this:
String s = Integer.toString(i, 36).toUpperCase();
Choosing base-36 will use characters 0-9a-z for the digits. To get a string that uses uppercase letters (as per your question) you would need to fold the result to upper case.
If you use an auto increment column for your id, set the next value to at least 60,466,176, which when rendered to base 36 is 100000 - always giving you a 6 digit number.
I would start with 0 for an empty table and do a
SELECT MAX(ID) FROM table
to find the largest id so far. Store it in an AtmoicInteger and convert it using toString
AtomicInteger counter = new AtomicInteger(maxSoFar);
String nextId = Integer.toString(counter.incrementAndGet(), 36);
or for padding. 36 ^^ 6 = 2176782336L
String nextId = Long.toString(2176782336L + counter.incrementAndGet(), 36).substring(1);
This will give you uniqueness and no duplicates to worry about. (it's not random either)
Simply, you can use Integer.toString(int i, int radix). Since you have base 36(26 letters+10 digits) you set the radix to 36 and i to your integer. For example, to use 16501, do:
String identifier=Integer.toString(16501, 36);
You can uppercase it with .toUpperCase()
Now onto your other questions, yes, you should query the database first to ensure it doesn't exist. If depending on the database, it may need to be synchronized, or it may not be as it'll use its own locking system. In any case, you'd need to tell us which database.
On the question of whether there's a builtin, we'd need to know the DB type as well.
To create a random but unique value within a small range here are some ideas I know of:
Create a new random value and try to insert it.
Let a database constraint catch violations. This column should also likely be indexed. The DML may need to be tried several times until a unique ID is found. This will lead to more collisions as time progresses, as noted (see the birthday problem).
Create a "free IDs" table ahead of time and on usage mark the ID as being used (or delete it from the "free IDs" table). This is similar to #1 but shifts when the work is done.
This allows the work of finding "free IDs" to be done at another time, perhaps during a cron job, so that there will not be a contraint violation during the insert keeping the insert itself the "same speed" throughout the usage of said domain. Make sure to use transactions.
Create a 1-to-1/injective "mixer" function such that the output "appears random". The point is this function must be 1-to-1 to inherently avoid duplicates.
This output number would then be "base 36 encoded" (which is also injective); but it would be guaranteed unique as long as the input (say, an auto-increment PK) was unique. This would likely be less random than the other approaches, but should still create a nice-looking non-linear output.
A custom injective function can be created around an 8-bit lookup table fairly trivially - just process a byte at a time and shuffle the map appropriately. I really like this idea, but it can still lead to somewhat predictable output
To find free IDs, approaches #1 and #2 above can use "probing with IN" to minimize the number of SQL statements used. That is, generate a bunch of random values and query for them using IN (keeping in mind what sizes of IN your database likes) and then see which values were free (as having no results).
To create a unique ID not constained to such a small space, a GUID or even hashing (e.g. SHA1) might be useful. However, these only guarantee uniqueness because they have 126/160-bit spaces so that the chance of collision (for different input/time-space) is currently accepted as improbable.
I actually really like the idea of using an injective function. Bearing in mind that it is not good "random" output, consider this pseudo-code:
byte_map = [0..255]
map[0] = shuffle(byte_map, seed[0])
..
map[n] = shuffle(byte_map, seed[1])
output[0] = map[0][input[0]]
..
output[n] = map[n][input[n]]
output_str = base36_encode(output[0] .. output[n])
While a very simple setup, numbers like 0x200012 and 0x200054 will still share common output - e.g. 0x1942fe and 0x1942a9 - although the lines will be changed a bit due to the later application of the base-36 encoding. This could probably be further improved to "make it look more random".
For efficient usage, try caching generated code in a HashSet<String> in your application:
HashSet<String> codes = new HashSet<String>();
This way you don't have to make a db call every time to check whether the generated code is unique or not. All you have to do is:
codes.contains(newCode);
And, yes, you should synchronize your method which updates the cache
public synchronize String getCode ()
{
String newCode = "";
do {
newCode = nextString();
}
while(codes.contains(newCode));
codes.put(newCode);
}
You mentioned in your comments that the relationship between id and code should not be easily guessable. For this you basically need encryption; there are plenty of encryption programs and modules out there that will perform encryption for you, given a secret key that you initially generate. To employ this approach, I would recommend converting your id into ascii (i.e., representing as base-256, and then interpreting each base-256 digit as a character) and then running the encryption, and then converting the encrypted ascii (base-256) into base 36 so you get your alpha-numeric, and then using 6 randomly chosen locations in the base 36 representation to get your code. You can resolve collisions e.g. by just choosing the nearest unused 6-digit alpha-numeric code when a collision occurs, and noting the re-assigned alpha-numeric code for the id in a (code <-> id) table that you will have to maintain anyway since you cannot decrypt directly if you only store 6 base-36 digits of the encrypted id.

Is there an equivalent to Java's String intern function in Go?

Is there an equivalent to Java's String intern function in Go?
I am parsing a lot of text input that has repeating patterns (tags). I would like to be memory efficient about it and store pointers to a single string for each tag, instead of multiple strings for each occurrence of a tag.
No such function exists that I know of. However, you can make your own very easily using maps. The string type itself is a uintptr and a length. So, a string assigned from another string takes up only two words. Therefore, all you need to do is ensure that there are no two strings with redundant content.
Here is an example of what I mean.
type Interner map[string]string
func NewInterner() Interner {
return Interner(make(map[string]string))
}
func (m Interner) Intern(s string) string {
if ret, ok := m[s]; ok {
return ret
}
m[s] = s
return s
}
This code will deduplicate redundant strings whenever you do the following:
str = interner.Intern(str)
EDIT: As jnml mentioned, my answer could pin memory depending on the string it is given. There are two ways to solve this problem. Both of these should be inserted before m[s] = s in my previous example. The first copies the string twice, the second uses unsafe. Neither are ideal.
Double copy:
b := []byte(s)
s = string(b)
Unsafe (use at your own risk. Works with current version of gc compiler):
b := []byte(s)
s = *(*string)(unsafe.Pointer(&b))
I think that for example Pool and GoPool may fulfill your needs. That code solves one thing which Stephen's solution ignores. In Go, a string value may be a slice of a bigger string. Scenarios are where it doesn't matter and scenarios are where that is a show stopper. The linked functions attempt to be on the safe side.

Generating sequentially all combination of a finite set using lexicographic order and bitwise arithmetic

Consider all combination of length 3 of the following array of integer {1,2,3}.
I would like to traverse all combination of length 3 using the following algorithm from wikipedia
// find next k-combination
bool next_combination(unsigned long& x) // assume x has form x'01^a10^b in binary
{
unsigned long u = x & -x; // extract rightmost bit 1; u = 0'00^a10^b
unsigned long v = u + x; // set last non-trailing bit 0, and clear to the right; v=x'10^a00^b
if (v==0) // then overflow in v, or x==0
return false; // signal that next k-combination cannot be represented
x = v +(((v^x)/u)>>2); // v^x = 0'11^a10^b, (v^x)/u = 0'0^b1^{a+2}, and x ← x'100^b1^a
return true; // successful completion
}
What should be my starting value for this algorithm for all combination of {1,2,3}?
When I get the output of the algorithm, how do I recover the combination?
I've try the following direct adaptation, but I'm new to bitwise arithmetic and I can't tell if this is correct.
// find next k-combination, Java
int next_combination(int x)
{
int u = x & -x;
int v = u + x;
if (v==0)
return v;
x = v +(((v^x)/u)>>2);
return x;
}
I found a class that exactly solve this problem. See the class CombinationGenerator here
https://bitbucket.org/rayortigas/everyhand-java/src/9e5f1d7bd9ca/src/Combinatorics.java
To recover a combination do
for(Long combination : combinationIterator(10,3))
toCombination(toPermutation(combination);
Thanks everybody for your input.
I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem falls under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it might be faster than the link you have found.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to Java.

How do I implement a string comparison in Java that takes the same amount of time no matter whether they match or where a mismatch (if any) occurs?

I want to implement a String comparison function that doesn't take a different amount of time depending on the number of characters that match or the position of the first mismatch. I assume there must be a library out there somewhere that provides this, but I was unable to find it via a quick search.
So far, the best idea I've got is to sum the XOR of each character and return whether or not the sum is 0. However, I'm pretty sure this wouldn't work so well with Unicode. I also have a vague concern that HotSpot would do some optimizations that would change my constant-time property, but I can't think of a specific optimization that would do this off the top of my head.
Thanks.
UPDATE: Sorry, I don't believe I was clear. I'm not looking for O(1), I'm looking for something that won't leak timing information. This would be used to compare hashed password values, and if the time it took to compare was different based on where the first mismatch occurred, that would be leaking information to an attacker.
I want to implement a String comparison function that doesn't take a
different amount of time depending on the number of characters that
match or the position of the first mismatch. I assume there must be a
library out there somewhere that provides this, but I was unable to
find it via a quick search.
So far, the best idea I've got is to sum the XOR of each character
Do you see the contradiction?
update:
To the updated, and therefore different question:
Can you gain information once, how much time is spent for comparing 2 Strings, in terms of constant amount and time, depending on the length of the two strings?
a + b*s(1).length + c*s(2).length + d*f(s(1), s(2))?
Is there an upper bound of characters for String 1 and 2?
If the time is, depending on a factor for the machine, for example for the longest strings you expect 0.01ms. You measure the time to encode the string, and stay idle until you reach that time, maybe + a factor of rand(10%) of the time.
If the length of the input is not limited, you could calculate the timing in a way, that will fit for 99%, 99.9% or 99.99% of typical input, depending on your security needs, and the speed of the machine. If the program is interacting with the user, a delay up to 0.2s is normally experienced as instant reaction, so it wouldn't annoy the user, if your code sleeps for 0.19994s, while doing real calculations for 0.00006s.
I see two immediate possibilities for not leaking password-related information in timing:
1/ Pad both the password string and candidate string out to 1K, with a known, fixed character (like A). Then run the following (pseudo-code):
match = true
for i = 0 to 1023:
if password[i] != candidate[i]:
match = false
That way, you're always taking the same amount of loops to do the comparison regardless of where it matches.
There's no need to muck about with xor since you can still do a simple comparison, but without exiting the loop early.
Just set the match flag to false if a mismatch is found and keep going. Once the loop exits (taking the same time regardless of size or content of password and candidate), then check whether it matched.
2/ Just add a large (relative to the normal comparison time) but slightly random delay at the end of the comparison. For example, a random value between 0.9 and 1.1 seconds. The time taken for the comparison should be swamped by the delay and the randomness should fully mask any information leakage (unless your randomness algorithm leaks information, of course).
That also has the added advantage of preventing brute force attacks since a password check takes at least about a second.
This should take approximately the same time for any matching length Strings. It's constant-time with a big constant.
public static boolean areEqualConstantTime(String a, String b) {
if ( a.length != b.length ) {
return false;
}
boolean equal = true;
for ( long i = 0; i < (Long)Integer.MAX_INT; i++ ) {
if ( a.charAt((int)(i % aChars.length)) != b.charAt((int)(i % bChars.length))) {
equal = false;
}
}
return equal;
}
Edit
Wow, if you're just trying to avoid leaking timing information this facetious answer got pretty close to the mark! We can start with a naive approach like this:
public static boolean arePasswordsEqual(String a, String b) {
boolean equal = true;
if ( a.length != b.length ) {
equal = false;
}
for ( int i = 0; i < MAX_PASSWORD_LENGTH; i++ ) {
if ( a.charAt(i%a.length()) != b.charAt(i%b.length()) ) {
equal = false;
}
}
return equal;
}
We need the MAX_PASSWORD_LENGTH constant because we can't simply use either the max or the min of the two input lengths as that would also leak timing information. An attacker could start with a very small guess and see how long the function takes. When the function time plateaus, he would know his password has the right length which eliminates much of the range of values he needs to try.
The following code is one common way to do a constant-time byte[] comparison in Java and avoid leaking password info via the time taken:
public static boolean isEqual(byte[] a, byte[] b) {
if (a.length != b.length) {
return false;
}
int result = 0;
for (int i = 0; i < a.length; i++) {
result |= a[i] ^ b[i];
}
return result == 0;
}
See http://codahale.com/a-lesson-in-timing-attacks/ for more discussion of this issue.
Current implementation in openjdk as an inspiration is available in isEqual method.
(This assumes that the length of the secret is not sensitive, for example if it is a hash. You should pad both sides to the same length if that is not true.)
This is essentially the same as your first suggestion:
So far, the best idea I've got is to sum the XOR of each character and return whether or not the sum is 0.
You asked:
However, I'm pretty sure this wouldn't work so well with Unicode.
This is a valid concern, but you need to clarify what you will accept as "equal" for a solution to be proposed. Luckily, you also say "This would be used to compare hashed password values", so I don't think that any of the unicode concerns will be in play.
I also have a vague concern that HotSpot would do some optimizations that would change my constant-time property,
Hopefully that's not true. I expect that the literature on how to avoid timing attacks in Java would address this if it were true, but I can't offer you any citations to back this up :-)
You can do constant time string comparisons if you intern your strings. Then you can compare them with == operator resulting in a constant time equality check.
Read this for further detials
The usual solution is to set a timer, and not return the result until the timer has expired.
The time taken to compare strings or hashes is not important, just set the time-out value sufficiently large.

Categories

Resources