convert a string into something reversible, in Java - java

I have a lot of urls that serve as keys in a HBase table. Since they "all" start by http://, Hbase puts them in the same node. Thus I end with a node at +100% and the other idle.
So, I need to map the url to something hash-like, but reversible. Is there any simple, standard, and fast way to do that in JAVA8.
I look for random (linear) distribution of prefixes.
Note:
reversing the url is not interesting since a lot of urls end with / ? = and risk to unbalance the distribution.
I do not need encryption, but I can accept it.
I do not look for compression, but it is welcome if possible :)
Thanks,
Costin

There's not a single, standard way.
One thing you can do is to prefix the key with its hash. Something like:
a01cc0fe http://...
That's easily reversible (just snip off the hash chars, which you can make be a fixed length) and will get you good distribution.
The hash code for a string is stable and consistent across JVMs. The algorithm for computing it is specified in String.hashCode's documentation, so you can consider it part of the contract of how a String works.

Add prefix of the hash code encoded by 36 decimal number [0-9a-z].
public static String encode(String s) {
return Integer.toString(s.hashCode() & 0xffffff, 36) + "#" + s;
}
public static String decode(String s) {
return s.replaceFirst("^[^#]*#", "");
}
sample:
http://google.com/ <-> 5o07l#http://google.com/

Related

What is the minimum test to verify that a component can save/retrieve UTF8 encoded strings

I am integration testing a component. The component allows you to save and fetch strings.
I want to verify that the component is handling UTF-8 characters properly. What is the minimum test that is required to verify this?
I think that doing something like this is a good start:
// This is the ☺ character
String toSave = "\u263A";
int id = 123;
// Saves to Database
myComponent.save( id, toSave );
// Retrieve from Database
String fromComponent = myComponent.retrieve( id );
// Verify they are same
org.junit.Assert.assertEquals( toSave, fromComponent );
One mistake I have made in the past is I have set String toSave = "è". My test passed because the string was saved and retrieved properly to/from the DB. Unfortunately the application was not actually working correctly because the app was using ISO 8859-1 encoding. This meant that è worked but other characters like ☺ did not.
Question restated: What is the minimum test (or tests) to verify that I can persist UTF-8 encoded strings?
A code and/or documentation review is probably your best option here. But, you can probe if you want. It seems that a sufficient test is the goal and minimizing it is less important. It is hard to figure what a sufficient test is, based only on speculation of what the threat would be, but here's my suggestion: all codepoints, including U+0000, proper handling of "combining characters."
The method you want to test has a Java string as a parameter. Java doesn't have "UTF-8 encoded strings": Java's native text datatypes use the UTF-16 encoding of the Unicode character set. This is common for in-memory representations of text—It's used by Java, .NET, JavaScript, VB6, VBA,…. UTF-8 is commonly used for streams and storage, so it makes sense that you should ask about it in the context of "saving and fetching". Databases typically offer one or more of UTF-8, 3-byte-limited UTF-8, or UTF-16 (NVARCHAR) datatypes and collations.
The encoding is an implementation detail. If the component accepts a Java string, it should either throw an exception for data it is unwilling to handle or handle it properly.
"Characters" is a rather ill-defined term. Unicode codepoints range from 0x0 to 0x10FFFF—21 bits. Some codepoints are not assigned (aka "defined"), depending on the Unicode Standard revision. Java datatypes can handle any codepoint, but information about them is limited by version. For Java 8, "Character information is based on the Unicode Standard, version 6.2.0.". You can limit the test to "defined" codepoints or go all possible codepoints.
A codepoint is either a base "character" or a "combining character". Also, each codepoint is in exactly one Unicode Category. Two categories are for combining characters. To form a grapheme, a base character is followed by zero or more combining characters. It might be difficult to layout graphemes graphically (see Zalgo text) but for text storage all that it is needed to not mangle the sequence of codepoints (and byte order, if applicable).
So, here is a non-minimal, somewhat comprehensive test:
final Stream<Integer> codepoints = IntStream
.rangeClosed(Character.MIN_CODE_POINT, Character.MAX_CODE_POINT)
.filter(cp -> Character.isDefined(cp)) // optional filtering
.boxed();
final int[] combiningCategories = {
Character.COMBINING_SPACING_MARK,
Character.ENCLOSING_MARK
};
final Map<Boolean, List<Integer>> partitionedCodepoints = codepoints
.collect(Collectors.partitioningBy(cp ->
Arrays.binarySearch(combiningCategories, Character.getType(cp)) < 0));
final Integer[] baseCodepoints = partitionedCodepoints.get(true)
.toArray(new Integer[0]);
final Integer[] combiningCodepoints = partitionedCodepoints.get(false)
.toArray(new Integer[0]);
final int baseLength = baseCodepoints.length;
final int combiningLength = combiningCodepoints.length;
final StringBuilder graphemes = new StringBuilder();
for (int i = 0; i < baseLength; i++) {
graphemes.append(Character.toChars(baseCodepoints[i]));
graphemes.append(Character.toChars(combiningCodepoints[i % combiningLength]));
}
final String test = graphemes.toString();
final byte[] testUTF8 = StandardCharsets.UTF_8.encode(test).array();
// Java 8 counts for when filtering by Character.isDefined
assertEquals(736681, test.length()); // number of UTF-16 code units
assertEquals(3241399, testUTF8.length); // number of UTF-8 code units
If your component is only capable of storing and retrieving strings, then all you need to do is make sure that nothing gets lost in the conversion to and from the Unicode strings of java and the UTF-8 strings that the component stores.
That would involve checking with at least one character from each UTF-8 code point length. So, I would suggest check with:
One character from the US-ASCII set, (1-byte long code point,) then
One character from Greek, (2-byte long code point,) and
One character from Chinese (3-byte long code point.)
In theory you would also want to check with an emoji (4-byte long code point,) though these cannot be represented in java's Unicode strings, so it's moot point.
A useful extra test would be to try a string combining at least one character from each of the above cases, so as to make sure that characters of different code-point lengths can co-exist within the same string.
(If your component does anything more than storing and retrieving strings, like searching for strings, then things can get a bit more complicated, but it seems to me that you specifically avoided asking about that.)
I do believe that black box testing is the only kind of testing that makes sense, so I would not recommend polluting the interface of your component with methods that would expose knowledge of its internals. However, there are two things that you can do to increase the testability of the component without ruining its interface:
Introduce additional functions to the interface that might help with testing without disclosing anything about the internal implementation and without requiring that the testing code must have knowledge of the internal implementation of the component.
Introduce functionality useful for testing in the constructor of your component. The code that constructs the component knows precisely what component it is constructing, so it is intimately familiar with the nature of the component, so it is okay to pass something implementation-specific there.
An example of what you could do with any of the above techniques would be to artificially severely limit the number of bytes that the internal representation is allowed to occupy, so that you can make sure that a certain string you are planning to store will fit. So, you could limit the internal size to no more than 9 bytes, and then make sure that a java unicode string containing 3 chinese characters gets properly stored and retrieved.
String instances use a predefined and unchangeable encoding(16-bit words).
So, returning only a String from your service is probably not enough to do this check.
You should try to return the byte representation of the persisted String (a byte array for example) and compare the content of this array with the "\u263A" String that you would encode in bytes with the UTF-8 charset.
String toSave = "\u263A";
int id = 123;
// Saves to Database
myComponent.save(id, toSave );
// Retrieve from Database
byte[] actualBytes = myComponent.retrieve(id );
// assertion
byte[] expectedBytes = toSave.getBytes(Charset.forName("UTF-8"));
Assert.assertTrue(Arrays.equals(expectedBytes, actualBytes));

Easiest way in Java to turn String into UUID

How to generate a valid UUID from a String? The String alone is not what I'm looking for. Rather, I'm looking for something like a hash function converting any String to a valid UUID.
Try this out:
String superSecretId = "f000aa01-0451-4000-b000-000000000000";
UUID.fromString(superSecretId);
I am using this in my project and it works. Make sure you import the right stuff.
In the Java core library, there's java.util.UUID.nameUUIDFromBytes(byte[]).
I wouldn't recommend it because it's UUID 3 which is based on MD5, a very broken hash function. You'd be better off finding an implementation of UUID 5, which is based on SHA-1 (better although also sort of broken).

Is it possible to compare two strings by their "hash" numbers?

I have a string which is lost forever. The only thing I have about it is some magic hash number. Now I have a new string, which could be similar or equal to the lost one. I need to find out how close it is.
Integer savedHash = 352736;
String newText = "this is new string";
if (Math.abs(hash(newText) - savedHash) < 100) {
// wow, they are very close!
}
Are there any algorithms for this purpose?
ps. The length of the text is not fixed.
pps. I know how usual hash codes work. I'm interested in an algorithm that will work differently, giving me the functionality explained above.
ppps. In a very simple scenario this hash() method would look like:
public int hash(String txt) {
return txt.length();
}
Standard hashing will not work in this case since close hash values do not imply close strings. In fact, most hash functions are designed to give close strings very different values, so as to create a random distribution of hash values for any given set of input strings.
If you had access to both strings, then you could use some kind of string distance function, such as Levenshtein distance. This calculates the edit distance between two strings, or the number of edits required to transform one to the other.
In this case however, the best approach might be to use some kind of fuzzy hashing technique. That way you don't have to store the original string, and can still get some measure of similarity.
If the hashes don't match then the strings are different.
If the hashes match then the strings are probably the same.
There is nothing else you can infer from the hash value.
No, this isn't going to work. The similarity of a hash bears no relation to the similarity of the original strings. In fact, it is entirely possible for 2 different strings to have the same hash. All you can say for sure is that if the hashes are different the strings were different.
[Edited in light of comment, possibility of collision is of course very real]
Edit for clarification:
If you only have the hash of the old string then there is no way you are going to find the original value of that string. There is no algorithm that would tell you if the hashes of 2 different strings represented strings that were close, and even if there was it wouldn't help. Even if you find a string that has an exact hash match with your old string there is still no way you would know if it was your original string, as any number of strings can produce the same hash value. In fact, there is a vast* number of strings that can produce the same hash.
[In theory this vast number is actually infinite but on any real storage system you can't generate an infinte number of strings. In any case your chance of matching an unknown string via this approach is very slim unless your hashes are large in relation to the input string, and even then you will need to brute force your way through every possible string]
As others have pointed out, with a typical hash algorithm, it just doesn't work like that at all.
There are, however, a few people who've worked out algorithms that are at least somewhat similar to that. For one example, there's a company called "Xpriori" that has some hashing (or least hash-like) algorithms that allow things like that. They'll let you compare for degree of similarity, or (for example) let you combine hashes so hash(a) + hash(b) == hash(a+b) (for some definition of +, not just simple addition of the numbers). Like with most hashes, there's always a possibility of collision, so you have some chance of a false positive (but by picking the hash size, you can set that chance to an arbitrarily small value).
As such, if you're dealing with existing data, you're probably out of luck. If you're creating something new, and want capabilities on this order, it is possible -- though trying to do it on your own is seriously non-trivial.
No. Hashes are designed so that minor variations in the input string cause huge differences in the resulting hashe. This is very useful for dictionary implementations, as well as verifying the integrity of a file (a single changed bit will cause a completely different hash). So no, it's not some kind of thing you can ever use as an inequality comparison.
If the hashCodes are different it cannot be the same String, however many Strings can have the same hashCode().
Depending on the nature of the Strings, doing a plain comparision could be more efficent than comparing the hashCode() it has to inspect and perform a calculation on every character, whereas comparision can store early e.g. if the length is different or as soon as it see a different character.
Any good hashing algorithm will by definition NEVER yield similar hashes for similar arguments. Otherwise, it would be too easy to crack. If the hashed value of "aaaa" looks similar to "aaab", then that is a poor hash. I have racked ones like that before without too much difficulty (fun puzzle to solve!) But you never know maybe your hash algorithm is poor. An idea what it is?
If you have time, you can just brute force this solution by hashing every possible word. Not elegant, but possible. Easier if you know the length of the original word as well.
If it is a standard has algorithm, like MD5, you can find websites that already have large mappings of source and hash, and get the answer that way. Try http://hashcrack.com/
I used this website successfully after one of our devs left and I needed to recover a password.
Cheers,
Daniel
You can treat the string as a really big number, but that's about the extent of your abilities in the general situation. If you have a specific problem domain, you may be able to compress a representation of the string to something smaller without losses, but still it will not be very useful.
For example, if you are working with individual words, you can use soundex to compare how similar two words will sound...
The best you can do with traditional hash codes will be to compare two strings for equality vs likely inequality. False positives are possible, but there will be no false negatives. You cannot compare for similarity this way, though.
a normal hash code changes a lot when the object changes a bit. that's made to distinguish different objects and don't care how resembling they could be. therefore the answer is no
Well, seems you want not real hash of string, but some fingerprint of string. Because you want it to be of 32-bits one way could be:
Calculate Pearson correlation coefficient between first and second half of string (if string length is odd number of chars, then add some padding) and store this number as 32-bit floating point number. But I'm not sure how reliable this method will be.
==EDIT==
Here is C example code (un-optimized) which implements this idea (a little bit modified):
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
float mean(char *str) {
char *x;
float sum = 0.0;
for(x=str; *x!='\0'; x++) {
sum += (float) *x;
}
return sum/strlen(str);
}
float stddev(char *str) {
char *x;
float sum = 0.0;
float u = mean(str);
for(x=str; *x!='\0'; x++) {
sum += ((float)*x - u)*((float)*x - u);
}
return sqrt(sum/strlen(str));
}
float covariance(char *str1, char *str2) {
int i;
int im = fmin(strlen(str1),strlen(str2));
float sum = 0.0;
float u1 = mean(str1);
float u2 = mean(str2);
for(i=0; i<im; i++) {
sum += ((float)str1[i] - u1)*((float)str2[i] - u2);
}
return sum/im;
}
float correlation(char *str1, char *str2) {
float cov = covariance(str1,str2);
float dev1 = stddev(str1);
float dev2 = stddev(str2);
return cov/(dev1*dev2);
}
float string_fingerprint(char *str) {
int len = strlen(str);
char *rot = (char*) malloc((len+1)*sizeof(char));
int i;
// rotate string by CHAR_COUNT/2
for(i=0; i<len; i++){
rot[i] = str[(i+len/2)%len];
}
rot[len] = '\0';
// now calculate correlation between original and rotated strings
float corr = correlation(str,rot);
free(rot);
return corr;
}
int main() {
char string1[] = "The quick brown fox jumps over the lazy dog";
char string2[] = "The slow brown fox jumps over the crazy dog";
float f1 = string_fingerprint(string1);
float f2 = string_fingerprint(string2);
if (fabs(f1 - f2) < 0.2) {
printf("wow, they are very close!\n");
}
return 0;
}
hth!

How to compress a String in Java?

I use GZIPOutputStream or ZIPOutputStream to compress a String (my string.length() is less than 20), but the compressed result is longer than the original string.
On some site, I found some friends said that this is because my original string is too short, GZIPOutputStream can be used to compress longer strings.
so, can somebody give me a help to compress a String?
My function is like:
String compress(String original) throws Exception {
}
Update:
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.GZIPOutputStream;
import java.util.zip.*;
//ZipUtil
public class ZipUtil {
public static String compress(String str) {
if (str == null || str.length() == 0) {
return str;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(str.getBytes());
gzip.close();
return out.toString("ISO-8859-1");
}
public static void main(String[] args) throws IOException {
String string = "admin";
System.out.println("after compress:");
System.out.println(ZipUtil.compress(string));
}
}
The result is :
Compression algorithms almost always have some form of space overhead, which means that they are only effective when compressing data which is sufficiently large that the overhead is smaller than the amount of saved space.
Compressing a string which is only 20 characters long is not too easy, and it is not always possible. If you have repetition, Huffman Coding or simple run-length encoding might be able to compress, but probably not by very much.
When you create a String, you can think of it as a list of char's, this means that for each character in your String, you need to support all the possible values of char. From the sun docs
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
If you have a reduced set of characters you want to support you can write a simple compression algorithm, which is analogous to binary->decimal->hex radix converstion. You go from 65,536 (or however many characters your target system supports) to 26 (alphabetical) / 36 (alphanumeric) etc.
I've used this trick a few times, for example encoding timestamps as text (target 36 +, source 10) - just make sure you have plenty of unit tests!
If the passwords are more or less "random" you are out of luck, you will not be able to get a significant reduction in size.
But: Why do you need to compress the passwords? Maybe what you need is not a compression, but some sort of hash value? If you just need to check if a name matches a given password, you don't need do save the password, but can save the hash of a password. To check if a typed in password matches a given name, you can build the hash value the same way and compare it to the saved hash. As a hash (Object.hashCode()) is an int you will be able to store all 20 password-hashes in 80 bytes).
Your friend is correct. Both gzip and ZIP are based on DEFLATE. This is a general purpose algorithm, and is not intended for encoding small strings.
If you need this, a possible solution is a custom encoding and decoding HashMap<String, String>. This can allow you to do a simple one-to-one mapping:
HashMap<String, String> toCompressed, toUncompressed;
String compressed = toCompressed.get(uncompressed);
// ...
String uncompressed = toUncompressed.get(compressed);
Clearly, this requires setup, and is only practical for a small number of strings.
Huffman Coding might help, but only if you have a lot of frequent characters in your small String
The ZIP algorithm is a combination of LZW and Huffman Trees. You can use one of theses algorithms separately.
The compression is based on 2 factors :
the repetition of substrings in your original chain (LZW): if there are a lot of repetitions, the compression will be efficient. This algorithm has good performances for compressing a long plain text, since words are often repeated
the number of each character in the compressed chain (Huffman): more the repartition between characters is unbalanced, more the compression will be efficient
In your case, you should try the LZW algorithm only. Used basically, the chain can be compressed without adding meta-informations: it is probably better for short strings compression.
For the Huffman algorithm, the coding tree has to be sent with the compressed text. So, for a small text, the result can be larger than the original text, because of the tree.
Huffman encoding is a sensible option here. Gzip and friends do this, but the way they work is to build a Huffman tree for the input, send that, then send the data encoded with the tree. If the tree is large relative to the data, there may be no not saving in size.
However, it is possible to avoid sending a tree: instead, you arrange for the sender and receiver to already have one. It can't be built specifically for every string, but you can have a single global tree used to encode all strings. If you build it from the same language as the input strings (English or whatever), you should still get good compression, although not as good as with a custom tree for every input.
If you know that your strings are mostly ASCII you could convert them to UTF-8.
byte[] bytes = string.getBytes("UTF-8");
This may reduce the memory size by about 50%. However, you will get a byte array out and not a string. If you are writing it to a file though, that should not be a problem.
To convert back to a String:
private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
...
String s = new String(bytes, UTF8_CHARSET);
You don't see any compression happening for your String, As you atleast require couple of hundred bytes to have real compression using GZIPOutputStream or ZIPOutputStream. Your String is too small.(I don't understand why you require compression for same)
Check Conclusion from this article:
The article also shows how to compress
and decompress data on the fly in
order to reduce network traffic and
improve the performance of your
client/server applications.
Compressing data on the fly, however,
improves the performance of
client/server applications only when
the objects being compressed are more
than a couple of hundred bytes. You
would not be able to observe
improvement in performance if the
objects being compressed and
transferred are simple String objects,
for example.
Take a look at the Huffman algorithm.
https://codereview.stackexchange.com/questions/44473/huffman-code-implementation
The idea is that each character is replaced with sequence of bits, depending on their frequency in the text (the more frequent, the smaller the sequence).
You can read your entire text and build a table of codes, for example:
Symbol Code
a 0
s 10
e 110
m 111
The algorithm builds a symbol tree based on the text input. The more variety of characters you have, the worst the compression will be.
But depending on your text, it could be effective.

Data structure for soundex algorithm?

Can anyone suggest me on what data structure to use for a soundex algorithm program? The language to be used is Java. If anybody has worked on this before in Java. The program should have these features:
be able to read about 50,000 words
should be able to read a word and return the related words having the same soundex
I don't want the program implementation just few advices on what data structure to use.
TIP: If you use SQL as a databackend then you can let SQL handle it with the two sql-functions SOUNDEX and DIFFERENCE.
Maybe not what you wanted, but many people do not know that MSsql has those two functions.
Well soundex can be implemented in a straightforward pass over a string, so that doesn't require anything special.
After that the 4 character code can be treated as an integer key.
Then just build a dictionary that stores word sets indexed by that integer key. 50,000 words should easily fit into memory so nothing fancy is required.
Then walk the dictionary and each bucket is a group of similar sounding words.
Actually, here is the whole program in perl:
#!/usr/bin/perl
use Text::Soundex;
use Data::Dumper;
open(DICT,"</usr/share/dict/linux.words");
my %dictionary = ();
while (<DICT>) {
chomp();
chomp();
push #{$dictionary{soundex($_)}},$_;
}
close(DICT);
while (<>) {
my #words = split / +/;
foreach (#words) {
print Dumper $dictionary{soundex($_)};
}
}
I believe you just need to convert the original strings into soundex keys into a hashtable; the value for each entry in the table would be a collection of original strings mapping to that soundex.
The MultiMap collection interface (and its implementations) in Google Collections would be useful to you.
class SpellChecker
{
interface Hash {
String hash(String);
}
private final Hash hash;
private final Map<String, Set<String>> collisions;
SpellChecker(Hash hash) {
this.hash = hash;
collisions = new TreeSet<String, Set<String>>();
}
boolean addWord(String word) {
String key = hash.hash(word);
Set<String> similar = collisions.get(key);
if (similar == null)
collisions.put(key, similar = new TreeSet<String>());
return similar.add(word);
}
Set<String> similar(String word) {
Set<String> similar = collisions.get(hash.hash(word));
if (similar == null)
return Collections.emptySet();
else
return Collections.unmodifiableSet(similar);
}
}
The hash strategy could be Soundex, Metaphone, or what have you. Some strategies might be tunable (how many characters does it output, etc.)
Since soundex is a hash, I'd use a hash table, with the soundex as the key.
you want a 4-byte integer.
The soundex algorithm always returns a 4-character code, if you use ANSI inputs, you'll get 4-bytes back (represented as 4 letters).
So store the codes returned in a hashtable, convert your word to the code and look it up in the hashtable. Its really that easy.

Categories

Resources