DNA compression using bitset java

DNA compression using bitset java - java

My assignment is to compress a DNA sequence. First enconding using a = 00 c = 01 g = 10 t = 11. I have to read in from a file the sequence and covert to my encoding. i know i have to use the bitSet class in java, but I'm having issues with how to implement. How do I ensure my encoding is used and the letters are not converted to actual binary.
this is the prompt: Develop space efficient Java code for two kinds of compressed encodings of this file of data. (N's are to be ignored). Convert lower case to upper case chars. Do the following and answer the questions: Credit will be awarded to both time and space efficient mechanisms. If your code takes too long to run, you need to rethink design.
Encoding 1. Using two bits A:00, C:01, G:10, T:11.
(a) How many total bits are needed to represent the genome sequence ? (b) how many of the total bits are 1's in the encoded sequence?
i know the logic i have to use, but the actual implementation of the bitSet class and the encoding is where i'm having issues.

Welcome to StackOverflow! Please look at certain Forward Genetic simulator that is being developed on github. It contains BitSetDNASequence class that may be helpful for creation of your BitMask. Of course it'll serve more of a guideline that 1:1 solution to your problem, but it definitely may get you up to speed.

I've made an example below of how you can convert the 'C' letter into bits. So for the "CCCC" string it should print "01010101".
import java.util.BitSet;
public class Test {
public static void main(String[] args){
String originalString = "CCCC";
int bitSetSize = 2 * originalString.length();
BitSet bitSet = new BitSet(bitSetSize);
for (int i = 0; i < originalString.length(); i++) {
if (originalString.charAt(i) == 'C') {
// put 01 in the bitset
bitSet.clear(i * 2);
bitSet.set(i * 2 + 1);
}
}
// print all the bits in the bitset
for (int i = 0; i < bitSetSize; i++) {
if (bitSet.get(i))
System.out.print("1");
else
System.out.print("0");
}
}
}
I believe all you need to understand from the BitSet in order to do your assignment are the methods: set, clear and get. Hope it helps.

You can have a look at BinCodec that provides binary encoding/decoding procedures to convert back and forth DNA and protein sequences to/from binary compact representation. It relies on the use of standard Java BitSet. Have also a look at BinCodedTest that shows how to use these APIs.

Related

Java - How to handle special characters when compressing bytes (Huffman encoding)?

I am writing a Huffman Compression/Decompression program. I have started writing my compression method and I am stuck. I am trying to read all bytes in the file and then put all of the bytes into a byte array. After putting all bytes into the byte array I create an int[] array that will store all the frequencies of each byte (with the index being the ASCII code).
It does include the extended ASCII table since the size of the int array is 256. However I encounter issues as soon as I read a special character in my file (AKA characters with a higher ASCII value than 127). I understand that a byte is signed and will wrap around to a negative value as soon as it crosses the 127 number limit (and an array index obviously cant be negative) so I tried to counter this by turning it into a signed value when I specify my index for the array (array[myByte&0xFF]).
This kind of worked but it gave me the wrong ASCII value (for example if the correct ASCII value for the character is 134 I instead got 191 or something). The even more annoying part is that I noticed that special characters are split into 2 separate bytes, which I feel will cause problems later (for example when I try to decompress).
How do I make my program compatible with every single type of character (this program is supposed to be able to compress/decompress pictures, mp3's etc).
Maybe I am taking the wrong approach to this, but I don't know what the right approach is. Please give me some tips for structuring this.
Tree:
package CompPck;
import java.util.TreeMap;
abstract class Tree implements Comparable<Tree> {
public final int frequency; // the frequency of this tree
public Tree(int freq) { frequency = freq; }
// compares on the frequency
public int compareTo(Tree tree) {
return frequency - tree.frequency;
}
}
class Leaf extends Tree {
public final int value; // the character this leaf represents
public Leaf(int freq, int val) {
super(freq);
value = val;
}
}
class Node extends Tree {
public final Tree left, right; // subtrees
public Node(Tree l, Tree r) {
super(l.frequency + r.frequency);
left = l;
right = r;
}
}
Build tree method:
public static Tree buildTree(int[] charFreqs) {
PriorityQueue<Tree> trees = new PriorityQueue<Tree>();
for (int i = 0; i < charFreqs.length; i++){
if (charFreqs[i] > 0){
trees.offer(new Leaf(charFreqs[i], i));
}
}
//assert trees.size() > 0;
while (trees.size() > 1) {
Tree a = trees.poll();
Tree b = trees.poll();
trees.offer(new Node(a, b));
}
return trees.poll();
}
Compression method:
public static void compress(File file){
try {
Path path = Paths.get(file.getAbsolutePath());
byte[] content = Files.readAllBytes(path);
TreeMap<Integer, String> treeMap = new TreeMap<Integer, String>();
File nF = new File(file.getName() + "_comp");
nF.createNewFile();
BitFileWriter bfw = new BitFileWriter(nF);
int[] charFreqs = new int[256];
// read each byte and record the frequencies
for (byte b : content){
charFreqs[b&0xFF]++;
System.out.println(b&0xFF);
}
// build tree
Tree tree = buildTree(charFreqs);
// build TreeMap
fillEncodeMap(tree, new StringBuffer(), treeMap);
} catch (IOException e) {
e.printStackTrace();
}
}

Encodings matter
If I take the character "ö" and read it in my file it will now be
represented by 2 different values (191 and 182 or something like that)
when its actual ASCII table value is 148.
That really depends, which kind of encoding was used to create your text file. Encodings determine how text messages are stored.
In UTF-8 the ö is stored as hex [0xc3, 0xb6] or [195, 182]
In ISO/IEC 8859-1 (= "Latin-1") it would be stored as hex [0xf6], or [246]
In Mac OS Central European, it would be hex [0x9a] or [154]
Please note, that the basic ASCII table itself doesn't really describe anything for that kind of character. ASCII only uses 7 bits, and by doing so only maps 128 codes.
Part of the problem is that in layman's terms, "ASCII" is sometimes used to describe extensions of ASCII as well, (e.g. like Latin-1)
History
There's actually a bit of history behind that. Originally ASCII was a very limited set of characters. When those weren't enough, each region started using the 8th bit to add their language-specific characters. Leading to all kind of compatibility issues.
Then there was some kind of consortium that made an inventory of all characters in all possible languages (and beyond). That set is called "unicode". It contains not just 128 or 256 characters, but thousands of them.
From that point on you would need more advanced encodings to cover them. UTF-8 is one of those encodings that covers that entire unicode set, and it does so while being kind-of backwards compatible with ASCII.
Each ASCII character is still mapped in the same way, but when 1-byte isn't enough, it will use the 8th bit to indicate that a 2nd byte will follow, which is the case for the ö character.
Tools
If you're using a more advanced text editor like Notepad++, then you can select your encoding from the drop-down menu.
In programming
Having said that, your current java source reads bytes, it's not reading characters. And I would think that it's a plus when it works on byte-level, because then it can support all encodings. Maybe you don't need to work on character level at all.
However, if it does matter for your specific algorithm. Let's say you've written an algorithm that is only supposed to handle Latin-1 encoding. So, then it's really going to work on "character-level" and not on "byte-level". In that case, consider reading directly to String or char[].
Java can do the heavy-lifting for you in that case. There are readers in java that will let you read a text-file directly to Strings/char[]. However, in those cases you should of course specify an encoding when you use them. Internally a single java character can contain up to 2 bytes of data.
Trying to convert bytes to characters manually is a tricky business. Unless you're working with plain old ASCII of course. The moment you see a value above 0x7F (127), (which are presented by negative values in byte) you're no longer working with simple ASCII. Then consider using something like: new String(bytes, StandardCharsets.UTF_8). There's no need to write a decoding algorithm from scratch.

What datatype to use for 40 digit integers in java

For example
first number = 123456.....40 digits
second number = 123456.....40digits
Then I need to store the number
third = first * second;
after that I need to print third and again I need to perform operation like
fourth = third * third;
and print fourth. So how can I handle that much long integers which data type I need to use?

Use BigInteger class in java.math, then use BigInteger.multiply to multiply them together.
Check here for more on how to use it:
https://www.tutorialspoint.com/java/math/biginteger_multiply.htm

See this question its similar Arbitrary-precision arithmetic Explanation
the answer explain it quite good.
The basic is that you work with smaller parts. Just remember how you learned to work with big numbers in school (2-3 grade) you wrote down two numbers and
2351
*12
-----
4702
2351
------
28212
You just do small operations and store them somewhere you can put them in string or better in some array of integers. Where for example
number 123456789123456789 can be
number[0] = 6789
number[1] = 2345
number[3] = 7891
number[4] = 2345
number[5] = 1
String numberToShow = "";
for(int i = 0; i
There are some links for computer arighmetics
https://en.wikipedia.org/wiki/Category:Computer_arithmetic
and for adders
https://en.wikipedia.org/wiki/Category:Adders_(electronics)
In your computer you have basically also just some adders which can work only with some size of numbers and if you need to work with bigger you need to split it in smaller parts.
Some of this parts can be done parallel, so you can speed up your algorithm. These are usually more sophisticated.
But the basic principe is similar to working with big numbers on your primary school.

Declaring a new data type for DNA

I am involved with biology, specifically DNA and often there is a problem with the size of the data that comes from sequencing a genome.
For those of you who don't have a background in biology, I'll give a quick overview of DNA sequencing. DNA consists of four letters: A, T, G, and C, the specific order of which determines what happens in the cell.
A major problem with DNA sequencing technology however is the size of the data that results, (for a whole genome, often much more than gigabytes).
I know that the size of an int in C varies from computer to computer, but it still has way more information storage possibility than four choices. Is there a way to define a type/way to define a 'base' that only takes up 2 or 3 bits? I've searched for defining a structure, but am afraid this isn't what I'm looking for. Thanks.
Also, would this work better in other languages (maybe higher level like java)?

Can't you just stuff two ATGC sets into one byte then? Like:
0 1 0 1 1 0 0 1
A T G C A T G C
So this one byte would represent TC,AC?

If you want to use Java, you're going to have to give up some control over how big things are. The smallest you can go, AFAIK, is the byte primitive, which is 8 bits (-128 to 127).
Although I guess this is debatable, it seems like Java is more suitable for broad systems control rather than fast, efficient nitty-gritty detail work such as you would generally do with C.
If there is no requirement that you hold the entire dataset in memory at once, you might even try using a managed database like MySQL to store the base information and then read that in piece by piece.

If I would write a similiar code, I would store the nucleotid identifier in a byte, where you can add 1,2,3,4 as values for A,T,G,C. Even if you will consider that you will use RNA then you can just add a 5th element, with value 5 for U.
If you are really digging yourself into the project, I would recommend making a class for codons. In this class you can specify if this is an intron/exon, a Start or Stop codon and so on. And on top of this, you can make a gene class, where you can specify the promoter regions and etc.
If you will have big sequences of dna, rna, and it will need a lot of computing than I strongly recommend to use C++ and for scientific computations Fortrain. ( The total human genom is 1.4 Gb)
Also because there are much repetitive sequences, structuring the genom into codons is usefull, this way you save a lot of memory (you just have to make a refrence to a codon class, and do not have to build the class N times).
Also strucuring into codons, you can predefine your classes, and there is only 64 of them, so your whole genom would be only an ordered referencing list. So in my opinion making a codon as a base unit is much more efficient.

Below link is one of my research paper, Checkout and let me know if you need more details about implementation if you find it useful for you.
GenCodeX - Kaliuday Balleda

Try a char datatype.
They are generally the smallest addressable memory unit in C\C++. Most systems I've used have it at 1 Byte.
The reason you can't use anything like one or two bits is because the CPU is already pulling in that extra data.
Take a look at this for more details

The issue is not just which data type will hold the smallest value, but also what is the most efficient way to access bit-level memory.
With my limited knowledge I might try setting up a bit-array of ints (which are, from my understanding, the most efficient way to access memory for bit-arrays; I may be mistaken in my understanding, but the same principles apply if there is a better one), then using bit-wise operators to write/read.
Here are some partial codes that should give you an idea of how to proceed with 2-bit definitions and a large array of ints.
Assuming a pointer (a) set to a large array of ints:
unsinged int *a, dna[large number];
a = dna;
*a = 0;
Setting up bit definitions:
For A:
da = 0;
da = ~da;
da = da << 2;
da = ~da; (11)
For G:
dg = 0;
dg = ~dg;
dg = dg << 1;
dg = ~dg;
dg = dg << 1; (10);
and so on for T and C
For the loop:
while ((b = getchar())!=EOF){
i = sizeof(int)*8; /*bytes into bits*/
if (i-= 2 > 0){ /*keeping track of how much unused memory is left in int*/
if (b =='a' || b == 'A')
*a = *a | da;
else if (b == 't' || b == 'T')
*a = *a | ta;
else if (t...
else if (g...
else
error;
*a = *a << 2;
} else{
*++a = 0; /*advance to next 32-bit set*/
i = sizeof(int)*8 /* it may be more efficient to set this value aside earlier, I don't honestly know enough to know this yet*/
if (b == 'a'...
else if (b == 't'...
...
else
error;
*a = *a <<2;
}
}
And so on. This will store 32 bits for each int (or 16 of letters). For array size maximums, see The maximum size of an array in C.
I am speaking only from a novice C perspective. I would think that a machine language would do a better job of what you are asking for specifically, though I'm certain there are high-level solutions out there. I know that FORTRAN is a well-regarded when it comes to the sciences, but I understand that it is so due to its computational speed, not necessarily because of its efficient storage (though I'm sure it's not lacking there); an interesting read here: http://arstechnica.com/science/2014/05/scientific-computings-future-can-any-coding-language-top-a-1950s-behemoth/. I would also look into compression, though I sadly have not learned much of it myself.
A source I turned to when I was looking into bit-arrays:
http://www.mathcs.emory.edu/~cheung/Courses/255/Syllabus/1-C-intro/bit-array.html

A memory-efficient large array of words

I am looking for a Java data structure for storing a large text (about a million words), such that I can get a word by index (for example, get the 531467 word).
The problem with String[] or ArrayList is that they take too much memory - about 40 bytes per word on my environment.
I thought of using a String[] where each element is a chunk of 10 words, joined by a space. This is much more memory-efficient - about 20 bytes per word; but the access is much slower.
Is there a more efficient way to solve this problem?

As Jon Skeet already mentioned, 40mb isn't too large.
But you stated that you are storing a text, so there may be many same Strings.
For example stop words like "and" and "or".
You can use String.intern()[1]. This will pool your String and returns a reference to an already existing String.
intern() quite slow, so you can replace this by a HashMap that will do the same trick for you.
[1] http://download.oracle.com/javase/6/docs/api/java/lang/String.html#intern%28%29

You could look at using memory mapping the data structure, but performance might be completely horrible.

Store all the words in a single string:
class WordList {
private final String content;
private final int[] indices;
public WordList(Collection<String> words) {
StringBuilder buf = new StringBuilder();
indices = new int[words.size()];
int currentWordIndex = 0;
int previousPosition = 0;
for (String word : words) {
buf.append(word);
indices[currentWordIndex++] = previousPosition;
previousPosition += word.length();
}
content = buf.toString();
}
public String wordAt(int index) {
if (index == indices.length - 1) return content.substring(indices[index]);
return content.substring(indices[index], indices[index + 1]);
}
public static void main(String... args) {
WordList list = new WordList(Arrays.asList(args));
for (int i = 0; i < args.length; ++i) {
System.out.printf("Word %d: %s%n", i, list.wordAt(i));
}
}
}
Apart from the characters they contain, each word has an overhead of four bytes using this solution (the entry in indices). Retrieving a word with wordAt will always allocate a new string; you could avoid this by saving the toString() of the StringBuilder rather than the builder itself, although it uses more memory on construction.
Depending on the kind of text, language, and more, you might want a solution that deals with recurring words better (like the one previously proposed).

-XX:+UseCompressedStrings
Use a byte[] for Strings which can be represented as pure ASCII.
(Introduced in Java 6 Update 21 Performance Release)
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
Seems like a interesting article:
http://www.javamex.com/tutorials/memory/string_saving_memory.shtml
I hear ropes are quite good in terms of speed in storing large strings, though not sure memory wise. But you might want to check it out.
http://ahmadsoft.org/ropes/
http://en.wikipedia.org/wiki/Rope_%28computer_science%29

One option would be to store byte arrays instead with the text encoded in UTF-8:
byte[][] words = ...;
Then:
public String getWord(int index)
{
return new String(words[index], "UTF-8");
}
This will be smaller in two ways:
The data for each string is directly in a byte[], rather than the String having a couple of integer members and a reference to a separate char[] object
If your text is mostly-ASCII, you'll benefit from UTF-8 using a single byte per character for those ASCII characters
I wouldn't really recommend this approach though... again it will be slower on access, as it needs to create a new String each time. Fundamentally, if you need a million string objects (so you don't want to pay the recreation penalty each time) then you're going to have to use the memory for a million string objects...

You could create a datastructure like this:
List<string> wordlist
Dictionary<string, int> tsildrow // for reverse lookup while building the structure
List<int> wordindex
wordlist will contain a list of all (unique) words,
tsildrow will give the index of a word in wordlist and wordindex will tell you the index in wordlist of a specific index in your text.
You would operate it in the following fashion:
for word in text:
if not word in tsildrow:
wordlist.append(word)
tsildrow.add(word, wordlist.last_index)
wordindex.append(tsildrow[word])
this fills up your datastructure. Now, to find the word at index 531467:
print wordlist[wordindex[531467]]
you can reproduce the entire text like this:
for index in wordindex:
print wordlist[index] + ' '
except, that you will still have a problem of punctuation etc...
if you won't be adding any more words (i.e. your text is stable), you can delete tsildrow to free up some memory if this is a concern of yours.

OK, I have experimented with several of your suggestions, and here are my results (I checked (Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory()) before filling the array, and checked again after filling the array and gc()):
Original (array of strings): 54 bytes/word (not 40 as I mistakenly wrote)
My solution (array of chunks of strings, separated by spaces):
2 words per chunk - 36 b/w (but unacceptable performance)
10 words per chunk - 18 b/w
100 words per chunk - 14 b/w
byte arrays - 40 b/w
char arrays - 36 b/w
HashMap, either mapping a string to itself, or mapping a string to its index - 26 b/w
(not sure I implemented this correctly)
intern - 10 b/w
baseline (empty array) - 4 b/w
The average word length is about 3 chars, and most chars are non-ASCII so it's probably about 6 bytes. So, it seems that intern is close to the optimum. It makes sense, since it's an array of words, and many of the words appear much more than once.

I would probably consider using a file, with either fixed sized words or some sort of index. FileInputStream with skip can be pretty efficient

If you have a mobile device you can use TIntArrayList which would use 4 bytes per int value. If you use one index per word it will one need a couple of MB. You can also use int[]
If you have a PC or server, this is trivial amount of memory. Memory cost about £6 per GB or 1 cent per MB.

Is it possible to compare two strings by their "hash" numbers?

I have a string which is lost forever. The only thing I have about it is some magic hash number. Now I have a new string, which could be similar or equal to the lost one. I need to find out how close it is.
Integer savedHash = 352736;
String newText = "this is new string";
if (Math.abs(hash(newText) - savedHash) < 100) {
// wow, they are very close!
}
Are there any algorithms for this purpose?
ps. The length of the text is not fixed.
pps. I know how usual hash codes work. I'm interested in an algorithm that will work differently, giving me the functionality explained above.
ppps. In a very simple scenario this hash() method would look like:
public int hash(String txt) {
return txt.length();
}

Standard hashing will not work in this case since close hash values do not imply close strings. In fact, most hash functions are designed to give close strings very different values, so as to create a random distribution of hash values for any given set of input strings.
If you had access to both strings, then you could use some kind of string distance function, such as Levenshtein distance. This calculates the edit distance between two strings, or the number of edits required to transform one to the other.
In this case however, the best approach might be to use some kind of fuzzy hashing technique. That way you don't have to store the original string, and can still get some measure of similarity.

If the hashes don't match then the strings are different.
If the hashes match then the strings are probably the same.
There is nothing else you can infer from the hash value.

No, this isn't going to work. The similarity of a hash bears no relation to the similarity of the original strings. In fact, it is entirely possible for 2 different strings to have the same hash. All you can say for sure is that if the hashes are different the strings were different.
[Edited in light of comment, possibility of collision is of course very real]
Edit for clarification:
If you only have the hash of the old string then there is no way you are going to find the original value of that string. There is no algorithm that would tell you if the hashes of 2 different strings represented strings that were close, and even if there was it wouldn't help. Even if you find a string that has an exact hash match with your old string there is still no way you would know if it was your original string, as any number of strings can produce the same hash value. In fact, there is a vast* number of strings that can produce the same hash.
[In theory this vast number is actually infinite but on any real storage system you can't generate an infinte number of strings. In any case your chance of matching an unknown string via this approach is very slim unless your hashes are large in relation to the input string, and even then you will need to brute force your way through every possible string]

As others have pointed out, with a typical hash algorithm, it just doesn't work like that at all.
There are, however, a few people who've worked out algorithms that are at least somewhat similar to that. For one example, there's a company called "Xpriori" that has some hashing (or least hash-like) algorithms that allow things like that. They'll let you compare for degree of similarity, or (for example) let you combine hashes so hash(a) + hash(b) == hash(a+b) (for some definition of +, not just simple addition of the numbers). Like with most hashes, there's always a possibility of collision, so you have some chance of a false positive (but by picking the hash size, you can set that chance to an arbitrarily small value).
As such, if you're dealing with existing data, you're probably out of luck. If you're creating something new, and want capabilities on this order, it is possible -- though trying to do it on your own is seriously non-trivial.

No. Hashes are designed so that minor variations in the input string cause huge differences in the resulting hashe. This is very useful for dictionary implementations, as well as verifying the integrity of a file (a single changed bit will cause a completely different hash). So no, it's not some kind of thing you can ever use as an inequality comparison.

If the hashCodes are different it cannot be the same String, however many Strings can have the same hashCode().
Depending on the nature of the Strings, doing a plain comparision could be more efficent than comparing the hashCode() it has to inspect and perform a calculation on every character, whereas comparision can store early e.g. if the length is different or as soon as it see a different character.

Any good hashing algorithm will by definition NEVER yield similar hashes for similar arguments. Otherwise, it would be too easy to crack. If the hashed value of "aaaa" looks similar to "aaab", then that is a poor hash. I have racked ones like that before without too much difficulty (fun puzzle to solve!) But you never know maybe your hash algorithm is poor. An idea what it is?
If you have time, you can just brute force this solution by hashing every possible word. Not elegant, but possible. Easier if you know the length of the original word as well.
If it is a standard has algorithm, like MD5, you can find websites that already have large mappings of source and hash, and get the answer that way. Try http://hashcrack.com/
I used this website successfully after one of our devs left and I needed to recover a password.
Cheers,
Daniel

You can treat the string as a really big number, but that's about the extent of your abilities in the general situation. If you have a specific problem domain, you may be able to compress a representation of the string to something smaller without losses, but still it will not be very useful.
For example, if you are working with individual words, you can use soundex to compare how similar two words will sound...
The best you can do with traditional hash codes will be to compare two strings for equality vs likely inequality. False positives are possible, but there will be no false negatives. You cannot compare for similarity this way, though.

a normal hash code changes a lot when the object changes a bit. that's made to distinguish different objects and don't care how resembling they could be. therefore the answer is no

Well, seems you want not real hash of string, but some fingerprint of string. Because you want it to be of 32-bits one way could be:
Calculate Pearson correlation coefficient between first and second half of string (if string length is odd number of chars, then add some padding) and store this number as 32-bit floating point number. But I'm not sure how reliable this method will be.
==EDIT==
Here is C example code (un-optimized) which implements this idea (a little bit modified):
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
float mean(char *str) {
char *x;
float sum = 0.0;
for(x=str; *x!='\0'; x++) {
sum += (float) *x;
}
return sum/strlen(str);
}
float stddev(char *str) {
char *x;
float sum = 0.0;
float u = mean(str);
for(x=str; *x!='\0'; x++) {
sum += ((float)*x - u)*((float)*x - u);
}
return sqrt(sum/strlen(str));
}
float covariance(char *str1, char *str2) {
int i;
int im = fmin(strlen(str1),strlen(str2));
float sum = 0.0;
float u1 = mean(str1);
float u2 = mean(str2);
for(i=0; i<im; i++) {
sum += ((float)str1[i] - u1)*((float)str2[i] - u2);
}
return sum/im;
}
float correlation(char *str1, char *str2) {
float cov = covariance(str1,str2);
float dev1 = stddev(str1);
float dev2 = stddev(str2);
return cov/(dev1*dev2);
}
float string_fingerprint(char *str) {
int len = strlen(str);
char *rot = (char*) malloc((len+1)*sizeof(char));
int i;
// rotate string by CHAR_COUNT/2
for(i=0; i<len; i++){
rot[i] = str[(i+len/2)%len];
}
rot[len] = '\0';
// now calculate correlation between original and rotated strings
float corr = correlation(str,rot);
free(rot);
return corr;
}
int main() {
char string1[] = "The quick brown fox jumps over the lazy dog";
char string2[] = "The slow brown fox jumps over the crazy dog";
float f1 = string_fingerprint(string1);
float f2 = string_fingerprint(string2);
if (fabs(f1 - f2) < 0.2) {
printf("wow, they are very close!\n");
}
return 0;
}
hth!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.