Creating a strong password based on a input regex [duplicate] - java

I am writing a Java utility that helps me to generate loads of data for performance testing. It would be really cool to be able to specify a regex for Strings so that my generator spits out things that match this.
Is something out there already baked that I can use to do this? Or is there a library that gets me most of the way there?

Edit:
Complete list of suggested libraries on this question:
Xeger* - Java
Generex* - Java
Rgxgen - Java
rxrdg - C#
* - Depends on dk.brics.automaton
Edit:
As mentioned in the comments, there is a library available at Google Code to achieve this:
https://code.google.com/archive/p/xeger/
See also https://github.com/mifmif/Generex as suggested by Mifmif
Original message:
Firstly, with a complex enough regexp, I believe this can be impossible. But you should be able to put something together for simple regexps.
If you take a look at the source code of the class java.util.regex.Pattern, you'll see that it uses an internal representation of Node instances. Each of the different pattern components have their own implementation of a Node subclass. These Nodes are organised into a tree.
By producing a visitor that traverses this tree, you should be able to call an overloaded generator method or some kind of Builder that cobbles something together.

It's too late to help the original poster, but it could help a newcomer. Generex is a useful java library that provides many features for using regexes to generate strings (random generation, generating a string based on its index, generating all strings...).
Example :
Generex generex = new Generex("[0-3]([a-c]|[e-g]{1,2})");
// generate the second String in lexicographical order that matches the given Regex.
String secondString = generex.getMatchedString(2);
System.out.println(secondString);// it print '0b'
// Generate all String that matches the given Regex.
List<String> matchedStrs = generex.getAllMatchedStrings();
// Using Generex iterator
Iterator iterator = generex.iterator();
while (iterator.hasNext()) {
System.out.print(iterator.next() + " ");
}
// it prints 0a 0b 0c 0e 0ee 0e 0e 0f 0fe 0f 0f 0g 0ge 0g 0g 1a 1b 1c 1e
// 1ee 1e 1e 1f 1fe 1f 1f 1g 1ge 1g 1g 2a 2b 2c 2e 2ee 2e 2e 2f 2fe 2f 2f 2g
// 2ge 2g 2g 3a 3b 3c 3e 3ee 3e 3e 3f 3fe 3f 3f 3g 3ge 3g 3g 1ee
// Generate random String
String randomStr = generex.random();
System.out.println(randomStr);// a random value from the previous String list
Disclosure
The project mentioned on this post belongs to the user answering (Mifmif) the question. As per the rules, this need to be brought up.

Xeger (Java) is capable of doing it as well:
String regex = "[ab]{4,6}c";
Xeger generator = new Xeger(regex);
String result = generator.generate();
assert result.matches(regex);

This question is really old, though the problem was actual for me.
I've tried xeger and Generex and they doesn't seem to meet my reguirements.
They actually fail to process some of the regex patterns (like a{60000}) or for others (e.g. (A|B|C|D|E|F)) they just don't produce all possible values. Since I didn't find any another appropriate solution - I've created my own library.
https://github.com/curious-odd-man/RgxGen
This library can be used to generate both matching and non-matching string.
There is also artifact on maven central available.
Usage example:
RgxGen rgxGen = new RgxGen(aRegex); // Create generator
String s = rgxGen.generate(); // Generate new random value

I've gone the root of rolling my own library for that (In c# but should be easy to understand for a Java developer).
Rxrdg started as a solution to a problem of creating test data for a real life project. The basic idea is to leverage the existing (regular expression) validation patterns to create random data that conforms to such patterns. This way valid random data is created.
It is not that difficult to write a parser for simple regex patterns. Using an abstract syntax tree to generate strings should be even easier.

On stackoverflow podcast 11:
Spolsky: Yep. There's a new product also, if you don't want to use the Team System there our friends at Redgate have a product called SQL Data Generator [http://www.red-gate.com/products/sql_data_generator/index.htm]. It's $295, and it just generates some realistic test data. And it does things like actually generate real cities in the city column that actually exist, and then when it generates those it'll get the state right, instead of getting the state wrong, or putting states into German cities and stuff like... you know, it generates pretty realistic looking data. I'm not really sure what all the features are.
This is probably not what you are looking for, but it might be a good starting off point, instead of creating your own.
I can't seem to find anything in google, so I would suggest tackling the problem by parsing a given regular expression into the smallest units of work (\w, [x-x], \d, etc) and writing some basic methods to support those regular expression phrases.
So for \w you would have a method getRandomLetter() which returns any random letter, and you would also have getRandomLetter(char startLetter, char endLetter) which gives you a random letter between the two values.

I am on flight and just saw the question: I have written easiest but inefficient and incomplete solution. I hope it may help you to start writing your own parser:
public static void main(String[] args) {
String line = "[A-Z0-9]{16}";
String[] tokens = line.split(line);
char[] pattern = new char[100];
int i = 0;
int len = tokens.length;
String sep1 = "[{";
StringTokenizer st = new StringTokenizer(line, sep1);
while (st.hasMoreTokens()) {
String token = st.nextToken();
System.out.println(token);
if (token.contains("]")) {
char[] endStr = null;
if (!token.endsWith("]")) {
String[] subTokens = token.split("]");
token = subTokens[0];
if (!subTokens[1].equalsIgnoreCase("*")) {
endStr = subTokens[1].toCharArray();
}
}
if (token.startsWith("^")) {
String subStr = token.substring(1, token.length() - 1);
char[] subChar = subStr.toCharArray();
Set set = new HashSet<Character>();
for (int p = 0; p < subChar.length; p++) {
set.add(subChar[p]);
}
int asci = 1;
while (true) {
char newChar = (char) (subChar[0] + (asci++));
if (!set.contains(newChar)) {
pattern[i++] = newChar;
break;
}
}
if (endStr != null) {
for (int r = 0; r < endStr.length; r++) {
pattern[i++] = endStr[r];
}
}
} else {
pattern[i++] = token.charAt(0);
}
} else if (token.contains("}")) {
char[] endStr = null;
if (!token.endsWith("}")) {
String[] subTokens = token.split("}");
token = subTokens[0];
if (!subTokens[1].equalsIgnoreCase("*")) {
endStr = subTokens[1].toCharArray();
}
}
int length = Integer.parseInt((new StringTokenizer(token, (",}"))).nextToken());
char element = pattern[i - 1];
for (int j = 0; j < length - 1; j++) {
pattern[i++] = element;
}
if (endStr != null) {
for (int r = 0; r < endStr.length; r++) {
pattern[i++] = endStr[r];
}
}
} else {
char[] temp = token.toCharArray();
for (int q = 0; q < temp.length; q++) {
pattern[i++] = temp[q];
}
}
}
String result = "";
for (int j = 0; j < i; j++) {
result += pattern[j];
}
System.out.print(result);
}

You'll have to write your own parser, like the author of String::Random (Perl) did. In fact, he doesn't use regexes anywhere in that module, it's just what perl-coders are used to.
On the other hand, maybe you can have a look at the source, to get some pointers.
EDIT: Damn, blair beat me to the punch by 15 seconds.

I know there's already an accepted answer, but I've been using RedGate's Data Generator (the one mentioned in Craig's answer) and it works REALLY well for everything I've thrown at it. It's quick and that leaves me wanting to use the same regex to generate the real data for things like registration codes that this thing spits out.
It takes a regex like:
[A-Z0-9]{3,3}-[A-Z0-9]{3,3}
and it generates tons of unique codes like:
LLK-32U
Is this some big secret algorithm that RedGate figured out and we're all out of luck or is it something that us mere mortals actually could do?

It's far from supporting a full PCRE regexp, but I wrote the following Ruby method to take a regexp-like string and produce a variation on it. (For language-based CAPTCHA.)
# q = "(How (much|many)|What) is (the (value|result) of)? :num1 :op :num2?"
# values = { :num1=>42, :op=>"plus", :num2=>17 }
# 4.times{ puts q.variation( values ) }
# => What is 42 plus 17?
# => How many is the result of 42 plus 17?
# => What is the result of 42 plus 17?
# => How much is the value of 42 plus 17?
class String
def variation( values={} )
out = self.dup
while out.gsub!( /\(([^())?]+)\)(\?)?/ ){
( $2 && ( rand > 0.5 ) ) ? '' : $1.split( '|' ).random
}; end
out.gsub!( /:(#{values.keys.join('|')})\b/ ){ values[$1.intern] }
out.gsub!( /\s{2,}/, ' ' )
out
end
end
class Array
def random
self[ rand( self.length ) ]
end
end

This question is very old, but I stumbled across it on my own search, so I will include a couple links for others who might be searching for the same functionality in other languages.
There is a Node.js library here: https://github.com/fent/randexp.js
There is a PHP library here: https://github.com/icomefromthenet/ReverseRegex
The PHP faker package includes a "regexify" method that accomplishes this: https://packagist.org/packages/fzaninotto/faker

If you want to generate "critical" strings, you may want to consider:
EGRET http://elarson.pythonanywhere.com/
that generates "evil" strings covering your regular expressions
MUTREX http://cs.unibg.it/mutrex/
that generates fault-detecting strings by regex mutation
Both are academic tools (I am one of the authors of the latter) and work reasonably well.

Related

JPEG how to skip user defined tags while decoding the file stream

public ArrayList DCTread(char[] im,int flag,int select, int DC0,int row,int col){
//Input:im is the binary sequence of the host image. I wrote a "byte2char" function to convert that. flag serves as an outside pointer to locate the to-be-decoded chars. DC0 is the DC coeff for the last block.And row,col is simply for debug.
//Main Output:An ArrayList that contains the DCT coeffs,pointer(an int showing how many bits are read in this function)
String rev=new String();char[] DCcode;
ArrayList res = new ArrayList(2);int[][] ac = null; int[][] dc = null;int[][] coeff = new int[8][8];int pointer = 0;
int[] ans;int wordLen;int zeroLen;int diff;int ACnum = 1;int dct;
switch(select){//determine using which two huffman trees.
case(0):ac = a0;dc = d0;break;
case(1):ac = a1;dc = d0;break;
case(16):ac = a0;dc = d1;break;
case(17):ac = a1;dc = d1;break;
}
//DC
ans = T.huffmanDecoder(im,pointer+flag,dc,row,col);
if(ans[0]==-1){
int a1 = T.bin2dec_str(im,pointer+flag,8);int a2 = T.bin2dec_str(im,pointer+flag+8,8);
pointer +=16;//I wish to skip the User Defined Tags by reading its length
int autoLen = T.bin2dec_str(im,pointer+flag,8)*16+T.bin2dec_str(im,pointer+flag+8,8);
pointer +=autoLen*8;
}
ans = T.huffmanDecoder(im,pointer+flag,dc,row,col);
pointer += ans[0];wordLen = ans[1];
diff = T.i_unsignDecoder(T.bin2dec_str(im,pointer+flag,wordLen),wordLen);
coeff[0][0]= DC0 + diff;
pointer += wordLen;DCcode=Arrays.copyOfRange(im, flag, pointer+flag);
//AC
while(ACnum<=63){
ans = T.huffmanDecoder(im,pointer+flag,ac,row,col);
pointer += ans[0];
if(ans[1]==0){//
break;}
zeroLen = (ans[1]&(0xF0))/16;wordLen = ans[1]&(0x0F);
for(int j=0;j<zeroLen;j++){
coeff[zigZag[ACnum][0]][zigZag[ACnum][1]] = 0;
ACnum ++;
}
dct = T.i_unsignDecoder(T.bin2dec_str(im,pointer+flag,wordLen),wordLen);
pointer += wordLen;
coeff[zigZag[ACnum][0]][zigZag[ACnum][1]] = dct;
ACnum ++;
}
res.add(coeff);
res.add(pointer);
res.add(DCcode);
return res;
}
Hi everyone, firstly I'm so glad to welcome you for seeing my tough problem that has bothered me for two days, and gratefully thank you for your time helping me solve this problem. I've been a watcher of StackOverflow for a long time yet it really is my first time posing a problem.
What I want is to read DCT of JPEG in Java without utilizing the libjpeg library (which is written in C++).However I encounter many user defined tags(UDTs) that I find hard to skip using the method I listed above in the algorithm. I'm quite not familiar with UDTs.
AREN'T THEY written with the beginning of "0xFFXX 0x...."(where 0x.... gives the length of this tag)? Your suggestions would be of great help to me. Thanks!
The markers that can be user defined are APPn's and COM. Those markers are followed by lengths in BIG ENDIAN format.
However, I am surprised you are finding "many" such tags. Typically, there will only be one or two in a JPEG stream.

How to generate 1000 unique email-ids using java

My requirement is to generate 1000 unique email-ids in Java. I have already generated random Text and using for loop I'm limiting the number of email-ids to be generated. Problem is when I execute 10 email-ids are generated but all are same.
Below is the code and output:
public static void main() {
first fr = new first();
String n = fr.genText()+"#mail.com";
for (int i = 0; i<=9; i++) {
System.out.println(n);
}
}
public String genText() {
String randomText = "abcdefghijklmnopqrstuvwxyz";
int length = 4;
String temp = RandomStringUtils.random(length, randomText);
return temp;
}
and output is:
myqo#mail.com
myqo#mail.com
...
myqo#mail.com
When I execute the same above program I get another set of mail-ids. Example: instead of 'myqo' it will be 'bfta'. But my requirement is to generate different unique ids.
For Example:
myqo#mail.com
bfta#mail.com
kjuy#mail.com
Put your String initialization in the for statement:
for (int i = 0; i<=9; i++) {
String n = fr.genText()+"#mail.com";
System.out.println(n);
}
I would like to rewrite your method a little bit:
public String generateEmail(String domain, int length) {
return RandomStringUtils.random(length, "abcdefghijklmnopqrstuvwxyz") + "#" + domain;
}
And it would be possible to call like:
generateEmail("gmail.com", 4);
As I understood, you want to generate unique 1000 emails, then you would be able to do this in a convenient way by Stream API:
Stream.generate(() -> generateEmail("gmail.com", 4))
.limit(1000)
.collect(Collectors.toSet())
But the problem still exists. I purposely collected a Stream<String> to a Set<String> (which removes duplicates) to find out its size(). As you may see, the size is not always equals 1000
999
1000
997
that means your algorithm returns duplicated values even for such small range.
Therefore, you'd better research already written email generators for Java or improve your own (for example, by adding numbers, some special characters that, in turn, will generate a plenty of exceptions).
If you are planning to use MockNeat, the feature for implementing email strings is already implemented.
Example 1:
String corpEmail = mock.emails().domain("startup.io").val();
// Possible Output: tiptoplunge#startup.io
Example 2:
String domsEmail = mock.emails().domains("abc.com", "corp.org").val();
// Possible Output: funjulius#corp.org
Note: mock is the default "mocking" object.
To guarantee uniqueness you could use a counter as part of the email address:
myqo0000#mail.com
bfta0001#mail.com
kjuy0002#mail.com
If you want to stick to letters only then convert the counter to base 26 representation using 'a' to 'z' as the digits.

Finding string from the next line of an ArrayList

I have this code, it should find a pre known method's name in the chosen file:
String[] sorok = new String[listaZ.size()];
String[] sorokPlusz1 = new String[listaIdeig.size()];
boolean keresesiFeltetel1;
boolean keresesiFeltetel3;
boolean keresesiFeltetel4;
int ind=0;
for (int i = 0; i < listaZ.size(); i++) {
for (int id = 0; id < listaIdeig.size(); id++) {
sorok = listaZ.get(i);
sorokPlusz1 = listaIdeig.get(id);
for (int j = 0; j < sorok.length; j++) {
for (int jj = 1; jj < sorok.length; jj++) {
keresesiFeltetel3 = (sorok[j].equals(oldName)) && (sorokPlusz1[id].startsWith("("));
keresesiFeltetel4 = sorok[j].startsWith(oldNameV3);
keresesiFeltetel1 = sorok[j].equals(oldName) && sorok[jj].startsWith("(");
if (keresesiFeltetel1 || keresesiFeltetel3 || keresesiFeltetel4) {
Array.set(sorok, j, newName);
listaZarojeles.set(i, sorok);
}
}
System.out.println(ind +". index, element: " +sorok[j]);
}
ind++;
}
}
listaZ is an ArrayList, elements spearated by '(' and ' ', listaIdeig is this list, without the first line (because of the keresesifeltetel3)
oldNameV3 is: oldName+ ()
I'd like to find a method's name if this is looking like this:
methodname
() {...
To do this I need the next line in keresesifeltetel 3, but I can't get it working properly. It's not finding anything or dropping errors.
Right now it writes out the input file's element's about 15 times, then it should; and shows error on keresesifeltetel3, and:
Exception in thread "AWT-EventQueue-0" java.lang.ArrayIndexOutOfBoundsException: 0
I think your problem is here: sorokPlusz1[id]. id does not seem to span sorokPlusz1's range. I suspect you want to use jj and that jj should span sorokPlusz1's range instead of sorok's and that sorok[jj].startsWith("(") should be sorokPlusz1[jj].startsWith("(").
But note that I'm largely speculating as I'm not 100% sure what you're trying to do or what listaZ and listaIdeig look like.
You're creating sorok with size = listaZ's size, and then you do this: sorok = listaZ.get(i);. This is clearly not right. Not knowing the exact type of listaZ makes it difficult to tell you what's wrong with it. If it's ArrayList<String[]>, then change
String[] sorok = new String[listaZ.size()]; to String[] sorok = null; or String[] sorok;. If it's ArrayList<String> then you probably want to do something more like sorok[i] = listaZ.get(i);
Now for some general notes about asking questions here: (with some repetition of what was said in the comments) (in the spirit of helping you be successful in getting answers to questions on this site).
Your question is generally unclear. After reading through your question and the code, I still have little idea what you're trying to do and what the input variables (listaZ and listaIdeig) look like.
Using non-English variable names makes it more difficult for any English speaker to help. Even changing sorok to array and keresesiFeltetelX to bX would be better (though still not great). Having long variable names that aren't understandable makes it much more difficult to read.
Comment your code. Enough comments (on almost every line) makes it much easier to understand your code.
Examples. If you have difficulty properly explaining what you want to do (in English), you can always provide a few examples which would assist your explanation a great deal (and doing this is a good idea in general). Note that a good example is both providing the input and the desired output (and the actual output, if applicable).

Tips optimizing Java code

So, I've written a spellchecker in Java and things work as they should. The only problem is that if I use a word where the max allowed distance of edits is too large (like say, 9) then my code runs out of memory. I've profiled my code and dumped the heap into a file, but I don't know how to use it to optimize my code.
Can anyone offer any help? I'm more than willing to put up the file/use any other approach that people might have.
-Edit-
Many people asked for more details in the comments. I figured that other people would find them useful, and they might get buried in the comments. Here they are:
I'm using a Trie to store the words themselves.
In order to improve time efficiency, I don't compute the Levenshtein Distance upfront, but I calculate it as I go. What I mean by this is that I keep only two rows of the LD table in memory. Since a Trie is a prefix tree, it means that every time I recurse down a node, the previous letters of the word (and therefore the distance for those words) remains the same. Therefore, I only calculate the distance with that new letter included, with the previous row remaining unchanged.
The suggestions that I generate are stored in a HashMap. The rows of the LD table are stored in ArrayLists.
Here's the code of the function in the Trie that leads to the problem. Building the Trie is pretty straight forward, and I haven't included the code for the same here.
/*
* #param letter: the letter that is currently being looked at in the trie
* word: the word that we are trying to find matches for
* previousRow: the previous row of the Levenshtein Distance table
* suggestions: all the suggestions for the given word
* maxd: max distance a word can be from th query and still be returned as suggestion
* suggestion: the current suggestion being constructed
*/
public void get(char letter, ArrayList<Character> word, ArrayList<Integer> previousRow, HashSet<String> suggestions, int maxd, String suggestion){
// the new row of the trie that is to be computed.
ArrayList<Integer> currentRow = new ArrayList<Integer>(word.size()+1);
currentRow.add(previousRow.get(0)+1);
int insert = 0;
int delete = 0;
int swap = 0;
int d = 0;
for(int i=1;i<word.size()+1;i++){
delete = currentRow.get(i-1)+1;
insert = previousRow.get(i)+1;
if(word.get(i-1)==letter)
swap = previousRow.get(i-1);
else
swap = previousRow.get(i-1)+1;
d = Math.min(delete, Math.min(insert, swap));
currentRow.add(d);
}
// if this node represents a word and the distance so far is <= maxd, then add this word as a suggestion
if(isWord==true && d<=maxd){
suggestions.add(suggestion);
}
// if any of the entries in the current row are <=maxd, it means we can still find possible solutions.
// recursively search all the branches of the trie
for(int i=0;i<currentRow.size();i++){
if(currentRow.get(i)<=maxd){
for(int j=0;j<26;j++){
if(children[j]!=null){
children[j].get((char)(j+97), word, currentRow, suggestions, maxd, suggestion+String.valueOf((char)(j+97)));
}
}
break;
}
}
}
Here's some code I quickly crafted showing one way to generate the candidates and to then "rank" them.
The trick is: you never "test" a non-valid candidate.
To me your: "I run out of memory when I've got an edit distance of 9" screams "combinatorial explosion".
Of course to dodge a combinatorial explosion you don't do thing like trying to generate yourself all words that are at a distance from '9' from your misspelled work. You start from the misspelled word and generate (quite a lot) of possible candidates, but you refrain from creating too many candidates, for then you'd run into trouble.
(also note that it doesn't make much sense to compute up to a Levenhstein Edit Distance of 9, because technically any word less than 10 letters can be transformed into any other word less than 10 letters in max 9 transformations)
Here's why you simply cannot test all words up to a distance of 9 without either having an OutOfMemory error or simply a program never terminating:
generating all the LED up to 1 for the word "ptmizing", by only adding one letter (from a to z) generates already 9*26 variations (i.e. 324 variations) [there are 9 positions where you can insert one out of 26 letters)
generating all the LED up to 2, by only adding one letter to what we know have generates already 10*26*324 variations (60 840)
generating all the LED up to 3 gives: 17 400 240 variations
And that is only by considering the case where we add one, add two or add three letters (we're not counting deletion, swaps, etc.). And that is on a misspelled word that is only nine characters long. On "real" words, it explodes even faster.
Sure, you could get "smart" and generate this in a way not to have too many dupes etc. but the point stays: it's a combinatorial explosion that explodes fastly.
Anyway... Here's an example. I'm simply passing the dictionary of valid words (containing only four words in this case) to the corresponding method to keep this short.
You'll obviously want to replace the call to the LED with your own LED implementation.
The double-metaphone is just an example: in a real spellchecker words that do "sound alike"
despite further LED should be considered as "more correct" and hence often suggest first. For example "optimizing" and "aupteemising" are quite far from a LED point of view, but using the double-metaphone you should get "optimizing" as one of the first suggestion.
(disclaimer: following was cranked in a few minutes, it doesn't take into account uppercase, non-english words, etc.: it's not a real spell-checker, just an example)
#Test
public void spellCheck() {
final String src = "misspeled";
final Set<String> validWords = new HashSet<String>();
validWords.add("boing");
validWords.add("Yahoo!");
validWords.add("misspelled");
validWords.add("stackoverflow");
final List<String> candidates = findNonSortedCandidates( src, validWords );
final SortedMap<Integer,String> res = computeLevenhsteinEditDistanceForEveryCandidate(candidates, src);
for ( final Map.Entry<Integer,String> entry : res.entrySet() ) {
System.out.println( entry.getValue() + " # LED: " + entry.getKey() );
}
}
private SortedMap<Integer, String> computeLevenhsteinEditDistanceForEveryCandidate(
final List<String> candidates,
final String mispelledWord
) {
final SortedMap<Integer, String> res = new TreeMap<Integer, String>();
for ( final String candidate : candidates ) {
res.put( dynamicProgrammingLED(candidate, mispelledWord), candidate );
}
return res;
}
private int dynamicProgrammingLED( final String candidate, final String misspelledWord ) {
return Levenhstein.getLevenshteinDistance(candidate,misspelledWord);
}
Here you generate all possible candidates using several methods. I've only implemented one such method (and quickly so it may be bogus but that's not the point ; )
private List<String> findNonSortedCandidates( final String src, final Set<String> validWords ) {
final List<String> res = new ArrayList<String>();
res.addAll( allCombinationAddingOneLetter(src, validWords) );
// res.addAll( allCombinationRemovingOneLetter(src) );
// res.addAll( allCombinationInvertingLetters(src) );
return res;
}
private List<String> allCombinationAddingOneLetter( final String src, final Set<String> validWords ) {
final List<String> res = new ArrayList<String>();
for (char c = 'a'; c < 'z'; c++) {
for (int i = 0; i < src.length(); i++) {
final String candidate = src.substring(0, i) + c + src.substring(i, src.length());
if ( validWords.contains(candidate) ) {
res.add(candidate); // only adding candidates we know are valid words
}
}
if ( validWords.contains(src+c) ) {
res.add( src + c );
}
}
return res;
}
One thing you could try is, increase the Java's heap size, in order to overcome "out of memory error".
Following article will help you in order to understand how to increase heap size in Java
http://viralpatel.net/blogs/2009/01/jvm-java-increase-heap-size-setting-heap-size-jvm-heap.html
But I think the better approach to address your problem is, find out a better algorithm than the current algorithm
Well without more Information on the topic there is not much the community could do for you... You can start with the following:
Look at what your Profiler says (after it has run a little while): Does anything pile up? Are there a lot of Objects - this should normally give you a hint on what is wrong with your code.
Publish your saved dump somewhere and link it in your question, so someone else could take a look at it.
Tell us which profiler you are using, then somebody can give you hints on where to look for valuable information.
After you have narrowed down your problem to a specific part of your Code, and you cannot figure out why there are so many objects of $FOO in your memory, post a snippet of the relevant part.

sample java code for approximate string matching or boyer-moore extended for approximate string matching

I need to find 1.mismatch(incorrectly played notes), 2.insertion(additional played), & 3.deletion (missed notes), in a music piece (e.g. note pitches [string values] stored in a table) against a reference music piece.
This is either possible through exact string matching algorithms or dynamic programming/ approximate string matching algos. However I realised that approximate string matching is more appropriate for my problem due to identifying mismatch, insertion, deletion of notes. Or an extended version of Boyer-moore to support approx. string matching.
Is there any link for sample java code I can try out approximate string matching? I find complex explanations and equations - but I hope I could do well with some sample code and simple explanations. Or can I find any sample java code on boyer-moore extended for approx. string matching? I understand the boyer-moore concept, but having troubles with adjusting it to support approx. string matching (i.e. to support mismatch, insertion, deletion).
Also what is the most efficient approx. string matching algorithm (like boyer-moore in exact string matching algo)?
Greatly appreciate any insight/ suggestions.
Many thanks in advance
You could start with the Wikipedia page on approximate string matching.
The problem is that this is a complex field, and simply looking at / copying some example code probably won't help you understand what is going on.
EDIT - besides, I don't see how Boyer-Moore would adapt to approximate string matching.
Here is the C# Boyer-More code, which can be tweeked to BMH or approximate matching.
Dictionary<char, int> ShiftSizeTable = new Dictionary<char, int>();
//Calculate Shifit/Skip count for each element in pattern text. So that we can skip that many no of Characters in given text while searching.
public void PreProcessBMSBadMatchTable(char[] patternCharacters)
{
ShiftSizeTable.Clear();
int totalCharacters = patternCharacters.Length;
for (int lpIndex = 0; lpIndex < totalCharacters; lpIndex++)
{
//Calculate the shift size for each character in the string or char array.
int ShiftSize = Math.Max(1, (totalCharacters - 1) - lpIndex);
//If the charater is already exists in the ShiftSize table then replace it else add it to ShiftSize table.
if (ShiftSizeTable.ContainsKey(patternCharacters[lpIndex]))
{
ShiftSizeTable.Remove(patternCharacters[lpIndex]);
}
ShiftSizeTable.Add(patternCharacters[lpIndex], ShiftSize);
}
}
//Use the PreProcessed Shift/Skip table to find the pattern Characters in text and skip the bad Characters in the text.
public int BoyerMooreSearch1UsingDictionary(char[] textCharacters, char[] patternCharacters)
{
PreProcessBMSBadMatchTable(patternCharacters);
int SkipLength;
int patternCharactersLenght = patternCharacters.Length;
int textCharactersLenght = textCharacters.Length;
// Step2. Use Loop through each character in source text use ShiftArrayTable to skip the elements.
for (int lpTextIndex = 0; lpTextIndex <= (textCharactersLenght - patternCharactersLenght); lpTextIndex += SkipLength)
{
SkipLength = 0;
for (int lpPatIndex = patternCharactersLenght - 1; lpPatIndex >= 0; lpPatIndex--)
{
if (patternCharacters[lpPatIndex] != textCharacters[lpTextIndex + lpPatIndex])
{
SkipLength = Math.Max(1, lpPatIndex - ShiftSizeTable[patternCharacters[lpPatIndex]]);
break;
}
}
if (SkipLength == 0)
{
return lpTextIndex; // Found
}
}
return -1; // Not found
}

Categories

Resources