String Match How to implement - java

I have a class Matcher() as follows. find method accepts two strings: pattern (string to be looked for) and source (string where to look for). Example if pattern = "abc" and source = "abc cda abc" is passed to find method. It returns [0 4], i.e. the pattern abc(exact match) is found at index 0 and index 4 of source. Whatever you pass to pattern, it will consider that string as one pattern. Without modifying the following Matcher class, if I want to search for more than one pattern. What is the best way to do it? For example I want to call the find method from other class and if i have two patterns stored in arraylist then i want to pass one pattern first and return the result and again pass the 2nd pattern and return the result in one time. I want to stop processing only after both the patterns or the patterns in arraylist are looked for in source. Need your idea.
public class Matcher {
public static List<Integer> find(String pattern, String source) {
char[] x = pattern.toCharArray(), y = source.toCharArray();
int i, j, m = x.length, n = y.length;
List<Integer> result = new ArrayList<Integer>();
/* Searching */
for (j = 0; j <= n - m; ++j) {
for (i = 0; i < m && x[i] == y[i + j]; ++i)
;
if (i >= m)
result.add(j);
}
return result;
}
}

So, you want your method find to return a list of indexes? Did you have a look at String.indexOf? That might do exactly what you want.

It seems to me that you answered your own question. You'll need to use a loop in your client code to make repeated calls to the find() method. You won't be able to do it in a single call unless you rewrite your find method, which you don't want to do. Your client code in mangled pseudocode:
declare a Matcher object
for (each pattern I want to match)
call the find method with the pattern and the source string
store the result
end loop
How you deal with the result will depend on what you need to do with it. You could create an ArrayList object and append the List objects to it. Or you could create a HashMap and use the pattern as a key to the List objects if you need to know which pattern is where.
Let me know if I completely missed your point.
Cheers,
dean

Related

Adding everything but the nth element to another arraylist

For my project we have to manipulate certain LISP phrasing using Java. One of the tasks is given:
'((4A)(1B)(2C)(2A)(1D)(4E)2)
The number at the end is the "n". The task is to delete every nth element from the expression. For example, the expression above would evaluate to:
′((4A)(2C)(1D)2)
My approach right now is adding all the elements that aren't at the nth index to another array. My error is that it adds every single element to the new array leaving both elements identical.
My code:
String input4=inData.nextLine();
length=input4.length();
String nString=input4.substring(length-2,length-1);
int n = Integer.parseInt(nString);
count=n;
String delete1=input4.replace("'(","");
String delete2=delete1.replace("(","");
final1=delete2.replace(")","");
length=final1.length();
for (int i=1;i<length;i++)
{
part=final1.substring(i-1,i);
list.add(part);
}
for(int i=0;i<=list.size();i++)
{
if(!(i%n==0))
{
delete.add(list.get(i-1));
delete.add(list.get(i));
}
else
{
}
}
System.out.print("\n"+list);
One solution to this problem (although not directly addressing your issue in your solution) is to use a Regex Pattern, as these work nicely for this sort of thing, especially if this code does not have to adapt much to different input strings. I find if something like this is possible, it is easier than trying to directly manipulate Strings, although these Patterns (and Regexs in general) are slow.
// Same as you had before
String input4="'((4A)(1B)(2C)(2A)(1D)(4E)2)";
int length=input4.length();
String nString=input4.substring(length-2,length-1);
int n = Integer.parseInt(nString);
int count=n;
// Match (..)
// This could be adapted to catch ( ) with anything in it other than another
// set of parentheses.
Matcher m = Pattern.compile("\\(.{2}\\)").matcher(input4);
// Initialize with the start of the resulting string.
StringBuilder sb = new StringBuilder("'(");
int i = 0;
while (m.find())
{
// If we are not at an index to skip, then append this group
if (++i % count != 0)
{
sb.append(m.group());
}
}
// Add the end, which is the count and the ending parentheses.
sb.append(count).append(")");
System.out.println(sb.toString());
Some example input/output:
'((4A)(1B)(2C)(2A)(1D)(4E)2)
'((4A)(2C)(1D)2)
'((4A)(1B)(2C)(2A)(1D)(4E)3)
'((4A)(1B)(2A)(1D)3)

java regex replace any double letter in word with single

I've been searching for hours but can't find an answer, I apologize if this has been answered before.
I'm trying to check each word in a message for any double letters and remove the extra letter, words like wall or doll for example would become wal or dol. the purpose is for a fake language translation for a game, so far I've gottan as far as identifying the double letters but don't know how to replace them.
here's my code so far:
public String[] removeDouble(String[] words){
Pattern pattern = Pattern.compile("(\\w)\\1+");
for (int i = 0; i < words.length; i++){
Matcher matcher = pattern.matcher(words[i]);
if (matcher.find()){
words[i].replaceAll("what to replace with?");
}
}
return words;
}
You can do the whole replacement operation in one statement if you use back references:
for (int i = 0; i < words.length; i++)
words[i] = words[i].replaceAll("(.)\\1", "$1");
Note that you must assign the value returned from string methods that (appear to) change strings, because they return new strings rather than mutate the string.
String.replaceAll does not modify the string in-place. (Java String is immutable) You need assign the returned value back.
And the String.replaceAll accepts two parameters.
Replace following line:
words[i].replaceAll("what to replace with?");
with:
words[i] = "what to replace with?";

String parsing based on mask

I have several string which multiple masks. I would like to know is there any better way of handling strings with mask parsing rather than String.spilt and loop over tokens and identify sequence etc. This code also gets clumsy that lots of token logic have to coded.
Sample masks can be:
PROD-LOC-STATE-CITY
PROD-DEST-STATE-ZIP
PROD-OZIP-DZIP-VER-INS
Sample Strings:
CoolDuo-GROUND-NYC-10082
Sample code:
String[] arr = input.split("-");
int pos = 0;
for(String k:arr){
if(pos == 0) {
//-- k is of PROD
...
...
}
..
...
pos++;
}
Above type of code is kept for every mask type.
You can use regex groups to get target strings by group names http://docs.oracle.com/javase/tutorial/essential/regex/groups.html. Check this Regex Named Groups in Java
If you can't use named groups, you can do it in this way (if your are absolutely sure in your strings structure):
final static int PROD_POS = 1;
final static int STATE_POS = 3;
...
Pattern pattern = Pattern.compile("(some_regexp)-(some_regexp)-(some_regexp)");
Matcher matcher = pattern.matcher(input);
if ( matcher.matches() ) {
String state = matcher.group(STATE_POS);
}
If you really want to delve in quite deep into this problem when your masks gets quite too big to manage, you can use some sort of lexical analysis packages available to java.
If you want to get a basis of what that really means look here (http://en.wikipedia.org/wiki/Lexical_analysis)
A popular package out there for java is JFlex (http://jflex.de/), but there are many others out there, just Google it for best results!
Best of luck

Efficient way to search for a set of strings in a string in Java

I have a set of elements of size about 100-200. Let a sample element be X.
Each of the elements is a set of strings (number of strings in such a set is between 1 and 4). X = {s1, s2, s3}
For a given input string (about 100 characters), say P, I want to test whether any of the X is present in the string.
X is present in P iff for all s belong to X, s is a substring of P.
The set of elements is available for pre-processing.
I want this to be as fast as possible within Java. Possible approaches which do not fit my requirements:
Checking whether all the strings s are substring of P seems like a costly operation
Because s can be any substring of P (not necessarily a word), I cannot use a hash of words
I cannot directly use regex as s1, s2, s3 can be present in any order and all of the strings need to be present as substring
Right now my approach is to construct a huge regex out of each X with all possible permutations of the order of strings. Because number of elements in X <= 4, this is still feasible. It would be great if somebody can point me to a better (faster/more elegant) approach for the same.
Please note that the set of elements is available for pre-processing and I want the solution in java.
You can use regex directly:
Pattern regex = Pattern.compile(
"^ # Anchor search to start of string\n" +
"(?=.*s1) # Check if string contains s1\n" +
"(?=.*s2) # Check if string contains s2\n" +
"(?=.*s3) # Check if string contains s3",
Pattern.DOTALL | Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();
foundMatch is true if all three substrings are present in the string.
Note that you might need to escape your "needle strings" if they could contain regex metacharacters.
It sounds like you're prematurely optimising your code before you've actually discovered a particular approach is actually too slow.
The nice property about your set of strings is that the string must contain all elements of X as a substring -- meaning we can fail fast if we find one element of X that is not contained within P. This might turn out a better time saving approach than others, especially if the elements of X are typically longer than a few characters and contain no or only a few repeating characters. For instance, a regex engine need only check 20 characters in 100 length string when checking for the presence of a 5 length string with non-repeating characters (eg. coast). And since X has 100-200 elements you really, really want to fail fast if you can.
My suggestion would be to sort the strings in order of length and check for each string in turn, stopping early if one string is not found.
Looks like a perfect case for the Rabin–Karp algorithm:
Rabin–Karp is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. However, Rabin–Karp is an algorithm of choice for multiple pattern search.
When the preprocessing time doesn't matter, you could create a hash table which maps every one-letter, two-letter, three-letter etc. combination which occurs in at least one string to a list of strings in which it occurs.
The algorithm to index a string would look like that (untested):
HashMap<String, Set<String>> indexes = new HashMap<String, Set<String>>();
for (int pos = 0; pos < string.length(); pos++) {
for (int sublen=0; sublen < string.length-pos; sublen++) {
String substring = string.substr(pos, sublen);
Set<String> stringsForThisKey = indexes.get(substring);
if (stringsForThisKey == null) {
stringsForThisKey = new HashSet<String>();
indexes.put(substring, stringsForThisKey);
}
stringsForThisKey.add(string);
}
}
Indexing each string that way would be quadratic to the length of the string, but it only needs to be done once for each string.
But the result would be constant-speed access to the list of strings in which a specific string occurs.
You are probably looking for Aho-Corasick algorithm, which constructs an automata (trie-like) from the set of strings (dictionary), and try to match the input string to the dictionary using this automata.
You might want to consider using a "Suffix Tree" as well. I haven't used this code, but there is one described here
I have used proprietary implementations (that I no longer even have access to) and they are very fast.
One way is to generate every possible substring and add this to a set. This is pretty inefficient.
Instead you can create all the strings from any point to the end into a NavigableSet and search for the closest match. If the closest match starts with the string you are looking for, you have a substring match.
static class SubstringMatcher {
final NavigableSet<String> set = new TreeSet<String>();
SubstringMatcher(Set<String> strings) {
for (String string : strings) {
for (int i = 0; i < string.length(); i++)
set.add(string.substring(i));
}
// remove duplicates.
String last = "";
for (String string : set.toArray(new String[set.size()])) {
if (string.startsWith(last))
set.remove(last);
last = string;
}
}
public boolean findIn(String s) {
String s1 = set.ceiling(s);
return s1 != null && s1.startsWith(s);
}
}
public static void main(String... args) {
Set<String> strings = new HashSet<String>();
strings.add("hello");
strings.add("there");
strings.add("old");
strings.add("world");
SubstringMatcher sm = new SubstringMatcher(strings);
System.out.println(sm.set);
for (String s : "ell,he,ow,lol".split(","))
System.out.println(s + ": " + sm.findIn(s));
}
prints
[d, ello, ere, hello, here, ld, llo, lo, old, orld, re, rld, there, world]
ell: true
he: true
ow: false
lol: false

Matching Subset in a String

Let's say I have-
String x = "ab";
String y = "xypa";
If I want to see if any subset of x exists in y, what would be the fastest way? Looping is time consuming. In the example above a subset of x is "a" which is found in y.
The answer really depends on many things.
If you just want to find any subset and you're doing this only once, looping is just fine (and the best you can do without using additional storage) and you can stop when you find a single character that matches.
If you have a fixed x and want to use it for matching several strings y, you can do some pre-processing to store the characters in x in a table and use this table to check if each character of y occurs in x or not.
If you want to find the largest subset, then you're looking at a different problem: the longest common subsequence problem.
Well, I'm not sure it's better than looping, but you could use String#matches:
if (y.matches(".*[" + x + "]+.*")) ...
You'd need to escape characters that are special in a regex [] construct, though (like ], -, \, ...).
The above is just an example, if you're doing it more than once, you'll want to use Pattern, Matcher, and the other stuff from the java.util.regex package.
You have to use for loop or use regex which is just as expensive as a for loop, becasue you need to convert one of your strings into chars basically.
Boolean isSubset = false;
for(int i = 0; i < x.length(); i++) {
if(y.contains(x.charAt(i))) {
isSubset = true;
break;
}
}
using a for loop.
It looks like this could be a case of the longest common substring problem.
You can generate all subsets of x (e.g. , in your example, ab, a, b) and then generate a regexp that would do the
Pattern p = Pattern.compile("(ab|a|b)");
Matcher m = p.matcher(y);
if(m.find()) {
System.err.println(m.group());
}
If both Strings will only contain [a-z]. Then fastest would be to make two bitmaps, 26 bits longs. Mark all the bits contained in the String. Take the AND of the bitmaps, the resulting bits are present in both Strings, the largest common subset. This would be a simple O(n) with n the length of the biggest String.
(If you want to cover the whole lot of UTF, bloom filters might be more appropriate. )
Looping is time-consuming, but there's no way to do what you want other than going over the target string repeatedly.
What you can do is optimize by checking the smallest strings first, and work your way up. For example, if the target string doesn't contain abc, it can't possibly contain abcdef.
Other optimizations off the top of my head:
Don't continue to check for a match after a non-matching character is hit, though in Java you can let the computer worry about this.
Don't check to see if something is a match if there aren't enough characters left in the target string for a match to be possible.
If you need speed and have lots of space, you might be able to break the target string up into a fancy data structure like a trie for better results, though I don't have an exact algorithm in mind.
Another storage-is-not-a-problem solution: decompose the target into every possible substring and store the results in a HashSet.
What about this:?
package so3935620;
import static org.junit.Assert.*;
import java.util.BitSet;
import org.junit.Test;
public class Main {
public static boolean overlap(String s1, String s2) {
BitSet bs = new BitSet();
for (int i = 0; i < s1.length(); i++) {
bs.set(s1.charAt(i));
}
for (int i = 0; i < s2.length(); i++) {
if (bs.get(s2.charAt(i))) {
return true;
}
}
return false;
}
#Test
public void test() {
assertFalse(overlap("", ""));
assertTrue(overlap("a", "a"));
assertFalse(overlap("abcdefg", "ABCDEFG"));
}
}
And if that version is too slow, you can compute the BitSet depending on s1, save that in some variable and later only loop over s2.

Categories

Resources