Match a string against multiple regex patterns - java

I have an input string.
I am thinking how to match this string against more than one regular expression effectively.
Example Input: ABCD
I'd like to match against these reg-ex patterns, and return true if at least one of them matches:
[a-zA-Z]{3}
^[^\\d].*
([\\w&&[^b]])*
I am not sure how to match against multiple patterns at once. Can some one tell me how do we do it effectively?

If you have just a few regexes, and they are all known at compile time, then this can be enough:
private static final Pattern
rx1 = Pattern.compile("..."),
rx2 = Pattern.compile("..."),
...;
return rx1.matcher(s).matches() || rx2.matcher(s).matches() || ...;
If there are more of them, or they are loaded at runtime, then use a list of patterns:
final List<Pattern> rxs = new ArrayList<>();
for (Pattern rx : rxs) if (rx.matcher(input).matches()) return true;
return false;

you can make one large regex out of the individual ones:
[a-zA-Z]{3}|^[^\\d].*|([\\w&&[^b]])*

To avoid recreating instances of Pattern and Matcher classes you can create one of each and reuse them. To reuse Matcher class you can use reset(newInput) method.
Warning: This approach is not thread safe. Use it only when you can guarantee that only one thread will be able to use this method, otherwise create separate instance of Matcher for each methods call.
This is one of possible code examples
private static Matcher m1 = Pattern.compile("regex1").matcher("");
private static Matcher m2 = Pattern.compile("regex2").matcher("");
private static Matcher m3 = Pattern.compile("regex3").matcher("");
public boolean matchesAtLeastOneRegex(String input) {
return m1.reset(input).matches()
|| m2.reset(input).matches()
|| m3.reset(input).matches();
}

like it was explained in (Running multiple regex patterns on String) it is better to concatenate each regex to one large regex and than run the matcher only one. This is an large improvement is you often reuse the regex.

I'm not sure what effectively means, but if it's about performance and you want to check a lot of strings, I'd go for this
...
static Pattern p1 = Pattern.compile("[a-zA-Z]{3}");
static Pattern p2 = Pattern.compile("^[^\\d].*");
static Pattern p3 = Pattern.compile("([\\w&&[^b]])*");
public static boolean test(String s){
return p1.matcher(s).matches ? true:
p2.matcher(s).matches ? true:
p3.matcher(s).matches;
}
I'm not sure how it will affect performance, but combining them all in one regexp with | could also help.

Here's an alternative.
Note that one thing this doesn't do is return them in a specific order. But one could do that by sorting by m.start() for example.
private static HashMap<String, String> regs = new HashMap<String, String>();
...
regs.put("COMMA", ",");
regs.put("ID", "[a-z][a-zA-Z0-9]*");
regs.put("SEMI", ";");
regs.put("GETS", ":=");
regs.put("DOT", "\\.");
for (HashMap.Entry<String, String> entry : regs.entrySet()) {
String key = entry.getKey();
String value = entry.getValue();
Matcher m = Pattern.compile(value).matcher("program var a, b, c; begin a := 0; end.");
boolean f = m.find();
while(f)
{
System.out.println(key);
System.out.print(m.group() + " ");
System.out.print(m.start() + " ");
System.out.println(m.end());
f = m.find();
}
}
}

Related

how to replace multiple strings with replace method at once without making more string objects

I got a string test "{\"userName\": \"<userName>\",\"firstName\": \"<firstName>\",\"lastName\": \"<lastName>\"}". what I want is that I want to replace things in angular brackets with dynamic values. Sample code:
public class Test {
public static void main(String[] args) {
// TODO Auto-generated method stub
String toBeReplaced0 = "alpha";
String toBeReplaced1 = "beta";
String toBeReplaced2 = "gama";
String test = "{\"userName\": \"<userName>\",\"firstName\": \"<firstName>\",\"lastName\": \"<lastName>\"}";
}
}
Now in this code I want to replace <userName> with alpha, <firstName> with beta, <lastName> with gama at once without making multiple string objects. This is not a homework question. test string can have more angular elements in it to be filled with dynamic values. How can I do this with replace method or any thing else..
Matcher.appendReplacement could be an option. Example:
public static void main(String[] args){
String toBeReplaced0 = "alpha";
String toBeReplaced1 = "beta";
String toBeReplaced2 = "gama";
String test = "{\"userName\": \"<userName>\",\"firstName\": \"<firstName>\",\"lastName\": \"<lastName>\"}";
System.out.println(findAndReplace(test,toBeReplaced0,toBeReplaced1,toBeReplaced2));
}
public static String findAndReplace(String original, String... replacments){
Pattern p = Pattern.compile("<[^<]*>");
Matcher m1 = p.matcher(original);
// count the matches
int count = 0;
while (m1.find()){
count++;
}
// if matches count equals replacement params length replace in the given order
if(count == replacments.length){
Matcher m = p.matcher(original);
StringBuffer sb = new StringBuffer();
int i = 0;
while (m.find()) {
m.appendReplacement(sb, replacments[i]);
i++;
}
m.appendTail(sb);
return sb.toString();
}
//else return original
return original;
}
If that's JSON, I'd prefer to use a JSON library to dynamically construct the resultant string and mitigate any possible syntactic errors.
If you really want to use String.replaceAll() or similar, I wouldn't expect that to be a problem in the above limited scope. Simply chain together your calls (e.g. see this tutorial)
Note that strings are immutable, and as such you can't easily do this without creating new string objects. If that's really a concern, perhaps you need to modify an array of chars (but that will be a non-trivial task when substituting multiple strings of varying lengths differing from their placeholders).
Simply:
String result = orig.replace("<userName>", "replacement");
(not forgetting that strings are immutable and so you have to use the result returned from this call)
I would do that with the StringTemplate library. With your example works out of the box:
<dependency>
<groupId>org.antlr</groupId>
<artifactId>ST4</artifactId>
<version>4.0.8</version>
<scope>compile</scope>
</dependency>
// Given
final ST template = new ST("{\"userName\": \"<userName>\",\"firstName\": \"<firstName>\",\"lastName\": \"<lastName>\"}");
template.add("userName", "alpha");
template.add("firstName", "beta");
template.add("lastName", "gamma");
// When
final String result = template.render();
// Then
Assert.assertEquals("{\"userName\": \"alpha\",\"firstName\": \"beta\",\"lastName\": \"gamma\"}", result);

java startsWith() method with custom rules

I implement typing trainer and would like to create my special String startsWith() method with specific rules.
For example: '-' char should be equal to any long hyphen ('‒', etc). Also I'll add other rules for special accent characters (e equals é, but not é equals e).
public class TestCustomStartsWith {
private static Map<Character, List<Character>> identityMap = new HashMap<>();
static { // different hyphens: ‒, –, —, ―
List<Character> list = new LinkedList<>();
list.add('‒');
list.add('–'); // etc
identityMap.put('-', list);
}
public static void main(String[] args) {
System.out.println(startsWith("‒d--", "-"));
}
public static boolean startsWith(String s, String prefix) {
if (s.startsWith(prefix)) return true;
if (prefix.length() > s.length()) return false;
int i = prefix.length();
while (--i >= 0) {
if (prefix.charAt(i) != s.charAt(i)) {
List<Character> list = identityMap.get(prefix.charAt(i));
if ((list == null) || (!list.contains(s.charAt(i)))) return false;
}
}
return true;
}
}
I could just replace all kinds of long hyphens with '-' char, but if there will be more rules, I'm afraid replacing will be too slow.
How can I improve this algorithm?
I don't know all of your custom rules, but would a regular expression work?
The user is passing in a String. Create a method to convert that String to a regex, e.g.
replace a short hyphen with short or long ([-‒]),
same for your accents, e becomes [eé]
Prepend with the start of word dohicky (\b),
Then convert this to a regex and give it a go.
Note that the list of replacements could be kept in a Map as suggested by Tobbias. Your code could be something like
public boolean myStartsWith(String testString, String startsWith) {
for (Map.Entry<String,String> me : fancyTransformMap) {
startsWith = startsWith.replaceAll(me.getKey(), me.getValue());
}
return testString.matches('\b' + startsWith);
}
p.s. I'm not a regex super-guru so if there may be possible improvements.
I'd think something like a HashMap that maps the undesirable characters to what you want them to be interpreted as might be the way to go if you are worried about performance;
HashMap<Character, Character> fastMap = new Map<Character, Character>();
// read it as '<long hyphen> can be interpreted as <regular-hyphen>
fastMap.add('–', '-');
fastMap.add('é', 'e');
fastMap.add('è', 'e');
fastMap.add('?', '?');
...
// and so on
That way you could ask for the value of the key: value = map.get(key).
However, this will only work as long as you have unique key-values. The caveat is that é can't be interpreted as è with this method - all the keys must be unique. However, if you are worried about performance, this is an exceedingly fast way of doing it, since the lookup time for a HashMap is pretty close to being O(1). But as others on this page has written, premature optimization is often a bad idea - try implementing something that works first, and if at the end of it you find it is too slow, then optimize.

How to iterate over regexp compliant strings

What is the easiest way to implement a class (in Java) that would serve as an iterator over the set of all values which conform to a given regexp?
Let's say I have a class like this:
public class RegexpIterator
{
private String regexp;
public RegexpIterator(String regexp) {
this.regexp = regexp;
}
public abstract boolean hasNext() {
...
}
public abstract String next() {
...
}
}
How do I implement it? The class assumes some linear ordering on the set of all conforming values and the next() method should return the i-th value when called for the i-th time.
Ideally the solution should support full regexp syntax (as supported by the Java SDK).
To avoid confusion, please note that the class is not supposed to iterate over matches of the given regexp over a given string. Rather it should (eventually) enumerate all string values that conform to the regexp (i.e. would be accepted by the matches() method of a matcher), without any other input string given as argument.
To further clarify the question, let's show a simple example.
RegexpIterator it = new RegexpIterator("ab?cd?e");
while (it.hasNext()) {
System.out.println(it.next());
}
This code snippet should have the following output (the order of lines is not relevant, even though a solution which would list shorter strings first would be preferred).
ace
abce
ecde
abcde
Note that with some regexps, such as ab[A-Z]*cd, the set of values over which the class is to iterate is ininite. The preceeding code snippet would run forever in these cases.
Do you need to implement a class? This pattern works well:
Pattern p = Pattern.compile("[0-9]+");
Matcher m = p.matcher("123, sdfr 123kjkh 543lkj ioj345ljoij123oij");
while (m.find()) {
System.out.println(m.group());
}
output:
123
123
543
345
123
for a more generalized solution:
public static List<String> getMatches(String input, String regex) {
List<String> retval = new ArrayList<String>();
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
while (m.find()) {
retval.add(m.group());
}
return retval;
}
which then can be used like this:
public static void main(String[] args) {
List<String> matches = getMatches("this matches _all words that _start _with an _underscore", "_[a-z]*");
for (String s : matches) { // List implements the 'iterable' interface
System.out.println(s);
}
}
which produces this:
_all
_start
_with
_underscore
more information about the Matcher class can be found here: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html
Here is another working example. It might be helpful :
public class RegxIterator<E> implements RegexpIterator {
private Iterator<E> itr = null;
public RegxIterator(Iterator<E> itr, String regex) {
ArrayList<E> list = new ArrayList<E>();
while (itr.hasNext()) {
E e = itr.next();
if (Pattern.matches(regex, e.toString()))
list.add(e);
}
this.itr = list.iterator();
}
#Override
public boolean hasNext() {
return this.itr.hasNext();
}
#Override
public String next() {
return this.itr.next().toString();
}
}
If you want to use it for other dataTypes(Integer,Float etc. or other classes where toString() is meaningful), declare next() to return Object instead of String. Then you may able be to perform a typeCast on the return value to get back the actual type.

java regex multiple patterns sequential matching

I have a specific question, to which I couldn't find any answer online. Basically, I would like to run a pattern-matching operation on a text, with multiple patterns. However, I do not wish that the matcher gets me the result all at once, but instead that each pattern is called at different stages of the loop, at the same time that specific operations are performed on each of these stages. So for instance, imagining I have Pattern1, Pattern2, and Pattern3, I would like something like:
if (Pattern 1 = true) {
delete Pattern1;
} else if (Pattern 2 = true) {
delete Pattern2;
} else if (Pattern 3 = true) {
replace with 'something;
} .....and so on
(this is just an illustration of the loop, so probably the syntax is not correct, )
My question is then: how can I compile different patterns, while calling them separately?
(I've only seen multiple patterns compiled together and searched together with the help of AND/OR and so on..that's not what I'm looking for unfortunately) Could I save the patterns in an array and call each of them on my loop?
Prepare your Pattern objects pattern1, pattern2, pattern3 and store them at any container (array or list). Then loop over this container using usePattern(Pattern newPattern) method of Matcher object at each iteration.
You can make a common interface, and make anonymous implementations that use patterns or whatever else you may want to transform your strings:
interface StringProcessor {
String process(String source);
}
StringProcessor[] processors = new StringProcessor[] {
new StringProcessor() {
private final Pattern p = Pattern.compile("[0-9]+");
public String process(String source) {
String res = source;
if (p.matcher(source).find()) {
res = ... // delete
}
return res;
}
}
, new StringProcessor() {
private final Pattern p = Pattern.compile("[a-z]+");
public String process(String source) {
String res = source;
if (p.matcher(source).find()) {
res = ... // replace
}
return res;
}
}
, new StringProcessor() {
private final Pattern p = Pattern.compile("[%^##]{2,5}");
public String process(String source) {
String res = source;
if (p.matcher(source).find()) {
res = ... // do whatever else
}
return res;
}
}
};
String res = "My starting string 123 and more 456";
for (StringProcessor p : processors) {
res = p.process(res);
}
Note that implementations of StringProcessor.process do not need to use regular expressions at all. The loop at the bottom has no idea the regexp is involved in obtaining the results.

avoid code duplication

consider the following code:
if (matcher1.find()) {
String str = line.substring(matcher1.start()+7,matcher1.end()-1);
/*+7 and -1 indicate the prefix and suffix of the matcher... */
method1(str);
}
if (matcher2.find()) {
String str = line.substring(matcher2.start()+8,matcher2.end()-1);
method2(str);
}
...
I have n matchers, all matchers are independent (if one is true, it says nothing about the others...), for each matcher which is true - I am invoking a different method on the content it matched.
question: I do not like the code duplication nor the "magic numbers" in here, but I'm wondering if there is better way to do it...? (maybe Visitor Pattern?) any suggestions?
Create an abstract class, and add offset in subclass (with string processing too... depending of your requirement).
Then populate them in a list and process the list.
Here is a sample absract processor:
public abstract class AbsractProcessor {
public void find(Pattern pattern, String line) {
Matcher matcher = p.matcher(line);
if (matcher.find()) {
process(line.substring(matcher.start() + getStartOffset(), matcher.end() - getEndOffset()));
}
}
protected abstract int getStartOffset();
protected abstract int getEndOffset();
protected abstract void process(String str);
}
Simple mark the part of the regex that you want to pass to the method with a capturing group.
For example if your regex is foo.*bar and you are not interested in foo or bar, make the regex foo(.*)bar. Then always grab the group 1 from the Matcher.
Your code would then look like this:
method1(matcher1.group(1));
method2(matcher2.group(2));
...
One further step would be to replace your methods with classes implementing an like this:
public interface MatchingMethod {
String getRegex();
void apply(String result);
}
Then you can easily automate the task:
for (MatchingMethod mm : getAllMatchingMethods()) {
Pattern p = Pattern.compile(mm.getRegex());
Matcher m = p.matcher(input);
while (m.find()) {
mm.apply(m.group(1));
}
Note that if performance is important, then pre-compiling the Pattern can improve runtime if you apply this to many inputs.
You could make it a little bit shorter, but I the question is, is this really worth the effort:
private String getStringFromMatcher(Matcher matcher, int magicNumber) {
return line.subString(matcher.start() + magicNumber, matcher.end() - 1 )
}
if (matcher1.find()) {
method1(getStringFromMatcher(matcher1, 7);
}
if (matcher2.find()) {
method2.(getStringFromMatcher(mather2, 8);
}
use Cochard's solution combined with a factory (switch statement) with all the methodX methods. so you can call it like this:
Factory.CallMethodX(myEnum.MethodX, str)
you can assign the myEnum.MethodX in the population step of Cochard's solution

Categories

Resources