Regular expression Java Merge Pattern - java

I've these three regular expressions. They work individually but i would like to merge them in a single pattern.
regex1 = [0-9]{16}
regex2 = [0-9]{4}[-][0-9]{4}[-][0-9]{4}[-][0-9]{4}
regex3 = [0-9]{4}[ ][0-9]{4}[ ][0-9]{4}[ ][0-9]{4}
I use this method:
Pattern.compile(regex);
Which is the regex string to merge them?

You can use backreferences:
[0-9]{4}([ -]|)([0-9]{4}\1){2}[0-9]{4}
This will only match if the seperators are either all
spaces
hyphens
blank
\1 means "this matches exactly what the first capturing group – expression in parentheses – matched".
Since ([ -]|) is that group, both other separators need to be the same for the pattern to match.
You can simplify it further to:
\d{4}([ -]|)(\d{4}\1){2}\d{4}

The following should match anything the three patterns match:
regex = [0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}
That is, I'm assuming you are happy with either a hyphen, a space or nothing between the numbers?
Note: this will also match situations where you have any combination of the three, e.g.
0000-0000 00000000
which may not be desired?
Alternatively, if you need to match any of the three individual patterns then just concatenate them with |, as follows:
([0-9]{16})|([0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4})|([0-9]{4} [0-9]{4} [0-9]{4} [0-9]{4})
(Your original example appears to have unnecessary square brackets around the space and hyphen)

Related

Regex: match everything the other regex left

I am struggling with the following issue: say there's a regex 1 and there's regex 2 which should match everything the regex 1 does not.
Let's have the regex 1:
/\$\d+/ (i.e. the dollar sign followed by any amount of digits.
Having a string like foo$12___bar___$34wilma buzz it detects $12 and $34.
How does the regex 2 should look in order to match the remained parts of the aforementioned string, i.e. foo, ___bar___ and wilma buzz? In other words it should pick up all the "remained" chunks of the source string.
You may use String#split to split on given regex and get remaining substrings in an array:
String[] arr = str.split( "\\$\\d+" );
//=> ["foo", "___bar___", "wilma buzz"]
RegEx Demo
It was tricky to get this working, but this regex will match everything besides \$\d+ for you. EDIT: no longer erroneously matches $44$444 or similar.
(?!\$\d+)(.+?)\$\d+|\$\d+|(?!\$\d+)(.+)
Breakdown
(?!\$\d+)(.+?)\$\d+
(?! ) negative lookahead: assert the following string does not match
\$\d+ your pattern - can be replaced with another pattern
(.+?) match at least one symbol, as few as possible
\$\d+ non-capturing match your pattern
OR
\$\d+ non-capturing group: matches one instance of your pattern
OR
(?!\$\d+)(.+)
(?!\$\d+) negative lookahead to not match your pattern
(.+) match at least one symbol, as few as possible
GENERIC FORM
(?!<pattern>)(.+?)<pattern>|<pattern>|(?!<pattern>)(.+)
By replacing <pattern>, you can match anything that doesn't match your pattern. Here's one that matches your pattern, and here's an example of arbitrary pattern (un)matching.
Good luck!
Try this one
[a-zA-Z_]+
Or even better
[^\$\d]+ -> With the ^symbol you can negotiate the search like ! in the java -> not equal

Java regular expressions for specific name\value format

I'm not familiar yet with java regular expressions. I want to validate a string that has the following format:
String INPUT = "[name1 value1];[name2 value2];[name3 value3];";
namei and valuei are Strings should contain any characters expect white-space.
I tried with this expression:
String REGEX = "([\\S*\\s\\S*];)*";
But if I call matches() I get always false even for a good String.
what's the best regular expression for it?
This does the trick:
(?:\[\w.*?\s\w.*?\];)*
If you want to only match three of these, replace the * at the end with {3}.
Explanation:
(?:: Start of non-capturing group
\[: Escapes the [ sign which is a meta-character in regex. This
allows it to be used for matching.
\w.*?: Lazily matches any word character [a-z][A-Z][0-9]_. Lazy matching means it attempts to match the character as few times possible, in this case meaning that when will stop matching once it finds the following \s.
\s: Matches one whitespace
\]: See \[
;: Matches one semicolon
): End of non-capturing group
*: Matches any number of what is contained in the preceding non-capturing group.
See this link for demonstration
You should escape square brackets. Also, if your aim is to match only three, replace * with {3}
(\[\\S*\\s\\S*\];){3}

Why does this pattern not match? ([\\\\A\\\\W]its[\\\\W\\\\z])

I'm trying to do a replace with this pattern, so I need to match this:
String pattern = "[\\\\A\\\\W]its[\\\\W\\\\z]";
The way I'm interpreting my pattern is: either a beginning of the string OR a non word character like a space or comma, then an "its", then a non word character OR the end of the string.
Why doesn't it match on this "its" inside this string?
its about time
The idea of what this is supposed to do it's supposed to detect incorrectly written words like "its" and fix them to "it's".
Also why do I need so many escape characters in order for the pattern to be accepted by the vm at all?
\\A and \\z are boundary matches. They cannot go inside character classes. If you use them properly, i.e. with two slashes instead of four, regex pattern compiler would throw an exception, because \A or \z cannot go inside [] blocks.
Use straight | syntax with non-capturing groups instead:
String pattern = "(?:\\A|\\W)its(?:\\W|\\z)";
Demo.

Very slow Regular Expression in Java

Using Java, i want to detect if a line starts with words and separator then "myword", but this regex takes too long. What is incorrect ?
^\s*(\w+(\s|/|&|-)*)*myword
The pattern ^\s*(\w+(\s|/|&|-)*)*myword is not efficient due to the nested quantifier. \w+ requires at least one word character and (\s|/|&|-)* can match zero or more of some characters. When the * is applied to the group and the input string has no separators in between word characters, the expression becomes similar to a (\w+)* pattern that is a classical catastrophical backtracking issue pattern.
Just a small illustration of \w+ and (\w+)* performance:
\w+: (\w+)*
You pattern is even more complicated and invloves more those backtracking steps. To avoid such issues, a pattern should not have optional subpatterns inside quantified groups. That is, create a group with obligatory subpatterns and apply the necessary quantifier to the group.
In this case, you can unroll the group you have as
String rx = "^\\s*(\\w+(?:[\\s/&-]+\\w+)*)[\\s/&-]+myword";
See IDEONE demo
Here, (\w+(\s|/|&|-)*)* is unrolled as (\w+(?:[\s/&-]+\w+)*) (I kept the outer parentheses to produce a capture group #1, you may remove these brackets if you are not interested in them). \w+ matches one or more word characters (so, it is an obligatory subpatter), and the (?:[\s/&-]+\w+)* subpattern matches zero or more (*, thus, this whole group is optional) sequences of one or more characters from the defined character class [\s/&-]+ (so, it is obligatory) followed with one or more word characters \w+.

Returning java regex (words, spaces, special characters, double quotes)

I am trying to use java regex to tokenize any language source file. What I want the list to return is:
words ([a-z_A-Z0-9])
spaces
any of [()*.,+-/=&:] as a single character
and quoted items left in quotes.
Here is the code I have so far:
Pattern pattern = Pattern.compile("[\"(\\w)\"]+|[\\s\\(\\)\\*\\+\\.,-/=&:]");
Matcher matcher = pattern.matcher(str);
List<String> matchlist = new ArrayList<String>();
while(matcher.find()) {
matchlist.add(matcher.group(0));
}
For example,
"I" am_the 2nd "best".
returns: list, size 8
("I", ,am_the, ,2nd, ,"best", .)
which is what I want. However, if the whole sentence is quoted, except for the period:
"I am_the 2nd best".
returns: list, size 8
("I, ,am_the, ,2nd, ,best", .)
and I want it to be able to return: list, size 2
("I am_the 2nd best", .)
If that makes sense. I believe it works for everything I want it to except for returning string literals (which I want to keep the quotes). What is it that I am missing from the pattern that will allow me to achieve this?
And by all means, if there is an easier pattern to use that I do not see, please help me out. The pattern shown above was the compilation of many trial/error. Thank you very much in advance for any help.
First, you'll need to separate the word-matching code from the string-literal-matching code. For word matching, use:
\w+
Next there's whitespace.
\s+
To match strings as one token, you need to allow more characters than just \w. That only allows alphanumeric characters and _, which means whitespace and symbols are not. You also need to move the starting and ending quotes outside of the square brackets.
And don't forget backslashes to escape characters. You want to allow \" inside of strings.
"(\\.|[^"])+"
Finally, there are the symbols. You could list all the symbols, or you could just treat any non-word, non-whitespace, non-quote character as a symbol. I recommend the latter so you don't choke on other symbols like # or |. So for symbols:
[^\s\w"]
Putting the pieces together, we get this combined regex:
\w+|\s+|"(\\.|[^"])+"|[^\s\w"]
Or, escaping everything properly so it can be put into source code:
Pattern pattern = Pattern.compile("\\w+|\\s+|\"(\\\\.|[^\"])+\"|[^\\s\\w\"]");
Typically, when parsing text, the process you're describing is called "lexical analysis" and the function used is called a 'lexer' which is used to break up an input stream into identifiable tokens like words, numbers, spaces, periods, etc.
The output of a lexer is consumed by a 'parser' which does "syntactic analysis" by identifying groups of tokens which belong together, like [double-quote] [word] [double-quote].
I would recommend you follow the same two-pass strategy, since it's been proven time and again in many, many parsers.
So, your first step might be to use this regular expression as your lexer:
\W|\w+
which will break your input text into either single non-word characters (like spaces, double and single quotation marks, commas, periods, etc.) or sequences of one or more word characters where \w is really just a shortcut for [a-zA-Z_0-9].
So, using your example above:
String str=/"I" am_the 2nd "best"./
String p="\\W|\\w+"
Pattern pattern = Pattern.compile(p);
Matcher matcher = pattern.matcher(str);
List<String> matchlist = new ArrayList<String>();
while(matcher.find()) {
matchlist.add(matcher.group(0));
}
produces:
['"', 'I', '"', ' ', 'am_the', ' ', '2nd', ' ', '"', 'best', '"', '.']
which you can then decide how to treat in your code.
No, this doesn't give you a single one-size-fits-all regular expression which matches both of the cases you list above, but in my experience, regular expressions aren't really the best tool to do the kind of syntactic analysis you require because they either lack the expressiveness needed to cover all possible cases or, and this is far more likely, they quickly become far too complex for most but the true RegExp maven to fully comprehend.

Categories

Resources