Extracting both matching and not matching regex - java

I have a String like this one abc3a de'f gHi?jk I want to split it into the substrings abc3a, de'f, gHi, ? and jk. In other terms, I want to return Strings that match the regular expression [a-zA-Z0-9'] and the Strings that do not match this regular expression. If there is a way to tell whether each resulting substring is a match or not, this will be a plus.
Thanks!

import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class HelloWorld{
public static void main(String []args){
Pattern pattern = Pattern.compile("([a-zA-Z0-9']*)?([^a-zA-Z0-9']*)?");
String str = "abc3a de'f gHi?jk";
Matcher matcher = pattern.matcher(str);
while(matcher.find()){
if(matcher.group(1).length() > 0)
System.out.println("Match:" + matcher.group(1));
if(matcher.group(2).length() > 0)
System.out.println("Miss: `" + matcher.group(2) + "`");
}
}
}
Output:
Match:abc3a
Miss: ` `
Match:de'f
Miss: ` `
Match:gHi
Miss: `?`
Match:jk
If you don't want white space.
Pattern pattern = Pattern.compile("([a-zA-Z0-9']*)?([^a-zA-Z0-9'\\s]*)?");
Output:
Match:abc3a
Match:de'f
Match:gHi
Miss: `?`
Match:jk

You can use this regex:
"[a-zA-Z0-9']+|[^a-zA-Z0-9' ]+"
Will give:
["abc3a", "de'f", "gHi", "?", "jk"]
Online Demo: http://regex101.com/r/xS0qG4
Java code:
Pattern p = Pattern.compile("[a-zA-Z0-9']+|[^a-zA-Z0-9' ]+");
Matcher m = p.matcher("abc3a de'f gHi?jk");
while (m.find())
System.out.println(m.group());
OUTPUT
abc3a
de'f
gHi
?
jk

myString.split("\\s+|(?<=[a-zA-Z0-9'])(?=[^a-zA-Z0-9'\\s])|(?<=[^a-zA-Z0-9'\\s])(?=[a-zA-Z0-9'])")
splits at all the boundaries between runs of characters in that charset.
The lookbehind (?<=...) matches after a character in a run, while the lookahead (?=...) matches before a character in a run of characters outside the set.
The \\s+ is not a boundary match, and matches a run of whitespace characters. This has the effect of removing white-space from the result entirely.
The | allows causing splitting to happy at either boundary or at a run of white-space.
Since the lookbehind and lookahead are both positive, the boundaries will not match at the start or end of the string, so there's no need to ignore empty strings in the output unless there is white-space there.

You can use anchors to split
private static String[] splitString(final String s) {
final String [] arr = s.split("(?=[^a-zA-Z0-9'])|(?<=[^a-zA-Z0-9'])");
final ArrayList<String> strings = new ArrayList<String>(arr.length);
for (final String str : arr) {
if(!"".equals(str.trim())) {
strings.add(str);
}
}
return strings.toArray(new String[strings.size()]);
}
(?=xxx) means xxx will follow here and (?<=xxx) mean xxx precedes this position.
As you did not want to include all-whitespace-matches into the result you need to filter the Array given by split.

Related

How to use Pattern, Matcher in Java regex API to remove a specific line

I have a complicate string split, I need to remove the comments, spaces, and keep all the numbers but change all string into character. If the - sign is at the start and followed by a number, treat it as a negative number rather than a operator
the comment has the style of ?<space>comments<space>? (the comments is a place holder)
Input :
-122+2 ? comments ?sa b
-122+2 ? blabla ?sa b
output :
["-122","+","2","?","s","a","b"]
(all string into character and no space, no comments)
Replace the unwanted string \s*\?\s*\w+\s*(?=\?) with "". You can chain String#replaceAll to remove any remaining whitespace. Note that ?= means positive lookahead and here it means \s*\?\s*\w+\s* followed by a ?. I hope you already know that \s specifies whitespace and \w specifies a word character.
Then you can use the regex, ^-\d+|\d+|\D which means either negative integer in the beginning (i.e. ^-\d+) or digits (i.e. \d+) or a non-digit (\D).
Demo:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String str = "-122+2 ? comments ?sa b";
str = str.replaceAll("\\s*\\?\\s*\\w+\\s*(?=\\?)", "").replaceAll("\\s+", "");
Pattern pattern = Pattern.compile("^-\\d+|\\d+|\\D");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
Output:
-122
+
2
?
s
a
b

Java Regex to extract substring with optional trailing slash

Regex:
\/test\/(.*|\/?)
Input
/something/test/{abc}/listed
/something/test/{abc}
Expected
{abc} for both the inputs
You need to capture all characters other than / after /test/:
String s = "/something/test/{abc}/listed";
Pattern pattern = Pattern.compile("/test/([^/]+)"); // or "/test/\\{([^/}]+)"
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println(matcher.group(1));
}
See the online demo
Details:
/test/ - matches /test/
([^/]+) - matches and captures into Group 1 one or more (+) (but as many as possible, since + is greedy) characters other than / (due to the negated character class [^/]).
Note that in Java regex patterns you do not need to escape / since it is not a special character and one needs no regex delimiters.
This should work for you :
public static void main(String[] args) {
String s1 = "/something/test/{abc}/listed";
String s2 = "/something/test/{abc}";
System.out.println(s1.replaceAll("[^{]+(\\{\\w+\\}).*", "$1"));
System.out.println(s2.replaceAll("[^{]+(\\{\\w+\\}).*", "$1"));
}
O/P :
{abc}
{abc}
Regex (as Java string, that is with doubled backslashes):
".*\\/test\\/([^/]*).*"

Splitting and Parsing formula String

I have below formula
(Trig01:BAO)/(((Trig01:COUNT*86400)-Trig01:UPI-Trig01:SOS)*2000)
I want to split and get output of staring values which are before colon only,
Final output need as -
{ "BAO","COUNT","UPI","SOS" }
Thanks in advance,
You can try with Positive Lookbehind in below regex pattern to get all the alphanumeric character after colon
(?<=:)[^\W]+
Online demo
Pattern explanation:
(?<= look behind to see if there is:
: ':'
) end of look-behind
[^\W]+ any character except: non-word characters
(all but a-z, A-Z, 0-9, _) (1 or more times)
Sample code:
String str="(Trig01:BAO)/(((Trig01:COUNT*86400)-Trig01:UPI-Trig01:SOS)*2000)";
Pattern p=Pattern.compile("(?<=:)[^\\W]+");
Matcher m=p.matcher(str);
while(m.find()){
System.out.println(m.group());
}
Use Regex, try this:
public static List<String> extractSubstringsFromAllMatches(String sourceString, String pattern) {
Pattern regexPattern = Pattern.compile(pattern);
Matcher matcher = regexPattern.matcher(sourceString);
List<String> matches = new ArrayList<String>();
while (matcher.find()) {
matches.add(matcher.group(1));
}
return matches;
}
Get the results you require by calling:
extractSubstringsFromAllMatches(YourString,":(\\w*)\\W")
Try this one-line solution:
String[] arr = str.replaceAll("^.*?(?=\\w+:)|:[^:]*$", "").split(":.*?(?=\\w+(:|$))");
This works by first stripping off the leading and trailing non-target chars, then splitting on the intervening chars. Matching is done using look aheads, which assert, but font capture, that a word followed by a colon follows.
Here's some test code:
String str = "(Trig01:BAO)/(((Trig02:COUNT*86400)-Trig03:UPI-Trig04:SOS)*2000)";
String[] arr = str.replaceAll("^.*?(?=\\w+:)|:[^:]*$", "").split(":.*?(?=\\w+(:|$))");
System.out.println(Arrays.toString(arr));
Output:
[Trig01, Trig02, Trig03, Trig04]

Replace a comma that is not in parentheses using regex

I have this string:
john(man,24,engineer),smith(man,23),lucy(female)
How do I replace a comma which not in the parentheses with #?
The result should be:
john(man,24,engineer)#smith(man,23)#lucy(female)
My code:
String str = "john(man,24,engineer),smith(man,23),lucy(female)";
Pattern p = Pattern.compile(".*?(?:\\(.*?\\)).+?");
Matcher m = p.matcher(str);
System.out.println(m.matches()+" "+m.find());
Why is m.matches() true and m.find() false? How can I achieve this?
Use a negative lookahead to achieve this:
,(?![^()]*\))
Explanation:
, # Match a literal ','
(?! # Start of negative lookahead
[^()]* # Match any character except '(' & ')', zero or more times
\) # Followed by a literal ')'
) # End of lookahead
Regex101 Demo
A simple regex for another approach in case we encounter unbalanced parentheses as insmiley:) or escape\)
While the lookahead approach works (and I too am a fan), it breaks down with input such as ,smiley:)(man,23), so I'll give you an alternative simple regex just in case. For the record, it's hard to find an simple approach that works all of the time because of potential nesting.
This situation is very similar to this question about "regex-matching a pattern unless...".
We can solve it with a beautifully-simple regex:
\([^()]*\)|(,)
Of course we can avoid more unpleasantness by allowing the parentheses matched on the left to roll over escaped parentheses:
\((?:\\[()]|[^()])*\)|(,)
The left side of the alternation | matches complete (parentheses). We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left.
This program shows how to use the regex (see the results at the bottom of the online demo):
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "john(man,24,engineer),smith(man,23),smiley:)(notaperson) ";
Pattern regex = Pattern.compile("\\([^()]*\\)|(,)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "#");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
System.out.println(replaced);
} // end main
} // end Program
For more information about the technique
How to match (or replace) a pattern except in situations s1, s2, s3...

Java regex patterns

I need help with this matter. Look at the following regex:
Pattern pattern = Pattern.compile("[A-Za-z]+(\\-[A-Za-z]+)");
Matcher matcher = pattern.matcher(s1);
I want to look for words like this: "home-made", "aaaa-bbb" and not "aaa - bbb", but not
"aaa--aa--aaa". Basically, I want the following:
word - hyphen - word.
It is working for everything, except this pattern will pass: "aaa--aaa--aaa" and shouldn't. What regex will work for this pattern?
Can can remove the backslash from your expression:
"[A-Za-z]+-[A-Za-z]+"
The following code should work then
Pattern pattern = Pattern.compile("[A-Za-z]+-[A-Za-z]+");
Matcher matcher = pattern.matcher("aaa-bbb");
match = matcher.matches();
Note that you can use Matcher.matches() instead of Matcher.find() in order to check the complete string for a match.
If instead you want to look inside a string using Matcher.find() you can use the expression
"(^|\\s)[A-Za-z]+-[A-Za-z]+(\\s|$)"
but note that then only words separated by whitespace will be found (i.e. no words like aaa-bbb.). To capture also this case you can then use lookbehinds and lookaheads:
"(?<![A-Za-z-])[A-Za-z]+-[A-Za-z]+(?![A-Za-z-])"
which will read
(?<![A-Za-z-]) // before the match there must not be and A-Z or -
[A-Za-z]+ // the match itself consists of one or more A-Z
- // followed by a -
[A-Za-z]+ // followed by one or more A-Z
(?![A-Za-z-]) // but afterwards not by any A-Z or -
An example:
Pattern pattern = Pattern.compile("(?<![A-Za-z-])[A-Za-z]+-[A-Za-z]+(?![A-Za-z-])");
Matcher matcher = pattern.matcher("It is home-made.");
if (matcher.find()) {
System.out.println(matcher.group()); // => home-made
}
Actually I can't reproduce the problem mentioned with your expression, if I use single words in the String. As cleared up with the discussion in the comments though, the String s contains a whole sentence to be first tokenised in words and then matched or not.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegExp {
private static void match(String s) {
Pattern pattern = Pattern.compile("[A-Za-z]+(\\-[A-Za-z]+)");
Matcher matcher = pattern.matcher(s);
if (matcher.matches()) {
System.out.println("'" + s + "' match");
} else {
System.out.println("'" + s + "' doesn't match");
}
}
/**
* #param args
*/
public static void main(String[] args) {
match(" -home-made");
match("home-made");
match("aaaa-bbb");
match("aaa - bbb");
match("aaa--aa--aaa");
match("home--home-home");
}
}
The output is:
' -home-made' doesn't match
'home-made' match
'aaaa-bbb' match
'aaa - bbb' doesn't match
'aaa--aa--aaa' doesn't match
'home--home-home' doesn't match

Categories

Resources