Replace a comma that is not in parentheses using regex

Replace a comma that is not in parentheses using regex - java

I have this string:
john(man,24,engineer),smith(man,23),lucy(female)
How do I replace a comma which not in the parentheses with #?
The result should be:
john(man,24,engineer)#smith(man,23)#lucy(female)
My code:
String str = "john(man,24,engineer),smith(man,23),lucy(female)";
Pattern p = Pattern.compile(".*?(?:\\(.*?\\)).+?");
Matcher m = p.matcher(str);
System.out.println(m.matches()+" "+m.find());
Why is m.matches() true and m.find() false? How can I achieve this?

Use a negative lookahead to achieve this:
,(?![^()]*\))
Explanation:
, # Match a literal ','
(?! # Start of negative lookahead
[^()]* # Match any character except '(' & ')', zero or more times
\) # Followed by a literal ')'
) # End of lookahead
Regex101 Demo

A simple regex for another approach in case we encounter unbalanced parentheses as insmiley:) or escape\)
While the lookahead approach works (and I too am a fan), it breaks down with input such as ,smiley:)(man,23), so I'll give you an alternative simple regex just in case. For the record, it's hard to find an simple approach that works all of the time because of potential nesting.
This situation is very similar to this question about "regex-matching a pattern unless...".
We can solve it with a beautifully-simple regex:
\([^()]*\)|(,)
Of course we can avoid more unpleasantness by allowing the parentheses matched on the left to roll over escaped parentheses:
\((?:\\[()]|[^()])*\)|(,)
The left side of the alternation | matches complete (parentheses). We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left.
This program shows how to use the regex (see the results at the bottom of the online demo):
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "john(man,24,engineer),smith(man,23),smiley:)(notaperson) ";
Pattern regex = Pattern.compile("\\([^()]*\\)|(,)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "#");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
System.out.println(replaced);
} // end main
} // end Program
For more information about the technique
How to match (or replace) a pattern except in situations s1, s2, s3...

Related

How to use Pattern, Matcher in Java regex API to remove a specific line

I have a complicate string split, I need to remove the comments, spaces, and keep all the numbers but change all string into character. If the - sign is at the start and followed by a number, treat it as a negative number rather than a operator
the comment has the style of ?<space>comments<space>? (the comments is a place holder)
Input :
-122+2 ? comments ?sa b
-122+2 ? blabla ?sa b
output :
["-122","+","2","?","s","a","b"]
(all string into character and no space, no comments)

Replace the unwanted string \s*\?\s*\w+\s*(?=\?) with "". You can chain String#replaceAll to remove any remaining whitespace. Note that ?= means positive lookahead and here it means \s*\?\s*\w+\s* followed by a ?. I hope you already know that \s specifies whitespace and \w specifies a word character.
Then you can use the regex, ^-\d+|\d+|\D which means either negative integer in the beginning (i.e. ^-\d+) or digits (i.e. \d+) or a non-digit (\D).
Demo:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String str = "-122+2 ? comments ?sa b";
str = str.replaceAll("\\s*\\?\\s*\\w+\\s*(?=\\?)", "").replaceAll("\\s+", "");
Pattern pattern = Pattern.compile("^-\\d+|\\d+|\\D");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
Output:
-122
+
2
?
s
a
b

Java - Regular Expressions matching one to another

I am trying to retrieve bits of data using RE. Problem is I'm not very fluent with RE. Consider the code.
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class HTTP{
private static String getServer(httpresp){
Pattern p = Pattern.compile("(\bServer)(.*[Server:-\r\n]"); //What RE syntax do I use here?
Matcher m = p.matcher(httpresp);
if (m.find()){
return m.group(2);
public static void main(String[] args){
String testdata = "HTTP/1.1 302 Found\r\nServer: Apache\r\n\r\n"; //Test data
System.out.println(getServer(testdata));
How would I get "Server:" to the next "\r\n" out which would output "Apache"? I googled around and tried myself, but have failed.

It's a one liner:
private static String getServer(httpresp) {
return httpresp.replaceAll(".*Server: (.*?)\r\n.*", "$1");
}
The trick here is two-part:
use .*?, which is a reluctant match (consumes as little as possible and still match)
regex matches whole input, but desired target captured and returned using a back reference

You could use capturing groups or positive lookbehind.
Pattern.compile("(?:\\bServer:\\s*)(.*?)(?=[\r\n]+)");
Then print the group index 1.
Example:
String testdata = "HTTP/1.1 302 Found\r\nServer: Apache\r\n\r\n";
Matcher matcher = Pattern.compile("(?:\\bServer:\\s*)(.*?)(?=[\r\n]+)").matcher(testdata);
if (matcher.find())
{
System.out.println(matcher.group(1));
}
OR
Matcher matcher = Pattern.compile("(?:\\bServer\\b\\S*\\s+)(.*?)(?=[\r\n]+)").matcher(testdata);
if (matcher.find())
{
System.out.println(matcher.group(1));
}
Output:
Apache
Explanation:
(?:\\bServer:\\s*) In regex, non-capturing group would be represented as (?:...), which will do matching only. \b called word boundary which matches between a word character and a non-word character. Server: matches the string Server: and the following zero or more spaces would be matched by \s*
(.*?) In regex (..) called capturing group which captures those characters which are matched by the pattern present inside the capturing group. In our case (.*?) will capture all the characters non-greedily upto,
(?=[\r\n]+) one or more line breaks are detected. (?=...) called positive lookahead which asserts that the match must be followed by the characters which are matched by the pattern present inside the lookahead.

Replace all spaces except the ones with in HTML tags

I need to replace all spaces with html code, i.e. &nbsp, in a string. Currently following, does the replacement but it also replaces the spaces with in html tags like <a href="http://google.com" />.
string.replaceAll(" ", "&nbsp")
But I need it to not change the tags.
Example:
String s1 = "Hello!, Check out this <^a href=\"http://www.entrepreneur.com/article/234538\">10 Movies Every Entrepreneur Needs to Watch <^/a>"
After replacment, it should be like;
String s1 = "Hello!,&nbspCheck&nbspout&nbspthis&nbsp<^a href=\"http://www.entrepreneur.com/article/234538\">10&nbspMovies&nbspEvery&nbspEntrepreneur&nbspNeeds&nbspto&nbspWatch&nbsp<^/a>"
Can anybody suggest a more intelligent regex to accomplish the task?

I know you have already accepted an answer, but your problem has another simple solution that wasn't mentioned. This situation sounds very similar to this question to "regex-match a pattern, excluding..."
With all the disclaimers about using regex to parse html, here is a simple way to do it.
We can solve it with a beautifully-simple regex:
<[^<>]*>|( )
The left side of the alternation | matches complete <tags>. We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
This full Java program shows how to use the regex (see the results at the bottom of the online demo):
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "Hello!, Check out this <^a href=\"http://www.entrepreneur.com/article/234538\">10 Movies Every Entrepreneur Needs to Watch <^/a>";
Pattern regex = Pattern.compile("<[^<>]*>|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, " ");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
System.out.println(replaced);
} // end main
} // end Program
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
How to match a pattern unless...

If we can assume that the only use of > and < in the string is for the tags, then this regex will work:
(?![^<]*>)
It works for your example.
How it works:
matches the space character. This is exactly like what you did.
(?! starts a negative lookahead. This means that this regex will match only if it is not followed by something that matches the regex in the lookahead.
[^<]* matches any character that is not <, multiple times
> matches >
) closes the lookahead.
In other words, this regex matches any space, but with the requirement there must be a < before every > after the space.

regex for letters or numbers in brackets

I am using Java to process text using regular expressions. I am using the following regular expression
^[\([0-9a-zA-Z]+\)\s]+
to match one or more letters or numbers in parentheses one or more times. For instance, I like to match
(aaa) (bb) (11) (AA) (iv)
or
(111) (aaaa) (i) (V)
I tested this regular expression on http://java-regex-tester.appspot.com/ and it is working. But when I use it in my code, the code does not compile. Here is my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tester {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("^[\([0-9a-zA-Z]+\)\s]+");
String[] words = pattern.split("(a) (1) (c) (xii) (A) (12) (ii)");
String w = pattern.
for(String s:words){
System.out.println(s);
}
}
}
I tried to use \ instead of \ but the regex gave different results than what I expected (it matches only one group like (aaa) not multiple groups like (aaa) (111) (ii).
Two questions:
How can I fix this regex and be able to match multiple groups?
How can I get the individual matches separately (like (aaa) alone and then (111) and so on). I tried pattern.split but did not work for me.

Firstly, you want to escape any backslashes in the quotation marks with another backslash. The Regex will treat it as a single backslash. (E.g. call a word character \w in quotation marks, etc.)
Secondly, you got to finish the line that reads:
String w = pattern.
That line explains why it doesn't compile.

Here is my final solution to match the individual groups of letters/numbers in brackets that appear at the beginning of a line and ignore the rest
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tester {
static ArrayList<String> listOfEnums;
public static void main(String[] args) {
listOfEnums = new ArrayList<String>();
Pattern pattern = Pattern.compile("^\\([0-9a-zA-Z^]+\\)");
String p = "(a) (1) (c) (xii) (A) (12) (ii) and the good news (1)";
Matcher matcher = pattern.matcher(p);
boolean isMatch = matcher.find();
int index = 0;
//once you find a match, remove it and store it in the arrayList.
while (isMatch) {
String s = matcher.group();
System.out.println(s);
//Store it in an array
listOfEnums.add(s);
//Remove it from the beginning of the string.
p = p.substring(listOfEnums.get(index).length(), p.length()).trim();
matcher = pattern.matcher(p);
isMatch = matcher.find();
index++;
}
}
}

1) Your regex is incorrect. You want to match individual groups of letters / numbers in brackets, and the current regex will match only a single string of one or more such groups. I.e. it will match
(abc) (def) (123)
as a single group rather than three separate groups.
A better regex that would match only up to the closing bracket would be
\([0-9a-zA-Z^\)]+\)
2) Java requires you to escape all backslashes with another backslash
3) The split() method will not do what you want. It will find all matches in your string then throw them away and return an array of what is left over. You want to use matcher() instead
Pattern pattern = Pattern.compile("\\([0-9a-zA-Z^\\)]+\\)");
Matcher matcher = pattern.matcher("(a) (1) (c) (xii) (A) (12) (ii)");
while (matcher.find()) {
System.out.println(matcher.group());
}

Extracting both matching and not matching regex

I have a String like this one abc3a de'f gHi?jk I want to split it into the substrings abc3a, de'f, gHi, ? and jk. In other terms, I want to return Strings that match the regular expression [a-zA-Z0-9'] and the Strings that do not match this regular expression. If there is a way to tell whether each resulting substring is a match or not, this will be a plus.
Thanks!

import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class HelloWorld{
public static void main(String []args){
Pattern pattern = Pattern.compile("([a-zA-Z0-9']*)?([^a-zA-Z0-9']*)?");
String str = "abc3a de'f gHi?jk";
Matcher matcher = pattern.matcher(str);
while(matcher.find()){
if(matcher.group(1).length() > 0)
System.out.println("Match:" + matcher.group(1));
if(matcher.group(2).length() > 0)
System.out.println("Miss: `" + matcher.group(2) + "`");
}
}
}
Output:
Match:abc3a
Miss: ` `
Match:de'f
Miss: ` `
Match:gHi
Miss: `?`
Match:jk
If you don't want white space.
Pattern pattern = Pattern.compile("([a-zA-Z0-9']*)?([^a-zA-Z0-9'\\s]*)?");
Output:
Match:abc3a
Match:de'f
Match:gHi
Miss: `?`
Match:jk

You can use this regex:
"[a-zA-Z0-9']+|[^a-zA-Z0-9' ]+"
Will give:
["abc3a", "de'f", "gHi", "?", "jk"]
Online Demo: http://regex101.com/r/xS0qG4
Java code:
Pattern p = Pattern.compile("[a-zA-Z0-9']+|[^a-zA-Z0-9' ]+");
Matcher m = p.matcher("abc3a de'f gHi?jk");
while (m.find())
System.out.println(m.group());
OUTPUT
abc3a
de'f
gHi
?
jk

myString.split("\\s+|(?<=[a-zA-Z0-9'])(?=[^a-zA-Z0-9'\\s])|(?<=[^a-zA-Z0-9'\\s])(?=[a-zA-Z0-9'])")
splits at all the boundaries between runs of characters in that charset.
The lookbehind (?<=...) matches after a character in a run, while the lookahead (?=...) matches before a character in a run of characters outside the set.
The \\s+ is not a boundary match, and matches a run of whitespace characters. This has the effect of removing white-space from the result entirely.
The | allows causing splitting to happy at either boundary or at a run of white-space.
Since the lookbehind and lookahead are both positive, the boundaries will not match at the start or end of the string, so there's no need to ignore empty strings in the output unless there is white-space there.

You can use anchors to split
private static String[] splitString(final String s) {
final String [] arr = s.split("(?=[^a-zA-Z0-9'])|(?<=[^a-zA-Z0-9'])");
final ArrayList<String> strings = new ArrayList<String>(arr.length);
for (final String str : arr) {
if(!"".equals(str.trim())) {
strings.add(str);
}
}
return strings.toArray(new String[strings.size()]);
}
(?=xxx) means xxx will follow here and (?<=xxx) mean xxx precedes this position.
As you did not want to include all-whitespace-matches into the result you need to filter the Array given by split.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Replace a comma that is not in parentheses using regex - java

Use a negative lookahead to achieve this: ,(?![^()]\)) Explanation: , # Match a literal ',' (?! # Start of negative lookahead [^()] # Match any character except '(' & ')', zero or more times \) # Followed by a literal ')' ) # End of lookahead Regex101 Demo

Related

How to use Pattern, Matcher in Java regex API to remove a specific line

Java - Regular Expressions matching one to another

Replace all spaces except the ones with in HTML tags

regex for letters or numbers in brackets

Extracting both matching and not matching regex

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Replace a comma that is not in parentheses using regex - java

Use a negative lookahead to achieve this: ,(?![^()]*\)) Explanation: , # Match a literal ',' (?! # Start of negative lookahead [^()]* # Match any character except '(' & ')', zero or more times \) # Followed by a literal ')' ) # End of lookahead Regex101 Demo

Related

How to use Pattern, Matcher in Java regex API to remove a specific line

Java - Regular Expressions matching one to another

Replace all spaces except the ones with in HTML tags

regex for letters or numbers in brackets

Extracting both matching and not matching regex

Categories

Resources

Use a negative lookahead to achieve this: ,(?![^()]\)) Explanation: , # Match a literal ',' (?! # Start of negative lookahead [^()] # Match any character except '(' & ')', zero or more times \) # Followed by a literal ')' ) # End of lookahead Regex101 Demo