Split a string by commas but no inside parenthesis [duplicate]

Split a string by commas but no inside parenthesis [duplicate] - java

I have a string that looks something like the following:
12,44,foo,bar,(23,45,200),6
I'd like to create a regex that matches the commas, but only the commas that are not inside of parentheses (in the example above, all of the commas except for the two after 23 and 45). How would I do this (Java regular expressions, if that makes a difference)?

Assuming that there can be no nested parens (otherwise, you can't use a Java Regex for this task because recursive matching is not supported):
Pattern regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS);
This regex uses a negative lookahead assertion to ensure that the next following parenthesis (if any) is not a closing parenthesis. Only then the comma is allowed to match.

Paul, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Also the existing solution checks that the comma is not followed by a parenthesis, but that does not guarantee that it is embedded in parentheses.
The regex is very simple:
\(.*?\)|(,)
The left side of the alternation matches complete set of parentheses. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left.
In this demo, you can see the Group 1 captures in the lower right pane.
You said you want to match the commas, but you can use the same general idea to split or replace.
To match the commas, you need to inspect Group 1. This full program's only goal in life is to do just that.
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "12,44,foo,bar,(23,45,200),6";
Pattern regex = Pattern.compile("\\(.*?\\)|(,)");
Matcher regexMatcher = regex.matcher(subject);
List<String> group1Caps = new ArrayList<String>();
// put Group 1 captures in a list
while (regexMatcher.find()) {
if(regexMatcher.group(1) != null) {
group1Caps.add(regexMatcher.group(1));
}
} // end of building the list
// What are all the matches?
System.out.println("\n" + "*** Matches ***");
if(group1Caps.size()>0) {
for (String match : group1Caps) System.out.println(match);
}
} // end main
} // end Program
Here is a live demo
To use the same technique for splitting or replacing, see the code samples in the article in the reference.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

I don’t understand this obsession with regular expressions, given that they are unsuited to most tasks they are used for.
String beforeParen = longString.substring(longString.indexOf('(')) + longString.substring(longString.indexOf(')') + 1);
int firstComma = beforeParen.indexOf(',');
while (firstComma != -1) {
/* do something. */
firstComma = beforeParen.indexOf(',', firstComma + 1);
}
(Of course this assumes that there always is exactly one opening parenthesis and one matching closing parenthesis coming somewhen after it.)

Related

Regex to capture everything that's not inside parenthesis [duplicate]

I have a string that looks something like the following:
12,44,foo,bar,(23,45,200),6
I'd like to create a regex that matches the commas, but only the commas that are not inside of parentheses (in the example above, all of the commas except for the two after 23 and 45). How would I do this (Java regular expressions, if that makes a difference)?

Assuming that there can be no nested parens (otherwise, you can't use a Java Regex for this task because recursive matching is not supported):
Pattern regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS);
This regex uses a negative lookahead assertion to ensure that the next following parenthesis (if any) is not a closing parenthesis. Only then the comma is allowed to match.

Paul, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Also the existing solution checks that the comma is not followed by a parenthesis, but that does not guarantee that it is embedded in parentheses.
The regex is very simple:
\(.*?\)|(,)
The left side of the alternation matches complete set of parentheses. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left.
In this demo, you can see the Group 1 captures in the lower right pane.
You said you want to match the commas, but you can use the same general idea to split or replace.
To match the commas, you need to inspect Group 1. This full program's only goal in life is to do just that.
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "12,44,foo,bar,(23,45,200),6";
Pattern regex = Pattern.compile("\\(.*?\\)|(,)");
Matcher regexMatcher = regex.matcher(subject);
List<String> group1Caps = new ArrayList<String>();
// put Group 1 captures in a list
while (regexMatcher.find()) {
if(regexMatcher.group(1) != null) {
group1Caps.add(regexMatcher.group(1));
}
} // end of building the list
// What are all the matches?
System.out.println("\n" + "*** Matches ***");
if(group1Caps.size()>0) {
for (String match : group1Caps) System.out.println(match);
}
} // end main
} // end Program
Here is a live demo
To use the same technique for splitting or replacing, see the code samples in the article in the reference.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

I don’t understand this obsession with regular expressions, given that they are unsuited to most tasks they are used for.
String beforeParen = longString.substring(longString.indexOf('(')) + longString.substring(longString.indexOf(')') + 1);
int firstComma = beforeParen.indexOf(',');
while (firstComma != -1) {
/* do something. */
firstComma = beforeParen.indexOf(',', firstComma + 1);
}
(Of course this assumes that there always is exactly one opening parenthesis and one matching closing parenthesis coming somewhen after it.)

Regex matching word that is in the middle of any character except a letter

I'd like to know how to detect word that is between any characters except a letter from alphabet. I need this, because I'm working on a custom import organizer for Java. This is what I have already tried:
The regex expression:
[^(a-zA-Z)]InitializationEvent[^(a-zA-Z)]
I'm searching for the word "InitializationEvent".
The code snippet I've been testing on:
public void load(InitializationEvent event) {
It looks like adding space before the word helps... is the parenthesis inside of alphabet range?
I tested this in my program and it didn't work. Also I checked it on regexr.com, showing same results - class name not recognized.
Am I doing something wrong? I'm new to regex, so it might be a really basic mistake, or not. Let me know!

Lose the parentheses:
[^a-zA-Z]InitializationEvent[^a-zA-Z]
Inside [], parentheses are taken literally, and by inverting the group (^) you prevent it from matching because a ( is preceding InitializationEvent in your string.
Note, however, that the above regex will only match if InitializationEvent is neither at the beginning nor at the end of the tested string. To allow that, you can use:
(^|[^a-zA-Z])InitializationEvent([^a-zA-Z]|$)
Or, without creating any matching groups (which is supposed to be cleaner, and perform better):
(?:^|[^a-zA-Z])InitializationEvent(?:[^a-zA-Z]|$)

how to detect word that is between any characters except a letter from alphabet
This is the case where lookarounds come handy. You can use:
(?<![a-zA-Z])InitializationEvent(?![a-zA-Z])
(?<![a-zA-Z]) is negative lookbehind to assert that there is no alphabet at previous position
(?![a-zA-Z]) is negative lookahead to assert that there is no alphabet at next position
RegEx Demo

The parentheses are causing the problem, just skip them:
"[^a-zA-Z]InitializationEvent[^a-zA-Z]"
or use the predefined non-word character class which is slightly different because it also excludes numbers and the underscore:
"\\WInitializationEvent\\W"
But as it seems you want to match a class name, this might be ok because the remaining character are exactly those that are allowed in a class name.

I'm not sure about your application but from a regexp perspective you can use negative lookaheads and negative lookbehinds to define what cannot surround the String to specify a match.
I have added the negative lookahead (?![a-zA-Z]) and the negative lookbehind (?<![a-zA-Z]) in place of your [^(a-zA-Z)] originally supplied to create: (?<![a-zA-Z])InitializationEvent(?![a-zA-Z])
Quick Fiddle I created:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HelloWorld{
public static void main(String []args){
String pattern = "(?<![a-zA-Z])InitializationEvent(?![a-zA-Z])";
String sourceString = "public void load(InitializationEvent event) {";
String sourceString2 = "public void load(BInitializationEventA event) {";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(sourceString);
if (m.find( )) {
System.out.println("Found value of pattern in sourceString: " + m.group(0) );
} else {
System.out.println("NO MATCH in sourceString");
}
Matcher m2 = r.matcher(sourceString2);
if (m2.find( )) {
System.out.println("Found value of pattern in sourceString2: " + m2.group(0) );
} else {
System.out.println("NO MATCH in sourceString2");
}
}
}
output:
sh-4.3$ java -Xmx128M -Xms16M HelloWorld
Found value of pattern in sourceString: InitializationEvent
NO MATCH in sourceString2

You seem really close:
[^(a-zA-Z)]*(InitializationEvent)[^(a-zA-Z)]*
I think this is what you are looking for. The asterisk provides a match for zero or many of the character or group before it.
EDIT/UPDATE
My apologies on the initial response.
[^a-zA-Z]+(InitializationEvent)[^a-zA-Z]+
My regex is a little rusty, but this will match on any non-alphabet character one or many times prior to the InitializationEvent and after.

Explicitly defining the end of the input in a regular expression using $

I have this code using a regular expression to separate an input string into two words, where the second word is optional (I know that I might use String.split() in this particular case, but the actual regular expression is a bit more complex):
package com.example;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Dollar {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(.*?)\\s*(?: (.*))?$"); // Works
//Pattern pattern = Pattern.compile("(.*?)\\s*(?: (.*))?"); // Does not work
Matcher matcher = pattern.matcher("first second");
matcher.find();
System.out.println("first : " + matcher.group(1));
System.out.println("second: " + matcher.group(2));
}
}
With this code, I get the expected output
first : first
second: second
and it also works if the second word is not there.
However, if I use the other regexp (without the dollar sign at then end), I get empty strings / nulls for the capture groups.
My question is: Why do I have to explicitly put a dollar sign at the end of the regexp to match the "the end of the input sequence" (as the Javadoc says)? In other words, why is the end of the regular expression not implicitly treated as the end of the input sequence?

That is due to lazy nature of your regex which finds & captures many empty matches.
If you use this better regex:
(\S+)(?: (.*))?
Then it will also work with:
(\S+)(?: (.*))?$

Replace all spaces except the ones with in HTML tags

I need to replace all spaces with html code, i.e. &nbsp, in a string. Currently following, does the replacement but it also replaces the spaces with in html tags like <a href="http://google.com" />.
string.replaceAll(" ", "&nbsp")
But I need it to not change the tags.
Example:
String s1 = "Hello!, Check out this <^a href=\"http://www.entrepreneur.com/article/234538\">10 Movies Every Entrepreneur Needs to Watch <^/a>"
After replacment, it should be like;
String s1 = "Hello!,&nbspCheck&nbspout&nbspthis&nbsp<^a href=\"http://www.entrepreneur.com/article/234538\">10&nbspMovies&nbspEvery&nbspEntrepreneur&nbspNeeds&nbspto&nbspWatch&nbsp<^/a>"
Can anybody suggest a more intelligent regex to accomplish the task?

I know you have already accepted an answer, but your problem has another simple solution that wasn't mentioned. This situation sounds very similar to this question to "regex-match a pattern, excluding..."
With all the disclaimers about using regex to parse html, here is a simple way to do it.
We can solve it with a beautifully-simple regex:
<[^<>]*>|( )
The left side of the alternation | matches complete <tags>. We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
This full Java program shows how to use the regex (see the results at the bottom of the online demo):
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "Hello!, Check out this <^a href=\"http://www.entrepreneur.com/article/234538\">10 Movies Every Entrepreneur Needs to Watch <^/a>";
Pattern regex = Pattern.compile("<[^<>]*>|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, " ");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
System.out.println(replaced);
} // end main
} // end Program
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
How to match a pattern unless...

If we can assume that the only use of > and < in the string is for the tags, then this regex will work:
(?![^<]*>)
It works for your example.
How it works:
matches the space character. This is exactly like what you did.
(?! starts a negative lookahead. This means that this regex will match only if it is not followed by something that matches the regex in the lookahead.
[^<]* matches any character that is not <, multiple times
> matches >
) closes the lookahead.
In other words, this regex matches any space, but with the requirement there must be a < before every > after the space.

Exclude strings within parentheses from a regular expression?

I'm looking to split space-delimited strings into a series of search terms. However, in doing so I'd like to ignore spaces within parentheses. For example, I'd like to be able to split the string
a, b, c, search:(1, 2, 3), d
into
[[a] [b] [c] [search:(1, 2, 3)] [d]]
Does anyone know how to do this using regular expressions in Java?
Thanks!

This isn't a full regex, but it'll get you there:
(\([^)]*\)|\S)*
This uses a common trick, treating one long string of characters as if it were a single character. On the right side we match non-whitespace characters with \S. On the left side we match a balanced set of parentheses with anything in between.
The end result is that a balanced set of parentheses is treated as if it were a single character, and so the regex as a whole matches a single word, where a word can contain these parenthesized groups.
(Note that because this is a regular expression it can't handle nested parentheses. One set of parentheses is the limit.)

This problem had another solution that wasn't mentioned, so I'll post it here for completion. This situation is similar to this question to ["regex-match a pattern, excluding..."][4]
We can solve this with a beautifully-simple regex:
\([^)]*\)|(\s*,\s*)
The left side of the alternation | matches complete (parentheses). We will ignore these matches. The right side matches and captures commas and surrounding spaces to Group 1, and we know they are the right apostrophes because they were not matched by the expression on the left. We will replace these commas by something distinctive, then split.
This program shows how to use the regex (see the results at the bottom of the online demo):
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "a, b, c, search:(1, 2, 3), d";
Pattern regex = Pattern.compile("\\([^)]*\\)|(\\s*,\\s*)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split a string by commas but no inside parenthesis [duplicate] - java

Related

Regex to capture everything that's not inside parenthesis [duplicate]

Regex matching word that is in the middle of any character except a letter

Explicitly defining the end of the input in a regular expression using $

Replace all spaces except the ones with in HTML tags

Exclude strings within parentheses from a regular expression?

Categories

Resources