Split a String based on regex - java

I have a string that needs to be split based on the occurrence of a ","(comma), but need to ignore any occurrence of it that comes within a pair of parentheses.
For example, B2B,(A2C,AMM),(BNC,1NF),(106,A01),AAA,AX3
Should be split into
B2B,
(A2C,AMM),
(BNC,1NF),
(106,A01),
AAA,
AX3

FOR NON NESTED
,(?![^\(]*\))
FOR NESTED(parenthesis inside parenthesis)
(?<!\([^\)]*),(?![^\(]*\))

Try below:
var str = 'B2B,(A2C,AMM),(BNC,1NF),(106,A01),AAA,AX3';
console.log(str.match(/\([^)]*\)|[A-Z\d]+/g));
// gives you ["B2B", "(A2C,AMM)", "(BNC,1NF)", "(106,A01)", "AAA", "AX3"]
Java edition:
String str = "B2B,(A2C,AMM),(BNC,1NF),(106,A01),AAA,AX3";
Pattern p = Pattern.compile("\\([^)]*\\)|[A-Z\\d]+");
Matcher m = p.matcher(str);
List<String> matches = new ArrayList<String>();
while(m.find()){
matches.add(m.group());
}
for (String val : matches) {
System.out.println(val);
}

One simple iteration will be probably better option then any regex, especially if your data can have parentheses inside parentheses. For example:
String data="Some,(data,(that),needs),to (be, splited) by, comma";
StringBuilder buffer=new StringBuilder();
int parenthesesCounter=0;
for (char c:data.toCharArray()){
if (c=='(') parenthesesCounter++;
if (c==')') parenthesesCounter--;
if (c==',' && parenthesesCounter==0){
//lets do something with this token inside buffer
System.out.println(buffer);
//now we need to clear buffer
buffer.delete(0, buffer.length());
}
else
buffer.append(c);
}
//lets not forget about part after last comma
System.out.println(buffer);
output
Some
(data,(that),needs)
to (be, splited) by
comma

Try this
\w{3}(?=,)|(?<=,)\(\w{3},\w{3}\)(?=,)|(?<=,)\w{3}
Explanation: There are three parts separated by OR (|)
\w{3}(?=,) - matches the 3 any alphanumeric character (including underscore) and does the positive look ahead for comma
(?<=,)\(\w{3},\w{3}\)(?=,) - matches this pattern (ABC,E4R) and also does a positive lookahead and look behind for the comma
(?<=,)\w{3} - matches the 3 any alphanumeric character (including underscore) and does the positive look behind for comma

Related

Java Regex replacing every digit in beginning

How can I replace with regex each digit at the beginning of word with the underscore character, as well as in the rest part of the word to replace all characters except letters, digits, dashes and dots to underscores?
I tried this regex:
^(\d+)|[^\w-.]
However, it replaces all digits in the beginning with a single underscore character.
So, 34567fgf-kl.)*/676hh is converted to _fgf-kl.___676hh while I need every digit in the beginning to be replaced with one underscore character like _____fgf-kl.___676hh.
Is it possible to achieve using a regex?
You can do it like this with Matcher.appendReplacement used with Matcher.find:
String fileText = "34567fgf-kl.)*/676hh";
String pattern = "^\\d+|[^\\w.-]+";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(fileText);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, repeat("_", m.group(0).length()));
}
m.appendTail(sb); // append the rest of the contents
System.out.println(sb);
And the repeat is
public static String repeat(String s, int n) {
if(s == null) {
return null;
}
final StringBuilder sb = new StringBuilder(s.length() * n);
for(int i = 0; i < n; i++) {
sb.append(s);
}
return sb.toString();
}
See IDEONE demo
Also, repeat can be replaced with String repeated = StringUtils.repeat("_", m.group(0).length()); using Commons Lang StringUtils.repeat().
You can use a negative-lookbehind to individually match each leading digit, i.e. any digit that doesn't have a non-digit before it.
(?<!\D.{0,999})\d|[^\w-.]
Due to constraints in lookbehind, it cannot be unlimited. The above code can handle at most 999 leading digits.
You can also use replaceAll() with regex:
(^\d)|(?<=\d\G)\d|[^-\w.\n]
which means match:
(^\d) - digit on beginning of a line,
| - or
(?<=\d\G)\d - digit if it is preceded by previously matched digit,
| - or
[^-\w.\n] - not dash, word character (\w is [A-Za-z_0-9]), point or
new line (\n). As a [^-\w.\n] is rather broad category, maybe you will like to add some more characters, or character groups, to exclude from matching, it is enough to add it inside brackets,
DEMO
\n is added if string could be multiline. If there is just one-line string, \n is redundant.
Example in Java:
public class Test {
public static void main(String[] args) {
String example = "34567fgf-kl.)*/676hh";
System.out.println(example.replaceAll("(^\\d)|(?<=\\d\\G)\\d|[^\\w.-]", "_"));
}
}
with output:
_____fgf-kl.___676hh

Why the string does not split?

While trying to split a string xyz213123kop234430099kpf4532 into tokens :
xyz213123
kop234430099
kpf4532
I wrote the following code
String s = "xyz213123kop234430099kpf4532";
String regex = "/^[a-zA-z]+[0-9]+$/";
String tokens[] = s.split(regex);
for(String t : tokens) {
System.out.println(t);
}
but instead of tokens, I get the whole string as one output. What is wrong with the regular expression I used ?
You can do that:
String s = "xyz213123kop234430099kpf4532";
String[] result = s.split("(?<=[0-9])(?=[a-z])");
The idea is to use zero width assertions to find the place where to cut the string, then I use a lookbehind (preceded by a digit [0-9]) and a lookahead (followed by a letter [a-z]).
These lookarounds are just checks and match nothing, thus the delimiter of the split is an empty string and no characters are removed from the result.
You could split on this matching between a number and not-a-number.
String s = "xyz213123kop234430099kpf4532";
String[] parts = s.split("(?<![^\\d])(?=\\D)");
for (String p : parts) {
System.out.println(p);
}
Output
xyz213123
kop234430099
kpf4532
There's nothing in your string that matches the regular expression, because your expression starts with ^ (beginning of string) and ends with $ (end of string). So it would either match the whole string, or nothing at all. But because it doesn't match the string, it is not found when you split the string into tokens. That's why you get just one big token.
You don't want to use split for that. The argument to split is the delimiter between tokens. You don't have that. Instead, you have a pattern that repeats and you want each match to the pattern. Try this instead:
String s = "xyz213123kop234430099kpf4532";
Pattern p = Pattern.compile("([a-zA-z]+[0-9]+)");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group());
}
Output:
xyz213123
kop234430099
kpf4532
(I don't know by what logic you would have the second token be "3kop234430099" as in your posted question. I assume that the leading "3" is a typo.)

How to convert a String to String array in Java ( Ignore whitespace and parentheses )

The String will looks like this:
String temp = "IF (COND_ITION) (ACT_ION)";
// Only has one whitespace in either side of the parentheses
or
String temp = " IF (COND_ITION) (ACT_ION) ";
// Have more irrelevant whitespace in the String
// But no whitespace in condition or action
I hope to get a new String array which contains three elemets, ignore the parentheses:
String[] tempArray;
tempArray[0] = IF;
tempArray[1] = COND_ITION;
tempArray[2] = ACT_ION;
I tried to use String.split(regex) method but I don't know how to implement the regex.
If your input string will always be in the format you described, it is better to parse it based on the whole pattern instead of just the delimiter, as this code does:
Pattern pattern = Pattern.compile("(.*?)[/s]\\((.*?)\\)[/s]\\((.*?)\\)");
Matcher matcher = pattern.matcher(inputString);
String tempArray[3];
if(matcher.find()) {
tempArray[0] name = matcher.group(1);
tempArray[1] name = matcher.group(2);
tempArray[2] name = matcher.group(3);
}
Pattern breakdown:
(.*?) IF
[/s] white space
\\((.*?)\\) (COND_ITION)
[/s] white space
\\((.*?)\\) (ACT_ION)
You can use StringTokenizer to split into strings delimited by whitespace. From Java documentation:
The following is one example of the use of the tokenizer. The code:
StringTokenizer st = new StringTokenizer("this is a test");
while (st.hasMoreTokens()) {
System.out.println(st.nextToken());
}
prints the following output:
this
is
a
test
Then write a loop to process the strings to replace the parentheses.
I think you want a regular expression like "\\)? *\\(?", assuming any whitespace inside the parentheses is not to be removed. Note that this doesn't validate that the parentheses match properly. Hope this helps.

Java Split not working as expected

I am trying to use a simple split to break up the following string: 00-00000
My expression is: ^([0-9][0-9])(-)([0-9])([0-9])([0-9])([0-9])([0-9])
And my usage is:
String s = "00-00000";
String pattern = "^([0-9][0-9])(-)([0-9])([0-9])([0-9])([0-9])([0-9])";
String[] parts = s.split(pattern);
If I play around with the Pattern and Matcher classes I can see that my pattern does match and the matcher tells me my groupCount is 7 which is correct. But when I try and split them I have no luck.
String.split does not use capturing groups as its result. It finds whatever matches and uses that as the delimiter. So the resulting String[] are substrings in between what the regex matches. As it is the regex matches the whole string, and with the whole string as a delimiter there is nothing else left so it returns an empty array.
If you want to use regex capturing groups you will have to use Matcher.group(), String.split() will not do.
for your example, you could simply do this:
String s = "00-00000";
String pattern = "-";
String[] parts = s.split(pattern);
I can not be sure, but I think what you are trying to do is to get each matched group into an array.
Matcher matcher = Pattern.compile(pattern).matcher();
if (matcher.matches()) {
String s[] = new String[matcher.groupCount()) {
for (int i=0;i<matches.groupCount();i++) {
s[i] = matcher.group(i);
}
}
}
From the documentation:
String[] split(String regex) -- Returns: the array of strings computed by splitting this string around matches of the given regular expression
Essentially the regular expression is used to define delimiters in the input string. You can use capturing groups and backreferences in your pattern (e.g. for lookarounds), but ultimately what matters is what and where the pattern matches, because that defines what goes into the returned array.
If you want to split your original string into 7 parts using regular expression, then you can do something like this:
String s = "12-3456";
String[] parts = s.split("(?!^)");
System.out.println(parts.length); // prints "7"
for (String part : parts) {
System.out.println("Part [" + part + "]");
} // prints "[1] [2] [-] [3] [4] [5] [6] "
This splits on zero-length matching assertion (?!^), which is anywhere except before the first character in the string. This prevents the empty string to be the first element in the array, and trailing empty string is already discarded because we use the default limit parameter to split.
Using regular expression to get individual character of a string like this is an overkill, though. If you have only a few characters, then the most concise option is to use foreach on the toCharArray():
for (char ch : "12-3456".toCharArray()) {
System.out.print("[" + ch + "] ");
}
This is not the most efficient option if you have a longer string.
Splitting on -
This may also be what you're looking for:
String s = "12-3456";
String[] parts = s.split("-");
System.out.println(parts.length); // prints "2"
for (String part : parts) {
System.out.print("[" + part + "] ");
} // prints "[12] [3456] "

Regarding String manipulation

I have a String str which can have list of values like below. I want the first letter in the string to be uppercase and if underscore appears in the string then i need to remove it and need to make the letter after it as upper case. The rest all letter i want it to be lower case.
""
"abc"
"abc_def"
"Abc_def_Ghi12_abd"
"abc__de"
"_"
Output:
""
"Abc"
"AbcDef"
"AbcDefGhi12Abd"
"AbcDe"
""
Well, without showing us that you put any effort into this problem this is going to be kinda vague.
I see two possibilities here:
Split the string at underscores, apply the answer from this question to each part and re-combine them.
Create a StringBuilder, walk through the string and keep track of whether you are
at the start of the string
after an underscore or
somewhere else
and act appropriately on the current character before appending it to the StringBuilder instance.
replace _ with space (str.replace("_", " "))
use WordUtils.capitalizeFully(str); (from commons-lang)
replace space with nothing (str.replace(" ", ""))
You can use following regexp based code:
public static String camelize(String input) {
char[] c = input.toCharArray();
Pattern pattern = Pattern.compile(".*_([a-z]).*");
Matcher m = pattern.matcher(input);
while ( m.find() ) {
int index = m.start(1);
c[index] = String.valueOf(c[index]).toUpperCase().charAt(0);
}
return String.valueOf(c).replace("_", "");
}
Use Pattern/Matcher in the java.util.regex package:
for each string that is in your array do the following:
StringBuffer output = new StringBuffer();
Matcher match = Pattern.compile("[^|_](\w)").matcher(inStr);
while(match.find()) {
match.appendReplacement(output, matcher.match(0).ToUpper());
}
match.appendTail(output);
// Will have the properly capitalized string.
String capitalized = output.ToString();
The regular expression looks for either the start of the string or an underscore "[^|_]"
Then puts the following character into a group "(\w)"
The code then goes through each of the matches in the input string capitalizing the first satisfying group.

Categories

Resources