I'm after some help with a regex that I can't get to work correctly, I've used a few online tools to test the patterns but with little success.
I need to split a string based on a pattern FS[0-9][0-9], but also include some trailing text which could be any length comma separated text and numbers.
For example: FS01,a,b,c,d,1,2,3FS02,x,y,zFS03,some random text,123FS04,1
Would need to be split into:
FS01,a,b,c,d,1,2,3
FS02,x,y,z
FS03,some random text,123
FS04,1
Use a negative lookbehind and positive lookahead to get the splits.
String s = "FS01,a,b,c,d,1,2,3FS02,x,y,zFS03,some random text,123FS04,1";
String tok[] = s.split("(?<!^)(?=FS\\d{2})");
System.out.println(tok[0]);
System.out.println(tok[1]);
System.out.println(tok[2]);
System.out.println(tok[3]);
Output:
FS01,a,b,c,d,1,2,3
FS02,x,y,z
FS03,some random text,123
FS04,1
DEMO
Explanation:
(?<!^) Negative lookbehind asserts that what preceding is not the start of the line.
(?=FS\\d{2}) Lookahead asserts that what following is FS followed by two digits. So it sets the matching marker just before to all the FS\d\d but not the one at the start.
Try this REGEX :
public static void main(String[] args) {
String s = "FS01,a,b,c,d,1,2,3FS02,x,y,zFS03,some random text,123FS04,1";
Pattern p = Pattern.compile("(FS.*?)(?=(FS|$))");
// positive Lookahead. Captures groups starting with FS and ending upto another FS or end of String (denoted by $)
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group(1));
}
}
O/P :
FS01,a,b,c,d,1,2,3
FS02,x,y,z
FS03,some random text,123
FS04,1
Try this regex here FS.*?(?=FS)
http://www.regexr.com/3999u
Related
I am trying to extract all heading digits from a string using Java regex without writing additional code and I could not find something to work:
"12345XYZ6789ABC" should give me "12345".
"X12345XYZ6789ABC" should give me nothing
public final class NumberExtractor {
private static final Pattern DIGITS = Pattern.compile("what should be my regex here?");
public static Optional<Long> headNumber(String token) {
var matcher = DIGITS.matcher(token);
return matcher.find() ? Optional.of(Long.valueOf(matcher.group())) : Optional.empty();
}
}
Use a word boundary \b:
\b\d+
See live demo.
If you strictly want to match only digits at the start of the input, and not from each word (same thing when the input contains only one word), use ^:
^\d+
Pattern DIGITS = Pattern.compile("\\b\\d+"); // leading digits of all words
Pattern DIGITS = Pattern.compile("^\\d+"); // leading digits of input
I'd think something like "^[0-9]*" would work. There's a \d that matches other Unicode digits if you want to include them as well.
Edit: removed errant . from the string.
In the following code:
public static void main(String[] args) {
List<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("\\d+\\D+\\d+").matcher("2abc3abc4abc5");
while (m.find()) {
allMatches.add(m.group());
}
String[] res = allMatches.toArray(new String[0]);
System.out.println(Arrays.toString(res));
}
The result is:
[2abc3, 4abc5]
I'd like it to be
[2abc3, 3abc4, 4abc5]
How can it be achieved?
Make the matcher attempt to start its next scan from the latter \d+.
Matcher m = Pattern.compile("\\d+\\D+(\\d+)").matcher("2abc3abc4abc5");
if (m.find()) {
do {
allMatches.add(m.group());
} while (m.find(m.start(1)));
}
Not sure if this is possible in Java, but in PCRE you could do the following:
(?=(\d+\D+\d+)).
Explanation
The technique is to use a matching group in a lookahead, and then "eat" one character to move forward.
(?= : start of positive lookahead
( : start matching group 1
\d+ : match a digit one or more times
\D+ : match a non-digit character one or more times
\d+ : match a digit one or more times
) : end of group 1
) : end of lookahead
. : match anything, this is to "move forward".
Online demo
Thanks to Casimir et Hippolyte it really seems to work in Java. You just need to add backslashes and display the first capturing group: (?=(\\d+\\D+\\d+))..
Tested on www.regexplanet.com:
The above solution of HamZa works perfectly in Java. If you want to find a specific pattern in a text all you have to do is:
String regex = "\\d+\\D+\\d+";
String updatedRegex = "(?=(" + regex + ")).";
Where the regex is the pattern you are looking for and to be overlapping you need to surround it with (?=(" at the start and ")). at the end.
I am trying to parse a string, to remove commas between numbers. Request you to read the complete question and then please answer.
Let us consider the following string. AS IS :)
John loves cakes and he always orders them by dialing "989,444 1234". Johns credentials are as follows"
"Name":"John", "Jr", "Mobile":"945,234,1110"
Assuming i have the above line of text in a java string, now, i would like to remove all comma's between numbers. I would like to replace the following in the same string:
"945,234,1110" with "9452341110"
"945,234,1110" with "9452341110"
without making any other changes to the string.
I could iterate through the loop, when ever a comma is found, i could check the previous index and next index for numbers and then could delete the required comma. But it looks ugly. Doesn't it?
If i use Regex "[0-9],[0-9]" then i would loose two char, before and after comma.
I am seeking for an efficient solution rather than doing a brute force "search and replace" over the complete string. The real time string length is ~80K char. Thanks.
public static void main(String args[]) throws IOException
{
String regex = "(?<=[\\d])(,)(?=[\\d])";
Pattern p = Pattern.compile(regex);
String str = "John loves cakes and he always orders them by dialing \"989,444 1234\". Johns credentials are as follows\" \"Name\":\"John\", \"Jr\", \"Mobile\":\"945,234,1110\"";
Matcher m = p.matcher(str);
str = m.replaceAll("");
System.out.println(str);
}
Output
John loves cakes and he always orders them by dialing "989444 1234". Johns credentials are as follows" "Name":"John", "Jr", "Mobile":"9452341110"
This regex uses a positive lookbehind and a positive lookahead to only match commas with a preceding digit and a following digit, without including those digits in the match itself:
(?<=\d),(?=\d)
You couldtry regex like this :
public static void main(String[] args) {
String s = "asd,asdafs,123,456,789,asda,dsfds";
System.out.println(s.replaceAll("(?<=\\d),(?=\\d)", "")); //positive look-behind for a digit and positive look-ahead for a digit.
// i.e, only (select and) remove the comma preceeded by a digit and followed by another digit.
}
O/P :
asd,asdafs,123456789,asda,dsfds
I'm trying to write a regex that will identify whether a string has 2 or more consecutive commas. For example:
hello,,457
,,,,,
dog,,,elephant,,,,,
Can anyone help on what a valid regex would be?
String str ="hello,,,457";
Pattern pat = Pattern.compile("[,]{2,}");
Matcher matcher = pat.matcher(str);
if(matcher.find()){
System.out.println("contains 2 or more commas");
}
The below regex would matches the strings which has two or more consecutive commas,
^.*?,,+.*$
DEMO
You don't need to include start and the end anchors while using the regex with matches method.
System.out.println("dog,,,elephant,,,,,".matches(".*?,,+.*"));
Output:
true
Try:
int occurance = StringUtils.countOccurrencesOf("dog,,,elephant,,,,,", ",,");
or
int count = StringUtils.countMatches("dog,,,elephant,,,,,", ",,");
depend which library you use:
Check the solution here: Java: How do I count the number of occurrences of a char in a String?
Why is non-greedy match not working for me? Take following example:
public String nonGreedy(){
String str2 = "abc|s:0:\"gef\";s:2:\"ced\"";
return str2.split(":.*?ced")[0];
}
In my eyes the result should be: abc|s:0:\"gef\";s:2 but it is: abc|s
The .*? in your regex matches any character except \n (0 or more times, matching the least amount possible).
You can try the regular expression:
:[^:]*?ced
On another note, you should use a constant Pattern to avoid recompiling the expression every time, something like:
private static final Pattern REGEX_PATTERN =
Pattern.compile(":[^:]*?ced");
public static void main(String[] args) {
String input = "abc|s:0:\"gef\";s:2:\"ced\"";
System.out.println(java.util.Arrays.toString(
REGEX_PATTERN.split(input)
)); // prints "[abc|s:0:"gef";s:2, "]"
}
It is behaving as expected. The non-greedy match will match as little as it has to, and with your input, the minimum characters to match is the first colon to the next ced.
You could try limiting the number of characters consumed. For example to limit the term to "up to 3 characters:
:.{0,3}ced
To make it split as close to ced as possible, use a negative look-ahead, with this regex:
:(?!.*:.*ced).*ced
This makes sure there isn't a closer colon to ced.