Find all subsequences within double squared brackets - java

My input text looks like this:
..some_text0 [[some_text1]] some_text2 [[some_text3]] some_text4 ....
I want to extract all texts contained within double squared brackets, or I want to obtain separately the groups:
some_text1
some_text3
I tried this solution:
Matcher m = Pattern.compile("\\[\\[.*\\]\\]").matcher(line_input);
while (m.find()) {
System.out.println("Found: " + matcher.group());
}
but this prints me:
[[some_text1]] some_text2 [[some_text3]]
as only result. How to achieve my goal?

The regex for this task is as below
\[\[(.*?)]]
It searches for the [[ follow by any string that closes with ]]
Here is DEMO and explanation

Using \[\[.*?]] regex that is just slow and . does not match a newline by default, you might confront an issue when backtracking limit is exhausted if you parse very long strings.
I suggest using a regex based on the unrolling-the-loop method:
\[{2}([^\]]*(?:\](?!\])[^\]]*)*)\]{2}
Or even a shorter
\[{2}([^\]]*(?:\][^\]]+)*)\]{2}
See regex demo 1 and demo 2.
Here is a Java demo:
String str = "some_text0 [[some_text1]] some_text2 [[some_text3]] some_text4";
Pattern ptrn = Pattern.compile("\\[{2}([^\\]]*(?:\\][^\\]]+)*)\\]{2}");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Result:
some_text1
some_text3
Regex breakdown:
\[{2} - match exactly 2 [ symbols
[^\]]* - match 0 or more symbols other than ]
(?:\][^\]]+)* - match 0 or more sequences of...
\] - a single ] that is followed by
[^\]]+ - 1 or more symbols other than ]
\]{2} - match exactly 2 ] symbols.
The difference from .*?-based regex is that matching becomes more linear and thus the regex pattern is much faster and less error prone.

Related

regex for not matching alpha plus numeric range

I have the following regex
.{19}_.{3}PDR_.{8}(ABCD|CTNE|PFRE)006[0-9][0-9].{3}_.{6}\.POC
a match is for example
NRM_0157F0680884976_598PDR_T0060000ABCD00619_00_6I1N0T.POC
and would like to negate the (ABCD|CTNE|PFRE)006[0-9][0-9]
portion such that
NRM_0157F0680884976_598PDR_T0060000ABCD00719_00_6I1N0T.POC
is a match but
NRM_0157F0680884976_598PDR_T0060000ABCD007192_00_6I1N0T.POC
or
NRM_0157F0680884976_598PDR_T0060000ABCD0061_00_6I1N0T.POC
is not (the negated part must be 9 chars long just like the non negated part for a total length of 58 chars).
Consider using the following pattern:
\b(?:ABCD|CTNE|PFRE)006[0-9][0-9]\b
Sample Java code:
String input = "Matching value is ABCD00601 but EFG123 is non matching";
Pattern r = Pattern.compile("\\b(?:ABCD|CTNE|PFRE)006[0-9][0-9]\\b");
Matcher m = r.matcher(input);
while (m.find()) {
System.out.println("Found a match: " + m.group());
}
This prints:
Found a match: ABCD00601
I would like to propose this expression
(ABCD|CTNE|PFRE)006\d{1,2}
where \d{1,2} catches any one or two digit number
that is it would get any alphanumeric values from ABCD0060~ABCD00699 or CTNE0060~CTNE00699 or PFRE0060~PFRE00699
Edit #1:
as user #Hao Wu mentioned the above regex would also accept if its ABCD0060 which is not ideal so
this should do the job by removing 1 from the { } we can get
alphanumeric values from ABCD00600~ABCD00699 or CTNE00600~CTNE00699 or PFRE00600~PFRE00699
so the resulting regex would be
(ABCD|CTNE|PFRE)006\d{2}

Regex expression is not working correctly

I am trying to find in a string in which numbers are formatted as "4.97", but if they are smaller than 1, they are in the format .97, .80 etc. I want to find these kind of substrings in the String and replace them so that they would start with a 0.
It's working for the string
String str = "Rate is : .97";
Result : "Rate is : 0.97"
But not for the string:
String str = "Rate is : .97 . XXXXXXXXX do you want . to perform another calculation . ";
String str = "Rate is : .97 . XXXXXXXXX do you want . 87 to perform another calculation . ";
String pattern = "(.*\\D)(.\\d\\d.*)";
System.out.println(str.matches("(.*\\D)(.\\d\\d.*)"));
str = str.replaceAll(pattern, "$10$2");
Why is this happening?
In your second example, the .* after the last \\d will match any character except a newline which will match the rest of the string.
You might do the replacement without a capturing group using a negative lookbehind (?<!\S) to check if what is on the left is not a non whitespace char.
(?<!\S)\.[0-9]
In the replacement use a zero followed by the full match.
Regex demo | Java demo
String str = "Rate is : .97 . XXXXXXXXX do you want . 87 to perform another calculation . ";
String pattern = "(?<!\\S)\\.[0-9]";
System.out.println(str.replaceAll(pattern, "0$0"));
Output
Rate is : 0.97 . XXXXXXXXX do you want . 87 to perform another calculation .
If there should be a non digit before, you could make use of a positive lookbehind
(?<=\D)\.[0-9]
Regex demo
In Java
String regex = "(?<=\\D)\\.[0-9]";
It looks like you need to add some lazy matching to your regex.
? means it will attempt to match as few times as possible, in this case it's to only pick up the first number and not go onto the second.
^(.*?\D)(.\d\d.*?)
You can see this regex work here, with a more complete explanation.
I have also added the ^ start of string matcher so to make sure only one match it created and not repeated onto the second.
First of all, your regex pattern seems to be wrong. I think you can just use:
(\D)(\.\d+)
Find a character that is not a digit, followed by a dot and at least one digit.
Second, for replacing, you could use more low-level features, such as:
String str = "Rate is : .97 . XXXXXXXXX do you want . 87 to perform another calculation . ";
final Pattern regex = Pattern.compile("(\\D)(\\.\\d+)");
final Matcher m = regex.matcher(str);
if (m.find()) {
str = m.replaceFirst(m.group(1) + "0" + m.group(2));
}
System.out.println(str);
But of course, this works too:
str = str.replaceAll("(\\D)(\\.\\d+)", "$10$2");
You can do a positive lookahead so that way you also catch whitespaces between . and the number.
(.(?=.\d)|(\d+))+
would give you
Then in your code you can do whatever operation on group 1(blue) and group 2(red) as you wish.

Parse string using Java Regex Pattern?

I have the below java string in the below format.
String s = "City: [name:NYK][distance:1100] [name:CLT][distance:2300] [name:KTY][distance:3540] Price:"
Using the java.util.regex package matter and pattern classes I have to get the output string int the following format:
Output: [NYK:1100][CLT:2300][KTY:3540]
Can you suggest a RegEx pattern which can help me get the above output format?
You can use this regex \[name:([A-Z]+)\]\[distance:(\d+)\] with Pattern like this :
String regex = "\\[name:([A-Z]+)\\]\\[distance:(\\d+)\\]";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
StringBuilder result = new StringBuilder();
while (matcher.find()) {
result.append("[");
result.append(matcher.group(1));
result.append(":");
result.append(matcher.group(2));
result.append("]");
}
System.out.println(result.toString());
Output
[NYK:1100][CLT:2300][KTY:3540]
regex demo
\[name:([A-Z]+)\]\[distance:(\d+)\] mean get two groups one the upper letters after the \[name:([A-Z]+)\] the second get the number after \[distance:(\d+)\]
Another solution from #tradeJmark you can use this regex :
String regex = "\\[name:(?<name>[A-Z]+)\\]\\[distance:(?<distance>\\d+)\\]";
So you can easily get the results of each group by the name of group instead of the index like this :
while (matcher.find()) {
result.append("[");
result.append(matcher.group("name"));
//----------------------------^^
result.append(":");
result.append(matcher.group("distance"));
//------------------------------^^
result.append("]");
}
If the format of the string is fixed, and you always have just 3 [...] groups inside to deal with, you may define a block that matches [name:...] and captures the 2 parts into separate groups and use a quite simple code with .replaceAll:
String s = "City: [name:NYK][distance:1100] [name:CLT][distance:2300] [name:KTY][distance:3540] Price:";
String matchingBlock = "\\s*\\[name:([A-Z]+)]\\[distance:(\\d+)]";
String res = s.replaceAll(String.format(".*%1$s%1$s%1$s.*", matchingBlock),
"[$1:$2][$3:$4][$5:$6]");
System.out.println(res); // [NYK:1100][CLT:2300][KTY:3540]
See the Java demo and a regex demo.
The block pattern matches:
\\s* - 0+ whitespaces
\\[name: - a literal [name: substring
([A-Z]+) - Group n capturing 1 or more uppercase ASCII chars (\\w+ can also be used)
]\\[distance: - a literal ][distance: substring
(\\d+) - Group m capturing 1 or more digits
] - a ] symbol.
In the .*%1$s%1$s%1$s.* pattern, the groups will have 1 to 6 IDs (referred to with $1 - $6 backreferences from the replacement pattern) and the leading and final .* will remove start and end of the string (add (?s) at the start of the pattern if the string can contain line breaks).

Regex - Match numbers & special cases

I'm trying to make a regex that would produce the following results :
for 7.0 + 5 - :asc + (8.256 - :b)^2 + :d/3 : 7.0, 5, :asc, 8.256, :b, 2, :d, 3
for -+*-/^^ )รง# : nothing
It's should first match numbers which can be float, so in my regex I have : [0-9]+(\\.[0-9])? but it should also mach special cases like :a or :Abc.
To be more precise, it should (if possible) match anything but mathematical operators /*+^- and parentheses.
So here is my final regex : ([0-9]+(\\.[0-9])?)|(:[a-zA-Z]+) but it's not working because matcher.groupCount() returns 3 for both of the examples I gave.
Groups are what you specifically group in the regex. Anything surrounded in parentheses is a group. (Hello) World has 1 group, Hello. What you need to be doing is finding all the matches.
In your code ([0-9]+(\\.[0-9])?)|(:[a-zA-Z]+), 3 sets of parentheses can be seen. This is why you will always be given 3 groups in every match.
Your code works fine as it is, here is an example:
String text = "7.0 + 5 - :asc + (8.256 - :b)^2 + :d/3";
Pattern p = Pattern.compile("([0-9]+(\\.[0-9]+)?)|(:[a-zA-Z]+)");
Matcher m = p.matcher(text);
List<String> matches = new ArrayList<String>();
while (m.find()) matches.add(m.group());
for (String match : matches) System.out.println(match);
The ArrayList matches will contain all of the matches that your regex finds.
The only change I made was add a + after the second [0-9].
Here is the output:
7.0
5
:asc
8.256
:b
2
:d
3
Here is some more information about groups in java.
Does that help?
Your regex is correct, run the following code:
String input = "7.0 + 5 - :asc + (8.256 - :b)^2 + :d/3"; // your input
String regex = "(\\d+(\\.\\d+)?)|(:[a-z-A-Z]+)"; // exactly yours.
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
Your problem is the understanding of the method matcher.groupCount(). JavaDoc clearly says
Returns the number of capturing groups in this matcher's pattern.
([^\()+\-*\s])+ //put any mathematical operator inside square bracket

Author and time matching regex

I would to use a regex in my Java program to recognize some feature of my strings.
I've this type of string:
`-Author- has wrote (-hh-:-mm-)
So, for example, I've a string with:
Cecco has wrote (15:12)
and i've to extract author, hh and mm fields. Obviously I've some restriction to consider:
hh and mm must be numbers
author hasn't any restrictions
I've to consider space between "has wrote" and (
How can I can use regex?
EDIT: I attach my snippet:
String mRegex = "(\\s)+ has wrote \\((\\d\\d):(\\d\\d)\\)";
Pattern mPattern = Pattern.compile(mRegex);
String[] str = {
"Cecco CQ has wrote (14:55)", //OK (matched)
"yesterday you has wrote that I'm crazy", //NO (different text)
"Simon has wrote (yesterday)", // NO (yesterday isn't numbers)
"John has wrote (22:32)", //OK
"James has wrote(22:11)", //NO (missed space between has wrote and ()
"Tommy has wrote (xx:ss)" //NO (xx and ss aren't numbers)
};
for(String s : str) {
Matcher mMatcher = mPattern.matcher(s);
while (mMatcher.find()) {
System.out.println(mMatcher.group());
}
}
homework?
Something like:
(.+) has wrote \((\d\d):(\d\d)\)
Should do the trick
() - mark groups to capture (there are three in the above)
.+ - any chars (you said no restrictions)
\d - any digit
\(\) escape the parens as literals instead of a capturing group
use:
Pattern p = Pattern.compile("(.+) has wrote \\((\\d\\d):(\\d\\d)\\)");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
To cope with an optional (HH:mm) at the end you need to start to use some dark regex voodoo:
Pattern p = Pattern.compile("(.+) has wrote\\s?(?:\\((\\d\\d):(\\d\\d)\\))?");
Matcher m = p.matcher("Gareth has wrote (12:00)");
if( m.matches()){
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
m = p.matcher("Gareth has wrote");
if( m.matches()){
System.out.println(m.group(1));
// m.group(2) == null since it didn't match anything
}
The new unescaped pattern:
(.+) has wrote\s?(?:\((\d\d):(\d\d)\))?
\s? optionally match a space (there might not be a space at the end if there isn't a (HH:mm) group
(?: ... ) is a none capturing group, i.e. allows use to put ? after it to make is optional
I think #codinghorror has something to say about regex
The easiest way to figure out regular expressions is to use a testing tool before coding.
I use an eclipse plugin from http://www.brosinski.com/regex/
Using this I came up with the following result:
([a-zA-Z]*) has wrote \((\d\d):(\d\d)\)
Cecco has wrote (15:12)
Found 1 match(es):
start=0, end=23
Group(0) = Cecco has wrote (15:12)
Group(1) = Cecco
Group(2) = 15
Group(3) = 12
An excellent turorial on regular expression syntax can be found at http://www.regular-expressions.info/tutorial.html
Well, just in case you didn't know, Matcher has a nice function that can draw out specific groups, or parts of the pattern enclosed by (), Matcher.group(int). Like if I wanted to match for a number between two semicolons like:
:22:
I could use the regex ":(\\d+):" to match one or more digits between two semicolons, and then I can fetch specifically the digits with:
Matcher.group(1)
And then its just a matter of parsing the String into an int. As a note, group numbering starts at 1. group(0) is the whole match, so Matcher.group(0) for the previous example would return :22:
For your case, I think the regex bits you need to consider are
"[A-Za-z]" for alphabet characters (you could probably also safely use "\\w", which matchers alphabet characters, as well as numbers and _).
"\\d" for digits (1,2,3...)
"+" for indicating you want one or more of the previous character or group.

Categories

Resources