How to extract multi-line text delimited by 2 strings - java

I've following pattern:
Claims(40)
This is good.
This is good, too.
Description
This is description.
The delimiter strings in this case are:
1st delimiter: "Claims(40)"
2nd delimiter: "Description"
I want to extract text between these delimiters while excluding the delimiters.
Also, in the above text, following rules exist:
1st delimiter starts on the 1st column in the text and it's the only word on the line.
In the first delimiter, opening parenthesis, combination of digits, and closing parenthesis may be absent. However, combination of digits and closing parenthesis exist if does the opening parenthesis.
2nd delimiter starts on the 1st column in the text and it's the only word on the line.
My regular expression:
String regxStr = "^Claims(\\(\\d+\\)?)$(.*?)^Description$";
This doesn't work.
I tried a lot many other regx, but none did work. So finally, I resorted applying brute-force approach with the regex:
String regxStr = "Claims(.*?)Description";
But neither of the regx is working. I am not being able to figure out what's and where the regx is going wrong.
I'm using Matcher class and find() method of Matcher class for further processing.
Please help me.

This captures the text you want, although I'm not totally clear on your requirements for the (40) part. #lovetostrike's answer addresses that.
\bClaims(?:\(\d+\))?\s+(.+?)\s+Description\b
You must activate the DOTALL flag when compiling the pattern:
Pattern.compile(regxStr, Pattern.DOTALL)
Escaped in a Java string:
"\\bClaims(?:\\(\\d+\\))?\\s+(.+?)\\s+Description\\b"

Here's a one-line solution:
String target = input.relaceAll(".*Claims(\\(\\d+\\))?\\s+(.*?)Description.*", "$1");

Also in addition to #aliteralmind answer, Regex isn't a good tool for nested structure, i.e. matching paren pairs. But in your simple case, you can use the OR, '|', operator in your pattern. The outer parens are used to separate the two groups for OR operator, first part with parens, and the second without parens.
(\\(\\d+\\)|\\d+)

Related

Regex for finding the text inside parentheses followed by #en : "example"#en [duplicate]

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?
In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']
I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'
Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.
The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.
I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.
A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1
The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.
This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "
I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1
string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character
Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.
All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.
My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match
echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.
From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter
If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".
A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub

Java Regex to validate String

I have just bought a book on Regex to try and get my head around it but I'm still really struggling with it. I am trying to create a java regex that will satisfy a string configuration that can;
Can contain lowercase letters ([a-z])
Can contain commas (,) but only between words
Can contain colon (:) but must be separated by words or multiply (*)
Can contain hyphens (-) but must be separated by words
Can contain multiply (*) but if used it must be the only character before/between/after the colon
Cannot contain spaces, 'words' are delimitated by a hyphens (-) or commas (,) or colon (:) or the end of the string
So for example the following would be true:
foo:bar
foo-bar:foo
foo,bar:foo
foo-bar,foo:bar,foo-bar
foo:bar:foo,bar
*:foo
foo:*
*:*:*
But the following would be false:
foo :bar
,foo:bar
foo-:bar
-foo:bar
foo,:bar-
foo:bar,
foo,*:bar
foo-*:bar
This is what I have so far:
^[a-z-]|*[:?][a-z-]|*[:?][a-z-]|*
Here is a regex that will work for all your cases:
([a-z]+([,-][a-z]+)*|\*)(:([a-z]+)([,-][a-z]+)*|\*)*
Here is a detailed analysis:
One of the basic structures used to build complicated regular expressions like this is actually pretty simple, and has the form text(separator text)*. A regex of that form will match:
one text
one text, a separator, and another text
one text, a separator, another text, another separator, and yet another text
or more, just add another separator and a text to the end.
So here is a breakdown of the code:
[a-z]+([,-][a-z]+)* is an instance of the pattern I discussed above: the text here is [a-z]+, and the separator is [,-].
([a-z]+([,-][a-z]+)*|\*) allows an asterisk to be matched instead.
([a-z]+([,-][a-z]+)*|\*)(:([a-z]+([,-][a-z]+)*|\*))* is another instance of the pattern I discussed above: the text is ([a-z]+([,-][a-z]+)*|\*), and the separator is :.
If you plan to use this as a component of an even larger regex, in which the group matches will be important, I would recommend making the internal parens non-grouping, and place grouping parens around the entire regex, like so:
((?:[a-z]+(?:[,-][a-z]+)*|\*)(?::([a-z]+)(?:[,-][a-z]+)*|\*)*)
We rarely see here somebody who can define positive and negative test cases. That makes live really easier.
Here's my regex with a 95% solution:
"(([a-z]+|\\*)[:,-])*([a-z]+|\\*)" (JAVA-Version)
(([a-z]+|\*)[:,-])*([a-z]+|\*) (plain regex)
It simply differntiates between words (a-z or *) and separators (one of :-,) and it must contain at least one word and words must be separated by a separator. It works for the positive cases and for the negative cases except the last two negative ones.
One remark: Such a complex "syntax" would in real live be implemented with a grammer definition tool like ANTLR (or a few years ago with lex/yacc, flex/bison). Regex can do that but will not be easy to maintain.

Remove everything from a string upto a certain character and optionally a string if it follows too

I am looking to write a regex that can remove any characters upto the first &emsp and if there is a (new section) following &emsp then remove that as well. But the following regex doesn't seem to work. Why? How do I correct this?
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
Pattern removeEmspPattern1 = Pattern.compile("(.*( (\\(new section\\)))?)(.*)", Pattern.MULTILINE);
System.out.println(removeEmspPattern1.matcher(removeEmsp).replaceAll("$2"));
Have you tried String Split? This creates an array of strings from a string, based on a deliminator.
Once you have the string split, just select the elements of the array that you need for print statement.
Read more here
Your regex is very long and I do not want to debug it. However the tip is that some characters have special meaning in regular expressions. For example & means "and". Squire brackets allow defining characters groups etc. Such characters must be escaped if you want them to be interpreted as just characters and not regex commands. To escape special character you have to write \ in front of it. But \ is escape character for java too, so it should be duplicate.
For example to replace ampersand by letter A you should write str.replaceAll("\\&", "A")
Now you have all information you need. Try to start from simpler regex and then expand it to what you need. Good luck.
EDIT
BTW parsing XML and/or HTML using regular expressions is possible but is highly not recommended. Use special parser for such formats.
Try this:
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
System.out.println(removeEmsp.replaceFirst("^.*?\\ (\\(new\\ssection\\))?", ""));
System.out.println(removeEmsp.replaceAll("^.*?\\ (\\(new\\ssection\\))?", ""));
Output:
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
It will remove everything up to " " and optionally, the following "(new section)" text if any.

capture all characters between match character (single or repeated) on string

I'm trying to extract the string preceding a specific character (even when character is repeated, like this (ie: underscore '_'):
this_is_my_example_line_0
this_is_my_example_line_1_
this_is_my_example_line_2___
_this_is_my_ _example_line_3_
__this_is_my___example_line_4__
and after running my regex I should get this (the regex should ignore the any instances of the matching character in the middle of the string):
this_is_my_example_line_0
this_is_my_example_line_1
this_is_my_example_line_2
this_is_my_ _example_line_3
this_is_my___example_line_4
In other words I'm trying to 'trim' the matched character(s) at the beginning and end of string.
I'm trying to use a Regex in Java to accomplish this, my idea is to capture the group of characters between the special character(s) at the end or beginning of the line.
So far I can only do this successfully for example 3 with this regexp:
/[^_]+|_+(.*)[_$]+|_$+/
[^_]+ not 'underscore' once or more
| OR
_+ underscore once or more
(.*) capture all characters
[_$]+ not 'underscore' once or more followed by end of line
|_$+ OR 'underscore' once or more followed by end of line
I just realized that this excludes the first word of the message on example 0,1,2 since the string doesn't start with underscore and it only starts matching after finding a underscore..
Is there an easier way not involving regex?
I don't really care about the first character (although it would be nice) I only need to ignore the repeating character at the end.. it looks that (by this regex tester) just doing this, would work? /()_+$/ the empty parenthesis matches anything before a single or repeting matches at the end of the line.. would that be correct?
Thank you!
There are a couple of options here, you could either replace matches of ^_+|_+$ with an empty string, or extract the contents of the first capture group from the match of ^_*(.*?)_*$. Note that if your strings may be multiple lines and you want to perform the replacement on each line then you will need to use the Pattern.MULTILINE flag for either approach. If your strings may be multiple lines and you only want to replacement to occur at the very beginning and end, don't use Pattern.MULTILINE but use Pattern.DOTALL for the second approach.
For example: http://regexr.com?355ff
How about [^_\n\r](.*[^_\n\r])??
Demo
String data=
"this_is_my_example_line_0\n" +
"this_is_my_example_line_1_\n" +
"this_is_my_example_line_2___\n" +
"_this_is_my_ _example_line_3_\n" +
"__this_is_my___example_line_4__";
Pattern p=Pattern.compile("[^_\n\r](.*[^_\n\r])?");
Matcher m=p.matcher(data);
while(m.find()){
System.out.println(m.group());
}
output:
this_is_my_example_line_0
this_is_my_example_line_1
this_is_my_example_line_2
this_is_my_ _example_line_3
this_is_my___example_line_4

Regular expression to match a character only once before any whitespace

In Java, what regular expression would I use to match a string that has exactly one colon and makes sure that the colon appears before any whitespace?
For example, it should match these strings:
label: print "Enter input"
But: I still had the money.
ghjkdhfjkgjhalergfyujhrageyjdfghbg:
area:54
But not
label: print "Enter input:"
There was one more thing: I still had the money.
ghfdsjhgakjsdhfkjdsagfjkhadsjkhflgadsjklfglsd
area::54
If you use it with matches (which requires to match the entire string), you could use
[^\\s:]*:[^:]*
Which means: arbitrarily many non-whitespace, non-: characters, then a :, then more arbitrarily many non-: characters.
I've really only used two regex concepts: (negated) character classes and repetition.
If you want to require at least one character before or after :, replace the corresponding * with + (as jlordo pointed out in a comment).
The following should work:
^[^\s:]*:(?!.*:)
If your strings can contain line breaks, use the DOTALL flag or change the regex to the following:
(?s)^[^\s:]*:(?!.*:)
It depends on what we call white space, it could be
[^\\p{Space}:]*:[^:]
The following should get you started:
Matcher MatchedPattern = Pattern.compile("^(\\w+\\:{1}[\"\\w\\s\\.]*)$").matcher("yourstring");

Categories

Resources