Regex to find missing double quote in csv - java

We are processing csv files which contain lines with non-closed double quoted entries. These blow up the csv parser, so I am trying to put together a regex which will identify these lines so we can delete them from the files before trying to process them.
In the following example, the csv parser gets to line 2 and includes everything up to the first double quote in line 3 before trying to close out the token and then blows up because there are non-whitespace characters after the "closing" double quote before the next comma.
Example Line 1,some data,"good line",processes fine,happy
Example Line 2,some data,"bad line,processes poorly,unhappy
Example Line 3,some data,"good line",dies before here,unhappy
I am trying to do something like:
.*,"[^(",)]*[\r\n]
The idea is finding a single line with anything followed by ," without an instance of ", which follows before the line ends.
The negation of the sequence is not working though. How is something like this done?
NOTE:
Since people keep suggesting essentially checking for an even number of double quotes, it's worth noting that a single double-quoted csv entry could contain a standalone double quote (e.g. ...,"Measurement: 1' 2"",...).

You can use:
int count = str.length() - str.replaceAll("\\"","").length();
if (count % 2 == 0) {
// do what you want
}

With your current requirements (including your concern about "Measurement: 1' 2"", this will select the bad lines:
^.*(?:^|,)[^",]*"(?:[^",]*(?:"[^",]*")?)+(?:$|,.*)
The ^ anchors at the top of the string
The .*(?:^|,) eats up any characters up to the top of the string or a comma
We match a "...
and, once or more times, [^",]*(?:"[^",]*")? matches characters that are neither a " or a comma, and, optionally, a balanced set of quotes: "[^",]*"
We either match the end of the string, or a comma and anything that follows
A note about escaped double quotes
You may have, in your input, double-quoted strings that contain an escaped double quote, like this: "abc\"de" If so, we need to replace our expression for double-quoted strings (?:"[^",]*") with something more solid: (?:"(?:\\"|[^"])*")
Hence the whole regex would become:
^.*(?:^|,)[^",]*"(?:[^",]*(?:"(?:\\"|[^"])*")?)+(?:$|,.*)

Something like this should work:
^[^"]*("[^"]*"[^"]*)*[^"]*$
The [^"]* that you see repeated all over the place means "any number of non-quote characters".
The ("[^"]*"[^"]*)* will match paired quotes while the [^"]*s will match the unquoted text before and after the final quotes.
The ^ and $ anchors ensure that we're matching the whole line, not just a portion of it.
Essentially: if there's an even number of quotes it will match. If there is an odd number of quotes, it will fail.
Here's an example of the regex in action.
If whatever solution you're working in has the option, there's a much simpler method that doesn't involve regular expressions. Simply count the number of double quotes in the CSV line. If it's odd, the line has a mismatched quote.

This was a regex someone else gave me the framework for that ended up working with a few modifications:
This will match anything followed by ," with or without spaces in between, not followed eventually by a ", (also with potential white space) and finally ending in a newline.
.*,[\s]*"(?!.*"[\s]*,).*\n

Regex doesn't really work reliably for that as there are many edge cases. You should try univocity-parsers as it is the only CSV parser I know that handles unescaped quotes properly.
It gives you the following options:
STOP_AT_CLOSING_QUOTE - If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found.
STOP_AT_DELIMITER - If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until a delimiter or line ending is found in the input.
SKIP_VALUE - If unescaped quotes are found in the input, the content parsed for the until the next delimiter is found, everything will, producing a null.
RAISE_ERROR - Throws an exception if unescaped quotes are found in the input
Use it like this:
CsvParserSettings settings = new CsvParserSettings();
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER);
CsvParser parser = new CsvParser(settings);
for(String row[] : parser.iterate(input)){
System.out.println(Arrays.toString(row));
}
Hope it helps. By default it runs with the STOP_AT_DELIMITER setting.
Disclaimer: I'm the author of this library. It's open-source and free (Apache 2.0 license)

Related

Regex for finding the text inside parentheses followed by #en : "example"#en [duplicate]

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?
In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']
I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'
Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.
The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.
I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.
A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1
The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.
This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "
I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1
string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character
Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.
All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.
My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match
echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.
From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter
If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".
A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub

Escape 'escape character' in between double quotes in a string

I have python code in String format
"input = \"1\n2\t3\n4\"\nprint(input)"
I want to escape any characters that only occur between double quotes or single quote.
The final string should look like this
"input = \"1\\n2\\t3\n4\"\nprint(input)"
I tried doing this but doesn't work.
code.replaceAll("(\")[\n\t\b]*(\")", "\"\\n\"")
You want to find all characters after a quote (") that are non-quotes, up to the next quote.
Regex to find all texts between two quotes:
"[^"]*"
|| ||
|| |and terminated with a "
|| MANY repeatations of that (written as *)
|characters that are NOT " (written as [^"])
Start with "
But now you don't want ONE finding for such a string between " and " - you want each character to be a unique finding.
At this point, standard regex can't do the job for you:
You could get just the first character of such a finding with "([^"])[^"]*" but then your next question is how to get 2nd, 3rd etc finding. You might think about adding variable-length identifiers before and after the match.. but even with a regex like "[^"]*([^"])[^"]*" you will always get just ONE match for the finding between both "". Regex does not support a concept of looping through findings; wildcards are always evaluated as max-match.
So you will need something different.
I'd recommend to search for first position of " within your string, e.g. with String.indexOf(...) and then loop through the string until you find the next (terminating) quote.
For all characters in between, you can replace it.
You will therefore work with a separate data output variable.

Parsing BibTeX record with Java RegEx

I have to write simple BibTeX parser using Java regular expressions. Task is a bit simplified: every tag value is between quotation marks "", not brackets {}. The thing is, {} can be inside "".
I'm trying to cut single records from entire String file, e. g. I want to get #book{...} as String. The problem is that there can be no comma after last tag, so it can end like: author = "john"}.
I've tried #\w*\{[\s\S]*?\}, but it stops if I have } in any tag value between "". There is also no guarantee that } will be in separate line, it can be directly after last tag value (which may not end with " either, since it can be an integer).
Can you help me with this?
You could try the following expression as a basis: #\w+\{(?>\s*\w+\s*=\s*"[^"]*")*\}
Exlanation:
#\w+\{...\} would be the record, e.g. #book{...}
(?>...)* means a non-capturing group that can occur multiple times or not at all - this is meant to represent the tags
\s*\w+\s*=\s*"[^"]*" would mean a tag which could be preceded by whitespace (\s*). The tag's value has to be in double quotes and anything between double quotes will be consumed, even curly braces.
Note that there might be some more cases to take into account but this should be able to handle curly braces in tag values because it will "consume" every content between the double quotes, thus it wouldn't match if the closing curly brace were missing (e.g. it would match #book{ title="the use of { and }" author="John {curly} Johnson"} but not #book{ title="the use of { and }" author="John {curly} Johnson").
I've found a hack, it may help someone with same problem: there must be new line character after } sign. If end of value is only " (} sign doesn't end any value), then [\r\n] at the end of regex will suffice.

How to split a string in java using (,) with certain conditions [duplicate]

I would like to find a regex that will pick out all commas that fall outside quote sets.
For example:
'foo' => 'bar',
'foofoo' => 'bar,bar'
This would pick out the single comma on line 1, after 'bar',
I don't really care about single vs double quotes.
Has anyone got any thoughts? I feel like this should be possible with readaheads, but my regex fu is too weak.
This will match any string up to and including the first non-quoted ",". Is that what you are wanting?
/^([^"]|"[^"]*")*?(,)/
If you want all of them (and as a counter-example to the guy who said it wasn't possible) you could write:
/(,)(?=(?:[^"]|"[^"]*")*$)/
which will match all of them. Thus
'test, a "comma,", bob, ",sam,",here'.gsub(/(,)(?=(?:[^"]|"[^"]*")*$)/,';')
replaces all the commas not inside quotes with semicolons, and produces:
'test; a "comma,"; bob; ",sam,";here'
If you need it to work across line breaks just add the m (multiline) flag.
The below regexes would match all the comma's which are present outside the double quotes,
,(?=(?:[^"]*"[^"]*")*[^"]*$)
DEMO
OR(PCRE only)
"[^"]*"(*SKIP)(*F)|,
"[^"]*" matches all the double quoted block. That is, in this buz,"bar,foo" input, this regex would match "bar,foo" only. Now the following (*SKIP)(*F) makes the match to fail. Then it moves on to the pattern which was next to | symbol and tries to match characters from the remaining string. That is, in our output , next to pattern | will match only the comma which was just after to buz . Note that this won't match the comma which was present inside double quotes, because we already make the double quoted part to skip.
DEMO
The below regex would match all the comma's which are present inside the double quotes,
,(?!(?:[^"]*"[^"]*")*[^"]*$)
DEMO
While it's possible to hack it with a regex (and I enjoy abusing regexes as much as the next guy), you'll get in trouble sooner or later trying to handle substrings without a more advanced parser. Possible ways to get in trouble include mixed quotes, and escaped quotes.
This function will split a string on commas, but not those commas that are within a single- or double-quoted string. It can be easily extended with additional characters to use as quotes (though character pairs like « » would need a few more lines of code) and will even tell you if you forgot to close a quote in your data:
function splitNotStrings(str){
var parse=[], inString=false, escape=0, end=0
for(var i=0, c; c=str[i]; i++){ // looping over the characters in str
if(c==='\\'){ escape^=1; continue} // 1 when odd number of consecutive \
if(c===','){
if(!inString){
parse.push(str.slice(end, i))
end=i+1
}
}
else if(splitNotStrings.quotes.indexOf(c)>-1 && !escape){
if(c===inString) inString=false
else if(!inString) inString=c
}
escape=0
}
// now we finished parsing, strings should be closed
if(inString) throw SyntaxError('expected matching '+inString)
if(end<i) parse.push(str.slice(end, i))
return parse
}
splitNotStrings.quotes="'\"" // add other (symmetrical) quotes here
Try this regular expression:
(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*=>\s*(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*,
This does also allow strings like “'foo\'bar' => 'bar\\',”.
MarkusQ's answer worked great for me for about a year, until it didn't. I just got a stack overflow error on a line with about 120 commas and 3682 characters total. In Java, like this:
String[] cells = line.split("[\t,](?=(?:[^\"]|\"[^\"]*\")*$)", -1);
Here's my extremely inelegant replacement that doesn't stack overflow:
private String[] extractCellsFromLine(String line) {
List<String> cellList = new ArrayList<String>();
while (true) {
String[] firstCellAndRest;
if (line.startsWith("\"")) {
firstCellAndRest = line.split("([\t,])(?=(?:[^\"]|\"[^\"]*\")*$)", 2);
}
else {
firstCellAndRest = line.split("[\t,]", 2);
}
cellList.add(firstCellAndRest[0]);
if (firstCellAndRest.length == 1) {
break;
}
line = firstCellAndRest[1];
}
return cellList.toArray(new String[cellList.size()]);
}
#SocialCensus, The example you gave in the comment to MarkusQ, where you throw in ' alongside the ", doesn't work with the example MarkusQ gave right above that if we change sam to sam's: (test, a "comma,", bob, ",sam's,",here) has no match against (,)(?=(?:[^"']|["|'][^"']")$). In fact, the problem itself, "I don't really care about single vs double quotes", is ambiguous. You have to be clear what you mean by quoting either with " or with '. For example, is nesting allowed or not? If so, to how many levels? If only 1 nested level, what happens to a comma outside the inner nested quotation but inside the outer nesting quotation? You should also consider that single quotes happen by themselves as apostrophes (ie, like the counter-example I gave earlier with sam's). Finally, the regex you made doesn't really treat single quotes on par with double quotes since it assumes the last type of quotation mark is necessarily a double quote -- and replacing that last double quote with ['|"] also has a problem if the text doesn't come with correct quoting (or if apostrophes are used), though, I suppose we probably could assume all quotes are correctly delineated.
MarkusQ's regexp answers the question: find all commas that have an even number of double quotes after it (ie, are outside double quotes) and disregard all commas that have an odd number of double quotes after it (ie, are inside double quotes). This is generally the same solution as what you probably want, but let's look at a few anomalies. First, if someone leaves off a quotation mark at the end, then this regexp finds all the wrong commas rather than finding the desired ones or failing to match any. Of course, if a double quote is missing, all bets are off since it might not be clear if the missing one belongs at the end or instead belongs at the beginning; however, there is a case that is legitimate and where the regex could conceivably fail (this is the second "anomaly"). If you adjust the regexp to go across text lines, then you should be aware that quoting multiple consecutive paragraphs requires that you place a single double quote at the beginning of each paragraph and leave out the quote at the end of each paragraph except for at the end of the very last paragraph. This means that over the space of those paragraphs, the regex will fail in some places and succeed in others.
Examples and brief discussions of paragraph quoting and of nested quoting can be found here http://en.wikipedia.org/wiki/Quotation_mark .

How to extract multi-line text delimited by 2 strings

I've following pattern:
Claims(40)
This is good.
This is good, too.
Description
This is description.
The delimiter strings in this case are:
1st delimiter: "Claims(40)"
2nd delimiter: "Description"
I want to extract text between these delimiters while excluding the delimiters.
Also, in the above text, following rules exist:
1st delimiter starts on the 1st column in the text and it's the only word on the line.
In the first delimiter, opening parenthesis, combination of digits, and closing parenthesis may be absent. However, combination of digits and closing parenthesis exist if does the opening parenthesis.
2nd delimiter starts on the 1st column in the text and it's the only word on the line.
My regular expression:
String regxStr = "^Claims(\\(\\d+\\)?)$(.*?)^Description$";
This doesn't work.
I tried a lot many other regx, but none did work. So finally, I resorted applying brute-force approach with the regex:
String regxStr = "Claims(.*?)Description";
But neither of the regx is working. I am not being able to figure out what's and where the regx is going wrong.
I'm using Matcher class and find() method of Matcher class for further processing.
Please help me.
This captures the text you want, although I'm not totally clear on your requirements for the (40) part. #lovetostrike's answer addresses that.
\bClaims(?:\(\d+\))?\s+(.+?)\s+Description\b
You must activate the DOTALL flag when compiling the pattern:
Pattern.compile(regxStr, Pattern.DOTALL)
Escaped in a Java string:
"\\bClaims(?:\\(\\d+\\))?\\s+(.+?)\\s+Description\\b"
Here's a one-line solution:
String target = input.relaceAll(".*Claims(\\(\\d+\\))?\\s+(.*?)Description.*", "$1");
Also in addition to #aliteralmind answer, Regex isn't a good tool for nested structure, i.e. matching paren pairs. But in your simple case, you can use the OR, '|', operator in your pattern. The outer parens are used to separate the two groups for OR operator, first part with parens, and the second without parens.
(\\(\\d+\\)|\\d+)

Categories

Resources