Regex double new line - java

What I want to do is to take the left and right parts of double line.
Example
LEFT_PART\r\n\r\nRIGHT_PART
Left and right part can be anything but they will not contain double new line.
What I'm doing is not working (doesn't match the string I give it). This is what I've done so far.
^(.*)[\r\r|\n\n|\r\n\r\n]{1,1}(.*)$
It can start with anything, followed by exactly one double-new line, followed by anything.
I group the right and left because I need to use them aftewards.
EDIT
I use OR to cover all three types of new-line

Square brackets are used for character class and not grouping. Try using parens:
^(.*)(\r\r|\n\n|\r\n\r\n)(.*)$
And to avoid capturing the double newlines;
^(.*)(?:\r\r|\n\n|\r\n\r\n)(.*)$
The {1,1} is also redundant. I removed it.

It's not working because you have used a character class, which matches just a single character. You should use parenthesis. Also, you can simplify your regex by using {n} quantifier. To match \r\r, use \r{2}:
^(.*)(?:\r|\n|\r\n){2}(.*)$
Apart from that, I would rather get the line separator for my system using:
String lineSeparator = System.getProperty("line.separator");
String regex = "^(.*)" + Pattern.quote(lineSeparator) + "{2}(.*)$

Try the next:
^(.*)(?:(\r|\r?\n){2})(.*)$

Try this:
(?m)^(.*)$[\r\n]{1,2}^$[\r\n]{1,2}^(.*)$
The switch (?m) has the effect that caret and dollar match after and before newlines for the remainder of the regular expression
Here's a live demo of this regex working.

Related

How to split a string in java using (,) with certain conditions [duplicate]

I would like to find a regex that will pick out all commas that fall outside quote sets.
For example:
'foo' => 'bar',
'foofoo' => 'bar,bar'
This would pick out the single comma on line 1, after 'bar',
I don't really care about single vs double quotes.
Has anyone got any thoughts? I feel like this should be possible with readaheads, but my regex fu is too weak.
This will match any string up to and including the first non-quoted ",". Is that what you are wanting?
/^([^"]|"[^"]*")*?(,)/
If you want all of them (and as a counter-example to the guy who said it wasn't possible) you could write:
/(,)(?=(?:[^"]|"[^"]*")*$)/
which will match all of them. Thus
'test, a "comma,", bob, ",sam,",here'.gsub(/(,)(?=(?:[^"]|"[^"]*")*$)/,';')
replaces all the commas not inside quotes with semicolons, and produces:
'test; a "comma,"; bob; ",sam,";here'
If you need it to work across line breaks just add the m (multiline) flag.
The below regexes would match all the comma's which are present outside the double quotes,
,(?=(?:[^"]*"[^"]*")*[^"]*$)
DEMO
OR(PCRE only)
"[^"]*"(*SKIP)(*F)|,
"[^"]*" matches all the double quoted block. That is, in this buz,"bar,foo" input, this regex would match "bar,foo" only. Now the following (*SKIP)(*F) makes the match to fail. Then it moves on to the pattern which was next to | symbol and tries to match characters from the remaining string. That is, in our output , next to pattern | will match only the comma which was just after to buz . Note that this won't match the comma which was present inside double quotes, because we already make the double quoted part to skip.
DEMO
The below regex would match all the comma's which are present inside the double quotes,
,(?!(?:[^"]*"[^"]*")*[^"]*$)
DEMO
While it's possible to hack it with a regex (and I enjoy abusing regexes as much as the next guy), you'll get in trouble sooner or later trying to handle substrings without a more advanced parser. Possible ways to get in trouble include mixed quotes, and escaped quotes.
This function will split a string on commas, but not those commas that are within a single- or double-quoted string. It can be easily extended with additional characters to use as quotes (though character pairs like « » would need a few more lines of code) and will even tell you if you forgot to close a quote in your data:
function splitNotStrings(str){
var parse=[], inString=false, escape=0, end=0
for(var i=0, c; c=str[i]; i++){ // looping over the characters in str
if(c==='\\'){ escape^=1; continue} // 1 when odd number of consecutive \
if(c===','){
if(!inString){
parse.push(str.slice(end, i))
end=i+1
}
}
else if(splitNotStrings.quotes.indexOf(c)>-1 && !escape){
if(c===inString) inString=false
else if(!inString) inString=c
}
escape=0
}
// now we finished parsing, strings should be closed
if(inString) throw SyntaxError('expected matching '+inString)
if(end<i) parse.push(str.slice(end, i))
return parse
}
splitNotStrings.quotes="'\"" // add other (symmetrical) quotes here
Try this regular expression:
(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*=>\s*(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*,
This does also allow strings like “'foo\'bar' => 'bar\\',”.
MarkusQ's answer worked great for me for about a year, until it didn't. I just got a stack overflow error on a line with about 120 commas and 3682 characters total. In Java, like this:
String[] cells = line.split("[\t,](?=(?:[^\"]|\"[^\"]*\")*$)", -1);
Here's my extremely inelegant replacement that doesn't stack overflow:
private String[] extractCellsFromLine(String line) {
List<String> cellList = new ArrayList<String>();
while (true) {
String[] firstCellAndRest;
if (line.startsWith("\"")) {
firstCellAndRest = line.split("([\t,])(?=(?:[^\"]|\"[^\"]*\")*$)", 2);
}
else {
firstCellAndRest = line.split("[\t,]", 2);
}
cellList.add(firstCellAndRest[0]);
if (firstCellAndRest.length == 1) {
break;
}
line = firstCellAndRest[1];
}
return cellList.toArray(new String[cellList.size()]);
}
#SocialCensus, The example you gave in the comment to MarkusQ, where you throw in ' alongside the ", doesn't work with the example MarkusQ gave right above that if we change sam to sam's: (test, a "comma,", bob, ",sam's,",here) has no match against (,)(?=(?:[^"']|["|'][^"']")$). In fact, the problem itself, "I don't really care about single vs double quotes", is ambiguous. You have to be clear what you mean by quoting either with " or with '. For example, is nesting allowed or not? If so, to how many levels? If only 1 nested level, what happens to a comma outside the inner nested quotation but inside the outer nesting quotation? You should also consider that single quotes happen by themselves as apostrophes (ie, like the counter-example I gave earlier with sam's). Finally, the regex you made doesn't really treat single quotes on par with double quotes since it assumes the last type of quotation mark is necessarily a double quote -- and replacing that last double quote with ['|"] also has a problem if the text doesn't come with correct quoting (or if apostrophes are used), though, I suppose we probably could assume all quotes are correctly delineated.
MarkusQ's regexp answers the question: find all commas that have an even number of double quotes after it (ie, are outside double quotes) and disregard all commas that have an odd number of double quotes after it (ie, are inside double quotes). This is generally the same solution as what you probably want, but let's look at a few anomalies. First, if someone leaves off a quotation mark at the end, then this regexp finds all the wrong commas rather than finding the desired ones or failing to match any. Of course, if a double quote is missing, all bets are off since it might not be clear if the missing one belongs at the end or instead belongs at the beginning; however, there is a case that is legitimate and where the regex could conceivably fail (this is the second "anomaly"). If you adjust the regexp to go across text lines, then you should be aware that quoting multiple consecutive paragraphs requires that you place a single double quote at the beginning of each paragraph and leave out the quote at the end of each paragraph except for at the end of the very last paragraph. This means that over the space of those paragraphs, the regex will fail in some places and succeed in others.
Examples and brief discussions of paragraph quoting and of nested quoting can be found here http://en.wikipedia.org/wiki/Quotation_mark .

How to extract multi-line text delimited by 2 strings

I've following pattern:
Claims(40)
This is good.
This is good, too.
Description
This is description.
The delimiter strings in this case are:
1st delimiter: "Claims(40)"
2nd delimiter: "Description"
I want to extract text between these delimiters while excluding the delimiters.
Also, in the above text, following rules exist:
1st delimiter starts on the 1st column in the text and it's the only word on the line.
In the first delimiter, opening parenthesis, combination of digits, and closing parenthesis may be absent. However, combination of digits and closing parenthesis exist if does the opening parenthesis.
2nd delimiter starts on the 1st column in the text and it's the only word on the line.
My regular expression:
String regxStr = "^Claims(\\(\\d+\\)?)$(.*?)^Description$";
This doesn't work.
I tried a lot many other regx, but none did work. So finally, I resorted applying brute-force approach with the regex:
String regxStr = "Claims(.*?)Description";
But neither of the regx is working. I am not being able to figure out what's and where the regx is going wrong.
I'm using Matcher class and find() method of Matcher class for further processing.
Please help me.
This captures the text you want, although I'm not totally clear on your requirements for the (40) part. #lovetostrike's answer addresses that.
\bClaims(?:\(\d+\))?\s+(.+?)\s+Description\b
You must activate the DOTALL flag when compiling the pattern:
Pattern.compile(regxStr, Pattern.DOTALL)
Escaped in a Java string:
"\\bClaims(?:\\(\\d+\\))?\\s+(.+?)\\s+Description\\b"
Here's a one-line solution:
String target = input.relaceAll(".*Claims(\\(\\d+\\))?\\s+(.*?)Description.*", "$1");
Also in addition to #aliteralmind answer, Regex isn't a good tool for nested structure, i.e. matching paren pairs. But in your simple case, you can use the OR, '|', operator in your pattern. The outer parens are used to separate the two groups for OR operator, first part with parens, and the second without parens.
(\\(\\d+\\)|\\d+)

Using java.util.regex.Pattern

I´m not a programmer, so my level is newie in this field. I must create a regular expression to check two lines. Between these two lines A and B could be one, two or more different lines.
I´ve been reviewing link http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html but i´ve not reach the solution, althouth i think that i´m very close to the solution.
I am testing the expression
^(.*$)
and this gets an entire line. If i write this expression twice it gets two lines. So it seems that this expression is getting as entire lines as occurrences of the expression.
But, i would like to check undetermined lines between A and B. I know that at least it will be one line
If i write ^(.*$){1,} it doesn´t work.
Anyone knows which could be the mistake?
Thank you for your time
Andres
DOT . in regex matches any character except newline character.
You're looking for DOTALL or s flag here that makes dot match any character including newline character as well. So if you want to match all the lines between literals A and B then use this regex:
(?s)A.*?B
(?s) is for DOTALL that will make .*? match all the characters including newline characters between A and B.
? is to make above regex non-greedy.
Read More: http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html
Why don't you use Scanner ? It might be more related to what you want:
Scanner sc = new ...
while (sc.nextLine().compareTo(strB)!=0) {
whatYouWantToDo
}
You could try to search for line terminators \r and \n. Depending on the source of the file you maybe have to experiment a bit.
As far as I understood it, you want to match the lines, with at least one empty line in between? Try ^(.*)$\n{2,}^(.*)$
If you want to find two equal lines, using regex:
Pattern pattern = Pattern.compile("^(?:.*\n)*(.*\n)(?:.*\n)*\\1");
// Skip some lines, find a line, skip some lines, find the first group `(...)`
Matcher m = pattern.matcher(text);
while (m.find()) {
System.out.println("Double: " + m.group(1);
}
The (?: ...) is a non-capturing group; that is, not available through m.group(#).
However this won't find line B in: "A\nB\nA\nB\n".

Remove string before double line break using regex

I have a string like this:
this is my text
more text
more text
text I want
is below
I just want the text below the double line break and not the stuff before.
Here is what I thought should work:
myString.replaceFirst(".+?(\n\n)","");
However it does not work. Any help would be greatly appreciated
You should use the below regex for your purpose: -
str = str.replaceFirst("(?s).+?(\n\n)", "");
Because, you want to match anything including the newline character before it encounters two newline characters back to back.
Note that dot(.) does not matches a newline, so it would stop matching on encountering the first newline character.
If you want your dot(.) to match newline, you can use Pattern.DOTALL, which in case of str.replaceFirst, is achieved by using (?s) expression.
From the documentation of Pattern.DOTALL: -
In dotall mode, the expression . matches any character, including a
line terminator. By default this expression does not match line
terminators.
Dotall mode can also be enabled via the embedded flag expression (?s).
Why not:
s = s.substring(s.indexOf("\n\n") + 2);
Note that it might be +1, +2, or +3. I don't feel like like breaking out my computer to test it at the moment.
You can use split here is an example
String newString = string.split("\n\n")[1];

regular expression to match one or more of char a or just one of char b

I am taking user input through UI, and I have to validate it. Input text should obey the following ondition
It should either end with one or more
white space characters OR with just
single '='
I can use
".*[\s=]+"
but it matches multiple '=' also which I don't want to.
Please help.
You can use alternation:
(\s+|=)$
This expression means match one or more whitespace character or one equals, at the end of the string. The $ is an anchor which matches the end of the string (as you mentioned you're looking for characters at the end of the string).
(As tchrist correctly pointed out in the comments, $ matches the end of line instead of end of string when in multiline mode. If this is true in your case, and you are indeed looking for the end of the string instead of the end of the line, you can use \Z instead, which matches the end of the string regardless of multiline mode.)
If you want to ensure that there is only one = at the end, you can use a lookaround (in this case, a negative lookbehind, specifically). A lookaround is a zero-width assertion which tells the regex engine that the assertion must pass for the pattern to match, but it does not consume any characters.
(\s+|(?<!=)=)$
In this case, (?<!=) tells the regex engine, the character before the current position cannot be an =. When put into the expression, (?<!=)= means that the = will only match if the previous character is not also a =.
Begin string
Anything not "=" ( to avoid the double "==")
One or more blank spaces OR one "="
End of string
^([^=]*[\s+|=])$
Should work :-)
Try this expression:
".*(\\s+|=)"

Categories

Resources