Remove string before double line break using regex - java

I have a string like this:
this is my text
more text
more text
text I want
is below
I just want the text below the double line break and not the stuff before.
Here is what I thought should work:
myString.replaceFirst(".+?(\n\n)","");
However it does not work. Any help would be greatly appreciated

You should use the below regex for your purpose: -
str = str.replaceFirst("(?s).+?(\n\n)", "");
Because, you want to match anything including the newline character before it encounters two newline characters back to back.
Note that dot(.) does not matches a newline, so it would stop matching on encountering the first newline character.
If you want your dot(.) to match newline, you can use Pattern.DOTALL, which in case of str.replaceFirst, is achieved by using (?s) expression.
From the documentation of Pattern.DOTALL: -
In dotall mode, the expression . matches any character, including a
line terminator. By default this expression does not match line
terminators.
Dotall mode can also be enabled via the embedded flag expression (?s).

Why not:
s = s.substring(s.indexOf("\n\n") + 2);
Note that it might be +1, +2, or +3. I don't feel like like breaking out my computer to test it at the moment.

You can use split here is an example
String newString = string.split("\n\n")[1];

Related

How to match regex pattern on single line only?

I have the following regex and sample input:
http://regex101.com/r/xK9dE3
As you can see it matching the first "yo". I only want the pattern to match on the same line (the second "yo") pattern with "cut me".
How can I make sure that the regex match is only on the same line?
Output:
Hi
Expected Output (this is what I really want):
Hi
yo keep this here
Keep this here
You can use this regex with s (DOTALL) regex flag:
^.*?(?=yo\b[^\n]*cut me:)
Online Demo: http://regex101.com/r/oV3eP7
yo\b[^\n]*cut me: is lookahead pattern that makes sure that yo with word boundary and cut me: are matched in the same line.
Remove the s or DOTALL flag and change your regex to the following:
^.*?((\yo\b.*?(cut me:)[\s\S]*))
With the DOTALL flag enabled . will match newline characters, so your match can span multiple lines including lines before yo or between yo and cut me. By removing this flag you can ensure that you only match the line with both yo and cut me, and then change the .* at the end to [\s\S]* which will match any character including newlines so that you can match to the end of the string.
http://regex101.com/r/sX2kL0
edit: Note that this takes a slightly different approach than the other answer, this will match the portion of the string that you want deleted so you can replace this portion with an empty string to remove it.

Java Regex ignoring newline character WITHOUT Dotall

I have to parse returned emails for a specific object id. The problem is that, when the email is returned, the id may be split into several lines. Usually it should look like this:
foo#bar-20130101-103000#12345
whereat I'm interested in the last part "12345". The problem is that that string tends to be split by a newline, for example:
foo#bar-20130101-103000#12
345
which causes my regex
[a-zA-Z0-9äöüÄÖÜß]{1,5}#[a-zA-Z0-9äöüÄÖÜß]{1,5}-\d{8}-\d{6}#(\d+)
to only find "12" instead of "12345". Now all the hints i find on the 'net are to use Pattern.MULTILINE and/or Pattern.DOTALL, but multiline only influences the ^ and $ anchors and dotall only makes . match on newline chars too. The problem is that i don't have a . here and it's not really applicable either, because i only want digits.
So how can i make my regex match the whole thing and not stop at the line break?
[\d\r\n] will match a digit or a new line, so try with ([\d\r\n]+).
Since your number is in the end you can try:
"(?s)^[a-zA-Z0-9äöüÄÖÜß]{1,5}#[a-zA-Z0-9äöüÄÖÜß]{1,5}-\d{8}-\d{6}#(.*)$"
i.e. capture everything after # with DOTALL
Following should also work without DOTALL:
"^[a-zA-Z0-9äöüÄÖÜß]{1,5}#[a-zA-Z0-9äöüÄÖÜß]{1,5}-\d{8}-\d{6}#[\\d\\r\\n]+$"

Regex double new line

What I want to do is to take the left and right parts of double line.
Example
LEFT_PART\r\n\r\nRIGHT_PART
Left and right part can be anything but they will not contain double new line.
What I'm doing is not working (doesn't match the string I give it). This is what I've done so far.
^(.*)[\r\r|\n\n|\r\n\r\n]{1,1}(.*)$
It can start with anything, followed by exactly one double-new line, followed by anything.
I group the right and left because I need to use them aftewards.
EDIT
I use OR to cover all three types of new-line
Square brackets are used for character class and not grouping. Try using parens:
^(.*)(\r\r|\n\n|\r\n\r\n)(.*)$
And to avoid capturing the double newlines;
^(.*)(?:\r\r|\n\n|\r\n\r\n)(.*)$
The {1,1} is also redundant. I removed it.
It's not working because you have used a character class, which matches just a single character. You should use parenthesis. Also, you can simplify your regex by using {n} quantifier. To match \r\r, use \r{2}:
^(.*)(?:\r|\n|\r\n){2}(.*)$
Apart from that, I would rather get the line separator for my system using:
String lineSeparator = System.getProperty("line.separator");
String regex = "^(.*)" + Pattern.quote(lineSeparator) + "{2}(.*)$
Try the next:
^(.*)(?:(\r|\r?\n){2})(.*)$
Try this:
(?m)^(.*)$[\r\n]{1,2}^$[\r\n]{1,2}^(.*)$
The switch (?m) has the effect that caret and dollar match after and before newlines for the remainder of the regular expression
Here's a live demo of this regex working.

Remove everything from a string upto a certain character and optionally a string if it follows too

I am looking to write a regex that can remove any characters upto the first &emsp and if there is a (new section) following &emsp then remove that as well. But the following regex doesn't seem to work. Why? How do I correct this?
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
Pattern removeEmspPattern1 = Pattern.compile("(.*( (\\(new section\\)))?)(.*)", Pattern.MULTILINE);
System.out.println(removeEmspPattern1.matcher(removeEmsp).replaceAll("$2"));
Have you tried String Split? This creates an array of strings from a string, based on a deliminator.
Once you have the string split, just select the elements of the array that you need for print statement.
Read more here
Your regex is very long and I do not want to debug it. However the tip is that some characters have special meaning in regular expressions. For example & means "and". Squire brackets allow defining characters groups etc. Such characters must be escaped if you want them to be interpreted as just characters and not regex commands. To escape special character you have to write \ in front of it. But \ is escape character for java too, so it should be duplicate.
For example to replace ampersand by letter A you should write str.replaceAll("\\&", "A")
Now you have all information you need. Try to start from simpler regex and then expand it to what you need. Good luck.
EDIT
BTW parsing XML and/or HTML using regular expressions is possible but is highly not recommended. Use special parser for such formats.
Try this:
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
System.out.println(removeEmsp.replaceFirst("^.*?\\ (\\(new\\ssection\\))?", ""));
System.out.println(removeEmsp.replaceAll("^.*?\\ (\\(new\\ssection\\))?", ""));
Output:
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
It will remove everything up to " " and optionally, the following "(new section)" text if any.

regular expression to match one or more of char a or just one of char b

I am taking user input through UI, and I have to validate it. Input text should obey the following ondition
It should either end with one or more
white space characters OR with just
single '='
I can use
".*[\s=]+"
but it matches multiple '=' also which I don't want to.
Please help.
You can use alternation:
(\s+|=)$
This expression means match one or more whitespace character or one equals, at the end of the string. The $ is an anchor which matches the end of the string (as you mentioned you're looking for characters at the end of the string).
(As tchrist correctly pointed out in the comments, $ matches the end of line instead of end of string when in multiline mode. If this is true in your case, and you are indeed looking for the end of the string instead of the end of the line, you can use \Z instead, which matches the end of the string regardless of multiline mode.)
If you want to ensure that there is only one = at the end, you can use a lookaround (in this case, a negative lookbehind, specifically). A lookaround is a zero-width assertion which tells the regex engine that the assertion must pass for the pattern to match, but it does not consume any characters.
(\s+|(?<!=)=)$
In this case, (?<!=) tells the regex engine, the character before the current position cannot be an =. When put into the expression, (?<!=)= means that the = will only match if the previous character is not also a =.
Begin string
Anything not "=" ( to avoid the double "==")
One or more blank spaces OR one "="
End of string
^([^=]*[\s+|=])$
Should work :-)
Try this expression:
".*(\\s+|=)"

Categories

Resources