Having to use regex for the first time and although I almost achieved what I require I do not seems to be able to combine into a single statement.
I have a string of words where I wish to replace \n if it is not preceded by a dot OR not preceded by a dot space.
I can run either of these two statements to achieve the required result. However, if I either run them one after another or try to combine them into a single regex, it does not work.
//replaces \n if not preceded by dot space
xx = xx.replaceAll("(.+)(?<!\\. )\n", "$1 ");
//replaces \n if not preceded by dot
xx = xx.replaceAll("(.+)(?<!\\.)\n", "$1 ");
//one of my attempts to combine into a single statement
xx = xx.replaceAll("(.+)(?<!\\. )\n|(?<!\\.)\n", "$1 ");
Example of String I'm trying to fix.
BEFORE
This is some text which may\n
have a newline character to break the line\n
but I only want to remove it if it's not preceded with a full.\n
or it's not preceded with a full stop and a space. \n
AFTER
This is some text which may
have a newline character to break the line
but I only want to remove it if it's not preceded with a full.\n
or it's not preceded with a full stop and a space. \n
I think I'm close, but being new to regex, I am getting more confused the more I read.
It's easier than you think:
String resultString = subjectString.replaceAll("(?<!\\. ?)\n", " ");
Explanation:
(?<! # Assert that the previous characters are not...
\. # a dot
[ ]? # optionally followed by a space
) # End of lookbehind
\n # Match a newline character
So you don't need to match (.+) in the first place, only to replace it with itself afterwards. Incidentally, here's what tripped you up:
(.+)(?<!\. )\n|(?<!\.)\n
is logically grouped as
(.+)(?<!\. )\n # Match this
| # or
(?<!\.)\n # this
so the (.+) is only matched if there is no space after the dot.
Related
I have this code:
String[] parts = sentence.split("\\s");
and a sentence like: "this is a whitespace and I want to split it" (note there are 3 whitespaces after "whitespace")
I want to split it in a way, where only the last whitespace will be removed, keeping the original message intact. The output should be
"[this], [is], [a], [whitespace ], [and], [I], [want], [to], [split], [it]"
(two whitespaces after the word "whitespace")
Can I do this with regex and if not, is there even a way?
I removed the + from \\s+ to only remove one whitespace
You can use
String[] parts = sentence.split("\\s(?=\\S)");
That will split with a whitespace char that is immediately followed with a non-whitespace char.
See the regex demo. Details:
\s - a whitespace char
(?=\S) - a positive lookahead that requires a non-whitespace char to appear immediately to the right of the current location.
To make it fully Unicode-aware in Java, add the (?U) (Pattern.UNICODE_CHARACTER_CLASS option equivalent) embedded flag option: .split("(?U)\\s(?=\\S)").
As a start, I am using Java, if this influences the regex.
I am trying to match the contents of a line that start with any number of whitespace character but no other, followed by any number of pounds (#), and followed by any characters, then ending with a new line.
Or, a fully empty line with only either whitespace or new line.
I tried finding the first part myself but it doesn't seem to match any of the comments:
^(?!.+)#+.*$
It doesn't work even if I include \r*\n* on the end
In your regexr example you have selected Javascript and enabled the s flag to have to dot match a newline.
If you want to match all lines, you can enable the multiline and global flag instead, and use
^[^\S\r\n]*(?:#.*)?\r?\n
Regex demo
In Java, you might use
^\h*(?:#.*)?\R
With the doubled escapes backslashes
String regex = "^\\h*(?:#.*)?\\R";
The pattern matches:
^ Start of string
\h* Match optional horizontal whitespace chars
(?:#.*)? Optionally match # followed by the rest of the line
\R Match any Unicode newline sequence
Regex demo
If you want to match the whole line, and instead of matching a newline you want to assert the end of the string you can use an anchor $ instead of \R
^\h*(?:#.*)?$
Regex demo
I have tried with [\s]+$ and (?:$|\s)+$ but i don't get the desired output.
What i am looking for is
String str ="this is a string ending with multiple newlines\n\n\n"
the new line can be : \n or \r or \r\n depending on OS so we use \s+ here.
I need to find all the newline chars from end of the string
and i have to use it in Java Code
The point is that \s, in Java, matches any non-Unicode whitespace by default (it matches any Unicode whitespace if you use (?U)\s).
You can use
String regex = "\\R+$";
String regex = "\\R+\\z";
See the regex demo.
If you need to get each individual line break sequence at the end of string, you can use
String regex = "\\R(?=\\R*$)";
See this regex demo.
These patterns mean
\R+ - one or more line break sequences
$ - at the end of the string (\z matches the very end of string and will work identically in this case)
\R(?=\R*$) - any line break sequence followed with zero or more line break sequences up to the end of the whole string.
I am trying to replace 'eed' and 'eedly' with 'ee' from words where there is a vowel before either term ('eed' or 'eedly') appears.
So for example, the word indeed would become indee because there is a vowel ('i') that happens before the 'eed'. On the other hand the word 'feed' would not change because there is no vowel before the suffix 'eed'.
I have this regex: (?i)([aeiou]([aeiou])*[e{2}][d]|[dly]\\b)
You can see what is happening with this here.
As you can see, this is correctly identifying words that end with 'eed', but it is not correctly identifying 'eedly'.
Also, when it does the replace, it is replacing all words that end with 'eed' , even words like feed which it should not remove the eed
What should I be considering here in order to make it correctly identify the words based on the rules I specified?
You can use:
str = str.replaceAll("(?i)\\b(\\w*?[aeiou]\\w*)eed(?:ly)?", "$1ee");
Updated RegEx Demo
\\b(\\w*?[aeiou]\\w*) before eed or eedly makes sure there is at least one vowel in the same word before this.
To expedite this regex you can use negated expression regex:
\\b([^\\Waeiou]*[aeiou]\\w*)eed(?:ly)?
RegEx Breakup:
\\b # word boundary
( # start captured group #`
[^\\Waeiou]* # match 0 or more of non-vowel and non-word characters
[aeiou] # match one vowel
\\w* # followed by 0 or more word characters
) # end captured group #`
eed # followed by literal "eed"
(?: # start non-capturing group
ly # match literal "ly"
)? # end non-capturing group, ? makes it optional
Replacement is:
"$1ee" which means back reference to captured group #1 followed by "ee"
find dly before finding d. otherwise your regex evaluation stops after finding eed.
(?i)([aeiou]([aeiou])*[e{2}](dly|d))
I have the following regex
in = in.replaceAll(" d+\n", "");
I wanted to use it to get rid of the "d" at the end of lines
But I just won't do that d
<i>I just won't do that</i> d
No, no-no-no, no, no d
What is not accurate with my regex in = in.replaceAll(" d+\n", "");
Most probably your lines are not separated only with \n but with \r\n. You can try with \r?\n to optionally add \r before \n. Lets also not forget about last b which doesn't have any line separators after it. To handle it you need to add $ in your regex which means anchor representing end of your data. So your final pattern could look like
in.replaceAll(" d+(\r?\n|$)", "")
In case you don't want to remove these line separators you can use "end of line anchor" $ with MULTILINE flag (?m) instead of line separators like
in.replaceAll("(?m) d+$", "")
especially because there are no line separators after last b.
In Java, when MULTILINE flag is specified, $ will match the empty string:
Before a line terminator:
A carriage-return character followed immediately by a newline character ("\r\n")
Newline (line feed) character ('\n') without carriage-return ('\r') right in front
Standalone carriage-return character ('\r')
Next-line character ('\u0085')
Line-separator character ('\u2028')
Paragraph-separator character ('\u2029')
At the end of the string
When UNIX_LINES flag is specified along with MULTILINE flag, $ will match the empty string right before a newline ('\n') or at the end of the string.
Anyway if it is possible don't use regex with HTML.
As Pshemo states in his answer, your string most likely contains Windows-style newline characters, which are \r\n as opposed to just \n.
You can modify your regex to account for both newline character (plus the case where the string ends with a d without a newline) with the code:
in = in.replaceAll("(d+(?=\r\n)|d+(?=\n)|d+$)","");
This regex will remove anything that matches d+ followed by \r\n, d+ followed by \n or d+$ (any d before the end of the String).
(d+(?=\r\n)|d+(?=\n)|d+$)
Debuggex Demo