How to locate the end of the line in regex? - java

I have the following regex
in = in.replaceAll(" d+\n", "");
I wanted to use it to get rid of the "d" at the end of lines
But I just won't do that d
<i>I just won't do that</i> d
No, no-no-no, no, no d
What is not accurate with my regex in = in.replaceAll(" d+\n", "");

Most probably your lines are not separated only with \n but with \r\n. You can try with \r?\n to optionally add \r before \n. Lets also not forget about last b which doesn't have any line separators after it. To handle it you need to add $ in your regex which means anchor representing end of your data. So your final pattern could look like
in.replaceAll(" d+(\r?\n|$)", "")
In case you don't want to remove these line separators you can use "end of line anchor" $ with MULTILINE flag (?m) instead of line separators like
in.replaceAll("(?m) d+$", "")
especially because there are no line separators after last b.
In Java, when MULTILINE flag is specified, $ will match the empty string:
Before a line terminator:
A carriage-return character followed immediately by a newline character ("\r\n")
Newline (line feed) character ('\n') without carriage-return ('\r') right in front
Standalone carriage-return character ('\r')
Next-line character ('\u0085')
Line-separator character ('\u2028')
Paragraph-separator character ('\u2029')
At the end of the string
When UNIX_LINES flag is specified along with MULTILINE flag, $ will match the empty string right before a newline ('\n') or at the end of the string.
Anyway if it is possible don't use regex with HTML.

As Pshemo states in his answer, your string most likely contains Windows-style newline characters, which are \r\n as opposed to just \n.
You can modify your regex to account for both newline character (plus the case where the string ends with a d without a newline) with the code:
in = in.replaceAll("(d+(?=\r\n)|d+(?=\n)|d+$)","");
This regex will remove anything that matches d+ followed by \r\n, d+ followed by \n or d+$ (any d before the end of the String).
(d+(?=\r\n)|d+(?=\n)|d+$)
Debuggex Demo

Related

Matching pound (#) or empty line comments with regex

As a start, I am using Java, if this influences the regex.
I am trying to match the contents of a line that start with any number of whitespace character but no other, followed by any number of pounds (#), and followed by any characters, then ending with a new line.
Or, a fully empty line with only either whitespace or new line.
I tried finding the first part myself but it doesn't seem to match any of the comments:
^(?!.+)#+.*$
It doesn't work even if I include \r*\n* on the end
In your regexr example you have selected Javascript and enabled the s flag to have to dot match a newline.
If you want to match all lines, you can enable the multiline and global flag instead, and use
^[^\S\r\n]*(?:#.*)?\r?\n
Regex demo
In Java, you might use
^\h*(?:#.*)?\R
With the doubled escapes backslashes
String regex = "^\\h*(?:#.*)?\\R";
The pattern matches:
^ Start of string
\h* Match optional horizontal whitespace chars
(?:#.*)? Optionally match # followed by the rest of the line
\R Match any Unicode newline sequence
Regex demo
If you want to match the whole line, and instead of matching a newline you want to assert the end of the string you can use an anchor $ instead of \R
^\h*(?:#.*)?$
Regex demo

What is the Regular Expression to get all the newline characters from the end of the string

I have tried with [\s]+$ and (?:$|\s)+$ but i don't get the desired output.
What i am looking for is
String str ="this is a string ending with multiple newlines\n\n\n"
the new line can be : \n or \r or \r\n depending on OS so we use \s+ here.
I need to find all the newline chars from end of the string
and i have to use it in Java Code
The point is that \s, in Java, matches any non-Unicode whitespace by default (it matches any Unicode whitespace if you use (?U)\s).
You can use
String regex = "\\R+$";
String regex = "\\R+\\z";
See the regex demo.
If you need to get each individual line break sequence at the end of string, you can use
String regex = "\\R(?=\\R*$)";
See this regex demo.
These patterns mean
\R+ - one or more line break sequences
$ - at the end of the string (\z matches the very end of string and will work identically in this case)
\R(?=\R*$) - any line break sequence followed with zero or more line break sequences up to the end of the whole string.

Matching at line endings

This is a pretty trivial thing: Replace a line
possibly containing trailing blanks
ended by '\n', '\r', '\r\n' or nothing
by a line containing no trailing blanks and ended by '\n'.
I thought I could do it via a simple regex. Here, "\\s+$" doesn't work as the $ matches before the final \n. That's why there's \\z. At least I thought. But
"\n".replaceAll("\\s*\\z", "\n").length()
returns 2. Actually, $, \\z, and \\Z do exactly the same thing here. I'm confused...
The explanation by Alan Moore was helpful, but it was just now when it occurred to me that for replacing an arbitrary final blank garbage at EOF I can do
replaceFirst("\\s*\\z"", "\n");
instead of replaceAll. A simple solution doing all the things described above is
replaceAll("(?<!\\s)\\s*\\z|[ \t]*(\r?\n|\r)", "\n");
I'm afraid, it's not very fast, but it's acceptable.
Actually, the \z is irrelevant. On the first match attempt, \s* consumes the linefeed (\n) and \z succeeds because it's now at the end of the string. So it replaces the linefeed with a linefeed, then it tries to match at the position after the linefeed, which is the end of the string. It matches again because \s* is allowed to match empty string, so it replaces the empty sting with another linefeed.
You might expect it to go on matching nothing and replacing it with infinite linefeeds, but that can't happen. Unless you reset it, the regex can't match twice at the same position. Or more accurately, starting at the same position. In this case, the first match started at position #0, and the second at position #1.
By the way, \s+$ should match the string "\n"; $ can match the very end of the string as well as before a line separator at the end of the string.
Update: In order to handle both cases: (1) getting rid of unwanted whitespace at the end of the line, and (2) adding a linefeed in cases where there's no unwanted whitespace, I thin your best bet is to use a lookbehind:
line = line.replaceAll("(?<!\\s)\\s*\\z", "\n");
This will still match every line, but it will only match once per line.
Could you just do something like the following?
String result = myString.trim() + '\n';

Meaning of "\\cM?\r?\n" in Java

I got a class for something that I wanted to do in Java and it uses a line
text[i] = text[i].replaceAll("\\cM?\r?\n", "");
I completely understand that command replaceAll replaces first string with second one but don't completely understand what "\cM?\r?\n" stands for?
I would appreciate if someone can explain this text between quotes. (I did try to google it but did not find a satisfactory answer)
It's a regular expression.
\cM matches a Control-M or carriage return character
\r Matches a carriage return character
\n is a new line
? Matches the preceding character or subexpression zero or one time.
For example, "do(es)?" matches the "do" in "do" or "does". ? is
equivalent to {0,1}
Different operating systems have different ways to start a new line in windows its /r/n in POSIX it's different. ec ect.
Your code is essentially removing all new lines and making everything on one single line.
It matches all new-line characters. \cM is Windows line endings. \r\n is another way of doing line endings in Windows. \n is the standard Unix line endings.
? means optional.
So match \cM or \r\n or \n which are all types of line endings. Should make everything on a single line...
As I said earlier, in comment...
Character Escapes
\cX matches a control character. E.g: \cm matches control-M.
\r matches carriage return.
\n matches linefeed.
http://www.javascriptkit.com/javatutors/redev2.shtml
Its a regular expression and in your case will remove all new line/line break characters which match the following : \cM will match a Control-M or carriage return character, \r is used to match a carriage return character and \n is used for a new line

How to tell java.lang.String.split to skip the delimiter?

as input my program gets a String containing IP Addresses are separated by a line delimiter, i.e. one IP Address per line. To validate each of the addresses I do:
String[] temp;
temp = address.split(System.getProperty("line.separator"));
and then I loop though the array of Strings.
I was wondering why all but the last IP Address were always invalid. I've found out, that they look like 10.1.1.1^M
Is there a way to tell the java.lang.String.split to drop the delimiter before putting the token into the array? Or what other options do I have here? Sorry, I'm not a Java Ninja, so I thought I'll ask you guys before I start googling for hours.
Thanks
Thomas
The problem is that the delimiter in your file is "\r\n", but the value of System.getProperty("line.separator") is "\n". This means that the "\r" is not treated as part of the delimiter.
Why don't you just use address.split("\\s+") since valid IP addresses can never contain spaces in them?
Predefined character classes
. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
It appears you are using a different carriage return from that of the platform. (e.g. editing on MS-DOS/Windows and running on Linux)
I would use \\s+ to break on any number of white spaces. This will also trim leading or trailing spaces.
Your line.separator is not valid. It depends on the system you are using:
\n = CR (Carriage Return) // Used as a new line character in Unix
\r = LF (Line Feed) // Used as a new line character in Mac OS
\n\r = CR + LF // Used as a new line character in Windows

Categories

Resources