Meaning of "\\cM?\r?\n" in Java - java

I got a class for something that I wanted to do in Java and it uses a line
text[i] = text[i].replaceAll("\\cM?\r?\n", "");
I completely understand that command replaceAll replaces first string with second one but don't completely understand what "\cM?\r?\n" stands for?
I would appreciate if someone can explain this text between quotes. (I did try to google it but did not find a satisfactory answer)

It's a regular expression.
\cM matches a Control-M or carriage return character
\r Matches a carriage return character
\n is a new line
? Matches the preceding character or subexpression zero or one time.
For example, "do(es)?" matches the "do" in "do" or "does". ? is
equivalent to {0,1}
Different operating systems have different ways to start a new line in windows its /r/n in POSIX it's different. ec ect.
Your code is essentially removing all new lines and making everything on one single line.

It matches all new-line characters. \cM is Windows line endings. \r\n is another way of doing line endings in Windows. \n is the standard Unix line endings.
? means optional.
So match \cM or \r\n or \n which are all types of line endings. Should make everything on a single line...

As I said earlier, in comment...
Character Escapes
\cX matches a control character. E.g: \cm matches control-M.
\r matches carriage return.
\n matches linefeed.
http://www.javascriptkit.com/javatutors/redev2.shtml

Its a regular expression and in your case will remove all new line/line break characters which match the following : \cM will match a Control-M or carriage return character, \r is used to match a carriage return character and \n is used for a new line

Related

Regex string validation

Trying to write some regex to validate a string, where null and empty strings are not allowed, but characters + new line should be allowed. The string I'm trying to validate is as follows:
First line \n
Second line \n
This is as far as i got:
^(?!\s*$).+
This fails my validation because of the new line. Any ideas? I should add, i cannot use awk.
Code
The following regex matches the entire line.
See regex in use here
^[^\r\n]*?\S.*$
The following regexes do the same as above except they're used for validation purposes only (they don't match the whole line, instead they simply ensures it's properly formed). The benefit of using these regexes over the one above is the number of steps (performance). In the regex101 links below they show as 28 steps as opposed to 34 for the pattern above.
See regex in use here
^[^\r\n]*?\S
See regex in use here
^.*?\S
Results
Input
First line \n
Second line \n
s
Output
Matches only
First line \n
Second line \n
s
Explanation
^ Assert position at the start of the line
[^\r\n]*? Match any character not present in the set (any character except the carriage return or line-feed characters) any number of times, but as few as possible (making this lazy increases performance - less steps)
\S Match any non-whitespace character
.* Match any character (excludes newline characters) any number of times
$ Assert position at the end of the line
Try this pattern:
([\S ]*(\n)*)*

How to locate the end of the line in regex?

I have the following regex
in = in.replaceAll(" d+\n", "");
I wanted to use it to get rid of the "d" at the end of lines
But I just won't do that d
<i>I just won't do that</i> d
No, no-no-no, no, no d
What is not accurate with my regex in = in.replaceAll(" d+\n", "");
Most probably your lines are not separated only with \n but with \r\n. You can try with \r?\n to optionally add \r before \n. Lets also not forget about last b which doesn't have any line separators after it. To handle it you need to add $ in your regex which means anchor representing end of your data. So your final pattern could look like
in.replaceAll(" d+(\r?\n|$)", "")
In case you don't want to remove these line separators you can use "end of line anchor" $ with MULTILINE flag (?m) instead of line separators like
in.replaceAll("(?m) d+$", "")
especially because there are no line separators after last b.
In Java, when MULTILINE flag is specified, $ will match the empty string:
Before a line terminator:
A carriage-return character followed immediately by a newline character ("\r\n")
Newline (line feed) character ('\n') without carriage-return ('\r') right in front
Standalone carriage-return character ('\r')
Next-line character ('\u0085')
Line-separator character ('\u2028')
Paragraph-separator character ('\u2029')
At the end of the string
When UNIX_LINES flag is specified along with MULTILINE flag, $ will match the empty string right before a newline ('\n') or at the end of the string.
Anyway if it is possible don't use regex with HTML.
As Pshemo states in his answer, your string most likely contains Windows-style newline characters, which are \r\n as opposed to just \n.
You can modify your regex to account for both newline character (plus the case where the string ends with a d without a newline) with the code:
in = in.replaceAll("(d+(?=\r\n)|d+(?=\n)|d+$)","");
This regex will remove anything that matches d+ followed by \r\n, d+ followed by \n or d+$ (any d before the end of the String).
(d+(?=\r\n)|d+(?=\n)|d+$)
Debuggex Demo

Matching at line endings

This is a pretty trivial thing: Replace a line
possibly containing trailing blanks
ended by '\n', '\r', '\r\n' or nothing
by a line containing no trailing blanks and ended by '\n'.
I thought I could do it via a simple regex. Here, "\\s+$" doesn't work as the $ matches before the final \n. That's why there's \\z. At least I thought. But
"\n".replaceAll("\\s*\\z", "\n").length()
returns 2. Actually, $, \\z, and \\Z do exactly the same thing here. I'm confused...
The explanation by Alan Moore was helpful, but it was just now when it occurred to me that for replacing an arbitrary final blank garbage at EOF I can do
replaceFirst("\\s*\\z"", "\n");
instead of replaceAll. A simple solution doing all the things described above is
replaceAll("(?<!\\s)\\s*\\z|[ \t]*(\r?\n|\r)", "\n");
I'm afraid, it's not very fast, but it's acceptable.
Actually, the \z is irrelevant. On the first match attempt, \s* consumes the linefeed (\n) and \z succeeds because it's now at the end of the string. So it replaces the linefeed with a linefeed, then it tries to match at the position after the linefeed, which is the end of the string. It matches again because \s* is allowed to match empty string, so it replaces the empty sting with another linefeed.
You might expect it to go on matching nothing and replacing it with infinite linefeeds, but that can't happen. Unless you reset it, the regex can't match twice at the same position. Or more accurately, starting at the same position. In this case, the first match started at position #0, and the second at position #1.
By the way, \s+$ should match the string "\n"; $ can match the very end of the string as well as before a line separator at the end of the string.
Update: In order to handle both cases: (1) getting rid of unwanted whitespace at the end of the line, and (2) adding a linefeed in cases where there's no unwanted whitespace, I thin your best bet is to use a lookbehind:
line = line.replaceAll("(?<!\\s)\\s*\\z", "\n");
This will still match every line, but it will only match once per line.
Could you just do something like the following?
String result = myString.trim() + '\n';

Remove string before double line break using regex

I have a string like this:
this is my text
more text
more text
text I want
is below
I just want the text below the double line break and not the stuff before.
Here is what I thought should work:
myString.replaceFirst(".+?(\n\n)","");
However it does not work. Any help would be greatly appreciated
You should use the below regex for your purpose: -
str = str.replaceFirst("(?s).+?(\n\n)", "");
Because, you want to match anything including the newline character before it encounters two newline characters back to back.
Note that dot(.) does not matches a newline, so it would stop matching on encountering the first newline character.
If you want your dot(.) to match newline, you can use Pattern.DOTALL, which in case of str.replaceFirst, is achieved by using (?s) expression.
From the documentation of Pattern.DOTALL: -
In dotall mode, the expression . matches any character, including a
line terminator. By default this expression does not match line
terminators.
Dotall mode can also be enabled via the embedded flag expression (?s).
Why not:
s = s.substring(s.indexOf("\n\n") + 2);
Note that it might be +1, +2, or +3. I don't feel like like breaking out my computer to test it at the moment.
You can use split here is an example
String newString = string.split("\n\n")[1];

How to tell java.lang.String.split to skip the delimiter?

as input my program gets a String containing IP Addresses are separated by a line delimiter, i.e. one IP Address per line. To validate each of the addresses I do:
String[] temp;
temp = address.split(System.getProperty("line.separator"));
and then I loop though the array of Strings.
I was wondering why all but the last IP Address were always invalid. I've found out, that they look like 10.1.1.1^M
Is there a way to tell the java.lang.String.split to drop the delimiter before putting the token into the array? Or what other options do I have here? Sorry, I'm not a Java Ninja, so I thought I'll ask you guys before I start googling for hours.
Thanks
Thomas
The problem is that the delimiter in your file is "\r\n", but the value of System.getProperty("line.separator") is "\n". This means that the "\r" is not treated as part of the delimiter.
Why don't you just use address.split("\\s+") since valid IP addresses can never contain spaces in them?
Predefined character classes
. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
It appears you are using a different carriage return from that of the platform. (e.g. editing on MS-DOS/Windows and running on Linux)
I would use \\s+ to break on any number of white spaces. This will also trim leading or trailing spaces.
Your line.separator is not valid. It depends on the system you are using:
\n = CR (Carriage Return) // Used as a new line character in Unix
\r = LF (Line Feed) // Used as a new line character in Mac OS
\n\r = CR + LF // Used as a new line character in Windows

Categories

Resources