Matching at line endings - java

This is a pretty trivial thing: Replace a line
possibly containing trailing blanks
ended by '\n', '\r', '\r\n' or nothing
by a line containing no trailing blanks and ended by '\n'.
I thought I could do it via a simple regex. Here, "\\s+$" doesn't work as the $ matches before the final \n. That's why there's \\z. At least I thought. But
"\n".replaceAll("\\s*\\z", "\n").length()
returns 2. Actually, $, \\z, and \\Z do exactly the same thing here. I'm confused...
The explanation by Alan Moore was helpful, but it was just now when it occurred to me that for replacing an arbitrary final blank garbage at EOF I can do
replaceFirst("\\s*\\z"", "\n");
instead of replaceAll. A simple solution doing all the things described above is
replaceAll("(?<!\\s)\\s*\\z|[ \t]*(\r?\n|\r)", "\n");
I'm afraid, it's not very fast, but it's acceptable.

Actually, the \z is irrelevant. On the first match attempt, \s* consumes the linefeed (\n) and \z succeeds because it's now at the end of the string. So it replaces the linefeed with a linefeed, then it tries to match at the position after the linefeed, which is the end of the string. It matches again because \s* is allowed to match empty string, so it replaces the empty sting with another linefeed.
You might expect it to go on matching nothing and replacing it with infinite linefeeds, but that can't happen. Unless you reset it, the regex can't match twice at the same position. Or more accurately, starting at the same position. In this case, the first match started at position #0, and the second at position #1.
By the way, \s+$ should match the string "\n"; $ can match the very end of the string as well as before a line separator at the end of the string.
Update: In order to handle both cases: (1) getting rid of unwanted whitespace at the end of the line, and (2) adding a linefeed in cases where there's no unwanted whitespace, I thin your best bet is to use a lookbehind:
line = line.replaceAll("(?<!\\s)\\s*\\z", "\n");
This will still match every line, but it will only match once per line.

Could you just do something like the following?
String result = myString.trim() + '\n';

Related

Regex string validation

Trying to write some regex to validate a string, where null and empty strings are not allowed, but characters + new line should be allowed. The string I'm trying to validate is as follows:
First line \n
Second line \n
This is as far as i got:
^(?!\s*$).+
This fails my validation because of the new line. Any ideas? I should add, i cannot use awk.
Code
The following regex matches the entire line.
See regex in use here
^[^\r\n]*?\S.*$
The following regexes do the same as above except they're used for validation purposes only (they don't match the whole line, instead they simply ensures it's properly formed). The benefit of using these regexes over the one above is the number of steps (performance). In the regex101 links below they show as 28 steps as opposed to 34 for the pattern above.
See regex in use here
^[^\r\n]*?\S
See regex in use here
^.*?\S
Results
Input
First line \n
Second line \n
s
Output
Matches only
First line \n
Second line \n
s
Explanation
^ Assert position at the start of the line
[^\r\n]*? Match any character not present in the set (any character except the carriage return or line-feed characters) any number of times, but as few as possible (making this lazy increases performance - less steps)
\S Match any non-whitespace character
.* Match any character (excludes newline characters) any number of times
$ Assert position at the end of the line
Try this pattern:
([\S ]*(\n)*)*

How to locate the end of the line in regex?

I have the following regex
in = in.replaceAll(" d+\n", "");
I wanted to use it to get rid of the "d" at the end of lines
But I just won't do that d
<i>I just won't do that</i> d
No, no-no-no, no, no d
What is not accurate with my regex in = in.replaceAll(" d+\n", "");
Most probably your lines are not separated only with \n but with \r\n. You can try with \r?\n to optionally add \r before \n. Lets also not forget about last b which doesn't have any line separators after it. To handle it you need to add $ in your regex which means anchor representing end of your data. So your final pattern could look like
in.replaceAll(" d+(\r?\n|$)", "")
In case you don't want to remove these line separators you can use "end of line anchor" $ with MULTILINE flag (?m) instead of line separators like
in.replaceAll("(?m) d+$", "")
especially because there are no line separators after last b.
In Java, when MULTILINE flag is specified, $ will match the empty string:
Before a line terminator:
A carriage-return character followed immediately by a newline character ("\r\n")
Newline (line feed) character ('\n') without carriage-return ('\r') right in front
Standalone carriage-return character ('\r')
Next-line character ('\u0085')
Line-separator character ('\u2028')
Paragraph-separator character ('\u2029')
At the end of the string
When UNIX_LINES flag is specified along with MULTILINE flag, $ will match the empty string right before a newline ('\n') or at the end of the string.
Anyway if it is possible don't use regex with HTML.
As Pshemo states in his answer, your string most likely contains Windows-style newline characters, which are \r\n as opposed to just \n.
You can modify your regex to account for both newline character (plus the case where the string ends with a d without a newline) with the code:
in = in.replaceAll("(d+(?=\r\n)|d+(?=\n)|d+$)","");
This regex will remove anything that matches d+ followed by \r\n, d+ followed by \n or d+$ (any d before the end of the String).
(d+(?=\r\n)|d+(?=\n)|d+$)
Debuggex Demo

Meaning of "\\cM?\r?\n" in Java

I got a class for something that I wanted to do in Java and it uses a line
text[i] = text[i].replaceAll("\\cM?\r?\n", "");
I completely understand that command replaceAll replaces first string with second one but don't completely understand what "\cM?\r?\n" stands for?
I would appreciate if someone can explain this text between quotes. (I did try to google it but did not find a satisfactory answer)
It's a regular expression.
\cM matches a Control-M or carriage return character
\r Matches a carriage return character
\n is a new line
? Matches the preceding character or subexpression zero or one time.
For example, "do(es)?" matches the "do" in "do" or "does". ? is
equivalent to {0,1}
Different operating systems have different ways to start a new line in windows its /r/n in POSIX it's different. ec ect.
Your code is essentially removing all new lines and making everything on one single line.
It matches all new-line characters. \cM is Windows line endings. \r\n is another way of doing line endings in Windows. \n is the standard Unix line endings.
? means optional.
So match \cM or \r\n or \n which are all types of line endings. Should make everything on a single line...
As I said earlier, in comment...
Character Escapes
\cX matches a control character. E.g: \cm matches control-M.
\r matches carriage return.
\n matches linefeed.
http://www.javascriptkit.com/javatutors/redev2.shtml
Its a regular expression and in your case will remove all new line/line break characters which match the following : \cM will match a Control-M or carriage return character, \r is used to match a carriage return character and \n is used for a new line

capture all characters between match character (single or repeated) on string

I'm trying to extract the string preceding a specific character (even when character is repeated, like this (ie: underscore '_'):
this_is_my_example_line_0
this_is_my_example_line_1_
this_is_my_example_line_2___
_this_is_my_ _example_line_3_
__this_is_my___example_line_4__
and after running my regex I should get this (the regex should ignore the any instances of the matching character in the middle of the string):
this_is_my_example_line_0
this_is_my_example_line_1
this_is_my_example_line_2
this_is_my_ _example_line_3
this_is_my___example_line_4
In other words I'm trying to 'trim' the matched character(s) at the beginning and end of string.
I'm trying to use a Regex in Java to accomplish this, my idea is to capture the group of characters between the special character(s) at the end or beginning of the line.
So far I can only do this successfully for example 3 with this regexp:
/[^_]+|_+(.*)[_$]+|_$+/
[^_]+ not 'underscore' once or more
| OR
_+ underscore once or more
(.*) capture all characters
[_$]+ not 'underscore' once or more followed by end of line
|_$+ OR 'underscore' once or more followed by end of line
I just realized that this excludes the first word of the message on example 0,1,2 since the string doesn't start with underscore and it only starts matching after finding a underscore..
Is there an easier way not involving regex?
I don't really care about the first character (although it would be nice) I only need to ignore the repeating character at the end.. it looks that (by this regex tester) just doing this, would work? /()_+$/ the empty parenthesis matches anything before a single or repeting matches at the end of the line.. would that be correct?
Thank you!
There are a couple of options here, you could either replace matches of ^_+|_+$ with an empty string, or extract the contents of the first capture group from the match of ^_*(.*?)_*$. Note that if your strings may be multiple lines and you want to perform the replacement on each line then you will need to use the Pattern.MULTILINE flag for either approach. If your strings may be multiple lines and you only want to replacement to occur at the very beginning and end, don't use Pattern.MULTILINE but use Pattern.DOTALL for the second approach.
For example: http://regexr.com?355ff
How about [^_\n\r](.*[^_\n\r])??
Demo
String data=
"this_is_my_example_line_0\n" +
"this_is_my_example_line_1_\n" +
"this_is_my_example_line_2___\n" +
"_this_is_my_ _example_line_3_\n" +
"__this_is_my___example_line_4__";
Pattern p=Pattern.compile("[^_\n\r](.*[^_\n\r])?");
Matcher m=p.matcher(data);
while(m.find()){
System.out.println(m.group());
}
output:
this_is_my_example_line_0
this_is_my_example_line_1
this_is_my_example_line_2
this_is_my_ _example_line_3
this_is_my___example_line_4

regular expression to match one or more of char a or just one of char b

I am taking user input through UI, and I have to validate it. Input text should obey the following ondition
It should either end with one or more
white space characters OR with just
single '='
I can use
".*[\s=]+"
but it matches multiple '=' also which I don't want to.
Please help.
You can use alternation:
(\s+|=)$
This expression means match one or more whitespace character or one equals, at the end of the string. The $ is an anchor which matches the end of the string (as you mentioned you're looking for characters at the end of the string).
(As tchrist correctly pointed out in the comments, $ matches the end of line instead of end of string when in multiline mode. If this is true in your case, and you are indeed looking for the end of the string instead of the end of the line, you can use \Z instead, which matches the end of the string regardless of multiline mode.)
If you want to ensure that there is only one = at the end, you can use a lookaround (in this case, a negative lookbehind, specifically). A lookaround is a zero-width assertion which tells the regex engine that the assertion must pass for the pattern to match, but it does not consume any characters.
(\s+|(?<!=)=)$
In this case, (?<!=) tells the regex engine, the character before the current position cannot be an =. When put into the expression, (?<!=)= means that the = will only match if the previous character is not also a =.
Begin string
Anything not "=" ( to avoid the double "==")
One or more blank spaces OR one "="
End of string
^([^=]*[\s+|=])$
Should work :-)
Try this expression:
".*(\\s+|=)"

Categories

Resources