How to tell java.lang.String.split to skip the delimiter? - java

as input my program gets a String containing IP Addresses are separated by a line delimiter, i.e. one IP Address per line. To validate each of the addresses I do:
String[] temp;
temp = address.split(System.getProperty("line.separator"));
and then I loop though the array of Strings.
I was wondering why all but the last IP Address were always invalid. I've found out, that they look like 10.1.1.1^M
Is there a way to tell the java.lang.String.split to drop the delimiter before putting the token into the array? Or what other options do I have here? Sorry, I'm not a Java Ninja, so I thought I'll ask you guys before I start googling for hours.
Thanks
Thomas

The problem is that the delimiter in your file is "\r\n", but the value of System.getProperty("line.separator") is "\n". This means that the "\r" is not treated as part of the delimiter.

Why don't you just use address.split("\\s+") since valid IP addresses can never contain spaces in them?
Predefined character classes
. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]

It appears you are using a different carriage return from that of the platform. (e.g. editing on MS-DOS/Windows and running on Linux)
I would use \\s+ to break on any number of white spaces. This will also trim leading or trailing spaces.

Your line.separator is not valid. It depends on the system you are using:
\n = CR (Carriage Return) // Used as a new line character in Unix
\r = LF (Line Feed) // Used as a new line character in Mac OS
\n\r = CR + LF // Used as a new line character in Windows

Related

Java regex splitting, but only removing one whitespace

I have this code:
String[] parts = sentence.split("\\s");
and a sentence like: "this is a whitespace and I want to split it" (note there are 3 whitespaces after "whitespace")
I want to split it in a way, where only the last whitespace will be removed, keeping the original message intact. The output should be
"[this], [is], [a], [whitespace ], [and], [I], [want], [to], [split], [it]"
(two whitespaces after the word "whitespace")
Can I do this with regex and if not, is there even a way?
I removed the + from \\s+ to only remove one whitespace
You can use
String[] parts = sentence.split("\\s(?=\\S)");
That will split with a whitespace char that is immediately followed with a non-whitespace char.
See the regex demo. Details:
\s - a whitespace char
(?=\S) - a positive lookahead that requires a non-whitespace char to appear immediately to the right of the current location.
To make it fully Unicode-aware in Java, add the (?U) (Pattern.UNICODE_CHARACTER_CLASS option equivalent) embedded flag option: .split("(?U)\\s(?=\\S)").

Java String Split() Method

I was wondering what the following line would do:
String parts = inputLine.split("\\s+");
Would this simply split the string at any spaces in the line? I think this a regex, but I've never seen them before.
Yes, as documentation states split takes regex as argument.
In regex \s represents character class of containing whitespace characters like:
tab \t,
space " ",
line separators \n \r
more...
+ is quantifier which can be read as "once or more" which makes \s+ representing text build from one or more whitespaces.
We need to write this regex as "\\s+ (with two backslashes) because in String \ is considered special character which needs escaping (with another backslash) to produce \ literal.
So split("\\s+") will produce array of tokens separated by one or more whitespaces. BTW trailing empty elements are removed so "a b c ".split("\\s+") will return array ["a", "b", "c"] not ["a", "b", "c", ""].
Yes, though actually any number of space meta-characters (including tabs, newlines etc). See the Java documentation on Patterns.
It will split the string on one (or more) consecutive white space characters. The Pattern Javadoc describes the Predefined character classes (of which \s is one) as,
Predefined character classes
. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
Note that the \\ is to escape the back-slash as required to embed it in a String.
Yes, and it splits both tab and space:
String t = "test your function aaa";
for(String s : t.split("\\s+"))
System.out.println(s);
Output:
test
your
function
aaa

How to locate the end of the line in regex?

I have the following regex
in = in.replaceAll(" d+\n", "");
I wanted to use it to get rid of the "d" at the end of lines
But I just won't do that d
<i>I just won't do that</i> d
No, no-no-no, no, no d
What is not accurate with my regex in = in.replaceAll(" d+\n", "");
Most probably your lines are not separated only with \n but with \r\n. You can try with \r?\n to optionally add \r before \n. Lets also not forget about last b which doesn't have any line separators after it. To handle it you need to add $ in your regex which means anchor representing end of your data. So your final pattern could look like
in.replaceAll(" d+(\r?\n|$)", "")
In case you don't want to remove these line separators you can use "end of line anchor" $ with MULTILINE flag (?m) instead of line separators like
in.replaceAll("(?m) d+$", "")
especially because there are no line separators after last b.
In Java, when MULTILINE flag is specified, $ will match the empty string:
Before a line terminator:
A carriage-return character followed immediately by a newline character ("\r\n")
Newline (line feed) character ('\n') without carriage-return ('\r') right in front
Standalone carriage-return character ('\r')
Next-line character ('\u0085')
Line-separator character ('\u2028')
Paragraph-separator character ('\u2029')
At the end of the string
When UNIX_LINES flag is specified along with MULTILINE flag, $ will match the empty string right before a newline ('\n') or at the end of the string.
Anyway if it is possible don't use regex with HTML.
As Pshemo states in his answer, your string most likely contains Windows-style newline characters, which are \r\n as opposed to just \n.
You can modify your regex to account for both newline character (plus the case where the string ends with a d without a newline) with the code:
in = in.replaceAll("(d+(?=\r\n)|d+(?=\n)|d+$)","");
This regex will remove anything that matches d+ followed by \r\n, d+ followed by \n or d+$ (any d before the end of the String).
(d+(?=\r\n)|d+(?=\n)|d+$)
Debuggex Demo

Meaning of "\\cM?\r?\n" in Java

I got a class for something that I wanted to do in Java and it uses a line
text[i] = text[i].replaceAll("\\cM?\r?\n", "");
I completely understand that command replaceAll replaces first string with second one but don't completely understand what "\cM?\r?\n" stands for?
I would appreciate if someone can explain this text between quotes. (I did try to google it but did not find a satisfactory answer)
It's a regular expression.
\cM matches a Control-M or carriage return character
\r Matches a carriage return character
\n is a new line
? Matches the preceding character or subexpression zero or one time.
For example, "do(es)?" matches the "do" in "do" or "does". ? is
equivalent to {0,1}
Different operating systems have different ways to start a new line in windows its /r/n in POSIX it's different. ec ect.
Your code is essentially removing all new lines and making everything on one single line.
It matches all new-line characters. \cM is Windows line endings. \r\n is another way of doing line endings in Windows. \n is the standard Unix line endings.
? means optional.
So match \cM or \r\n or \n which are all types of line endings. Should make everything on a single line...
As I said earlier, in comment...
Character Escapes
\cX matches a control character. E.g: \cm matches control-M.
\r matches carriage return.
\n matches linefeed.
http://www.javascriptkit.com/javatutors/redev2.shtml
Its a regular expression and in your case will remove all new line/line break characters which match the following : \cM will match a Control-M or carriage return character, \r is used to match a carriage return character and \n is used for a new line

combine two java regex

Having to use regex for the first time and although I almost achieved what I require I do not seems to be able to combine into a single statement.
I have a string of words where I wish to replace \n if it is not preceded by a dot OR not preceded by a dot space.
I can run either of these two statements to achieve the required result. However, if I either run them one after another or try to combine them into a single regex, it does not work.
//replaces \n if not preceded by dot space
xx = xx.replaceAll("(.+)(?<!\\. )\n", "$1 ");
//replaces \n if not preceded by dot
xx = xx.replaceAll("(.+)(?<!\\.)\n", "$1 ");
//one of my attempts to combine into a single statement
xx = xx.replaceAll("(.+)(?<!\\. )\n|(?<!\\.)\n", "$1 ");
Example of String I'm trying to fix.
BEFORE
This is some text which may\n
have a newline character to break the line\n
but I only want to remove it if it's not preceded with a full.\n
or it's not preceded with a full stop and a space. \n
AFTER
This is some text which may
have a newline character to break the line
but I only want to remove it if it's not preceded with a full.\n
or it's not preceded with a full stop and a space. \n
I think I'm close, but being new to regex, I am getting more confused the more I read.
It's easier than you think:
String resultString = subjectString.replaceAll("(?<!\\. ?)\n", " ");
Explanation:
(?<! # Assert that the previous characters are not...
\. # a dot
[ ]? # optionally followed by a space
) # End of lookbehind
\n # Match a newline character
So you don't need to match (.+) in the first place, only to replace it with itself afterwards. Incidentally, here's what tripped you up:
(.+)(?<!\. )\n|(?<!\.)\n
is logically grouped as
(.+)(?<!\. )\n # Match this
| # or
(?<!\.)\n # this
so the (.+) is only matched if there is no space after the dot.

Categories

Resources