In HACKERRANK this line of code occurs very frequently. I think this is to skip whitespaces but what does that "\r\u2028\u2029\u0085" thing mean
scanner.skip("(\r\n|[\n\r\u2028\u2029\u0085])?");
Scanner.skip skips a input which matches the pattern, here the pattern is :-
(\r\n|[\n\r\u2028\u2029\u0085])?
? matches exactly zero or one of the previous character.
| Alternative
[] Matches single character present in
\r matches a carriage return
\n newline
\u2028 matches the character with index 2018 base 16(8232 base 10 or 20050 base 8) case sensitive
\u2029 matches the character with index 2029 base 16(8233 base 10 or 20051 base 8) case sensitive
\u0085 matches the character with index 85 base 16(133 base 10 or 205 base 8) case sensitive
1st Alternative \r\n
\r matches a carriage return (ASCII 13)
\n matches a line-feed (newline) character (ASCII 10)
2nd Alternative [\n\r\u2028\u2029\u0085]
Match a single character present in the list below [\n\r\u2028\u2029\u0085]
\n matches a line-feed (newline) character (ASCII 10)
\r matches a carriage return (ASCII 13)
\u2028 matches the character with index 202816 (823210 or 200508) literally (case sensitive) LINE SEPARATOR
\u2029 matches the character with index 202916 (823310 or 200518) literally (case sensitive) PARAGRAPH SEPARATOR
\u0085 matches the character with index 8516 (13310 or 2058) literally (case sensitive) NEXT LINE
Skip \r\n is for Windows.
The rest is standard \r=CR, \n=LF (see \r\n , \r , \n what is the difference between them?)
Then some Unicode special characters:
u2028 = LINE SEPARATOR (https://www.fileformat.info/info/unicode/char/2028/index.htm)
u2029 = PARAGRAPH SEPARATOR
(http://www.fileformat.info/info/unicode/char/2029/index.htm)
u0085 = NEXT LINE (https://www.fileformat.info/info/unicode/char/0085/index.htm)
OpenJDK's source code shows that nextLine() uses this regex for line separators:
private static final String LINE_SEPARATOR_PATTERN = "\r\n|[\n\r\u2028\u2029\u0085]";
\r\n is a Windows line ending.
\n is a UNIX line ending.
\r is a Macintosh (pre-OSX) line ending.
\u2028 is LINE SEPARATOR.
\u2029 is PARAGRAPH SEPARATOR.
\u0085 is NEXT LINE (NEL).
The whole thing is a regex expression, so you could simply drop it into https://regexr.com or https://regex101.com/ and it will provided you with a full description of what each part of the regex means.
Here it is for you though:
(\r\n|[\n\r\u2028\u2029\u0085])? / gm
1st Capturing Group (\r\n|[\n\r\u2028\u2029\u0085])?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
1st Alternative \r\n
\r matches a carriage return (ASCII 13)
\n matches a line-feed (newline) character (ASCII 10)
2nd Alternative [\n\r\u2028\u2029\u0085]
Match a single character present in the list below
[\n\r\u2028\u2029\u0085]
\n matches a line-feed (newline) character (ASCII 10)
\r matches a carriage return (ASCII 13)
\u2028 matches the character
with index 202816 (823210 or 200508) literally (case sensitive)
\u2029 matches the character
with index 202916 (823310 or 200518) literally (case sensitive)
\u0085 matches the character with index 8516 (13310 or 2058) literally (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
As for scanner.skip this does (Scanner Pattern Tutorial):
The java.util.Scanner.skip(Pattern pattern) method skips input that matches the specified pattern, ignoring delimiters. This method will skip input if an anchored match of the specified pattern succeeds.If a match to the specified pattern is not found at the current position, then no input is skipped and a NoSuchElementException is thrown.
I would also recommend reading Alan Moore's answer on here RegEx in Java: how to deal with newline he talks about new ways in Java 1.8.
scanner.skip("(\r\n|[\n\r\u2028\u2029\u0085])?");
in Unix and all Unix-like systems, \n is the code for end-of-line,
\r means nothing special
as a consequence, in C and most languages that somehow copy it (even
remotely), \n is the standard escape sequence for end of line
(translated to/from OS-specific sequences as needed)
in old Mac systems (pre-OS X), \r was the code for end-of-line
instead in Windows (and many old OSs), the code for end of line is 2
characters, \r\n, in this order as a (surprising;-) consequence
(harking back to OSs much older than Windows), \r\n is the standard
line-termination for text formats on the Internet
u0085 NEXT LINE (NEL)
U2029 PARAGRAPH SEPARATOR
U2028 LINE SEPARATOR'
The whole logic behind this is to remove the extra space and extra new line when input is from scanner
There's already a similar question here scanner.skip. It won't skip whitespaces since the unicode char for it is not present (u0020)
\r = CR (Carriage Return) // Used as a new line character in Mac OS before X
\n = LF (Line Feed) // Used as a new line character in Unix/Mac OS X
\r\n = CR + LF // Used as a new line character in Windows
u2028 = line separator
u2029 = paragraph separator
u0085 = next line
This ignores one line break, see \R.
Exactly the same could have been done with \R - sigh.
scanner.skip("\\R?");
I have a much simpler exercise to explain this
public class Solution {
public static void main(String[] args) {
int i = 4;
double d = 4.0;
String s = "HackerRank ";
Scanner scan = new Scanner(System.in);
int a;
double b;
String c = null;
a = scan.nextInt();
b = scan.nextDouble();
c = scan.nextLine();
System.out.println(c);
scan.close();
System.out.println(a + i);
System.out.println(b + d);
System.out.println(s.concat(c));
}
}
TRY running this.. FIRST and see the output
After that
public class Solution {
public static void main(String[] args) {
int i = 4;
double d = 4.0;
String s = "HackerRank ";
Scanner scan = new Scanner(System.in);
int a;
double b;
String c = null;
a = scan.nextInt();
b = scan.nextDouble();
scan.skip("(\r\n|[\n\r\u2028\u2029\u0085])?");
c = scan.nextLine();
System.out.println(c);
scan.close();
System.out.println(a + i);
System.out.println(b + d);
System.out.println(s.concat(c));
}
}
TRY THIS AGAIN..
This can be a very tricky interview question
I cursing myself before I could realise the issue..
Just ask any programmer
to take an integer number
to take an double number
and a string
ALL FROM USER INPUT
If they don't know this.. they will most definitely fail..
You can find a much simpler answer about the behaivor of the integer and the double in their javadocs
It is associated to scanner class:
Lets suppose u have input from system console
4
This is next line
int a =scanner.nextInt();
String s = scanner.nextLine();
value of a will be read as 4
and value of s will be empty string because nextLine just reads what is next in same line, and after that it shifts to nextLine
to read it perfectly, u should add one more time nextLine() like below
int a =scanner.nextInt();
scanner.nextLine();
String s = scanner.nextLine();
to insure that it reaches to nextline and skips everything if there is any anomaly in the input
scan.skip("(\r\n|[\n\r\u2028\u2029\u0085])?");
upper line does job perfectly in every OS and environment.
Related
Why don't I have to use nextLine when entering numbers into console on separate lines? I expected console to interpret end of each line having \n but program works the same whether I use nextLine() after each nextInt(). Using Eclipse.
// Example used
10 5 // # lines to read , divisor
22 // if # divisible by divisor, count++
15
10
25
17
13
15
10
7
9
public static void main(String[] args) {
Scanner keyboard = new Scanner(System.in);
int n = getNumber(keyboard);
int k = getNumber(keyboard);
int numLinesPassed = getNumberPassCriteria(keyboard, n, k);
System.out.println("# Passed: " + numLinesPassed);
}
public static int getNumber(Scanner keyboard) {
return keyboard.nextInt();
}
public static int getNumberPassCriteria(Scanner keyboard, int n, int k) {
int counter = 0;
for(int i = 0; i < n; i++) {
int value = keyboard.nextInt();
if (value % k == 0) {
counter++;
}
//keyboard.nextLine(); not understand why I don't need this
}
return counter;
}
Probably confused with Scanner is skipping nextLine() after using next() or nextFoo()?.
This is only a problem if you are reading, or trying to read the next line after the number using nextLine.
Why? nextInt reads a token, defined as text between delimiters (default white space including linefeed and carriage return), that means, if the input starts with one or more delimiter, these are ignored (discarded). nextLine does not read tokens, it just reads up to the newline; so if the next character is a newline, an empty string is returned.
Easy to test: just enter a couple of empty lines between the numbers - your code should read them without problems.
Because the default delimiter for Scanner is this pattern
\p{javaWhitespace}+
(Note: this exact pattern is not explicitly documented, but it's what you obtain calling the delimiter() method of Scanner when you instantiate one without explicitly specifying a different separator)
EDIT
As per user15244370's suggestion this mess of references I made is actually documented directly in the Scanner documentation here. You can skip directly to the list below.
The meaning of that pattern is documented in the documentation for the Pattern class which itself refers to the Character.isWhitespace method
That pattern means a sequence of at least one of
a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
*'\t', U+0009 HORIZONTAL TABULATION.
'\n', U+000A LINE FEED.
'\u000B', U+000B VERTICAL TABULATION.
'\f', U+000C FORM FEED.
'\r', U+000D CARRIAGE RETURN.
'\u001C', U+001C FILE SEPARATOR.
'\u001D', U+001D GROUP SEPARATOR.
'\u001E', U+001E RECORD SEPARATOR.
'\u001F', U+001F UNIT SEPARATOR.
Which includes newline (the second bullet point).
When a user input contains Unicode characters (e.g. ‘ or ” ), the following action fails:
String[] unicodeStrings = answerText.split("((?<=\\R)|(?=\\R))");
I've tried debugging the split method, but I haven't found the root cause. I have a hunch it has something to do with the question mark (?) in the expression.
I've also tried an online java regex tool and applied the expression on some text with the following characters ‘”. It didn't show any error.
I've also tried writing a simple test method in online java compiler where I passed a test string with the ‘” characters and performed the above-mentioned split. No error either.
Code:
String answerText = uiq.getAnswerText();
if (answerText.matches("[\\x00-\\x7F]*")) //if the answerString consists only of ascii characters we encode it
sb.append("<String name=\"answerText\">")
.append(wrapCdata(uiq.isDate() ? formatDate(uiq.getAnswerText(), sourceFormat, targetFormat) : answerText)).append("</String>");
else { //if the answerString consists of unicode characters we encode only the Linebreakers (the \R)
String answerNonEscapedText = "";
String[] unicodeStrings = answerText.split("((?<=\\R)|(?=\\R))");//This regex splits the string to its linebreak-delimiters, including them. i.e. ("$$$\r\n" ---> [$,$,$,\r\n])
for (String str : unicodeStrings) {
if (str.matches("\\R"))
str = StringEscapeUtils.escapeJava(str);
answerNonEscapedText += str;
}
Error:
java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 6
((?<=\R)|(?=\R))
^
at java.util.regex.Pattern.error(Pattern.java:1924)
at java.util.regex.Pattern.escape(Pattern.java:2416)
at java.util.regex.Pattern.atom(Pattern.java:2164)
at java.util.regex.Pattern.sequence(Pattern.java:2046)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.group0(Pattern.java:2807)
at java.util.regex.Pattern.sequence(Pattern.java:2018)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.group0(Pattern.java:2854)
at java.util.regex.Pattern.sequence(Pattern.java:2018)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.compile(Pattern.java:1665)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at java.lang.String.split(String.java:2313)
at java.lang.String.split(String.java:2355)
Could you please help me finding the root cause of the failure?
String answerText = uiq.getAnswerText();
if (answerText.matches("[\\x00-\\x7F]*")) {
sb.append("<String name=\"answerText\">")
.append(wrapCdata(uiq.isDate()
? formatDate(uiq.getAnswerText(), sourceFormat, targetFormat)
: answerText))
.append("</String>");
} else {
String[] unicodeStrings = answerText.split("\\R"); // Splits on linebreaks.
// This looses the exact line delimiter.
String answerNonEscapedText = ""; // Better StringBuilder too.
for (String str : unicodeStrings) {
answerNonEscapedText += str + "\\r\\n";
}
For some cases the above loss of the original line delimiters is important: there exists CSV where a field value may contain line separators \n whereas the line ends in \r\n. Or such.
A simpler solution:
// Java >= 9
String answerText = Pattern.compile("\\R").matcher(uiq.getAnswerText())
.replaceAll(mr -> StringEscapeUtils.escapeJava(mr.group()));
// Java < 9 (only for \r and \n)
String answerText = uiq.getAnswerText()
.replace("\r", "\\r").replace("\n", "\\n");
In this case, the regex expression was not incorrect. It was, however, supported only by java 8+ and I had java 7 on my environment. An upgrade of java solved the issue.
Pattern (Java Platform SE 7)
Perl constructs not supported by this class:
Predefined character classes (Unicode character)
\h A horizontal whitespace
\H A non horizontal whitespace
\v A vertical whitespace
\V A non vertical whitespace
\R Any Unicode linebreak sequence \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
\X Match Unicode extended grapheme cluster
In HACKERRANK this line of code occurs very frequently. I think this is to skip whitespaces but what does that "\r\u2028\u2029\u0085" thing mean
scanner.skip("(\r\n|[\n\r\u2028\u2029\u0085])?");
Scanner.skip skips a input which matches the pattern, here the pattern is :-
(\r\n|[\n\r\u2028\u2029\u0085])?
? matches exactly zero or one of the previous character.
| Alternative
[] Matches single character present in
\r matches a carriage return
\n newline
\u2028 matches the character with index 2018 base 16(8232 base 10 or 20050 base 8) case sensitive
\u2029 matches the character with index 2029 base 16(8233 base 10 or 20051 base 8) case sensitive
\u0085 matches the character with index 85 base 16(133 base 10 or 205 base 8) case sensitive
1st Alternative \r\n
\r matches a carriage return (ASCII 13)
\n matches a line-feed (newline) character (ASCII 10)
2nd Alternative [\n\r\u2028\u2029\u0085]
Match a single character present in the list below [\n\r\u2028\u2029\u0085]
\n matches a line-feed (newline) character (ASCII 10)
\r matches a carriage return (ASCII 13)
\u2028 matches the character with index 202816 (823210 or 200508) literally (case sensitive) LINE SEPARATOR
\u2029 matches the character with index 202916 (823310 or 200518) literally (case sensitive) PARAGRAPH SEPARATOR
\u0085 matches the character with index 8516 (13310 or 2058) literally (case sensitive) NEXT LINE
Skip \r\n is for Windows.
The rest is standard \r=CR, \n=LF (see \r\n , \r , \n what is the difference between them?)
Then some Unicode special characters:
u2028 = LINE SEPARATOR (https://www.fileformat.info/info/unicode/char/2028/index.htm)
u2029 = PARAGRAPH SEPARATOR
(http://www.fileformat.info/info/unicode/char/2029/index.htm)
u0085 = NEXT LINE (https://www.fileformat.info/info/unicode/char/0085/index.htm)
OpenJDK's source code shows that nextLine() uses this regex for line separators:
private static final String LINE_SEPARATOR_PATTERN = "\r\n|[\n\r\u2028\u2029\u0085]";
\r\n is a Windows line ending.
\n is a UNIX line ending.
\r is a Macintosh (pre-OSX) line ending.
\u2028 is LINE SEPARATOR.
\u2029 is PARAGRAPH SEPARATOR.
\u0085 is NEXT LINE (NEL).
The whole thing is a regex expression, so you could simply drop it into https://regexr.com or https://regex101.com/ and it will provided you with a full description of what each part of the regex means.
Here it is for you though:
(\r\n|[\n\r\u2028\u2029\u0085])? / gm
1st Capturing Group (\r\n|[\n\r\u2028\u2029\u0085])?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
1st Alternative \r\n
\r matches a carriage return (ASCII 13)
\n matches a line-feed (newline) character (ASCII 10)
2nd Alternative [\n\r\u2028\u2029\u0085]
Match a single character present in the list below
[\n\r\u2028\u2029\u0085]
\n matches a line-feed (newline) character (ASCII 10)
\r matches a carriage return (ASCII 13)
\u2028 matches the character
with index 202816 (823210 or 200508) literally (case sensitive)
\u2029 matches the character
with index 202916 (823310 or 200518) literally (case sensitive)
\u0085 matches the character with index 8516 (13310 or 2058) literally (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
As for scanner.skip this does (Scanner Pattern Tutorial):
The java.util.Scanner.skip(Pattern pattern) method skips input that matches the specified pattern, ignoring delimiters. This method will skip input if an anchored match of the specified pattern succeeds.If a match to the specified pattern is not found at the current position, then no input is skipped and a NoSuchElementException is thrown.
I would also recommend reading Alan Moore's answer on here RegEx in Java: how to deal with newline he talks about new ways in Java 1.8.
scanner.skip("(\r\n|[\n\r\u2028\u2029\u0085])?");
in Unix and all Unix-like systems, \n is the code for end-of-line,
\r means nothing special
as a consequence, in C and most languages that somehow copy it (even
remotely), \n is the standard escape sequence for end of line
(translated to/from OS-specific sequences as needed)
in old Mac systems (pre-OS X), \r was the code for end-of-line
instead in Windows (and many old OSs), the code for end of line is 2
characters, \r\n, in this order as a (surprising;-) consequence
(harking back to OSs much older than Windows), \r\n is the standard
line-termination for text formats on the Internet
u0085 NEXT LINE (NEL)
U2029 PARAGRAPH SEPARATOR
U2028 LINE SEPARATOR'
The whole logic behind this is to remove the extra space and extra new line when input is from scanner
There's already a similar question here scanner.skip. It won't skip whitespaces since the unicode char for it is not present (u0020)
\r = CR (Carriage Return) // Used as a new line character in Mac OS before X
\n = LF (Line Feed) // Used as a new line character in Unix/Mac OS X
\r\n = CR + LF // Used as a new line character in Windows
u2028 = line separator
u2029 = paragraph separator
u0085 = next line
This ignores one line break, see \R.
Exactly the same could have been done with \R - sigh.
scanner.skip("\\R?");
I have a much simpler exercise to explain this
public class Solution {
public static void main(String[] args) {
int i = 4;
double d = 4.0;
String s = "HackerRank ";
Scanner scan = new Scanner(System.in);
int a;
double b;
String c = null;
a = scan.nextInt();
b = scan.nextDouble();
c = scan.nextLine();
System.out.println(c);
scan.close();
System.out.println(a + i);
System.out.println(b + d);
System.out.println(s.concat(c));
}
}
TRY running this.. FIRST and see the output
After that
public class Solution {
public static void main(String[] args) {
int i = 4;
double d = 4.0;
String s = "HackerRank ";
Scanner scan = new Scanner(System.in);
int a;
double b;
String c = null;
a = scan.nextInt();
b = scan.nextDouble();
scan.skip("(\r\n|[\n\r\u2028\u2029\u0085])?");
c = scan.nextLine();
System.out.println(c);
scan.close();
System.out.println(a + i);
System.out.println(b + d);
System.out.println(s.concat(c));
}
}
TRY THIS AGAIN..
This can be a very tricky interview question
I cursing myself before I could realise the issue..
Just ask any programmer
to take an integer number
to take an double number
and a string
ALL FROM USER INPUT
If they don't know this.. they will most definitely fail..
You can find a much simpler answer about the behaivor of the integer and the double in their javadocs
It is associated to scanner class:
Lets suppose u have input from system console
4
This is next line
int a =scanner.nextInt();
String s = scanner.nextLine();
value of a will be read as 4
and value of s will be empty string because nextLine just reads what is next in same line, and after that it shifts to nextLine
to read it perfectly, u should add one more time nextLine() like below
int a =scanner.nextInt();
scanner.nextLine();
String s = scanner.nextLine();
to insure that it reaches to nextline and skips everything if there is any anomaly in the input
scan.skip("(\r\n|[\n\r\u2028\u2029\u0085])?");
upper line does job perfectly in every OS and environment.
Given a String containing numbers (possibly with decimals), parentheses and any amount of whitespace, I need to iterate through the String and handle each number and parenthesis.
The below works for the String "1 ( 2 3 ) 4", but does not work if I remove whitespaces between the parentheses and the numbers "1 (2 3) 4)".
Scanner scanner = new Scanner(expression);
while (scanner.hasNext()) {
String token = scanner.next();
// handle token ...
System.out.println(token);
}
Scanner uses whitespace as it's default delimiter. You can change this to use a different Regex pattern, for example:
(?:\\s+)|(?<=[()])|(?=[()])
This pattern will set the delimiter to the left bracket or right bracket or one or more whitespace characters. However, it will also keep the left and right brackets (as I think you want to include those in your parsing?) but not the whitespace.
Here is an example of using this:
String test = "123(3 4)56(7)";
Scanner scanner = new Scanner(test);
scanner.useDelimiter("(?:\\s+)|(?<=[()])|(?=[()])");
while(scanner.hasNext()) {
System.out.println(scanner.next());
}
Output:
123
(
3
4
)
56
(
7
)
Detailed Regex Explanation:
(?:\\s+)|(?<=[()])|(?=[()])
1st Alternative: (?:\\s+)
(?:\\s+) Non-capturing group
\\s+ match any white space character [\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Alternative: (?<=[()])
(?<=[()]) Positive Lookbehind - Assert that the regex below can be matched
[()] match a single character present in the list below
() a single character in the list () literally
3rd Alternative: (?=[()])
(?=[()]) Positive Lookahead - Assert that the regex below can be matched
[()] match a single character present in the list below
() a single character in the list () literally
Scanner's .next() method uses whitespace as its delimiter. Luckily, we can change the delimiter!
For example, if you need the scanner to process to handle whitespace and parentheses, you could run this code immediately after constructing your Scanner:
scanner.useDelimiter(" ()");
System.out.println("Enter a string:");
Scanner sc = new Scanner(System.in);
String str = sc.nextLine();
if (str.contains("\n")) {
System.out.println("yes");
}
to the above piece of code the input string one\ntwo does not print "yes"
but the below code prints "yes"
String str = "one\ntwo";
if (str.contains("\n")) {
System.out.println("yes");
}
Could anyone suggest the reason for such a result?
When you type one\ntwo in console input \n is treated as two characters: \ and n, but when you write "\n" in code in String literal, then it represents line separator.
To check if your input contains \ character followed by n use contains("\\n") - to create \ literal we need to escape it by writing it as "\\" because it is special character in String (used for instance to create \n, \r \t, or \").
In Java, the \ is the 'escape' character. If you use \ in a String declaration, it is never literally put into the String, but used to escape the character right after it. For instance, you can use it to escape the double quote:
String str = "A double quote: \""; \\
You can also escape the escape character:
String str = "A backslash : \\";
The escape character is also used in meta-characters like \n. If you want to literally use those in a string, you have to escape them as well:
String str = "A newline character: \\n";
And that last example is exactly what Java does automatically for you if you retrieve the input from the System.in. It gets the literal \n, not the meta-character new-line.
So to summarize: inputting \n via the System.in is equivalent to directly setting a String to \\n.