What caracters are in [[:jletterdigit:]] in JFlex ?
I need to translate [[:jletterdigit:]] to classical regex.
To clarify Michael Lowman's answer:
This is what the JFlex documentation says:
jletter and jletterdigit are predefined character classes. jletter includes all characters for which the Java function Character.isJavaIdentifierStart returns true and jletterdigit all characters for that Character.isJavaIdentifierPart returns true.
And what he wrote is the documentation of Character.isJavaIdentifierPart:
Determines if the specified character may be part of a Java identifier
as other than the first character.
A character may be part of a Java identifier if any of the following
are true:
it is a letter
it is a currency symbol (such as '$')
it is a connecting punctuation character (such as '_')
it is a digit
it is a numeric letter (such as a Roman numeral character)
it is a combining mark
it is a non-spacing mark
isIdentifierIgnorable returns true for the character
isIdentifierIgnorable is in turn defined as:
Determines if the specified character (Unicode code point) should be
regarded as an ignorable character in a Java identifier or a Unicode
identifier.
The following Unicode characters are ignorable in a Java identifier or
a Unicode identifier:
ISO control characters that are not whitespace
'\u0000' through '\u0008'
'\u000E' through '\u001B'
'\u007F' through '\u009F'
all characters that have the FORMAT general category value
A character may be part of a Java identifier if any of the following are true:
it is a letter
it is a currency symbol (such as '$')
it is a connecting punctuation character (such as '_')
it is a digit
it is a numeric letter (such as a Roman numeral character)
it is a combining mark
it is a non-spacing mark
isIdentifierIgnorable returns true for the character
from the Java API
Related
I used Regex to verify a password with a minimum of 8 characters containing 1 uppercase, 1 lowercase, 1 special character, and 1 numeric value. The problem I encountered is that if people from Russia or any other country whose language is different from English try to enter the password using the default keyboard, it will create a problem for them.
Currently i have set this regex condition for my application :
String regex = "^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!#$%^&*-]).{8,}$";
It works if the user sets the password in English. But it does not work if the user's language is different. (A password like Испытание#123 will fail).
I searched on google about russian regex for the default keyboard and found a solution for the russian keyboard.
Do i need to work out for all different language for password validation
regex?
Is there an alternative solution so that I can check the password check for any default keyboard layout with the password validation that I want?
You can use \p{Ll} to match any Unicode lowercase letters and \p{Lu} to match any Unicode uppercase letters:
String regex = "^(?=\\P{Lu}*\\p{Lu})(?=\\P{Ll}*\\p{Ll})(?=[^0-9]*[0-9])(?=[^#?!#$%^&*-]*[#?!#$%^&*-]).{8,}$";
Note the .*? in the lookaheads is not efficient, I used the reverse patterns (\P{Lu} matches any char that is not a Unicode uppercase letter, \P{Ll} matches any char other than a Unicode lowercase letter, etc.) in line with the principle of contrast.
If you need to support any Unicode digits, too, replace (?=[^0-9]*[0-9]) with (?=\\P{Nd}*\\p{Nd}) where \P{Nd} matches any char other than a Unicode digit and \p{Nd} matches any char other than a Unicode decimal digit.
When matching certain characters (such as line feed), you can use the regex "\\n" or indeed just "\n". For example, the following splits a string into an array of lines:
String[] lines = allContent.split("\\r?\\n");
But the following works just as well:
String[] lines = allContent.split("\r?\n");
My question:
Do the above two work in exactly the same way, or is there any subtle difference? If the latter, can you give an example case where you get different results?
Or is there a difference only in [possible/theoretical] performance?
There is no difference in the current scenario. The usual string escape sequences are formed with the help of a single backslash and then a valid escape char ("\n", "\r", etc.) and regex escape sequences are formed with the help of a literal backslash (that is, a double backslash in the Java string literal) and a valid regex escape char ("\\n", "\\d", etc.).
"\n" (an escape sequence) is a literal LF (newline) and "\\n" is a regex escape sequence that matches an LF symbol.
"\r" (an escape sequence) is a literal CR (carriage return) and "\\r" is a regex escape sequence that matches an CR symbol.
"\t" (an escape sequence) is a literal tab symbol and "\\t" is a regex escape sequence that matches a tab symbol.
See the list in the Java regex docs for the supported list of regex escapes.
However, if you use a Pattern.COMMENTS flag (used to introduce comments and format a pattern nicely, making the regex engine ignore all unescaped whitespace in the pattern), you will need to either use "\\n" or "\\\n" to define a newline (LF) in the Java string literal and "\\r" or "\\\r" to define a carriage return (CR).
See a Java test:
String s = "\n";
System.out.println(s.replaceAll("\n", "LF")); // => LF
System.out.println(s.replaceAll("\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\n", "<LF>"));
// => <LF>
//<LF>
Why is the last one producing <LF>+newline+<LF>? Because "(?x)\n" is equal to "", an empty pattern, and it matches an empty space before the newline and after it.
Yes there are different. The Java Compiler has different behavior for Unicode Escapes in the Java Book The Java Language Specification section 3.3;
The Java programming language specifies a standard way of transforming
a program written in Unicode into ASCII that changes a program into a
form that can be processed by ASCII-based tools. The transformation
involves converting any Unicode escapes in the source text of the
program to ASCII by adding an extra u - for example, \uxxxx becomes
\uuxxxx - while simultaneously converting non- ASCII characters in the
source text to Unicode escapes containing a single u each.
So how this affect the /n vs //n in the Java Doc:
It is therefore necessary to double backslashes in string literals
that represent regular expressions to protect them from interpretation
by the Java bytecode compiler.
An a example of the same doc:
The string literal "\b", for example, matches a single backspace
character when interpreted as a regular expression, while "\b"
matches a word boundary. The string literal "(hello)" is illegal and
leads to a compile-time error; in order to match the string (hello)
the string literal "\(hello\)" must be used.
I have an issue with matching some of punctuation characters when Pattern.UNICODE_CHARACTER_CLASS flag is enabled.
For sample code is as follows:
final Pattern p = Pattern.compile("\\p{Punct}",Pattern.UNICODE_CHARACTER_CLASS);
final Matcher matcher = p.matcher("+");
System.out.println(matcher.find());
The output is false, although it is explicitly stated in documentation that p{Punct} includes characters such as !"#$%&'()*+,-./:;<=>?#[]^_`{|}~
Apart from '+' sign, the same problem occurs for following characters $+<=>^`|~
When Pattern.UNICODE_CHARACTER_CLASS is removed, it works fine
I will appreciate any hints on that problem
From the documentation:
When this flag is specified then the (US-ASCII only) Predefined
character classes and POSIX character classes are in conformance with
Unicode Technical Standard #18: Unicode Regular Expression Annex
C: Compatibility Properties.
If you take a look at the general category property for UTS35 (Unicode Technical Standard), you'll see a distinction between symbols (S and sub-categories) and punctuation (P and sub-categories) in a table under General Category Property.
Quoting:
The most basic overall character property is the General Category,
which is a basic categorization of Unicode characters into: Letters,
Punctuation, Symbols, Marks, Numbers, Separators, and Other.
If you try your example with \\p{S}, with the flag on, it will match.
My guess is that + is not listed under punctuation as an arbitrary (yet semantically appropriate) choice, i.e. literally punctuation != symbols.
The javadoc states what comes under //p{punc} with the caveat that
POSIX character classes (US-ASCII only)
If you take a look at the punctuation chars in unicode there is no + or $. Take a look at the punctuation chars in unicode at http://www.fileformat.info/info/unicode/category/Po/list.htm .
The java.net.URI ctor accepts most non-ASCII characters but does not accept ideographic space (0x3000). The ctor fails with java.net.URISyntaxException: Illegal character in path ...
So my questions are:
Why doesn't the URI ctor accept 0x3000 but does accept other non-ASCII characters ?
What other characters doesn't it accept ?
The set of acceptable characters is spelled out in detail in the JavaDoc documentation for java.net.URI
Character categories
RFC 2396 specifies precisely which characters are permitted in the various components of a URI reference. The following categories, most of which are taken from that specification, are used below to describe these constraints:
alpha The US-ASCII alphabetic characters, 'A' through 'Z' and 'a' through 'z'
digit The US-ASCII decimal digit characters, '0' through '9'
alphanum All alpha and digit characters
unreserved All alphanum characters together with those in the string "_-!.~'()*"
punct The characters in the string ",;:$&+="
reserved All punct characters together with those in the string "?/[]#"
escaped Escaped octets, that is, triplets consisting of the percent character ('%') followed by two hexadecimal digits ('0'-'9', 'A'-'F', and 'a'-'f')
other The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the Character.isISOControl method), and are not space characters (according to the Character.isSpaceChar method) (Deviation from RFC 2396, which is limited to US-ASCII)
The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters.
In particular, "other" does not include space characters, which are defined (by Character.isSpaceChar) as those with Unicode general category types
SPACE_SEPARATOR
LINE_SEPARATOR
PARAGRAPH_SEPARATOR
and according to the page you've linked to in the question, the ideographic space character is indeed one of these types.
Please note the 1st example contains the ideographic space rather than a regular space.
It is the ideographic space that is the problem.
Here is the code that allows non-ASCII characters to be used:
} else if ((c > 128)
&& !Character.isSpaceChar(c)
&& !Character.isISOControl(c)) {
// Allow unescaped but visible non-US-ASCII chars
return p + 1;
}
As you can see, it disallows "funky" non-visible characters.
See also the URI class javadocs which specifies which characters are allowed (by the class!) in each component of a URI.
Why?
It is probably a safety measure.
What others are disallowed?
An character that is whitespace or a control character ... according to the respective Character predicate methods. (See the Character javadocs for a precise specification.)
You should also note that this is a deviation from the URI specification. The URI specification says that non-ASCII characters are only allowed if you:
convert the UCS character code to UTF-8, and
percent encode the UTF-8 bytes as required by the spec.
My understanding is that the URI.toASCIIString() method will take care of that if you have a "deviant" java.net.URI object.
I got into an interesting discussion in a forum where we discussed the naming of variables.
Conventions aside, I noticed that it is legal for a variable to have the name of a Unicode character, for example the following is legal:
int \u1234;
However, if I for example gave it the name #, it produces an error. According to Sun's tutorial it is valid if "beginning with a letter, the dollar sign "$", or the underscore character "_"."
But the unicode 1234 is some Ethiopic character. So what is really defined as a "letter"?
The Unicode standard defines what counts as a letter.
From the Java Language Specification, section 3.8:
Letters and digits may be drawn from
the entire Unicode character set,
which supports most writing scripts in
use in the world today, including the
large sets for Chinese, Japanese, and
Korean. This allows programmers to use
identifiers in their programs that are
written in their native languages.
A
"Java letter" is a character for which
the method
Character.isJavaIdentifierStart(int)
returns true. A "Java letter-or-digit"
is a character for which the method
Character.isJavaIdentifierPart(int)
returns true.
From the Character documenation for isJavaIdentifierPart:
Determines if the character (Unicode code point) may be part of a Java identifier as other
than the first character.
A character may be part of a Java identifier if any of the following are true:
it is a letter
it is a currency symbol (such as '$')
it is a connecting punctuation character (such as '_')
it is a digit
it is a numeric letter (such as a Roman numeral character)
it is a combining mark
it is a non-spacing mark
isIdentifierIgnorable(codePoint) returns true for the character
Unicode characters fall into character classes. There's a set of Unicode characters which fall into the class "letter".
Determined by Character.isLetter(c) for Java. But for identifiers, Character.isJavaIdentifierStart(c) and Character.isJavaIdentifierPart(c) are more relevant.
For the relevant Unicode spec, see this.