The java.net.URI ctor accepts most non-ASCII characters but does not accept ideographic space (0x3000). The ctor fails with java.net.URISyntaxException: Illegal character in path ...
So my questions are:
Why doesn't the URI ctor accept 0x3000 but does accept other non-ASCII characters ?
What other characters doesn't it accept ?
The set of acceptable characters is spelled out in detail in the JavaDoc documentation for java.net.URI
Character categories
RFC 2396 specifies precisely which characters are permitted in the various components of a URI reference. The following categories, most of which are taken from that specification, are used below to describe these constraints:
alpha The US-ASCII alphabetic characters, 'A' through 'Z' and 'a' through 'z'
digit The US-ASCII decimal digit characters, '0' through '9'
alphanum All alpha and digit characters
unreserved All alphanum characters together with those in the string "_-!.~'()*"
punct The characters in the string ",;:$&+="
reserved All punct characters together with those in the string "?/[]#"
escaped Escaped octets, that is, triplets consisting of the percent character ('%') followed by two hexadecimal digits ('0'-'9', 'A'-'F', and 'a'-'f')
other The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the Character.isISOControl method), and are not space characters (according to the Character.isSpaceChar method) (Deviation from RFC 2396, which is limited to US-ASCII)
The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters.
In particular, "other" does not include space characters, which are defined (by Character.isSpaceChar) as those with Unicode general category types
SPACE_SEPARATOR
LINE_SEPARATOR
PARAGRAPH_SEPARATOR
and according to the page you've linked to in the question, the ideographic space character is indeed one of these types.
Please note the 1st example contains the ideographic space rather than a regular space.
It is the ideographic space that is the problem.
Here is the code that allows non-ASCII characters to be used:
} else if ((c > 128)
&& !Character.isSpaceChar(c)
&& !Character.isISOControl(c)) {
// Allow unescaped but visible non-US-ASCII chars
return p + 1;
}
As you can see, it disallows "funky" non-visible characters.
See also the URI class javadocs which specifies which characters are allowed (by the class!) in each component of a URI.
Why?
It is probably a safety measure.
What others are disallowed?
An character that is whitespace or a control character ... according to the respective Character predicate methods. (See the Character javadocs for a precise specification.)
You should also note that this is a deviation from the URI specification. The URI specification says that non-ASCII characters are only allowed if you:
convert the UCS character code to UTF-8, and
percent encode the UTF-8 bytes as required by the spec.
My understanding is that the URI.toASCIIString() method will take care of that if you have a "deviant" java.net.URI object.
Related
I'm trying to escape a string to ensure that special characters are escaped.
Using
StringEscapeUtils.escapeJava("😀") escapes to \\uD83D\\uDE00
StringEscapeUtils.escapeJava("% ! # $ ^ & * ") doesn't escape any of the characters
StringEscapeUtils.escapeJava("£") escapes to \\u00A3
I can understand that emojis contain backslashes and so are escaped, but why is the pound sign being escaped, and how do I stop it from being escaped?
The documentation of StringEscapeUtils.escapeJava() is vague on exactly what "Java String rules" are.
I guess it is referring to the bit in JLS Chapter 3, where it says:
Programs are written in Unicode (§3.1), but lexical translations are provided (§3.2) so that Unicode escapes (§3.3) can be used to include any Unicode character using only ASCII characters.
and
ASCII (ANSI X3.4) is the American Standard Code for Information Interchange. The first 128 characters of the Unicode UTF-16 encoding are the ASCII characters.
So it might mean escaping the string so that it can be written using only ASCII characters.
%, !, #, $, ^, & and * are all ASCII characters. They have values less than 128 (i.e. they are in the 7-bit block).
£ isn't an ASCII character: in ISO8859-1, it is encoded as 163 (0xA3), which is outside the 7-bit ASCII block.
If you open a file with the £ in a string literal, it might be rendered as something else, if that editor doesn't set the character encoding correctly. For example, it could be Ł, if it's interpreted in ISO8859-2.
In order to be unambiguous, the pound sign is therefore escaped.
how do I stop it from being escaped
You can't, using this method; you'd need to find an alternative. The only thing you can do would be to replace the \u00A7s in the string with £ again.
I have an issue with matching some of punctuation characters when Pattern.UNICODE_CHARACTER_CLASS flag is enabled.
For sample code is as follows:
final Pattern p = Pattern.compile("\\p{Punct}",Pattern.UNICODE_CHARACTER_CLASS);
final Matcher matcher = p.matcher("+");
System.out.println(matcher.find());
The output is false, although it is explicitly stated in documentation that p{Punct} includes characters such as !"#$%&'()*+,-./:;<=>?#[]^_`{|}~
Apart from '+' sign, the same problem occurs for following characters $+<=>^`|~
When Pattern.UNICODE_CHARACTER_CLASS is removed, it works fine
I will appreciate any hints on that problem
From the documentation:
When this flag is specified then the (US-ASCII only) Predefined
character classes and POSIX character classes are in conformance with
Unicode Technical Standard #18: Unicode Regular Expression Annex
C: Compatibility Properties.
If you take a look at the general category property for UTS35 (Unicode Technical Standard), you'll see a distinction between symbols (S and sub-categories) and punctuation (P and sub-categories) in a table under General Category Property.
Quoting:
The most basic overall character property is the General Category,
which is a basic categorization of Unicode characters into: Letters,
Punctuation, Symbols, Marks, Numbers, Separators, and Other.
If you try your example with \\p{S}, with the flag on, it will match.
My guess is that + is not listed under punctuation as an arbitrary (yet semantically appropriate) choice, i.e. literally punctuation != symbols.
The javadoc states what comes under //p{punc} with the caveat that
POSIX character classes (US-ASCII only)
If you take a look at the punctuation chars in unicode there is no + or $. Take a look at the punctuation chars in unicode at http://www.fileformat.info/info/unicode/category/Po/list.htm .
I have a String like this which is coming in a JSON processing data call\\U007fabc computers when I try to parse it jackson throwsn an exception like this:
org.codehaus.jackson.JsonParseException: Unrecognized character escape 'U' (code 85)
at [Source: java.io.StringReader#1b43c429; line: 1, column: 361]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1292)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:385)
at org.codehaus.jackson.impl.JsonParserMinimalBase._handleUnrecognizedCharacterEscape(JsonParserMinimalBase.java:360)
at org.codehaus.jackson.impl.ReaderBasedParser._decodeEscaped(ReaderBasedParser.java:1064)
at org.codehaus.jackson.impl.ReaderBasedParser._finishString2(ReaderBasedParser.java:785)
at org.codehaus.jackson.impl.ReaderBasedParser._finishString(ReaderBasedParser.java:762)
I think the problem is happening because of \\U007f. It definitely means something in UTF-8. Any idea how we can avoid this issue? Does JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER will help anything here?
Your JSON data is malformed.
JSON uses the \u escape sequence to encode a UTF-16 codeunit.
In this case, your JSON data is trying to escape Unicode codepoint U+007F DELETE (which is an ASCII control character that is not required by the JSON spec to be escaped, but is allowed to be escaped), but is using the \U escape sequence to do so. The JSON spec explicitly states that \u MUST be used:
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F. There are two-character escape sequence representations of some characters.
...
Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point.
...
To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair.
Although not explicitly stated in that last paragraph, the twelve-character sequence for a UTF-16 surrogate pair consists of two six-character sequences that must follow the same escape format as characters in the BMP. This is enforced by the character encoding diagram:
(source: json.org)
There is no \U escape sequence defined. That is what the parser error message is complaining about:
Unrecognized character escape 'U'
Unicode Character U+007F DELETE is probably what you are facing.
This answer states that it shouldnt have been encoded.
However to circumvent, you can refer to this answer on how to strip them off.
What caracters are in [[:jletterdigit:]] in JFlex ?
I need to translate [[:jletterdigit:]] to classical regex.
To clarify Michael Lowman's answer:
This is what the JFlex documentation says:
jletter and jletterdigit are predefined character classes. jletter includes all characters for which the Java function Character.isJavaIdentifierStart returns true and jletterdigit all characters for that Character.isJavaIdentifierPart returns true.
And what he wrote is the documentation of Character.isJavaIdentifierPart:
Determines if the specified character may be part of a Java identifier
as other than the first character.
A character may be part of a Java identifier if any of the following
are true:
it is a letter
it is a currency symbol (such as '$')
it is a connecting punctuation character (such as '_')
it is a digit
it is a numeric letter (such as a Roman numeral character)
it is a combining mark
it is a non-spacing mark
isIdentifierIgnorable returns true for the character
isIdentifierIgnorable is in turn defined as:
Determines if the specified character (Unicode code point) should be
regarded as an ignorable character in a Java identifier or a Unicode
identifier.
The following Unicode characters are ignorable in a Java identifier or
a Unicode identifier:
ISO control characters that are not whitespace
'\u0000' through '\u0008'
'\u000E' through '\u001B'
'\u007F' through '\u009F'
all characters that have the FORMAT general category value
A character may be part of a Java identifier if any of the following are true:
it is a letter
it is a currency symbol (such as '$')
it is a connecting punctuation character (such as '_')
it is a digit
it is a numeric letter (such as a Roman numeral character)
it is a combining mark
it is a non-spacing mark
isIdentifierIgnorable returns true for the character
from the Java API
I got into an interesting discussion in a forum where we discussed the naming of variables.
Conventions aside, I noticed that it is legal for a variable to have the name of a Unicode character, for example the following is legal:
int \u1234;
However, if I for example gave it the name #, it produces an error. According to Sun's tutorial it is valid if "beginning with a letter, the dollar sign "$", or the underscore character "_"."
But the unicode 1234 is some Ethiopic character. So what is really defined as a "letter"?
The Unicode standard defines what counts as a letter.
From the Java Language Specification, section 3.8:
Letters and digits may be drawn from
the entire Unicode character set,
which supports most writing scripts in
use in the world today, including the
large sets for Chinese, Japanese, and
Korean. This allows programmers to use
identifiers in their programs that are
written in their native languages.
A
"Java letter" is a character for which
the method
Character.isJavaIdentifierStart(int)
returns true. A "Java letter-or-digit"
is a character for which the method
Character.isJavaIdentifierPart(int)
returns true.
From the Character documenation for isJavaIdentifierPart:
Determines if the character (Unicode code point) may be part of a Java identifier as other
than the first character.
A character may be part of a Java identifier if any of the following are true:
it is a letter
it is a currency symbol (such as '$')
it is a connecting punctuation character (such as '_')
it is a digit
it is a numeric letter (such as a Roman numeral character)
it is a combining mark
it is a non-spacing mark
isIdentifierIgnorable(codePoint) returns true for the character
Unicode characters fall into character classes. There's a set of Unicode characters which fall into the class "letter".
Determined by Character.isLetter(c) for Java. But for identifiers, Character.isJavaIdentifierStart(c) and Character.isJavaIdentifierPart(c) are more relevant.
For the relevant Unicode spec, see this.