Unrecognized character while parsing JSON - java

I have a String like this which is coming in a JSON processing data call\\U007fabc computers when I try to parse it jackson throwsn an exception like this:
org.codehaus.jackson.JsonParseException: Unrecognized character escape 'U' (code 85)
at [Source: java.io.StringReader#1b43c429; line: 1, column: 361]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1292)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:385)
at org.codehaus.jackson.impl.JsonParserMinimalBase._handleUnrecognizedCharacterEscape(JsonParserMinimalBase.java:360)
at org.codehaus.jackson.impl.ReaderBasedParser._decodeEscaped(ReaderBasedParser.java:1064)
at org.codehaus.jackson.impl.ReaderBasedParser._finishString2(ReaderBasedParser.java:785)
at org.codehaus.jackson.impl.ReaderBasedParser._finishString(ReaderBasedParser.java:762)
I think the problem is happening because of \\U007f. It definitely means something in UTF-8. Any idea how we can avoid this issue? Does JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER will help anything here?

Your JSON data is malformed.
JSON uses the \u escape sequence to encode a UTF-16 codeunit.
In this case, your JSON data is trying to escape Unicode codepoint U+007F DELETE (which is an ASCII control character that is not required by the JSON spec to be escaped, but is allowed to be escaped), but is using the \U escape sequence to do so. The JSON spec explicitly states that \u MUST be used:
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F. There are two-character escape sequence representations of some characters.
...
Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point.
...
To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair.
Although not explicitly stated in that last paragraph, the twelve-character sequence for a UTF-16 surrogate pair consists of two six-character sequences that must follow the same escape format as characters in the BMP. This is enforced by the character encoding diagram:
(source: json.org)
There is no \U escape sequence defined. That is what the parser error message is complaining about:
Unrecognized character escape 'U'

Unicode Character U+007F DELETE is probably what you are facing.
This answer states that it shouldnt have been encoded.
However to circumvent, you can refer to this answer on how to strip them off.

Related

How to use Escape character in HL7 message

I am using ca.uhn.hl7v2.util.Terser to create a HL7 message. For one of the HL7 fields I need to set the following value
"\home\one\two".
HL7 message type is MDM_T02(version is 2.3.1). Because "" is an escape character in hl7 messages if I try to use
public void methodOne() {
MDM_T02 mdmt02 = new MDM_T02();
Terser terser = new Terser(mdmt02);
terser.set("OBX-5-1", "\\\\usne-server\\Pathology\\Quantum");
}
In the HL7 message OBX-5-1 is printed as "\E\E\usne-server\E\Pathology\E\Quantum".
Can someone help me to print the proper message?
You may refer to the description of HL7 Escape Sequences here or here.
HL7 defines character sequences to represent ’special’ characters not otherwise permitted in HL7 messages. These sequences begin and end with the message’s Escape character (usually ‘\’), and contain an identifying character, followed by 0 or more characters. The most common use of HL7 escape sequences is to escape the HL7 defined delimiter characters.
Character Description Conversion
\Cxxyy\ Single-byte character set escape sequence with two hexadecimal values not converted
\E\ Escape character converted to escape character (e.g., ‘\’)
\F\ Field separator converted to field separator character (e.g., ‘|’)
\H\ Start highlighting not converted
\Mxxyyzz\ Multi-byte character set escape sequence with two or three hexadecimal values (zz is optional) not converted
\N\ Normal text (end highlighting) not converted
\R\ Repetition separator converted to repetition separator character (e.g., ‘~’)
\S\ Component separator converted to component separator character (e.g., ‘^’)
\T\ Subcomponent separator converted to subcomponent separator character (e.g., ‘&’)
\Xdd…\ Hexadecimal data (dd must be hexadecimal characters) converted to the characters identified by each pair of digits
\Zdd…\ Locally defined escape sequence not converted
If \ is part of your data, you need to escape it with \E\.
So your value:
"\home\one\two"
becomes
"\E\home\E\one\E\two"
About second issue:
In the HL7 message OBX-5-1 is printed as "\E\E\usne-server\E\Pathology\E\Quantum".
While reading the value, you have to reverse the process. That means, you should replace \E\ with \ back to get original value.
As #Amit Joshi mentioned, this has to do with HL7 escaping. You may want to try to change your escape character to one other than a backslash that is unlikely to appear in your message as your client appears to not be following it anyway.
This would be the 3rd character in MSH-2.

StringEscapeUtils escapeJava is escaping pound signs

I'm trying to escape a string to ensure that special characters are escaped.
Using
StringEscapeUtils.escapeJava("😀") escapes to \\uD83D\\uDE00
StringEscapeUtils.escapeJava("% ! # $ ^ & * ") doesn't escape any of the characters
StringEscapeUtils.escapeJava("£") escapes to \\u00A3
I can understand that emojis contain backslashes and so are escaped, but why is the pound sign being escaped, and how do I stop it from being escaped?
The documentation of StringEscapeUtils.escapeJava() is vague on exactly what "Java String rules" are.
I guess it is referring to the bit in JLS Chapter 3, where it says:
Programs are written in Unicode (§3.1), but lexical translations are provided (§3.2) so that Unicode escapes (§3.3) can be used to include any Unicode character using only ASCII characters.
and
ASCII (ANSI X3.4) is the American Standard Code for Information Interchange. The first 128 characters of the Unicode UTF-16 encoding are the ASCII characters.
So it might mean escaping the string so that it can be written using only ASCII characters.
%, !, #, $, ^, & and * are all ASCII characters. They have values less than 128 (i.e. they are in the 7-bit block).
£ isn't an ASCII character: in ISO8859-1, it is encoded as 163 (0xA3), which is outside the 7-bit ASCII block.
If you open a file with the £ in a string literal, it might be rendered as something else, if that editor doesn't set the character encoding correctly. For example, it could be Ł, if it's interpreted in ISO8859-2.
In order to be unambiguous, the pound sign is therefore escaped.
how do I stop it from being escaped
You can't, using this method; you'd need to find an alternative. The only thing you can do would be to replace the \u00A7s in the string with £ again.

why is java accepting unicode outside the " " and ' ' also?

This line compiles fine
Thread t = \u006E\u0065\u0077\u0020\u0054\u0068\u0072\u0065\u0061\u0064\u0028\u0029\u003B
this is the unicode for the text new Thread();
my question is what is the need for accepting unicode characters outside the " " or ' '. we can use unicodes in string literals and character literals. but what is the need for it to be accepted in the actual code itself?
The reason why this works is that the Unicode escape sequence isn't handled by the grammar or the string parsing code but the tokenizer. So the Java grammar never "sees" those escape sequences, it gets a Unicode string.
Which has unfortunate side effects like this code doesn't compile:
// C:\user\...
For most of us, it's a comment. For the tokenizer, it's the illegal unicode sequence ser\.
The reason to do it this way is that you can now use any Unicode character anywhere in the Java source code - Java identifiers are not limited to ASCII!
But the tools to edit Java might not be as good. In 1994, it was pretty hard to find a text editor capable of Unicode. Also, code generators often work better if you stay with ASCII.
JLS specified it
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
This works because the Java Language Specification requires this. See 3.3. Unicode Escapes:
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.
The reason is simple: Java allows full unicode support (even for identifiers!), but sometimes it is not practical to use actual unicode for your source files, in that case you can use escapes.
This also means that unicode escapes are not an artifact of strings in Java, but actually of the compiler: if you have a String (or char) with unicode escapes it will translated at compiletime to the actual character, not at runtime!
The section 3.2. Lexical Translations is also relevant:
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.
A translation of the Unicode stream resulting from step 1 [...]
If the source code is not UTF-8 this feature makes it possible to use Unicode characters in the source code otherwise not available

Regular expression for Multi Bytes string

What could be the regular expression to detect a multi byte string.
For example here is the expression to detect a string in english
Pattern p=Pattern.compile("[a-zA-Z/]");
Similarly I want a pattern which has multi bytes like
コメント_1050_固-減価償却費
You may want to have a look at Unicode Support in Java
I think basically you want the Unicode property \p{L}. This would match any code point that has the property "letter".
So your regex could look like this
Pattern p=Pattern.compile("[\\p{L}/]");
I just replaced the character ranges a-zA-Z with \p{L}
Since Java 7 you could also use Pattern.UNICODE_CHARACTER_CLASS
Enables the Unicode version of Predefined character classes and POSIX character classes.
That would turn the predefined \w into the Unicode version, means it would match all Unicode letters and digits (and string connecting characters like _)
So to match your string コメント_1050_固-減価償却費, you could use
Pattern p=Pattern.compile("^\\w+$", Pattern.UNICODE_CHARACTER_CLASS);
This would match any string consisting of letters, digits and _
See here for more details
and here on regular-expression.info an overview over the Unicode scripts, properties and blocks.
See here a famous answer from tchrist about the caveats of regex in Java, including an updated what has changed with Java 7 (or will be in Java 8)
If you want to detect whether you have a multi-byte strings you cna look at the length
if (text.length() != text.getBytes(encoding).length)
This will detect that a multi-byte character has been used for any encoding.
Essentially, Java regular expressions work on Strings, not arrays of bytes - characters are represented as abstract "character" entities, not as bytes in some specific encoding. This is not completely true since the char type only contains characters from the Basic Multilingual Plane and Unicode chars from outside this range are represented as two char values each, but nonetheless "multibyte" is relative and depends on the encoding.
If what you need is "multibyte in UTF-8", then note that only characters with values 0-127 are single-byte in this encoding. So, the easiest way to check would be to use a loop and check each character - if it's greater than 127, it's more than one byte in UTF-8.
If you insist on using a regex, you could probably use the character range operator in the regex like this: [\u0080-\uFFFF] (haven't checked and \uFFFF is not really a character but I think the regex engine should accept it).
You will need to use Unicode for elements which are not in the English language. This link should provide you with some information.
There is a nice introduction to UniCode regex here.

What is meant by "escape sequence" in the definition of Java string literals?

From the Java Language Specification, Section 3.10.5 String Literals:
Characters may be represented by escape sequences - one escape sequence for characters in the range U+0000 to U+FFFF, two escape sequences for the UTF-16 surrogate code units of characters in the range U+010000 to U+10FFFF.
What does this mean? If a character falls within the range U+0000 to U+FFFF, then one escape sequence may be used. How different is one escape sequence from two escape sequences?
By escape sequence, does it refer to \n, \r and similar? Are these one sequence or two escape sequences?
From u+0000 to u+ffff, each number (if you will) represents a characters. However, some unicode characters (called surrogate pairs) are combination of two numbers in the u+010000 to u+10ffff. So if you have a number u+010000 to u+10ffff, then a second one is required to represent a valid character.
By escape sequence, they mean stuff like \u0000 (which you can use in a String literal to represent a unicode character).

Categories

Resources