Determine if there is/are escape character(s) in string - java

Let say I have
String str="hello\" world\\";
when printing str, the output is
hello" world\
even when printing str.length() the output is
13
Is there any way to prove that str value has escape character(s)?

There is no such thing as escape characters at run time.
Escape characters appear only in String literals. For example,
String literal = "Some\nEscape\rSequence\\\"";
At compilation time, the compiler produces a String value with their actual binary representation (UTF-8 iirc). The JVM uses that String value directly.
You wrote
I am thinking that whenever we print a string and the output contains
character such as " and \, then we can conclude that those character,
" and \ was escaped?
This is not true, those characters might have been read from a file or some other InputStream. They were definitely not escaped in a text file.

Yes.
Use the Apache Commons Library, specifically StringEscapeUtils#escapeJava.
jshell> StringEscapeUtils.escapeJava("Newline \n here \u0344 and unicode \f\n\r\t\"\0\13 and more")
$136 ==> "Newline \\n here \\u0344 and unicode \\f\\n\\r\\t\\\"\\u0000\\u000B and more"
This prepends a backslash to each escape sequence and also swaps the variable-width octal sequences for fixed-width Unicode sequences. This means that every escape sequence will consist of "\\" two backslashes, followed by one of {n, b, r, t, f, ", \}, or a 'u' character, plus exactly four hexadecimal [0-F] digits.
If you just want to know whether or not the original String contains escape sequences, search for "\\" in the Apache-fied string. If you want to find the positions of those sequences, it's a bit more involved.
See more at this Gist.

Related

How can we remove a "\" backslash character from a string in java?

I've been trying to figure out how we can remove a special character along with its preceding letters within a string.
Let's suppose, there a string with "ABC\n000111". In this case we have to remove the "ABC\" character from the string. So, the result would be n000111.
Can someone help me find the efficient way of doing this?
The Java string literal "ABC\n000111" doesn't contain a backslash: \n is a special character sequence, meaning a single character for a (unix) newline.
If you want to replace \n with n, you can do so:
System.out.println("ABC\n000111".replace('\n', 'n'));
If you want to replace everything up to and including the \n with n, you can do so:
System.out.println("ABC\n000111".replaceAll("^.*\n", "n"));

Java regex escaped characters

When matching certain characters (such as line feed), you can use the regex "\\n" or indeed just "\n". For example, the following splits a string into an array of lines:
String[] lines = allContent.split("\\r?\\n");
But the following works just as well:
String[] lines = allContent.split("\r?\n");
My question:
Do the above two work in exactly the same way, or is there any subtle difference? If the latter, can you give an example case where you get different results?
Or is there a difference only in [possible/theoretical] performance?
There is no difference in the current scenario. The usual string escape sequences are formed with the help of a single backslash and then a valid escape char ("\n", "\r", etc.) and regex escape sequences are formed with the help of a literal backslash (that is, a double backslash in the Java string literal) and a valid regex escape char ("\\n", "\\d", etc.).
"\n" (an escape sequence) is a literal LF (newline) and "\\n" is a regex escape sequence that matches an LF symbol.
"\r" (an escape sequence) is a literal CR (carriage return) and "\\r" is a regex escape sequence that matches an CR symbol.
"\t" (an escape sequence) is a literal tab symbol and "\\t" is a regex escape sequence that matches a tab symbol.
See the list in the Java regex docs for the supported list of regex escapes.
However, if you use a Pattern.COMMENTS flag (used to introduce comments and format a pattern nicely, making the regex engine ignore all unescaped whitespace in the pattern), you will need to either use "\\n" or "\\\n" to define a newline (LF) in the Java string literal and "\\r" or "\\\r" to define a carriage return (CR).
See a Java test:
String s = "\n";
System.out.println(s.replaceAll("\n", "LF")); // => LF
System.out.println(s.replaceAll("\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\\\n", "LF")); // => LF
System.out.println(s.replaceAll("(?x)\n", "<LF>"));
// => <LF>
//<LF>
Why is the last one producing <LF>+newline+<LF>? Because "(?x)\n" is equal to "", an empty pattern, and it matches an empty space before the newline and after it.
Yes there are different. The Java Compiler has different behavior for Unicode Escapes in the Java Book The Java Language Specification section 3.3;
The Java programming language specifies a standard way of transforming
a program written in Unicode into ASCII that changes a program into a
form that can be processed by ASCII-based tools. The transformation
involves converting any Unicode escapes in the source text of the
program to ASCII by adding an extra u - for example, \uxxxx becomes
\uuxxxx - while simultaneously converting non- ASCII characters in the
source text to Unicode escapes containing a single u each.
So how this affect the /n vs //n in the Java Doc:
It is therefore necessary to double backslashes in string literals
that represent regular expressions to protect them from interpretation
by the Java bytecode compiler.
An a example of the same doc:
The string literal "\b", for example, matches a single backspace
character when interpreted as a regular expression, while "\b"
matches a word boundary. The string literal "(hello)" is illegal and
leads to a compile-time error; in order to match the string (hello)
the string literal "\(hello\)" must be used.

converting string of unicode "\u0063" into "c"

I'm doing some cryptoanalysis homework and was trying to write code that does a + b = c. My idea was to use unicode. b +(b-a) = c. Problem is my code returns a the unicode value of c not the String "c" and I can't convert it.
Please can someone explain the difference between the string below called unicode and those called test and test2? Also is there any way I could get the string unicodeOfC to print "c"?
//this calculates the unicode value for c
String unicodeOfC = ("\\u" + Integer.toHexString('b'+('b'-'a') | 0x10000).substring(1));
//this prints \u0063
System.out.println(unicodeOfC);
String test = "\u0063";
//this prints c
System.out.println(test);
//this is false
System.out.println(test.equals(unicodeOfC));
String test2 = "\u0063";
//this is true
System.out.println(test.equals(test2));
There is no difference between test and test2. They are both String literals referring to the same String. This String literal is made up of a unicode escape.
A compiler for the Java programming language ("Java compiler") first
recognizes Unicode escapes in its input, translating the ASCII
characters \u followed by four hexadecimal digits to the UTF-16 code
unit (§3.1) for the indicated hexadecimal value, and passing all other
characters unchanged.
So the compiler will translate this unicode escape and convert it to the corresponding UTF-16 code unit. That is, the unicode escape \u0063 translates to the character c.
In this
String unicodeOfC = ("\\u" + Integer.toHexString('b'+('b'-'a') | 0x10000).substring(1));
the String literal "\\u" (which uses a \ character to escape a \ character) has a runtime value of \u, ie. the two character \ and u. That String is concatenated with the result of invoking toHexString(..). You then invoke substring on the resulting String and assign its result to unicodeOfC. So the String value is \u0063, ie. the 6 characters \, u, 0, 0, 6, and 3.
Also is there any way I could get the string unicodeOfC to print "c"?
Similarly to how you created it, you need to get the numerical part of the unicode escape,
String numerical = unicodeOfC.replace("\\u", "");
int val = Integer.parseInt(numerical, 16);
System.out.println((char) val);
You can then print it out.
I think you're not understanding how string escaping works.
In Java backslash is an escape character that allows you to use characters in strings like newlines \n, tabs \t, or unicode \u0063.
Suppose I am writing code and I need to print a newline. I would do this System.out.println("\n");
Now lets say I want to show a backslash, System.out.println("\"); will be a compile error but System.out.println("\\"); will print \.
So your first string is printing the literal backslash character then the letter u then the hexadecimal number.

Remove escape char ' \' from string in java

I have to remove \ from the string.
My String is "SEPIMOCO EUROPE\119"
I tried replace, indexOf, Pattern but I am not able to remove this \ from this string
String strconst="SEPIMOCO EUROPE\119";
System.out.println(strconst.replace("\\", " ")); // Gives SEPIMOCO EUROPE 9
System.out.println(strconst.replace("\\\\", " ")); // Gives SEPIMOCO EUROPE 9
System.out.println(strconst.indexOf("\\",0)); //Gives -1
Any solutions for this ?
Your string doesn't actually contain a backslash. This part: "\11" is treated as an octal escape sequence (so it's really a tab - U+0009). If you really want a backslash, you need:
String strconst="SEPIMOCO EUROPE\\119";
It's not really clear where you're getting your input data from or what you're trying to achieve, but that explains everything you're seeing at the moment.
You have to distinguish between the string literal, i.e. the thing you write in your source code, enclosed with double quotes, and the string value it represents. When turning the former into the latter, escape sequences are interpreted, causing a difference between these two.
Stripping from string literals
\11 in the literal represents the character with octal value 11, i.e. a tab character, in the actual string value. \11 is equivalent to \t.
There is no way to reliably obtain the escaped version of a string literal. In other words, you cannot know whether the source code contained \11 or \t, because that information isn't present in the class file any more. Therefore, if you wanted to “strip backslashes” from the sequence, you wouldn't know whether 11 or t was the correct replacement.
For this reason, you should try to fix the string literals, either to not include the backslashes if you don't want them at all, or to contain proper backslashes, by escaping them in the literal as well. \\ in a string literal gives a single \ in the string it expresses.
Runtime strings
As you comments to other answers indicate that you're actually receiving this string at runtime, I would expect the string to contain a real backslash instead of a tab character. Unless you employ some fancy input method which parses escape sequences, you will still have the raw backslash. In order to simulate that situation in testing code, you should include a real backslash in your string, i.e. a double backslash \\ in your string literal.
When you have a real backslash in your string, strconst.replace("\\", " ") should do what you want it to do:
String strconst="SEPIMOCO EUROPE\\119";
System.out.println(strconst.replace("\\", " ")); // Gives SEPIMOCO EUROPE 119
Where does your String come from? If you declare it like in the example you will want to add another escaping backslash before the one you have there.

Matching Unicode Dashes in Java Regular Expressions?

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:
private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");
which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:
String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);
No joy. For the sample input below, the dash is not detected, and
titleSegmentSeparator.matcher(sectionTitle).find() returns false!
In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?
Sample input:
Study Summary (1 of 10) – Competition
S(83)t(116)u(117)d(100)y(121)
(32)S(83)u(117)m(109)m(109)a(97)r(114)y(121)
(32)((40)1(49) (32)o(111)f(102)
(32)1(49)0(48))(41) (32)–(8211)
(32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)
It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?
You're mixing decimal (8211) and hexadecimal (0x8211).
\x and \u both expect a hexadecimal number, therefore you'd need to use \u2014 to match the em-dash, not \u8211 (and \x2D for the normal hyphen etc.).
But why not simply use the Unicode property "Dash punctuation"?
As a Java string: "\\s\\p{Pd}\\s"

Categories

Resources