java how to escape accented character in string

java how to escape accented character in string - java

For example
{"orderNumber":"S301020000","customerFirstName":"ke ČECHA ","customerLastName":"张科","orderStatus":"PENDING_FULFILLMENT_REQUEST","orderSubmittedDate":"May 13, 2015 1:41:28 PM"}
how to get the accented character like "Č" in above json string and escape it in java
Just give some context of this question, please check this question from me
Ajax unescape response text from java servlet not working properly
Sorry for my English :)

You should escape all characters that are greater than 0x7F. You can loop through the String's characters using the .charAt(index) method. For each character ch that needs escaping, replace it with:
String hexDigits = Integer.toHexString(ch).toUpperCase();
String escapedCh = "\\u" + "0000".substring(hexDigits.length) + hexDigits;
I don't think you will need to unescape them in JavaScript because JavaScript supports escaped characters in string literals, so you should be able to work with the string the way it is returned by the server. I'm guessing you will be using JSON.parse() to convert the returned JSON string into a JavaScript object, like this.
Here's a complete function:
public static escapeJavaScript(String source)
{
StringBuilder result = new StringBuilder();
for (int i = 0; i < source.length(); i++)
{
char ch = source.charAt(i);
if (ch > 0x7F)
{
String hexDigits = Integer.toHexString(ch).toUpperCase();
String escapedCh = "\\u" + "0000".substring(hexDigits.length) + hexDigits;
result.append(escapedCh);
}
else
{
result.append(ch);
}
}
return result.toString();
}

Related

Java %u20AC conversion to euro €

how can I convert a string like:
URLDecoder.decode("promo desc %u20AC", "UTF-16");
into "promo desc €" ?
In fact the method above doesn't work because % indicates a hex string whilst u20AC is not a valid hex string.
The string to decode is generated by a Javascript like this:
var string = escape("{€ć") ---> "%7B%u20AC%u0107"
I didn't want to use URLDecoder because, semantically, it's not a URL I'm trying to decode but a very long text. In java % indicates a hex string and %u is illegal. I think that converting % to \ is a bit naive, there may be sequences of % in the text.
What I am after is this function here:
unescape("%7B%u20AC%u0107")
that exists in Javascript but not in Java to my knowledge. How can I achieve this in Java?
Thanks

I was curious, because I've not seen the %u escapes before, but it turns out unescaping them is fairly easy:
private static final Pattern JAVASCRIPT_ESCAPE_SEQUENCE= Pattern.compile("%(u[0-9a-fA-F]{4}|[0-9a-fA-F]{2})");
/**
* Unescape a JavaScript-escaped string.
* Undoes the effect of calling the <a href="https://developer.mozilla.org/de/docs/Web/JavaScript/Reference/Global_Objects/escape">
* the JavaScript escape method</a>.
*/
static String unescape(String input) {
Matcher matcher = JAVASCRIPT_ESCAPE_SEQUENCE.matcher(input);
StringBuilder sb = new StringBuilder(input.length());
while(matcher.find()) {
String escapeSequence = matcher.group(1);
if (escapeSequence.startsWith("u")) {
escapeSequence = escapeSequence.substring(1);
}
char c = (char) Integer.parseInt(escapeSequence, 16);
matcher.appendReplacement(sb, Character.toString(c));
}
matcher.appendTail(sb);
return sb.toString();
}
Given this method unescape("%7B%u20AC%u0107") produces the desired output {€ć.

Encode only specific characters in String

I have to encode only some special characters in a string to numeric value.
Say,
String name = "test $#";
I want to encode only characters $ and # in the above string. I tried using below code but it did not work out.
String encode = URLEncoder.encode(StringEscapeUtils.escapeJava(name), "UTF-8");
The encoded value will be like, for white space the encoded value is &#160

What about to split that String (by string#split method - with space as regex), from Array, which it returns you can use last item and you will get there symbols, what you need :)
String name = "test $#";
String nameSplittedArr = name.split(" ");
String yourChars = nameSplittedArr[nameSplittedArr.length-1]; //indexes from zero
That should works :)

As per the comments, I think you are after a customized encoding function. Something like:
public static String EncodeString(String text) {
StringBuffer sb = new StringBuffer();
for (char c : text.toCharArray()) {
if (Character.isLetterOrDigit(c)) {
sb.append(c);
} else {
sb.append("&#" + (int)c + ";");
}
}
return sb.toString();
}
An example of this is here.

How to disguise escape character - \" within a string

I am facing a little difficulty with a Syntax highlighter that I've made and is 90% complete. What it does is that it reads in the text from the source of a .java file, detects keywords, comments, etc and writes a (colorful) output in an HTML file. Sample output from it is:
(I couldn't upload a whole html page, so this is a screenshot.) As (I hope) you can see, my program seems to work correctly with keywords, literals and comments (see below) and hence can normally document almost all programs. But it seems to break apart when I store the escape sequence for " i.e. \" inside a String. An error case is shown below:
The string literal highlighting doesn't stop at the end of the literal, but continues until it finds another cue, like a keyword or another literal.
So, the question is how do I disguise/hide/remove this \" from within a String?
The stringFilter method of my program is:
public String stringFilter(String line) {
if (line == null || line.equals("")) {
return "";
}
StringBuffer buf = new StringBuffer();
if (line.indexOf("\"") <= -1) {
return keywordFilter(line);
}
int start = 0;
int startStringIndex = -1;
int endStringIndex = -1;
int tempIndex;
//Keep moving through String characters until we want to stop...
while ((tempIndex = line.indexOf("\"")) > -1 && !isInsideString(line, tempIndex)) {
//We found the beginning of a string
if (startStringIndex == -1) {
startStringIndex = 0;
buf.append( stringFilter(line.substring(start,tempIndex)) );
buf.append("</font>");
buf.append(literal).append("\"");
line = line.substring(tempIndex+1);
}
//Must be at the end
else {
startStringIndex = -1;
endStringIndex = tempIndex;
buf.append(line.substring(0,endStringIndex+1));
buf.append("</font>");
buf.append(normal);
line = line.substring(endStringIndex+1);
}
}
buf.append( keywordFilter(line) );
return buf.toString();
}
EDIT
in response to the first few comments and answers, here's what I tried:
A snippet from htmlFilter(String), but it doesn't work :(
//replace '&' i.e. ampersands with HTML escape sequence for ampersand.
line = line.replaceAll("&", "&");
//line = line.replaceAll(" ", " ");
line = line.replaceAll("" + (char)35, "#");
// replace less-than signs which might be confused
// by HTML as tag angle-brackets;
line = line.replaceAll("<", "<");
// replace greater-than signs which might be confused
// by HTML as tag angle-brackets;
line = line.replaceAll(">", ">");
line = multiLineCommentFilter(line);
//replace the '\\' i.e. escape for backslash with HTML escape sequences.
//fixes a problem when backslashes preceed quotes.
//line = line.replaceAll("\\\"", "\"");
//line = line.replaceAll("" + (char)92 + (char)92, "\\");
return line;

My idea is that when a backslash is met, ignore the next character.
String str = "blah\"blah\\blah\n";
int index = 0;
while (true) {
// find the beginning
while (index < str.length() && str.charAt(index) != '\"')
index++;
int beginIndex = index;
if (index == str.length()) // no string found
break;
index++;
// find the ending
while (index < str.length()) {
if (str.charAt(index) == '\\') {
// escape, ignore the next character
index += 2;
} else if (str.charAt(index) == '\"') {
// end of string found
System.out.println(beginIndex + " " + index);
break;
} else {
// plain content
index++;
}
}
if (index >= str.length())
throw new IllegalArgumentException(
"String literal is not properly closed by a double-quote");
index++;
}

Check for char found at tempIndex-1 it it is \ then don't consider as beginning or ending of string.
String originalLine=line;
if ((tempIndex = originalLine.indexOf("\"", tempIndex + 1)) > -1) {
if (tempIndex==0 || originalLine.charAt(tempIndex - 1) != '\\') {
...

Steps to follow:
First replace all \" with some temp string such as
String tempStr="forward_slash_followed_by_double_quote";
line = line.replaceAll("\\\\\"", tempStr);
//line = line.replaceAll("\\\"", tempStr);
do what ever you are doing
Finally replace that temp string with \"
line = line.replaceAll(tempStr, "\\\\\"");
//line = line.replaceAll(tempStr, "\\\"");

The trouble with finding a quote and then trying to work out whether it's escaped is that it's not enough to simply look at the previous character to see if it's a backslash - consider
String basedir = "C:\\Users\\";
where the \" isn't an escaped quote, but is actually an escaped backslash followed by an unescaped quote. In general a quote preceded by an odd number of backslashes is escaped, one preceded by an even number of backslashes isn't.
A more sensible approach would be to parse through the string one character at a time from left to right rather than trying to jump ahead to quote characters. If you don't want to have to learn a proper parser generator like JavaCC or antlr then you can tackle this case with regular expressions using the \G anchor (to force each subsequent match to start at the end of the previous one with no gaps) - if we assume that str is a substring of your input starting with the character following the opening quote of a string literal then
Pattern p = Pattern.compile("\\G(?:\\\\u[0-9A-Fa-f]{4}|\\\\.|[^\"\\\\])");
StringBuilder buf = new StringBuilder();
Matcher m = p.matcher(str);
while(m.find()) buf.append(m.group());
will leave buf containing the content of the string literal up to but not including the closing quote, and will handle escapes like \", \\ and unicode escapes \uNNNN.

Use double slash "\\"" instead of "\""... Maybe it works...

Using Java Normalizer to convert accent ascii to non-accent but to exclude some symboles

I have a set of data that have accented ascii in them. I want to convert the accent to plain English alphabets. I achieve that with the following code :
import java.text.Normalizer;
import java.util.regex.Pattern;
public String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}
But what this code is missing is the exclude characters, I don't know how I can exclude certain characters from the conversion, for example I want to exclude the letter "ü" from the word Düsseldorf so when I convert, it doesn't turn into Dusseldorf word. Is there a way to pass an exclude list to the method or the matcher and don't convert certain accented characters ?

Do not use normalization to remove accents!
For example, the following letters are not asciified using your method:
ł
đ
ħ
You may also want to split ligatures like œ into separate letters (i.e. oe).
Try this:
private static final String TAB_00C0 = "" +
"AAAAAAACEEEEIIII" +
"DNOOOOO×OUUUÜYTs" + // <-- note an accented letter you wanted
// and preserved multiplication sign
"aaaaaaaceeeeiiii" +
"dnooooo÷ouuuüyty" + // <-- note an accented letter and preserved division sign
"AaAaAaCcCcCcCcDd" +
"DdEeEeEeEeEeGgGg" +
"GgGgHhHhIiIiIiIi" +
"IiJjJjKkkLlLlLlL" +
"lLlNnNnNnnNnOoOo" +
"OoOoRrRrRrSsSsSs" +
"SsTtTtTtUuUuUuUu" +
"UuUuWwYyYZzZzZzs";
public static String toPlain(String source) {
StringBuilder sb = new StringBuilder(source.length());
for (int i = 0; i < source.length(); i++) {
char c = source.charAt(i);
switch (c) {
case 'ß':
sb.append("ss");
break;
case 'Œ':
sb.append("OE");
break;
case 'œ':
sb.append("oe");
break;
// insert more ligatures you want to support
// or other letters you want to convert in a non-standard way here
// I recommend to take a look at: æ þ ð ﬂ ﬁ
default:
if (c >= 0xc0 && c <= 0x17f) {
c = TAB_00C0.charAt(c - 0xc0);
}
sb.append(c);
}
}
return sb.toString();
}

Replace non-ascii character by ascii code using java regex

I have string like this T 8.ESTÜTESTतुम मेरी. Now using java regex i want to replace non-ascii character Ü, तुम मेरी with its equivalent code.
How can i achieve this?
I can replace it with any other string.
String str = "T 8.ESTÜTESTतुम मेरी";
String resultString = str.replaceAll("[^\\p{ASCII}]", "");
System.out.println(resultString);
It prints T 8.ESTTEST

Sorry, I don't know how to do this using a single regex, please check if this works for you
String str = "T 8.ESTÜTESTतुम मेरी";
StringBuffer sb = new StringBuffer();
for(int i=0;i<str.length();i++){
if (String.valueOf(str.charAt(i)).matches("[^\\p{ASCII}]")){
sb.append("[CODE #").append((int)str.charAt(i)).append("]");
}else{
sb.append(str.charAt(i));
}
}
System.out.println(sb.toString());
prints
T 8.EST[CODE #220]TEST[CODE #2340][CODE #2369][CODE #2350] [CODE #2350][CODE #2375][CODE #2352][CODE #2368]
the problem seems to be how to tell regex how to convert what it finds to the code.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java how to escape accented character in string - java

Related

Java %u20AC conversion to euro €

Encode only specific characters in String

How to disguise escape character - \" within a string

Using Java Normalizer to convert accent ascii to non-accent but to exclude some symboles

Replace non-ascii character by ascii code using java regex

Categories

Resources