Convert escaped Unicode character back to actual character - java

I have the following value in a string variable in Java which has UTF-8 characters encoded like below
Dodd\u2013Frank
instead of
Dodd–Frank
(Assume that I don't have control over how this value is assigned to this string variable)
Now how do I convert (encode) it properly and store it back in a String variable?
I found the following code
Charset.forName("UTF-8").encode(str);
But this returns a ByteBuffer, but I want a String back.
Edit:
Some more additional information.
When I use System.out.println(str); I get
Dodd\u2013Frank
I am not sure what is the correct terminology (UTF-8 or unicode). Pardon me for that.

try
str = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(str);
from Apache Commons Lang

java.util.Properties
You can take advantage of the fact that java.util.Properties supports strings with '\uXXXX' escape sequences and do something like this:
Properties p = new Properties();
p.load(new StringReader("key="+yourInputString));
System.out.println("Escaped value: " + p.getProperty("key"));
Inelegant, but functional.
To handle the possible IOExeception, you may want a try-catch.
Properties p = new Properties();
try { p.load( new StringReader( "key=" + input ) ) ; } catch ( IOException e ) { e.printStackTrace(); }
System.out.println( "Escaped value: " + p.getProperty( "key" ) );

try
str = org.apache.commons.text.StringEscapeUtils.unescapeJava(str);
as org.apache.commons.lang3.StringEscapeUtils is deprecated.

Suppose you have a Unicode value, such as 00B0 (degree symbol, or superscript 'o', as in Spanish abbreviation for 'primero')
Here is a function that does just what you want:
public static String unicodeToString( char charValue )
{
Character ch = new Character( charValue );
return ch.toString();
}

I used StringEscapeUtils.unescapeXml to unescape the string loaded from an API that gives XML result.

UnicodeUnescaper from org.apache.commons:commons-text is also acceptable.
new UnicodeUnescaper().translate("Dodd\u2013Frank")

Perhaps the following solution which decodes the string correctly without any additional dependencies.
This works in a scala repl, though should work just as good in Java only solution.
import java.nio.charset.StandardCharsets
import java.nio.charset.Charset
> StandardCharsets.UTF_8.decode(Charset.forName("UTF-8").encode("Dodd\u2013Frank"))
res: java.nio.CharBuffer = Dodd–Frank

You can convert that byte buffer to String like this :
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.ByteBuffer
public static CharsetDecoder decoder = CharsetDecoder.newDecoder();
public static String byteBufferToString(ByteBuffer buffer)
{
String data = "";
try
{
// EDITOR'S NOTE -- There is no 'position' method for ByteBuffer.
// As such, this is pseudocode.
int old_position = buffer.position();
data = decoder.decode(buffer).toString();
// reset buffer's position to its original so it is not altered:
buffer.position(old_position);
}
catch (Exception e)
{
e.printStackTrace();
return "";
}
return data;
}

Related

Regular expression for matching "Shift-JIS" string against given set of ranges

Problem Statement :-
Let's call 0x8140~0x84BE, 0x889F~0x9872, 0x989F~0x9FFC, 0xE040~0xEAA4, 0x8740~0x879C, 0xED40~0xEEFC, 0xFA40~0xFC4B, 0xF040~0xF9FC as range.
I want to validate if input String contains a kanji which is not in the the above range.
Here are examples of input Kanji characters not in the above range with output results :-
龔 --> OK
鑫 --> OK
璐 --> Need Change
Expected result should be "Need Change" for all of them.
please help.
Here is a code :-
import java.io.UnsupportedEncodingException;
import java.util.regex.*;
//import java.util.regex.Pattern;
public class RegExpDemo2 {
private boolean validateMnpName(String name) {
try {
byte[] utf8Bytes = name.getBytes("UTF-8");
String string = new String(utf8Bytes, "UTF-8");
byte[] shiftJisBytes = string.getBytes("Shift-JIS");
String strName = new String(shiftJisBytes, "Shift-JIS");
System.out.println("ShiftJIS Str name : "+strName);
final String regex = "([\\x{8140}-\\x{84BE}]+)|([\\x{889F}-\\x{9872}]+)|([\\x{989F}-\\x{9FFC}]+)|([\\x{E040}-\\x{EAA4}]+)|([\\x{8740}-\\x{879C}]+)|([\\x{ED40}-\\x{EEFC}]+)|([\\x{FA40}-\\x{FC4B}]+)|([\\x{F040}-\\x{F9FC}]+)";
if (Pattern.compile(regex).matcher(strName).find()) {
return true;
} else
return false;
}
catch (Exception e) {
e.printStackTrace();
return false;
}
}
public static void main(String args[]) {
RegExpDemo2 obj = new RegExpDemo2();
if (obj.validateMnpName("ロ")) {
System.out.println("OK");
} else {
System.out.println("Need Change");
}
}
}
Your approach cannot work, because a String is Unicode in Java.
As observed by #VGR and myself, a round-trip through a Shift-JIS byte array does not change that. You simply converted Unicode to Shift-JIS and back to Unicode.
There are two approaches possible:
Convert the Java String (which is Unicode) into an array of bytes (in Shift-JIS encoding), and then examine the byte array for the allowed/forbidden values.
Convert the 'allowed' ranges into Unicode (and a single range in Shift-JIS may not be a single range in Unicode) and work with the String representation in Unicode.
Neither way seems pretty, but if you have to use old character codes instead of the not-quite-so-old (only 30 years!) Unicode, this is necessary.

Java not writing "\u" to properties file

I have a properties file that maps german characters to their hex value (00E4). I had to encode this file with "iso-8859-1" as it was the only way to get the german characters to display. What I'm trying to do is go through german words and check if these characters appear anywhere in the string and if they do replace that value with the hex format. For instance replace the german char with \u00E4.
The code replaces the character fine but instead on one backlash, I'm getting two like so \\u00E4. You can see in the code I'm using "\\u" to try and print \u, but that's not what happens. Any ideas of where I'm going wrong here?
private void createPropertiesMaps(String result) throws FileNotFoundException, IOException
{
Properties importProps = new Properties();
Properties encodeProps = new Properties();
// This props file contains a map of german strings
importProps.load(new InputStreamReader(new FileInputStream(new File(result)), "iso-8859-1"));
// This props file contains the german character mappings.
encodeProps.load(new InputStreamReader(
new FileInputStream(new File("encoding.properties")),
"iso-8859-1"));
// Loop through the german characters
encodeProps.forEach((k, v) ->
{
importProps.forEach((key, val) ->
{
String str = (String) val;
// Find the index of the character if it exists.
int index = str.indexOf((String) k);
if (index != -1)
{
// create new string, replacing the german character
String newStr = str.substring(0, index) + "\\u" + v + str.substring(index + 1);
// set the new property value
importProps.setProperty((String) key, newStr);
if (hasUpdated == false)
{
hasUpdated = true;
}
}
});
});
if (hasUpdated == true)
{
// Write new file
writeNewPropertiesFile(importProps);
}
}
private void writeNewPropertiesFile(Properties importProps) throws IOException
{
File file = new File("import_test.properties");
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");
importProps.store(writer, "Unicode Translations");
writer.close();
}
The point is that you are not writing a simple text-file but a java properties-file. In a properties-file the backslash-character is an escape-character, so if your property-value contains a backslash Java is so kind to escape it for you - which is not what you want in your case.
You might try to circumvent Java's property-file-mechanism by writing a plian text-file that can be read back in as a proerties-file, but that would mean doing all the formatting that gets provided automatically by the Properties-class manually.

Get Character Representation of a Unicode value In Java

I want the character representation of a Unicode value in Java.
Can this be done ?
Some characters (example is the character whose unicode value is \u001b) are not supported in XML. So I am escaping them in the XML by putting the Unicode value '\u001b' and after unmarshalling, I want the character representation of \u001b to displayed.
Can this be done in Java ?
Suggestions are welcome.
try this
String s = "\\u0031";
char c = (char)Integer.parseInt(s.substring(2), 16);
System.out.print(c);
output
1
though I would suggest to use XML numeric character references http://en.wikipedia.org/wiki/Numeric_character_reference like  then it would be decoded by XML parser automatically
String fileName = "outputFile.txt";
String str = "String with unicode" ;
try {
FileOutputStream fos = new FileOutputStream(fileName);
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
} catch (IOException e) {
e.printStackTrace(System.err);
}
This should do

Decode a string in Java

How do I properly decode the following string in Java
http%3A//www.google.ru/search%3Fhl%3Dru%26q%3Dla+mer+powder%26btnG%3D%u0420%A0%u0421%u045F%u0420%A0%u0421%u2022%u0420%A0%u0421%u2018%u0420%u040E%u0420%u0453%u0420%A0%u0421%u201D+%u0420%A0%u0420%u2020+Google%26lr%3D%26rlz%3D1I7SKPT_ru
When I use URLDecoder.decode() I the following error
java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u0"
Thanks,
Dave
According to Wikipedia, "there exist a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a Unicode value".
Continuing: "This behavior is not specified by any RFC and has been rejected by the W3C".
Your URL contains such tokens, and the Java URLDecoder implementation doesn't support those.
%uXXXX encoding is non-standard, and was actually rejected by W3C, so it's natural, that URLDecoder does not understand it.
You can make small function, which will fix it by replacing each occurrence of %uXXYY with %XX%YY in your encoded string. Then you can procede and decode the fixed string normally.
we started with Vartec's solution but found out additional issues. This solution works for UTF-16, but it can be changed to return UTF-8. The replace all is left for clarity reasons and you can read more at http://www.cogniteam.com/wiki/index.php?title=DecodeEncodeJavaScript
static public String unescape(String escaped) throws UnsupportedEncodingException
{
// This code is needed so that the UTF-16 won't be malformed
String str = escaped.replaceAll("%0", "%u000");
str = str.replaceAll("%1", "%u001");
str = str.replaceAll("%2", "%u002");
str = str.replaceAll("%3", "%u003");
str = str.replaceAll("%4", "%u004");
str = str.replaceAll("%5", "%u005");
str = str.replaceAll("%6", "%u006");
str = str.replaceAll("%7", "%u007");
str = str.replaceAll("%8", "%u008");
str = str.replaceAll("%9", "%u009");
str = str.replaceAll("%A", "%u00A");
str = str.replaceAll("%B", "%u00B");
str = str.replaceAll("%C", "%u00C");
str = str.replaceAll("%D", "%u00D");
str = str.replaceAll("%E", "%u00E");
str = str.replaceAll("%F", "%u00F");
// Here we split the 4 byte to 2 byte, so that decode won't fail
String [] arr = str.split("%u");
Vector<String> vec = new Vector<String>();
if(!arr[0].isEmpty())
{
vec.add(arr[0]);
}
for (int i = 1 ; i < arr.length ; i++) {
if(!arr[i].isEmpty())
{
vec.add("%"+arr[i].substring(0, 2));
vec.add("%"+arr[i].substring(2));
}
}
str = "";
for (String string : vec) {
str += string;
}
// Here we return the decoded string
return URLDecoder.decode(str,"UTF-16");
}
After having had a good look at the solution presented by #ariy I created a Java based solution that is also resilient against encoded characters that have been chopped into two parts (i.e. half of the encoded character is missing). This happens in my usecase where I need to decode long urls that are sometimes chopped at 2000 chars length. See What is the maximum length of a URL in different browsers?
public class Utils {
private static Pattern validStandard = Pattern.compile("%([0-9A-Fa-f]{2})");
private static Pattern choppedStandard = Pattern.compile("%[0-9A-Fa-f]{0,1}$");
private static Pattern validNonStandard = Pattern.compile("%u([0-9A-Fa-f][0-9A-Fa-f])([0-9A-Fa-f][0-9A-Fa-f])");
private static Pattern choppedNonStandard = Pattern.compile("%u[0-9A-Fa-f]{0,3}$");
public static String resilientUrlDecode(String input) {
String cookedInput = input;
if (cookedInput.indexOf('%') > -1) {
// Transform all existing UTF-8 standard into UTF-16 standard.
cookedInput = validStandard.matcher(cookedInput).replaceAll("%00%$1");
// Discard chopped encoded char at the end of the line (there is no way to know what it was)
cookedInput = choppedStandard.matcher(cookedInput).replaceAll("");
// Handle non standard (rejected by W3C) encoding that is used anyway by some
// See: https://stackoverflow.com/a/5408655/114196
if (cookedInput.contains("%u")) {
// Transform all existing non standard into UTF-16 standard.
cookedInput = validNonStandard.matcher(cookedInput).replaceAll("%$1%$2");
// Discard chopped encoded char at the end of the line
cookedInput = choppedNonStandard.matcher(cookedInput).replaceAll("");
}
}
try {
return URLDecoder.decode(cookedInput,"UTF-16");
} catch (UnsupportedEncodingException e) {
// Will never happen because the encoding is hardcoded
return null;
}
}
}

Converting UTF-8 to ISO-8859-1 in Java

I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctly, such as “, – and ’ (they display as ?).
Is it possible to convert these characters from UTF-8 to ISO-8859-1?
Here is a snippet of code I have written to attempt this:
BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
I'm not quite sure what's going awry, but I believe it's readLine() that's causing the grief (since the strings would be Java/UTF-16 encoded?). Another variation I tried was to replace latin1 with
byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");
I have read previous posts on the subject and I'm learning as I go. Thanks in advance for your help.
I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.
The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:
public final class HtmlEncoder {
private HtmlEncoder() {}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("&#x");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}
Example usage:
String foo = "This is Cyrillic Ya: \u044F\n"
+ "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";
StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());
Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C “ ) is encoded as “. A couple of other arbitrary code points are likewise encoded.
Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.
Depending on your default encoding, following lines could cause problem,
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
In Java, String/Char is always in UTF-16BE. Different encoding is only involved when you convert the characters to bytes. Say your default encoding is UTF-8, the latin1 buffer is treated as UTF-8 and some sequence of Latin-1 may form invalid UTF-8 sequence and you will get ?.
With Java 8, McDowell's answer can be simplified like this (while preserving correct handling of surrogate pairs):
public final class HtmlEncoder {
private HtmlEncoder() {
}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (PrimitiveIterator.OfInt iterator = sequence.codePoints().iterator(); iterator.hasNext(); ) {
int codePoint = iterator.nextInt();
if (Character.UnicodeBlock.of(codePoint) == Character.UnicodeBlock.BASIC_LATIN) {
out.append((char) codePoint);
} else {
out.append("&#x");
out.append(Integer.toHexString(codePoint));
out.append(";");
}
}
return out;
}
}
when you instanciate your String object, you need to indicate which encoding to use.
So replace :
return new String(latin1);
by
return new String(latin1, "ISO-8859-1");

Categories

Resources