Get Character Representation of a Unicode value In Java - java

I want the character representation of a Unicode value in Java.
Can this be done ?
Some characters (example is the character whose unicode value is \u001b) are not supported in XML. So I am escaping them in the XML by putting the Unicode value '\u001b' and after unmarshalling, I want the character representation of \u001b to displayed.
Can this be done in Java ?
Suggestions are welcome.

try this
String s = "\\u0031";
char c = (char)Integer.parseInt(s.substring(2), 16);
System.out.print(c);
output
1
though I would suggest to use XML numeric character references http://en.wikipedia.org/wiki/Numeric_character_reference like  then it would be decoded by XML parser automatically

String fileName = "outputFile.txt";
String str = "String with unicode" ;
try {
FileOutputStream fos = new FileOutputStream(fileName);
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
} catch (IOException e) {
e.printStackTrace(System.err);
}
This should do

Related

Remove illegal xml characters from UTF-16LE encoded file

I have a java application that parses an xml file that was encoded in utf-16le. The xml has been erroring out while being parsed due to illegal xml characters. My solution is to read in this file into a java string, then removing the xml characters, so it can be parsed successfully. It works 99% but there are some slight differences in the input output from this process, not caused by the illegal characters being removed, but going from the utf-16le encoding to java string utf-16.. i think
BufferedReader reader = null;
String fileText = ""; //stored as UTF-16
try {
reader = new BufferedReader(new InputStreamReader(in, "UTF-16LE"));
for (String line; (line = reader.readLine()) != null; ) {
fileText += line;
}
} catch (Exception ex) {
logger.log(Level.WARNING, "Error removing illegal xml characters", ex);
} finally {
if (reader != null) {
reader.close();
}
}
//code to remove illegal chars from string here, irrelevant to problem
ByteArrayInputStream inStream = new ByteArrayInputStream(fileText.getBytes("UTF-16LE"));
Document doc = XmlUtil.openDocument(inStream, XML_ROOT_NODE_ELEM);
Do characters get changed/lost when going from UTF-16LE to UTF-16? Is there a way to do this in java and assuring the input is exactly the same as the output?
Certainly one problem is that readLine throws away the line ending.
You would need to do something like:
fileText += line + "\r\n";
Otherwise XML attributes, DTD entities, or something else could get glued together where at least a space was required. Also you do not want the text content to be altered when it contains a line break.
Performance (speed and memory) can be improved using a
StringBuilder fileText = new StringBuilder();
... fileText.append(line).append("\n");
... fileText.toString();
Then there might be a problem with the first character of the file, which
sometimes redundantly is added: a BOM char.
line = line.replace("\uFEFF", "");

How to convert escape-decimal text back to unicode in Java

A third-party library in our stack is munging strings containing emoji etc like so:
"Ben \240\159\144\144\240\159\142\169"
That is, decimal bytes, not hexadecimal shorts.
Surely there is an existing routine to turn this back into a proper Unicode string, but all the discussion I've found about this expects the format \u12AF, not \123.
I am not aware of any existing routine, but something simple like this should do the job (assuming the input is available as a string):
public static String unEscapeDecimal(String s) {
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, "utf-8");
int pos = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == '\\') {
writer.flush();
baos.write(Integer.parseInt(s.substring(i+1, i+4)));
i += 3;
} else {
writer.write(c);
}
}
writer.flush();
return new String(baos.toByteArray(), "utf-8");
} catch (IOException e) {
throw new RuntimeException(e);
}
}
The writer is just used to make sure existing characters in the string with code points > 127 are encoded correctly, should they occur unescaped. If all non-ascii characters are escaped, the byte array output stream should be sufficient.

Convert escaped Unicode character back to actual character

I have the following value in a string variable in Java which has UTF-8 characters encoded like below
Dodd\u2013Frank
instead of
Dodd–Frank
(Assume that I don't have control over how this value is assigned to this string variable)
Now how do I convert (encode) it properly and store it back in a String variable?
I found the following code
Charset.forName("UTF-8").encode(str);
But this returns a ByteBuffer, but I want a String back.
Edit:
Some more additional information.
When I use System.out.println(str); I get
Dodd\u2013Frank
I am not sure what is the correct terminology (UTF-8 or unicode). Pardon me for that.
try
str = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(str);
from Apache Commons Lang
java.util.Properties
You can take advantage of the fact that java.util.Properties supports strings with '\uXXXX' escape sequences and do something like this:
Properties p = new Properties();
p.load(new StringReader("key="+yourInputString));
System.out.println("Escaped value: " + p.getProperty("key"));
Inelegant, but functional.
To handle the possible IOExeception, you may want a try-catch.
Properties p = new Properties();
try { p.load( new StringReader( "key=" + input ) ) ; } catch ( IOException e ) { e.printStackTrace(); }
System.out.println( "Escaped value: " + p.getProperty( "key" ) );
try
str = org.apache.commons.text.StringEscapeUtils.unescapeJava(str);
as org.apache.commons.lang3.StringEscapeUtils is deprecated.
Suppose you have a Unicode value, such as 00B0 (degree symbol, or superscript 'o', as in Spanish abbreviation for 'primero')
Here is a function that does just what you want:
public static String unicodeToString( char charValue )
{
Character ch = new Character( charValue );
return ch.toString();
}
I used StringEscapeUtils.unescapeXml to unescape the string loaded from an API that gives XML result.
UnicodeUnescaper from org.apache.commons:commons-text is also acceptable.
new UnicodeUnescaper().translate("Dodd\u2013Frank")
Perhaps the following solution which decodes the string correctly without any additional dependencies.
This works in a scala repl, though should work just as good in Java only solution.
import java.nio.charset.StandardCharsets
import java.nio.charset.Charset
> StandardCharsets.UTF_8.decode(Charset.forName("UTF-8").encode("Dodd\u2013Frank"))
res: java.nio.CharBuffer = Dodd–Frank
You can convert that byte buffer to String like this :
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.ByteBuffer
public static CharsetDecoder decoder = CharsetDecoder.newDecoder();
public static String byteBufferToString(ByteBuffer buffer)
{
String data = "";
try
{
// EDITOR'S NOTE -- There is no 'position' method for ByteBuffer.
// As such, this is pseudocode.
int old_position = buffer.position();
data = decoder.decode(buffer).toString();
// reset buffer's position to its original so it is not altered:
buffer.position(old_position);
}
catch (Exception e)
{
e.printStackTrace();
return "";
}
return data;
}

Error reading UTF-8 file in Java

I am trying to read in some sentences from a file that contains unicode characters. It does print out a string but for some reason it messes up the unicode characters
This is the code I have:
public static String readSentence(String resourceName) {
String sentence = null;
try {
InputStream refStream = ClassLoader
.getSystemResourceAsStream(resourceName);
BufferedReader br = new BufferedReader(new InputStreamReader(
refStream, Charset.forName("UTF-8")));
sentence = br.readLine();
} catch (IOException e) {
throw new RuntimeException("Cannot read sentence: " + resourceName);
}
return sentence.trim();
}
The problem is probably in the way that the string is being output.
I suggest that you confirm that you are correctly reading the Unicode characters by doing something like this:
for (char c : sentence.toCharArray()) {
System.err.println("char '" + ch + "' is unicode codepoint " + ((int) ch)));
}
and see if the Unicode codepoints are correct for the characters that are being messed up. If they are correct, then the problem is output side: if not, then input side.
First, you could create the InputStreamReader as
new InputStreamReader(refStream, "UTF-8")
Also, you should verify if the resource really contains UTF-8 content.
One of the most annoying reason could be... your IDE settings.
If your IDE default console encoding is something like latin1 then you'll be struggling very long with different variations of java code but nothing help untill you correctly set some IDE options.

Converting UTF-8 to ISO-8859-1 in Java

I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctly, such as “, – and ’ (they display as ?).
Is it possible to convert these characters from UTF-8 to ISO-8859-1?
Here is a snippet of code I have written to attempt this:
BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
I'm not quite sure what's going awry, but I believe it's readLine() that's causing the grief (since the strings would be Java/UTF-16 encoded?). Another variation I tried was to replace latin1 with
byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");
I have read previous posts on the subject and I'm learning as I go. Thanks in advance for your help.
I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.
The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:
public final class HtmlEncoder {
private HtmlEncoder() {}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("&#x");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}
Example usage:
String foo = "This is Cyrillic Ya: \u044F\n"
+ "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";
StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());
Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C “ ) is encoded as “. A couple of other arbitrary code points are likewise encoded.
Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.
Depending on your default encoding, following lines could cause problem,
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
In Java, String/Char is always in UTF-16BE. Different encoding is only involved when you convert the characters to bytes. Say your default encoding is UTF-8, the latin1 buffer is treated as UTF-8 and some sequence of Latin-1 may form invalid UTF-8 sequence and you will get ?.
With Java 8, McDowell's answer can be simplified like this (while preserving correct handling of surrogate pairs):
public final class HtmlEncoder {
private HtmlEncoder() {
}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (PrimitiveIterator.OfInt iterator = sequence.codePoints().iterator(); iterator.hasNext(); ) {
int codePoint = iterator.nextInt();
if (Character.UnicodeBlock.of(codePoint) == Character.UnicodeBlock.BASIC_LATIN) {
out.append((char) codePoint);
} else {
out.append("&#x");
out.append(Integer.toHexString(codePoint));
out.append(";");
}
}
return out;
}
}
when you instanciate your String object, you need to indicate which encoding to use.
So replace :
return new String(latin1);
by
return new String(latin1, "ISO-8859-1");

Categories

Resources