Java UTF-8 to ASCII conversion with supplements - java

we are accepting all sorts of national characters in UTF-8 string on the input, and we need to convert them to ASCII string on the output for some legacy use. (we don't accept Chinese and Japanese chars, only European languages)
We have a small utility to get rid of all the diacritics:
public static final String toBaseCharacters(final String sText) {
if (sText == null || sText.length() == 0)
return sText;
final char[] chars = sText.toCharArray();
final int iSize = chars.length;
final StringBuilder sb = new StringBuilder(iSize);
for (int i = 0; i < iSize; i++) {
String sLetter = new String(new char[] { chars[i] });
sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFC);
try {
byte[] bLetter = sLetter.getBytes("UTF-8");
sb.append((char) bLetter[0]);
} catch (UnsupportedEncodingException e) {
}
}
return sb.toString();
}
The question is how to replace all the german sharp s (ß, Đ, đ) and other characters that get through the above normalization method, with their supplements (in case of ß, supplement would probably be "ss" and in case od Đ supplement would be either "D" or "Dj").
Is there some simple way to do it, without million of .replaceAll() calls?
So for example: Đonardan = Djonardan, Blaß = Blass and so on.
We can replace all "problematic" chars with empty space, but would like to avoid this to make the output as similar to the input as possible.
Thank you for your answers,
Bozo

You want to use ICU4J. It includes the com.ibm.icu.text.Transliterator class, which apparently can do what you are looking for.

Here's my converter which uses lucene...
private final KeywordTokenizer keywordTokenizer = new KeywordTokenizer(new StringReader(""));
private final ASCIIFoldingFilter asciiFoldingFilter = new ASCIIFoldingFilter(keywordTokenizer);
private final TermAttribute termAttribute = (TermAttribute) asciiFoldingFilter.getAttribute(TermAttribute.class);
public String process(String line)
{
if (line != null)
{
try
{
keywordTokenizer.reset(new StringReader(line));
if (asciiFoldingFilter.incrementToken())
{
return termAttribute.term();
}
}
catch (IOException e)
{
logger.warn("Failed to parse: " + line, e);
}
}
return null;
}

I'm using something like this:
Transliterator transliterator = Transliterator.getInstance("Any-Latin; Upper; Lower; NFD; [:Nonspacing Mark:] Remove; NFC", Transliterator.FORWARD);

Is there some simple way to do it, without million of .replaceAll() calls?
If you just support European, Latin-based languages, around 100 should be enough; that's definitely doable: Grab the Unicode charts for Latin-1 Supplement and Latin Extended-A and get the String.replace party started. :-)

Related

How to convert escape-decimal text back to unicode in Java

A third-party library in our stack is munging strings containing emoji etc like so:
"Ben \240\159\144\144\240\159\142\169"
That is, decimal bytes, not hexadecimal shorts.
Surely there is an existing routine to turn this back into a proper Unicode string, but all the discussion I've found about this expects the format \u12AF, not \123.
I am not aware of any existing routine, but something simple like this should do the job (assuming the input is available as a string):
public static String unEscapeDecimal(String s) {
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, "utf-8");
int pos = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == '\\') {
writer.flush();
baos.write(Integer.parseInt(s.substring(i+1, i+4)));
i += 3;
} else {
writer.write(c);
}
}
writer.flush();
return new String(baos.toByteArray(), "utf-8");
} catch (IOException e) {
throw new RuntimeException(e);
}
}
The writer is just used to make sure existing characters in the string with code points > 127 are encoded correctly, should they occur unescaped. If all non-ascii characters are escaped, the byte array output stream should be sufficient.

Decode an escaped string from VBScript in Java

I tried to decode the following string,
String str = "AT%26amp%3BT%20Network%20Client%20%u2013%20IBM";
System.out.println(StringEscapeUtils.unescapeHtml(str));
try {
System.out.println("res:"+java.net.URLDecoder.decode(str, "UTF-8"));
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Both methods fail as below,
AT%26amp%3BT%20Network%20Client%20%u2013%20IBM
Exception in thread "main" java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u2"
at java.net.URLDecoder.decode(URLDecoder.java:173)
at decrypt.DecryptHtml.main(DecryptHtml.java:19)
The source of the string is a VBS script that uses the Escape function. How can I decode this string?
Unfortunately, from reading the documentation, it appears that Microsoft Has Done It Again (tm): "non standard xxx", where here "xxx" is "escaping format".
Specifically, in the documentation of the VBScript function, it is said that:
[...]Unicode characters that have a value greater than 255 are stored using the %uxxxx format.
(Hey, MS: there is no such thing as "Unicode characters"; those are called code points)
Great. So you need your own decoding function.
Fortunately, we use Java. And since this proprietary escape sequence only covers Unicode code points in the Basic Multilingual Plane (U+0000 to U+FFFF), and since char is a UTF-16 code unit, and since there is a 1 to 1 mapping between BMP and UTF-16, this makes our job a little easier.
Here is the code:
public final class MSUnescaper
{
private static final char PERCENT = '%';
private static final char NONSTANDARD_PCT_ESCAPE = 'u';
private MSUnescaper()
{
}
public static String unescape(final String input)
{
final StringBuilder sb = new StringBuilder(input.length());
final CharBuffer buf = CharBuffer.wrap(input);
char c;
while (buf.hasRemaining()) {
c = buf.get();
if (c != PERCENT) {
sb.append(c);
continue;
}
if (!buf.hasRemaining())
throw new IllegalArgumentException();
c = buf.get();
sb.append(c == NONSTANDARD_PCT_ESCAPE
? msEscape(buf) : standardEscape(buf, c));
}
return sb.toString();
}
private static char standardEscape(final CharBuffer buf, final char c)
{
if (!buf.hasRemaining())
throw new IllegalArgumentException();
final char[] array = { c, buf.get() };
return (char) Integer.parseInt(new String(array), 16);
}
private static char msEscape(final CharBuffer buf)
{
if (buf.remaining() < 4)
throw new IllegalArgumentException();
final char[] array = new char[4];
buf.get(array);
return (char) Integer.parseInt(new String(array), 16);
}
public static void main(final String... args)
{
final String input = "AT%26amp%3BT%20Network%20Client%20%u2013%20IBM";
System.out.println(unescape(input));
}
}
Output:
AT&T Network Client – IBM
String str = "AT%26amp%3BT%20Network%20Client%20%[here]u[here]2013%20IBM"
I think this string is invalid. %u20 is not valid charecter.
If you remove u from your string you can encode it.
For reference: w3schools html url encodeing

Decode a string in Java

How do I properly decode the following string in Java
http%3A//www.google.ru/search%3Fhl%3Dru%26q%3Dla+mer+powder%26btnG%3D%u0420%A0%u0421%u045F%u0420%A0%u0421%u2022%u0420%A0%u0421%u2018%u0420%u040E%u0420%u0453%u0420%A0%u0421%u201D+%u0420%A0%u0420%u2020+Google%26lr%3D%26rlz%3D1I7SKPT_ru
When I use URLDecoder.decode() I the following error
java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u0"
Thanks,
Dave
According to Wikipedia, "there exist a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a Unicode value".
Continuing: "This behavior is not specified by any RFC and has been rejected by the W3C".
Your URL contains such tokens, and the Java URLDecoder implementation doesn't support those.
%uXXXX encoding is non-standard, and was actually rejected by W3C, so it's natural, that URLDecoder does not understand it.
You can make small function, which will fix it by replacing each occurrence of %uXXYY with %XX%YY in your encoded string. Then you can procede and decode the fixed string normally.
we started with Vartec's solution but found out additional issues. This solution works for UTF-16, but it can be changed to return UTF-8. The replace all is left for clarity reasons and you can read more at http://www.cogniteam.com/wiki/index.php?title=DecodeEncodeJavaScript
static public String unescape(String escaped) throws UnsupportedEncodingException
{
// This code is needed so that the UTF-16 won't be malformed
String str = escaped.replaceAll("%0", "%u000");
str = str.replaceAll("%1", "%u001");
str = str.replaceAll("%2", "%u002");
str = str.replaceAll("%3", "%u003");
str = str.replaceAll("%4", "%u004");
str = str.replaceAll("%5", "%u005");
str = str.replaceAll("%6", "%u006");
str = str.replaceAll("%7", "%u007");
str = str.replaceAll("%8", "%u008");
str = str.replaceAll("%9", "%u009");
str = str.replaceAll("%A", "%u00A");
str = str.replaceAll("%B", "%u00B");
str = str.replaceAll("%C", "%u00C");
str = str.replaceAll("%D", "%u00D");
str = str.replaceAll("%E", "%u00E");
str = str.replaceAll("%F", "%u00F");
// Here we split the 4 byte to 2 byte, so that decode won't fail
String [] arr = str.split("%u");
Vector<String> vec = new Vector<String>();
if(!arr[0].isEmpty())
{
vec.add(arr[0]);
}
for (int i = 1 ; i < arr.length ; i++) {
if(!arr[i].isEmpty())
{
vec.add("%"+arr[i].substring(0, 2));
vec.add("%"+arr[i].substring(2));
}
}
str = "";
for (String string : vec) {
str += string;
}
// Here we return the decoded string
return URLDecoder.decode(str,"UTF-16");
}
After having had a good look at the solution presented by #ariy I created a Java based solution that is also resilient against encoded characters that have been chopped into two parts (i.e. half of the encoded character is missing). This happens in my usecase where I need to decode long urls that are sometimes chopped at 2000 chars length. See What is the maximum length of a URL in different browsers?
public class Utils {
private static Pattern validStandard = Pattern.compile("%([0-9A-Fa-f]{2})");
private static Pattern choppedStandard = Pattern.compile("%[0-9A-Fa-f]{0,1}$");
private static Pattern validNonStandard = Pattern.compile("%u([0-9A-Fa-f][0-9A-Fa-f])([0-9A-Fa-f][0-9A-Fa-f])");
private static Pattern choppedNonStandard = Pattern.compile("%u[0-9A-Fa-f]{0,3}$");
public static String resilientUrlDecode(String input) {
String cookedInput = input;
if (cookedInput.indexOf('%') > -1) {
// Transform all existing UTF-8 standard into UTF-16 standard.
cookedInput = validStandard.matcher(cookedInput).replaceAll("%00%$1");
// Discard chopped encoded char at the end of the line (there is no way to know what it was)
cookedInput = choppedStandard.matcher(cookedInput).replaceAll("");
// Handle non standard (rejected by W3C) encoding that is used anyway by some
// See: https://stackoverflow.com/a/5408655/114196
if (cookedInput.contains("%u")) {
// Transform all existing non standard into UTF-16 standard.
cookedInput = validNonStandard.matcher(cookedInput).replaceAll("%$1%$2");
// Discard chopped encoded char at the end of the line
cookedInput = choppedNonStandard.matcher(cookedInput).replaceAll("");
}
}
try {
return URLDecoder.decode(cookedInput,"UTF-16");
} catch (UnsupportedEncodingException e) {
// Will never happen because the encoding is hardcoded
return null;
}
}
}

Binary to text in Java

I have a String with binary data in it (1110100) I want to get the text out so I can print it (1110100 would print "t"). I tried this, it is similar to what I used to transform my text to binary but it's not working at all:
public static String toText(String info)throws UnsupportedEncodingException{
byte[] encoded = info.getBytes();
String text = new String(encoded, "UTF-8");
System.out.println("print: "+text);
return text;
}
Any corrections or suggestions would be much appreciated.
Thanks!
You can use Integer.parseInt with a radix of 2 (binary) to convert the binary string to an integer:
int charCode = Integer.parseInt(info, 2);
Then if you want the corresponding character as a string:
String str = new Character((char)charCode).toString();
This is my one (Working fine on Java 8):
String input = "01110100"; // Binary input as String
StringBuilder sb = new StringBuilder(); // Some place to store the chars
Arrays.stream( // Create a Stream
input.split("(?<=\\G.{8})") // Splits the input string into 8-char-sections (Since a char has 8 bits = 1 byte)
).forEach(s -> // Go through each 8-char-section...
sb.append((char) Integer.parseInt(s, 2)) // ...and turn it into an int and then to a char
);
String output = sb.toString(); // Output text (t)
and the compressed method printing to console:
Arrays.stream(input.split("(?<=\\G.{8})")).forEach(s -> System.out.print((char) Integer.parseInt(s, 2)));
System.out.print('\n');
I am sure there are "better" ways to do this but this is the smallest one you can probably get.
I know the OP stated that their binary was in a String format but for the sake of completeness I thought I would add a solution to convert directly from a byte[] to an alphabetic String representation.
As casablanca stated you basically need to obtain the numerical representation of the alphabetic character. If you are trying to convert anything longer than a single character it will probably come as a byte[] and instead of converting that to a string and then using a for loop to append the characters of each byte you can use ByteBuffer and CharBuffer to do the lifting for you:
public static String bytesToAlphabeticString(byte[] bytes) {
CharBuffer cb = ByteBuffer.wrap(bytes).asCharBuffer();
return cb.toString();
}
N.B. Uses UTF char set
Alternatively using the String constructor:
String text = new String(bytes, 0, bytes.length, "ASCII");
public static String binaryToText(String binary) {
return Arrays.stream(binary.split("(?<=\\G.{8})"))/* regex to split the bits array by 8*/
.parallel()
.map(eightBits -> (char)Integer.parseInt(eightBits, 2))
.collect(
StringBuilder::new,
StringBuilder::append,
StringBuilder::append
).toString();
}
Here is the answer.
private String[] splitByNumber(String s, int size) {
return s.split("(?<=\\G.{"+size+"})");
}
The other way around (Where "info" is the input text and "s" the binary version of it)
byte[] bytes = info.getBytes();
BigInteger bi = new BigInteger(bytes);
String s = bi.toString(2);
Look at the parseInt function. You may also need a cast and the Character.toString function.
Also you can use alternative solution without streams and regular expressions (based on casablanca's answer):
public static String binaryToText(String binaryString) {
StringBuilder stringBuilder = new StringBuilder();
int charCode;
for (int i = 0; i < binaryString.length(); i += 8) {
charCode = Integer.parseInt(binaryString.substring(i, i + 8), 2);
String returnChar = Character.toString((char) charCode);
stringBuilder.append(returnChar);
}
return stringBuilder.toString();
}
you just need to append the specified character as a string to character sequence.

Converting UTF-8 to ISO-8859-1 in Java

I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctly, such as “, – and ’ (they display as ?).
Is it possible to convert these characters from UTF-8 to ISO-8859-1?
Here is a snippet of code I have written to attempt this:
BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
I'm not quite sure what's going awry, but I believe it's readLine() that's causing the grief (since the strings would be Java/UTF-16 encoded?). Another variation I tried was to replace latin1 with
byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");
I have read previous posts on the subject and I'm learning as I go. Thanks in advance for your help.
I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.
The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:
public final class HtmlEncoder {
private HtmlEncoder() {}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("&#x");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}
Example usage:
String foo = "This is Cyrillic Ya: \u044F\n"
+ "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";
StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());
Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C “ ) is encoded as “. A couple of other arbitrary code points are likewise encoded.
Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.
Depending on your default encoding, following lines could cause problem,
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
In Java, String/Char is always in UTF-16BE. Different encoding is only involved when you convert the characters to bytes. Say your default encoding is UTF-8, the latin1 buffer is treated as UTF-8 and some sequence of Latin-1 may form invalid UTF-8 sequence and you will get ?.
With Java 8, McDowell's answer can be simplified like this (while preserving correct handling of surrogate pairs):
public final class HtmlEncoder {
private HtmlEncoder() {
}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (PrimitiveIterator.OfInt iterator = sequence.codePoints().iterator(); iterator.hasNext(); ) {
int codePoint = iterator.nextInt();
if (Character.UnicodeBlock.of(codePoint) == Character.UnicodeBlock.BASIC_LATIN) {
out.append((char) codePoint);
} else {
out.append("&#x");
out.append(Integer.toHexString(codePoint));
out.append(";");
}
}
return out;
}
}
when you instanciate your String object, you need to indicate which encoding to use.
So replace :
return new String(latin1);
by
return new String(latin1, "ISO-8859-1");

Categories

Resources