Can any character be encoded in UTF-16 (using java) ?
I thought it could but my code that encodes as
CharsetEncoder encoder = Charset.forName("UTF-16LE").newEncoder();
ByteBuffer bb = encoder.encode(CharBuffer.wrap((String) value + '\0'));
has thrown a CharacterCodingException
Unfortunately as this only occurred for a customer not myself I dont have details of the offending character.
There are possible values of char that are not valid UTF-16 sequences. For example:
CharsetEncoder encoder = Charset.forName("UTF-16LE").newEncoder();
ByteBuffer bb = encoder.encode(CharBuffer.wrap("\uDFFF"));
This code will throw an exception. U+DFFF is an unpaired surrogate.
Related
I have a string in UTF-8 format. I want to convert it to clean ANSI format. How to do that?
You could use a java function like this one here to convert from UTF-8 to ISO_8859_1 (which seems to be a subset of ANSI):
private static String convertFromUtf8ToIso(String s1) {
if(s1 == null) {
return null;
}
String s = new String(s1.getBytes(StandardCharsets.UTF_8));
byte[] b = s.getBytes(StandardCharsets.ISO_8859_1);
return new String(b, StandardCharsets.ISO_8859_1);
}
Here is a simple test:
String s1 = "your utf8 stringáçﬠ";
String res = convertFromUtf8ToIso(s1);
System.out.println(res);
This prints out:
your utf8 stringáç?
The ﬠ character gets lost because it cannot be represented with ISO_8859_1 (it has 3 bytes when encoded in UTF-8). ISO_8859_1 can represent á and ç.
You can do something like this:
new String("your utf8 string".getBytes(Charset.forName("utf-8")));
in this format 4 bytes of UTF8 converts to 8 bytes of ANSI
Converting UTF-8 to ANSI is not possible generally, because ANSI only has 128 characters (7 bits) and UTF-8 has up to 4 bytes. That's like converting long to int, you lose information in most cases.
What is the most time efficient way to create and Base32 encode a random UUID in Java? I would like to use Base32 encoding to store globally unique IDs that are usable in URLs.
Base32 still pads with the = character, so you'll need to do something with that if you really want to avoid URL escaping.
If you really want to avoid Base16, I recommend you use Base64 instead of Base32. If you want to use an RFC standard, try base64url. However, that standard also uses "=" for the trailing padding, so you need to escape that. It's substitutions are:
+ -> -
/ -> _
= -> =
Personally, I use a variant called Y64. It's substitutions are:
+ -> .
/ -> _
= -> -
It's not an RFC standard, but at least you don't have to worry about escaping the trailing "=".
Apache Commons Codec provides both Base64 and Base32. Here's an example with Base64 with the Y64 variant
To encode:
UUID uuid = UUID.randomUUID();
ByteBuffer uuidBuffer = ByteBuffer.allocate(16);
LongBuffer longBuffer = uuidBuffer.asLongBuffer();
longBuffer.put(uuid.getMostSignificantBits());
longBuffer.put(uuid.getLeastSignificantBits());
String encoded = new String(Base64.encode(uuidBuffer.array()),
Charset.forName("US-ASCII"));
encoded = encoded.replace('+', '.')
.replace('/', '_')
.replace('=', '-');
And decode:
String encoded; // from your request parameters or whatever
encoded = encoded.replace('.', '+')
.replace('_', '/')
.replace('-', '=');
ByteBuffer uuidBuffer = ByteBuffer.wrap(Base64.decode(
encoded.getBytes(Charset.forName("US-ASCII"))));
LongBuffer longBuffer = uuidBuffer.asLongBuffer();
UUID uuid = new UUID(longBuffer.get(), longBuffer.get());
I agreed with the discussion that Base64Url might be more suitable but I also see the benefit of generating unique value in Base32. It's case insensitive that is easier to handle if human is involved.
This is my code to convert UUID to Base32 using Guava's BaseEncoding.
public static String toBase32(UUID uuid){
ByteBuffer bb = ByteBuffer.wrap(new byte[16]);
bb.putLong(uuid.getMostSignificantBits());
bb.putLong(uuid.getLeastSignificantBits());
return BaseEncoding.base32().omitPadding().encode(bb.array());
}
The codec Base32Codec can encode UUIDs efficiently to base-32.
// Returns a base-32 string
// uuid::: 01234567-89AB-4DEF-A123-456789ABCDEF
// base32: aerukz4jvng67ijdivtytk6n54
String string = Base32Codec.INSTANCE.encode(uuid);
There are codecs for other encodings in the same package of uuid-creator.
I have code of character in Windows-1251 code table.
How i can get code of this character in UTF-8 code table?
For example i have character 'А' with coded in Windows-1251 equals 192, appropriate utf-8 code equals 1040
How i can to initialize Character or char in Java with code 192 from Windows-1251 code table?
char c = (char)192; //how to specify the encoding ?
To convert a byte[] encoding in one character encoding to another you can do
public static byte[] convertEncoding(byte[] bytes, String from, String to) {
return new String(bytes, from).getBytes(to);
}
I know it's a very general question but I'm becoming mad.
I used this code:
String ucs2Content = new String(bufferToConvert, inputEncoding);
byte[] outputBuf = ucs2Content.getBytes(outputEncoding);
return outputBuf;
But I read that is better to use CharsetDecoder and CharsetEncoder (I have contents with some character probably outside the destination encoding). I've just written this code but that has some problems:
// Create the encoder and decoder for Win1252
Charset charsetInput = Charset.forName(inputEncoding);
CharsetDecoder decoder = charsetInput.newDecoder();
Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();
// Convert the byte array from starting inputEncoding into UCS2
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));
// Convert the internal UCS2 representation into outputEncoding
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
return bbuf.array();
Indeed this code appends to the buffer a sequence of null character!!!!!
Could someone tell me where is the problem? I'm not so skilled with encoding conversion in Java.
Is there a better way to convert encoding in Java?
Your problem is that ByteBuffer.array() returns a direct reference to the array used as backing store for the ByteBuffer and not a copy of the backing array's valid range. You have to obey bbuf.limit() (as Peter did in his response) and just use the array content from index 0 to bbuf.limit()-1.
The reason for the extra 0 values in the backing array is a slight flaw in how the resulting ByteBuffer is created by the CharsetEncoder. Each CharsetEncoder has an "average bytes per character", which for the UCS2 encoder seem to be simple and correct (2 bytes/char). Obeying this fixed value, the CharsetEncoder initially allocates a ByteBuffer with "string length * average bytes per character" bytes, in this case e.g. 20 bytes for a 10 character long string. The UCS2 CharsetEncoder starts however with a BOM (byte order mark), which also occupies 2 bytes, so that only 9 of the 10 characters fit in the allocated ByteBuffer. The CharsetEncoder detects the overflow and allocates a new ByteBuffer with a length of 2*n+1 (n being the original length of the ByteBuffer), in this case 2*20+1 = 41 bytes. Since only 2 of the 21 new bytes are required to encode the remaining character, the array you get from bbuf.array() will have a length of 41 bytes, but bbuf.limit() will indicate that only the first 22 entries are actually used.
I am not sure how you get a sequence of null characters. Try this
String outputEncoding = "UTF-8";
Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();
// Convert the byte array from starting inputEncoding into UCS2
byte[] bufferToConvert = "Hello World! £€".getBytes();
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));
// Convert the internal UCS2 representation into outputEncoding
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
System.out.println(new String(bbuf.array(), 0, bbuf.limit(), charsetOutput));
prints
Hello World! £€
How can I convert so called "php unicode"(link to php unicode) to normal character via Java? Example \xEF\xBC\xA1 -> A. Are there any embedded methods in jdk or should I use regex for this conversion?
You first need to get the bytes out of the string into a byte-array without changing them and then decode the byte-array as a UTF-8 string.
The simplest way to get the string into a byte array is to encode it using ISO-8859-1 which map every character with a unicode value less than 256 to a byte with the same value (or the equivalent negative)
String phpUnicode = "\u00EF\u00BC\u00A1"
byte[] bytes = phpUnicode.getBytes("ISO-8859-1"); // maps to bytes with the same ordinal value
String javaString = new String(bytes, "UTF-8");
System.out.println(javaString);
Edit
The above converts the UTF-8 to the Unicode character. If you then want to convert it to a reasonable ASCII equivalent, there's no standard way of doing that: but see this question
Edit
I assumed that you had a string containing characters that had the same ordinal value as the UTF-8 sequence but you indicate that your string literally contains the escape sequence, as in:
String phpUnicode = "\\xEF\\xBC\\xA1";
The JDK doesn't have any built-in methods to convert Strings like this so you'll need to use your own regex. Since we ultimately want to convert a utf-8 byte-sequence into a String, we need to set up a byte-array, using maybe:
Pattern oneChar = Pattern.compile("\\\\x([0-9A-F]{2})|(.)", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher matcher = oneChar.matcher(phpUnicode);
ByteArrayOutputStream bytes = new ByteArrayOutputStream();
while (matcher.find()) {
int ch;
if (matcher.group(1) == null) {
ch = matcher.group(2).charAt(0);
}
else {
ch = Integer.parseInt(matcher.group(1), 16);
}
bytes.write((int) ch);
}
String javaString = new String(bytes.toByteArray(), "UTF-8");
System.out.println(javaString);
This will generate a UTF-8 stream by converting \xAB sequences. This UTF-8 stream is then converted to a Java string. It's important to note that any character that is not part of an escape sequence will be converted to a byte equivalent to to the low-order 8 bites of the unicode character. This works fine for ascii but can cause transcoding problems for non-ascii characters.
#McDowell:
The sequence:
String phpUnicode = "\u00EF\u00BC\u00A1"
byte[] bytes = phpUnicode.getBytes("ISO-8859-1");
creates a byte array containing as many bytes as the original string has characters and for each character with a unicode value below 256, the same numeric value is stored in the byte-array.
The character FULLWIDTH LATIN CAPITAL LETTER A (U+FF41) is not present in the original String so the fact that it is not in ISO-8859-1 is irrelevant.
I know that transcoding bugs can occur when you convert characters to bytes that's why I said that ISO-8859-1 would only "map every character with a unicode value less than 256 to a byte with the same value"
The character in question is U+FF21 (FULLWIDTH LATIN CAPITAL LETTER A). The PHP form (\xEF\xBC\xA1) is a UTF-8 encoded octet sequence.
In order to decode this sequence to a Java String (which is always UTF-16), you would use the following code:
// \xEF\xBC\xA1
byte[] utf8 = { (byte) 0xEF, (byte) 0xBC, (byte) 0xA1 };
String utf16 = new String(utf8, Charset.forName("UTF-8"));
// print the char as hex
for(char ch : utf16.toCharArray()) {
System.out.format("%02x%n", (int) ch);
}
If you want to decode the data from a string literal you could use code of this form:
public static void main(String[] args) {
String utf16 = transformString("This is \\xEF\\xBC\\xA1 string");
for (char ch : utf16.toCharArray()) {
System.out.format("%s %02x%n", ch, (int) ch);
}
}
private static final Pattern SEQ
= Pattern.compile("(\\\\x\\p{Alnum}\\p{Alnum})+");
private static String transformString(String encoded) {
StringBuilder decoded = new StringBuilder();
Matcher matcher = SEQ.matcher(encoded);
int last = 0;
while (matcher.find()) {
decoded.append(encoded.substring(last, matcher.start()));
byte[] utf8 = toByteArray(encoded.substring(matcher.start(), matcher.end()));
decoded.append(new String(utf8, Charset.forName("UTF-8")));
last = matcher.end();
}
return decoded.append(encoded.substring(last, encoded.length())).toString();
}
private static byte[] toByteArray(String hexSequence) {
byte[] utf8 = new byte[hexSequence.length() / 4];
for (int i = 0; i < utf8.length; i++) {
int offset = i * 4;
String hex = hexSequence.substring(offset + 2, offset + 4);
utf8[i] = (byte) Integer.parseInt(hex, 16);
}
return utf8;
}