Windows-1251 to UTF-8 codes - java

I have code of character in Windows-1251 code table.
How i can get code of this character in UTF-8 code table?
For example i have character 'А' with coded in Windows-1251 equals 192, appropriate utf-8 code equals 1040
How i can to initialize Character or char in Java with code 192 from Windows-1251 code table?
char c = (char)192; //how to specify the encoding ?

To convert a byte[] encoding in one character encoding to another you can do
public static byte[] convertEncoding(byte[] bytes, String from, String to) {
return new String(bytes, from).getBytes(to);
}

Related

Java String to UCS2 encoding for Letters with Accents

I have a requirement for encoding a String that contains foreign characters eg. letters with accents to UCS2 characters and have the following piece of code working for normal english letters.
String encodeAsUCS2(String test) throws UnsupportedEncodingException{
byte[] bytes = test.getBytes("UTF-16BE");
StringBuilder sb = new StringBuilder();
for (byte b : bytes) {
sb.append(String.format("%02X", b));
}
return sb.toString();
}
That outputs hexadecimal sequence of UCS2/UTF16 bytes
eg. hello = 00680065006C006C006F
It runs into an issue with the letters that have accents/foreign characters and displays the value as FFFD which is in the Specials table and is used to indicate problems when a system is unable to render a stream of data to a correct symbol.
Any work around for this?

Can any character be encoded in UTF-16 (using Java 8)

Can any character be encoded in UTF-16 (using java) ?
I thought it could but my code that encodes as
CharsetEncoder encoder = Charset.forName("UTF-16LE").newEncoder();
ByteBuffer bb = encoder.encode(CharBuffer.wrap((String) value + '\0'));
has thrown a CharacterCodingException
Unfortunately as this only occurred for a customer not myself I dont have details of the offending character.
There are possible values of char that are not valid UTF-16 sequences. For example:
CharsetEncoder encoder = Charset.forName("UTF-16LE").newEncoder();
ByteBuffer bb = encoder.encode(CharBuffer.wrap("\uDFFF"));
This code will throw an exception. U+DFFF is an unpaired surrogate.

How to convert a string UTF-8 to ANSI in java?

I have a string in UTF-8 format. I want to convert it to clean ANSI format. How to do that?
You could use a java function like this one here to convert from UTF-8 to ISO_8859_1 (which seems to be a subset of ANSI):
private static String convertFromUtf8ToIso(String s1) {
if(s1 == null) {
return null;
}
String s = new String(s1.getBytes(StandardCharsets.UTF_8));
byte[] b = s.getBytes(StandardCharsets.ISO_8859_1);
return new String(b, StandardCharsets.ISO_8859_1);
}
Here is a simple test:
String s1 = "your utf8 stringáçﬠ";
String res = convertFromUtf8ToIso(s1);
System.out.println(res);
This prints out:
your utf8 stringáç?
The ﬠ character gets lost because it cannot be represented with ISO_8859_1 (it has 3 bytes when encoded in UTF-8). ISO_8859_1 can represent á and ç.
You can do something like this:
new String("your utf8 string".getBytes(Charset.forName("utf-8")));
in this format 4 bytes of UTF8 converts to 8 bytes of ANSI
Converting UTF-8 to ANSI is not possible generally, because ANSI only has 128 characters (7 bits) and UTF-8 has up to 4 bytes. That's like converting long to int, you lose information in most cases.

byte[] to String {100,25,28,-122,-26,94,-3,-26}

How can I convert this byte[] to String :
byte[] mytest = new byte[] {100,25,28,-122,-26,94,-3,-26};
i get this : "d��^�" when I use :
new String( mytest , "UTF-8" )
Here is code java for creation of key :
m_key = new javax.crypto.spec.SecretKeySpec(new byte[] {100,25,28,-122,-26,94,-3,-26}, "DES");
Thanks.
In order to decode the byte array into something like ASCII, you need to know its original encoding. Otherwise you would need to treat it as binary.
Note: Base64 is intended for transferring binary data across networks.
I would suggest Base64 encoding your byte array. Then in your PHP code decoding the Base64 string back into a UTF-8 string.
In Java, here's how to Base64 encode your byte array and then decode it back to UTF-8:
import org.apache.commons.codec.binary.Base64;
public class MyTest {
public static void main(String[] args) throws Exception {
byte[] byteArray = new byte[] {100,25,28,-122,-26,94,-3,-26};
System.out.println("To UTF-8 string: " + new String(byteArray, "UTF-8"));
byte[] base64 = Base64.encodeBase64(byteArray);
System.out.println("To Base64 string: " + new String(base64, "UTF-8"));
byte[] decoded = Base64.decodeBase64(base64);
System.out.println("Back to UTF-8 string: " + new String(decoded, "UTF-8"));
/* the decoded byte array is the same as the original byte array */
for (int i = 0; i < decoded.length; i++) {
assert byteArray[i] == decoded[i];
}
}
}
The output from the above code is:
To UTF-8 string: d��^�
To Base64 string: ZBkchuZe/eY=
Back to UTF-8 string: d��^�
So if you wanted to use the same binary data in your PHP code, cut and paste the Base64 string into your PHP code and decode it back to UTF-8. Something like this:
<?php
$str = 'ZBkchuZe/eY=';
$key = base64_decode($str);
echo $key;
?>
I don't code in PHP, but you should be able to decode Base64 using this method:
http://php.net/manual/en/function.base64-decode.php
The above code should echo back the original binary data as UTF-8 (albeit with funny characters). The point is that the funny-looking string in the $key variable is representing the same binary data you had in the Java byte array:
d��^�
You should be able to pass the $key variable into your PHP encryption method.
with the way you are doing it makes no sense imo. you are creating a new string with the byte[] as an argument. i dont think that function is suppose to parse. so what you end up with is a lot of junk. but a little bit of googling got me this: http://www.mkyong.com/java/how-do-convert-byte-array-to-string-in-java/
Would m_key.getEncoded() give you the desired result.
Javadocs - SecretKeySpec
If not, you have to identify the Key provider that was used for the encoding (which resulted in the byte array that you have now) and decode.

Converting from Windows 1252 to UTF8 in Java: null characters with CharsetDecoder/Encoder

I know it's a very general question but I'm becoming mad.
I used this code:
String ucs2Content = new String(bufferToConvert, inputEncoding);
byte[] outputBuf = ucs2Content.getBytes(outputEncoding);
return outputBuf;
But I read that is better to use CharsetDecoder and CharsetEncoder (I have contents with some character probably outside the destination encoding). I've just written this code but that has some problems:
// Create the encoder and decoder for Win1252
Charset charsetInput = Charset.forName(inputEncoding);
CharsetDecoder decoder = charsetInput.newDecoder();
Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();
// Convert the byte array from starting inputEncoding into UCS2
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));
// Convert the internal UCS2 representation into outputEncoding
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
return bbuf.array();
Indeed this code appends to the buffer a sequence of null character!!!!!
Could someone tell me where is the problem? I'm not so skilled with encoding conversion in Java.
Is there a better way to convert encoding in Java?
Your problem is that ByteBuffer.array() returns a direct reference to the array used as backing store for the ByteBuffer and not a copy of the backing array's valid range. You have to obey bbuf.limit() (as Peter did in his response) and just use the array content from index 0 to bbuf.limit()-1.
The reason for the extra 0 values in the backing array is a slight flaw in how the resulting ByteBuffer is created by the CharsetEncoder. Each CharsetEncoder has an "average bytes per character", which for the UCS2 encoder seem to be simple and correct (2 bytes/char). Obeying this fixed value, the CharsetEncoder initially allocates a ByteBuffer with "string length * average bytes per character" bytes, in this case e.g. 20 bytes for a 10 character long string. The UCS2 CharsetEncoder starts however with a BOM (byte order mark), which also occupies 2 bytes, so that only 9 of the 10 characters fit in the allocated ByteBuffer. The CharsetEncoder detects the overflow and allocates a new ByteBuffer with a length of 2*n+1 (n being the original length of the ByteBuffer), in this case 2*20+1 = 41 bytes. Since only 2 of the 21 new bytes are required to encode the remaining character, the array you get from bbuf.array() will have a length of 41 bytes, but bbuf.limit() will indicate that only the first 22 entries are actually used.
I am not sure how you get a sequence of null characters. Try this
String outputEncoding = "UTF-8";
Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();
// Convert the byte array from starting inputEncoding into UCS2
byte[] bufferToConvert = "Hello World! £€".getBytes();
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));
// Convert the internal UCS2 representation into outputEncoding
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
System.out.println(new String(bbuf.array(), 0, bbuf.limit(), charsetOutput));
prints
Hello World! £€

Categories

Resources