If I have some binary data D And I convert it to string S. I expect than on converting it back to binary I will get D. But It's wrong.
public class A {
public static void main(String[] args) throws IOException {
final byte[] bytes = new byte[]{-114, 104, -35};// In hex: 8E 68 DD
System.out.println(bytes.length); //prints 3
System.out.println(new String(bytes, "UTF-8").getBytes("UTF-8").length); //prints 7
}
}
Why does this happens?
Converting between a byte array to a String and back again is not a one-to-one mapping operation. Reading the docs, the String implmentation uses the CharsetDecoder to convert the incoming byte array into unicode. The first and last bytes in your input byte array must not map to a valid unicode character, thus it replaces it with some replacement string.
It's likely that the bytes you're converting to a string don't actually form a valid string. If java can't figure out what you mean by each byte, it will attempt to fix them. This means that when you convert back to the byte array, it won't be the same as when you started. If you try with a valid set of bytes, then you should be more successful.
Your data can't be decoded into valid Unicode characters using UTF-8 encoding. Look at decoded string. It consists of 3 characters: 0xFFFD, 0x0068 and 0xFFFD. First and last are "�" - Unicode replacement characters. I think you need to choose other encoding. I.e. "CP866" produces valid string and converts back into same array.
Related
I am converting a String from UTF-8 to CP1047 and then performing hex encoding on it, which works great. Next what I am doing is converting back, using decoding the hex String and displaying it on console in UTF-8 format. Problem is I am not getting the proper String what I passed to encoding method. Below is the piece of code I coded:
public class HexEncodeDecode {
public static void main(String[] args) throws UnsupportedEncodingException,
DecoderException {
String reqMsg = "ISO0150000150800C220000080000000040000050000000215102190000000014041615141800001427690161 0B0 000123450041234";
char[] hexed = getHex(reqMsg, "UTF-8", "Cp1047");
System.out.println(hexed);
System.out.println(getString(hexed));
}
public static char[] getHex(String source, String inputCharacterCoding,
String outputCharacterCoding) throws UnsupportedEncodingException {
return Hex.encodeHex(new String(source.getBytes(inputCharacterCoding),
outputCharacterCoding).getBytes(), false);
}
public static String getString(char[] source) throws DecoderException,
UnsupportedEncodingException {
return new String(Hex.decodeHex(source), Charset.forName("UTF-8"));
}
}
Output I am getting is :
C3B1C3AB7CC290C291C295C290C290C290C290C291C295C290C298C290C290C3A41616C290C290C290C290C290C298C290C290C290C290C290C290C290C290C294C290C290C290C290C290C295C290C290C290C290C290C290C29016C291C295C291C29016C291C299C290C290C290C290C290C290C290C290C291C294C290C294C291C296C291C295C291C294C291C298C290C290C290C290C291C2941604C296C299C290C291C296C291C280C290C3A2C290C280C280C280C280C290C290C290C29116C293C294C295C290C290C294C29116C293C294
ñë|äâ
So, need help in printing the input String back.
Expected output would be:
C3B1C3AB7CC290C291C295C290C290C290C290C291C295C290C298C290C290C3A41616C290C290C290C290C290C298C290C290C290C290C290C290C290C290C294C290C290C290C290C290C295C290C290C290C290C290C290C29016C291C295C291C29016C291C299C290C290C290C290C290C290C290C290C291C294C290C294C291C296C291C295C291C294C291C298C290C290C290C290C291C2941604C296C299C290C291C296C291C280C290C3A2C290C280C280C280C280C290C290C290C29116C293C294C295C290C290C294C29116C293C294
ISO0150000150800C220000080000000040000050000000215102190000000014041615141800001427690161 0B0 000123450041234
new String(source.getBytes(inputCharacterCoding), outputCharacterCoding)
.getBytes()
This probably does not do what you think it does.
First things first: a String has no encoding. Repeat after me: a String has no encoding.
A String is simply a sequence of tokens which aim to represent characters. It just happens that for this purpose Java uses a sequence of chars. They could just as well be carrier pigeons.
UTF8, CP1047 and others are just character codings; two operations can be performed:
encoding: turn a stream of carrier pigeons (chars) into a stream of bytes;
decoding: turn a stream of bytes into a stream of carrier pigeons (chars).
Basically, your base assumption is wrong; you cannot associate an encoding with a String. Your real input should be a byte stream (more often than not a byte array) which you know is the result of a particular encoding (in your case, UTF-8), which you want to re-encode using another charset (in your case, CP1047).
The "secret" behing a real answer here would be the code of your Hex.encodeHex() method but you don't show it, so this is as good an answer that I can muster.
reqMsg no longer has an encoding so it's pointless (and damaging) to try to convert if from UTF-8 to "Cp1047".
If reqMsg is going to be coming from an external source in the future, such as from disk or network, then you will have to decode - perhaps this is where the confusion comes from. Perhaps you'll being doing: UTF-8->Unicode(String)->CP1047->HEX. When you write it to stdout, the HEX will likely to be ASCII encoded.
The follow example creates an ASCII hex string based on your original string after conversion to CP1047 (Unicode->CP1047->HEX):
String reqMsg = "ISO0150000150800C220000080000000040000050000000215102190000000014041615141800001427690161 0B0 000123450041234";
// encode to cp1047 represented as Hex
byte[] reqMsqBytes = reqMsg.getBytes("Cp1047");
char[] hex = Hex.encodeHex(reqMsqBytes);
System.out.println(hex);
// decode
String respMsqBytes = new String(Hex.decodeHex(hex), "Cp1047");
System.out.println(respMsqBytes);
A quick fix (though a little ugly) would be to change getString() to:
public static String getString(char[] source) throws DecoderException, UnsupportedEncodingException {
return new String(new String(Hex.decodeHex(source), Charset.forName("UTF-8")).getBytes("Cp1047"),"UTF-8");
}
As fge already mentioned, you switch transforming between chars and bytes, which are different pairs of shoes. So in this quick solution you first get your hex decode assuming UTF-8, then encoding it to a Cp1047 byte array and finally, decode it back to a String by using the UTF-8 charset.
As I already said, this is simply a quick one-liner workaround and not the cleanest solution, as the error is already done during the hex encoding.
I am trying to encode/decode a ByteArray to String, but input/output are not matching. Am I doing something wrong?
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));
String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));
The output is:
130021000061f8f0001a
130021000061efbfbd
Complete code:
String[] arr = {"13", "00", "21", "00", "00", "61", "F8", "F0", "00", "1A"};
byte[] by = new byte[arr.length];
for (int i = 0; i < arr.length; i++) {
by[i] = (byte)(Integer.parseInt(arr[i],16) & 0xff);
}
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));
String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));
The problem here is that f8f0001a isn't a valid UTF-8 byte sequence.
First of all, the f8 opening byte denotes a 5 byte sequence and you've only got four. Secondly, f8 can only be followed by a byte of 8x, 9x, ax or bx form.
Therefore it gets replaced with a unicode replacement character (U+FFFD), whose byte sequence in UTF-8 is efbfbd.
And there (rightly) is no guarantee that the conversion of an invalid byte sequence to and from a string will result in the same byte sequence. (Note that even with two, seemingly identical strings, you might get different bytes representing them in Unicode, see Unicode equivalence. )
The moral of the story is: if you want to represent bytes, don't convert them to characters, and if you want to represent text, don't use byte arrays.
My UTF-8 is a bit rusty :-), but the sequence F8 F0 is imho not a valid utf-8 encoding.
Look at http://en.wikipedia.org/wiki/Utf-8#Description.
When you build the String from the array of bytes, the bytes are decoded.
Since the bytes from your code does not represent valid characters, the bytes that finally composes the String are not the same your passed as parameter.
public String(byte[] bytes)
Constructs a new String by decoding the
specified array of bytes using the platform's default charset. The
length of the new String is a function of the charset, and hence may
not be equal to the length of the byte array.
The behavior of this
constructor when the given bytes are not valid in the default charset
is unspecified. The CharsetDecoder class should be used when more
control over the decoding process is required.
I'm having some trouble decoding some encoding char.
What i need to decode is the %E9, i have a string like this D%E9bardeur and degr%E9
What i do in my java class, is the following:
try
{
System.out.println(o);// test
o = URLDecoder.decode((String) o, "UTF-8");
}
catch (UnsupportedEncodingException e)
{
e.printStackTrace();
}
After this operation, what i get is
D�bardeur and degr�
The very same happens when i dont decode to utf-8
Any advice?
thx
%E9 is not UTF-8.
The correct way to decode this would be:
URLDecoder.decode((String) o, "ISO-8859-1")
By %E9, could you mean there is a byte in your string that evaluates to hex E9? Because if so, that flags as "multibyte" in UTF-8, and there are 2 more "continuation bytes" (within the correct range) that follow.
Because remember, UTF-8 is a variable length encoding, so some code points (character values) are represented by 1 byte, some by 2, 3, etc.
If you have a string you're treating as UTF-8 and E9 is encountered, the next 2 bytes need to be in the correct range. For example, in this string, 00, which follows E9 is not a valid continuation byte:
http://hexutf8.com/?q=0x640x650x670x720xe90x00
Here's an example where E9 in a string is followed by the correct 2 bytes:
http://hexutf8.com/?q=0xc20xa90xe90x810xaa
And the appropriate character is represented.
I have a String say String a = "abc";. Now I want to convert it into a byte array say byte b[];, so that when I print b it should show "abc".
How can I do that?
getBytes() method is giving different result.
My program looks like that so far:
String a="abc";
byte b[]=a.getBytes();
what I want is I have two methods made in a class one is
public byte[] encrypt(String a)
and another is
public String decrypt(byte[] b)
after doing encryption i saved the data into database but when i am getting it back then byte methods are not giving the correct output but i got the same data using String method but now I have to pass it into decrypt(byte[] b)
How to do it this is the real scenario.
Well, your first problem is that a String in Java is not an array of bytes, but of chars, where each of them takes 16bit. This is to cover for unicode characters, instead of only ascii that you'd get with bytes. That means that if you use the getBytes method, you won't be able to print the string one array position at a time, since it takes two array positions (two bytes) to represent one character.
What you could do is use getChars and then cast each char to a byte, with the corresponding precision los. This is not a good idea since it won't work outside of normal English characters! You asked, though, so here you go ;)
EDIT: as #PeterLawerey mentions,Unicode characters make it even harder, with some unicode characters needing more than one char. There's a good discussion in StackOverflow and it links to an detailed article from Oracle.
byte b[]=a.getBytes();
System.out.println(new String(b));
You could use this constructor to build your string back again:
String a="abc";
byte b[]=a.getBytes("UTF-8");
System.out.println(new String(b, "UTF-8"));
Other than that, you can't do System.out.println(b) and expect to see abc.
A byte is value between -128 and 127. When you print it, it will be a number by default.
If you want to print it as an ASCII char, you can cast it to a (char)
byte[] bytes = "abc".getBytes();
for(byte b: bytes)
System.out.println((char) b);
prints
a
b
c
It seems like you are implementing encryption and decryption code.
String constructors are for text data. you should not use it to convert byte array
which contains encrypted data to string value.
You should use base64 instead, which encodes any binary data into ASCII.
this one is good public domain one
http://iharder.sourceforge.net/current/java/base64/
base64 apache commons
http://commons.apache.org/codec/download_codec.cgi
String msg ="abc";
byte[] data = Base64.decode(msg);
String convert = Base64.encodeBytes(data);
This will convert "abc" to byte and then the code will print "abc" in respective ASCII code (ie. 97 98 99).
byte a[]=new byte[160];
String s="abc";
a=s.getBytes();
for(int i=0;i<s.length();i++)
{
System.out.print(a[i]+" ");
}
If you add these lines it will again change the ASCII code to String (ie. abc)
String s1=new String(a);
System.out.print("\n"+s1);
Hope it Helpes.
Modified Code:
To send byte array as an argument:
public static void another_method_name(byte b1[])
{
String s1=new String(b1);
System.out.print("\n"+s1);
}
public static void main(String[] args)
{
byte a[]=new byte[160];
String s="abc";
a=s.getBytes();
for(int i=0;i<s.length();i++)
{
System.out.print(a[i]+" ");
}
another_method_name(a);
}
Hope it helps again.
If I use any ASCII characters from 33 to 127, the codePointAt method gives the correct decimal value, for example:
String s1 = new String("#");
int val = s1.codePointAt(0);
This returns 35 which is the correct value.
But if I try use ASCII characters from 128 to 255 (extended ASCII/ISO-8859-1), this method gives wrong value, for example:
String s1 = new String("ƒ") // Latin small letter f with hook
int val = s1.codePointAt(0);
This should return 159 as per this reference table, but instead returns 409, why is this?
But if I try use ASCII characters from 128 to 255
ASCII doesn't have values in this range. It only uses 7 bits.
Java chars are UTF-16 (and nothing else!). If you want to represent ASCII using Java, you need to use a byte array.
The codePointAt method returns the 32-bit codepoint. 16-bit chars can't contain the entire Unicode range, so some code points must be split across two chars (as per the encoding scheme for UTF-16). The codePointAt method helps resolve to chars code points.
I wrote a rough guide to encoding in Java here.
Java chars are not encoded in ISO-8859-1. They use UTF-16 which has the same values for 7bit ASCII characters (only values from 0-127).
To get the correct value for ISO-8859-1 you have to convert your string into a byte[] with String.getBytes("ISO-8859-1"); and look in the byte array.
Update
ISO-8859-1 is not the extended ASCII encoding, use String.getBytes("Cp437"); to get the correct values.
in Unicode
ƒ 0x0192 LATIN SMALL LETTER F WITH HOOK
String.codePointAt returns the Unicode-Codepoint at this specified index.
The Unicode-Codepoint of ƒ is 402, see
http://www.decodeunicode.org/de/u+0192/properties
So
System.out.println("ƒ".codePointAt(0));
printing 402 is correct.
If you are interested in the representation in other charsets, you can printout the bytes representaion of the character in other charsets via getBytes(String charsetName):
final String s = "ƒ";
for (final String csName : Charset.availableCharsets().keySet()) {
try {
final Charset cs = Charset.forName(csName);
final CharsetEncoder encode = cs.newEncoder();
if (encode.canEncode(s))
{
System.out.println(csName + ": " + Arrays.toString(s.getBytes(csName)));
}
} catch (final UnsupportedOperationException uoe) {
} catch (final UnsupportedEncodingException e) {
}
}