If I read a binary stream into a String using an ISO-8859-1 encoding, and subsequently convert it back to a binary stream, would I always get exactly the same bytes? And if not, when would I not get the same bytes?
public byte[] toStringAndBack(byte[] binaryData) throws Exception {
String s = new String(binaryData, "ISO-8859-1");
return s.getBytes("ISO-8859-1");
}
=== EDIT ===
Test:
byte[] d = {0, 1, 2, 3, 4, (byte)128, (byte)129, (byte)130}; // some not defined values
byte[] dd = toStringAndBack(d);
for (byte b : dd)
System.out.print((b&0xFF) + " ");
Output:
0 1 2 3 4 128 129 130
So, even not defined bytes seem to be converted properly.
The constructor you're using says:
The behavior of this constructor when the given bytes are not valid in the given charset is unspecified.
So in theory it could fail for any value ISO-8859-1 doesn't assign characters to, such as 0-31 and 128-160.
That means even if it works on a given JVM's String implementation (or Charset implementation for ISO-8859-1), you cannot rely on it working on another JVM's String/Charset implementation (whether that's just a different dot-rev of a JVM from the same vendor, or a different vendor's JVM).
Let's test it:
// all possible bytes
byte[] bin = new byte[256];
for (int i=0; i<bin.length; i++)
bin[i] = (byte)i;
// convert to string
String s = new String(bin, "ISO-8859-1");
for (int i=0; i<s.length(); i++)
{
if (s.charAt(i) != i)
System.out.println(i + " s[i]=" + s.charAt(i));
}
// convert back to byte[]
byte[] bout = s.getBytes("ISO-8859-1");
for (int i=0; i<bin.length; i++)
{
if (bin[i] != bout[i])
System.out.println(i + " in=" + bin[i] + " bout=" + bout[i]);
}
System.out.println("done");
It prints only done.
Therefore at least for the current ISO-8859-1 implementation the operations are binary safe as defined in the question.
EDIT:
the current implementation is sun.nio.cs.ISO_8859_1.
Looking at the source it only checks if a char is < 256 to decide if it can be encoded.
Related
I'm using this code to convert a UTF-8 String to binary:
public String toBinary(String str) {
byte[] buf = str.getBytes(StandardCharsets.UTF_8);
StringBuilder result = new StringBuilder();
for (int i = 0; i < buf.length; i++) {
int ch = (int) buf[i];
String binary = Integer.toBinaryString(ch);
result.append(("00000000" + binary).substring(binary.length()));
result.append(' ');
}
return result.toString().trim();
}
Before I was using this code:
private String toBinary2(String str) {
StringBuilder result = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
int ch = (int) str.charAt(i);
String binary = Integer.toBinaryString(ch);
if (ch<256)
result.append(("00000000" + binary).substring(binary.length()));
else {
binary = ("0000000000000000" + binary).substring(binary.length());
result.append(binary.substring(0, 8));
result.append(' ');
result.append(binary.substring(8));
}
result.append(' ');
}
return result.toString().trim();
}
These two method can return different results; for example:
toBinary("è") = "11000011 10101000"
toBinary2("è") = "11101000"
I think that because the bytes of è are negative while the corresponding char is not (because char is a 2 byte unsigned integer).
What I want to know is: which of the two approaches is the correct one and why?
Thanks in advance.
Whenever you want to convert text into binary data (or into text representing binary data, as you do here) you have to use some encoding.
Your toBinary uses UTF-8 for that encoding.
Your toBinary2 uses something that's not a standard encoding: it encodes every UTF-16 codepoint * <= 256 in a single byte and all others in 2 bytes. Unfortunately that one is not a useful encoding, since for decoding you'll have to know if a single byte is stand-alone or part of a 2-byte sequence (UTF-8/UTF-16 do that by indicating with the highest-level bits which one it is).
tl;dr toBinary seems correct, toBinary2 will produce output that can't uniquely be decoded back to the original string.
* You might be wondering where the mention of UTF-16 comes from: That's because all String objects in Java are implicitly encoded in UTF-16. So if you use charAt you get UTF-16 codepoints (which just so happen to be equal to the Unicode code number for all characters that fit into the Basic Multilingual Plane).
This code snippet might help.
String s = "Some String";
byte[] bytes = s.getBytes();
StringBuilder binary = new StringBuilder();
for(byte b:bytes){
int val =b;
for(int i=;i<=s.length;i++){
binary.append((val & 128) == 0 ? 0 : 1);
val<<=1;
}
}
System.out.println(" "+s+ "to binary" +binary);
I am trying to serialize and deserialize an object in java using proto3. Here is what my object in proto looks like
option java_multiple_files = true;
option java_package = "com.project.dataModel";
option java_outer_classname = "FlowProto";
// The request message containing the user's name.
message Flow {
string subscriberIMSEI = 1;
string destinationIP = 2;
uint64 txBytes = 3;
uint64 rxBytes = 4;
uint64 txPkts = 5;
uint64 rxPkts = 6;
uint64 startTimeInMillis = 7;
uint64 endTimeInMillis = 8;
string asnNumber = 9;
string asnName = 10;
string asnCountryCode = 11;
}
Here is how my serialization and deserialzation in java looks like
public class Test {
public static void main(String[] args) throws Exception {
Flow flow =
Flow.newBuilder().setAsnName("abc")
.setEndTimeInMillis(123456789L)
.setStartTimeInMillis(123456789L)
.setDestinationIP("1.1.1.1")
.setTxBytes(1L)
.setRxBytes(1L)
.setTxPkts(1L)
.setRxPkts(1L)
.setAsnName("blah")
.setAsnCountryCode("blah")
.build();
byte[] flowByteArray = flow.toByteArray();
String flowString = flow.toByteString().toStringUtf8();
System.out.println("Parsed from ByteArray:" + Flow.parseFrom(flowByteArray).getEndTimeInMillis());
System.out.println("Parsed from ByteString:" + Flow.parseFrom(ByteString.copyFromUtf8(flowString))
.getEndTimeInMillis());
}
}
My output is as follows
Parsed from ByteArray:123456789
Parsed from ByteString:-4791902657223630865
Where am I going wrong when I am trying to go the ByteString and the utf-8 route for serialization and deserialization?
Thanks!
The reason why you are seeing an issue is because your serialized byte array is being corrupted. This happens because UTF-8 is a variable length encoding and converting to a UTF-8 string changes the bytes in your original array. When you are doing flow.toByteString().toStringUtf8() one byte in the original bytestring may be transformed into three new bytes with different values. Then when you do ByteString.copyFromUtf8(flowString) the byte changes are not undone since that line of code effectively just retrieves the transformed UTF-8 bytes, not the original bytes you put in.
Here is a small test that illustrates the issue you are seeing
#Test
public void byteConsistency() {
byte[] vals = new byte[] {0, 110, -1};
ByteString original = ByteString.copyFrom(vals);
ByteString newString = ByteString.copyFromUtf8(original.toStringUtf8());
for (int index = 0; index < newString.size(); index++) {
System.out.println(newString.byteAt(index));
}
}
You would expect this code to output
0
110
-1
But it actually outputs
0
110
-17
-65
-67
That's because UTF-8 likely dictates that a -1 (0xFF) byte should be encoded as three bytes [-17, -65, -67].
In summary, when dealing with protobuf don't convert serialized objects into UTF-8 strings. Only use the raw bytes for serialization and deserialization. If you try converting to UTF-8 strings the serialized bytes will become corrupted and you will not be able to deserialize them.
I have a hex string (sA) convert from UTF8 string.
When I convert hex string sA to UTF8 string, I can't show it in form UI with build mode (run file .jar) but when I run with run mode or debug mode UTF8 string can show in form UI.
I use netbeans IDE 7.3.1.
My code below:
public String hexToString(String txtInHex) {
byte[] txtInByte = new byte[txtInHex.length() / 2];
int j = 0;
for (int i = 0; i < txtInHex.length(); i += 2) {
txtInByte[j++] = Byte.parseByte(txtInHex.substring(i, i + 2), 16);
}
return new String(txtInByte);
}
private String asHex(byte[] buf) {
char[] chars = new char[2 * buf.length];
for (int i = 0; i < buf.length; ++i) {
chars[2 * i] = HEX_CHARS[(buf[i] & 0xF0) >>> 4];
chars[2 * i + 1] = HEX_CHARS[buf[i] & 0x0F];
}
return new String(chars);
}
There are multiple problems with this code.
The valid range for byte values is -128 to 127, or -80 to 7F in hex, and Byte.parseByte enforces this. If your asHex method has to process a character whose second byte is greater than 127 it will produce a string that can't be decoded by toHexString.
The asHex method processes only the second byte of the input characters, so it will work correctly only for the first 256 Unicode characters and produce bogus output for the rest of them.
The toHexString method decodes a string from a byte array assuming some platform-specific default encoding, which will give incorrect results if the data was supposedly encoded in UTF-8 and the default encoding is something else.
Why are you trying to create your own methods for encoding and decoding hex strings instead of using a well known and tested library?
new String(txtInByte, "UTF-8");
Without the encoding the platform encoding is taken, for instance Windows-1252. The same holds for its inverse: String.getBytes-
String s = "....";
byte[] b = s.getBytes("UTF-8");
Is it possible to convert a string to a byte array and then convert it back to the original string in Java or Android?
My objective is to send some strings to a microcontroller (Arduino) and store it into EEPROM (which is the only 1 KB). I tried to use an MD5 hash, but it seems it's only one-way encryption. What can I do to deal with this issue?
I would suggest using the members of string, but with an explicit encoding:
byte[] bytes = text.getBytes("UTF-8");
String text = new String(bytes, "UTF-8");
By using an explicit encoding (and one which supports all of Unicode) you avoid the problems of just calling text.getBytes() etc:
You're explicitly using a specific encoding, so you know which encoding to use later, rather than relying on the platform default.
You know it will support all of Unicode (as opposed to, say, ISO-Latin-1).
EDIT: Even though UTF-8 is the default encoding on Android, I'd definitely be explicit about this. For example, this question only says "in Java or Android" - so it's entirely possible that the code will end up being used on other platforms.
Basically given that the normal Java platform can have different default encodings, I think it's best to be absolutely explicit. I've seen way too many people using the default encoding and losing data to take that risk.
EDIT: In my haste I forgot to mention that you don't have to use the encoding's name - you can use a Charset instead. Using Guava I'd really use:
byte[] bytes = text.getBytes(Charsets.UTF_8);
String text = new String(bytes, Charsets.UTF_8);
You can do it like this.
String to byte array
String stringToConvert = "This String is 76 characters long and will be converted to an array of bytes";
byte[] theByteArray = stringToConvert.getBytes();
http://www.javadb.com/convert-string-to-byte-array
Byte array to String
byte[] byteArray = new byte[] {87, 79, 87, 46, 46, 46};
String value = new String(byteArray);
http://www.javadb.com/convert-byte-array-to-string
Use [String.getBytes()][1] to convert to bytes and use [String(byte[] data)][2] constructor to convert back to string.
byte[] pdfBytes = Base64.decode(myPdfBase64String, Base64.DEFAULT)
import java.io.FileInputStream;
import java.io.ByteArrayOutputStream;
public class FileHashStream
{
// write a new method that will provide a new Byte array, and where this generally reads from an input stream
public static byte[] read(InputStream is) throws Exception
{
String path = /* type in the absolute path for the 'commons-codec-1.10-bin.zip' */;
// must need a Byte buffer
byte[] buf = new byte[1024 * 16]
// we will use 16 kilobytes
int len = 0;
// we need a new input stream
FileInputStream is = new FileInputStream(path);
// use the buffer to update our "MessageDigest" instance
while(true)
{
len = is.read(buf);
if(len < 0) break;
md.update(buf, 0, len);
}
// close the input stream
is.close();
// call the "digest" method for obtaining the final hash-result
byte[] ret = md.digest();
System.out.println("Length of Hash: " + ret.length);
for(byte b : ret)
{
System.out.println(b + ", ");
}
String compare = "49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d";
String verification = Hex.encodeHexString(ret);
System.out.println();
System.out.println("===")
System.out.println(verification);
System.out.println("Equals? " + verification.equals(compare));
}
}
How can I check if a string is in valid UTF-8 format?
Only byte data can be checked. If you constructed a String then its already in UTF-16 internally.
Also only byte arrays can be UTF-8 encoded.
Here is a common case of UTF-8 conversions.
String myString = "\u0048\u0065\u006C\u006C\u006F World";
System.out.println(myString);
byte[] myBytes = null;
try
{
myBytes = myString.getBytes("UTF-8");
}
catch (UnsupportedEncodingException e)
{
e.printStackTrace();
System.exit(-1);
}
for (int i=0; i < myBytes.length; i++) {
System.out.println(myBytes[i]);
}
If you don't know the encoding of your byte array, juniversalchardet is a library to help you detect it.
The following post is taken from the official Java tutorials available at: https://docs.oracle.com/javase/tutorial/i18n/text/string.html.
The StringConverter program starts by creating a String containing
Unicode characters:
String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
When printed, the String named original appears as:
AêñüC
To convert the String object to UTF-8, invoke the getBytes method and
specify the appropriate encoding identifier as a parameter. The
getBytes method returns an array of bytes in UTF-8 format. To create a
String object from an array of non-Unicode bytes, invoke the String
constructor with the encoding parameter. The code that makes these
calls is enclosed in a try block, in case the specified encoding is
unsupported:
try {
byte[] utf8Bytes = original.getBytes("UTF8");
byte[] defaultBytes = original.getBytes();
String roundTrip = new String(utf8Bytes, "UTF8");
System.out.println("roundTrip = " + roundTrip);
System.out.println();
printBytes(utf8Bytes, "utf8Bytes");
System.out.println();
printBytes(defaultBytes, "defaultBytes");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
The StringConverter program prints out the values in the utf8Bytes and
defaultBytes arrays to demonstrate an important point: The length of
the converted text might not be the same as the length of the source
text. Some Unicode characters translate into single bytes, others into
pairs or triplets of bytes.
The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file,
UnicodeFormatter.java. Here is the printBytes method:
public static void printBytes(byte[] array, String name) {
for (int k = 0; k < array.length; k++) {
System.out.println(name + "[" + k + "] = " + "0x" +
UnicodeFormatter.byteToHex(array[k]));
}
}
The output of the printBytes method follows. Note that only the first
and last bytes, the A and C characters, are the same in both arrays:
utf8Bytes[0] = 0x41
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xaa
utf8Bytes[3] = 0xc3
utf8Bytes[4] = 0xb1
utf8Bytes[5] = 0xc3
utf8Bytes[6] = 0xbc
utf8Bytes[7] = 0x43
defaultBytes[0] = 0x41
defaultBytes[1] = 0xea
defaultBytes[2] = 0xf1
defaultBytes[3] = 0xfc
defaultBytes[4] = 0x43