Shortest String encoding for a byte array

Shortest String encoding for a byte array - java

I have this code that generates UBJSON byte array
UBObject obj = UBValueFactory.createObject();
obj.put("appId", UBValueFactory.createString("70cce8adb93c4c968a7b1483f2edf5c1"));
obj.put("apiKey", UBValueFactory.createString("a65d8f147fa741b0a6d7fc43e18363c9"));
obj.put("entityType", UBValueFactory.createString("Todo"));
obj.put("entityId", UBValueFactory.createString("2-0"));
obj.put("blobName", UBValueFactory.createString("blobName"));
ByteArrayOutputStream out = new ByteArrayOutputStream();
UBWriter writer = new UBWriter(out);
try {
writer.write(obj);
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
// Byte array of UBJSON
byte[] ubjsonBytes = out.toByteArray();
The question is, what is the shortest String encoding that can be done for the byte array here, that can be used and transmitted over HTTP URL? Using Base64 works perfect as URL path or query parameter but yields quite long String.

Depending on the input length and other properties you might want to try compressing the input with gzip before encoding the byte[] with Base64. Often a URL friendly variant of Base64 is used:
For this reason, modified Base64 for URL variants exist (such as base64url in RFC 4648), where the + and / characters of standard Base64 are respectively replaced by - and _, so that using URL encoders/decoders is no longer necessary and have no impact on the length of the encoded value, leaving the same encoded form intact for use in relational databases, web forms, and object identifiers in general.
Some variants allow or require omitting the padding = signs to avoid them being confused with field separators, or require that any such padding be percent-encoded. Some libraries will encode = to ., potentially exposing applications to relative path attacks when a folder name is encoded from user data.
You could attempt to use Base85 however it encodes with characters that can change the meaning of URL e.g. &. This might or might not work with your setup and might depend stuff like reverse proxy configuration. Because of that it's often better to use a safe encoding like Base64.
All in all, long data should go into request body and not URL.

Related

Uploading image to server corrupts the image

I have in my application a image upload method that need to send a image and a string to my server.
The problem is that the server receives the content (image and string) but when it saves the image on the disk it is corrupted and can't be opened.
This is the relevant part of the script.
HttpPost httpPost = new HttpPost(url);
Bitmap bmp = ((BitmapDrawable) imageView.getDrawable()).getBitmap();
ByteArrayOutputStream stream = new ByteArrayOutputStream();
bmp.compress(Bitmap.CompressFormat.PNG, 100, stream);
byte[] byteArray = stream.toByteArray();
String byteStr = new String(byteArray);
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.append("--"+boundary+"\r\n");
stringBuilder.append("Content-Disposition: form-data; name=\"content\"\r\n\r\n");
stringBuilder.append(message+"\r\n");
stringBuilder.append("--"+boundary+"\r\n");
stringBuilder.append("Content-Disposition: form-data; name=\"image\"; filename=\"image.jpg\"\r\n");
stringBuilder.append("Content-Type: image/jpeg\r\n\r\n");
stringBuilder.append(byteStr);
stringBuilder.append("\r\n");
stringBuilder.append("--"+boundary+"--\r\n");
StringEntity entity = new StringEntity(stringBuilder.toString());
httpPost.setEntity(entity);
I can't change the server because other clients use it and it works for them. I just need to understand why the image is being corrupted.

When you do new String(byteArray), it's converting binary into the default character set (which is typically UTF-8). Most character sets aren't a suitable encoding for binary data. In other words if you were to encode certain binary strings to UTF-8 and then decode back to binary, you would not get the same binary string.
Since you're using multipart encoding, you need to write directly to the stream of the entity. Apache HTTP Client has helpers for doing this. See this guide, or this Android guide to uploading with multipart.
If you NEED to using strings only, you can safely convert your byte array to a string with
String byteStr = android.util.Base64.encode(byteArray, android.util.Base64.DEFAULT);
But it's important to note that your server will need to Base64 decode the string back to a byte array and save it to an image. Further, the transfer size will be greater because Base64 encoding isn't as space efficient as raw binary.

Your solutions above is not working because you are using new String(byteArray). The constructor encodes the byte array using the default encoding - see What is the default encoding - and it is very likely, that you have byte sequences in your data that cannot be encoded into a character.
To be more precise, a charset defines how characters are represented as bytes.
Most charsets have more than 256 characters. That is why you need more than one byte to represent a character. UTF-8 and UTF-16 uses up to four bytes.
So you have a mapping between the number space and the character space and this mapping is not bejectiv a priori. So it is very likely that there exist a number in the number space that have no character mapped to it.
The solution #Samuel suggested is foolproof because Base64 uses A–Z, a–z, 0–9, + , / and terminates with = to represent a byte. I would prefer this solution!
If you don't want or cannot use Base64, than you can try just to throw in every byte as it is into the StringBuilder hoping that the server does not do any encoding before you get it.
for (byte b : byteArray) {
stringBuilder.append((char)b);
}
I do not recommand that solution in general, but it may help you to get your stuff done.

Converting string to byte[] returns wrong value (encoding?)

I read a byte[] from a file and convert it to a String:
byte[] bytesFromFile = Files.readAllBytes(...);
String stringFromFile = new String(bytesFromFile, "UTF-8");
I want to compare this to another byte[] I get from a web service:
String stringFromWebService = webService.getMyByteString();
byte[] bytesFromWebService = stringFromWebService.getBytes("UTF-8");
So I read a byte[] from a file and convert it to a String and I get a String from my web service and convert it to a byte[]. Then I do the following tests:
// works!
org.junit.Assert.assertEquals(stringFromFile, stringFromWebService);
// fails!
org.junit.Assert.assertArrayEquals(bytesFromFile, bytesFromWebService);
Why does the second assertion fail?

Other answers have covered the likely fact that the file is not UTF-8 encoded giving rise to the symptoms described.
However, I think the most interesting aspect of this is not that the byte[] assert fails, but that the assert that the string values are the same passes. I'm not 100% sure why this is, but I think the following trawl through the source code might give us the answer:
Looking at how new String(bytesFromFile, "UTF-8"); works - we see that the constructor calls through to StringCoding.decode()
This in turn, if supplied with tht UTF-8 character set, calls through to StringDecoder.decode()
This calls through to CharsetDecoder.decode() which decides what to do if the character is unmappable (which I guess will be the case if a non-UTF-8 character is presented)
In this case it uses an action defined by
private CodingErrorAction unmappableCharacterAction
= CodingErrorAction.REPORT;
Which means that it still reports the character it has decoded, even though it's technically unmappable.
I think this means that even when the code gets an umappable character, it substitutes its best guess - so I'm guessing that its best guess is correct and hence the String representations are the same under comparison, but the byte[] are no longer the same.
This hypothesis is kind of supported by the fact that the catch block for CharacterCodingException in StringCoding.decode() says:
} catch (CharacterCodingException x) {
// Substitution is always enabled,
// so this shouldn't happen

I don't understand it fully, but here's what I get so fare:
The problem is that the data contains some bytes which are not valid UTF-8 bytes as I know by the following check:
// returns false for my data!
public static boolean isValidUTF8(byte[] input) {
CharsetDecoder cs = Charset.forName("UTF-8").newDecoder();
try {
cs.decode(ByteBuffer.wrap(input));
return true;
}
catch(CharacterCodingException e){
return false;
}
}
When I change the encoding to ISO-8859-1 everything works fine. The strange thing (which a don't understand yet) is why my conversion (new String(bytesFromFile, "UTF-8");) doesn't throw any exception (like my isValidUTF8 method), although the data is not valid UTF-8.
However, I think I will go another and encode my byte[] in a Base64 string as I don't want more trouble with encoding.

The real problem in your code is that you don't know what the real file encoding.
When you read the string from the web service you get a sequence of chars; when you convert the string from chars to bytes the conversion is made right because you specify how to transform char in bytes with a specific encoding ("UFT-8"). when you read a text file you face a different problem. You have a sequence of bytes that needs to be converted to chars. In order to do it properly you must know how the chars where converted to bytes i.e. what is the file encoding. For files (unless specified) it's a platform constants; on windows the file are encoded in win1252 (which is very close to ISO-8859-1); on linux/unix it depends, I think UTF8 is the default.
By the way the web service call did a decond operation under the hood; the http call use an header taht defins how chars are encoded, i.e. how to read the bytes form the socket and transform then to chars. So calling a SOAP web service gives you back an xml (which can be marshalled into a Java object) with all the encoding operations done properly.
So if you must read chars from a File you must face the encoding issue; you can use BASE64 as you stated but you lose one of the main benefits of text files: the are human readable, easing debugging and developing.

Base64 String to Windows1251 (cyrillic symbols)

I have a trouble to convert email attachment(simple text file in windows-1251 encoding with latin and cyrillic symbols) to String. I.e I have a problem with converting cyrillic.
I got attachment file as base64 encoded String like this:
Base64Encoded email Attachment
Original file
So when I try to decode it, I got "?" instead of Cyrillic symbols.
How can I get right Cyrillic(Russian) symbols instead of "?"
I've already tried this code with all encodings, but nothing help to get correct Russian symbols.
BASE64Decoder dec = new BASE64Decoder();
for (String key : Charset.availableCharsets().keySet()) {
System.out.println("K=" + key + " Value:" +
Charset.availableCharsets().get(key));
try {
System.out.println(new String(dec.decodeBuffer(encoded), key));
} catch (Exception e) {
continue;
}
}
Thank You beforehand.

I am not very familiar with BPEL and protocols it uses. If you communicate between nodes using some binary protocols, then you must 1) ensure, client and receiver use the same charset and 2) convert java string into proper bytes in this encoding. Java stores string internally in UTF-16 format. So when you execute String correct = new String(commonName.getBytes("ISO-8859-1"), "ISO-8859-5") you will get correct string in UTF-16. Then you need to export it to bytes in requested encoding, eg. byte[] buff = correct.getBytes("UTF-8") assuming the encoding you use between nodes is UTF-8. If happen the encoding is different, then you must make sure, it actually supports Cyrillic characters (e.g. ISO-8859-1 does not support it).
If you use XML for data exchange, make sure it uses suitable encoding in <?xml encoding="UTF-8"?>. You don't need then to play with bytes, you just need to correctly "import" the string (see correct variable). Writing to XML converts characters automatically, but it (encoding) must support characters you want to write. So if you set encoding="ISO-88591", then you will get those question marks again.

Read special charatters ( æ ø å ) with Java from Oracle database

i have a problem when reading special charatters from oracle database (use JDBC driver and glassfish tooplink).
I store on database the name "GRØNLÅEN KJÆTIL" through WebService and, on database, the data are store correctly.
But when i read this String, print on log file and convert this in byte array whit this code:
int pos = 0;
byte[] msg=new byte[1024];
String F = "F" + passenger.getName();
logger.debug("Add " + F + " " + F.length());
msg = addStringToArrayBytePlusSeparator(msg, F,pos);
..............
private byte[] addStringToArrayBytePlusSeparator(byte[] arrDest,String strToAdd,int destPosition)
{
System.arraycopy(strToAdd.getBytes(Charset.forName("ISO-8859-1")), 0, arrDest, destPosition, strToAdd.getBytes().length);
arrDest = addSeparator(arrDest,destPosition+strToAdd.getBytes().length,1);
return arrDest;
}
1) In the log file there is:"Add FGRÃNLÃ " (the name isn't correct and the F.length() are not printed).
2) The code throw:
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at it.edea.ebooking.business.chi.control.VingCardImpl.addStringToArrayBytePlusSeparator(Test.java:225).
Any solution?
Tanks

You're calling strToAdd.getBytes() without specifying the character encoding, within the System.arraycopy call - that will be using the system default encoding, which may well not be ISO-8859-1. You should be consistent in which encoding you use. Frankly I'd also suggest that you use UTF-8 rather than ISO-8859-1 if you have the choice, but that's a different matter.
Why are you dealing with byte arrays anyway at this point? Why not just use strings?
Also note that your addStringToArrayBytePlusSeparator method doesn't give any indication of how many bytes it's copied, which means the caller won't have any idea what to do with it afterwards. If you must use byte arrays like this, I'd suggest making addStringToArrayBytePlusSeparator return either the new "end of logical array" or the number of bytes copied. For example:
private static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
/**
* (Insert fuller description here.)
* Returns the number of bytes written to the array
*/
private static int addStringToArrayBytePlusSeparator(byte[] arrDest,
String strToAdd,
int destPosition)
{
byte[] encodedText = ISO_8859_1.getBytes(strToAdd);
// TODO: Verify that there's enough space in the array
System.arraycopy(encodedText, 0, arrDest, destPosition, encodedText.length);
return encodedText.length;
}

Encoding/Decoding problems are hard. In every process step you have to do the correct encoding/decoding. So,
familiarize yourself with the difference of bytes (inputstream) and Characters (Readers, Strings)
Choose in which character encoding you want to store your data in the database, and in which character encoding you want to expose your webservice. Make sure when you load initial data in the database it's in the right encoding
connect with the right database properties. mysql requires an addition to the connection url:?useUnicode=true&characterEncoding=UTF-8 when using UTF-8, I don't know about oracle.
if you print/debug at a certain step and it looks ok, you can't be sure you did it right. The logger can write with the wrong encoding (sometimes making something look ok, while in fact it's broken). Your terminal might not handle strange byte encodings correct. The same holds for command-line database clients. Your data might wrongly be stored, but your wrongly configured terminal interprets/shows the data as correct.
In XML, it's not only the stream encoding that matters, but also the xml-encoding attribute.

How do you convert binary data to Strings and back in Java?

I have binary data in a file that I can read into a byte array and process with no problem. Now I need to send parts of the data over a network connection as elements in an XML document. My problem is that when I convert the data from an array of bytes to a String and back to an array of bytes, the data is getting corrupted. I've tested this on one machine to isolate the problem to the String conversion, so I now know that it isn't getting corrupted by the XML parser or the network transport.
What I've got right now is
byte[] buffer = ...; // read from file
// a few lines that prove I can process the data successfully
String element = new String(buffer);
byte[] newBuffer = element.getBytes();
// a few lines that try to process newBuffer and fail because it is not the same data anymore
Does anyone know how to convert binary to String and back without data loss?
Answered: Thanks Sam. I feel like an idiot. I had this answered yesterday because my SAX parser was complaining. For some reason when I ran into this seemingly separate issue, it didn't occur to me that it was a new symptom of the same problem.
EDIT: Just for the sake of completeness, I used the Base64 class from the Apache Commons Codec package to solve this problem.

String(byte[]) treats the data as the default character encoding. So, how bytes get converted from 8-bit values to 16-bit Java Unicode chars will vary not only between operating systems, but can even vary between different users using different codepages on the same machine! This constructor is only good for decoding one of your own text files. Do not try to convert arbitrary bytes to chars in Java!
Encoding as base64 is a good solution. This is how files are sent over SMTP (e-mail). The (free) Apache Commons Codec project will do the job.
byte[] bytes = loadFile(file);
//all chars in encoded are guaranteed to be 7-bit ASCII
byte[] encoded = Base64.encodeBase64(bytes);
String printMe = new String(encoded, "US-ASCII");
System.out.println(printMe);
byte[] decoded = Base64.decodeBase64(encoded);
Alternatively, you can use the Java 6 DatatypeConverter:
import java.io.*;
import java.nio.channels.*;
import javax.xml.bind.DatatypeConverter;
public class EncodeDecode {
public static void main(String[] args) throws Exception {
File file = new File("/bin/ls");
byte[] bytes = loadFile(file, new ByteArrayOutputStream()).toByteArray();
String encoded = DatatypeConverter.printBase64Binary(bytes);
System.out.println(encoded);
byte[] decoded = DatatypeConverter.parseBase64Binary(encoded);
// check
for (int i = 0; i < bytes.length; i++) {
assert bytes[i] == decoded[i];
}
}
private static <T extends OutputStream> T loadFile(File file, T out)
throws IOException {
FileChannel in = new FileInputStream(file).getChannel();
try {
assert in.size() == in.transferTo(0, in.size(), Channels.newChannel(out));
return out;
} finally {
in.close();
}
}
}

If you encode it in base64, this will turn any data into ascii safe text, but base64 encoded data is larger than the orignal data

See this question, How do you embed binary data in XML?
Instead of converting the byte[] into String then pushing into XML somewhere, convert the byte[] to a String via BASE64 encoding (some XML libraries have a type to do this for you). The BASE64 decode once you get the String back from XML.
Use http://commons.apache.org/codec/
You data may be getting messed up due to all sorts of weird character set restrictions and the presence of non-priting characters. Stick w/ BASE64.

How are you building your XML document? If you use java's built in XML classes then the string encoding should be handled for you.
Take a look at the javax.xml and org.xml packages. That's what we use for generating XML docs, and it handles all the string encoding and decoding quite nicely.
---EDIT:
Hmm, I think I misunderstood the problem. You're not trying to encode a regular string, but some set of arbitrary binary data? In that case the Base64 encoding suggested in an earlier comment is probably the way to go. I believe that's a fairly standard way of encoding binary data in XML.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Shortest String encoding for a byte array - java

Related

Uploading image to server corrupts the image

Converting string to byte[] returns wrong value (encoding?)

Base64 String to Windows1251 (cyrillic symbols)

Read special charatters ( æ ø å ) with Java from Oracle database

How do you convert binary data to Strings and back in Java?

Categories

Resources