Java: convert UTF8 String to byte array in another encoding - java

I have UTF8 encoded String, but I need to post parameters to Runtime process in cp1251. How can I decode String or byte array?
I need smth like:.bytesInCp1251 = encodeTo(stringInUtf8, "cp1251");
Thanks to all! This is my own solution:
OutputStreamWriter writer = new OutputStreamWriter(out, "cp1251");
writer.write(s);

There is no such thing as an "UTF8 encoded String" in Java. Java Strings use UTF-16 internally, but should be seen as an abstraction without a specific encoding. If you have a String, it's already decoded. If you want to encode it, use string.getBytes(encoding). If you original data is UTF-8, you have to take that into account when you convert that data from bytes to String.

byte[] bytesInCp1251 = stringInUtf8.getBytes("cp1251");

This is solution!
OutputStreamWriter writer = new OutputStreamWriter(out, "cp1251");
writer.write(s);

Related

Setting Charset for String

I use forbiddenapis to check my code. It gives an error:
[forbiddenapis] Forbidden class/interface use: java.lang.String#<init>(byte[])
[forbiddenapis] in org.a.b.MyObject (MyObject.java:14)
Which points to:
String finalString = new String(((ByteArrayOutputStream) out).toByteArray());
How can I resolve it? I know that I can set a Charset, i.e.:
Charset.forName("UTF-8").encode(myString);
However since there is used byte, which charset should I use to avoid a problem with different characters?
You'll need insight into the charset with which the bytes were encoded in the first place. If you're confident it'd always be UTF8, you could just use the String constructor:
new String(bytes, StandardCharsets.UTF_8)
Do not use FileReader. This is an old utility class to read files in the default platform encoding. That is not suited for portable files. The code is unportable.
String / Reader / Writer holds Unicode. When converting from byte[] / InputStream / OutputStream one needs to indicate the encoding of those bytes, binary data.
String s = new String(bytes, charset);
byte[] bytes = s.getBytes(charset);
It seems that the message mentions FileReader and complains about its
new String(bytes);
which uses the default encoding, as would:
string.getBytes();

Read byte array into base64 encoded image

I'm trying to read an image into a ByteArrayOutputStream and then encode the array into Base64 for sending as part of a json to my API. I'm wanting to avoid saving it anywhere and just read it, encode, and send. Unfortunately, when I use the ByteArrayOutputStream.toByteArray() as a parameter in Base64.getEncoder.encodeToString() method it returns a String that contains extra break characters '\' in the String as compared to a successful test reading from a File into Base64.
Is it possible to read directly from the byte array into base 64? Or will I have to translate into an image then to base 64?
Any help is appreciated.
Getting Image from base64:
byte[] b = DatatypeConverter.parseBase64Binary(base64Img);
ByteArrayInputStream s = new ByteArrayInputStream(b);
return new Image(s);
Maybe it can help you to do the reverse.
Apparently, passing the outputstream directly into the encoder was the issue. I added a local variable to reference the byte[] and then pass it into the encoder and it now works.
byte[] array = outputStream.toByteArray();
String base64String = Base64.getEncoder().encodeToString(array);

Uploading image to server corrupts the image

I have in my application a image upload method that need to send a image and a string to my server.
The problem is that the server receives the content (image and string) but when it saves the image on the disk it is corrupted and can't be opened.
This is the relevant part of the script.
HttpPost httpPost = new HttpPost(url);
Bitmap bmp = ((BitmapDrawable) imageView.getDrawable()).getBitmap();
ByteArrayOutputStream stream = new ByteArrayOutputStream();
bmp.compress(Bitmap.CompressFormat.PNG, 100, stream);
byte[] byteArray = stream.toByteArray();
String byteStr = new String(byteArray);
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.append("--"+boundary+"\r\n");
stringBuilder.append("Content-Disposition: form-data; name=\"content\"\r\n\r\n");
stringBuilder.append(message+"\r\n");
stringBuilder.append("--"+boundary+"\r\n");
stringBuilder.append("Content-Disposition: form-data; name=\"image\"; filename=\"image.jpg\"\r\n");
stringBuilder.append("Content-Type: image/jpeg\r\n\r\n");
stringBuilder.append(byteStr);
stringBuilder.append("\r\n");
stringBuilder.append("--"+boundary+"--\r\n");
StringEntity entity = new StringEntity(stringBuilder.toString());
httpPost.setEntity(entity);
I can't change the server because other clients use it and it works for them. I just need to understand why the image is being corrupted.
When you do new String(byteArray), it's converting binary into the default character set (which is typically UTF-8). Most character sets aren't a suitable encoding for binary data. In other words if you were to encode certain binary strings to UTF-8 and then decode back to binary, you would not get the same binary string.
Since you're using multipart encoding, you need to write directly to the stream of the entity. Apache HTTP Client has helpers for doing this. See this guide, or this Android guide to uploading with multipart.
If you NEED to using strings only, you can safely convert your byte array to a string with
String byteStr = android.util.Base64.encode(byteArray, android.util.Base64.DEFAULT);
But it's important to note that your server will need to Base64 decode the string back to a byte array and save it to an image. Further, the transfer size will be greater because Base64 encoding isn't as space efficient as raw binary.
Your solutions above is not working because you are using new String(byteArray). The constructor encodes the byte array using the default encoding - see What is the default encoding - and it is very likely, that you have byte sequences in your data that cannot be encoded into a character.
To be more precise, a charset defines how characters are represented as bytes.
Most charsets have more than 256 characters. That is why you need more than one byte to represent a character. UTF-8 and UTF-16 uses up to four bytes.
So you have a mapping between the number space and the character space and this mapping is not bejectiv a priori. So it is very likely that there exist a number in the number space that have no character mapped to it.
The solution #Samuel suggested is foolproof because Base64 uses A–Z, a–z, 0–9, + , / and terminates with = to represent a byte. I would prefer this solution!
If you don't want or cannot use Base64, than you can try just to throw in every byte as it is into the StringBuilder hoping that the server does not do any encoding before you get it.
for (byte b : byteArray) {
stringBuilder.append((char)b);
}
I do not recommand that solution in general, but it may help you to get your stuff done.

How to convert String encoded in windows-1250/Cp1250 to utf-8?

As title say ...
I read content from htto response
InputStream is = response.getEntity().getContent();
String cw = IOUtils.toString(is);
byte[] b = cw.getBytes("Cp1250");
String x = StringUtils.newStringUtf8(b);
String content = new String(b, "UTF-8");
System.out.println(content);
I have tried plenty of variations. I am little confused about what are correct encoding constants used as strings. windows-1250 or Cp1250. UTF-8 or utf-8 or utf8?
You seem to think that a String object has an encoding. That's not correct. An encoding is used as part of the translation from binary data (a byte[] or InputStream) to text data (a String or char[] etc).
It's not clear what IOUtils.toString is doing, but it's almost certainly losing data or at least handling it inappropriately. If your data is originally in Windows-1250, then you should use an InputStreamReader wrapping the InputStream, specifying the charset in the InputStreamReader constructor call.
It's not clear where UTF-8 comes in - you might want to write out the data in UTF-8 afterwards, but the result of that would be byte[], not a string.
You're converting backwards. You need to get the input data as a byte array and then use String(byteArray, "Cp1250") to create the String object. Then if you want UTF-8, use String.getBytes("UTF-8").
Encoding have a canonical (unique) name and other varying names, and that case-insensitive. For instance "UTF-8" is the canonical name, but some java versions back it was "UTF8"; it got written more to the common usage. The same for "Windows-1250," which you might see also in HTML pages. "Cp1250" (Code-Page) is a java internal name.
In java byte[] is binary data, String (internally Unicode) is text.
Conversion between both needs an encoding, often optional though, taking the operating system default.
byte, InputStream, OutputStream <-> String, char, Reader, Writer
String cw = IOUtils.toString(is, "UTF-8"); // InputStream is binary gives byte[], hence give encoding
byte[] b = cw.getBytes("Cp1250");
String x = new String(b, "Cp1250");
String content = s;
System.out.println(content);
To allow this universal (qua encoding) String, String internally uses char, UTF-16.
String constants are stored in the .class file as UTF-8 (more compact).
Assuming Apache Commons IO, use one of the methods that specifies an encoding:
String cw = IOUtils.toString(is, "windows-1250");
All strings are implicitly UTF-16 in Java. Other encodings are generally represented using byte arrays.
I see better to use Scanner for reading in different charsets.
FileInputStream is = new FileInputStream(fileOrPath);
Scanner scanner = new Scanner(is, "cp1250");
String out = scanner.next();
And method next() returns String value in charset of application.
Tested on "czech language" from "cp1250" to "UTF-8".

Java character conversion to UTF-8

I am using:
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
to read in characters from a text file and converting them to UTF8 characters.
My question is, what if one of the characters being read cannot be converted to utf8, what happens? Will there be an exception? or will get the character get dropped off?
You are not converting from one charset to another. You are just indicating that the file is UTF 8 encoded so that you can read it correctly.
If you want to convert from 1 encoding to the other then you should do something like below
File infile = new File("x-utf8.txt");
File outfile = new File("x-utf16.txt");
String fromEncoding="UTF-8";
String toEncoding="UTF-16";
Reader in = new InputStreamReader(new FileInputStream(infile), fromEncoding);
Writer out = new OutputStreamWriter(new FileOutputStream(outfile), toEncoding);
After going through the David Gelhar's response, I feel this code can be improved a bit. If you doesn't know the encoding of the "inFile" then use the GuessEncoding library to detect the encoding and then construct the reader in the encoding detected.
If the input file contains bytes that are not valid utf-8, read() will by default replace the invalid characters with a value of U+FFFD (65533 decimal; the Unicode "replacement character").
If you need more control over this behavior, you can use:
InputStreamReader(InputStream in, CharsetDecoder dec)
and supply a CharsetDecoder configured to your liking.

Categories

Resources