Base64.Decoder returning foreign characters

Base64.Decoder returning foreign characters - java

I am building a small application to turn the text in a text file to Base64 then back to normal. The decoded text always returns some Chinese characters in the beginning of the first line.
public EncryptionEngine(File appFile){
this.appFile= appFile;
}
public void encrypt(){
try {
byte[] fileText = Files.readAllBytes(appFile.toPath());// get file text as bytes
Base64.Encoder encoder = Base64.getEncoder();
PrintWriter writer = new PrintWriter(appFile);
writer.print("");//erase old, readable text
writer.print(encoder.encodeToString(fileText));// insert encoded text
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}
public void deycrpt(){
try {
byte[] fileText = Files.readAllBytes(appFile.toPath());
String s = new String (fileText, StandardCharsets.UTF_8);//String s = new String (fileText);
Base64.Decoder decoder = Base64.getDecoder();
byte[] decodedByteArray = decoder.decode(s);
PrintWriter writer = new PrintWriter(appFile);
writer.print("");
writer.print(new String (decodedByteArray,StandardCharsets.UTF_8)); //writer.print(new String (decodedByteArray));
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}
Text FileBefore before encrypt():
cheese
tomatoes
potatoes
hams
yams
Text File after encrypt()
//5jAGgAZQBlAHMAZQANAAoAdABvAG0AYQB0AG8AZQBzAA0ACgBwAG8AdABhAHQAbwBlAHMADQAKAGgAYQBtAHMADQAKAHkAYQBtAHMA
Text File After decrypt
뿯붿cheese
tomatoes
potatoes
hams
yams
Before encrypt() :
After decrypt() :

Your input file is UTF-16, not UTF-8. It begins with FF FE, the little-endian byte order mark. StandardCharsets.UTF_16 will handle this correctly. (Or instead, set your text editor to UTF-8 instead of UTF-16.)
When you decoded fffe as UTF-8, you got two replacement characters "��", one for each of the two bytes that was not valid in UTF-8. Then when you printed this out, each replacement character '�' was encoded as ef bf bd in UTF-8. Then you interpreted the result as UTF-16, taking them in groups of two, reading it as efbf bdef bfbd. The remainder of the file was UTF-16 the whole time, but the null bytes will safely round-trip.
(If the file were ascii text encoded as UTF-16 without a byte-order mark, you would not have noticed how broken this was!)

Your encrypt and decrypt functions don't make the same assumptions. encrypt Base64-encodes any file and is just fine except for the variable names and comments that suggest that the file is a text file. It need not be.
decrypt reverses the Base64-encoded data back to bytes but then "overprocesses" by assuming that the bytes were text encoding with UTF-8 and decoding then and re-encoding them before writing them to the file. If the assumption was true, it would just be a NOP; It's clearly not true in your case and it mangles the data.
Perhaps you did that because you were trying to use a PrintWriter. In Java (and .NET), the multiple stream and file I/O classes are often confusing—expecially considering their decades-long evolution. Sometimes there is one that does exactly what you need but it could be hard to find; other times, there isn't. And, sometimes, a commonly used library like Apache Commons fills the gap.
So, just write the bytes to the file. There are lots of modern and historical options as explained in the answers to this direct question byte[] to file in Java. Here's one with Files.write:
Files.write(appFile.toPath(), decodedByteArray, StandardOpenOption.CREATE);
Note: While Base64 possibly would have been considered encryption (and cracked) a couple of hundred years ago, it's not intended for that purpose. It's a bit dangerous (and confusing) to call it as such.

Related

How to detect encoding mismatch

I have a bunch of old AES-encrypted Strings encrypted roughly like this:
String is converted to bytes with ISO-8859-1 encoding
Bytes are encrypted with AES
Result is converted to BASE64 encoded char array
Now I would like to change the encoding to UTF8 for new values (eg. '€' does not work with ISO-8859-1). This will of
course cause problems if I try to decrypt the old ISO-8859-1 encoded values with UTF-8 encoding:
org.junit.ComparisonFailure: expected:<!#[¤%&/()=?^*ÄÖÖÅ_:;>½§#${[]}<|'äöå-.,+´¨]'-Lorem ipsum dolor ...> but was:<!#[�%&/()=?^*����_:;>��#${[]}<|'���-.,+��]'-Lorem ipsum dolor ...>
I'm thinking of creating some automatic encoding fallback for this.
So the main question would be that is it enough to inspect the decrypted char array for '�' characters to figure out encoding mismatch? And what is the 'correct' way to declare that '�' symbol when comparing?
if (new String(utf8decryptedCharArray).contains("�")) {
// Revert to doing the decrypting with ISO-8859-1
decryptAsISO...
}

When decrypting, you get back the original byte sequence (result of your step 1), and then you can only guess whether these bytes denote characters according to the ISO-8859-1 or the UTF-8 encoding.
From a byte sequence, there's no way to clearly tell how it is to be interpreted.
A few ideas:
You could migrate all the old encrypted strings (decrypt, decode to string using ISO-8859-1, encode to byte array using UTF-8, encrypt). Then the problem is solved once and forever.
You could try to decode the byte array in both versions, see if one version is illegal, or if both versions are equal, and if it still is ambiguous, take the one with higher probability according to expected characters. I wouldn't recommend to go that way, as it needs a lot of work and still there's some probability of error.
For the new entries, you could prepend the string / byte sequence by some marker that doesn't appear in ISO-8859-1 text. E.g. some people follow the convention to prepend a Byte Order Marker at the beginning of UTF-8 encoded files. Although the resulting bytes (EF BB BF) aren't strictly illegal in ISO-8859-1 (being read as ï»¿), they are highly unlikely. Then, when your decrypted bytes start with EF BB BF, decode to string using UTF-8, otherwise using ISO-8859-1. Still, there's a non-zero probability of error.
If ever possible, I'd go for migrating the existing entries. Otherwise, you'll have to carry on with "old-format compatibility stuff" in your code base forever, and still can't absolutely guarantee correct behaviour.

When decoding bytes to text, don't rely on the � character to detect malformed input. Use a strict decoder. Here is a helper method for that:
static String decodeStrict(byte[] bytes, Charset charset) throws CharacterCodingException {
return charset.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT)
.decode(ByteBuffer.wrap(bytes))
.toString();
}
Here is the corresponding strict encoder helper method, in case you need it:
static byte[] encodeStrict(String str, Charset charset) throws CharacterCodingException {
ByteBuffer buf = charset.newEncoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT)
.encode(CharBuffer.wrap(str));
byte[] bytes = buf.array();
if (bytes.length == buf.limit())
return bytes;
return Arrays.copyOfRange(bytes, 0, buf.limit());
}
Since ISO-8859-1 allows all bytes, you can't use it to detect malformed input. UTF-8 is however validating, so it is very likely to detect malformed input. It is however not 100% guaranteed, but it's the best we get do.
So, try decoding using strict UTF-8, and then fall back to ISO-8859-1 if it fails:
static String decode(byte[] bytes) {
try {
return decodeStrict(bytes, StandardCharsets.UTF_8);
} catch (CharacterCodingException e) {
return new String(bytes, StandardCharsets.ISO_8859_1);
}
}
Test
System.out.println(decode("señor".getBytes(StandardCharsets.ISO_8859_1))); // prints: señor
System.out.println(decode("señor".getBytes(StandardCharsets.UTF_8))); // prints: señor
System.out.println(decode("€100".getBytes(StandardCharsets.UTF_8))); // prints: €100

How to make a frequency table from file content using fileInputStream

My assignment is to create a program that does compression using the Huffman algorithm. My program must be able to compress any type of file. Hence why i'm not using the Reader that works with characters.
Im not understanding how to be able to make some kind of frequency table when encoding a binary file?
EDIT!! Problem solved.
public static void main(String args[]){
try{
FileInputStream in = new FileInputStream("./src/hello.jpg");
int currentByte;
while((currentByte = in.read())!=-1){ //in.read()
//read all byte streams in file and create a frequency
//table
}
}catch (IOException e){
e.printStackTrace();
}
}

I'm not sure what you mean by "reading from an image and look at the characters" but talking about text files (as you're reading one in in your code example) this is most of the time working by casting the read byte to char by doing a
char charVal = (char) currentByte;
It's mostly working because most data is ASCII and most charsets contain ASCII. It gets more complicated with non-ASCII characters because a simple cast is equivalent with using charset ISO-8859-1. This will still most of the time produce correct results, because e.g. Window's cp1252 (on german systems) only differ with ISO-8859-1 at the Euro-sign.
Things start to run havoc with charsets like UTF-8 where non-ASCII characters are encoded with multiple bytes, so you will see things like Ã¤ instead of an ä. Same for files being encoded with Unicode where every second byte is most likely a binary zero.

You could use Files.readAllBytes and then iterate over this array.
Path path = Paths.get("hello.txt");
try {
byte[] array = Files.readAllBytes(path);
} catch (IOException ) {
}

Convert Utf-16 to UTF-8 strings with data losing using Java

I have to insert text which 99,9% is UTF-8 but have 0.01% UTF-16 characters. Sо when I try to save it in my Mysql databse using Hibernate and Spring an exception occured. I can even remove these chars there is no problem, so I want to convert all my text in UTF-8 and save to my database with data losing, so the problem chars to be removed. I tried
String string = "😈 Devil Emoji";
byte[] converttoBytes = string.getBytes("UTF-16");
string = new String(converttoBytes, "UTF-8");
System.out.println(string);
But nothing happens.
😈 Devil Emoji
Is there any external library in order to do that?

😈 probably has nothing to do with UTF-16. It's hex is F09F9888. Notice that that is 4 bytes. Also notice that that is a UTF-8 encoding, not a "Unicode" encoding: U+1F608 or \u1F608. UTF-16 would be none of the above. More (scarfboy).
MySQL's utf8 handles only 3-byte (or shorter) UTF-8 characters. MySQL's utf8mb4 also handles 4-byte characters like that little devil.
You need to change the CHARACTER SET of the column you are storing him into. And you need to establish that your connection is charset=UTF-8.
Note: things outside MySQL call it UTF-8, but MySQL calls it utf8mb4.

String holds Unicode in java, so all scripts can be combined.
byte[] converttoBytes = string.getBytes("UTF-16");
These bytes are binary data, but actually used to store text, encoded in UTF-16.
string = new String(converttoBytes, "UTF-8");
Now String thinks that the bytes represent text encoding in UTF-8, and converts those. This is wrong.
Now to detect the encoding, either UTF-8 or UTF-16, then that should best be done on bytes, not String, as that String then has an erroneous conversion with possible loss.
As UTF-8 has the most strict format of both, we'll check that one.
Also UTF-16 has a byte 0 for ASCII, that almost never occurs in normal text.
So something like
public static String string(byte[] bytes) {
ByteBuffer buffer = ByteBuffer.wrap(bytes);
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
try {
String s = decoder.decode(buffer).toString();
if (!s.contains("\u0000")) { // Could be UTF-16
return s;
}
} catch (CharacterCodingException e) { // Error in UTF-8
}
return new String(bytes, "UTF-16LE");
}
If you only have a String (for instance from the database), then
if (!s.contains("\u0000")) { // Could be UTF-16
s = new String(s.getBytes("Windows-1252"), "UTF-16LE");
}
might work or make a larger mess.

Reading from UTF-16 encoded text file, þÿ is prepended on the front

I'm outputting a byte array to a text file using the following method:
try{
FileOutputStream fos = new FileOutputStream(filePath+".8102");
fos.write(concatenatedIVCipherMAC);
fos.close();
}catch(Exception e)
{
e.printStackTrace();
}
which outputs to the file a UTF-16 encoded data, example:
¢¬6î)ªÈP~m˜LïiÆŸÃª•Àe»/#Ó ö¹¥‘þ²XhÃ&¼lG:Öé )GU3«´DÃ{+í—Ã]íò
However when I'm reading it back in I get þÿ prepended to the front of the data, e.g:
þÿ¢¬6î)ªÈP~m˜LïiÆŸÃª•Àe»/?#Ó ö¹¥‘þ²XhÃ&¼lG:Öé )GU3«´DÃ{+í—Ã]íò
This is the method I'm using to read in the file:
private String getFilesContents()
{
String fileContents = "";
Scanner sc = null;
try {
sc = new Scanner(file, "UTF-16");
System.out.println("Can read file: "+file.canRead());
} catch (FileNotFoundException e) {
e.printStackTrace();
}
while(sc.hasNextLine()){
fileContents += sc.nextLine();
}
sc.close();
return fileContents;
}
and then byte[] contentsOfFile = fileContents.getBytes("UTF-16"); to convert the String into a byte array.
A quick Google told me that þÿ represents the byte order but is it Java putting that there or Windows? How can I avoid having the þÿ prepended at the start of the data I'm reading in? I was thinking of just ignoring the first two bytes but if it is Windows then this will obviously break the program on other platforms.
edit: changed appended to prepended.

The file is the IV+data+MAC. It's not meant to be readable text? Should be I be doing something differently?
Yes. You shouldn't be trying to treat it as text anywhere.
If you really need to convert arbitrary binary data into text, use Base64 to convert it. Other than that, stick to byte arrays, InputStream and OutputStream.
I don't know exactly why you're supposedly getting extra characters, but the fact that you haven't got real text to start suggests that it's not really worth diagnosing that side. Just start handling binary data as binary data instead.
EDIT: Have a look at Guava's IO helpers for simplicity...

þÿ is the byte order mark (BOM) unicode character saved as UTF16-BE, interpreted as ISO-8859-1.
You shouldn't treat binary data as text (in whatever encoding), if you want to avoid such errors.

How do you convert binary data to Strings and back in Java?

I have binary data in a file that I can read into a byte array and process with no problem. Now I need to send parts of the data over a network connection as elements in an XML document. My problem is that when I convert the data from an array of bytes to a String and back to an array of bytes, the data is getting corrupted. I've tested this on one machine to isolate the problem to the String conversion, so I now know that it isn't getting corrupted by the XML parser or the network transport.
What I've got right now is
byte[] buffer = ...; // read from file
// a few lines that prove I can process the data successfully
String element = new String(buffer);
byte[] newBuffer = element.getBytes();
// a few lines that try to process newBuffer and fail because it is not the same data anymore
Does anyone know how to convert binary to String and back without data loss?
Answered: Thanks Sam. I feel like an idiot. I had this answered yesterday because my SAX parser was complaining. For some reason when I ran into this seemingly separate issue, it didn't occur to me that it was a new symptom of the same problem.
EDIT: Just for the sake of completeness, I used the Base64 class from the Apache Commons Codec package to solve this problem.

String(byte[]) treats the data as the default character encoding. So, how bytes get converted from 8-bit values to 16-bit Java Unicode chars will vary not only between operating systems, but can even vary between different users using different codepages on the same machine! This constructor is only good for decoding one of your own text files. Do not try to convert arbitrary bytes to chars in Java!
Encoding as base64 is a good solution. This is how files are sent over SMTP (e-mail). The (free) Apache Commons Codec project will do the job.
byte[] bytes = loadFile(file);
//all chars in encoded are guaranteed to be 7-bit ASCII
byte[] encoded = Base64.encodeBase64(bytes);
String printMe = new String(encoded, "US-ASCII");
System.out.println(printMe);
byte[] decoded = Base64.decodeBase64(encoded);
Alternatively, you can use the Java 6 DatatypeConverter:
import java.io.*;
import java.nio.channels.*;
import javax.xml.bind.DatatypeConverter;
public class EncodeDecode {
public static void main(String[] args) throws Exception {
File file = new File("/bin/ls");
byte[] bytes = loadFile(file, new ByteArrayOutputStream()).toByteArray();
String encoded = DatatypeConverter.printBase64Binary(bytes);
System.out.println(encoded);
byte[] decoded = DatatypeConverter.parseBase64Binary(encoded);
// check
for (int i = 0; i < bytes.length; i++) {
assert bytes[i] == decoded[i];
}
}
private static <T extends OutputStream> T loadFile(File file, T out)
throws IOException {
FileChannel in = new FileInputStream(file).getChannel();
try {
assert in.size() == in.transferTo(0, in.size(), Channels.newChannel(out));
return out;
} finally {
in.close();
}
}
}

If you encode it in base64, this will turn any data into ascii safe text, but base64 encoded data is larger than the orignal data

See this question, How do you embed binary data in XML?
Instead of converting the byte[] into String then pushing into XML somewhere, convert the byte[] to a String via BASE64 encoding (some XML libraries have a type to do this for you). The BASE64 decode once you get the String back from XML.
Use http://commons.apache.org/codec/
You data may be getting messed up due to all sorts of weird character set restrictions and the presence of non-priting characters. Stick w/ BASE64.

How are you building your XML document? If you use java's built in XML classes then the string encoding should be handled for you.
Take a look at the javax.xml and org.xml packages. That's what we use for generating XML docs, and it handles all the string encoding and decoding quite nicely.
---EDIT:
Hmm, I think I misunderstood the problem. You're not trying to encode a regular string, but some set of arbitrary binary data? In that case the Base64 encoding suggested in an earlier comment is probably the way to go. I believe that's a fairly standard way of encoding binary data in XML.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.