Java converting UTF-8 Text to ASCII produces weird output

Java converting UTF-8 Text to ASCII produces weird output - java

I'm trying to convert the content of a plain text file (UTF-8) into ASCII (ISO-8859-15) to write it to an output file. I've written a few lines of code (see below) which read the content of the file, write it to an array of bytes, decode it with the UTF-8 charset, encode it with the ISO-8859-15 Charset and write the result to a file. This works just fine, except for a question mark (Hx: 3F) that suddenly appears at the very beginning of the output file.
import java.io.*;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
public class Main {
public static void main(String[] args) {
/* Read / write the file to a byte array */
File input = new File("input.txt");
byte[] bytes = new byte[(int) input.length()];
try (FileInputStream fileInput = new FileInputStream(input)){
fileInput.read(bytes);
} catch(IOException e) {
if(e instanceof FileNotFoundException) {
System.err.println("File not found.");
} else {
e.printStackTrace();
}
}
/* Getting the charsets */
Charset utf8charset = Charset.forName("UTF-8");
Charset iso885915charset = Charset.forName("ISO-8859-15");
/* Wrapping the bytes from the file into a buffer */
ByteBuffer inputBuffer = ByteBuffer.wrap(bytes);
/* Encoding the text file from UTF-8 */
CharBuffer data = utf8charset.decode(inputBuffer);
/* Decoding the text file to ISO-8859-15 and writing it to an array*/
ByteBuffer outputBuffer = iso885915charset.encode(data);
byte[] outputData = outputBuffer.array();
System.out.println(new String(outputData));
File output = new File("output.txt");
/* Writing the output to a file */
try(BufferedOutputStream out = new BufferedOutputStream(new FileOutputStream(output))) {
out.write(outputData);
out.close();
} catch(IOException e) {
e.printStackTrace();
}
}
}
Input file:
ABC
DEF
GHI
Output file:
?ABC
DEF
GHI
If you have an idea, what might be the cause of this weird behavior, please let me know. Also, if there is anything weird about my code in general, please point it out to me, since I'm not very experienced in the use of Java yet.
Thanks :)

Related

java array byte file to human readable

I have a byte array file with me which I am trying to convert into human readable. I tried below ways :
public static void main(String args[]) throws IOException
{
//System.out.println("Platform Encoding : " + System.getProperty("file.encoding"));
FileInputStream fis = new FileInputStream("<Path>");
// Using Apache Commons IOUtils to read file into byte array
byte[] filedata = IOUtils.toByteArray(fis);
String str = new String(filedata, "UTF-8");
System.out.println(str);
}
Another approach :
public static void main(String[] args) {
File file = new File("<Path>");
readContentIntoByteArray(file);
}
private static byte[] readContentIntoByteArray(File file) {
FileInputStream fileInputStream = null;
byte[] bFile = new byte[(int) file.length()];
try {
FileInputStream(file);
fileInputStream.read(bFile);
fileInputStream.close();
for (int i = 0; i < bFile.length; i++) {
System.out.print((char) bFile[i]);
}
} catch (Exception e) {
e.printStackTrace();
}
return bFile;
}
These codes are compiling but its not yielding output file in a human readable fashion. Excuse me if this is a repeated or basic question.
Could someone please correct me where I am going wrong here?

Your code (from the first snippet) for decoding a byte file into a UTF-8 text file looks correct to me (assuming FileInputStream fis = new FileInputStream("Path") is yielding the correct fileInputStream) .
If you're expecting a text file format but are not sure which encoding the file format is in (perhaps it's not UTF-8) , you can use a library like the below to find out.
https://code.google.com/archive/p/juniversalchardet/
or just explore some of the different Charsets in the Charset library and see what they produce in your String initialization line and what you produce:
new String(byteArray, Charset.defaultCharset()) // try other Charsets here.
The second method you show has associated catches with byte to char conversion , depending on the characters, as discussed here (Byte and char conversion in Java).
Chances are, if you cannot find a valid encoding for this file, it is not human readable to begin with, before byte conversion, or the byte array file being passed to you lost something that makes it decodeable along the way.

Decoding base64 data and not able to download as a file

Am getting base64 encoded data as String format. Am trying to decode the base64 and want to download as a file. I have commented the below few lines of code, where am getting error out of those line.
Am not sure how to decode the data.
String contentByte=null;
for (SearchHit contenthit : contentSearchHits) {
Map<String, Object> sourceAsMap = contenthit.getSourceAsMap();
fileName=sourceAsMap.get("Name").toString();
System.out.println("FileName ::::"+fileName);
contentByte = sourceAsMap.get("resume").toString();
}
System.out.println("Bytes --->"+contentByte);
File file = File.createTempFile("Testing",".pdf", new File("D:/") );
file.deleteOnExit();
BufferedWriter out = new BufferedWriter(new FileWriter(file));
out.write(Base64.getDecoder().decode(contentByte)); //getting error on this line
Please find the below compilation error am getting.
The method write(int) in the type BufferedWriter is not applicable for the arguments (byte[])
Am using Java 8 version

Writers are used for writing characters, not bytes. To write bytes, you should use some flavor of OutputStream. See Writer or OutputStream?
But if all you want is to write a byte array to a file, Files class provides a Files.write method that does just that:
byte[] bytes = Base64.getDecoder().decode(contentByte);
Files.write(file.toPath(), bytes);

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
public class Example {
public static void main(String[] args) {
String contentByte="Simple text send from server";
byte[] bytes =
Base64.getEncoder().encode(contentByte.getBytes(StandardCharsets.UTF_8));
//Data received by you at server end(base64 encoded data as string)
contentByte = new String(bytes);
System.out.println(new String(bytes));
BufferedWriter out = null;
System.out.println("Bytes --->"+contentByte);
try {
File file = File.createTempFile("Testing",".pdf", new File("/tmp/") );
// file.deleteOnExit(); // this line will remove file and your data will not going to save to file. So remove this line.
out = new BufferedWriter(new FileWriter(file));
byte[] decodedImg =
Base64.getDecoder().decode(contentByte.getBytes(StandardCharsets.UTF_8));
out.write(new String(decodedImg)); //getting error on this line
}catch (Exception e)
{
e.printStackTrace();
}finally {
if(out!=null)
{
try {
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
}
Might above solution with help you.

Why doesn't Base64 Encoding of a byte[] in Java work?

import java.io.*;
import java.nio.*;
import java.util.Base64;
import java.util.UUID;
import java.io.UnsupportedEncodingException;
public class Abc {
public static String readFileAsString(String filePath) throws IOException {
DataInputStream dis = new DataInputStream(new FileInputStream(filePath));
try {
long len = new java.io.File(filePath).length();
if (len > Integer.MAX_VALUE) throw new IOException("File " + filePath + " too large")
byte[] bytes = new byte[(int) len];
dis.readFully(bytes);
String ans = new String(bytes, "UTF-8");
return ans;
} finally {
dis.close();
}
}
public static void main(String args[]) throws IOException {
String base64encodedString = null;
FileOutputStream stream = new FileOutputStream("C:\\Users\\EMP142738\\Desktop\\New folder\\Readhjbdsdsefd.pdf");
String filePath = new String("C:\\Users\\EMP142738\\Desktop\\New folder\\Readers Quick Ref Card.pdf");
try {
base64encodedString = java.util.Base64.getUrlEncoder().encodeToString(new Abc().readFileAsString(filePath).getBytes("utf-8"));
} catch (IOException e) {
e.printStackTrace();
}
try {
byte[] base64decodedBytes = java.util.Base64.getUrlDecoder().decode(base64encodedString);
stream.write(base64decodedBytes);
} catch(IOException e){
e.printStackTrace();}
finally {
stream.close();
}//catch (FileNotFoundException e) {
// e.printStackTrace();
}
}
I'm trying to encode and decode a PDF file using Base64. What I'm doing is converting a PDF(Binary File) to a ByteArray, then returning the ByteArray as a string. I'm then encoding this string in Base64, using java.util.Base64. When I try to backtrack through the process, I'm able to convert a PDF(Binary File) but the File is corrupted/damaged. Also, the output file after the entire process ( Encode- Decode) is significantly larger than the input file. I expected that both of them would be of the same size. What am I doing wrong here?
Edit 1( 7/13/16):
In the main method, I modified the code as per Jim's suggestion.
I tried using Base64.encode(byte[] src) after reading the documentation of the same. However it keeps giving the error "cannot find symbol Base64.encode(byte[])". But I've used the encodetoString method from the same Class( java.util.Base64.Encoder). I'm unable to understand the issue here.
Here's the modified main method used after returning a byte[] from the readFileAsString method.
public void main(String args[]) throws IOException {
String filePath = new String("C:\\Users\\EMP142738\\Desktop\\New folder\\Readers Quick Ref Card.pdf");
byte[] src = new Abc().readFileAsString(filePath);
byte[] destination = Base64.encode(src);
}

The problem is in your flow
byte[] -> String -> base64 string
You need to omit the conversion to String and go directly:
byte[] -> base64 string
Converting to String will corrupt a binary stream as it involves a decode operation from the input character set to 16-bit Unicode characters.

Java - Reading UTF8 bytes from File into String in a system independent way

How to read a UTF8 encoded file in Java into a String accurately?
When i change the encoding of this .java file to UTF-8 (Eclipse > Rightclick on App.java > Properties > Resource > Text file encoding) , it works fine from within Eclipse but not command line. Seems like eclipse is setting file.encoding parameter when running App.
Why should the encoding of the source file have any impact on creating String from bytes. What is the fool-proof way to create String from bytes when encoding is known?
I may have files with different encodings. Once encoding of a file is known, I must be able to read into string, regardless of value of file.encoding?
The content of utf8 file is below
English Hello World.
Korean 안녕하세요.
Japanese 世界こんにちは。
Russian Привет мир.
German Hallo Welt.
Spanish Hola mundo.
Hindi हैलो वर्ल्ड।
Gujarati હેલો વર્લ્ડ.
Thai สวัสดีชาวโลก.
-end of file-
The code is below. MY observations are in the comments within.
public class App {
public static void main(String[] args) {
String slash = System.getProperty("file.separator");
File inputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text.txt");
File outputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_out.txt");
File outputUtfByteWrittenFile = new File(
"C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_byteout.txt");
outputUtfFile.delete();
outputUtfByteWrittenFile.delete();
try {
/*
* read a utf8 text file with internationalized strings into bytes.
* there should be no information loss here, when read into raw bytes.
* We are sure that this file is UTF-8 encoded.
* Input file created using Notepad++. Text copied from Google translate.
*/
byte[] fileBytes = readBytes(inputUtfFile);
/*
* Create a string from these bytes. Specify that the bytes are UTF-8 bytes.
*/
String str = new String(fileBytes, StandardCharsets.UTF_8);
/*
* The console is incapable of displaying this string.
* So we write into another file. Open in notepad++ to check.
*/
ArrayList<String> list = new ArrayList<>();
list.add(str);
writeLines(list, outputUtfFile);
/*
* Works fine when I read bytes and write bytes.
* Open the other output file in notepad++ and check.
*/
writeBytes(fileBytes, outputUtfByteWrittenFile);
/*
* I am using JDK 8u60.
* I tried running this on command line instead of eclipse. Does not work.
* I tried using apache commons io library. Does not work.
*
* This means that new String(bytes, charset); does not work correctly.
* There is no real effect of specifying charset to string.
*/
} catch (IOException e) {
e.printStackTrace();
}
}
public static void writeLines(List<String> lines, File file) throws IOException {
BufferedWriter writer = null;
OutputStreamWriter osw = null;
OutputStream fos = null;
try {
fos = new FileOutputStream(file);
osw = new OutputStreamWriter(fos);
writer = new BufferedWriter(osw);
String lineSeparator = System.getProperty("line.separator");
for (int i = 0; i < lines.size(); i++) {
String line = lines.get(i);
writer.write(line);
if (i < lines.size() - 1) {
writer.write(lineSeparator);
}
}
} catch (IOException e) {
throw e;
} finally {
close(writer);
close(osw);
close(fos);
}
}
public static byte[] readBytes(File file) {
FileInputStream fis = null;
byte[] b = null;
try {
fis = new FileInputStream(file);
b = readBytesFromStream(fis);
} catch (Exception e) {
e.printStackTrace();
} finally {
close(fis);
}
return b;
}
public static void writeBytes(byte[] inBytes, File file) {
FileOutputStream fos = null;
try {
fos = new FileOutputStream(file);
writeBytesToStream(inBytes, fos);
fos.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
close(fos);
}
}
public static void close(InputStream inStream) {
try {
inStream.close();
} catch (IOException e) {
e.printStackTrace();
}
inStream = null;
}
public static void close(OutputStream outStream) {
try {
outStream.close();
} catch (IOException e) {
e.printStackTrace();
}
outStream = null;
}
public static void close(Writer writer) {
if (writer != null) {
try {
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
writer = null;
}
}
public static long copy(InputStream readStream, OutputStream writeStream) throws IOException {
int bytesread = -1;
byte[] b = new byte[4096]; //4096 is default cluster size in Windows for < 2TB NTFS partitions
long count = 0;
bytesread = readStream.read(b);
while (bytesread != -1) {
writeStream.write(b, 0, bytesread);
count += bytesread;
bytesread = readStream.read(b);
}
return count;
}
public static byte[] readBytesFromStream(InputStream readStream) throws IOException {
ByteArrayOutputStream writeStream = null;
byte[] byteArr = null;
writeStream = new ByteArrayOutputStream();
try {
copy(readStream, writeStream);
writeStream.flush();
byteArr = writeStream.toByteArray();
} finally {
close(writeStream);
}
return byteArr;
}
public static void writeBytesToStream(byte[] inBytes, OutputStream writeStream) throws IOException {
ByteArrayInputStream bis = null;
bis = new ByteArrayInputStream(inBytes);
try {
copy(bis, writeStream);
} finally {
close(bis);
}
}
};
Edit: For #JB Nizet, And Everyone :)
//writeLines(list, outputUtfFile, StandardCharsets.UTF_16BE); //does not work
//writeLines(list, outputUtfFile, Charset.defaultCharset()); //does not work.
writeLines(list, outputUtfFile, StandardCharsets.UTF_16LE); //works
I need to specify encoding of bytes when reading bytes into String.
I need to specify encoding of bytes when I am writing bytes from String into file.
Once I have a String in JVM, I do not need to remember the source byte encoding, am I right?
When I write to file, it should convert the String into the default Charset of my machine (be it UTF8 or ASCII or cp1252). That is failing.
It fails for UTF16 BE too. Why does it fail for some Charsets?

The Java source file encoding is indeed irrelevant. And the reading part of your code is correct (although inefficient). What is incorrect is the writing part:
osw = new OutputStreamWriter(fos);
should be changed to
osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
Otherwise, you use the default encoding (which doesn't seem to be UTF8 on your system) instead of using UTF8.
Note that Java allows using forward slashes in file paths, even on Windows. You could simply write
File inputUtfFile = new File("C:/sources/TestUtfRead/utf8text.txt");
EDIT:
Once I have a String in JVM, I do not need to remember the source byte encoding, am I right?
Yes, you're right.
When I write to file, it should convert the String into the default Charset of my machine (be it UTF8 or ASCII or cp1252). That is failing.
If you don't specify any encoding, Java will indeed use the platform default encoding to transform the characters into bytes. If you specify an encoding (as suggested in the beginning of this answer), then it uses the encoding you tell it to use.
But all encodings can't, like UTF8, represent all the unicode characters. ASCII for example only supports 128 different characters. Cp1252, AFAIK, only supports 256 characters. So, the encoding succeeds, but it replaces unencodable characters with a special one (I can't remember which one) which means: I can't encode this Thai or Russian character because it's not part of my supported character set.
UTF16 encoding should be fine. But make sure to also configure your text editor to use UTF16 when reading and displaying the content of the file. If it's configured to use another encoding, the displayed content won't be correct.

Java decompressing array of bytes

On server (C++), binary data is compressed using ZLib function:
compress2()
and it's sent over to client (Java).
On client side (Java), data should be decompressed using the following code snippet:
public static String unpack(byte[] packedBuffer) {
InflaterInputStream inStream = new InflaterInputStream(new ByteArrayInputStream( packedBuffer);
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
int readByte;
try {
while((readByte = inStream.read()) != -1) {
outStream.write(readByte);
}
} catch(Exception e) {
JMDCLog.logError(" unpacking buffer of size: " + packedBuffer.length);
e.printStackTrace();
// ... the rest of the code follows
}
Problem is that when it tries to read in while loop it always throws:
java.util.zip.ZipException: invalid stored block lengths
Before I check for other possible causes can someone please tell me can I compress on one side with compress2 and decompress it on the other side using above code, so I can eliminate this as a problem? Also if someone has a possible clue about what might be wrong here (I know I didn't provide too much of of the code in here but projects are rather big.
Thanks.

I think the problem is not with unpack method but in packedBuffer content. Unpack works fine
public static byte[] pack(String s) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
DeflaterOutputStream dout = new DeflaterOutputStream(out);
dout.write(s.getBytes());
dout.close();
return out.toByteArray();
}
public static void main(String[] args) throws Exception {
byte[] a = pack("123");
String s = unpack(a); // calls your unpack
System.out.println(s);
}
output
123

public static String unpack(byte[] packedBuffer) {
try (GZipInputStream inStream = new GZipInputStream(
new ByteArrayInputStream(packedBuffer));
ByteArrayOutputStream outStream = new ByteArrayOutputStream()) {
inStream.transferTo(outStream);
//...
return outStream.toString(StandardCharsets.UTF_8);
} catch(Exception e) {
JMDCLog.logError(" unpacking buffer of size: " + packedBuffer.length);
e.printStackTrace();
throw new IllegalArgumentException(e);
}
}
ZLib is the zip format, hence a GZipInputStream is fine.
A you seem to expect the bytes to represent text, hence be in some encoding, add that encoding, Charset, to the conversion to String (which always holds Unicode).
Note, UTF-8 is the encoding of the bytes. In your case it might be an other encoding.
The ugly try-with-resources syntax closes the streams even on exception or here the return.
I rethrowed a RuntimeException as it seems dangerous to do something with no result.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.