Problem
I am trying to encode file contents of doc/pdf extensions to Base64 string in Java.
The encoded string length almost doubles from the original(115k -> 230k).
Whereas encoding the same file contents in Python/PHP or any online tool only gives a third increase(115k -> 154k).
What causes this increase in size for Java and is there any way to get equivalent result as the other sources?
Code
import java.util.Base64;
...
//String content;
System.out.println(content.length());
String encodedStr = new String(Base64.getEncoder().encode(content.getBytes()));
System.out.println(encodedStr.length());
String urlEncodedStr = new String(Base64.getUrlEncoder().encode(content.getBytes()));
System.out.println(urlEncodedStr.length());
String mimieEncodedStr = new String(Base64.getMimeEncoder().encode(content.getBytes()));
System.out.println(mimieEncodedStr.length());
Output
For pdf file
115747
230816
230816
236890
For doc file
13685
26392
26392
27086
First, never use new String. Second, pass an encoding to String.getBytes(String) (e.g. content.getBytes(encoding)). For example,
String encodedStr = Base64.getEncoder()
.encodeToString(content.getBytes("UTF-8"));
or
String encodedStr = Base64.getEncoder()
.encodeToString(content.getBytes("US-ASCII"));
Related
I'm trying to convert a file from UTF-8 to UTF-16 with a Java application
But my output turned out to be like this
蓘Ꟙ괠��Ꟙ돘ꨊ䥎潴楦楣慴楯渮瑩瑬攮佲摥牁摤敤乯瑩晩捡瑩潮偬畧楮㷘께뇛賘꼠���藙蘊啉乯瑩晩捡瑩潮慢敬⹏牤敲䅤摥摎潴楦楣慴楯湐汵杩渽��藘귘뗙裙萠��藘꿛賘뇛賘ꨠ
Eventually, the output should be the same
utf8= سلام utf16=\u0633\u0644\u0627\u0645
import java.io.*;
class WriteUTF8Data<inbytes> {
WriteUTF8Data() throws UnsupportedEncodingException {
}
public static void main(String[] args) throws IOException {
System.setProperty("file.encoding","UTF-8");
byte[] inbytes = new byte[1024];
FileInputStream fis = new FileInputStream("/home/mehrad/Desktop/PerkStoreNotification(1).properties");
fis.read(inbytes);
FileOutputStream fos = new FileOutputStream("/home/mehrad/Desktop/PerkStoreNotification(2).properties");
String in = new String(inbytes, "UTF16");
fos.write(in.getBytes());
}
}
You're currently converting from UTF-16 into whatever your system default encoding is. If you want to convert from UTF-8, you need to specify that when you're converting the binary data. There are other issues with your code though - you're assuming that InputStream.read reads the whole buffer, and that that's all that's in the file. You'd probably be better using an Reader and a Writer, looping round and reading into a char array then writing the relevant part of that char array into the writer.
Here's some sample code that does that. It may well not be the best way of doing it these days, but it should at least work:
import java.io.*;
import java.nio.charset.*;
import java.nio.file.*;
public class ConvertUtf8ToUtf16 {
public static void main(String[] args) throws IOException {
Path inputPath = Paths.get(args[0]);
Path outputPath = Paths.get(args[1]);
char[] buffer = new char[4096];
// UTF-8 is actually the default for Files.newBufferedReader,
// but let's be explicit.
try (Reader reader = Files.newBufferedReader(inputPath, StandardCharsets.UTF_8)) {
try (Writer writer = Files.newBufferedWriter(outputPath, StandardCharsets.UTF_16)) {
int charsRead;
while ((charsRead = reader.read(buffer)) != -1) {
writer.write(buffer, 0, charsRead);
}
}
}
}
}
First of all answer by Jon Skeet is correct answer and will work. The problem with your code is that you convert incoming String into bytes according to your current encoding (I guess - UTF-8) and then try to create a new String with UTF-16 encoding from bytes that were produced as UTF-8 and that's why you get garbled output. Java keeps Strings internally in its own encoding (I think it is UCS-2). So when you have a String you can tell java to produce bytes from String in whatever charset you want. So for the same valid String method getBytes(UTF-8) and getBytes("UTF-16") would produce different sequence of bytes. So if you read your original content and you know that it is UTF-8 then you need to create String in UTF-8 String inString = new String(inbytes, "UTF-8") and then when you are writing produce your byte array from your String fos.write(inString.getBytes(UTF-16)); Also I would suggest to use this tool that would help you to understand the internal workings with String: It is a Utility that converts any String into unicode sequence and vice-versa.
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library that contains this Utility is called MgntUtils and can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc. Here is javadoc for the class StringUnicodeEncoderDecoder. Here is the link to an article that describes the MgntUtils Open source library: Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison
After a week of work I designed a binary file format, and made a Java reader for it. It's just an experiment, which works fine, unless I'm using the GZip compression function.
I called my binary type MBDF (Minimal Binary Database Format), and it can store 8 different types:
Integer (There is nothing like a byte, short, long or anything like that, since it is stored in flexible space (bigger numbers take more space))
Float-32 (32-bits floating point format, like java's float type)
Float-64 (64-bits floating point format, like java's double type)
String (A string in UTF-16 format)
Boolean
Null (Just specifies a null value)
Array (Something like java's ArrayList<Object>)
Compound (A String - Object map)
I used this data as test data:
COMPOUND {
float1: FLOAT_32 3.3
bool2: BOOLEAN true
float2: FLOAT_64 3.3
int1: INTEGER 3
compound1: COMPOUND {
xml: STRING "two length compound"
int: INTEGER 23
}
string1: STRING "Hello world!"
string2: STRING "3"
arr1: ARRAY [
STRING "Hello world!"
INTEGER 3
STRING "3"
FLOAT_32 3.29
FLOAT_64 249.2992
BOOLEAN true
COMPOUND {
str: STRING "one length compound"
}
BOOLEAN false
NULL null
]
bool1: BOOLEAN false
null1: NULL null
}
The xml key in a compound does matter!!
I made a file from it using this java code:
MBDFFile.writeMBDFToFile(
"/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf",
b.makeMBDF(false)
);
Here, the variable b is a MBDFBinary object, containing all the data given above. With the makeMBDF function it generates the ISO 8859-1 encoded string and if the given boolean is true, it compresses the string using GZip. Then, when writing, an extra information character is added at the beginning of the file, containing information about how to read it back.
Then, after writing the file, I read it back into java and parse it
MBDF mbdf = MBDFFile.readMBDFFromFile("/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf");
System.out.println(mbdf.getBinaryObject().parse());
This prints exactly the information mentioned above.
Then I try to use compression:
MBDFFile.writeMBDFToFile(
"/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf",
b.makeMBDF(true)
);
I do exactly the same to read it back as I did with the uncompressed file, which should work. It prints this information:
COMPOUND {
float1: FLOAT_32 3.3
bool2: BOOLEAN true
float2: FLOAT_64 3.3
int1: INTEGER 3
compound1: COMPOUND {
xUT: STRING 'two length compound'
int: INTEGER 23
}
string1: STRING 'Hello world!'
string2: STRING '3'
arr1: ARRAY [
STRING 'Hello world!'
INTEGER 3
STRING '3'
FLOAT_32 3.29
FLOAT_64 249.2992
BOOLEAN true
COMPOUND {
str: STRING 'one length compound'
}
BOOLEAN false
NULL null
]
bool1: BOOLEAN false
null1: NULL null
}
Comparing it to the initial information, the name xml changed into xUT for some reason...
After some research I found little differences in binary data between before the compression and after the compression. Such patterns as 110011 change into 101010.
When I make the name xml longer, like xmldm, it is just parsed as xmldm for some reason.
I currently saw the problem only occur on names with three characters.
Directly compressing and decompressing the generated string (without saving it to a file and reading that) does work, so maybe the bug is caused by the file encoding.
As far as I know, the string output is in ISO 8859-1 format, but I couldn't get the file encoding right. When a file is read, it is read as it has to be read, and all the characters are read as ISO 8859-1 characters.
I've some things that could be a reason, I actually don't know how to test them:
The GZip output has a different encoding than the uncompressed encoding, causing small differences while storing as a file.
The file is stored as UTF-8 format, just ignoring the order to be ISO 8859-1 encoding ( don't know how to explain :) )
There is a little bug in the java GZip libraries.
But which one is true, and if none of them is right, what is the true reason for this bug?
I couldn't figure it out right now.
The MBDFFile class, reading and storing the files:
/* MBDFFile.java */
package com.redgalaxy.mbdf;
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class MBDFFile {
public static MBDF readMBDFFromFile(String filename) throws IOException {
// FileInputStream is = new FileInputStream(filename);
// InputStreamReader isr = new InputStreamReader(is, "ISO-8859-1");
// BufferedReader br = new BufferedReader(isr);
//
// StringBuilder builder = new StringBuilder();
//
// String currentLine;
//
// while ((currentLine = br.readLine()) != null) {
// builder.append(currentLine);
// builder.append("\n");
// }
//
// builder.deleteCharAt(builder.length() - 1);
//
//
// br.close();
Path path = Paths.get(filename);
byte[] data = Files.readAllBytes(path);
return new MBDF(new String(data, "ISO-8859-1"));
}
private static void writeToFile(String filename, byte[] txt) throws IOException {
// BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
//// FileWriter writer = new FileWriter(filename);
// writer.write(txt.getBytes("ISO-8859-1"));
// writer.close();
// PrintWriter pw = new PrintWriter(filename, "ISO-8859-1");
FileOutputStream stream = new FileOutputStream(filename);
stream.write(txt);
stream.close();
}
public static void writeMBDFToFile(String filename, MBDF info) throws IOException {
writeToFile(filename, info.pack().getBytes("ISO-8859-1"));
}
}
The pack function generates the final string for the file, in ISO 8859-1 format.
For all the other code, see my MBDF Github repository.
I commented the code I've tried, trying to show what I tried.
My workspace:
- Macbook Air '11 (High Sierra)
- IntellIJ Community 2017.3
- JDK 1.8
I hope this is enough information, this is actually the only way to make clear what I'm doing, and what exactly isn't working.
Edit: MBDF.java
/* MBDF.java */
package com.redgalaxy.mbdf;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
public class MBDF {
private String data;
private InfoTag tag;
public MBDF(String data) {
this.tag = new InfoTag((byte) data.charAt(0));
this.data = data.substring(1);
}
public MBDF(String data, InfoTag tag) {
this.tag = tag;
this.data = data;
}
public MBDFBinary getBinaryObject() throws IOException {
String uncompressed = data;
if (tag.isCompressed) {
uncompressed = GZipUtils.decompress(data);
}
Binary binary = getBinaryFrom8Bit(uncompressed);
return new MBDFBinary(binary.subBit(0, binary.getLen() - tag.trailing));
}
public static Binary getBinaryFrom8Bit(String s8bit) {
try {
byte[] bytes = s8bit.getBytes("ISO-8859-1");
return new Binary(bytes, bytes.length * 8);
} catch( UnsupportedEncodingException ignored ) {
// This is not gonna happen because encoding 'ISO-8859-1' is always supported.
return new Binary(new byte[0], 0);
}
}
public static String get8BitFromBinary(Binary binary) {
try {
return new String(binary.getByteArray(), "ISO-8859-1");
} catch( UnsupportedEncodingException ignored ) {
// This is not gonna happen because encoding 'ISO-8859-1' is always supported.
return "";
}
}
/*
* Adds leading zeroes to the binary string, so that the final amount of bits is 16
*/
private static String addLeadingZeroes(String bin, boolean is16) {
int len = bin.length();
long amount = (long) (is16 ? 16 : 8) - len;
// Create zeroes and append binary string
StringBuilder zeroes = new StringBuilder();
for( int i = 0; i < amount; i ++ ) {
zeroes.append(0);
}
zeroes.append(bin);
return zeroes.toString();
}
public String pack(){
return tag.getFilePrefixChar() + data;
}
public String getData() {
return data;
}
public InfoTag getTag() {
return tag;
}
}
This class contains the pack() method. data is already compressed here (if it should be).
For other classes, please watch the Github repository, I don't want to make my question too long.
Solved it by myself!
It seemed to be the reading and writing system. When I exported a file, I made a string using the ISO-8859-1 table to turn bytes into characters. I wrote that string to a text file, which is UTF-8. The big problem was that I used FileWriter instances to write it, which are for text files.
Reading used the inverse system. The complete file was read into memory as a string (memory consuming!!) and was then being decoded.
I didn't know a file was binary data, where specific formats of them form text data. ISO-8859-1 and UTF-8 are some of those formats. I had problems with UTF-8, because it splitted some characters into two bytes, which I couldn't manage...
My solution to it was to use streams. There exist FileInputStreams and FileOutputStreams in Java, which could be used for reading and writing binary files. I didn't use the streams, as I thought there was no big difference ("files are text, so what's the problem?"), but there is... I implemented this (by writing a new similar library) and I'm now able to pass every input stream to the decoder and every output stream to the encoder. To make uncompressed files, you need to pass a FileOutputStream. GZipped files could use GZipOutputStreams, relying on a FileOutputStream. If someone wants a string with the binary data, a ByteArrayOutputStream could be used. Same rules apply to reading, where the InputStream variant of the mentioned streams should be used.
No UTF-8 or ISO-8859-1 problems anymore, and it seemed to work, even with GZip!
I'm writing a Java program that saves data to UTF8 text files. However, I'd also like to provide the option to save to IBM437 for compatibility with an old program that uses the same sort of data files.
How can I check to see if the data the user is trying to save isn't representable in IBM437? At the moment the file saves without complaining but results in unusual characters being replaced with question marks.
I'd prefer it if I could show a warning to the user that the data they are saving isn't supported in IBM437. The user could then have the option of manually replacing characters with the nearest ASCII equivalent.
Current code for saving is:
String encoding = "UTF-8";
if (forceLegacySupport)
{
// Force character encoding to IBM437
encoding = "IBM437";
}
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(saveFile.getAbsoluteFile()), encoding));
IOController.writeFileToDisk(bw);
bw.close();
As mentioned by JB Nizet in comments you can use charset encoder
and for creating text/String as UTF-8
just a suggestion from my end:
public static char[] cookie = "HEADER_COOKIE".toCharArray();
byte[] cookieInBytes = new byte[COOKIE_SIZE];
for(int i=0;i<cookie.length;i++)
{
if(i < cookie.length)
cookieInBytes[i] = (byte)cookie[i];
}
String headerStr = new String(cookieInBytes,StandardCharsets.UTF_8);
In Java, how can I output UTF-8 to real string?
我们
\u6211\u4eec
String str = new String("\u6211\u4eec");
System.out.println(str); // still ouput \u6211\u4eec, but I expect 我们 to be an output
-----
String tmp = request.getParameter("tag");
System.out.println("request:"+tmp);
System.out.println("character set :"+request.getCharacterEncoding());
String tmp1 = new String("\u6211\u4eec");
System.out.println("string equal:"+(tmp.equalsIgnoreCase(tmp1)));
String tag = new String(tmp);
System.out.println(tag);
request:\u6211\u4eec
character set :UTF-8
string equal:false
\u6211\u4eec
From the output, the value from the request is the same as the string value of tmp1, but why does equalsIgnoreCase output false?
did you try to display just one of them? like
String str = new String("\u6211");
System.out.println(str);
I bet there is a problem in how you create that string.
Java String are encoded in UTF-16. I do not see any problem in your code, I would believe the problem comes from your console and it doesn't show correctly the content of the String.
If you are using eclipse, change your console encoding here to UTF-8
Eclipse > Preferences > General > Workspace > Text file encoding
I am trying to read Japanese string values from a .properties file with the code:
Properties properties = new Properties();
InputStream in = MyClass.class.getResourceAsStream(fileName);
properties.load(in);
The problem is apparently with the above code not recognizing the encoding of the file. It reads only the English portions and replaces the Japanese characters with question marks. Incidentally, this is not a problem with displaying Japanese text in Swing or reading/writing a UTF-8 encoded .properties file in an editor. Both things work.
Is the Properties class encoding-unaware? Is there an encoding-aware alternative that does not violate the security manager settings normally found in applets?
In my opinion you have to convert Japanese character to java Unicode escape string
For example, this is the way I did with Vietnamese
Currency_Converter = Chuyen doi tien te
Enter_Amount = Nh\u1eadp v\u00e0o s\u1ed1 l\u01b0\u1ee3ng
Source_Currency = \u0110\u01a1n v\u1ecb g\u1ed1c
Target_Currency = \u0110\u01a1n v\u1ecb chuy\u1ec3n
Converted_Amount = K\u1ebft qu\u1ea3
Convert = Chuy\u1ec3n \u0111\u1ed5i
Alert_Mess = Vui l\u00f2ng nh\u1eadp m\u1ed9t s\u1ed1 h\u1ee3p l\u1ec7
Alert_Title = Thong bao
load expects ISO 8859-1 encoding, as noted in the docs.
In general you'll want to use native2ascii to convert property files, load using a reader, or use XML where you can specify encoding.
it is possible to read any language from properties file, what you should do is just get value by key from Properties file and then create a new string new String(keyValue.getBytes("ISO-8859-1"), "UTF-8"), that is, it will create a UTF-8 string for you.
public static String getLocalizedPropertyValue(String fileName, String key, Locale locale) throws UnsupportedEncodingException {
Properties props = getPropertiesByLocale(fileName, locale);
String keyValue = props.getProperty(key);
return keyValue != null ? new String(keyValue.getBytes("ISO-8859-1"), "UTF-8") : "";
}
Have you considered using ResourceBundle ?