Java buffered base64 encoder for streams - java

I have lots of PDF files that I need to get its content encoded using base64. I have an Akka app which fetch the files as stream and distributes to many workers to encode these files and returns the string base64 for each file. I got a basic solution for encoding:
org.apache.commons.codec.binary.Base64InputStream;
...
Base64InputStream b64IStream = null;
InputStreamReader reader = null;
BufferedReader br = null;
StringBuilder sb = new StringBuilder();
try {
b64IStream = new Base64InputStream(input, true);
reader = new InputStreamReader(b64IStream);
br = new BufferedReader(reader);
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
}
} finally {
if (b64IStream != null) {
b64IStream.close();
}
if (reader != null) {
reader.close();
}
if (br != null) {
br.close();
}
}
It works, but I would like to know what would be the best way that I can encode the files using a buffer and if there is a faster alternative for this.
I tested some other approaches such as:
Base64.getEncoder
sun.misc.BASE64Encoder
Base64.encodeBase64
javax.xml.bind.DatatypeConverter.printBase64
com.google.guava.BaseEncoding.base64
They are faster but they need the entire file, correct? Also, I do not want to block other threads while encoding 1 PDF file.
Any input is really helpful. Thank you!

Fun fact about Base64: It takes three bytes, and converts them into four letters. This means that if you read binary data in chunks that are divisible by three, you can feed the chunks to any Base64 encoder, and it will encode it in the same way as if you fed it the entire file.
Now, if you want your output stream to just be one long line of Base64 data - which is perfectly legal - then all you need to do is something along the lines of:
private static final int BUFFER_SIZE = 3 * 1024;
try ( BufferedInputStream in = new BufferedInputStream(input, BUFFER_SIZE); ) {
Base64.Encoder encoder = Base64.getEncoder();
StringBuilder result = new StringBuilder();
byte[] chunk = new byte[BUFFER_SIZE];
int len = 0;
while ( (len = in.read(chunk)) == BUFFER_SIZE ) {
result.append( encoder.encodeToString(chunk) );
}
if ( len > 0 ) {
chunk = Arrays.copyOf(chunk,len);
result.append( encoder.encodeToString(chunk) );
}
}
This means that only the last chunk may have a length that is not divisible by three and will therefore contain the padding characters.
The above example is with Java 8 Base64, but you can really use any encoder that takes a byte array of an arbitrary length and returns the base64 string of that byte array.
This means that you can play around with the buffer size as you wish.
If you want your output to be MIME compatible, however, you need to have the output separated into lines. In this case, I would set the chunk size in the above example to something that, when multiplied by 4/3, gives you a round number of lines. For example, if you want to have 64 characters per line, each line encodes 64 / 4 * 3, which is 48 bytes. If you encode 48 bytes, you'll get one line. If you encode 480 bytes, you'll get 10 full lines.
So modify the above BUFFER_SIZE to something like 4800. Instead of Base64.getEncoder() use Base64.getMimeEncoder(64,new byte[] { 13, 10}). And then, when it encodes, you'll get 100 full-sized lines from each chunk except the last. You may need to add a result.append("\r\n") to the while loop.

Related

Decompressing a buffer from a SOAP response

This might be a dumb problem but I'm really not able to figure it out.
I'm making a SOAP request in SoapUI that retrieves me a GZIP compressed buffer for a certain file.
My issue is that I'm not able to decompress the buffer obtained ( I'm not that experienced with java ). The only results that I obtained till now are some random 10-11 characters string ( [B#6d03e736 ) or errors like "not in GZIP format)
The buffer looks like this: "1f8b0800000000000000a58e4d0ac2400c85f78277e811f2e665329975bbae500f2022dd2978ff95715ae82cdcf9415efec823c6710247582d5965c32c65aab0f5fc0a5204c415855e7c190ef61b34710bcdc7486d2bab8a7a4910d022d5e107d211ed345f2f37a103da2ddb1f619ab8acefe7fdb1beb6394998c7dfbde3dcac3acf3f399f3eeae152012e010000"
I've looked in many similar threads and ran only into examples where someone gets a random string that gets compressed and then decompressed ( mostly trough GZIPInputStream/GZIPOutputStream ).
String stringBuffer = "buffer_from_above";
byte[] buffer = stringBuffer.getBytes(StandardCharsets.ISO_8859_1); // also tried UTF-8
System.out.println(decompress(buffer));
public static String decompress(final byte[] compressed) throws IOException {
final StringBuilder outStr = new StringBuilder();
if ((compressed == null) || (compressed.length == 0)) {
return "";
}
if (isCompressed(compressed)) {
final GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(compressed));
final BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(gis, "UTF-8"));
String line;
while ((line = bufferedReader.readLine()) != null) {
outStr.append(line);
}
} else {
outStr.append(compressed);
}
return outStr.toString();
}
I would highly appreciate if someone is able to give me any tips or any advice for this matter.
Thanks for the time spent and I wish you have a great day!
Decompress function stolen from this thread: compression and decompression of string data in java
If that's really the string, then you need to convert the hexadecimal to binary before feeding it to the gzip decoder. See Convert a string representation of a hex dump to a byte array using Java?

Java Reading large files into byte array chunk by chunk

So I've been trying to make a small program that inputs a file into a byte array, then it will turn that byte array into hex, then binary. It will then play with the binary values (I haven't thought of what to do when I get to this stage) and then save it as a custom file.
I studied a lot of internet code and I can turn a file into a byte array and into hex, but the problem is I can't turn huge files into byte arrays (out of memory).
This is the code that is not a complete failure
public void rundis(Path pp) {
byte bb[] = null;
try {
bb = Files.readAllBytes(pp); //Files.toByteArray(pathhold);
System.out.println("byte array made");
} catch (Exception e) {
e.printStackTrace();
}
if (bb.length != 0 || bb != null) {
System.out.println("byte array filled");
//send to method to turn into hex
} else {
System.out.println("byte array NOT filled");
}
}
I know how the process should go, but I don't know how to code that properly.
The process if you are interested:
Input file using File
Read the chunk by chunk of the file into a byte array. Ex. each byte array record hold 600 bytes
Send that chunk to be turned into a Hex value --> Integer.tohexstring
Send that hex value chunk to be made into a binary value --> Integer.toBinarystring
Mess around with the Binary value
Save to custom file line by line
Problem:: I don't know how to turn a huge file into a byte array chunk by chunk to be processed.
Any and all help will be appreciated, thank you for reading :)
To chunk your input use a FileInputStream:
Path pp = FileSystems.getDefault().getPath("logs", "access.log");
final int BUFFER_SIZE = 1024*1024; //this is actually bytes
FileInputStream fis = new FileInputStream(pp.toFile());
byte[] buffer = new byte[BUFFER_SIZE];
int read = 0;
while( ( read = fis.read( buffer ) ) > 0 ){
// call your other methodes here...
}
fis.close();
To stream a file, you need to step away from Files.readAllBytes(). It's a nice utility for small files, but as you noticed not so much for large files.
In pseudocode it would look something like this:
while there are more bytes available
read some bytes
process those bytes
(write the result back to a file, if needed)
In Java, you can use a FileInputStream to read a file byte by byte or chunk by chunk. Lets say we want to write back our processed bytes. First we open the files:
FileInputStream is = new FileInputStream(new File("input.txt"));
FileOutputStream os = new FileOutputStream(new File("output.txt"));
We need the FileOutputStream to write back our results - we don't want to just drop our precious processed data, right? Next we need a buffer which holds a chunk of bytes:
byte[] buf = new byte[4096];
How many bytes is up to you, I kinda like chunks of 4096 bytes. Then we need to actually read some bytes
int read = is.read(buf);
this will read up to buf.length bytes and store them in buf. It will return the total bytes read. Then we process the bytes:
//Assuming the processing function looks like this:
//byte[] process(byte[] data, int bytes);
byte[] ret = process(buf, read);
process() in above example is your processing method. It takes in a byte-array, the number of bytes it should process and returns the result as byte-array.
Last, we write the result back to a file:
os.write(ret);
We have to execute this in a loop until there are no bytes left in the file, so lets write a loop for it:
int read = 0;
while((read = is.read(buf)) > 0) {
byte[] ret = process(buf, read);
os.write(ret);
}
and finally close the streams
is.close();
os.close();
And thats it. We processed the file in 4096-byte chunks and wrote the result back to a file. It's up to you what to do with the result, you could also send it over TCP or even drop it if it's not needed, or even read from TCP instead of a file, the basic logic is the same.
This still needs some proper error-handling to work around missing files or wrong permissions but that's up to you to implement that.
A example implementation for the process method:
//returns the hex-representation of the bytes
public static byte[] process(byte[] bytes, int length) {
final char[] hexchars = "0123456789ABCDEF".toCharArray();
char[] ret = new char[length * 2];
for ( int i = 0; i < length; ++i) {
int b = bytes[i] & 0xFF;
ret[i * 2] = hexchars[b >>> 4];
ret[i * 2 + 1] = hexchars[b & 0x0F];
}
return ret;
}

efficient text loading without line separators

so I have a fairly big (4mb) txt with data for a monolingual dictionary. Because the explanations of the words are split into multiple lines i can't read it line by line. On the other hand I have "###" separators which I can use.
My question is: what is the most efficient way to load this text into a map in java/android?
Load file to single String and use split("###") method on it. It gives you array of strings split by your separator. 4 Mb is OK to load it in memory at once.
byte[] encoded = Files.readAllBytes(Paths.get(filePath));
String fileContents = new String(encoded, encoding);
String[] lines = fileContents.split("###");
Update: not sure you can use that code to read file on android - it's for Java SE 7. On android can use code like this:
FileInputStream fis;
fis = openFileInput(filePath);
StringBuffer fileContent = new StringBuffer("");
byte[] buffer = new byte[1024];
while ((n = fis.read(buffer)) != -1)
{
fileContent.append(new String(buffer, 0, n));
}
String[] lines = fileContent.toString().split("###");

Search for a string in large file and save it's position in Java

I'm searching for a way to parse large files (about 5-10Go) and search for position (in byte) of some recurrent strings, the fastest as possible.
I've tried to use the RandomAccessFile reader by doing something like bellow:
RandomAccessFile lecteurFichier = new RandomAccessFile(<MyFile>, "r");
while (currentPointeurPosition < lecteurFichier.length()) {
char currentFileChar = (char) lecteurFichier.readByte();
// Test each char for matching my string (by appending chars until I found my string)
// and keep a trace of all found string's position
}
The problem is this code is too slow (maybe because I read byte by byte ?).
I also tried the solution bellow, which is perfect in term of speedness but I can't get my string's positions.
FileInputStream is = new FileInputStream(fichier.getFile());
FileChannel f = is.getChannel();
ByteBuffer buf = ByteBuffer.allocateDirect(64 * 1024);
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
long len = 0;
while ((len = f.read(buf)) != -1) {
buf.flip();
String data = "";
try {
int old_position = buf.position();
data = decoder.decode(buf).toString();
// reset buffer's position to its original so it is not altered:
buf.position(old_position);
}
catch (Exception e) {
e.printStackTrace();
}
buf.clear();
}
f.close();
Does anyone has a better solution to propose ?
Thank you in advance (and sorry for my spelling, I'm french)
Since your input data is encoded in an 8-bit encoding*, you can speed up the search by encoding the search string rather than decoding the file:
byte[] encoded = searchString.getBytes("ISO-8859-1");
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
int b;
long pos = -1;
while ((b = bis.read()) != -1) {
pos++;
if (encoded[0] == b) {
// see if rest of string matches
}
}
A BufferedInputStream should be pretty fast. Using ByteBuffer might be faster, but this is going to make the search logic more complicated because of the possibility of a string match than spans a buffer boundary.
Then there are various clever ways to optimize string searches that could be adapted to this situation ... where you are search a stream of bytes / characters rather than an array of bytes / characters. The Wikipedia page on String Searching is a good place to start.
Note that since we are reading and matching in a byte-wise fashion, the position is just the count of bytes read (or skipped), so there is no need to use a random access file.
* In fact this trick will work with many multibyte encodings too.
Searching for a 'needle' in a 'haystack' is a well-studied problem-Here's a related link on StackOverflow itself. I am sure the java implementations of the algorithms discussed should be available too. Why not try some of them,to see if they fit the job?

Is there any way to get the size in bytes of a string in Java?

I need the size in bytes of each line in a file, so I can get a percentage of the file read. I already got the size of the file with file.length(), but how do I get each line's size?
final String hello_str = "Hello World";
hello_str.getBytes().length is the "byte size", i.e. the number of bytes
You need to know the encoding - otherwise it's a meaningless question. For example, "foo" is 6 bytes in UTF-16, but 3 bytes in ASCII. Assuming you're reading a line at a time (given your question) you should know which encoding you're using as you should have specified it when you started to read.
You can call String.getBytes(charset) to get the encoded representation of a particular string.
Do not just call String.getBytes() as that will use the platform default encoding.
Note that all of this is somewhat make-work... you've read the bytes, decoded them to text, then you're re-encoding them into bytes...
You probably use about the following to read the file
FileInputStream fis = new FileInputStream(path);
BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
String line;
while ((line = br.readLine()) != null) {
/* process line */
/* report percentage */
}
You need to specify the encoding already at the beginning. If you don't, you should get UTF-8 on Android. It is the default but that can be changed. I would assume that no device does that though.
To repeat what the other answers already stated: The character count is not always the same as the byte count. Especially the UTF encodings are tricky. There are currently 249,764 assigned Unicode characters and potentially over a million (WP) and UTF uses 1 to 4 byte to be able to encode all of them. UTF-32 is the simplest case since it will always use 4 bytes. UTF-8 does that dynamically and uses 1 to 4 bytes. Simple ASCII characters use just 1 byte. (source: UTF & BOM FAQ)
To get the amount of bytes you can use e.g. line.getBytes("UTF-8").length(). One big disadvantage is that this is very inefficient since it creates copy of the String internal array each time and throws it away after that. That is #1 addressed at Android | Performance Tips
It is also not 100% accurate in terms of actual bytes read from the file for following reasons:
UTF-16 textfiles for example often start with a special 2 byte BOM (Byte Order Mark) to signal whether they have to interpreted little or big endian. Those 2 (UTF-8: 3, UTF-32: 4) bytes are not reported when you just look at the String you get from your reader. So you are already some bytes off here.
Turning every line of a file into an UTF-16 String will include those BOM bytes for each line. So getBytes will report 2 bytes too much for each line.
Line ending characters are not part of the resulting line-String. To make things worse you have different ways of signaling the end of a line. Usually the Unix-Style '\n' which is only 1 character or the Windows-Style '\r''\n' which is two characters. The BufferedReader will simply skip those. Here your calculation is missing a very variable amount of bytes. From 1 byte for Unix/UTF-8 to 8 bytes for Windows/UTF-32.
The last two reasons would negate each other if you have Unix/UTF-16, but that is probably not the typical case. The effect of the error also depends on line length: if you have an error of 4 byte for each line that is in total only 10 bytes long your progress will be quite considerably wrong (if my math is good your progress would be at 140% or 60% when after the last line, depending on whether your calculation assumes -4 or +4 byte per line)
That means so far that regardless of what you do, you get no more than an approximation.
Getting the actual byte-count could probably be done if you write your own special byte counting Reader but that would be quite a lot of work.
An alternative would be to use a custom InputStream that counts how much bytes are actually read from the underlying stream. That's not too hard to do and it does not care for encodings.
The big disadvantage is that it does not increase linearly with the lines you read since BufferedReader will fill it's internal buffer and read lines from there, then read the next chunk from the file and so on. If the buffer is large enough you are at 100% at the first line already. But I assume your files are big enough or you would not want to find out about the progress.
This for example would be such an implementation. It works but I can't guarantee that it is perfect. It won't work if streams use mark() and reset(). File reading should no do that though.
static class CountingInputStream extends FilterInputStream {
private long bytesRead;
protected CountingInputStream(InputStream in) {
super(in);
}
#Override
public int read() throws IOException {
int result = super.read();
if (result != -1) bytesRead += 1;
return result;
}
#Override
public int read(byte[] b) throws IOException {
int result = super.read(b);
if (result != -1) bytesRead += result;
return result;
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
int result = super.read(b, off, len);
if (result != -1) bytesRead += result;
return result;
}
#Override
public long skip(long n) throws IOException {
long result = super.skip(n);
if (result != -1) bytesRead += result;
return result;
}
public long getBytesRead() {
return bytesRead;
}
}
Using the following code
File file = new File("mytestfile.txt");
int linesRead = 0;
long progress = 0;
long fileLength = file.length();
String line;
CountingInputStream cis = new CountingInputStream(new FileInputStream(file));
BufferedReader br = new BufferedReader(new InputStreamReader(cis, "UTF-8"), 8192);
while ((line = br.readLine()) != null) {
long newProgress = cis.getBytesRead();
if (progress != newProgress) {
progress = newProgress;
int percent = (int) ((progress * 100) / fileLength);
System.out.println(String.format("At line: %4d, bytes: %6d = %3d%%", linesRead, progress, percent));
}
linesRead++;
}
System.out.println("Total lines: " + linesRead);
System.out.println("Total bytes: " + fileLength);
br.close();
I get output like
At line: 0, bytes: 8192 = 5%
At line: 82, bytes: 16384 = 10%
At line: 178, bytes: 24576 = 15%
....
At line: 1621, bytes: 155648 = 97%
At line: 1687, bytes: 159805 = 100%
Total lines: 1756
Total bytes: 159805
or in case of the same file UTF-16 encoded
At line: 0, bytes: 24576 = 7%
At line: 82, bytes: 40960 = 12%
At line: 178, bytes: 57344 = 17%
.....
At line: 1529, bytes: 303104 = 94%
At line: 1621, bytes: 319488 = 99%
At line: 1687, bytes: 319612 = 100%
Total lines: 1756
Total bytes: 319612
Instead of printing that you could update your progress.
So, what is the best approach?
If you know that you have simple ASCII text in an encoding that uses only 1 byte for those characters: just use String#length() (and maybe add +1 or +2 for the line ending)
String#length() is fast and simple and as long as you know what files you have you should have no problems.
If your have international text where the simple approach won't work:
for smaller files where processing each line takes rather long: String#getBytes(), the longer processing 1 line takes the lower the impact of temporary arrays and their garbage collection. The inaccuracy should be within acceptable bounds. Just make sure not to crash if progress > 100% or < 100% at the end.
for larger files above approach. The larger the file the better. Updating progress in 0.001% steps is just slowing down things. Decreasing the reader's buffer size would increases the accuracy but it also decreases the read performance.
If you have enough time: write your own Reader that tells you the exact byte position. Maybe a combination of InputStreamReader and BufferedReader since Reader already operates on characters. Android's implementation may help as starting point.
If the File is an ASCII file, then you can use String.length();
otheriwse it gets more complex.
Consider you have a string variable called hello_str
final String hello_str = "Hello World";
//Check Character length
hello_str.length() //output will be 11
// Check encoded sizes
final byte[] utf8Bytes = hello_str.getBytes("UTF-8");
utf8Bytes.length //output will be 11
final byte[] utf16Bytes= hello_str.getBytes("UTF-16");
utf16Bytes.length // output will be "24"
final byte[] utf32Bytes = hello_str.getBytes("UTF-32");
utf32Bytes.length // output will be "44"

Categories

Resources