Issue with streaming formatted input stream to server - java

I am trying to write a "formatted" input stream to a tomcat servlet (with Guice).
The underlying problem is the following: I want to stream data from a database directly to a server. Therefore I load the data, convert it to JSON and upload it to the server. I don't want to write the JSON to a temporary file first, this is done due to performance issues, so I want to circumvent using the hard drive, by directly streaming to the server.
EDIT: Similar to Sending a stream of documents to a Jersey #POST endpoint
But a comment in the answer says that it is loosing data, I seem to have the same problem.
I wrote a "ModelInputStream" that
Loads the next model from the database when the previous is streamed
Writes one byte for the type (enum ordinal)
Writes 4 bytes for the length of the next byte array (int)
Writes a string (refId)
Writes 4 bytes for the length of the next byte array (int)
Writes the actual json
Repeat until all models are streamed
I also wrote a "ModelStreamReader" that knows that logic and reads accordingly.
When I test this directly it works fine, but once I create the ModelInputStream on client side and use the incoming input stream on the server with the ModelStreamReader the actual json bytes are less than specified in the 4 bytes defining the length. I guess this is due to deflating or compression.
I tried different content headers for trying to disable compression etc, but nothing worked.
java.io.IOException: Unexpected length, expected 8586, received 7905
So on Client the JSON byte array is 8586 bytes long and when it arrives at the server it is 7905 bytes long, which breaks the whole concept.
Also it seems that it does not really stream, but first caches the whole content that is returned from the input stream.
How would I need to adjust the calling code to get the result I described?
ModelInputStream
package *;
import java.io.IOException;
import java.io.InputStream;
import java.nio.ByteBuffer;
import java.util.LinkedList;
import java.util.List;
import java.util.Queue;
import ***.Daos;
import ***.IDatabase;
import ***.CategorizedEntity;
import ***.CategorizedDescriptor;
import ***.JsonExport;
import com.google.gson.Gson;
import com.google.gson.JsonObject;
public class ModelInputStream extends InputStream {
private final Gson gson = new Gson();
private final IDatabase db;
private final Queue<CategorizedDescriptor> descriptors;
private byte[] buffer = new byte[0];
private int position = 0;
public ModelInputStream(IDatabase db, List<CategorizedDescriptor> descriptors) {
this.db = db;
this.descriptors = new LinkedList<>();
this.descriptors.addAll(descriptors);
}
#Override
public int read() throws IOException {
if (position == buffer.length) {
if (descriptors.size() == 0)
return -1;
loadNext();
position = 0;
}
return buffer[position++];
}
private void loadNext() throws IOException {
CategorizedDescriptor descriptor = descriptors.poll();
byte type = (byte) descriptor.getModelType().ordinal();
byte[] refId = descriptor.getRefId().getBytes();
byte[] json = getData(descriptor);
buildBuffer(type, refId, json);
}
private byte[] getData(CategorizedDescriptor d) {
CategorizedEntity entity = Daos.createCategorizedDao(db, d.getModelType()).getForId(d.getId());
JsonObject object = JsonExport.toJson(entity);
String json = gson.toJson(object);
return json.getBytes();
}
private void buildBuffer(byte type, byte[] refId, byte[] json) throws IOException {
buffer = new byte[1 + 4 + refId.length + 4 + json.length];
int index = put(buffer, 0, type);
index = put(buffer, index, asByteArray(refId.length));
index = put(buffer, index, refId);
index = put(buffer, index, asByteArray(json.length));
put(buffer, index, json);
}
private byte[] asByteArray(int i) {
return ByteBuffer.allocate(4).putInt(i).array();
}
private int put(byte[] array, int index, byte... bytes) {
for (int i = 0; i < bytes.length; i++) {
array[index + i] = bytes[i];
}
return index + bytes.length;
}
}
ModelStreamReader
package *;
import java.io.IOException;
import java.io.InputStream;
import java.nio.ByteBuffer;
import *.ModelType;
public class ModelStreamReader {
private InputStream stream;
public ModelStreamReader(InputStream stream) {
this.stream = stream;
}
public Model next() throws IOException {
int modelType = stream.read();
if (modelType == -1)
return null;
Model next = new Model();
next.type = ModelType.values()[modelType];
next.refId = readNextPart();
next.data = readNextPart();
return next;
}
private String readNextPart() throws IOException {
int length = readInt();
byte[] bytes = readBytes(length);
return new String(bytes);
}
private int readInt() throws IOException {
byte[] bytes = readBytes(4);
return ByteBuffer.wrap(bytes).getInt();
}
private byte[] readBytes(int length) throws IOException {
byte[] buffer = new byte[length];
int read = stream.read(buffer);
if (read != length)
throw new IOException("Unexpected length, expected " + length + ", received " + read);
return buffer;
}
public class Model {
public ModelType type;
public String refId;
public String data;
}
}
Calling Code
ModelInputStream stream = new ModelInputStream(db, getAll(db));
URL url = new URL("http://localhost:8080/ws/test/streamed");
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setDoOutput(true);
con.setRequestMethod("POST");
con.connect();
int read = -1;
while ((read = stream.read()) != -1) {
con.getOutputStream().write(read);
}
con.getOutputStream().flush();
System.out.println(con.getResponseCode());
System.out.println(con.getResponseMessage());
con.disconnect();
Server part (Jersey WebResource)
package *.webservice;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.util.HashMap;
import java.util.List;
import java.util.UUID;
import javax.ws.rs.POST;
import javax.ws.rs.Path;
import javax.ws.rs.core.Response;
import *.ModelStreamReader;
import *.ModelStreamReader.Model;
#Path("test")
public class TestResource {
#POST
#Path("streamed")
public Response streamed(InputStream modelStream) throws IOException {
ModelStreamReader reader = new ModelStreamReader(modelStream);
writeDatasets(reader);
return Response.ok(new HashMap<>()).build();
}
private void writeDatasets(ModelStreamReader reader) throws IOException {
String commitId = UUID.randomUUID().toString();
File dir = new File("/opt/tests/streamed/" + commitId);
dir.mkdirs();
Model dataset = null;
while ((dataset = reader.next()) != null) {
File file = new File(dir, dataset.refId);
writeDataset(file, dataset.data);
}
}
private void writeDataset(File file, String data) {
try {
if (data == null)
file.createNewFile();
else
Files.write(file.toPath(), data.getBytes(Charset.forName("utf-8")));
} catch (IOException e) {
e.printStackTrace();
}
}
}

The bytes read have to be shifted into the (0, 255) range (see ByteArrayInputStream).
ModelInputStream
#Override
public int read() throws IOException {
...
return buffer[position++] & 0xff;
}
Finally this line has to be added to the calling code (for chunking):
...
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setChunkedStreamingMode(1024 * 1024);
...

I found the problem which was of a totally different nature.
First the input stream was not compressed or anything. The bytes read have to be shifted into the (0, 255) range instead of (-128, 127). So the stream reading was interrupted by a -1 byte value.
ModelInputStream
#Override
public int read() throws IOException {
...
return buffer[position++] + 128;
}
Second, the data has to be transfered chunked to be actually "streaming". Therefore the ModelStreamReader.readBytes(int) method must be additionally adjusted to:
ModelStreamReader
private byte[] readBytes(int length) throws IOException {
byte[] result = new byte[length];
int totalRead = 0;
int position = 0;
int previous = -1;
while (totalRead != length) {
int read = stream.read();
if (read != -1) {
result[position++] = (byte) read - 128;
totalRead++;
} else if (previous == -1) {
break;
}
previous = read;
}
return result;
}
Finally this line has to be added to the calling code:
...
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setChunkedStreamingMode(1024 * 1024);
...

Related

Writing a binary file and rereading a part of it - Expected data size has changed

I'm receiving a motion JPEG stream via a servlet's post method. I cut out the JPEG images from the stream as they come and then write them to a Motion JPEG file (mjpeg), using the following structure:
--END
Content-Type: image/jpeg
Content-Length: <Length of following JPEG image in bytes>
<JPEG image data>
--END
After the last --END part, the next image starts.
The mjpeg file is created in the following way:
private static final String BOUNDARY = "END";
private static final Charset CHARSET = StandardCharsets.UTF_8;
final byte[] PRE_BOUNDARY = ("--" + BOUNDARY).getBytes(CHARSET);
final byte[] CONTENT_TYPE = "Content-Type: image/jpeg".getBytes(CHARSET);
final byte[] CONTENT_LENGTH = "Content-Length: ".getBytes(CHARSET);
final byte[] LINE_FEED = "\n".getBytes(CHARSET);
try (OutputStream out = new BufferedOutputStream(new FileOutputStream(targetFile))) {
for every image received {
onImage(image, out);
}
}
public void onImage(byte[] image, OutputStream out) {
try {
out.write(PRE_BOUNDARY);
out.write(LINE_FEED);
out.write(CONTENT_TYPE);
out.write(LINE_FEED);
out.write(CONTENT_LENGTH);
out.write(String.valueOf(image.length).getBytes(CHARSET));
out.write(LINE_FEED);
out.write(LINE_FEED);
out.write(image);
out.write(LINE_FEED);
out.write(LINE_FEED);
} catch (IOException e) {
e.printStackTrace();
}
}
Here is an example file.
Now, I'd like to read the mjpeg files again and do some processing on the contained images. For this, I build the following reader:
package de.supportgis.stream;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
public class MediaConverter {
private static final String BOUNDARY = "END";
private static final Charset CHARSET = StandardCharsets.UTF_8;
private static final int READ_PRE_BOUNDARY = 1;
private static final int READ_CONTENT_TYPE = 2;
private static final int READ_CONTENT_LENGTH = 3;
private static final int READ_CONTENT = 4;
public static void createMovieFromMJPEG(String file) throws FileNotFoundException, IOException {
char LINE_FEED = '\n';
char[] PRE_BOUNDARY = new String("--" + BOUNDARY + LINE_FEED).toCharArray();
try (InputStream in = new FileInputStream(file);
Reader reader = new InputStreamReader(in, CHARSET);
Reader buffer = new BufferedReader(reader)) {
int r;
StringBuffer content_buf = new StringBuffer();
int mode = READ_PRE_BOUNDARY;
long content_length = 0;
int[] cmdBuf = new int[PRE_BOUNDARY.length];
int boundaryPointer = 0;
int counter = 0;
while ((r = reader.read()) != -1) {
System.out.print((char)r);
counter++;
if (mode == READ_PRE_BOUNDARY) {
if (r == PRE_BOUNDARY[boundaryPointer]) {
boundaryPointer++;
if (boundaryPointer >= PRE_BOUNDARY.length - 1) {
// Read a PRE_BOUNDARY
mode = READ_CONTENT_TYPE;
boundaryPointer = 0;
}
}
} else if (mode == READ_CONTENT_TYPE) {
if (r != LINE_FEED) {
content_buf.append((char)r);
} else {
if (content_buf.length() == 0) {
// leading line break, ignore...
} else {
mode = READ_CONTENT_LENGTH;
content_buf.setLength(0);
}
}
} else if (mode == READ_CONTENT_LENGTH) {
if (r != LINE_FEED) {
content_buf.append((char)r);
} else {
if (content_buf.length() == 0) {
// leading line break, ignore...
} else {
String number = content_buf.substring(content_buf.lastIndexOf(":") + 1).trim();
content_length = Long.valueOf(number);
content_buf.setLength(0);
mode = READ_CONTENT;
}
}
} else if (mode == READ_CONTENT) {
char[] fileBuf = new char[(int)content_length];
reader.read(fileBuf);
System.out.println(fileBuf);
mode = READ_PRE_BOUNDARY;
}
}
}
}
public static void main(String[] args) {
try {
createMovieFromMJPEG("video.mjpeg");
} catch (IOException e) {
e.printStackTrace();
}
}
}
Note that this reader may not produce working JPEGs yet, as I'm still trying to debug the following error:
I read the value given at Content-length. I furthermore expect that when I read <content-length> of bytes after the Content-Length part into fileBuf (line 78), I end up with exactly the bytes of the image that I wrote in the prior step. However, fileBuf contains the whole image, as well as the metadata and half the bytes of the next image, which means it reads way too much.
I know that when it comes to saving, reading and encoding binary data, there are many things that can go wrong. Which mistake do I have the pleasure of making here?
Thanks in advance.
The mistake was, as several comments suggested, using a Reader instead of an InputStream. I assumed InputStreamReader to return provide the returned bytes of an Inputstream via a Reader interface, but instead it returned characters in a specificm encoding as seemingly all Readers do.
So, the saving logic for the mjpeg stream was ok and the corrected reader looks like this:
import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
public class MediaConverter {
private static final String BOUNDARY = "END";
private static final Charset CHARSET = StandardCharsets.UTF_8;
private static final int READ_PRE_BOUNDARY = 1;
private static final int READ_CONTENT_TYPE = 2;
private static final int READ_CONTENT_LENGTH = 3;
private static final int READ_CONTENT = 4;
public static void createMovieFromMJPEG(String file) throws FileNotFoundException, IOException {
char LINE_FEED = '\n';
char[] PRE_BOUNDARY = new String("--" + BOUNDARY + LINE_FEED).toCharArray();
try (InputStream in = new FileInputStream(file);
BufferedInputStream reader = new BufferedInputStream(in);) {
int r;
StringBuffer content_buf = new StringBuffer();
int mode = READ_PRE_BOUNDARY;
long content_length = 0;
int[] cmdBuf = new int[PRE_BOUNDARY.length];
int boundaryPointer = 0;
int counter = 0;
while ((r = reader.read()) != -1) {
System.out.print((char)r);
counter++;
if (mode == READ_PRE_BOUNDARY) {
if (r == PRE_BOUNDARY[boundaryPointer]) {
boundaryPointer++;
if (boundaryPointer >= PRE_BOUNDARY.length - 1) {
// Read a PRE_BOUNDARY
mode = READ_CONTENT_TYPE;
boundaryPointer = 0;
}
}
} else if (mode == READ_CONTENT_TYPE) {
if (r != LINE_FEED) {
content_buf.append((char)r);
} else {
if (content_buf.length() == 0) {
// leading line break, ignore...
} else {
mode = READ_CONTENT_LENGTH;
content_buf.setLength(0);
}
}
} else if (mode == READ_CONTENT_LENGTH) {
if (r != LINE_FEED) {
content_buf.append((char)r);
} else {
if (content_buf.length() == 0) {
// leading line break, ignore...
} else {
String number = content_buf.substring(content_buf.lastIndexOf(":") + 1).trim();
content_length = Long.valueOf(number);
content_buf.setLength(0);
mode = READ_CONTENT;
}
}
} else if (mode == READ_CONTENT) {
byte[] fileBuf = new byte[(int)content_length];
reader.read(fileBuf);
System.out.println(fileBuf);
mode = READ_PRE_BOUNDARY;
}
}
}
}
public static void main(String[] args) {
try {
createMovieFromMJPEG("video.mjpeg");
} catch (IOException e) {
e.printStackTrace();
}
}
}

What is the fastest way to compare the content of two BIG text files in JAVA

I have two text files that are more than 600MB and I want to compare the content of them if they are the same (Ignoring any space at the end or the start of any line in it i.e. trim() each line).
I am thinking of reading each line of them as a string and then trim it and compare it.
Is there is a better idea and if not what is the fastest implementation to this idea?
Thanks in advance.
If you want to compare whether the files are consistent, please calculate the file md5 value to compare:
import java.io.FileInputStream;
import java.io.InputStream;
import java.math.BigInteger;
import java.security.MessageDigest;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
public class MainServer {
public static void main(String[] args) {
String filePath1 = "D:\\Download\\a.mp3";
String filePath2 = "D:\\Download\\b.mp3";
String file1_md5 = md5HashCode(filePath1);
String file2_md5 = md5HashCode(filePath2);
System.out.println(file1_md5);
System.out.println(file2_md5);
if(file1_md5.equals(file2_md5)){
System.out.println("Two files are the same ");
}
}
/**
* get file md5 value
*/
public static String md5HashCode(String filePath) {
try {
InputStream fis =new FileInputStream(filePath);
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] buffer = new byte[1024];
int length = -1;
while ((length = fis.read(buffer, 0, 1024)) != -1) {
md.update(buffer, 0, length);
}
fis.close();
byte[] md5Bytes = md.digest();
BigInteger bigInt = new BigInteger(1, md5Bytes);
return bigInt.toString(16);
} catch (Exception e) {
e.printStackTrace();
return "";
}
}
}
If you need to read each line of the file for comparison:
List<String> file1_lines = null;
List<String> file2_lines = null;
try {
file1_lines = Files.readAllLines(Paths.get("D:/a.txt"), StandardCharsets.UTF_8);
file2_lines = Files.readAllLines(Paths.get("D:/b.txt"), StandardCharsets.UTF_8);
} catch (IOException e) {
e.printStackTrace();
}
for (int i = 0; i < file1_lines.size(); i++) {
String file1_line = file1_lines.get(i).trim();
String file2_line = file2_lines.get(i).trim();
if (file1_line.equals(file2_line)) {
//do some
}
}

How to read data serialized with Chronicle Wire from InputStream?

Some data are serialized to an outputstream via Chronicle Wire.
Object m = ... ;
OutputStream out = ... ;
WireType.RAW //
.apply(Bytes.elasticByteBuffer()) //
.getValueOut().object(m) //
.bytes().copyTo(out)
;
I want to get them back from an Inputstream.
InputStream in = ... ;
WireType.RAW
.apply(Bytes.elasticByteBuffer())
.getValueIn()
???
;
Object m = ???; // How to initialize m ?
How to read my initial object m from in ?
There is an assumption you will have some idea of how long the data is and read it in one go. It is also assumed you will want to reuse the buffers to avoid creating garbage. To minimise latency data is typical read to/from NIO Channels.
I have raised an issue to create this example, Improve support for Input/OutputStream and non Marshallable objects https://github.com/OpenHFT/Chronicle-Wire/issues/111
This should do what you want efficiently without creating garbage each time.
package net.openhft.chronicle.wire;
import net.openhft.chronicle.bytes.Bytes;
import java.io.DataOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.ByteBuffer;
public class WireToOutputStream {
private final Bytes<ByteBuffer> bytes = Bytes.elasticHeapByteBuffer(128);
private final Wire wire;
private final DataOutputStream dos;
public WireToOutputStream(WireType wireType, OutputStream os) {
wire = wireType.apply(bytes);
dos = new DataOutputStream(os);
}
public Wire getWire() {
wire.clear();
return wire;
}
public void flush() throws IOException {
int length = Math.toIntExact(bytes.readRemaining());
dos.writeInt(length);
dos.write(bytes.underlyingObject().array(), 0, length);
}
}
package net.openhft.chronicle.wire;
import net.openhft.chronicle.bytes.Bytes;
import java.io.DataInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.StreamCorruptedException;
import java.nio.ByteBuffer;
public class InputStreamToWire {
private final Bytes<ByteBuffer> bytes = Bytes.elasticHeapByteBuffer(128);
private final Wire wire;
private final DataInputStream dis;
public InputStreamToWire(WireType wireType, InputStream is) {
wire = wireType.apply(bytes);
dis = new DataInputStream(is);
}
public Wire readOne() throws IOException {
wire.clear();
int length = dis.readInt();
if (length < 0) throw new StreamCorruptedException();
bytes.ensureCapacity(length);
byte[] array = bytes.underlyingObject().array();
dis.readFully(array, 0, length);
bytes.readPositionRemaining(0, length);
return wire;
}
}
You can then do the following
package net.openhft.chronicle.wire;
import net.openhft.chronicle.core.util.ObjectUtils;
import org.junit.Test;
import java.io.IOException;
import java.io.Serializable;
import java.net.ServerSocket;
import java.net.Socket;
import static org.junit.Assert.assertEquals;
public class WireToOutputStreamTest {
#Test
public void testVisSocket() throws IOException {
ServerSocket ss = new ServerSocket(0);
Socket s = new Socket("localhost", ss.getLocalPort());
Socket s2 = ss.accept();
WireToOutputStream wtos = new WireToOutputStream(WireType.RAW, s.getOutputStream());
Wire wire = wtos.getWire();
AnObject ao = new AnObject();
ao.value = 12345;
ao.text = "Hello";
// write the type is needed.
wire.getValueOut().typeLiteral(AnObject.class);
Wires.writeMarshallable(ao, wire);
wtos.flush();
InputStreamToWire istw = new InputStreamToWire(WireType.RAW, s2.getInputStream());
Wire wire2 = istw.readOne();
Class type = wire2.getValueIn().typeLiteral();
Object ao2 = ObjectUtils.newInstance(type);
Wires.readMarshallable(ao2, wire2, true);
System.out.println(ao2);
ss.close();
s.close();
s2.close();
assertEquals(ao.toString(), ao2.toString());
}
public static class AnObject implements Serializable {
long value;
String text;
#Override
public String toString() {
return "AnObject{" +
"value=" + value +
", text='" + text + '\'' +
'}';
}
}
}
Sample code
// On Sender side
Object m = ... ;
OutputStream out = ... ;
WireToOutputStream wireToOutputStream = new
WireToOutputStream(WireType.TEXT, out);
Wire wire = wireToOutputStream.getWire();
wire.getValueOut().typeLiteral(m.getClass());
Wires.writeMarshallable(m, wire);
wireToOutputStream.flush();
// On Receiver side
InputStream in = ... ;
InputStreamToWire inputStreamToWire = new InputStreamToWire(WireType.TEXT, in);
Wire wire2 = inputStreamToWire.readOne();
Class type = wire2.getValueIn().typeLiteral();
Object m = ObjectUtils.newInstance(type);
Wires.readMarshallable(m, wire2, true);
This code is a lot simpler if your DTO extends Marshallable but this will work whether you extend an interface or not. i.e. you don't need to extend Serializable.
Also if you know what the type will be you don't need to write it each time.
The helper classes above have been added to the latest SNAPSHOT

How to read file chunk by chunk from S3 using aws-java-sdk

I am trying to read large file into chunks from S3 without cutting any line for parallel processing.
Let me explain by example:
There is file of size 1G on S3. I want to divide this file into chucks of 64 MB. It is easy I can do it like :
S3Object s3object = s3.getObject(new GetObjectRequest(bucketName, key));
InputStream stream = s3object.getObjectContent();
byte[] content = new byte[64*1024*1024];
while (stream.read(content) != -1) {
//process content here
}
but problem with chunk is it may have 100 complete line and one incomplete. but I can not process incomplete line and don't want to discard it.
Is any way to handle this situations ? means all chucks have no partial line.
My usual approach (InputStream -> BufferedReader.lines() -> batches of lines -> CompletableFuture) won't work here because the underlying S3ObjectInputStream times out eventually for huge files.
So I created a new class S3InputStream, which doesn't care how long it's open for and reads byte blocks on demand using short-lived AWS SDK calls. You provide a byte[] that will be reused. new byte[1 << 24] (16Mb) appears to work well.
package org.harrison;
import java.io.IOException;
import java.io.InputStream;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.GetObjectRequest;
/**
* An {#link InputStream} for S3 files that does not care how big the file is.
*
* #author stephen harrison
*/
public class S3InputStream extends InputStream {
private static class LazyHolder {
private static final AmazonS3 S3 = AmazonS3ClientBuilder.defaultClient();
}
private final String bucket;
private final String file;
private final byte[] buffer;
private long lastByteOffset;
private long offset = 0;
private int next = 0;
private int length = 0;
public S3InputStream(final String bucket, final String file, final byte[] buffer) {
this.bucket = bucket;
this.file = file;
this.buffer = buffer;
this.lastByteOffset = LazyHolder.S3.getObjectMetadata(bucket, file).getContentLength() - 1;
}
#Override
public int read() throws IOException {
if (next >= length) {
fill();
if (length <= 0) {
return -1;
}
next = 0;
}
if (next >= length) {
return -1;
}
return buffer[this.next++];
}
public void fill() throws IOException {
if (offset >= lastByteOffset) {
length = -1;
} else {
try (final InputStream inputStream = s3Object()) {
length = 0;
int b;
while ((b = inputStream.read()) != -1) {
buffer[length++] = (byte) b;
}
if (length > 0) {
offset += length;
}
}
}
}
private InputStream s3Object() {
final GetObjectRequest request = new GetObjectRequest(bucket, file).withRange(offset,
offset + buffer.length - 1);
return LazyHolder.S3.getObject(request).getObjectContent();
}
}
The aws-java-sdk already provides streaming functionality for your S3 objects. You have to call "getObject" and the result will be an InputStream.
1) AmazonS3Client.getObject(GetObjectRequest getObjectRequest) -> S3Object
2) S3Object.getObjectContent()
Note: The method is a simple getter and does not actually create a
stream. If you retrieve an S3Object, you should close this input
stream as soon as possible, because the object contents aren't
buffered in memory and stream directly from Amazon S3. Further,
failure to close this stream can cause the request pool to become
blocked.
aws java docs
100 complete line and one incomplete
do you mean you need to read the stream line by line? If so, instead of using a an InputStream try to read the s3 object stream by using BufferedReader so that you can read the stream line by line but I think this will make a little slower than by chunk.
S3Object s3object = s3.getObject(new GetObjectRequest(bucketName, key));
BufferedReader in = new BufferedReader(new InputStreamReader(s3object.getObjectContent()));
String line;
while ((line = in.readLine()) != null) {
//process line here
}
You can read all the files in the bucket with checking the tokens. And you can read files with other java libs.. i.e. Pdf.
import java.io.IOException;
import java.io.InputStream;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import javax.swing.JTextArea;
import java.io.FileWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import org.joda.time.DateTime;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3Client;
import com.amazonaws.services.s3.model.AmazonS3Exception;
import com.amazonaws.services.s3.model.CopyObjectRequest;
import com.amazonaws.services.s3.model.GetObjectRequest;
import com.amazonaws.services.s3.model.ListObjectsV2Request;
import com.amazonaws.services.s3.model.ListObjectsV2Result;
import com.amazonaws.services.s3.model.S3Object;
import com.amazonaws.services.s3.model.S3ObjectSummary;
import java.io.File;
//..
// in your main class
private static AWSCredentials credentials = null;
private static AmazonS3 amazonS3Client = null;
public static void intializeAmazonObjects() {
credentials = new BasicAWSCredentials(ACCESS_KEY, SECRET_ACCESS_KEY);
amazonS3Client = new AmazonS3Client(credentials);
}
public void mainMethod() throws IOException, AmazonS3Exception{
// connect to aws
intializeAmazonObjects();
ListObjectsV2Request req = new ListObjectsV2Request().withBucketName(bucketName);
ListObjectsV2Result listObjectsResult;
do {
listObjectsResult = amazonS3Client.listObjectsV2(req);
int count = 0;
for (S3ObjectSummary objectSummary : listObjectsResult.getObjectSummaries()) {
System.out.printf(" - %s (size: %d)\n", objectSummary.getKey(), objectSummary.getSize());
// Date lastModifiedDate = objectSummary.getLastModified();
// String bucket = objectSummary.getBucketName();
String key = objectSummary.getKey();
String newKey = "";
String newBucket = "";
String resultText = "";
// only try to read pdf files
if (!key.contains(".pdf")) {
continue;
}
// Read the source file as text
String pdfFileInText = readAwsFile(objectSummary.getBucketName(), objectSummary.getKey());
if (pdfFileInText.isEmpty())
continue;
}//end of current bulk
// If there are more than maxKeys(in this case 999 default) keys in the bucket,
// get a continuation token
// and list the next objects.
String token = listObjectsResult.getNextContinuationToken();
System.out.println("Next Continuation Token: " + token);
req.setContinuationToken(token);
} while (listObjectsResult.isTruncated());
}
public String readAwsFile(String bucketName, String keyName) {
S3Object object;
String pdfFileInText = "";
try {
// AmazonS3 s3client = getAmazonS3ClientObject();
object = amazonS3Client.getObject(new GetObjectRequest(bucketName, keyName));
InputStream objectData = object.getObjectContent();
PDDocument document = PDDocument.load(objectData);
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
pdfFileInText = tStripper.getText(document);
}
} catch (Exception e) {
e.printStackTrace();
}
return pdfFileInText;
}
The #stephen-harrison answer works well. I updated it for v2 of the sdk. I made a couple of tweaks: mainly the connection can now be authorized and the LazyHolder class is no longer static -- I couldn't figure out how to authorize the connection and still keep the class static.
For another approach using Scala, see https://alexwlchan.net/2019/09/streaming-large-s3-objects/
package foo.whatever;
import java.io.IOException;
import java.io.InputStream;
import software.amazon.awssdk.auth.credentials.AwsBasicCredentials;
import software.amazon.awssdk.auth.credentials.StaticCredentialsProvider;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.S3Client;
import software.amazon.awssdk.services.s3.model.GetObjectRequest;
import software.amazon.awssdk.services.s3.model.HeadObjectRequest;
import software.amazon.awssdk.services.s3.model.HeadObjectResponse;
/**
* Adapted for aws Java sdk v2 by jomofrodo#gmail.com
*
* An {#link InputStream} for S3 files that does not care how big the file is.
*
* #author stephen harrison
*/
public class S3InputStreamV2 extends InputStream {
private class LazyHolder {
String appID;
String secretKey;
Region region = Region.US_WEST_1;
public S3Client S3 = null;
public void connect() {
AwsBasicCredentials awsCreds = AwsBasicCredentials.create(appID, secretKey);
S3 = S3Client.builder().region(region).credentialsProvider(StaticCredentialsProvider.create(awsCreds))
.build();
}
private HeadObjectResponse getHead(String keyName, String bucketName) {
HeadObjectRequest objectRequest = HeadObjectRequest.builder().key(keyName).bucket(bucketName).build();
HeadObjectResponse objectHead = S3.headObject(objectRequest);
return objectHead;
}
// public static final AmazonS3 S3 = AmazonS3ClientBuilder.defaultClient();
}
private LazyHolder lazyHolder = new LazyHolder();
private final String bucket;
private final String file;
private final byte[] buffer;
private long lastByteOffset;
private long offset = 0;
private int next = 0;
private int length = 0;
public S3InputStreamV2(final String bucket, final String file, final byte[] buffer, String appID, String secret) {
this.bucket = bucket;
this.file = file;
this.buffer = buffer;
lazyHolder.appID = appID;
lazyHolder.secretKey = secret;
lazyHolder.connect();
this.lastByteOffset = lazyHolder.getHead(file, bucket).contentLength();
}
#Override
public int read() throws IOException {
if (next >= length || (next == buffer.length && length == buffer.length)) {
fill();
if (length <= 0) {
return -1;
}
next = 0;
}
if (next >= length) {
return -1;
}
return buffer[this.next++] & 0xFF;
}
public void fill() throws IOException {
if (offset >= lastByteOffset) {
length = -1;
} else {
try (final InputStream inputStream = s3Object()) {
length = 0;
int b;
while ((b = inputStream.read()) != -1) {
buffer[length++] = (byte) b;
}
if (length > 0) {
offset += length;
}
}
}
}
private InputStream s3Object() {
final Long rangeEnd = offset + buffer.length - 1;
final String rangeString = "bytes=" + offset + "-" + rangeEnd;
final GetObjectRequest getObjectRequest = GetObjectRequest.builder().bucket(bucket).key(file).range(rangeString)
.build();
return lazyHolder.S3.getObject(getObjectRequest);
}
}
Got puzzled while we were migrating from AWS Sdk V1 to V2 and realised in V2 SDK its not the same way to define the range
With AWS V1 SDK
S3Object currentS3Obj = client.getObject(new GetObjectRequest(bucket, key).withRange(start, end));
return currentS3Obj.getObjectContent();
With AWS V2 SDK
var range = String.format("bytes=%d-%d", start, end);
ResponseBytes<GetObjectResponse> currentS3Obj = client.getObjectAsBytes(GetObjectRequest.builder().bucket(bucket).key(key).range(range).build());
return currentS3Obj.asInputStream();

What's the most standard Java way to store raw binary data along with XML?

I need to store a huge amount of binary data into a file, but I want also to read/write the header of that file in XML format.
Yes, I could just store the binary data into some XML value and let it be serialized using base64 encoding. But this wouldn't be space-efficient.
Can I "mix" XML data and raw binary data in a more-or-less standardized way?
I was thinking about two options:
Is there a way to do this using JAXB?
Or is there a way to take some existing XML data and append binary data to it, in such a way that the the boundary is recognized?
Isn't the concept I'm looking for somehow used by / for SOAP?
Or is it used in the email standard? (Separation of binary attachments)
Scheme of what I'm trying to achieve:
[meta-info-about-boundary][XML-data][boundary][raw-binary-data]
Thank you!
You can leverage AttachementMarshaller & AttachmentUnmarshaller for this. This is the bridge used by JAXB/JAX-WS to pass binary content as attachments. You can leverage this same mechanism to do what you want.
http://download.oracle.com/javase/6/docs/api/javax/xml/bind/attachment/package-summary.html
PROOF OF CONCEPT
Below is how it could be implemented. This should work with any JAXB impl (it works for me with EclipseLink JAXB (MOXy), and the reference implementation).
Message Format
[xml_length][xml][attach1_length][attach1]...[attachN_length][attachN]
Root
This is an object with multiple byte[] properties.
import javax.xml.bind.annotation.XmlRootElement;
#XmlRootElement
public class Root {
private byte[] foo;
private byte[] bar;
public byte[] getFoo() {
return foo;
}
public void setFoo(byte[] foo) {
this.foo = foo;
}
public byte[] getBar() {
return bar;
}
public void setBar(byte[] bar) {
this.bar = bar;
}
}
Demo
This class has is used to demonstrate how MessageWriter and MessageReader are used:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import javax.xml.bind.JAXBContext;
public class Demo {
public static void main(String[] args) throws Exception {
JAXBContext jc = JAXBContext.newInstance(Root.class);
Root root = new Root();
root.setFoo("HELLO WORLD".getBytes());
root.setBar("BAR".getBytes());
MessageWriter writer = new MessageWriter(jc);
FileOutputStream outStream = new FileOutputStream("file.xml");
writer.write(root, outStream);
outStream.close();
MessageReader reader = new MessageReader(jc);
FileInputStream inStream = new FileInputStream("file.xml");
Root root2 = (Root) reader.read(inStream);
inStream.close();
System.out.println(new String(root2.getFoo()));
System.out.println(new String(root2.getBar()));
}
}
MessageWriter
Is responsible for writing the message to the desired format:
import java.io.ByteArrayOutputStream;
import java.io.ObjectOutputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;
import javax.activation.DataHandler;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.Marshaller;
import javax.xml.bind.attachment.AttachmentMarshaller;
public class MessageWriter {
private JAXBContext jaxbContext;
public MessageWriter(JAXBContext jaxbContext) {
this.jaxbContext = jaxbContext;
}
/**
* Write the message in the following format:
* [xml_length][xml][attach1_length][attach1]...[attachN_length][attachN]
*/
public void write(Object object, OutputStream stream) {
try {
Marshaller marshaller = jaxbContext.createMarshaller();
marshaller.setProperty(Marshaller.JAXB_FRAGMENT, true);
BinaryAttachmentMarshaller attachmentMarshaller = new BinaryAttachmentMarshaller();
marshaller.setAttachmentMarshaller(attachmentMarshaller);
ByteArrayOutputStream xmlStream = new ByteArrayOutputStream();
marshaller.marshal(object, xmlStream);
byte[] xml = xmlStream.toByteArray();
xmlStream.close();
ObjectOutputStream messageStream = new ObjectOutputStream(stream);
messageStream.write(xml.length); //[xml_length]
messageStream.write(xml); // [xml]
for(Attachment attachment : attachmentMarshaller.getAttachments()) {
messageStream.write(attachment.getLength()); // [attachX_length]
messageStream.write(attachment.getData(), attachment.getOffset(), attachment.getLength()); // [attachX]
}
messageStream.flush();
} catch(Exception e) {
throw new RuntimeException(e);
}
}
private static class BinaryAttachmentMarshaller extends AttachmentMarshaller {
private static final int THRESHOLD = 10;
private List<Attachment> attachments = new ArrayList<Attachment>();
public List<Attachment> getAttachments() {
return attachments;
}
#Override
public String addMtomAttachment(DataHandler data, String elementNamespace, String elementLocalName) {
return null;
}
#Override
public String addMtomAttachment(byte[] data, int offset, int length, String mimeType, String elementNamespace, String elementLocalName) {
if(data.length < THRESHOLD) {
return null;
}
int id = attachments.size() + 1;
attachments.add(new Attachment(data, offset, length));
return "cid:" + String.valueOf(id);
}
#Override
public String addSwaRefAttachment(DataHandler data) {
return null;
}
#Override
public boolean isXOPPackage() {
return true;
}
}
public static class Attachment {
private byte[] data;
private int offset;
private int length;
public Attachment(byte[] data, int offset, int length) {
this.data = data;
this.offset = offset;
this.length = length;
}
public byte[] getData() {
return data;
}
public int getOffset() {
return offset;
}
public int getLength() {
return length;
}
}
}
MessageReader
Is responsible for reading the message:
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.ObjectInputStream;
import java.io.OutputStream;
import java.util.HashMap;
import java.util.Map;
import javax.activation.DataHandler;
import javax.activation.DataSource;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.bind.attachment.AttachmentUnmarshaller;
public class MessageReader {
private JAXBContext jaxbContext;
public MessageReader(JAXBContext jaxbContext) {
this.jaxbContext = jaxbContext;
}
/**
* Read the message from the following format:
* [xml_length][xml][attach1_length][attach1]...[attachN_length][attachN]
*/
public Object read(InputStream stream) {
try {
ObjectInputStream inputStream = new ObjectInputStream(stream);
int xmlLength = inputStream.read(); // [xml_length]
byte[] xmlIn = new byte[xmlLength];
inputStream.read(xmlIn); // [xml]
BinaryAttachmentUnmarshaller attachmentUnmarshaller = new BinaryAttachmentUnmarshaller();
int id = 1;
while(inputStream.available() > 0) {
int length = inputStream.read(); // [attachX_length]
byte[] data = new byte[length]; // [attachX]
inputStream.read(data);
attachmentUnmarshaller.getAttachments().put("cid:" + String.valueOf(id++), data);
}
Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();
unmarshaller.setAttachmentUnmarshaller(attachmentUnmarshaller);
ByteArrayInputStream byteInputStream = new ByteArrayInputStream(xmlIn);
Object object = unmarshaller.unmarshal(byteInputStream);
byteInputStream.close();
inputStream.close();
return object;
} catch(Exception e) {
throw new RuntimeException(e);
}
}
private static class BinaryAttachmentUnmarshaller extends AttachmentUnmarshaller {
private Map<String, byte[]> attachments = new HashMap<String, byte[]>();
public Map<String, byte[]> getAttachments() {
return attachments;
}
#Override
public DataHandler getAttachmentAsDataHandler(String cid) {
byte[] bytes = attachments.get(cid);
return new DataHandler(new ByteArrayDataSource(bytes));
}
#Override
public byte[] getAttachmentAsByteArray(String cid) {
return attachments.get(cid);
}
#Override
public boolean isXOPPackage() {
return true;
}
}
private static class ByteArrayDataSource implements DataSource {
private byte[] bytes;
public ByteArrayDataSource(byte[] bytes) {
this.bytes = bytes;
}
public String getContentType() {
return "application/octet-stream";
}
public InputStream getInputStream() throws IOException {
return new ByteArrayInputStream(bytes);
}
public String getName() {
return null;
}
public OutputStream getOutputStream() throws IOException {
return null;
}
}
}
For More Information
http://bdoughan.blogspot.com/2011/03/jaxb-web-services-and-binary-data.html
I followed the concept suggested by Blaise Doughan, but without attachment marshallers:
I let an XmlAdapter convert a byte[] to a URI-reference and back, while references point to separate files, where raw data is stored. The XML file and all binary files are then put into a zip.
It is similar to the approach of OpenOffice and the ODF format, which in fact is a zip with few XMLs and binary files.
(In the example code, no actual binary files are written, and no zip is created.)
Bindings.java
import java.net.*;
import java.util.*;
import javax.xml.bind.annotation.*;
import javax.xml.bind.annotation.adapters.*;
final class Bindings {
static final String SCHEME = "storage";
static final Class<?>[] ALL_CLASSES = new Class<?>[]{
Root.class, RawRef.class
};
static final class RawRepository
extends XmlAdapter<URI, byte[]> {
final SortedMap<String, byte[]> map = new TreeMap<>();
final String host;
private int lastID = 0;
RawRepository(String host) {
this.host = host;
}
#Override
public byte[] unmarshal(URI o) {
if (!SCHEME.equals(o.getScheme())) {
throw new Error("scheme is: " + o.getScheme()
+ ", while expected was: " + SCHEME);
} else if (!host.equals(o.getHost())) {
throw new Error("host is: " + o.getHost()
+ ", while expected was: " + host);
}
String key = o.getPath();
if (!map.containsKey(key)) {
throw new Error("key not found: " + key);
}
byte[] ret = map.get(key);
return Arrays.copyOf(ret, ret.length);
}
#Override
public URI marshal(byte[] o) {
++lastID;
String key = String.valueOf(lastID);
map.put(key, Arrays.copyOf(o, o.length));
try {
return new URI(SCHEME, host, "/" + key, null);
} catch (URISyntaxException ex) {
throw new Error(ex);
}
}
}
#XmlRootElement
#XmlType
static final class Root {
#XmlElement
final List<RawRef> element = new LinkedList<>();
}
#XmlType
static final class RawRef {
#XmlJavaTypeAdapter(RawRepository.class)
#XmlElement
byte[] raw = null;
}
}
Main.java
import java.io.*;
import javax.xml.bind.*;
public class _Run {
public static void main(String[] args)
throws Exception {
JAXBContext context = JAXBContext.newInstance(Bindings.ALL_CLASSES);
Marshaller marshaller = context.createMarshaller();
marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
Unmarshaller unmarshaller = context.createUnmarshaller();
Bindings.RawRepository adapter = new Bindings.RawRepository("myZipVFS");
marshaller.setAdapter(adapter);
Bindings.RawRef ta1 = new Bindings.RawRef();
ta1.raw = "THIS IS A STRING".getBytes();
Bindings.RawRef ta2 = new Bindings.RawRef();
ta2.raw = "THIS IS AN OTHER STRING".getBytes();
Bindings.Root root = new Bindings.Root();
root.element.add(ta1);
root.element.add(ta2);
StringWriter out = new StringWriter();
marshaller.marshal(root, out);
System.out.println(out.toString());
}
}
Output
<root>
<element>
<raw>storage://myZipVFS/1</raw>
</element>
<element>
<raw>storage://myZipVFS/2</raw>
</element>
</root>
This is not natively supportted by JAXB as you do not want serialize the binary data to XML, but can usually be done in a higher level when using JAXB.
The way I do this is with webservices (SOAP and REST) is using MIME multipart/mixed messages (check multipart specification). Initially designed for emails, works great to send xml with binary data and most webservice frameworks such as axis or jersey support it in an almost transparent way.
Here is an example of sending an object in XML together with a binary file with REST webservice using Jersey with the jersey-multipart extension.
XML object
#XmlRootElement
public class Book {
private String title;
private String author;
private int year;
//getter and setters...
}
Client
byte[] bin = some binary data...
Book b = new Book();
b.setAuthor("John");
b.setTitle("wild stuff");
b.setYear(2012);
MultiPart multiPart = new MultiPart();
multiPart.bodyPart(new BodyPart(b, MediaType.APPLICATION_XML_TYPE));
multiPart.bodyPart(new BodyPart(bin, MediaType.APPLICATION_OCTET_STREAM_TYPE));
response = service.path("rest").path("multipart").
type(MultiPartMediaTypes.MULTIPART_MIXED).
post(ClientResponse.class, multiPart);
Server
#POST
#Consumes(MultiPartMediaTypes.MULTIPART_MIXED)
public Response post(MultiPart multiPart) {
for(BodyPart part : multiPart.getBodyParts()) {
System.out.println(part.getMediaType());
}
return Response.status(Response.Status.ACCEPTED).
entity("Attachements processed successfully.").
type(MediaType.TEXT_PLAIN).build();
}
I tried to send a file with 110917 bytes. Using wireshark, you can see that the data is sent directly over HTTP like this:
Hypertext Transfer Protocol
POST /org.etics.test.rest.server/rest/multipart HTTP/1.1\r\n
Content-Type: multipart/mixed; boundary=Boundary_1_353042220_1343207087422\r\n
MIME-Version: 1.0\r\n
User-Agent: Java/1.7.0_04\r\n
Host: localhost:8080\r\n
Accept: text/html, image/gif, image/jpeg\r\n
Connection: keep-alive\r\n
Content-Length: 111243\r\n
\r\n
[Full request URI: http://localhost:8080/org.etics.test.rest.server/rest/multipart]
MIME Multipart Media Encapsulation, Type: multipart/mixed, Boundary: "Boundary_1_353042220_1343207087422"
[Type: multipart/mixed]
First boundary: --Boundary_1_353042220_1343207087422\r\n
Encapsulated multipart part: (application/xml)
Content-Type: application/xml\r\n\r\n
eXtensible Markup Language
<?xml
<book>
<author>
John
</author>
<title>
wild stuff
</title>
<year>
2012
</year>
</book>
Boundary: \r\n--Boundary_1_353042220_1343207087422\r\n
Encapsulated multipart part: (application/octet-stream)
Content-Type: application/octet-stream\r\n\r\n
Media Type
Media Type: application/octet-stream (110917 bytes)
Last boundary: \r\n--Boundary_1_353042220_1343207087422--\r\n
As you see, binary data is sent has octet-stream, with no waste of space, contrarly to what happens when sending binary data inline in the xml. The is just the very low overhead MIME envelope.
With SOAP the principle is the same (just that it will have the SOAP envelope).
I don't think so -- XML libraries generally aren't designed to work with XML+extra-data.
But you might be able to get away with something as simple as a special stream wrapper -- it would expose an "XML"-containing stream and a binary stream (from the special "format"). Then JAXB (or whatever else XML library) could play with the "XML" stream and the binary stream is kept separate.
Also remember to take "binary" vs. "text" files into account.
Happy coding.

Categories

Resources