Given an object byte[], when we want to operate with such object often we need pieces of it. In my particular example i get byte[] from wire where first 4 bytes describe lenght of the message then another 4 bytes the type of the message (an integer that maps to concrete protobuf class) then remaining byte[] is actual content of the message... like this
length|type|content
in order to parse this message i have to pass content part to specific class which knows how to parse an instance from it... the problem is that often there are no methods provided so that you could specify from where to where parser shall read the array...
So what we end up doing is copying remaining chuks of that array, which is not effective...
As far as i know in java it is not possible to create another byte[] reference that actually refers to some original bigger byte[] array with just 2 indexes (this was approach with String that led to memory leaks)...
I wonder how do we solve situations like this? I suppose giving up on protobuf just because it does not provide some parseFrom(byte[], int, int) does not make sence... protobuf is just an example, anything could lack that api...
So does this force us to write inefficient code or there is something that can be done? (appart from adding that method)...
Normally you would tackle this kind of thing with streams.
A stream is an abstraction for reading just what you need to process the current block of data. So you can read the correct number of bytes into a byte array and pass it to your parse function.
You ask 'So does this force us to write inefficient code or there is something that can be done?'
Usually you get your data in the form of a stream and then using the technique demonstrated below will be more performant because you skip making one copy. (Two copies instead of three; once by the OS and once by you. You skip making a copy of the total byte array before you start parsing.) If you actually start out with a byte[] but it is constructed by yourself then you may want to change to constructing an object such as { int length, int type, byte[] contentBytes } instead and pass contentBytes to your parse function.
If you really, really have to start out with byte[] then the below technique is just a more convenient way to parse it, it would not be more performant.
So suppose you got a buffer of bytes from somewhere and you want to read the contents of that buffer. First you convert it to a stream:
private static List<Content> read(byte[] buffer) {
try {
ByteArrayInputStream bytesStream = new ByteArrayInputStream(buffer);
return read(bytesStream);
} catch (IOException e) {
e.printStackTrace();
}
}
The above function wraps the byte array with a stream and passes it to the function that does the actual reading.
If you can start out from a stream then obviously you can skip the above step and just pass that stream into the below function directly:
private static List<Content> read(InputStream bytesStream) throws IOException {
List<Content> results = new ArrayList<Content>();
try {
// read the content...
Content content1 = readContent(bytesStream);
results.add(content1);
// I don't know if there's more than one content block but assuming
// that there is, you can just continue reading the stream...
//
// If it's a fixed number of content blocks then just read them one
// after the other... Otherwise make this a loop
Content content2 = readContent(bytesStream);
results.add(content2);
} finally {
bytesStream.close();
}
return results;
}
Since your byte-array contains content you will want to read Content blocks from the stream. Since you have a length and a type field, I am assuming that you have different kinds of content blocks. The next function reads the length and type and passes the processing of the content bytes on to the proper class depending on the read type:
private static Content readContent(InputStream stream) throws IOException {
final int CONTENT_TYPE_A = 10;
final int CONTENT_TYPE_B = 11;
// wrap the InputStream in a DataInputStream because the latter has
// convenience functions to convert bytes to integers, etc.
// Note that DataInputStream handles the stream in a BigEndian way,
// so check that your bytes are in the same byte order. If not you'll
// have to find another stream reader that can convert to ints from
// LittleEndian byte order.
DataInputStream data = new DataInputStream(stream);
int length = data.readInt();
int type = data.readInt();
// I'm assuming that above length field was the number of bytes for the
// content. So, read length number of bytes into a buffer and pass that
// to your `parseFrom(byte[])` function
byte[] contentBytes = new byte[length];
int readCount = data.read(contentBytes, 0, contentBytes.length);
if (readCount < contentBytes.length)
throw new IOException("Unexpected end of stream");
switch (type) {
case CONTENT_TYPE_A:
return ContentTypeA.parseFrom(contentBytes);
case CONTENT_TYPE_B:
return ContentTypeB.parseFrom(contentBytes);
default:
throw new UnsupportedOperationException();
}
}
I have made up the below Content classes. I don't know what protobuf is but it can apparently convert from a byte array to an actual object with its parseFrom(byte[]) function, so take this as pseudocode:
class Content {
// common functionality
}
class ContentTypeA extends Content {
public static ContentTypeA parseFrom(byte[] contentBytes) {
return null; // do the actual parsing of a type A content
}
}
class ContentTypeB extends Content {
public static ContentTypeB parseFrom(byte[] contentBytes) {
return null; // do the actual parsing of a type B content
}
}
In Java, Array is not just section of memory - it is an object, that have some additional fields (at least - length). So you cannot link to part of array - you should:
Use array-copy functions or
Implement and use some algorithm that uses only part of byte array.
The concern seems that there is no way to create a view over an array (e.g., an array equivalent of List#subList()). A workaround might be making your parsing methods take in the reference to the entire array and two indices (or an index and a length) to specify the sub-array the method should work on.
This would not prevent the methods from reading or modifying sections of the array they should not touch. Perhaps an ByteArrayView class could be made to add a little bit of safety if this is a concern:
public class ByteArrayView {
private final byte[] array;
private final int start;
private final int length;
public ByteArrayView(byte[] array, int start, int length) { ... }
public byte[] get(int index) {
if (index < 0 || index >= length) {
throw new ArrayOutOfBoundsExceptionOrSomeOtherRelevantException();
}
return array[start + index];
}
}
But if, on the other hand, performance is a concern, then a method call to get() for fetching each byte is probably undesirable.
The code is for illustration; it's not tested or anything.
EDIT
On a second reading of my own answer, I realized that I should point this out: having a ByteArrayView will copy each byte you read from the original array -- just byte by byte rather than as a chunk. It would be inadequate for the OP's concerns.
Related
I have a reporting web application which generates reports. The application gets data from a database and stores data into a StringWriter object. I have to get this data in a byte array format to create a csv file and send it to browser.
Below is the code snippet
return new FileTransfer(fileName, reportType.getMimeType(),
new ByteArrayInputStream(generateCSV(reportType, grid, new DataList(), params).toString().getBytes("UTF-8")));
where generateCSV returns a StringWriter object, then to convert it into byte array I am calling toString and then getBytes() method. Below is what the generateCSV method looks like
StringWriter generateCSV(ReportType reportType, GridConfig grid, DataList dataList, String params) {......}
The problem is that when my report has huge records (more than 1 million), the getBytes() method fails with
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
The whole report data when converted to String object has a huge number of characters (billions of it). The .getBytes("UTF-8") method convert it into array, each array element as one character. And for 1 million records, the character are exceeding the MAX JVM ARRAY size limit (https://plumbr.io/outofmemoryerror/requested-array-size-exceeds-vm-limit).
Now how can I avoid use of toString().getBytes("UTF-8") to avoid OOM error? Is there any better approach to convert to byte array from StringWriter?
It’s strange to receive the result of generateCSV as a StringWriter; the preferred solution would be to let the method write to a target while generating, so you don’t have the entire contents in memory.
In either case, you should resort to the FileTransfer(String, String mimeType, OutputStreamLoader) constructor, to receive the target OutputStream when it is time to write the actual data.
When you can’t avoid the intermediate StringWriter, you should at least avoid calling toString on it, as constructing a String implies creating a copy of the entire buffer.
So a solution could look like:
return new FileTransfer(fileName, reportType.getMimeType(), new OutputStreamLoader() {
public void close() {}
public void load(OutputStream out) throws IOException {
// the best would be to let generateCSV write to out directly
// otherwise use:
StringBuffer sb = generateCSV(reportType, grid, new DataList(), params).getBuffer();
Writer w = new OutputStreamWriter(out, "UTF-8")
final int bufSize = 8192;
for(int s = 0, e; s < sb.length(); s = e) {
e = Math.min(sb.length(), s + bufSize);
w.write(sb.substring(s, e));
}
w.flush(); // let the caller close the OutputStream
}
});
An alternative to StringWriter would be CharArrayWriter, which has a writeTo(Writer out), which eliminates the need to implement a manual copying loop and might be even more efficient. But, as said, refactoring generateCSV to write directly to a target would be even better.
The StringWriter holds its content in the memory. So it's not a good approach to use it with large files.
You should try to chunk the File directly to the InputStream without the StringWriter in the middle.
What about your own InputStream implementation which reads and convert the file to csv on the fly.
Can you show us the generateCSV method?
Suppose we have some binary data byte[] data that only contains Integers. If I wanted to read this data utilizing a DataInputStream, the only approach I can come up with is the following:
DataInputStream in = new DataInputStream(new ByteArrayInputStream(data));
try {
while (true){
int i = in.readInt();
}
} catch (EOFException e) {
// we're done!
} catch (IOException e){
throw new RuntimeException(e);
}
What bugs me about this is that reaching the end of the stream is expected and it would exceptional only if no exception was thrown, what IMO defeats the purpose of exceptions in the first place.
When using Java NIO's IntBuffer, there's no such problem.
IntBuffer in = ByteBuffer.wrap(data).asIntBuffer();
while (in.hasRemaining()){
int i = in.get();
}
Coming from C# and being in the process of learning Java I refuse to believe that this is the intended way of doing this.
Moreover, I just came across Java NIO which seems to be "quite new". Using IntBuffer here instead would be my way of procrastinating the matter. Regardless, I wanna know how this is properly done in Java.
You can't. readInt() can return any integer value, so an out-of-band mechanism is required to signal end of stream, so an exception is thrown. That's how the API was designed. Nothing you can do about it.
Since you are coming from .NET, Java's DataInputStream is roughly equivalent to BinaryReader of .NET.
Just like its .NET equivalent, DataInputStream class and its main interface, DataInput, have no provision for determining if a primitive of any given type is available for retrieval at the current position of the stream.
You can gain valuable insight of how the designers of the API expect you to use it by looking at designer's own usage of the API.
For example, look at ObjectInputStream.java source, which is used for object deserialization. The code that reads arrays of various types calls type-specific readXYZ methods of DataInput in a loop. In order to figure out where the primitives end, the code retrieves the number of items (line 1642):
private Object readArray(boolean unshared) throws IOException {
if (bin.readByte() != TC_ARRAY) {
throw new InternalError();
}
ObjectStreamClass desc = readClassDesc(false);
int len = bin.readInt();
...
if (ccl == Integer.TYPE) {
bin.readInts((int[]) array, 0, len);
...
}
...
}
Above, bin is a BlockDataInputStream, which is another implementation of DataInput interface. Note how len, the number of items in the array stored by array serialization counterpart, is passed to readInts, which calls readInt in a loop len times (line 2918).
I am attempting to use Gson to to take some Java Object and serialize that to json and get a byte array that represents that Json. I need a byte array because I am passing on the output to an external dependency that requires it to be a byte array.
public byte[] serialize(Object object){
return gson.toJson(object).getBytes();
}
I have 2 questions:
If the input is a String gson seems to return the String as is. It doesn't do any validation of the input. Is this expected? I'd like to use Gson in a way that it would validate that the input object is actually Json. How could I do this?
I'm gonna be invoking this serialize function several thousands of times over a short period. Converting to String and then to byte[] could be some unwanted overhead. Is there a more optimal way to get the byte[]?
edit: my answer on point 1 was misinformed.
2) There will be a lot of unnecessary overhead in reflection if you just use the vanilla gson converter. It would very much be a performance benefit in your case to write a custom adapter. here is one article with more info on that
https://open.blogs.nytimes.com/2016/02/11/improving-startup-time-in-the-nytimes-android-app/?_r=0
If the input is a String gson seems to return the String as is. It doesn't do any validation of the input. Is this expected?
Yes, this is fine. It just returns a JSON string representation of the given string.
I'd like to use Gson in a way that it would validate that the input object is actually Json. How could I do this?
No need per se. Gson.toJson() method accepts objects to be serialized and it generates valid JSON always. If you mean deserialization, then Gson makes fast fails on invalid JSON documents during reading/parsing/deserialization (actually reading, this is the bottom-most layer of Gson).
I'm gonna be invoking this serialize function several thousands of times over a short period. Converting to String and then to byte[] could be some unwanted overhead. Is there a more optimal way to get the byte[]?
Yes, accumulating a JSON string to in order just to expose its internal char[] clone is memory waste, of course. Gson is basically a stream-oriented tool, and note that there are Gson.toJson method overloads accepting Appendable that are basically the Gson core (just take a quick look on how Gson.fromJson(Object) works -- it just creates a StringWriter instance to accumulate a string because of the Appendable interface). It would be extremely cool if Gson could emit JSON tokens via a Reader rather than writing to an Appendable, but this idea was refused and most likely will never be implemented in Gson, unfortunately. Since Gson does not emit JSON tokens during deserialization in read semantics manner (from your code perspective), you have to buffer the whole result:
private static byte[] serializeToBytes(final Object object)
throws IOException {
final ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
final OutputStreamWriter writer = new OutputStreamWriter(outputStream);
gson.toJson(object, writer);
writer.flush();
return outputStream.toByteArray();
}
This one does not use StringWriter thus not accumulating an intermediate string with cloned arrays ping-pong. I don't know if there are writers/output streams that can utilize/re-use existing byte arrays, but I believe there should be some, because it makes a good rationale for the performance purposes you mentioned in your question.
If possible, you can also check your library interface/API for exposing/accepting OutputStreams somehow -- then you could probably easily pass such output streams to the serializeToBytes method or even remove the method. If it can use input streams, not just byte arrays, you could also take a look at converting output streams to input streams so the serializeToBytes method could return an InputStream or a Reader (requires some overhead, but can process infinite data -- need to find the balance):
private static InputStream serializeToByteStream(final Object object)
throws IOException {
final PipedInputStream inputStream = new PipedInputStream();
final OutputStream outputStream = new PipedOutputStream(inputStream);
new Thread(() -> {
try {
final OutputStreamWriter writer = new OutputStreamWriter(outputStream);
gson.toJson(object, writer);
writer.flush();
} catch ( final IOException ex ) {
throw new RuntimeException(ex);
} finally {
try {
outputStream.close();
} catch ( final IOException ex ) {
throw new RuntimeException(ex);
}
}
}).start();
return inputStream;
}
Example of use:
final String value = "foo";
System.out.println(Arrays.toString(serializeToBytes(value)));
try ( final InputStream inputStream = serializeToByteStream(value) ) {
int b;
while ( (b = inputStream.read()) != -1 ) {
System.out.print(b);
System.out.print(' ');
}
System.out.println();
}
Output:
[34, 102, 111, 111, 34]
34 102 111 111 34
Both represent an array of ASCII codes representing a string "foo" literally.
I am reading an array of bytes passed in to me (not my choice, but i have to use it this way). I need to get the data to a LinkedBlockingQueue, and ultimately step through the bytes to build one or more (may contain partial messages) xml messages. So my question is this:
What generic should i use for the LBQ type?
what is the most efficient way to get the byte[] to that generic type?
Here is my example code:
parsebytes(byte[] bytes, int length)
{
//assume that i am doing other checks on data
if (length > 0)
{
myThread.putBytes(bytes, length);
}
}
in my thread:
putBytes(byte[] bytes, int length)
{
for (int i = 0; i < length; i++)
{
blockingQueue.put(bytes[i]);
}
}
I also do not want to have to pull off the blocking queue byte-by-byte either. I would rather grab everything that is in the queue and process it.
There is no such thing as a ListBlockingQueue. However, any BlockingQueue<Object> will accept byte[] since Java arrays are objects.
In the absence of other design considerations, the simplest option might be to just stick the arrays into the queue as they arrive, and let the consumer stich them together.
Consider this:
BlockingQueue<byte[]> q = new LinkedBlockingQueue<>();
q.put(new byte[] {1,2,3});
byte[] bytes = q.take();
I want to create an InputStream that is limited to a certain range of bytes in file, e.g. to bytes from position 0 to 100. So that the client code should see EOF once 100th byte is reached.
The read() method of InputStream reads a single byte at a time. You could write a subclass of InputStream that maintains an internal counter; each time read() is called, update the counter. If you have hit your maximum, do not allow any further reads (return -1 or something like that).
You will also need to ensure that the other methods for reading read_int, etc are unsupported (ex: Override them and just throw UnsupportedOperationException());
I don't know what your use case is, but as a bonus you may want to implement buffering as well.
As danben says, just decorate your stream and enforce the constraint:
public class ConstrainedInputStream extends InputStream {
private final InputStream decorated;
private long length;
public ConstrainedInputStream(InputStream decorated, long length) {
this.decorated = decorated;
this.length = length;
}
#Override public int read() throws IOException {
return (length-- <= 0) ? -1 : decorated.read();
}
// TODO: override other methods if you feel it's necessary
// optionally, extend FilterInputStream instead
}
Consider using http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/io/LimitInputStream.html
If you only need 100 bytes, then simple is probably best, I'd read them into an array and wrap that as a ByteArrayInputStream. E.g.
int length = 100;
byte[] data = new byte[length];
InputStream in = ...; //your inputstream
DataInputStream din = new DataInputStream(din);
din.readFully(data);
ByteArrayInputStream first100Bytes = new ByteArrayInputStream(data);
// pass first100bytes to your clients
If you don't want to use DataInputStream.readFully, there is IOUtils.readFully from apache commons-io, or you can implment the read loop explicitly.
If you have more advanced needs, such as reading from a segment in the middle of the file, or larger amounts of data, then extending InputStream and overriding the read(byte[], int,int) as well as read(), will give you better performance than just overriding the read() method.
You can use guava's ByteStreams.
Notice that you should use skipFully() before limit, for example:
ByteStreams.skipFully(tmpStream, range.start());
tmpStream = ByteStreams.limit(tmpStream, range.length());
In addition to this solution, using the skip method of an InputStream, you can also read a range starting in the middle of the file.
public class RangeInputStream extends InputStream
{
private InputStream parent;
private long remaining;
public RangeInputStream(InputStream parent, long start, long end) throws IOException
{
if (end < start)
{
throw new IllegalArgumentException("end < start");
}
if (parent.skip(start) < start)
{
throw new IOException("Unable to skip leading bytes");
}
this.parent=parent;
remaining = end - start;
}
#Override
public int read() throws IOException
{
return --remaining >= 0 ? parent.read() : -1;
}
}
I was solved a similar problem for my project, you can see the working code here PartInputStream.
I was used it for assets and files input streams. But it is not suitable for а streams whose length is not available initially, such as network streams.