I am attempting to use Gson to to take some Java Object and serialize that to json and get a byte array that represents that Json. I need a byte array because I am passing on the output to an external dependency that requires it to be a byte array.
public byte[] serialize(Object object){
return gson.toJson(object).getBytes();
}
I have 2 questions:
If the input is a String gson seems to return the String as is. It doesn't do any validation of the input. Is this expected? I'd like to use Gson in a way that it would validate that the input object is actually Json. How could I do this?
I'm gonna be invoking this serialize function several thousands of times over a short period. Converting to String and then to byte[] could be some unwanted overhead. Is there a more optimal way to get the byte[]?
edit: my answer on point 1 was misinformed.
2) There will be a lot of unnecessary overhead in reflection if you just use the vanilla gson converter. It would very much be a performance benefit in your case to write a custom adapter. here is one article with more info on that
https://open.blogs.nytimes.com/2016/02/11/improving-startup-time-in-the-nytimes-android-app/?_r=0
If the input is a String gson seems to return the String as is. It doesn't do any validation of the input. Is this expected?
Yes, this is fine. It just returns a JSON string representation of the given string.
I'd like to use Gson in a way that it would validate that the input object is actually Json. How could I do this?
No need per se. Gson.toJson() method accepts objects to be serialized and it generates valid JSON always. If you mean deserialization, then Gson makes fast fails on invalid JSON documents during reading/parsing/deserialization (actually reading, this is the bottom-most layer of Gson).
I'm gonna be invoking this serialize function several thousands of times over a short period. Converting to String and then to byte[] could be some unwanted overhead. Is there a more optimal way to get the byte[]?
Yes, accumulating a JSON string to in order just to expose its internal char[] clone is memory waste, of course. Gson is basically a stream-oriented tool, and note that there are Gson.toJson method overloads accepting Appendable that are basically the Gson core (just take a quick look on how Gson.fromJson(Object) works -- it just creates a StringWriter instance to accumulate a string because of the Appendable interface). It would be extremely cool if Gson could emit JSON tokens via a Reader rather than writing to an Appendable, but this idea was refused and most likely will never be implemented in Gson, unfortunately. Since Gson does not emit JSON tokens during deserialization in read semantics manner (from your code perspective), you have to buffer the whole result:
private static byte[] serializeToBytes(final Object object)
throws IOException {
final ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
final OutputStreamWriter writer = new OutputStreamWriter(outputStream);
gson.toJson(object, writer);
writer.flush();
return outputStream.toByteArray();
}
This one does not use StringWriter thus not accumulating an intermediate string with cloned arrays ping-pong. I don't know if there are writers/output streams that can utilize/re-use existing byte arrays, but I believe there should be some, because it makes a good rationale for the performance purposes you mentioned in your question.
If possible, you can also check your library interface/API for exposing/accepting OutputStreams somehow -- then you could probably easily pass such output streams to the serializeToBytes method or even remove the method. If it can use input streams, not just byte arrays, you could also take a look at converting output streams to input streams so the serializeToBytes method could return an InputStream or a Reader (requires some overhead, but can process infinite data -- need to find the balance):
private static InputStream serializeToByteStream(final Object object)
throws IOException {
final PipedInputStream inputStream = new PipedInputStream();
final OutputStream outputStream = new PipedOutputStream(inputStream);
new Thread(() -> {
try {
final OutputStreamWriter writer = new OutputStreamWriter(outputStream);
gson.toJson(object, writer);
writer.flush();
} catch ( final IOException ex ) {
throw new RuntimeException(ex);
} finally {
try {
outputStream.close();
} catch ( final IOException ex ) {
throw new RuntimeException(ex);
}
}
}).start();
return inputStream;
}
Example of use:
final String value = "foo";
System.out.println(Arrays.toString(serializeToBytes(value)));
try ( final InputStream inputStream = serializeToByteStream(value) ) {
int b;
while ( (b = inputStream.read()) != -1 ) {
System.out.print(b);
System.out.print(' ');
}
System.out.println();
}
Output:
[34, 102, 111, 111, 34]
34 102 111 111 34
Both represent an array of ASCII codes representing a string "foo" literally.
Related
I have a reporting web application which generates reports. The application gets data from a database and stores data into a StringWriter object. I have to get this data in a byte array format to create a csv file and send it to browser.
Below is the code snippet
return new FileTransfer(fileName, reportType.getMimeType(),
new ByteArrayInputStream(generateCSV(reportType, grid, new DataList(), params).toString().getBytes("UTF-8")));
where generateCSV returns a StringWriter object, then to convert it into byte array I am calling toString and then getBytes() method. Below is what the generateCSV method looks like
StringWriter generateCSV(ReportType reportType, GridConfig grid, DataList dataList, String params) {......}
The problem is that when my report has huge records (more than 1 million), the getBytes() method fails with
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
The whole report data when converted to String object has a huge number of characters (billions of it). The .getBytes("UTF-8") method convert it into array, each array element as one character. And for 1 million records, the character are exceeding the MAX JVM ARRAY size limit (https://plumbr.io/outofmemoryerror/requested-array-size-exceeds-vm-limit).
Now how can I avoid use of toString().getBytes("UTF-8") to avoid OOM error? Is there any better approach to convert to byte array from StringWriter?
It’s strange to receive the result of generateCSV as a StringWriter; the preferred solution would be to let the method write to a target while generating, so you don’t have the entire contents in memory.
In either case, you should resort to the FileTransfer(String, String mimeType, OutputStreamLoader) constructor, to receive the target OutputStream when it is time to write the actual data.
When you can’t avoid the intermediate StringWriter, you should at least avoid calling toString on it, as constructing a String implies creating a copy of the entire buffer.
So a solution could look like:
return new FileTransfer(fileName, reportType.getMimeType(), new OutputStreamLoader() {
public void close() {}
public void load(OutputStream out) throws IOException {
// the best would be to let generateCSV write to out directly
// otherwise use:
StringBuffer sb = generateCSV(reportType, grid, new DataList(), params).getBuffer();
Writer w = new OutputStreamWriter(out, "UTF-8")
final int bufSize = 8192;
for(int s = 0, e; s < sb.length(); s = e) {
e = Math.min(sb.length(), s + bufSize);
w.write(sb.substring(s, e));
}
w.flush(); // let the caller close the OutputStream
}
});
An alternative to StringWriter would be CharArrayWriter, which has a writeTo(Writer out), which eliminates the need to implement a manual copying loop and might be even more efficient. But, as said, refactoring generateCSV to write directly to a target would be even better.
The StringWriter holds its content in the memory. So it's not a good approach to use it with large files.
You should try to chunk the File directly to the InputStream without the StringWriter in the middle.
What about your own InputStream implementation which reads and convert the file to csv on the fly.
Can you show us the generateCSV method?
Suppose we have some binary data byte[] data that only contains Integers. If I wanted to read this data utilizing a DataInputStream, the only approach I can come up with is the following:
DataInputStream in = new DataInputStream(new ByteArrayInputStream(data));
try {
while (true){
int i = in.readInt();
}
} catch (EOFException e) {
// we're done!
} catch (IOException e){
throw new RuntimeException(e);
}
What bugs me about this is that reaching the end of the stream is expected and it would exceptional only if no exception was thrown, what IMO defeats the purpose of exceptions in the first place.
When using Java NIO's IntBuffer, there's no such problem.
IntBuffer in = ByteBuffer.wrap(data).asIntBuffer();
while (in.hasRemaining()){
int i = in.get();
}
Coming from C# and being in the process of learning Java I refuse to believe that this is the intended way of doing this.
Moreover, I just came across Java NIO which seems to be "quite new". Using IntBuffer here instead would be my way of procrastinating the matter. Regardless, I wanna know how this is properly done in Java.
You can't. readInt() can return any integer value, so an out-of-band mechanism is required to signal end of stream, so an exception is thrown. That's how the API was designed. Nothing you can do about it.
Since you are coming from .NET, Java's DataInputStream is roughly equivalent to BinaryReader of .NET.
Just like its .NET equivalent, DataInputStream class and its main interface, DataInput, have no provision for determining if a primitive of any given type is available for retrieval at the current position of the stream.
You can gain valuable insight of how the designers of the API expect you to use it by looking at designer's own usage of the API.
For example, look at ObjectInputStream.java source, which is used for object deserialization. The code that reads arrays of various types calls type-specific readXYZ methods of DataInput in a loop. In order to figure out where the primitives end, the code retrieves the number of items (line 1642):
private Object readArray(boolean unshared) throws IOException {
if (bin.readByte() != TC_ARRAY) {
throw new InternalError();
}
ObjectStreamClass desc = readClassDesc(false);
int len = bin.readInt();
...
if (ccl == Integer.TYPE) {
bin.readInts((int[]) array, 0, len);
...
}
...
}
Above, bin is a BlockDataInputStream, which is another implementation of DataInput interface. Note how len, the number of items in the array stored by array serialization counterpart, is passed to readInts, which calls readInt in a loop len times (line 2918).
I am using google GSON API to parse a JSON file for my Android project but I have an issue of performance.
Here is the source code I use for parsing the JSON with google GSON API :
public void loadJsonInDb(String path){
InputStream isJson = context.getAssets().open(path);
if (isJson != null) {
int sizeJson = isJson.available();
byte[] bufferJson = new byte[sizeJson];
isJson.read(bufferJson);
isJson.close();
String jsonStr = new String(bufferJson, "UTF-8");
JsonParser parser = new JsonParser();
JsonObject object = parser.parse(jsonStr).getAsJsonObject();
JsonArray array = object.getAsJsonArray("datas");
Gson gson = new Gson();
for(JsonElement jsonElement : array){
MyEntity entity = gson.fromJson(jsonElement, MyEntity.class);
// Do insert into Db stuffs
}
}
}
The problem with this is that after parsing I have to go through the JsonArray with a for loop and perform the desired action (which is an insertion in SQLite DB with ORMLite of each element in the array), I would like to know if it is possible to perform insertion on the flight during the parsing, instead of waiting for the the array to be computed. I have seen in documentation that maybe JsonStreamParser can do the job but I am not how to use it.
I have a few notes regarding the use of Gson and other stuff.
You should close I/O resources in finally blocks to ensure you don't have resource leaks (available and read may throw an exception that prevents the resource from being closed). (Also I'm not sure if using available is a good idea here.)
You just don't have to use Strings in this case. Strings are generally a performance/memory killer for such a scenario (much depends on their result sizes) since strings are accumulated into memory, thus you lose your on-fly idea having it's all collected into memory first. In worst cases, it can finish up your application with OutOfMemoryError.
You can read input streams with a specified encoding, so no string-buffering is necessary.
JsonParser is designed to return JSON trees: JsonElement contains the whole JSON tree in memory. Sounds similar to the strings case above, right? Another performance penalty here.
Creating Gson instances may be somewhat expensive (depending on how to compare, of course), and you can instantiated it once: it's thread safe.
JsonStreamParser is not an option too, because each next() will produce another JSON tree branch in memory (again, depends on how big are your JSON documents and its $.datas array and its elements).
Gson.fromJson uses lookup to find the best type adapter, and you ask a Gson instance for a type adapter once, then not wasting time for lookups anymore. Type adapters are usually perfectly thread-safe too, thus can be cached.
Summarizing the above up, you could implement it as follows:
private static final Gson gson = new Gson();
private static final TypeAdapter<MyEntity> myEntityTypeAdapter = gson.getAdapter(MyEntity.class);
private static void loadJsonInDb(final String path)
throws IOException {
// Java 7 language features can be easily converted to Java 6 try/finally
// Note the way how you can decorate (wrap) everything: an input stream (byte streams) to a reader (character streams, UTF-8 here) to a JSON reader (more high-level character reader)
try ( final JsonReader jsonReader = new JsonReader(new InputStreamReader(context.getAssets().open(path), "UTF-8")) ) {
// Ensure that we're about to open the root object
jsonReader.beginObject();
// And iterate each object property
while ( jsonReader.hasNext() ) {
// And check it's name
final String name = jsonReader.nextName();
// Another Java 7 language feature
switch ( name ) {
// Is it datas?
case "datas":
// The consume it's opening array token
jsonReader.beginArray();
// And iterate each array element
while ( jsonReader.hasNext() ) {
// Read the current value as an MyEntity instance
final MyEntity myEntity = myEntityTypeAdapter.read(jsonReader);
// Now do what you want here
}
// "Close" the array
jsonReader.endArray();
break;
default:
// If it's something other than "datas" -- just skip the entire value -- Gson will do it efficiently (I hope, not sure)
jsonReader.skipValue();
break;
}
}
// "Close" the object
jsonReader.endObject();
}
}
Simply speaking, you just have to write a parser to consume each token. Now, having the following JSON document:
{
"object": {
},
"number": 2,
"array": [
],
"datas": [
{
"k": "v1"
},
{
"k": "v2"
},
{
"k": "v3"
}
]
}
the parser above would extract $.datas.* only consuming as less resources as possible. Substituting // Now do what you want here with System.out.println(myEntity.k); would produce:
v1
v2
v3
assuming that MyEntity is final class MyEntity{final String k=null;}. Note that you can process infinite JSON documents using this approach too.
I have 2 suggestions here:
Deserealize entire collection in 3 lines:
Gson gson = new Gson();
Type listType = new TypeToken<ArrayList<MyEntity>>(){}.getType();
List<MyEntity> listOf = gson.fromJson(jsonStr, listType);
When you got whole list of the entities use bulkInsert with single transaction. There you can get the idea how do use it
P.S.
To use bulkInsert you have to create list of ContentValues from your Entities.
Given an object byte[], when we want to operate with such object often we need pieces of it. In my particular example i get byte[] from wire where first 4 bytes describe lenght of the message then another 4 bytes the type of the message (an integer that maps to concrete protobuf class) then remaining byte[] is actual content of the message... like this
length|type|content
in order to parse this message i have to pass content part to specific class which knows how to parse an instance from it... the problem is that often there are no methods provided so that you could specify from where to where parser shall read the array...
So what we end up doing is copying remaining chuks of that array, which is not effective...
As far as i know in java it is not possible to create another byte[] reference that actually refers to some original bigger byte[] array with just 2 indexes (this was approach with String that led to memory leaks)...
I wonder how do we solve situations like this? I suppose giving up on protobuf just because it does not provide some parseFrom(byte[], int, int) does not make sence... protobuf is just an example, anything could lack that api...
So does this force us to write inefficient code or there is something that can be done? (appart from adding that method)...
Normally you would tackle this kind of thing with streams.
A stream is an abstraction for reading just what you need to process the current block of data. So you can read the correct number of bytes into a byte array and pass it to your parse function.
You ask 'So does this force us to write inefficient code or there is something that can be done?'
Usually you get your data in the form of a stream and then using the technique demonstrated below will be more performant because you skip making one copy. (Two copies instead of three; once by the OS and once by you. You skip making a copy of the total byte array before you start parsing.) If you actually start out with a byte[] but it is constructed by yourself then you may want to change to constructing an object such as { int length, int type, byte[] contentBytes } instead and pass contentBytes to your parse function.
If you really, really have to start out with byte[] then the below technique is just a more convenient way to parse it, it would not be more performant.
So suppose you got a buffer of bytes from somewhere and you want to read the contents of that buffer. First you convert it to a stream:
private static List<Content> read(byte[] buffer) {
try {
ByteArrayInputStream bytesStream = new ByteArrayInputStream(buffer);
return read(bytesStream);
} catch (IOException e) {
e.printStackTrace();
}
}
The above function wraps the byte array with a stream and passes it to the function that does the actual reading.
If you can start out from a stream then obviously you can skip the above step and just pass that stream into the below function directly:
private static List<Content> read(InputStream bytesStream) throws IOException {
List<Content> results = new ArrayList<Content>();
try {
// read the content...
Content content1 = readContent(bytesStream);
results.add(content1);
// I don't know if there's more than one content block but assuming
// that there is, you can just continue reading the stream...
//
// If it's a fixed number of content blocks then just read them one
// after the other... Otherwise make this a loop
Content content2 = readContent(bytesStream);
results.add(content2);
} finally {
bytesStream.close();
}
return results;
}
Since your byte-array contains content you will want to read Content blocks from the stream. Since you have a length and a type field, I am assuming that you have different kinds of content blocks. The next function reads the length and type and passes the processing of the content bytes on to the proper class depending on the read type:
private static Content readContent(InputStream stream) throws IOException {
final int CONTENT_TYPE_A = 10;
final int CONTENT_TYPE_B = 11;
// wrap the InputStream in a DataInputStream because the latter has
// convenience functions to convert bytes to integers, etc.
// Note that DataInputStream handles the stream in a BigEndian way,
// so check that your bytes are in the same byte order. If not you'll
// have to find another stream reader that can convert to ints from
// LittleEndian byte order.
DataInputStream data = new DataInputStream(stream);
int length = data.readInt();
int type = data.readInt();
// I'm assuming that above length field was the number of bytes for the
// content. So, read length number of bytes into a buffer and pass that
// to your `parseFrom(byte[])` function
byte[] contentBytes = new byte[length];
int readCount = data.read(contentBytes, 0, contentBytes.length);
if (readCount < contentBytes.length)
throw new IOException("Unexpected end of stream");
switch (type) {
case CONTENT_TYPE_A:
return ContentTypeA.parseFrom(contentBytes);
case CONTENT_TYPE_B:
return ContentTypeB.parseFrom(contentBytes);
default:
throw new UnsupportedOperationException();
}
}
I have made up the below Content classes. I don't know what protobuf is but it can apparently convert from a byte array to an actual object with its parseFrom(byte[]) function, so take this as pseudocode:
class Content {
// common functionality
}
class ContentTypeA extends Content {
public static ContentTypeA parseFrom(byte[] contentBytes) {
return null; // do the actual parsing of a type A content
}
}
class ContentTypeB extends Content {
public static ContentTypeB parseFrom(byte[] contentBytes) {
return null; // do the actual parsing of a type B content
}
}
In Java, Array is not just section of memory - it is an object, that have some additional fields (at least - length). So you cannot link to part of array - you should:
Use array-copy functions or
Implement and use some algorithm that uses only part of byte array.
The concern seems that there is no way to create a view over an array (e.g., an array equivalent of List#subList()). A workaround might be making your parsing methods take in the reference to the entire array and two indices (or an index and a length) to specify the sub-array the method should work on.
This would not prevent the methods from reading or modifying sections of the array they should not touch. Perhaps an ByteArrayView class could be made to add a little bit of safety if this is a concern:
public class ByteArrayView {
private final byte[] array;
private final int start;
private final int length;
public ByteArrayView(byte[] array, int start, int length) { ... }
public byte[] get(int index) {
if (index < 0 || index >= length) {
throw new ArrayOutOfBoundsExceptionOrSomeOtherRelevantException();
}
return array[start + index];
}
}
But if, on the other hand, performance is a concern, then a method call to get() for fetching each byte is probably undesirable.
The code is for illustration; it's not tested or anything.
EDIT
On a second reading of my own answer, I realized that I should point this out: having a ByteArrayView will copy each byte you read from the original array -- just byte by byte rather than as a chunk. It would be inadequate for the OP's concerns.
Consider a generic byte reader implementing the following simple API to read an unspecified number of bytes from a data structure that is otherwise inaccessible:
public interface ByteReader
{
public byte[] read() throws IOException; // Returns null only at EOF
}
How could the above be efficiently converted to a standard Java InputStream, so that an application using all methods defined by the InputStream class, works as expected?
A simple solution would be subclassing InputStream to
Call the read() method of the ByteReader as much as needed by the read(...) methods of the InputStream
Buffer the bytes retrieved in a byte[] array
Return part of the byte array as expected, e.g., 1 byte at a time whenever the InputStream read() method is called.
However, this requires more work to be efficient (e.g., for avoiding multiple byte array allocations). Also, for the application to scale to large input sizes, reading everything into memory and then processing is not an option.
Any ideas or open source implementations that could be used?
Create multiple ByteArrayInputStream instances around the returned arrays and use them in a stream that provides for concatenation. You could for instance use SequenceInputStream for this.
Trick is to implement a Enumeration<ByteArrayInputStream> that is can use the ByteReader class.
EDIT: I've implemented this answer, but it is probably better to create your own InputStream instance instead. Unfortunately, this solution does not let you handle IOException gracefully.
final Enumeration<ByteArrayInputStream> basEnum = new Enumeration<ByteArrayInputStream>() {
ByteArrayInputStream baos;
boolean ended;
#Override
public boolean hasMoreElements() {
if (ended) {
return false;
}
if (baos == null) {
getNextBA();
if (ended) {
return false;
}
}
return true;
}
#Override
public ByteArrayInputStream nextElement() {
if (ended) {
throw new NoSuchElementException();
}
if (baos.available() != 0) {
return baos;
}
getNextBA();
return baos;
}
private void getNextBA() {
byte[] next;
try {
next = byteReader.read();
} catch (IOException e) {
throw new IllegalStateException("Issues reading byte arrays");
}
if (next == null) {
ended = true;
return;
}
this.baos = new ByteArrayInputStream(next);
}
};
SequenceInputStream sis = new SequenceInputStream(basEnum);
I assume, by your use of "convert", that a replacement is acceptable.
The easiest way to do this is to just use a ByteArrayInputStream, which already provides all the features you are looking for (but must wrap an existing array), or to use any of the other already provided InputStream for reading data from various sources.
It seems like you may be running the risk of reinventing wheels here. If possible, I would consider scrapping your ByteReader interface entirely, and instead going with one of these options:
Replace with ByteInputStream.
Use the various other InputStream classes (depending on the source of the data).
Extend InputStream with your custom implementation.
I'd stick to the existing InputStream class everywhere. I have no idea how your code is structured but you could, for example, add a getInputStream() method to your current data sources, and have them return an appropriate already-existing InputStream (or a custom subclass if necessary).
By the way, I recommend avoiding the term Reader in your own IO classes, as Reader is already heavily used in the Java SDK to indicate stream readers that operate on encoded character data (as opposed to InputStream which generally operates on raw byte data).