Consider a generic byte reader implementing the following simple API to read an unspecified number of bytes from a data structure that is otherwise inaccessible:
public interface ByteReader
{
public byte[] read() throws IOException; // Returns null only at EOF
}
How could the above be efficiently converted to a standard Java InputStream, so that an application using all methods defined by the InputStream class, works as expected?
A simple solution would be subclassing InputStream to
Call the read() method of the ByteReader as much as needed by the read(...) methods of the InputStream
Buffer the bytes retrieved in a byte[] array
Return part of the byte array as expected, e.g., 1 byte at a time whenever the InputStream read() method is called.
However, this requires more work to be efficient (e.g., for avoiding multiple byte array allocations). Also, for the application to scale to large input sizes, reading everything into memory and then processing is not an option.
Any ideas or open source implementations that could be used?
Create multiple ByteArrayInputStream instances around the returned arrays and use them in a stream that provides for concatenation. You could for instance use SequenceInputStream for this.
Trick is to implement a Enumeration<ByteArrayInputStream> that is can use the ByteReader class.
EDIT: I've implemented this answer, but it is probably better to create your own InputStream instance instead. Unfortunately, this solution does not let you handle IOException gracefully.
final Enumeration<ByteArrayInputStream> basEnum = new Enumeration<ByteArrayInputStream>() {
ByteArrayInputStream baos;
boolean ended;
#Override
public boolean hasMoreElements() {
if (ended) {
return false;
}
if (baos == null) {
getNextBA();
if (ended) {
return false;
}
}
return true;
}
#Override
public ByteArrayInputStream nextElement() {
if (ended) {
throw new NoSuchElementException();
}
if (baos.available() != 0) {
return baos;
}
getNextBA();
return baos;
}
private void getNextBA() {
byte[] next;
try {
next = byteReader.read();
} catch (IOException e) {
throw new IllegalStateException("Issues reading byte arrays");
}
if (next == null) {
ended = true;
return;
}
this.baos = new ByteArrayInputStream(next);
}
};
SequenceInputStream sis = new SequenceInputStream(basEnum);
I assume, by your use of "convert", that a replacement is acceptable.
The easiest way to do this is to just use a ByteArrayInputStream, which already provides all the features you are looking for (but must wrap an existing array), or to use any of the other already provided InputStream for reading data from various sources.
It seems like you may be running the risk of reinventing wheels here. If possible, I would consider scrapping your ByteReader interface entirely, and instead going with one of these options:
Replace with ByteInputStream.
Use the various other InputStream classes (depending on the source of the data).
Extend InputStream with your custom implementation.
I'd stick to the existing InputStream class everywhere. I have no idea how your code is structured but you could, for example, add a getInputStream() method to your current data sources, and have them return an appropriate already-existing InputStream (or a custom subclass if necessary).
By the way, I recommend avoiding the term Reader in your own IO classes, as Reader is already heavily used in the Java SDK to indicate stream readers that operate on encoded character data (as opposed to InputStream which generally operates on raw byte data).
Related
Suppose we have some binary data byte[] data that only contains Integers. If I wanted to read this data utilizing a DataInputStream, the only approach I can come up with is the following:
DataInputStream in = new DataInputStream(new ByteArrayInputStream(data));
try {
while (true){
int i = in.readInt();
}
} catch (EOFException e) {
// we're done!
} catch (IOException e){
throw new RuntimeException(e);
}
What bugs me about this is that reaching the end of the stream is expected and it would exceptional only if no exception was thrown, what IMO defeats the purpose of exceptions in the first place.
When using Java NIO's IntBuffer, there's no such problem.
IntBuffer in = ByteBuffer.wrap(data).asIntBuffer();
while (in.hasRemaining()){
int i = in.get();
}
Coming from C# and being in the process of learning Java I refuse to believe that this is the intended way of doing this.
Moreover, I just came across Java NIO which seems to be "quite new". Using IntBuffer here instead would be my way of procrastinating the matter. Regardless, I wanna know how this is properly done in Java.
You can't. readInt() can return any integer value, so an out-of-band mechanism is required to signal end of stream, so an exception is thrown. That's how the API was designed. Nothing you can do about it.
Since you are coming from .NET, Java's DataInputStream is roughly equivalent to BinaryReader of .NET.
Just like its .NET equivalent, DataInputStream class and its main interface, DataInput, have no provision for determining if a primitive of any given type is available for retrieval at the current position of the stream.
You can gain valuable insight of how the designers of the API expect you to use it by looking at designer's own usage of the API.
For example, look at ObjectInputStream.java source, which is used for object deserialization. The code that reads arrays of various types calls type-specific readXYZ methods of DataInput in a loop. In order to figure out where the primitives end, the code retrieves the number of items (line 1642):
private Object readArray(boolean unshared) throws IOException {
if (bin.readByte() != TC_ARRAY) {
throw new InternalError();
}
ObjectStreamClass desc = readClassDesc(false);
int len = bin.readInt();
...
if (ccl == Integer.TYPE) {
bin.readInts((int[]) array, 0, len);
...
}
...
}
Above, bin is a BlockDataInputStream, which is another implementation of DataInput interface. Note how len, the number of items in the array stored by array serialization counterpart, is passed to readInts, which calls readInt in a loop len times (line 2918).
Java 8's stream API has been convenient and gained popularity. For file I/O, I found that two API's are provided to generate stream output: Files.lines(path), and bufferedReader.lines();
I did not find a stream API which provide Stream of fixed-sized buffers for reading files, though.
My concern is: in case of files with very long line, e.g. a 4GB file with only a single line, aren't these line-based API very inefficient?
The line-based reader will need at least 4GB memory to keep that line.
Compared to a fix-sized buffer reader (fileInputStream.read(byte[] b, int off, int len)), which takes at most the buffer size of memory.
If the above concern is true, are there any Stream API for file i/o API which are more efficient?
If you have a 4GB text file with a single line, and you're processing it "line by line", then you've made a serious error in your programming by not understanding the data you're working with.
They're convenience methods for when you need to do simple work with data like CSV or other such format, and the line sizes are manageable.
A real life example of a 4GB text file with a single line would be an XML file without line breaks. You would use a streaming XML parser to read that, not roll your own solution that reads line by line.
It depends on how you want to process the data, which method of delivery is appropriate. So if your processing requires processing the data line by line, there is no way around doing it that way.
If you really want fixed size chunks of character data, you can using the following method(s):
public static Stream<String> chunks(Path path, int chunkSize) throws IOException {
return chunks(path, chunkSize, StandardCharsets.UTF_8);
}
public static Stream<String> chunks(Path path, int chunkSize, Charset cs)
throws IOException {
Objects.requireNonNull(path);
Objects.requireNonNull(cs);
if(chunkSize<=0) throw new IllegalArgumentException();
CharBuffer cb = CharBuffer.allocate(chunkSize);
BufferedReader r = Files.newBufferedReader(path, cs);
return StreamSupport.stream(
new Spliterators.AbstractSpliterator<String>(
Files.size(path)/chunkSize, Spliterator.ORDERED|Spliterator.NONNULL) {
#Override public boolean tryAdvance(Consumer<? super String> action) {
try { do {} while(cb.hasRemaining() && r.read(cb)>0); }
catch (IOException ex) { throw new UncheckedIOException(ex); }
if(cb.position()==0) return false;
action.accept(cb.flip().toString());
return true;
}
}, false).onClose(() -> {
try { r.close(); } catch(IOException ex) { throw new UncheckedIOException(ex); }
});
}
but I wouldn’t be surprised if your next question is “how can I merge adjacent stream elements”, as these fixed sized chunks are rarely the natural data unit to your actual task.
More than often, the subsequent step is to perform pattern matching within the contents and in this case, it’s better to use Scanner in the first place, which is capable of performing pattern matching while streaming the data, which can be done efficiently as the regex engine tells whether buffering more data could change the outcome of a match operation (see hitEnd() and requireEnd()). Unfortunately, generating a stream of matches from a Scanner has only been added in Java 9, but see this answer for a back-port of that feature to Java 8.
I am attempting to use Gson to to take some Java Object and serialize that to json and get a byte array that represents that Json. I need a byte array because I am passing on the output to an external dependency that requires it to be a byte array.
public byte[] serialize(Object object){
return gson.toJson(object).getBytes();
}
I have 2 questions:
If the input is a String gson seems to return the String as is. It doesn't do any validation of the input. Is this expected? I'd like to use Gson in a way that it would validate that the input object is actually Json. How could I do this?
I'm gonna be invoking this serialize function several thousands of times over a short period. Converting to String and then to byte[] could be some unwanted overhead. Is there a more optimal way to get the byte[]?
edit: my answer on point 1 was misinformed.
2) There will be a lot of unnecessary overhead in reflection if you just use the vanilla gson converter. It would very much be a performance benefit in your case to write a custom adapter. here is one article with more info on that
https://open.blogs.nytimes.com/2016/02/11/improving-startup-time-in-the-nytimes-android-app/?_r=0
If the input is a String gson seems to return the String as is. It doesn't do any validation of the input. Is this expected?
Yes, this is fine. It just returns a JSON string representation of the given string.
I'd like to use Gson in a way that it would validate that the input object is actually Json. How could I do this?
No need per se. Gson.toJson() method accepts objects to be serialized and it generates valid JSON always. If you mean deserialization, then Gson makes fast fails on invalid JSON documents during reading/parsing/deserialization (actually reading, this is the bottom-most layer of Gson).
I'm gonna be invoking this serialize function several thousands of times over a short period. Converting to String and then to byte[] could be some unwanted overhead. Is there a more optimal way to get the byte[]?
Yes, accumulating a JSON string to in order just to expose its internal char[] clone is memory waste, of course. Gson is basically a stream-oriented tool, and note that there are Gson.toJson method overloads accepting Appendable that are basically the Gson core (just take a quick look on how Gson.fromJson(Object) works -- it just creates a StringWriter instance to accumulate a string because of the Appendable interface). It would be extremely cool if Gson could emit JSON tokens via a Reader rather than writing to an Appendable, but this idea was refused and most likely will never be implemented in Gson, unfortunately. Since Gson does not emit JSON tokens during deserialization in read semantics manner (from your code perspective), you have to buffer the whole result:
private static byte[] serializeToBytes(final Object object)
throws IOException {
final ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
final OutputStreamWriter writer = new OutputStreamWriter(outputStream);
gson.toJson(object, writer);
writer.flush();
return outputStream.toByteArray();
}
This one does not use StringWriter thus not accumulating an intermediate string with cloned arrays ping-pong. I don't know if there are writers/output streams that can utilize/re-use existing byte arrays, but I believe there should be some, because it makes a good rationale for the performance purposes you mentioned in your question.
If possible, you can also check your library interface/API for exposing/accepting OutputStreams somehow -- then you could probably easily pass such output streams to the serializeToBytes method or even remove the method. If it can use input streams, not just byte arrays, you could also take a look at converting output streams to input streams so the serializeToBytes method could return an InputStream or a Reader (requires some overhead, but can process infinite data -- need to find the balance):
private static InputStream serializeToByteStream(final Object object)
throws IOException {
final PipedInputStream inputStream = new PipedInputStream();
final OutputStream outputStream = new PipedOutputStream(inputStream);
new Thread(() -> {
try {
final OutputStreamWriter writer = new OutputStreamWriter(outputStream);
gson.toJson(object, writer);
writer.flush();
} catch ( final IOException ex ) {
throw new RuntimeException(ex);
} finally {
try {
outputStream.close();
} catch ( final IOException ex ) {
throw new RuntimeException(ex);
}
}
}).start();
return inputStream;
}
Example of use:
final String value = "foo";
System.out.println(Arrays.toString(serializeToBytes(value)));
try ( final InputStream inputStream = serializeToByteStream(value) ) {
int b;
while ( (b = inputStream.read()) != -1 ) {
System.out.print(b);
System.out.print(' ');
}
System.out.println();
}
Output:
[34, 102, 111, 111, 34]
34 102 111 111 34
Both represent an array of ASCII codes representing a string "foo" literally.
I am currently in need to serialize arbitrary Java objects since I would like to use the Hash as a key for a hash table. After I read various warnings that the default hashCode creates collisions way to often, I wanted to switch to hashing via MessageDigest to use alternative algorithms (e.g. SHA1, ...) that are said to allow more entries without collisions. [As a sidenote: I am aware that even here collisions can occur early on, yet I want to increase the likelihood to remain collision free.]
To achieve this I tried a method proposed in this StackOverflow post. It uses the following code to obtain a byte[] necessary for MessageDigest:
public static byte[] convertToHashableByteArray(Object obj) {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
ObjectOutput out = null;
byte[] byteOutput = null;
try {
out = new ObjectOutputStream(bos);
out.writeObject(obj);
byteOutput = bos.toByteArray();
} catch (IOException io) {
io.printStackTrace();
} finally {
try {
if(out != null) { out.close(); }
} catch(IOException io) {
io.printStackTrace();
}
try {
bos.close();
} catch(IOException io) {
io.printStackTrace();
}
}
return byteOutput;
}
This, however, causes the problem that only objects implementing the serializable interface will be serialized/converted into a byte[]. To circumvent this issue I applied toString() to the given obj in the catch clause to enforce getting a byte[] in all cases:
public static byte[] convertToHashableByteArray(Object obj) {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
ObjectOutput out = null;
byte[] byteOutput = null;
try {
out = new ObjectOutputStream(bos);
out.writeObject(obj);
byteOutput = bos.toByteArray();
} catch (IOException io) {
String stringed = obj.toString();
byteOutput = stringed.getBytes();
} finally {
try {
if(out != null) { out.close(); }
} catch(IOException io) {
io.printStackTrace();
}
try {
bos.close();
} catch(IOException io) {
io.printStackTrace();
}
}
return byteOutput;
}
However, this still feels utterly wrong for me. So my question is, whether there is a better alternative to convert arbitrary objects to byte[] to be able to compute hashes. Preferably a solution that works without using additional libraries or one using well established ones like Apache Commons.
(Beside that I am also open for other approaches to obtain SHA1/SHA512 hashes of arbitrary Java objects.)
Perhaps you can use UUIDs for your objects as immutable unique identifiers?
There are so many things wrong here...
You should have proper key classes with equals and hashCode implemented, instead of using random objects.
Serialization performance overhead can easily mean that such map will be slower than even trivial iteration search.
Default hashcode should not be used in most cases, as it might be different for objects which are 'equal' from business point of view. You should reimplement hashcode together with equals (which comes back to point 1). Whenever it has collisions due to pointer aliasing is irrelevant if it won't work poperly
Way overcomplicated method of closing in-memory streams. Just close them one after another, it is not external resource - if it fails, just let it fail, you don't need to close everything 100% in case of failures. You can also use one of closeable utilities (or try/catch with resources) to avoid some overhead
You don't need complicated digest of that byte array - use Arrays.hashCode, it WILL be good enough for your use cases (remember - don't do it anyway, point 1)
If you are still reading and still not willing to implement point 1, go back to point 1. And again. And again.
And to finally answer you question, use hessian serialization.
http://hessian.caucho.com/doc/hessian-overview.xtp
It is very similar to java one, just faster, shorter output and allows serializing objects which do not implement Serializable interface (at the risk of messing things up, you need to set special flag to allow that).
If you want to serialize a given object, i suggest you change your méthod like this :
public static byte[] convertToHashableByteArray(Serializable obj){
..........
..........
}
I have need to return the byte array for the ByteArrayOutputStream from the called method. I see two ways to achieve the same thing: firstly to return ByteArrayOutputStream & use toByteArray() method, and secondly use baos.toByteArray() and return the byte array.
Which one should I use?
To illustrate by example:
Method 1
void parentMethod(){
bytes [] result = process();
}
void byte[] process(){
ByteArrayOutputStream baos;
.....
.....
.....
baos.toByteArray();
}
Method 2
void parentMethod(){
ByteArrayOutputStream baos = process();
}
void ByteArrayOutputStream process(){
ByteArrayOutputStream baos;
.....
.....
.....
return baos;
}
There's another alternative: return an InputStream. The idea is presumably that you're returning the data resulting from the operation. As such, returning an output stream seems very odd to me. To return data, you'd normally either return the raw byte[], or an InputStream wrapping it - the latter is more flexible in that it could be reading from a file or something similar, but does require the caller to close the stream afterwards.
It partly depends on what callers want to do with the data, too - there are some operations which are easier to perform if you've already got a stream; there are others which are easier with a byte array. I'd let that influence the decision quite a lot.
If you do want to return a stream, that's easy:
return new ByteArrayInputStream(baos.toByteArray());
So to summarize:
Don't return ByteArrayOutputStream. The use of that class in coming up with the data is an implementation detail, and it's not really the logical result of the operation.
Consider returning an InputStream if callers are likely to find that easier to use or if you may later want to read the data from a file (or network connection, or whatever); ByteArrayInputStream is suitable in the current implementation