I keep running into the case where I want some structure of let's say buffer size 4096 and I can
write bytes into
read bytes from it
reset the read back to the previous read
MOST IMPORTANT, not have to deal with copying stuff as data windows get near the end of the byte array!!! (This is much like a circular buffer basically with wrap around or something)
ByteBuffer seems just as much of a heartache as byte[] as you write to it and read from it on both of these, the beginning of the array starts to empty out. I almost just want a structure of List or something....I just want it all managed for me (or I may have to write my own structure). I think some kind of InputStream would be nice with a mark and reset so I can mark before I read and then reset in case there is not enough data in the buffer just yet.
This is extremely useful in nearly all asynchronous programming where data comes in and you may have enough to parse or may not have enough to parse and you fill the buffer, try to read and parse and need to reset until you have more data.
ByteBuffer seems totally right for this, and ByteBuffer.compact() is really what you want to use when you want to move the remaining buffer back to the start.
You might be able to use this circular byte buffer - use the getAvailable() method rather than reading and then resetting.
hmmm, I just found this non-GPL one...apache license looks like..
https://svn.apache.org/repos/asf/etch/releases/release-1.0.0/util/src/main/java/etch/util/CircularByteBuffer.java
anyone use this? lloks ok to me.
Related
Good afternoon everyone,
First of all, I'll say that it's only for personal purpose in a certain way, it's made to use for little projects to improve my Java knowledge, but my idea is to make this kind of things to understand better the way developers works with sockets and bytes, as I really like to understand this kind of things better for my future ideas.
Actually I'm making a lightweight HTTP server in Java to understand the way it works, and I've been reading documentation but still have some difficulties to actually understand part of the official documentation. The main problem I'm facing is that, something I'd like to know if it's related or not, the content-length seems to have a higher length than the one I get from the BufferedReader. I don't know if the issue is about the way chars are managed and bytes are being parsed to chars on the BufferedReader, so it has less data, so probably what I have to do is treat this part as a binary, so I'd have to read the bytes of the InputStream, but here comes the real deal I'm facing.
As Readers reads a certain amount of bytes, and then it stops and uses this as buffer, this means the data from the InputStream is being used on the Reader, and it's no longer on the stream, so using read() would end up on a -1 as there aren't more bytes to read. A multipart is divided in multiple elements separated with a boundary, and a newline that delimiters the information from the content. I still have to get the information as an String to process it, but the content should be parsed into a binary data, and, without modifying the buffer length, implying I'd require knowledge about the exact length I require to get only the information, the most probably result would be the content being transferred to the BufferedReader buffer. Is it possible to do it even with the processed data from the BufferedStream, or should I find a way to get that certain content as binary without being processed?
As I said, I'm new working with sockets and services, so I don't exactly know which are the possibilities it's another kind of issue, so any help would be appreciated, thank you in advance.
Answer from Remy Lebeau, that can be found on the comments, which become useful for me:
since multipart data is both textual and binary, you are going to have to do your own buffering of the socket data so you have more control and know where the data switches back and forth. At the very least, since you can read binary data directly from a BufferedInputStream, and access its internal buffer, you can let it handle the actual buffering for you, and it is not difficult to write a custom readLine() method that can read a line of text from a BufferedInputStream without using BufferedReader
I am receiving files through a socket
and saving them to database.
So, i'm receiving the byte stream, and passing it
to a back-end process, say Process1
for the DB save.
I'm looking to do this without saving
the stream on disk. So, rather than storing the incoming stream
as a file on disk and then passing that file to Process1,
i'm looking to pass it while it's still in the memory.
This is to eliminate the time-costly disk read & write.
One way i can do is to pass the byte[] to Process1.
I'm wondering whether there's a better way of doing this.
TIA.
You can use a ByteArrayOutputStream. It is, essentially, a growable byte[] which you can write into at will, that is in the limit of your available heap space.
After having written to it/flushed it/closed it (although those two last operations are essentially a no-op, that's no reason for ditching sane practices), you can obtain the underlying byte array using this class's .toByteArray().
Socket sounds like what you are looking for.
I'm under the impression that a ByteArrayOutputStream is not memory efficient, since all it's contents are stored in memory.
Similarly, calling toByteArray on a large stream seems like it "scales poorly".
Why, then, in the example in The example in Tom White's book Hadoop: the Definitive Guide use them both:
ByteArrayOutputStream out = new ByteArrayOutputStream;
Decoder decoder = DecoderFactory().defaultFactory().createBinaryDecoder(out.toByteArray(), null);
Isn't "Big Data" the norm for Avro? What am I missing?
Edit 1: What I'm trying to do - Say I'm streaming avros over a websocket. What would the example look like if I wanted to deserialize multiple records, not just one that was put in it's own ByteArrayOutoputStream?
Is there a better way to supply BinaryDecoder with a byte[]? Or perhaps a different type of stream? Or I should be sending 1 record per stream instead of loading streams with multiple records?
ByteArrayOutputStream makes sense when dealing with small objects like small to medium images, or fixed-size request/response. It is in memory and doesn't touch the disk so this can be great for performance. It doesn't make any sense to use it for 1 TerraByte of data. Possibly this is a case of trying to keep an example in a book small and self-contained so as not to detract from the main point.
EDIT: Now that I see where your going I'd be looking to setup a pipeline. Pull a message off the stream (so I'm assuming you can get in InputStream from your HTTP object) and either process it with a memory-less method or throw it at a queue and have a thread-pool process the queue with a memory-less method. So the requirements for this are 1) being able to detect the boundary between Avro messages as you pull them off the stream and having the method for decoding.
The way to decode appears to be read the bytes for each message into a byte-array and hand that to your BinaryDecoder (after you find the message boundary).
Currently I'm working with an SSH client api providing me stdout and stderr as InputStreams. I have to read all the data from these streams at client side and provide an api for implementors to be able to work with these data the way they want (just drop it, write it to DB, process it etc). First I tried to keep the whole data read in byte arrays, but with huge amount of data (could happen sometimes) this can cause serious memory problems. But I don't want to write all the data of every call into files if that isn't really necessary.
Anyone knows about a solution which reads data into memory until it reaches a limit (like 1mb), after it writes data from memory to a file and appends all the remaining data of the inputstream to the same file?
commons io has a workable solution: DeferredFileOutputStream.
Can you avoid reading the stream until you know what you are going to do with it?
If you use this approach you can dump them, read portions of data and write them to a database as you read it, or read and process the data as you read it.
This way you would not need to read more than 1 MB (or less) at any one time.
So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud
I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.
What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.
GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...