I'm under the impression that a ByteArrayOutputStream is not memory efficient, since all it's contents are stored in memory.
Similarly, calling toByteArray on a large stream seems like it "scales poorly".
Why, then, in the example in The example in Tom White's book Hadoop: the Definitive Guide use them both:
ByteArrayOutputStream out = new ByteArrayOutputStream;
Decoder decoder = DecoderFactory().defaultFactory().createBinaryDecoder(out.toByteArray(), null);
Isn't "Big Data" the norm for Avro? What am I missing?
Edit 1: What I'm trying to do - Say I'm streaming avros over a websocket. What would the example look like if I wanted to deserialize multiple records, not just one that was put in it's own ByteArrayOutoputStream?
Is there a better way to supply BinaryDecoder with a byte[]? Or perhaps a different type of stream? Or I should be sending 1 record per stream instead of loading streams with multiple records?
ByteArrayOutputStream makes sense when dealing with small objects like small to medium images, or fixed-size request/response. It is in memory and doesn't touch the disk so this can be great for performance. It doesn't make any sense to use it for 1 TerraByte of data. Possibly this is a case of trying to keep an example in a book small and self-contained so as not to detract from the main point.
EDIT: Now that I see where your going I'd be looking to setup a pipeline. Pull a message off the stream (so I'm assuming you can get in InputStream from your HTTP object) and either process it with a memory-less method or throw it at a queue and have a thread-pool process the queue with a memory-less method. So the requirements for this are 1) being able to detect the boundary between Avro messages as you pull them off the stream and having the method for decoding.
The way to decode appears to be read the bytes for each message into a byte-array and hand that to your BinaryDecoder (after you find the message boundary).
Related
Good afternoon everyone,
First of all, I'll say that it's only for personal purpose in a certain way, it's made to use for little projects to improve my Java knowledge, but my idea is to make this kind of things to understand better the way developers works with sockets and bytes, as I really like to understand this kind of things better for my future ideas.
Actually I'm making a lightweight HTTP server in Java to understand the way it works, and I've been reading documentation but still have some difficulties to actually understand part of the official documentation. The main problem I'm facing is that, something I'd like to know if it's related or not, the content-length seems to have a higher length than the one I get from the BufferedReader. I don't know if the issue is about the way chars are managed and bytes are being parsed to chars on the BufferedReader, so it has less data, so probably what I have to do is treat this part as a binary, so I'd have to read the bytes of the InputStream, but here comes the real deal I'm facing.
As Readers reads a certain amount of bytes, and then it stops and uses this as buffer, this means the data from the InputStream is being used on the Reader, and it's no longer on the stream, so using read() would end up on a -1 as there aren't more bytes to read. A multipart is divided in multiple elements separated with a boundary, and a newline that delimiters the information from the content. I still have to get the information as an String to process it, but the content should be parsed into a binary data, and, without modifying the buffer length, implying I'd require knowledge about the exact length I require to get only the information, the most probably result would be the content being transferred to the BufferedReader buffer. Is it possible to do it even with the processed data from the BufferedStream, or should I find a way to get that certain content as binary without being processed?
As I said, I'm new working with sockets and services, so I don't exactly know which are the possibilities it's another kind of issue, so any help would be appreciated, thank you in advance.
Answer from Remy Lebeau, that can be found on the comments, which become useful for me:
since multipart data is both textual and binary, you are going to have to do your own buffering of the socket data so you have more control and know where the data switches back and forth. At the very least, since you can read binary data directly from a BufferedInputStream, and access its internal buffer, you can let it handle the actual buffering for you, and it is not difficult to write a custom readLine() method that can read a line of text from a BufferedInputStream without using BufferedReader
I am receiving files through a socket
and saving them to database.
So, i'm receiving the byte stream, and passing it
to a back-end process, say Process1
for the DB save.
I'm looking to do this without saving
the stream on disk. So, rather than storing the incoming stream
as a file on disk and then passing that file to Process1,
i'm looking to pass it while it's still in the memory.
This is to eliminate the time-costly disk read & write.
One way i can do is to pass the byte[] to Process1.
I'm wondering whether there's a better way of doing this.
TIA.
You can use a ByteArrayOutputStream. It is, essentially, a growable byte[] which you can write into at will, that is in the limit of your available heap space.
After having written to it/flushed it/closed it (although those two last operations are essentially a no-op, that's no reason for ditching sane practices), you can obtain the underlying byte array using this class's .toByteArray().
Socket sounds like what you are looking for.
I can see there are a number of posts regarding reuse InputStream. I understand InputStream is a one-time thing and cannot be reused.
However, I have a use case like this:
I have downloaded the file from DropBox by obtaining the DropBoxInputStream using the DropBox's Java SDK. I then need to upload the file to another system by passing the InputStream. However, as part of the download, I have to provide the MD5 of the file. So I have to read the file from the stream before uploading the file. Because the DropBoxInputStream I received can only be used once, I have to get another DropBoxInputStream after I have calculated the MD5 and before uploading the file. The procedure is like:
Get first DropBoxInputStream
Read from the DropBoxInputStream and calculate MD5
Get the second DropBoxInputStream
Upload file using the MD5 and the second DropBoxInputStream.
I am thinking that, if there are many way for me to "cache" or "backup" the InputStream before I calculate the MD5 so that I can save step 3 of obtaining the same DropBoxInputStream again?
Many thanks
EDIT:
Sorry I missed some information.
What I am currently doing is that I use a MD5DigestOutputStream to calculate MD5. I stream data across the MD5DigestOutputStream and save them locally as a temp file. Once the data goes through the MD5DigestOutputStream, it will calculate the MD5.
I then call a third party library to upload the file using the calculated md5 and a FileInputStream which reads from the temp file.
However, this requires huge disk space sometime and I want to remove the needs to use temp file. The library I use only accepts a MD5 and InputStream. This means I have to calculate the MD5 on my end. My plan is to use my MD5DigestOutputStream to write data to /dev/null (not keeping the file) so that I can calculate theMD5, and get the InputStream from DropBox again and pass that to the library I use. I assume the library will be able to get the file directly from DropBox without the need for me to cache the file either in the memory of at the disk. Will it work?
Input streams aren't really designed for creating copies or re-using, they're specifically for situations where you don't want to read off into a byte array and use array operations on that (this is especially useful when the whole array isn't available, as in, for e.g. socket comunication). You could buffer up into a byte array, which is the process of reading sections from the stream into a byte array buffer until you have enough information.
But that's unnecessary for calculating an md5. Notice that InputStream is abstract, so it needs be implemented in an extended class. It has many implementations- GZIPInputStream, fileinputstream etc. These are, in design pattern speak, decorators of the IO stream: they add extra functionality to the abstract base IO classes. For example, GZIPInputStream gzips up the stream.
So, what you need is a stream to do this for md5. There is, joyfully, a well documented similar thing: see this answer. So you should just be able to pass your dropbox input stream (as it will be itself an input stream) to create a new DigestInputStream, and then you can both take the md5 and continue to read as before.
Worried about type casting? The idea with decorators in Java is that, since the InputStream base class interfaces all the methods and 'beef' you need to do your IO, there's no harm in passing instances of objects inheriting from InputStream in the constructor of each stream implementation, and you can still do the same core IO.
Finally, I should probably answer your actual question- say you still want to "cache" or "backup" the stream anyway? Well, you could just write it to a byte array. This is well documented, but can become faff when your streams get more complicated. Alternatively, try looking at a PushbackInputStream. Here, you can easily write a function to read off n bytes, perform and operation on them, and then restore them to the stream. Generally good to avoid these implementations of streams in Java, as it's bad for memory use, but no worse than buffering everything up which you'd otherwise have to do.
Or, of course, I would have a go with DigestInputStream.
Hope this helps,
Best.
You don't need to open a new InputStream from DropBox.
Once you have read the file from DropBox, you have it locally. So it is either in memory (in a byte array) or you stored it in a local file. Now you can create an InputStream that reads the data from memory (ByteArrayInputStream) or disk (FileInputStream) in order to upload the file.
So instead of caching the InputStream (which you can't) you cache the contents (which you can).
Currently I'm working with an SSH client api providing me stdout and stderr as InputStreams. I have to read all the data from these streams at client side and provide an api for implementors to be able to work with these data the way they want (just drop it, write it to DB, process it etc). First I tried to keep the whole data read in byte arrays, but with huge amount of data (could happen sometimes) this can cause serious memory problems. But I don't want to write all the data of every call into files if that isn't really necessary.
Anyone knows about a solution which reads data into memory until it reaches a limit (like 1mb), after it writes data from memory to a file and appends all the remaining data of the inputstream to the same file?
commons io has a workable solution: DeferredFileOutputStream.
Can you avoid reading the stream until you know what you are going to do with it?
If you use this approach you can dump them, read portions of data and write them to a database as you read it, or read and process the data as you read it.
This way you would not need to read more than 1 MB (or less) at any one time.
I have a server which waits for a connection from a client then sends an image to that client using the Socket class.
Several clients will be connecting over a short time, so I would like to compress the image before sending it.
The images are 1000 by 1000 pixel BufferedImages, and my current way of sending them is to iterate over all pixels and send that value, then reconstruct it on the other side. I suspect this is not the best way to do things.
Can anyone give any advice on compression and a better method for sending images over a network?
Thanks.
Compression is very much horses for courses: which method will actually work better depends on where your image came from in the first place, and what your requirements are (principally, whether you allow lossy compression, and if so, what constraints you place on it).
To get started, try using ImageIO.write() to write the image in JPEG or PNG format to a ByteArrayOutputStream, whose resulting byte array you can then send down the socket[1]. If that gives you an acceptable result, then the advantage is it'll involve next-to-no development time.
If they don't give an acceptable result (either because you can't use lossy compression or because PNG compression doesn't give an acceptable compression ratio), then you may have to come up with something custom to suit your data; at that point. Only you know your data at the end of the day, but a general technique is to try and get your data in to a form where it works well with a Deflater or some other standard algorithm. So with a deflater, for example, you transform/re-order your data so that repeating patterns and runs of similar bytes are likely to occur close to one another. That might mean sending all of the top bits of pixels, then all the next-top bits, etc, and actually not sending the bottom few bits of each component if they're effectively just noise.
Hopefully the JPEG/PNG option will get you the result you need, though, and you won't have to worry much further.
[1] Sorry, should have said -- you can obviously make the socket output stream the one that the image data is writte into if you don't need to first query it for length, take a hash code...
JPEG is lossy, so if you need the exact same image on the other side, you can use a GZIPOutputStream on top of the socket's OutputStream to send the compressed data, and receive it on the other side through a GZIPInputStream on top of the socket's InputStream.
It's been a long time since I did any image processing in Java, but you can save the image on the server as a JPEG, and then send them a URI and let them retrieve it themselves.
If you are using the getInputStream and getOutputStream methods on Socket, try wrapping the streams with java.util.net.GZIPInputStream and java.util.net.GZIPOutputStream.