There is a string (message body) and 3 different headers to be sent to 3 users using java nio socket.
A way is to create a large byte buffer and put the message body in some postion and put header in front of message boday.
In this way, I still need to one copy for message body and rewrite headers. In my project, the message body is around 14 K bytes. If the memory page is 2K bytes, it is not efficient for memory efficiency management.
My question: is there any way to avoid copying large message string to the byte buffer? I guess C can support it using pointers. Is it true?
Thanks.
I wouldn't create the String, but create the ByteBuffer with the text you would place in the String.
Note: String is not mutable so it will be a copy of some other source eg a StringBuilder. Using a ByteBuffer instead will save you two copies.
You can place the message body in the ByteBuffer with enough padding at the start to put in the header later. This way the message body won't need to copied again.
This is a job for gathering writes: the write(ByteBuffer[], ...) method.
Related
I got a requirement to deliver emails to a legacy system that needs to read the attachments.
For each part in a multipart email I need to provide the byte offset for where the attachment starts in the email, so the legacy system doesn't need to know how to parse emails.
Performance and memory usage is an issue, so the solution can't load the entire email into memory. And to my eyes that leaves out javax.mail.
How would you go about it in Java?
My first idea was to use mime4j, but the library does not keep of byte offset or even the line number.
I investigated making a PR to mime4j to add tracking of line numbers and byte offsets. But it is not very easy, since it is a very mature project and it uses lots of buffering internally.
Now I am thinking that maybe I am going about this the wrong way. So I would very much appreciate any ideas of how to solve this in a simple matter.
You're going to run into issues just sending the byte offsets and the full email, as emails still can be base64 encoded or printed-quoteable encoded.
You'll want to use a MimeStreamParser and give your own ContentHandler and override the body method. You can then directly send the BodyDescriptor and InputStream to the legacy system. The InputStream is the "decoded" email (IE handles any Content-Transfer-Encoding). The BodyDescriptor is useful to extract stuff from the headers of the part that you may care about (MimeType and Charset are the most useful ones).
This does not buffer the whole email, and allows you to stream out just the body parts. I'm not sure how you're communicating with the legacy system (via the network or if it's an inprocess subcomponent) but hopefully that works!
I'm under the impression that a ByteArrayOutputStream is not memory efficient, since all it's contents are stored in memory.
Similarly, calling toByteArray on a large stream seems like it "scales poorly".
Why, then, in the example in The example in Tom White's book Hadoop: the Definitive Guide use them both:
ByteArrayOutputStream out = new ByteArrayOutputStream;
Decoder decoder = DecoderFactory().defaultFactory().createBinaryDecoder(out.toByteArray(), null);
Isn't "Big Data" the norm for Avro? What am I missing?
Edit 1: What I'm trying to do - Say I'm streaming avros over a websocket. What would the example look like if I wanted to deserialize multiple records, not just one that was put in it's own ByteArrayOutoputStream?
Is there a better way to supply BinaryDecoder with a byte[]? Or perhaps a different type of stream? Or I should be sending 1 record per stream instead of loading streams with multiple records?
ByteArrayOutputStream makes sense when dealing with small objects like small to medium images, or fixed-size request/response. It is in memory and doesn't touch the disk so this can be great for performance. It doesn't make any sense to use it for 1 TerraByte of data. Possibly this is a case of trying to keep an example in a book small and self-contained so as not to detract from the main point.
EDIT: Now that I see where your going I'd be looking to setup a pipeline. Pull a message off the stream (so I'm assuming you can get in InputStream from your HTTP object) and either process it with a memory-less method or throw it at a queue and have a thread-pool process the queue with a memory-less method. So the requirements for this are 1) being able to detect the boundary between Avro messages as you pull them off the stream and having the method for decoding.
The way to decode appears to be read the bytes for each message into a byte-array and hand that to your BinaryDecoder (after you find the message boundary).
We have a bunch of threads that take a block of data, compress this data and then eventually concatenate them into one large byte array. If anyone can expand on this idea or recommend another method, that'd be awesome. I've currently got two methods that I'm trying out, but neither are working the way they should:
The first: I have each thread's run() function take the input data and just use GZIPOutputStream to compress it and write it to the buffer.
The problem with this approach here is that, because each thread has one block of data which is part of a longer complete data when I call GZIPOutputStream, it treats that little block as a complete piece of data to zip. That means it sticks on the header and trailer (I also use a custom dictionary so I've got no idea how many bits the header is now nor how to find out).
I think you could manually cut off the header and trailer and you would just be left with compressed data (and leave the header of the first block and the trailer of the last block). The other thing I'm not sure about with this method is whether I can even do that. If I leave the header on the first block of data, will it still decompress correctly. Doesn't that header contain information for ONLY the first block of the data and not the other concatenated blocks?
The second method is to use the Deflater class. In that case, I can simply set the input, set the dictionary, and then call deflate().
The problem is, that's not gzip format. That's just "raw" compressed data. I have no idea how to make it so that gzip can recognize the final output.
You need a method that writes to a single GZIPOutputStream that is called by the other threads, with suitable co-ordination between them so the data doesn't get mixed up. Or else have the threads write to temporary files, and assemble and zip it all in a second phase.
I have a server that sends bytes back to a client app, when the client app receives a finished response from the server i want to gather the bytes before the finish response comes back to the client. How do i append theses bytes back together again.
So when the bytes are sent to the server these bytes are split up into segments of say 100 bytes and when the server sends the bytes back to the client i want to to gather these segments back into its normal form again.
I have had a look at Concatenating to arrays but is there a simple way?
You could create a ByteArrayOutputStream, then write() the arrays to it, and finally use toByteArray().
Guava's Bytes class provides a Bytes.concat method, though it's more useful when you have a fixed number of arrays you want to concatenate than if you're gathering a variable number of arrays to concatenate. ByteArrayOutputStream is probably what you want here, though, based on your description, because it doesn't require you to keep each individual array you receive around in order to concatenate them... you can just add them to the output stream.
I could use some hints or tips for a decent interface for reading file of special characteristics.
The files in question has a header (~120 bytes), a body (1 byte - 3gb) and a footer (4 bytes).
The header contains information about the body and the footer is only a simple CRC32-value of the body.
I use Java so my idea was to extend the "InputStream" class and add a constructor such as "public MyInStream( InputStream in)" where I immediately read the header and the direct the overridden read()'s the body.
Problem is, I can't give the user of the class the CRC32-value until the whole body has been read.
Because the file can be 3gb large, putting it all in memory is a be an idea.
Reading it all in to a temporary file is going to be a performance hit if there are many small files.
I don't know how large the file is because the InputStream doesn't have to be a file, it could be a socket.
Looking at it again, maybe extending InputStream is a bad idea.
Thank you for reading the confused thoughts of a tired programmer. :)
Looking at it again, maybe extending
InputStream is a bad idea.
If users of the class need to access the body as a stream, it's IMO not a bad choice. Java's ObjectOutput/InputStream works like this.
I don't know how large the file is
because the InputStream doesn't have
to be a file, it could be a socket.
Um, then your problem is not with the choice of Java class, but with the design of the file format. If you can't change the format, there isn't really anything you can do to make the data at the end of the file available before all of it is read.
But perhaps you could encapsulate the processing of the checksum completely? Presumably it's a checksum for the body, so your class could always "read ahead" 4 bytes to see when the file ends and not return the last 4 bytes to the client as part of the body and instead compare them with a CRC computed while reading the body, throwing an exception when it does not match.