I want to add ByteBuffers to a queue in java so I have the following code,
public class foo{
private Queue <ByteBuffer> messageQueue = new LinkedList<ByteBuffer>();
protected boolean queueInit(ByteBuffer bbuf)
{
if(bbuf.capacity() > 10000)
{
int limit = bbuf.limit();
bbuf.position(0);
for(int i = 0;i<limit;i=i+10000)
{
int kb = 1024;
for(int j = 0;j<kb;j++)
{
ByteBuffer temp = ByteBuffer.allocate(kb);
temp.array()[j] = bbuf.get(j);
System.out.println(temp.get(j));
addQueue(temp);
}
}
}
System.out.println(messageQueue.peek().get(1));
return true;
}
private void addQueue(ByteBuffer bbuf)
{
messageQueue.add(bbuf);
}
}
The inner workings of the for loop appear to work correctly as the temp value is set to the correct value and then that should be added to the queue by calling the addQueue method. However only the first letter of the bytebuffer only gets added to the queue and nothing else. Since when I peek at the first value in the head of the queue I get the number 116 as I should, but when I try to get other values in the head they are 0 which is not correct. Why might this be happening where no other values except for the first value of the bytbuffer are getting added to the head of the queue?
ByteBuffer.allocate creates a new ByteBuffer. In each iteration of your inner j loop, you are creating a new buffer, placing a single byte in it, and passing that buffer to addQueue. You are doing this 1024 times (in each iteration of the outer loop), so you are creating 1024 buffers which have a single byte set; in each buffer, all other bytes will be zero.
You are not using the i loop variable of your outer loop at all. I'm not sure why you'd want to skip over 10000 bytes anyway, if your buffers are only 1024 bytes in size.
The slice method can be used to create smaller ByteBuffers from a larger one:
int kb = 1024;
while (bbuf.remaining() >= kb) {
ByteBuffer temp = bbuf.slice();
temp.limit(1024);
addQueue(temp);
bbuf.position(bbuf.position() + kb);
}
if (bbuf.hasRemaining()) {
ByteBuffer temp = bbuf.slice();
addQueue(temp);
}
It's important to remember that the new ByteBuffers will be sharing content with bbuf. Meaning, making changes to any byte in bbuf would also change exactly one of the sliced buffers. That is probably what you want, as it's more efficient than making copies of the buffer. (Potentially much more efficient, if your original buffer is large; would you really want two copies of a one-gigabyte buffer in memory?)
If you truly need to copy all the bytes into independent buffers, regardless of incurred memory usage, you could have your addQueue method copy each buffer space:
private void addQueue(ByteBuffer bbuf)
{
bbuf = ByteBuffer.allocate(bbuf.remaining()).put(bbuf); // copy
bbuf.flip();
messageQueue.add(bbuf);
}
Related
I'm doing some aggregations on keys, on a global window. I'm buffering events to batch process only when reaches the buffer size, in order to reduce the number of iterations.
But, some keys never reach the buffer size, and aren't calculated.
Here is some pseudo-code:
DoFN<> {
#StateId("count")
private final StateSpec<ValueState<Integer>> countState = StateSpecs.value();
private static final int MAX_BUFFER_SIZE = 1000;
#ProcessElement () {
int count = firstNonNull(countState.read(), 0);
count = count + 1;
countState.write(count);
if (count >= MAX_BUFFER_SIZE) {
(...)
countState.clear();
}
}
}
That works good, but i'm losing some data, since some keys never reach 1000 records to trigger the batch processing.
I think proper solution here will be a to add a FinishBundle method and finalize keys that did not reach buffer size at the finalizeBundle() method.
I have a method which takes a parameter which is Partition enum. This method will be called by multiple background threads (15 max) around same time period by passing different value of partition. Here dataHoldersByPartition is a map of Partition and ConcurrentLinkedQueue<DataHolder>.
private final ImmutableMap<Partition, ConcurrentLinkedQueue<DataHolder>> dataHoldersByPartition;
//... some code to populate entry in `dataHoldersByPartition`
private void validateAndSend(final Partition partition) {
ConcurrentLinkedQueue<DataHolder> dataHolders = dataHoldersByPartition.get(partition);
Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder = new HashMap<>();
int totalSize = 0;
DataHolder dataHolder;
while ((dataHolder = dataHolders.poll()) != null) {
byte[] clientKeyBytes = dataHolder.getClientKey().getBytes(StandardCharsets.UTF_8);
if (clientKeyBytes.length > 255)
continue;
byte[] processBytes = dataHolder.getProcessBytes();
int clientKeyLength = clientKeyBytes.length;
int processBytesLength = processBytes.length;
int additionalLength = clientKeyLength + processBytesLength;
if (totalSize + additionalLength > 50000) {
Message message = new Message(clientKeyBytesAndProcessBytesHolder, partition);
// here size of `message.serialize()` byte array should always be less than 50k at all cost
sendToDatabase(message.getAddress(), message.serialize());
clientKeyBytesAndProcessBytesHolder = new HashMap<>();
totalSize = 0;
}
clientKeyBytesAndProcessBytesHolder.put(clientKeyBytes, processBytes);
totalSize += additionalLength;
}
// calling again with remaining values only if clientKeyBytesAndProcessBytesHolder is not empty
if(!clientKeyBytesAndProcessBytesHolder.isEmpty()) {
Message message = new Message(partition, clientKeyBytesAndProcessBytesHolder);
// here size of `message.serialize()` byte array should always be less than 50k at all cost
sendToDatabase(message.getAddress(), message.serialize());
}
}
And below is my Message class:
public final class Message {
private final byte dataCenter;
private final byte recordVersion;
private final Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder;
private final long address;
private final long addressFrom;
private final long addressOrigin;
private final byte recordsPartition;
private final byte replicated;
public Message(Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder, Partition recordPartition) {
this.clientKeyBytesAndProcessBytesHolder = clientKeyBytesAndProcessBytesHolder;
this.recordsPartition = (byte) recordPartition.getPartition();
this.dataCenter = Utils.CURRENT_LOCATION.get().datacenter();
this.recordVersion = 1;
this.replicated = 0;
long packedAddress = new Data().packAddress();
this.address = packedAddress;
this.addressFrom = 0L;
this.addressOrigin = packedAddress;
}
// Output of this method should always be less than 50k always
public byte[] serialize() {
int bufferCapacity = getBufferCapacity(clientKeyBytesAndProcessBytesHolder); // 36 + dataSize + 1 + 1 + keyLength + 8 + 2;
ByteBuffer byteBuffer = ByteBuffer.allocate(bufferCapacity).order(ByteOrder.BIG_ENDIAN);
// header layout
byteBuffer.put(dataCenter).put(recordVersion).putInt(clientKeyBytesAndProcessBytesHolder.size())
.putInt(bufferCapacity).putLong(address).putLong(addressFrom).putLong(addressOrigin)
.put(recordsPartition).put(replicated);
// now the data layout
for (Map.Entry<byte[], byte[]> entry : clientKeyBytesAndProcessBytesHolder.entrySet()) {
byte keyType = 0;
byte[] key = entry.getKey();
byte[] value = entry.getValue();
byte keyLength = (byte) key.length;
short valueLength = (short) value.length;
ByteBuffer dataBuffer = ByteBuffer.wrap(value);
long timestamp = valueLength > 10 ? dataBuffer.getLong(2) : System.currentTimeMillis();
byteBuffer.put(keyType).put(keyLength).put(key).putLong(timestamp).putShort(valueLength)
.put(value);
}
return byteBuffer.array();
}
private int getBufferCapacity(Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder) {
int size = 36;
for (Entry<byte[], byte[]> entry : clientKeyBytesAndProcessBytesHolder.entrySet()) {
size += 1 + 1 + 8 + 2;
size += entry.getKey().length;
size += entry.getValue().length;
}
return size;
}
// getters and to string method here
}
Basically, what I have to make sure is whenever the sendToDatabase method is called, size of message.serialize() byte array should always be less than 50k at all cost. My sendToDatabase method sends byte array coming out from serialize method. And because of that condition I am doing below validation plus few other stuff. In the method, I will iterate dataHolders CLQ and I will extract clientKeyBytes and processBytes from it. Here is the validation I am doing:
If the clientKeyBytes length is greater than 255 then I will skip it and continue iterating.
I will keep incrementing the totalSize variable which will be the sum of clientKeyLength and processBytesLength, and this totalSize length should always be less than 50000 bytes.
As soon as it reaches the 50000 limit, I will send the clientKeyBytesAndProcessBytesHolder map to the sendToDatabase method and clear out the map, reset totalSize to 0 and start populating again.
If it doesn't reaches that limit and dataHolders got empty, then it will send whatever it has.
I believe there is some bug in my current code because of which maybe some records are not being sent properly or dropped somewhere because of my condition and I am not able to figure this out. Looks like to properly achieve this 50k condition I may have to use getBufferCapacity method to correctly figure out the size before calling sendToDatabase method?
I checked your code, its look good as per your logic. As you said it will always store the information which is less than 50K but it will actually store information till 50K. To make it less than 50K you have to change the if condition to if (totalSize + additionalLength >= 50000).
If your codes still not fulfilling your requirement i.e. storing information when totalSize + additionalLength is greater than 50k I can advise you few thinks.
As more than 50 threads call this method you need to consider two section in your codes to be synchronize.
One is global variable which is a container dataHoldersByPartition object. If multiple concurrent and parallel searches happened in this container object, outcome might not be perfect. Just check whether container type is synchronized or not. If not make this block like below:-
synchronized(this){
ConcurrentLinkedQueue<DataHolder> dataHolders = dataHoldersByPartition.get(partition);
}
Now, I can give only two suggestion to fix this issue. One is instead of if (totalSize + additionalLength > 50000) this you can check the size of the object clientKeyBytesAndProcessBytesHolder if(sizeof(clientKeyBytesAndProcessBytesHolder) >= 50000) (check appropriate method for sizeof in java). And second one is narrow down the area to check whether it is a side effect of multithreading or not. All these suggestion are to find out the area where exactly problem is and fix should be from your end only.
First check whether you method validateAndSend is exactly satisfying your requirement or not. For that synchronize whole validateAndSend method first and check whether everything fine or still have the same result. If still have the same result that means it is not because of multithreading but your coding is not as per requirement. If its work fine that means it is a problem of multithreading. If method synchronization is fixing your issue but degrade the performance you just remove the synchronization from it and concentrate every small block of your code which might cause the issue and make it synchronize block and remove if still not fixing your issue. Like that finally you locate the block of code which is actually creating the issue and leave it as synchronize to fix it finally.
For example first attempt:-
`private synchronize void validateAndSend`
Second attempts: Remove synchronize key words from the method and do the below step:-
synchronize(this){
Message message = new Message(clientKeyBytesAndProcessBytesHolder, partition);
sendToDatabase(message.getAddress(), message.serialize());
}
If you think that I did not correctly understand you please let me know.
In your validateAndSend I would put whole data to the queue, and do whole processing in separate thread. Please consider command model. That way all threads are going to put their load on queue. Consumer thread has all the data, all the information in place, and can process it quite effectively. The only complicated part is sending response / result back to calling thread. Since in your case that is not a problem - the better. There are some more benefits of this pattern - please look at netflix/hystrix.
I'm reading about Buffer Streams. I searched about it and found many answers that clear my concepts but still have little more questions.
After searching, I have come to know that, Buffer is temporary memory(RAM) which helps program to read data quickly instead hard disk. and when Buffers empty then native input API is called.
After reading little more I got answer from here that is.
Reading data from disk byte-by-byte is very inefficient. One way to
speed it up is to use a buffer: instead of reading one byte at a time,
you read a few thousand bytes at once, and put them in a buffer, in
memory. Then you can look at the bytes in the buffer one by one.
I have two confusion,
1: How/Who data filled in Buffers? (native API how?) as quote above, who filled thousand bytes at once? and it will consume same time. Suppose I have 5MB data, and 5MB loaded once in Buffer in 5 Seconds. and then program use this data from buffer in 5 seconds. Total 10 seconds. But if I skip buffering, then program get direct data from hard disk in 1MB/2sec same as 10Sec total. Please clear my this confusion.
2: The second one how this line works
BufferedReader inputStream = new BufferedReader(new FileReader("xanadu.txt"));
As I'm thinking FileReader write data to buffer, then BufferedReader read data from buffer memory? Also explain this.
Thanks.
As for the performance of using buffering during read/write, it's probably minimal in impact since the OS will cache too, however buffering will reduce the number of calls to the OS, which will have an impact.
When you add other operations on top, such as character encoding/decoding or compression/decompression, the impact is greater as those operations are more efficient when done in blocks.
You second question said:
As I'm thinking FileReader write data to buffer, then BufferedReader read data from buffer memory? Also explain this.
I believe your thinking is wrong. Yes, technically the FileReader will write data to a buffer, but the buffer is not defined by the FileReader, it's defined by the caller of the FileReader.read(buffer) method.
The operation is initiated from outside, when some code calls BufferedReader.read() (any of the overloads). BufferedReader will then check it's buffer, and if enough data is available in the buffer, it will return the data without involving the FileReader. If more data is needed, the BufferedReader will call the FileReader.read(buffer) method to get the next chunk of data.
It's a pull operation, not a push, meaning the data is pulled out of the readers by the caller.
All the stuff is done by a private method named fill() i give you for educational purpose, but all java IDE let you see the source code yourself :
private void fill() throws IOException {
int dst;
if (markedChar <= UNMARKED) {
/* No mark */
dst = 0;
} else {
/* Marked */
int delta = nextChar - markedChar;
if (delta >= readAheadLimit) {
/* Gone past read-ahead limit: Invalidate mark */
markedChar = INVALIDATED;
readAheadLimit = 0;
dst = 0;
} else {
if (readAheadLimit <= cb.length) {
/* Shuffle in the current buffer */
// here copy the read chars in a memory buffer named cb
System.arraycopy(cb, markedChar, cb, 0, delta);
markedChar = 0;
dst = delta;
} else {
/* Reallocate buffer to accommodate read-ahead limit */
char ncb[] = new char[readAheadLimit];
System.arraycopy(cb, markedChar, ncb, 0, delta);
cb = ncb;
markedChar = 0;
dst = delta;
}
nextChar = nChars = delta;
}
}
int n;
do {
n = in.read(cb, dst, cb.length - dst);
} while (n == 0);
if (n > 0) {
nChars = dst + n;
nextChar = dst;
}
}
I send long number via UDP.
LinkedQueue Q = new LinkedQueue<ByteBuffer>();
while (this._run) {
udp_socket.receive(packet);
if (packet.getLength() > 0) {
ByteBuffer bb = ByteBuffer.wrap(buf, 0, packet.getLength());
Q.add(bb);
}
}
//udp close. I remove data from Queue, but all ByteBuffers have same value.
while(!Q.isEmpty){
ByteBuffer b = Q.remove();
b.getLong();//same value
}
Why i receive same value? Any Suggest?
Does your byte buffer consists of just one long?
Probably not, my guess is that you put a bit too much for just one long in there.
And that's why it gives you same values on first sizeof(long) bytes.
What you need to do is to keep calling .getLong() until you hit the end of the buffer.
See the docs.
I've got a 40MB file in the disk and I need to "map" it into memory using a byte array.
At first, I thought writing the file to a ByteArrayOutputStream would be the best way, but I find it takes about 160MB of heap space at some moment during the copy operation.
Does somebody know a better way to do this without using three times the file size of RAM?
Update: Thanks for your answers. I noticed I could reduce memory consumption a little telling ByteArrayOutputStream initial size to be a bit greater than the original file size (using the exact size with my code forces reallocation, got to check why).
There's another high memory spot: when I get byte[] back with ByteArrayOutputStream.toByteArray. Taking a look to its source code, I can see it is cloning the array:
public synchronized byte toByteArray()[] {
return Arrays.copyOf(buf, count);
}
I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?
MappedByteBuffer might be what you're looking for.
I'm surprised it takes so much RAM to read a file in memory, though. Have you constructed the ByteArrayOutputStream with an appropriate capacity? If you haven't, the stream could allocate a new byte array when it's near the end of the 40 MB, meaning that you would, for example, have a full buffer of 39MB, and a new buffer of twice the size. Whereas if the stream has the appropriate capacity, there won't be any reallocation (faster), and no wasted memory.
ByteArrayOutputStream should be okay so long as you specify an appropriate size in the constructor. It will still create a copy when you call toByteArray, but that's only temporary. Do you really mind the memory briefly going up a lot?
Alternatively, if you already know the size to start with you can just create a byte array and repeatedly read from a FileInputStream into that buffer until you've got all the data.
If you really want to map the file into memory, then a FileChannel is the appropriate mechanism.
If all you want to do is read the file into a simple byte[] (and don't need changes to that array to be reflected back to the file), then simply reading into an appropriately-sized byte[] from a normal FileInputStream should suffice.
Guava has Files.toByteArray() which does all that for you.
For an explanation of the buffer growth behavior of ByteArrayOutputStream, please read this answer.
In answer to your question, it is safe to extend ByteArrayOutputStream. In your situation, it is probably better to override the write methods such that the maximum additional allocation is limited, say, to 16MB. You should not override the toByteArray to expose the protected buf[] member. This is because a stream is not a buffer; A stream is a buffer that has a position pointer and boundary protection. So, it is dangerous to access and potentially manipulate the buffer from outside the class.
I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?
You shouldn't change the specified behavior of the existing method, but it's perfectly fine to add a new method. Here's an implementation:
/** Subclasses ByteArrayOutputStream to give access to the internal raw buffer. */
public class ByteArrayOutputStream2 extends java.io.ByteArrayOutputStream {
public ByteArrayOutputStream2() { super(); }
public ByteArrayOutputStream2(int size) { super(size); }
/** Returns the internal buffer of this ByteArrayOutputStream, without copying. */
public synchronized byte[] buf() {
return this.buf;
}
}
An alternative but hackish way of getting the buffer from any ByteArrayOutputStream is to use the fact that its writeTo(OutputStream) method passes the buffer directly to the provided OutputStream:
/**
* Returns the internal raw buffer of a ByteArrayOutputStream, without copying.
*/
public static byte[] getBuffer(ByteArrayOutputStream bout) {
final byte[][] result = new byte[1][];
try {
bout.writeTo(new OutputStream() {
#Override
public void write(byte[] buf, int offset, int length) {
result[0] = buf;
}
#Override
public void write(int b) {}
});
} catch (IOException e) {
throw new RuntimeException(e);
}
return result[0];
}
(That works, but I'm not sure if it's useful, given that subclassing ByteArrayOutputStream is simpler.)
However, from the rest of your question it sounds like all you want is a plain byte[] of the complete contents of the file. As of Java 7, the simplest and fastest way to do that is call Files.readAllBytes. In Java 6 and below, you can use DataInputStream.readFully, as in Peter Lawrey's answer. Either way, you will get an array that is allocated once at the correct size, without the repeated reallocation of ByteArrayOutputStream.
If you have 40 MB of data I don't see any reason why it would take more than 40 MB to create a byte[]. I assume you are using a growing ByteArrayOutputStream which creates a byte[] copy when finished.
You can try the old read the file at once approach.
File file =
DataInputStream is = new DataInputStream(FileInputStream(file));
byte[] bytes = new byte[(int) file.length()];
is.readFully(bytes);
is.close();
Using a MappedByteBuffer is more efficient and avoids a copy of data (or using the heap much) provided you can use the ByteBuffer directly, however if you have to use a byte[] its unlikely to help much.
... but I find it takes about 160MB of heap space at some moment during the copy operation
I find this extremely surprising ... to the extent that I have my doubts that you are measuring the heap usage correctly.
Let's assume that your code is something like this:
BufferedInputStream bis = new BufferedInputStream(
new FileInputStream("somefile"));
ByteArrayOutputStream baos = new ByteArrayOutputStream(); /* no hint !! */
int b;
while ((b = bis.read()) != -1) {
baos.write((byte) b);
}
byte[] stuff = baos.toByteArray();
Now the way that a ByteArrayOutputStream manages its buffer is to allocate an initial size, and (at least) double the buffer when it fills it up. Thus, in the worst case baos might use up to 80Mb buffer to hold a 40Mb file.
The final step allocates a new array of exactly baos.size() bytes to hold the buffer's contents. That's 40Mb. So the peak amount of memory that is actually in use should be 120Mb.
So where are those extra 40Mb being used? My guess is that they are not, and that you are actually reporting the total heap size, not the amount of memory that is occupied by reachable objects.
So what is the solution?
You could use a memory mapped buffer.
You could give a size hint when you allocate the ByteArrayOutputStream; e.g.
ByteArrayOutputStream baos = ByteArrayOutputStream(file.size());
You could dispense with the ByteArrayOutputStream entirely and read directly into a byte array.
byte[] buffer = new byte[file.size()];
FileInputStream fis = new FileInputStream(file);
int nosRead = fis.read(buffer);
/* check that nosRead == buffer.length and repeat if necessary */
Both options 1 and 2 should have an peak memory usage of 40Mb while reading a 40Mb file; i.e. no wasted space.
It would be helpful if you posted your code, and described your methodology for measuring memory usage.
I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?
The potential danger is that your assumptions are incorrect, or become incorrect due to someone else modifying your code unwittingly ...
Google Guava ByteSource seems to be a good choice for buffering in memory. Unlike implementations like ByteArrayOutputStream or ByteArrayList(from Colt Library) it does not merge the data into a huge byte array but stores every chunk separately. An example:
List<ByteSource> result = new ArrayList<>();
try (InputStream source = httpRequest.getInputStream()) {
byte[] cbuf = new byte[CHUNK_SIZE];
while (true) {
int read = source.read(cbuf);
if (read == -1) {
break;
} else {
result.add(ByteSource.wrap(Arrays.copyOf(cbuf, read)));
}
}
}
ByteSource body = ByteSource.concat(result);
The ByteSource can be read as an InputStream anytime later:
InputStream data = body.openBufferedStream();
... came here with the same observation when reading a 1GB file: Oracle's ByteArrayOutputStream has a lazy memory management.
A byte-Array is indexed by an int and such anyway limited to 2GB. Without dependency on 3rd-party you might find this useful:
static public byte[] getBinFileContent(String aFile)
{
try
{
final int bufLen = 32768;
final long fs = new File(aFile).length();
final long maxInt = ((long) 1 << 31) - 1;
if (fs > maxInt)
{
System.err.println("file size out of range");
return null;
}
final byte[] res = new byte[(int) fs];
final byte[] buffer = new byte[bufLen];
final InputStream is = new FileInputStream(aFile);
int n;
int pos = 0;
while ((n = is.read(buffer)) > 0)
{
System.arraycopy(buffer, 0, res, pos, n);
pos += n;
}
is.close();
return res;
}
catch (final IOException e)
{
e.printStackTrace();
return null;
}
catch (final OutOfMemoryError e)
{
e.printStackTrace();
return null;
}
}