Sending data to a database in size-limited chunks - java

I have a method which takes a parameter which is Partition enum. This method will be called by multiple background threads (15 max) around same time period by passing different value of partition. Here dataHoldersByPartition is a map of Partition and ConcurrentLinkedQueue<DataHolder>.
private final ImmutableMap<Partition, ConcurrentLinkedQueue<DataHolder>> dataHoldersByPartition;
//... some code to populate entry in `dataHoldersByPartition`
private void validateAndSend(final Partition partition) {
ConcurrentLinkedQueue<DataHolder> dataHolders = dataHoldersByPartition.get(partition);
Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder = new HashMap<>();
int totalSize = 0;
DataHolder dataHolder;
while ((dataHolder = dataHolders.poll()) != null) {
byte[] clientKeyBytes = dataHolder.getClientKey().getBytes(StandardCharsets.UTF_8);
if (clientKeyBytes.length > 255)
continue;
byte[] processBytes = dataHolder.getProcessBytes();
int clientKeyLength = clientKeyBytes.length;
int processBytesLength = processBytes.length;
int additionalLength = clientKeyLength + processBytesLength;
if (totalSize + additionalLength > 50000) {
Message message = new Message(clientKeyBytesAndProcessBytesHolder, partition);
// here size of `message.serialize()` byte array should always be less than 50k at all cost
sendToDatabase(message.getAddress(), message.serialize());
clientKeyBytesAndProcessBytesHolder = new HashMap<>();
totalSize = 0;
}
clientKeyBytesAndProcessBytesHolder.put(clientKeyBytes, processBytes);
totalSize += additionalLength;
}
// calling again with remaining values only if clientKeyBytesAndProcessBytesHolder is not empty
if(!clientKeyBytesAndProcessBytesHolder.isEmpty()) {
Message message = new Message(partition, clientKeyBytesAndProcessBytesHolder);
// here size of `message.serialize()` byte array should always be less than 50k at all cost
sendToDatabase(message.getAddress(), message.serialize());
}
}
And below is my Message class:
public final class Message {
private final byte dataCenter;
private final byte recordVersion;
private final Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder;
private final long address;
private final long addressFrom;
private final long addressOrigin;
private final byte recordsPartition;
private final byte replicated;
public Message(Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder, Partition recordPartition) {
this.clientKeyBytesAndProcessBytesHolder = clientKeyBytesAndProcessBytesHolder;
this.recordsPartition = (byte) recordPartition.getPartition();
this.dataCenter = Utils.CURRENT_LOCATION.get().datacenter();
this.recordVersion = 1;
this.replicated = 0;
long packedAddress = new Data().packAddress();
this.address = packedAddress;
this.addressFrom = 0L;
this.addressOrigin = packedAddress;
}
// Output of this method should always be less than 50k always
public byte[] serialize() {
int bufferCapacity = getBufferCapacity(clientKeyBytesAndProcessBytesHolder); // 36 + dataSize + 1 + 1 + keyLength + 8 + 2;
ByteBuffer byteBuffer = ByteBuffer.allocate(bufferCapacity).order(ByteOrder.BIG_ENDIAN);
// header layout
byteBuffer.put(dataCenter).put(recordVersion).putInt(clientKeyBytesAndProcessBytesHolder.size())
.putInt(bufferCapacity).putLong(address).putLong(addressFrom).putLong(addressOrigin)
.put(recordsPartition).put(replicated);
// now the data layout
for (Map.Entry<byte[], byte[]> entry : clientKeyBytesAndProcessBytesHolder.entrySet()) {
byte keyType = 0;
byte[] key = entry.getKey();
byte[] value = entry.getValue();
byte keyLength = (byte) key.length;
short valueLength = (short) value.length;
ByteBuffer dataBuffer = ByteBuffer.wrap(value);
long timestamp = valueLength > 10 ? dataBuffer.getLong(2) : System.currentTimeMillis();
byteBuffer.put(keyType).put(keyLength).put(key).putLong(timestamp).putShort(valueLength)
.put(value);
}
return byteBuffer.array();
}
private int getBufferCapacity(Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder) {
int size = 36;
for (Entry<byte[], byte[]> entry : clientKeyBytesAndProcessBytesHolder.entrySet()) {
size += 1 + 1 + 8 + 2;
size += entry.getKey().length;
size += entry.getValue().length;
}
return size;
}
// getters and to string method here
}
Basically, what I have to make sure is whenever the sendToDatabase method is called, size of message.serialize() byte array should always be less than 50k at all cost. My sendToDatabase method sends byte array coming out from serialize method. And because of that condition I am doing below validation plus few other stuff. In the method, I will iterate dataHolders CLQ and I will extract clientKeyBytes and processBytes from it. Here is the validation I am doing:
If the clientKeyBytes length is greater than 255 then I will skip it and continue iterating.
I will keep incrementing the totalSize variable which will be the sum of clientKeyLength and processBytesLength, and this totalSize length should always be less than 50000 bytes.
As soon as it reaches the 50000 limit, I will send the clientKeyBytesAndProcessBytesHolder map to the sendToDatabase method and clear out the map, reset totalSize to 0 and start populating again.
If it doesn't reaches that limit and dataHolders got empty, then it will send whatever it has.
I believe there is some bug in my current code because of which maybe some records are not being sent properly or dropped somewhere because of my condition and I am not able to figure this out. Looks like to properly achieve this 50k condition I may have to use getBufferCapacity method to correctly figure out the size before calling sendToDatabase method?

I checked your code, its look good as per your logic. As you said it will always store the information which is less than 50K but it will actually store information till 50K. To make it less than 50K you have to change the if condition to if (totalSize + additionalLength >= 50000).
If your codes still not fulfilling your requirement i.e. storing information when totalSize + additionalLength is greater than 50k I can advise you few thinks.
As more than 50 threads call this method you need to consider two section in your codes to be synchronize.
One is global variable which is a container dataHoldersByPartition object. If multiple concurrent and parallel searches happened in this container object, outcome might not be perfect. Just check whether container type is synchronized or not. If not make this block like below:-
synchronized(this){
ConcurrentLinkedQueue<DataHolder> dataHolders = dataHoldersByPartition.get(partition);
}
Now, I can give only two suggestion to fix this issue. One is instead of if (totalSize + additionalLength > 50000) this you can check the size of the object clientKeyBytesAndProcessBytesHolder if(sizeof(clientKeyBytesAndProcessBytesHolder) >= 50000) (check appropriate method for sizeof in java). And second one is narrow down the area to check whether it is a side effect of multithreading or not. All these suggestion are to find out the area where exactly problem is and fix should be from your end only.
First check whether you method validateAndSend is exactly satisfying your requirement or not. For that synchronize whole validateAndSend method first and check whether everything fine or still have the same result. If still have the same result that means it is not because of multithreading but your coding is not as per requirement. If its work fine that means it is a problem of multithreading. If method synchronization is fixing your issue but degrade the performance you just remove the synchronization from it and concentrate every small block of your code which might cause the issue and make it synchronize block and remove if still not fixing your issue. Like that finally you locate the block of code which is actually creating the issue and leave it as synchronize to fix it finally.
For example first attempt:-
`private synchronize void validateAndSend`
Second attempts: Remove synchronize key words from the method and do the below step:-
synchronize(this){
Message message = new Message(clientKeyBytesAndProcessBytesHolder, partition);
sendToDatabase(message.getAddress(), message.serialize());
}
If you think that I did not correctly understand you please let me know.

In your validateAndSend I would put whole data to the queue, and do whole processing in separate thread. Please consider command model. That way all threads are going to put their load on queue. Consumer thread has all the data, all the information in place, and can process it quite effectively. The only complicated part is sending response / result back to calling thread. Since in your case that is not a problem - the better. There are some more benefits of this pattern - please look at netflix/hystrix.

Related

On statefull processing how should you deal with data that was not processed due to never reach buffer size?

I'm doing some aggregations on keys, on a global window. I'm buffering events to batch process only when reaches the buffer size, in order to reduce the number of iterations.
But, some keys never reach the buffer size, and aren't calculated.
Here is some pseudo-code:
DoFN<> {
#StateId("count")
private final StateSpec<ValueState<Integer>> countState = StateSpecs.value();
private static final int MAX_BUFFER_SIZE = 1000;
#ProcessElement () {
int count = firstNonNull(countState.read(), 0);
count = count + 1;
countState.write(count);
if (count >= MAX_BUFFER_SIZE) {
(...)
countState.clear();
}
}
}
That works good, but i'm losing some data, since some keys never reach 1000 records to trigger the batch processing.
I think proper solution here will be a to add a FinishBundle method and finalize keys that did not reach buffer size at the finalizeBundle() method.

Using final variable in lambda expression Java

I have an app that fetches a lot of data, so I would like to paginate the data into chunks and process those chunks individually rather than dealing with the data all at once. So I wrote a function I am calling every n seconds to check if a chunk is done and then process that chunk. My problem is I have no way of keeping track of the fact that I just processed a chunk and that I should move onto the next chunk when it is available. I was thinking something along the lines of the code below, however I cannot call multiplier++; as it complains that it is not behaving like a final variable anymore. I would like to use something like multiplier so that once the code processes a chunk it 1) doesn't process the same chunk again and 2) moves onto the next chunk. Is it possible to do this? Is there a modifier one can put on multiplier to help avoid race conditions?
int multiplier = 1;
CompletableFuture<String> completionFuture = new CompletableFuture<>();
final ScheduledFuture<?> checkFuture = executor.scheduleAtFixedRate(() -> {
// parse json response
String response = getJSONResponse();
JsonObject jsonObject = ConverterUtils.parseJson(response, true)
.getAsJsonObject();
int pages = jsonObject.get("stats").getAsJsonObject().get("pages").getAsInt();
// if we have a chunk of n pages records then process them with dataHandler function
if (pages > multiplier * bucketSize) {
dataHandler.apply(getResponsePaginated((multiplier - 1) * bucketSize, bucketSize));
multiplier++;
}
if (jsonObject.has("finishedAt") && !jsonObject.get("finishedAt").isJsonNull()) {
// we are done!
completionFuture.complete("");
}
}, 0, sleep, TimeUnit.SECONDS);
You can use an AtomicInteger. Since this is a mutable type, you can assign it to a final variable while still being able to change its value. This also addresses the synchronization issue between the callbacks:
final AtomicInteger multiplier = new AtomicInteger(1);
executor.scheduleAtFixedRate(() -> {
//...
multiplier.incrementAndGet();
}, 0, sleep, TimeUnit.SECONDS);

How to make sure total bytes of a map (sum of all keys and values length) stays within the limit?

I am getting lot of records from a particular source and I need to send those records to our database. Below is what I am doing:
I am storing all these records in a ConcurrentHashMap where key is Integer and value is ConcurrentLinkedQueue and this CHM gets populated by multiple threads in a thread safe way.
Now I have a single background thread (runs every 1 minute) which reads from this map and send those events to some other method which does validation and send it to our database.
Below is my method which will be called by a single background thread every 1 minute.
private void validateAndSend(final int partition,
final ConcurrentLinkedQueue<DataHolder> dataHolders) {
Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder = new HashMap<>();
int totalSize = 0;
while (!dataHolders.isEmpty()) {
DataHolder dataHolder = dataHolders.poll();
byte[] clientKeyBytes = dataHolder.getClientKey().getBytes(StandardCharsets.UTF_8);
if (clientKeyBytes.length > 255)
continue;
byte[] processBytes = dataHolder.getProcessBytes();
int clientKeyLength = clientKeyBytes.length;
int processBytesLength = processBytes.length;
totalSize += clientKeyLength + processBytesLength;
if (totalSize > 64000) {
sendToDatabase(partition, clientKeyBytesAndProcessBytesHolder);
clientKeyBytesAndProcessBytesHolder.clear(); // watch out for gc
totalSize = 0;
}
clientKeyBytesAndProcessBytesHolder.put(clientKeyBytes, processBytes);
}
// calling again with remaining values
sendToDatabase(partition, clientKeyBytesAndProcessBytesHolder);
}
In the above method, I will iterate dataHolders CLQ and I will extract clientKeyBytes and processBytes from it. Here is the validation that I am supposed to do:
If clientKeyBytes length is greater than 255 then I will skip it and continue iterating.
And then I will keep incrementing totalSize variable which will be sum of clientKeyLength and processBytesLength and this totalSize length should be less than 64000 always.
As soon as it is reaching 64000 limit, I will send the clientKeyBytesAndProcessBytesHolder map to sendToDatabase method and clear out the map, reset totalSize to 0 and start populating again.
If it doesn't reaches that limit and dataHolders got empty, then we will send whatever we have.
Basically what I have to make sure is whenever sendToDatabase method is called, clientKeyBytesAndProcessBytesHolder map should have size less than 64000 (sum of all keys and values length). It should never be called with the size greater than 64000.
Is this the best and efficient way to do what I am doing or there is any better way to accomplish the same thing?
Update:
This is how it should be?
private void validateAndSend(final int partition,
final ConcurrentLinkedQueue<DataHolder> dataHolders) {
Map<byte[], byte[]> clientKeyBytesAndProcessBytesHolder = new HashMap<>();
int totalSize = 0;
while (!dataHolders.isEmpty()) {
DataHolder dataHolder = dataHolders.poll();
byte[] clientKeyBytes = dataHolder.getClientKey().getBytes(StandardCharsets.UTF_8);
if (clientKeyBytes.length > 255)
continue;
byte[] processBytes = dataHolder.getProcessBytes();
int clientKeyLength = clientKeyBytes.length;
int processBytesLength = processBytes.length;
int additionalLength = clientKeyLength + processBytesLength;
if (totalSize + additionalLength > 64000) {
Message message = new Message(partition, clientKeyBytesAndProcessBytesHolder);
sendToDatabase(message.getAddress(), message.getLocation());
clientKeyBytesAndProcessBytesHolder.clear(); // watch out for gc
totalSize = 0;
}
clientKeyBytesAndProcessBytesHolder.put(clientKeyBytes, processBytes);
totalSize += additionalLength;
}
// calling again with remaining values
Message message = new Message(partition, clientKeyBytesAndProcessBytesHolder);
sendToDatabase(message.getAddress(), message.getLocation());
}
Looks good, but there is a small bug: totalSize is reset to 0 where it should be set to clientKeyLength + processBytesLength -- the bytes for the current key are ignored when the data is sent, although the entry is added after the if statement.
I'd change the code as follows (the whole question might be better suited for the codereviews stack exchange):
int additionalLength = clientKeyLength + processBytesLength;
if (totalSize + additionalLength > 64000) {
sendToDatabase(partition, clientKeyBytesAndProcessBytesHolder);
clientKeyBytesAndProcessBytesHolder.clear(); // watch out for gc
totalSize = 0;
}
clientKeyBytesAndProcessBytesHolder.put(clientKeyBytes, processBytes);
totalSize += additionalLength;
P.S.: What is the expected behavior when the same key is inserted multiple times? Your code currently inserts all instances...

Queue of byte buffers in Java

I want to add ByteBuffers to a queue in java so I have the following code,
public class foo{
private Queue <ByteBuffer> messageQueue = new LinkedList<ByteBuffer>();
protected boolean queueInit(ByteBuffer bbuf)
{
if(bbuf.capacity() > 10000)
{
int limit = bbuf.limit();
bbuf.position(0);
for(int i = 0;i<limit;i=i+10000)
{
int kb = 1024;
for(int j = 0;j<kb;j++)
{
ByteBuffer temp = ByteBuffer.allocate(kb);
temp.array()[j] = bbuf.get(j);
System.out.println(temp.get(j));
addQueue(temp);
}
}
}
System.out.println(messageQueue.peek().get(1));
return true;
}
private void addQueue(ByteBuffer bbuf)
{
messageQueue.add(bbuf);
}
}
The inner workings of the for loop appear to work correctly as the temp value is set to the correct value and then that should be added to the queue by calling the addQueue method. However only the first letter of the bytebuffer only gets added to the queue and nothing else. Since when I peek at the first value in the head of the queue I get the number 116 as I should, but when I try to get other values in the head they are 0 which is not correct. Why might this be happening where no other values except for the first value of the bytbuffer are getting added to the head of the queue?
ByteBuffer.allocate creates a new ByteBuffer. In each iteration of your inner j loop, you are creating a new buffer, placing a single byte in it, and passing that buffer to addQueue. You are doing this 1024 times (in each iteration of the outer loop), so you are creating 1024 buffers which have a single byte set; in each buffer, all other bytes will be zero.
You are not using the i loop variable of your outer loop at all. I'm not sure why you'd want to skip over 10000 bytes anyway, if your buffers are only 1024 bytes in size.
The slice method can be used to create smaller ByteBuffers from a larger one:
int kb = 1024;
while (bbuf.remaining() >= kb) {
ByteBuffer temp = bbuf.slice();
temp.limit(1024);
addQueue(temp);
bbuf.position(bbuf.position() + kb);
}
if (bbuf.hasRemaining()) {
ByteBuffer temp = bbuf.slice();
addQueue(temp);
}
It's important to remember that the new ByteBuffers will be sharing content with bbuf. Meaning, making changes to any byte in bbuf would also change exactly one of the sliced buffers. That is probably what you want, as it's more efficient than making copies of the buffer. (Potentially much more efficient, if your original buffer is large; would you really want two copies of a one-gigabyte buffer in memory?)
If you truly need to copy all the bytes into independent buffers, regardless of incurred memory usage, you could have your addQueue method copy each buffer space:
private void addQueue(ByteBuffer bbuf)
{
bbuf = ByteBuffer.allocate(bbuf.remaining()).put(bbuf); // copy
bbuf.flip();
messageQueue.add(bbuf);
}

Java wordcount: a mediocre implementation

I implemented a wordcount program with Java. Basically, the program takes a large file (in my tests, I used a 10 gb data file that contained numbers only), and counts the number of times each 'word' appears - in this case, a number (23723 for example might appear 243 times in the file).
Below is my implementation. I seek to improve it, with mainly performance in mind, but a few other things as well, and I am looking for some guidance. Here are a few of the issues I wish to correct:
Currently, the program is threaded and works properly. However, what I do is pass a chunk of memory (500MB/NUM_THREADS) to each thread, and each thread proceeds to wordcount. The problem here is that I have the main thread wait for ALL the threads to complete before passing more data to each thread. It isn't too much of a problem, but there is a period of time where a few threads will wait and do nothing for a while. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance). Currently, I use the fact that I know the input is an integer, and just store the temporary variables as an int, so no memory problems there. I want to be able to use some sort of delimiter, whether that delimiter be a space, or several characters.
I am using a global ConcurrentHashMap to story key value pairs. For example, if a thread finds a number "24624", it searches for that number in the map. If it exists, it will increase the value of that key by one. The value of the keys at the end represent the number of occurrences of that key. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
I am open to other possibilities as well, this is just what comes to mind.
Note: Splitting the file is not an option I want to explore, as I might be deploying this on a server in which I should not be creating my own files, but if it would really be a performance boost, I might listen.
Other Note: I am new to java threading, as well as new to StackOverflow. Be gentle.
public class BigCount2 {
public static void main(String[] args) throws IOException, InterruptedException {
int num, counter;
long i, j;
String delimiterString = " ";
ArrayList<Character> delim = new ArrayList<Character>();
for (char c : delimiterString.toCharArray()) {
delim.add(c);
}
int counter2 = 0;
num = Integer.parseInt(args[0]);
int bytesToRead = 1024 * 1024 * 1024 / 2; //500 MB, size of loop
int remainder = bytesToRead % num;
int k = 0;
bytesToRead = bytesToRead - remainder;
int byr = bytesToRead / num;
String filepath = "C:/Users/Daniel/Desktop/int-dataset-10g.dat";
RandomAccessFile file = new RandomAccessFile(filepath, "r");
Thread[] t = new Thread [num];//array of threads
ConcurrentMap<Integer, Integer> wordCountMap = new ConcurrentHashMap<Integer, Integer>(25000);
byte [] byteArray = new byte [byr]; //allocates 500mb to a 2D byte array
char[] newbyte;
for (i = 0; i < file.length(); i += bytesToRead) {
counter = 0;
for (j = 0; j < bytesToRead; j += byr) {
file.seek(i + j);
file.read(byteArray, 0, byr);
newbyte = new String(byteArray).toCharArray();
t[counter] = new Thread(
new BigCountThread2(counter,
newbyte,
delim,
wordCountMap));//giving each thread t[i] different file fileReader[i]
t[counter].start();
counter++;
newbyte = null;
}
for (k = 0; k < num; k++){
t[k].join(); //main thread continues after ALL threads have finished.
}
counter2++;
System.gc();
}
file.close();
System.exit(0);
}
}
class BigCountThread2 implements Runnable {
private final ConcurrentMap<Integer, Integer> wordCountMap;
char [] newbyte;
private ArrayList<Character> delim;
private int threadId; //use for later
BigCountThread2(int tid,
char[] newbyte,
ArrayList<Character> delim,
ConcurrentMap<Integer, Integer> wordCountMap) {
this.delim = delim;
threadId = tid;
this.wordCountMap = wordCountMap;
this.newbyte = newbyte;
}
public void run() {
int intCheck = 0;
int counter = 0; int i = 0; Integer check; int j =0; int temp = 0; int intbuilder = 0;
for (i = 0; i < newbyte.length; i++) {
intCheck = Character.getNumericValue(newbyte[i]);
if (newbyte[i] == ' ' || intCheck == -1) { //once a delimiter is found, the current tempArray needs to be added to the MAP
check = wordCountMap.putIfAbsent(intbuilder, 1);
if (check != null) { //if returns null, then it is the first instance
wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1);
}
intbuilder = 0;
}
else {
intbuilder = (intbuilder * 10) + intCheck;
counter++;
}
}
}
}
Some thoughts on a little of most ..
.. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
If all the threads take about the same time to process the same amount of data, then there really isn't that much of a "problem" here.
However, one nice thing about a Thread Pool is it allows one to rather trivially adjust some basic parameters such as number of concurrent workers. Furthermore, using an executor service and Futures can provide an additional level of abstraction; in this case it could be especially handy if each thread returned a map as the result.
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance) ..
This sounds like an implementation issue. While I would first try a StreamTokenizer (because it's already written), if doing it manually, I would check out the source - a good bit of that can be omitted when simplifying the notion of a "token". (It uses a temporary array to build the token.)
I am using a global ConcurrentHashMap to story key value pairs. .. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
It would reduce locking and may increase performance to use a separate map per thread and merge strategy. Furthermore, the current implementation is broken as wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1) is not atomic and thus the operation might under count. I would use a separate map simply because reducing mutable shared state makes a threaded program much easier to reason about.
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
Consider using a FileReader (and BufferedReader) per thread on the same file. This will avoid having to first copy the file into the array and slice it out for individual threads which, while the same amount of total reading, avoids having to soak up so much memory. The reading done is actually not random access, but merely sequential (with a "skip") starting from different offsets - each thread still works on a mutually exclusive range.
Also, the original code with the slicing is broken if an integer value was "cut" in half as each of the threads would read half the word. One work-about is have each thread skip the first word if it was a continuation from the previous block (i.e. scan one byte sooner) and then read-past the end of it's range as required to complete the last word.

Categories

Resources