How to Wait for Completion of ANY Worker Thread? - java

I want a dispatcher thread that executes and retrieves results from a pool of worker threads. The dispatcher needs to continuously feed work to the worker threads. When ANY of the worker thread completes, the dispatcher needs to gather its results and re-dispatch (or create a new) worker thread. It seems to me like this should be obvious but I have been unable to find an example of a suitable pattern. A Thread.join() loop would be inadequate because that is really "AND" logic and I am looking for "OR" logic.
The best I could come up with is to have the dispatcher thread wait() and have the worker threads notify() when they are done. Though seems like I would have to guard against two worker threads that end at the same time causing the dispatcher thread to miss a notify(). Plus, this seems a little bit inelegant to me.
Even less elegant is the idea of the dispatcher thread periodically waking up and polling the worker thread pool and checking each thread to see if it has completed via isAlive().
I took a look at java.util.concurrent and didn't see anything that looked like it fit this pattern.
I feel that to implement what I mention above would involve a lot of defensive programming and reinventing the wheel. There's got to be something that I am missing. What can I leverage to implement this pattern?
This is the single-threaded version. putMissingToS3() would become the dispatcher thread and the capability represented in the uploadFileToBucket() would become the worker thread.
private void putMissingToS3()
{
int reqFilesToUpload = 0;
long reqSizeToUpload = 0L;
int totFilesUploaded = 0;
long totSizeUploaded = 0L;
int totFilesSkipped = 0;
long totSizeSkipped = 0L;
int rptLastFilesUploaded = 0;
long rptSizeInterval = 1000000000L;
long rptLastSize = 0L;
StopWatch rptTimer = new StopWatch();
long rptLastMs = 0L;
StopWatch globalTimer = new StopWatch();
StopWatch indvTimer = new StopWatch();
for (FileSystemRecord fsRec : fileSystemState.toList())
{
String reqKey = PathConverter.pathToKey(PathConverter.makeRelativePath(fileSystemState.getRootPath(), fsRec.getFullpath()));
LocalS3MetadataRecord s3Rec = s3Metadata.getRecord(reqKey);
// Just get a rough estimate of what the size of this upload will be
if (s3Rec == null)
{
++reqFilesToUpload;
reqSizeToUpload += fsRec.getSize();
}
}
long uploadTimeGuessMs = (long)((double)reqSizeToUpload/estUploadRateBPS*1000.0);
printAndLog("Estimated upload: " + natFmt.format(reqFilesToUpload) + " files, " + Utils.readableFileSize(reqSizeToUpload) +
", Estimated time " + Utils.readableElapsedTime(uploadTimeGuessMs));
globalTimer.start();
rptTimer.start();
for (FileSystemRecord fsRec : fileSystemState.toList())
{
String reqKey = PathConverter.pathToKey(PathConverter.makeRelativePath(fileSystemState.getRootPath(), fsRec.getFullpath()));
if (PathConverter.validate(reqKey))
{
LocalS3MetadataRecord s3Rec = s3Metadata.getRecord(reqKey);
//TODO compare and deal with size mismatches. Maybe go and look at last-mod dates.
if (s3Rec == null)
{
indvTimer.start();
uploadFileToBucket(s3, syncParms.getS3Bucket(), fsRec.getFullpath(), reqKey);
indvTimer.stop();
++totFilesUploaded;
totSizeUploaded += fsRec.getSize();
logOnly("Uploaded: Size=" + fsRec.getSize() + ", " + indvTimer.stopDeltaMs() + " ms, File=" + fsRec.getFullpath() + ", toKey=" + reqKey);
if (totSizeUploaded > rptLastSize + rptSizeInterval)
{
long invSizeUploaded = totSizeUploaded - rptLastSize;
long nowMs = rptTimer.intervalMs();
long invElapMs = nowMs - rptLastMs;
long remSize = reqSizeToUpload - totSizeUploaded;
double progessPct = (double)totSizeUploaded/reqSizeToUpload*100.0;
double mbps = (invElapMs > 0) ? invSizeUploaded/1e6/(invElapMs/1000.0) : 0.0;
long remMs = (long)((double)remSize/((double)invSizeUploaded/invElapMs));
printOnly("Progress: " + d2Fmt.format(progessPct) + "%, " + Utils.readableFileSize(totSizeUploaded) + " of " +
Utils.readableFileSize(reqSizeToUpload) + ", Rate " + d3Fmt.format(mbps) + " MB/s, " +
"Time rem " + Utils.readableElapsedTime(remMs));
rptLastMs = nowMs;
rptLastFilesUploaded = totFilesUploaded;
rptLastSize = totSizeUploaded;
}
}
}
else
{
++totFilesSkipped;
totSizeSkipped += fsRec.getSize();
logOnly("Skipped (Invalid chars): Size=" + fsRec.getSize() + ", " + fsRec.getFullpath() + ", toKey=" + reqKey);
}
}
globalTimer.stop();
double mbps = 0.0;
if (globalTimer.stopDeltaMs() > 0)
mbps = totSizeUploaded/1e6/(globalTimer.stopDeltaMs()/1000.0);
printAndLog("Actual upload: " + natFmt.format(totFilesUploaded) + " files, " + Utils.readableFileSize(totSizeUploaded) +
", Time " + Utils.readableElapsedTime(globalTimer.stopDeltaMs()) + ", Rate " + d3Fmt.format(mbps) + " MB/s");
if (totFilesSkipped > 0)
printAndLog("Skipped Files: " + natFmt.format(totFilesSkipped) + " files, " + Utils.readableFileSize(totSizeSkipped));
}
private void uploadFileToBucket(AmazonS3 amazonS3, String bucketName, String filePath, String fileKey)
{
File inFile = new File(filePath);
ObjectMetadata objectMetadata = new ObjectMetadata();
objectMetadata.addUserMetadata(Const.LAST_MOD_KEY, Long.toString(inFile.lastModified()));
objectMetadata.setLastModified(new Date(inFile.lastModified()));
PutObjectRequest por = new PutObjectRequest(bucketName, fileKey, inFile).withMetadata(objectMetadata);
// Amazon S3 never stores partial objects; if during this call an exception wasn't thrown, the entire object was stored.
amazonS3.putObject(por);
}

I think you are at right package. you should use ExecutorService API.
This removes burden of waiting and watching for thread's notification.
Example:
import java.util.concurrent.ExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.Executors;
public class ExecutorEx{
static class ThreadA implements Runnable{
int id;
public ThreadA(int id){
this.id = id;
}
public void run(){
//To simulate some work
try{Thread.sleep(Math.round(Math.random()*100));}catch(Exception e){}
// to show message
System.out.println(this.id + "--Test Message" + System.currentTimeMillis());
}
}
public static void main(String args[]) throws Exception{
int poolSize = 10;
ExecutorService pool = Executors.newFixedThreadPool(poolSize);
int i=0;
while(i<100){
pool.submit(new ThreadA(i));
i++;
}
pool.shutdown();
while(!pool.isTerminated()){
pool.awaitTermination(60, TimeUnit.SECONDS);
}
}
}
And if you want to return something from your thread will need to implement Callable instead of Runnable(call() instead of run()) and collect returned values in Future object array, that you can iterate over later.

Related

How to write 1000 records per file or wait to have more record to write then break file?

I have generating data of users with auto-increment ID, then write it to file following these rules:
Name the file in following structure (FileCounter)_(StartID)_(EndID)
Maximum 1000 records per file
If don't have enough 1000 records to write, wait maximum 10s, if any added, write it all to file otherwise, write the remain list to file (not enough 1000), if nothing to write after wait, create empty file with naming (FileCounter)_0_0
My approach is using 2 thread, 1 thread to generate data then push it to the queue, 1 thread to take from the queue add to a list then write the list to the file.
//Generate function
public void generatedata() {
int capacity = 1678;
synchronized(users) {
for(int index = 0; index <capacity; index++) {
users.add(generateUser());
// notify to read thread
users.notifyAll();
}
}
//Write function
public void writeToFile(ArrayList<User> u) {
String fileName ="";
if(!u.isEmpty()) {
String filename = "" + (++FileCounter) + "_"+ u.get(0).getId() + "_" +
u.get(u.size() - 1).getId() + ".txt";
try {
FileWriter writer = new FileWriter(filename, true);
for (User x : u) {
System.out.println(x.toString());
writer.write(x.getId() + " | " + x.getFormatedDate() + " | " +
x.getSex() + " | " + x.getPhoneNum().getPhoneNumber() + " | " +
x.getPhoneNum().getProvider() + "\r\n");
}
writer.close();
}
catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
else {
try {
fileName = ""+(++FileCounter) +"_0_0.txt";
File f = new File(fileName);
f.createNewFile();
} catch (IOException ex) {
Logger.getLogger(UsersManager.class.getName()).log(Level.SEVERE,
null, ex);
}
}
}
//Read function
public ArrayList<User> ReadFromQueue(ArrayList<User> u) {
while(true) {
try {
int size = users.size();
if(users.isEmpty() && u.size() < 1000) {
users.wait(10000);
if(isChanged(size)) {
System.out.println("Size changed here");
u.add(users.take());
}
else return u;
}
if(u.size() == 1000) {
System.out.println("Check the size is 1000");
return u;
}
u.add(users.take());
} catch (InterruptedException ex) {
Logger.getLogger(UsersManager.class.getName()).log(Level.SEVERE,
null, ex);
}
}
It work fine when I run 1 thread to generate data, 1 thread to read then write data to file but when I use 2++ thread for each generate thread of write thread, There are 1 problems :
The list written in the file still has 1000 records as expected but not sequential at all, it only ascending order.
My output is like:
1_2_1999.txt
2_1_2000.txt
3_2001_3000.txt
My expected output is like:
1_1_1000.txt
2_1001_2000.txt
....
Thanks in advance!
using the thread approach is best for when you do not want to control the amount per file. but since you have a constraint of 1000 records, it's probably easier to use a counter;
public class DataReaderWriter(){
//keeps track of where you left off at, which row in source data.
static int currentRowInSourceData = 0;
public static void main(String[] args){
List<ContactRecord> contacts = getMoreData();
writeRecords(contacts);
}
writeRecords(List<ContactRecord> contacts){
int maxRecords = currentRowInSourceData+1000;
for(int i = currentRowInSourceData;i<maxRecords;i++){
ContactRecord c = contacts.get(i);
writeToFile(c);
currentRowInSourceData++;
}
}
I had a project where I needed to create 90 second previews from larger MP4 files. What I did was to have multiple threads start up with access to a shared Queue of file names. Each thread consumes work from the Queue by using queue.poll().
Here is the Constructor:
public Worker(Queue<String> queue, String conferenceYear, CountDownLatch startSignal, CountDownLatch doneSignal) {
this.queue = queue;
this.startSignal = startSignal;
this.doneSignal = doneSignal;
}
Then, as I said above, I keep polling for data:
public void run() {
while (!queue.isEmpty()) {
String fileName = queue.poll() + ".mp4";
File f = new File("/home/ubuntu/preview_" + fileName);
if (fileName != null && !f.exists()) {
System.out.println("Processing File " + fileName + "....");
I started these threads in another class called WorkLoad:
public static void main(String[] args) {
long startTime = System.currentTimeMillis();
BlockingQueue<String> filesToDownload = new LinkedBlockingDeque<String>(1024);
BlockingQueue<String> filesToPreview = new LinkedBlockingDeque<String>(1024);
BlockingQueue<String> filesToUpload = new LinkedBlockingDeque<String>(1024);
for (int x = 0; x < NUMBER_OF_THREADS; x++) {
workers[x] = new Thread(new Worker(filesToPreview, currentYear, startSignal, doneSignal));
workers[x].start();
}
In your specific case, you could provide each thread its own file name, or a handle on a file. If you want the file names and entries in a chronological sequence, then just start 2 threads, 1 for acquiring data and placing on a queue, with a barrier/limit of 1000 records, and the other thread as a consumer.
the original code creates multiple threads. I am able to create 90 second snippets from over 1000 MP4 videos in about 30 minutes.
Here I am creating a thread per processor, I usually end up with at least 4 threads on my AWS EC2 instance:
/**
* Here we can find out how many cores we have.
* Then make the number of threads NUMBER_OF_THREADS = the number of cores.
*/
NUMBER_OF_THREADS = Runtime.getRuntime().availableProcessors();
System.out.println("Thread Count: "+NUMBER_OF_THREADS);
for (int x = 0; x < NUMBER_OF_THREADS; x++) {
workers[x] = new Thread(new MyClass(param1, param2));
workers[x].start();
}

How to find method which call threadPool

I have some bug in production's application, but I can't find the cause of it. I try to get some log to find a method, which calls my method(). But because I use threadPool I can't just get Thread.currentThread().getStackTrace() and iterate through StackTraceElements, it shows only some lines before ThreadPool.
If I use the next code, I'll get every method which I need, but it so expansive. Only 1 call of method cost 400+ Kb in a text file in my test environment. In production it would be about 1 Mb in a second, I think.
private final ExecutorService completableFutureExecutor =
new ThreadPoolExecutor(10, 2000, 60L, TimeUnit.SECONDS, new SynchronousQueue<>());
public void firstMethod(){
secondMethod();
}
private CompletableFuture<Void> secondMethod(){
return CompletableFuture.supplyAsync(()->method(),threadPool);
}
void method(){
Map<Thread, StackTraceElement[]> map = Thread.getAllStackTraces();
for (Thread thread : map.keySet()) {
printLog(thread);
}
}
private void printLog(Thread thread) {
StringBuilder builder = new StringBuilder();
for (StackTraceElement s : thread.getStackTrace()) {
builder.append("\n getClass = " + s.getClass());
builder.append("\n getClassName = " + s.getClassName());
builder.append("\n getFileName = " + s.getFileName());
builder.append("\n getLineNumber = " + s.getLineNumber());
builder.append("\n getMethodName = " + s.getMethodName());
builder.append("\n ---------------------------- \n ");
}
ownLogger.info("SomeThread = {} ", builder);
}
How to find that firstMethod() who calls secondMethod() ?
As I haven't found any good solution my own is to put logger before and after CompletableFuture call
It looks like
Logger beforeAsync= LoggerFactory.getLogger("beforeAsync");
Logger afterAsync= LoggerFactory.getLogger("afterAsync");
private CompletableFuture<Void> secondMethod(){
printLongerTrace(Thread.currentThread(),beforeAsync);
return CompletableFuture.supplyAsync(()->method(),threadPool);
}
private void methodWithException(){
try{
//do something
}
catch(Exception e){
printLongerTrace(e,"methodWithException", afterAsync);
}
}
public void printLongerTrace(Throwable t, String methodName, Logger ownlogger) {
if (t.getCause() != null) {
printLongerTrace(t.getCause(), methodName, fields, ownlogger);
}
StringBuilder builder = new StringBuilder();
builder.append("\n Thread = " + Thread.currentThread().getName());
builder.append("ERROR CAUSE = " + t.getCause() + "\n");
builder.append("ERROR MESSAGE = " + t.getMessage() + "\n");
printLog(t.getStackTrace(), builder);
ownlogger.info(methodName + "Trace ----- {}", builder);
}
public void printLongerTrace(Thread t, Logger ownlogger) {
StringBuilder builder = new StringBuilder();
builder.append("\n Thread = " + Thread.currentThread().getName());
printLog(t.getStackTrace(), builder);
ownlogger.info("Trace ----- {}", builder);
}
private StringBuilder printLog(StackTraceElement[] elements, StringBuilder builder) {
int size = elements.length > 15 ? 15 : elements.length;
for (int i = 0; i < size; i++) {
builder.append("Line " + i + " = " + elements[i] + " with method = " + elements[i].getMethodName() + "\n");
}
return builder;
}
printLongerTrace(Throwable t, String methodName, Logger ownlogger) needs to print exception with every cause in recursion.
printLongerTrace(Thread t, Logger ownlogger) needs to print which method call before CompletableFuture
Just dump the Stack by calling Thread.dumpStack() but this is only for debugin and has a big overhead, since dumping the stack is cpu intensive

Amazon S3 AWS upload progress listener

Does anyone know how to see the progress (in percent) of an upload in a multipart upload in Amazon S3?
I would do it like this:
MultipleFileUpload transfer = transferManager.uploadDirectory(mybucket, null, new File(localSourceDataFilesPath), false);
// blocks the thread until the upload is completed
showTransferProgress(transfer);
Then in showTransferProgress, I would create a block the upload using a sleep, and do the math every X seconds:
private void showTransferProgress(MultipleFileUpload xfer) {
while (!xfer.isDone()) {
// some logic to wait so you don't do the math every second like a Thread.sleep
TransferProgress progress = xfer.getProgress();
long bytesTransferred = progress.getBytesTransferred();
long total = progress.getTotalBytesToTransfer();
Double percentDone = progress.getPercentTransferred();
LOG.debug("S3 xml upload progress...{}%", percentDone.intValue());
LOG.debug("{} bytes transferred to S3 out of {}", bytesTransferred, total);
}
// print the final state of the transfer.
TransferState xferState = xfer.getState();
LOG.debug("Final transfer state: " + xferState);
}
this line is what you are looking for:
Double percentDone = progress.getPercentTransferred();
Hi guys here is is my final version
private void awsHoldUntilCompletedAndShowTransferProgress(Upload upload) throws InterruptedException, AmazonClientException {
TransferProgress tProgress = upload.getProgress();
long totalSize = tProgress.getTotalBytesToTransfer();
long bPrevious = 0;
int timerSec = Math.toIntExact(Math.round(Math.sqrt(totalSize / 1024 / 1024) / 4));// calculate based on file size
if (timerSec > 60) {// no longer than 60 second per loop
timerSec = 60;
}
while (!upload.isDone()) {
long bTransferred = tProgress.getBytesTransferred();
String strMbps = Double.valueOf((((bTransferred - bPrevious) / timerSec) / 1024) / 1024).toString() + " MBps";
String strTransfered = bTransferred + " bytes";
if (bTransferred > 1024) {
strTransfered = Double.valueOf((bTransferred / 1024) / 1024).toString() + " MB";
}
log.info("Upload progress: "
+ strTransfered
+ " / "
+ FileUtils.byteCountToDisplaySize(totalSize) + " - "
+ Math.round(tProgress.getPercentTransferred()) + "% "
+ strMbps);
bPrevious = bTransferred;
TimeUnit.SECONDS.sleep(timerSec);
}
Transfer.TransferState transferState = upload.getState();
log.info("Final transfer state: " + transferState);
if (transferState == Transfer.TransferState.Failed) {
throw upload.waitForException();
}
}
and here is where I call the code above from
..stuff...
TransferManager tm = TransferManagerBuilder
.standard()
.withS3Client(s3Client)
.build();
LocalDateTime uploadStartedAt = LocalDateTime.now();
log.info("Starting to upload " + FileUtils.byteCountToDisplaySize(fileSize));
Upload up = tm.upload(bucketName, file.getName(), file);
awsHoldUntilCompletedAndShowTransferProgress(up);
log.info("Time consumed: " + DurationFormatUtils.formatDuration(Duration.between(uploadStartedAt, LocalDateTime.now()).toMillis(), "dd HH:mm:ss"));
this works, but may be a better way to do this
TransferManager tm = TransferManagerBuilder
.standard()
.withS3Client(s3Client)
.build();
PutObjectRequest request = new PutObjectRequest(bucketName, file.getName(), file);
ProgressListener progressListener = new ProgressListener() {
#Override
public void progressChanged(com.amazonaws.event.ProgressEvent progressEvent) {
bytesUploaded += progressEvent.getBytesTransferred();// add counter
if (bytesUploaded > byteTrigger) {
if ((bytesUploaded + sizeRatio) < fileSize) {
byteTrigger = bytesUploaded + sizeRatio ;
} else {
byteTrigger = bytesUploaded + (sizeRatio / 6);// increase precision approaching the end
}
String percent = new DecimalFormat("###.##").format(bytesUploaded * 100.0 / fileSize);
log.info("Uploaded: " + FileUtils.byteCountToDisplaySize(bytesUploaded) + " - " + percent + "%");
}
}
};
request.setGeneralProgressListener(progressListener);
Upload upload = tm.upload(request);
log.info("starting upload");
upload.waitForUploadResult();
log.info("Upload completed");

Looping Main Method while Writing Output Data to TXT File

So, I am working on a java project that is concerned with genetic algorithm.
I have a main method that calls a function (Let's call it function 1) that calculates through until the specified iterations. I wanted to run the main method 100 times and collect the data, so I decided to use FileWriter inside the function 1 that I am calling in my main method.
public static int Runcnt = 0;
static int o = 0;
public static File statText = new File("C:\\Users\\ShadyAF\\Desktop\\StatTest.txt");
public static void main(String [] args){
while(Runcnt <= 100)
{
final long startTime = System.nanoTime();
MainAlgorithm mA = new MainAlgorithm("config.xml");
mA.cMA();
final long duration = System.nanoTime() - startTime;
System.out.println(duration/1000000000 + " seconds");
o = 0;
}
The above snippet of code is the main that I'm trying to loop. (function 1)
System.out.println("best = "+Main.indx+" = "+Main.val);
System.out.println("max_cnt: " + Main.max_cnt);
try {
FileOutputStream is = new FileOutputStream(Main.statText);
OutputStreamWriter osw = new OutputStreamWriter(is);
Writer w = new BufferedWriter(osw);
w.write("#" + Main.Runcnt + " Best#: " + Main.indx + " BestScore: " + Main.val + " MaxCnt: " + Main.max_cnt + "\n");
w.close();
} catch (IOException e) {
System.err.println("Problem writing to file.");
}
The above snippet of code is the mA.cMa() function that is inside the main loop.
I ran the code for a while and it appears that the program writes to the file only for the first loop and does not do anything for the rest of the looops.
Any help is much appreciated!
Edit: Why am I getting downvoted? At least leave a helpful comment :/
You should change your pattern from scratch... anyway you can try with something like this in your Main:
public static Path pathFile = Paths.get("C:\\Users\\..blah..\\stats.txt");
Then use in your loop
try {
String log = "#" + Main.Runcnt + " Best#: " + Main.indx + " BestScore: " + Main.val + " MaxCnt: " + Main.max_cnt + "\n";
Files.write(Main.pathFile, log.getBytes(), Files.exists(Main.pathFile) ? StandardOpenOption.APPEND : StandardOpenOption.CREATE);
} catch (IOException e) {
// exception handling
}
It is not so efficient, in particular in case of lot of records but whole code you wrote should need strong refactoring too :)

DynamoDB Parallel Scan - Java Synchronization

I'm trying to use the DynamoDB Parallel Scan Example:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LowLevelJavaScanning.html
I have 200,000 items, and I've taken the sequential code scan, and modified it slightly for my usage:
Map<String, AttributeValue> lastKeyEvaluated = null;
do
{
ScanRequest scanRequest = new ScanRequest()
.withTableName(tableName)
.withExclusiveStartKey(lastKeyEvaluated);
ScanResult result = client.scan(scanRequest);
double counter = 0;
for(Map<String, AttributeValue> item : result.getItems())
{
itemSerialize.add("Set:"+counter);
for (Map.Entry<String, AttributeValue> getItem : item.entrySet())
{
String attributeName = getItem.getKey();
AttributeValue value = getItem.getValue();
itemSerialize.add(attributeName
+ (value.getS() == null ? "" : ":" + value.getS())
+ (value.getN() == null ? "" : ":" + value.getN())
+ (value.getB() == null ? "" : ":" + value.getB())
+ (value.getSS() == null ? "" : ":" + value.getSS())
+ (value.getNS() == null ? "" : ":" + value.getNS())
+ (value.getBS() == null ? "" : ":" + value.getBS()));
}
counter += 1;
}
lastKeyEvaluated = result.getLastEvaluatedKey();
}
while(lastKeyEvaluated != null);
The counter gives exactly 200,000 when this code has finished, however, I also wanted to try the parallel scan.
Function Call:
ScanSegmentTask task = null;
ArrayList<String> list = new ArrayList<String>();
try
{
ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
int totalSegments = numberOfThreads;
for (int segment = 0; segment < totalSegments; segment++)
{
// Runnable task that will only scan one segment
task = new ScanSegmentTask(tableName, itemLimit, totalSegments, segment, list);
// Execute the task
executor.execute(task);
}
shutDownExecutorService(executor);
}
.......Catches something if error
return list;
Class:
I have a static list that the data is shared with all the threads. I was able to retrieve the lists, and output the amount of data.
// Runnable task for scanning a single segment of a DynamoDB table
private static class ScanSegmentTask implements Runnable
{
// DynamoDB table to scan
private String tableName;
// number of items each scan request should return
private int itemLimit;
// Total number of segments
// Equals to total number of threads scanning the table in parallel
private int totalSegments;
// Segment that will be scanned with by this task
private int segment;
static ArrayList<String> list_2;
Object lock = new Object();
public ScanSegmentTask(String tableName, int itemLimit, int totalSegments, int segment, ArrayList<String> list)
{
this.tableName = tableName;
this.itemLimit = itemLimit;
this.totalSegments = totalSegments;
this.segment = segment;
list_2 = list;
}
public void run()
{
System.out.println("Scanning " + tableName + " segment " + segment + " out of " + totalSegments + " segments " + itemLimit + " items at a time...");
Map<String, AttributeValue> exclusiveStartKey = null;
int totalScannedItemCount = 0;
int totalScanRequestCount = 0;
int counter = 0;
try
{
while(true)
{
ScanRequest scanRequest = new ScanRequest()
.withTableName(tableName)
.withLimit(itemLimit)
.withExclusiveStartKey(exclusiveStartKey)
.withTotalSegments(totalSegments)
.withSegment(segment);
ScanResult result = client.scan(scanRequest);
totalScanRequestCount++;
totalScannedItemCount += result.getScannedCount();
synchronized(lock)
{
for(Map<String, AttributeValue> item : result.getItems())
{
list_2.add("Set:"+counter);
for (Map.Entry<String, AttributeValue> getItem : item.entrySet())
{
String attributeName = getItem.getKey();
AttributeValue value = getItem.getValue();
list_2.add(attributeName
+ (value.getS() == null ? "" : ":" + value.getS())
+ (value.getN() == null ? "" : ":" + value.getN())
+ (value.getB() == null ? "" : ":" + value.getB())
+ (value.getSS() == null ? "" : ":" + value.getSS())
+ (value.getNS() == null ? "" : ":" + value.getNS())
+ (value.getBS() == null ? "" : ":" + value.getBS()));
}
counter += 1;
}
}
exclusiveStartKey = result.getLastEvaluatedKey();
if (exclusiveStartKey == null)
{
break;
}
}
}
catch (AmazonServiceException ase)
{
System.err.println(ase.getMessage());
}
finally
{
System.out.println("Scanned " + totalScannedItemCount + " items from segment " + segment + " out of " + totalSegments + " of " + tableName + " with " + totalScanRequestCount + " scan requests");
}
}
}
Executor Service Shut Down:
public static void shutDownExecutorService(ExecutorService executor)
{
executor.shutdown();
try
{
if (!executor.awaitTermination(10, TimeUnit.SECONDS))
{
executor.shutdownNow();
}
}
catch (InterruptedException e)
{
executor.shutdownNow();
Thread.currentThread().interrupt();
}
}
However, the amount of items changes every time I run this piece of code (Varies around 60000 in total, 6000 per threads, with 10 created threads). Removing synchronization does not change the result too.
Is there a bug with the synchronization or with the Amazon AWS API?
Thanks All
EDIT:
The new function call:
ScanSegmentTask task = null;
ArrayList<String> list = new ArrayList<String>();
try
{
ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
int totalSegments = numberOfThreads;
for (int segment = 0; segment < totalSegments; segment++)
{
// Runnable task that will only scan one segment
task = new ScanSegmentTask(tableName, itemLimit, totalSegments, segment);
// Execute the task
Future<ArrayList<String>> future = executor.submit(task);
list.addAll(future.get());
}
shutDownExecutorService(executor);
}
The new class:
// Runnable task for scanning a single segment of a DynamoDB table
private static class ScanSegmentTask implements Callable<ArrayList<String>>
{
// DynamoDB table to scan
private String tableName;
// number of items each scan request should return
private int itemLimit;
// Total number of segments
// Equals to total number of threads scanning the table in parallel
private int totalSegments;
// Segment that will be scanned with by this task
private int segment;
ArrayList<String> list_2 = new ArrayList<String>();
static int counter = 0;
public ScanSegmentTask(String tableName, int itemLimit, int totalSegments, int segment)
{
this.tableName = tableName;
this.itemLimit = itemLimit;
this.totalSegments = totalSegments;
this.segment = segment;
}
#SuppressWarnings("finally")
public ArrayList<String> call()
{
System.out.println("Scanning " + tableName + " segment " + segment + " out of " + totalSegments + " segments " + itemLimit + " items at a time...");
Map<String, AttributeValue> exclusiveStartKey = null;
try
{
while(true)
{
ScanRequest scanRequest = new ScanRequest()
.withTableName(tableName)
.withLimit(itemLimit)
.withExclusiveStartKey(exclusiveStartKey)
.withTotalSegments(totalSegments)
.withSegment(segment);
ScanResult result = client.scan(scanRequest);
for(Map<String, AttributeValue> item : result.getItems())
{
list_2.add("Set:"+counter);
for (Map.Entry<String, AttributeValue> getItem : item.entrySet())
{
String attributeName = getItem.getKey();
AttributeValue value = getItem.getValue();
list_2.add(attributeName
+ (value.getS() == null ? "" : ":" + value.getS())
+ (value.getN() == null ? "" : ":" + value.getN())
+ (value.getB() == null ? "" : ":" + value.getB())
+ (value.getSS() == null ? "" : ":" + value.getSS())
+ (value.getNS() == null ? "" : ":" + value.getNS())
+ (value.getBS() == null ? "" : ":" + value.getBS()));
}
counter += 1;
}
exclusiveStartKey = result.getLastEvaluatedKey();
if (exclusiveStartKey == null)
{
break;
}
}
}
catch (AmazonServiceException ase)
{
System.err.println(ase.getMessage());
}
finally
{
return list_2;
}
}
}
Final EDIT:
Function Call:
ScanSegmentTask task = null;
ArrayList<String> list = new ArrayList<String>();
ArrayList<Future<ArrayList<String>>> holdFuture = new ArrayList<Future<ArrayList<String>>>();
try
{
ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
int totalSegments = numberOfThreads;
for (int segment = 0; segment < totalSegments; segment++)
{
// Runnable task that will only scan one segment
task = new ScanSegmentTask(tableName, itemLimit, totalSegments, segment);
// Execute the task
Future<ArrayList<String>> future = executor.submit(task);
holdFuture.add(future);
}
for (int i = 0 ; i < holdFuture.size(); i++)
{
boolean flag = false;
while(flag == false)
{
Thread.sleep(1000);
if(holdFuture.get(i).isDone())
{
list.addAll(holdFuture.get(i).get());
flag = true;
}
}
}
shutDownExecutorService(executor);
}
Class:
private static class ScanSegmentTask implements Callable>
{
// DynamoDB table to scan
private String tableName;
// number of items each scan request should return
private int itemLimit;
// Total number of segments
// Equals to total number of threads scanning the table in parallel
private int totalSegments;
// Segment that will be scanned with by this task
private int segment;
ArrayList<String> list_2 = new ArrayList<String>();
static AtomicInteger counter = new AtomicInteger(0);
public ScanSegmentTask(String tableName, int itemLimit, int totalSegments, int segment)
{
this.tableName = tableName;
this.itemLimit = itemLimit;
this.totalSegments = totalSegments;
this.segment = segment;
}
#SuppressWarnings("finally")
public ArrayList<String> call()
{
System.out.println("Scanning " + tableName + " segment " + segment + " out of " + totalSegments + " segments " + itemLimit + " items at a time...");
Map<String, AttributeValue> exclusiveStartKey = null;
try
{
while(true)
{
ScanRequest scanRequest = new ScanRequest()
.withTableName(tableName)
.withLimit(itemLimit)
.withExclusiveStartKey(exclusiveStartKey)
.withTotalSegments(totalSegments)
.withSegment(segment);
ScanResult result = client.scan(scanRequest);
for(Map<String, AttributeValue> item : result.getItems())
{
list_2.add("Set:"+counter);
for (Map.Entry<String, AttributeValue> getItem : item.entrySet())
{
String attributeName = getItem.getKey();
AttributeValue value = getItem.getValue();
list_2.add(attributeName
+ (value.getS() == null ? "" : ":" + value.getS())
+ (value.getN() == null ? "" : ":" + value.getN())
+ (value.getB() == null ? "" : ":" + value.getB())
+ (value.getSS() == null ? "" : ":" + value.getSS())
+ (value.getNS() == null ? "" : ":" + value.getNS())
+ (value.getBS() == null ? "" : ":" + value.getBS()));
}
counter.addAndGet(1);
}
exclusiveStartKey = result.getLastEvaluatedKey();
if (exclusiveStartKey == null)
{
break;
}
}
}
catch (AmazonServiceException ase)
{
System.err.println(ase.getMessage());
}
finally
{
return list_2;
}
}
}
OK, I believe the issue is in the way you synchronized.
In your case, your lock is pretty much pointless, as each thread has its own lock, and so synchronizing never actually blocks one thread from running the same piece of code. I believe that this is the reason that removing synchronization does not change the result -- because it never would have had an effect in the first place.
I believe your issue is in fact due to the static ArrayList<String> that's shared by your threads. This is because ArrayList is actually not thread-safe, and so operations on it are not guaranteed to succeed; as a result, you have to synchronize operations to/from it. Without proper synchronization, it could be possible to have two threads add something to an empty ArrayList, yet have the resulting ArrayList have a size of 1! (or at least if my memory hasn't failed me. I believe this is the case for non-thread-safe objects, though)
As I said before, while you do have a synchronized block, it really isn't doing anything. You could synchronize on list_2, but all that would do is effectively make all your threads run in sequence, as the lock on the ArrayList wouldn't be released until one of your threads was done.
There are a few solutions to this. You can use Collections.synchronizedList(list_2) to create a synchronized wrapper to your ArrayList. This way, adding to the list is guaranteed to succeed. However, this induces a synchronization cost per operations, and so isn't ideal.
What I would do is actually have ScanSegmentTask implement Callable (technically Callable<ArrayList<String>>. The Callable interface is almost exactly like the Runnable interface, except its method is call(), which returns a value.
Why is this important? I think that what would produce the best results for you is this:
Make list_2 an instance variable, initialized to a blank list
Have each thread add to this list exactly as you have done
Return list_2 when you are done
Concatenate each resulting ArrayList<String> to the original ArrayList using addAll()
This way, you have no synchronization overhead to deal with!
This will require a few changes to your executor code. Instead of calling execute(), you'll need to call submit(). This returns a Future object (Future<ArrayList<String>> in your case) that holds the results of the call() method. You'll need to store this into some collection -- an array, ArrayList, doesn't matter.
To retrieve the results, simply loop through the collection of Future objects and call get() (I think). This call will block until the thread that the Future object corresponds to is complete.
I think that's it. While this is more complicated, I think that this is be best performance you're going to get, as with enough threads either CPU contention or your network link will become the bottleneck. Please ask if you have any questions, and I'll update as needed.

Categories

Resources