I am getting 20k small xml files 1kb to 3kb size in a minute.
I have to write all the files as it arrives in the directory.
Sometimes the speed of the incoming files increases to 100k per minute.
Is there anything in java or aws api that can help me match the incoming speed?
I am using uploadFileList() API to upload all the files .
I have tried watch event as well so that when ever files arrives in a folder it will upload that file into S3 but that is so slow compared to incoming files and creates huge amount of backlogs.
I have tried multi threading also but if i spin up more thread i get error from S3 reduce you request rate error.
and some times i get below error also
com.amazonaws.services.s3.model.AmazonS3Exception: Your socket
connection to the server was not read from or written to within the
timeout period. Idle connections will be closed.
but when i dot use threading i do not get this error
Another way I also have tried is to create one big files and then upload into S3 and then in S3 i again split it into small files which is fine but this solution delays the files upload int S3 and impacts the user who access this file from S3.
I know uploading small files into S3 is not suitable but i have use case like that.
The speed i noticed is 5k files upload in a minutes.
Can someone please suggest some alternate way so that my speed of uploading files will increase least 15k per minutes.
I am sharing my full code where i am trying to upload using multi threaded application
Class one where i create File to put into thread
public class FileProcessThreads {
public ArrayList process(String fileLocation) {
File dir = new File(fileLocation);
File[] directoryListing = dir.listFiles();
ArrayList<File> files = new ArrayList<File>();
if (directoryListing.length > 0) {
for (File path : directoryListing) {
return files;
Class 2 where i create Thread pool and Executor
public class UploadExecutor {
private static String fileLocation = "C:\\Users\\u6034690\\Desktop\\ONEFILE";
// private static String fileLocation="D:\\TRFAudits_Moved\\";
private static final String _logFileName = "s3FileUploader.log";
private static Logger _logger = Logger.getLogger(UploadExecutor.class);
public static void main(String[] args) {
_logger.info("----------Stating application's main method----------------- ");
AWSCredentials credential = new ProfileCredentialsProvider("TRFAuditability-Prod-ServiceUser").getCredentials();
final ClientConfiguration config = new ClientConfiguration();
AmazonS3Client s3Client = (AmazonS3Client) AmazonS3ClientBuilder.standard().withRegion("us-east-1")
.withCredentials(new AWSStaticCredentialsProvider(credential)).withForceGlobalBucketAccessEnabled(true)
TransferManager tm = new TransferManager(s3Client);
while (true) {
FileProcessThreads fp = new FileProcessThreads();
List<File> records = fp.process(fileLocation);
while (records.size() <= 0) {
try {
_logger.info("No records found willl wait for 10 Seconds");
records = fp.process(fileLocation);
} catch (InterruptedException e) {
_logger.error("InterruptedException: " + e.toString());
_logger.info("Total no of Audit files = " + records.size());
ExecutorService es = Executors.newFixedThreadPool(2);
int recordsInEachThread = (int) (records.size() / 2);
_logger.info("No of records in each thread = " + recordsInEachThread);
UploadObject my1 = new UploadObject(records.subList(0, recordsInEachThread), tm);
UploadObject my2 = new UploadObject(records.subList(recordsInEachThread, records.size()), tm);
try {
boolean finshed = es.awaitTermination(1, TimeUnit.MINUTES);
if (!finshed) {
} catch (InterruptedException e) {
_logger.error("InterruptedException: " + e.toString());
Last class where i upload files into S3
public class UploadObject implements Runnable{
static String bucketName = "a205381-auditxml/S3UPLOADER";
private String fileLocation="C:\\Users\\u6034690\\Desktop\\ONEFILE";
//private String fileLocation="D:\\TRFAudits\\";
//static String bucketName = "a205381-auditxml/S3UPLOADER";
private static Logger _logger;
List<File> records;
TransferManager tm;
UploadObject(List<File> list,TransferManager tm){
this.records = list;
_logger = Logger.getLogger(UploadObject.class);
public void run(){
System.out.println(Thread.currentThread().getName() + " : ");
public void uploadToToS3() {
_logger.info("Number of record to be processed in current thread: : "+records.size());
MultipleFileUpload xfer = tm.uploadFileList(bucketName, "TEST",new File(fileLocation), records);
try {
TransferState xfer_state = xfer.getState();
_logger.info("Upload status -----------------" + xfer_state);
for (File file : records) {
try {
} catch (IOException e) {
_logger.error("IOException: "+e.toString());
_logger.info("Successfully completed file cleanse");
} catch (AmazonServiceException e) {
_logger.error("AmazonServiceException: "+e.toString());
} catch (AmazonClientException e) {
_logger.error("AmazonClientException: "+e.toString());
} catch (InterruptedException e) {
_logger.error("InterruptedException: "+e.toString());
_logger.info("Upload completed");
_logger.info("Calling Transfer manager shutdown");
It sounds like you're tripping the built-in protections for S3 (quoted docs below). I've also listed some similar questions below; some of these advise rearchitecting using SQS to even out and distribute the load on S3.
Aside from introducing more moving pieces, you can reuse your S3Client and TransferManager. Move them up out of your runnable object and pass them into its constructor. TransferManager itself uses multithreading according to the javadoc.
When possible, TransferManager attempts to use multiple threads to upload multiple parts of a single upload at once. When dealing with large content sizes and high bandwidth, this can have a significant increase on throughput.
You can also increase the max number of simultaneous connections that the S3Client uses.
s3Client.getClientConfiguration().setMaxConnections(75) or even higher.
Lastly, you could try to upload to different prefixes/folders under the bucket, as noted below to allow scaling for high request rates.
The current AWS Request Rate and Performance Guidelines
Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/POST/DELETE and 5,500 GET requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. It is simple to increase your read or write performance exponentially. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second.
The current AWS S3 Error Best Practices
Tune Application for Repeated SlowDown errors
As with any distributed system, S3 has protection mechanisms which detect intentional or unintentional resource over-consumption and react accordingly. SlowDown errors can occur when a high request rate triggers one of these mechanisms. Reducing your request rate will decrease or eliminate errors of this type. Generally speaking, most users will not experience these errors regularly; however, if you would like more information or are experiencing high or unexpected SlowDown errors, please post to our Amazon S3 developer forum https://forums.aws.amazon.com/ or sign up for AWS Premium Support https://aws.amazon.com/premiumsupport/.
Similar questions:
S3 SlowDown: Please reduce your request rate exception
Amazon Web Services S3 Request Limit
AWS Forums - Maximizing Connection Reuse for S3 getObjectMetadata() Calls
S3 Transfer Acceleration does not necessarily give faster upload speeds. It is sometime slower than normal upload when using from same region. Amazon S3 Transfer Acceleration uses the AWS edge infrastructure they have around the world to get data on to the aws backbone quicker. When you use Amazon S3 Transfer Acceleration your request is routed to the best AWS edge location based on latency. Transfer Acceleration will then send your uploads back to S3 over the AWS-managed backbone network using optimized network protocols, persistent connections from edge to origin, fully-open send and receive windows, and so forth. As you would already be within the region you wouldn't see any benefit to using this. But, its better to test the speed from https://s3-accelerate-speedtest.s3-accelerate.amazonaws.com/en/accelerate-speed-comparsion.html
I am writing a Service that obtains data from large sql query in database (over 100,000 records) and streams into an API CSV File. Is there any java library function that does it, or any way to make the code below more efficient? Currently using Java 8 in Spring boot environment.
Code is below with sql repository method, and service for csv. Preferably trying to write to csv file, while data is being fetched from sql concurrently as query make take 2-3 min for user .
We are using Snowflake DB.
public class ProductService {
private final ProductRepository productRepository;
private final ExecutorService executorService;
public ProductService(ProductRepository productRepository) {
this.productRepository = productRepository;
this.executorService = Executors.newFixedThreadPool(20);
public InputStream getproductExportFile(productExportFilters filters) throws IOException {
PipedInputStream is = new PipedInputStream();
PipedOutputStream os = new PipedOutputStream(is);
executorService.execute(() -> {
try {
Stream<productExport> productStream = productRepository.getproductExportStream(filters);
Field[] fields = Stream.of(productExport.class.getDeclaredFields())
.peek(f -> f.setAccessible(true))
String[] headers = Stream.of(fields)
CSVFormat csvFormat = CSVFormat.DEFAULT.builder()
OutputStreamWriter outputStreamWriter = new OutputStreamWriter(os);
CSVPrinter csvPrinter = new CSVPrinter(outputStreamWriter, csvFormat);
productStream.forEach(productExport -> writeproductExportToCsv(productExport, csvPrinter, fields));
} catch (Exception e) {
logger.warn("Unable to complete writing to csv stream.", e);
} finally {
try {
} catch (IOException ignored) { }
return is;
private void writeProductExportToCsv(productExport productExport, CSVPrinter csvPrinter, Field[] fields) {
Object[] values = Stream.of(fields).
map(f -> {
try {
return f.get(productExport);
} catch (IllegalAccessException e) {
return null;
try {
} catch (IOException e) {
logger.warn("Unable to write record to file.", e);
public Stream<PatientExport> getProductExportStream(ProductExportFilters filters) {
MapSqlParameterSource parameterSource = new MapSqlParameterSource();
parameterSource.addValue("customerId", filters.getCustomerId().toString());
parameterSource.addValue("practiceId", filters.getPracticeId().toString());
StringBuilder sqlQuery = new StringBuilder("SELECT * FROM dbo.Product ");
sqlQuery.append("\nWHERE CUSTOMERID = :customerId\n" +
"AND PRACTICEID = :practiceId\n"
Streaming allows you to transfer the data, little by little, without having to load it all into the server’s memory. You can do your operations by using the extractData() method in ResultSetExtractor. You can find javadoc about ResultSetExtractor here.
You can view an example using ResultSetExtractor here.
You can also easily create your JPA queries as ResultSet using JdbcTemplate. You can take a look at an example here. to use ResultSetExtractor.
There is product which we bought some time ago for our company, we got even the source code back then. https://northconcepts.com/ We were also evaluating Apache Camel which had similar support but it didnt suite our goal. If you really need speed you should go to lowest level possible - pure JDBC and as simple as possible csv writer.
Nortconcepts library itself provides capability to read from jdbc and write to CSV on lower level. We found few tweaks which have sped up the transmission and processing. With single thread we are able to stream 100 000 records (with 400 columns) within 1-2 minutes.
Given that you didn't specify which database you use I can give you only generic answers.
In general code like this is network limited, as JDBC resultset is usually transferred in "only n rows" packages, and when you exhaust one, only then database triggers fetching of next packet. This property is often called fetch-size, and you should greatly increase it. By default settings, most of databases transfer 10-100 rows in one fetch. In spring you can use setFetchSize property. Some benchmarks here.
There are other similar low level stuff which you could do. For example, Oracle jdbc driver has "InsensitiveResultSetBufferSize" - how big in bytes is a buffer which holds result set. But dose things tend to be database specific.
Thus being said, the best way to really increase speed of your transfer is to actually launch multiple queries. Divide your data on some column value, and than launch multiple parallel queries. Essentially, if you can design data to support parallel queries working on easily distinguished subsets, bottleneck can be transferred to a network or CPU throughput.
For example one of your columns might be 'timestamp'. Instead having one query to fetch all rows, fetch multiple subset of rows with query like this:
SELECT * FROM dbo.Product
AND PRACTICEID = :practiceId
AND :lowerLimit <= timestamp AND timestamp < :upperLimit
Launch this query in parallel with different timestamp ranges. Aggregate result of those subqueries in shared ConcurrentLinkedQueue and build CSV there.
With similar approach I regularly read 100000 rows/sec on 80 column table from Oracle DB. That is 40-60 MB/sec sustained transfer rate from a table which is not even locked.
In my java application I need to write data to S3, which I don't know the size in advance and sizes are usually big so as recommend in the AWS S3 documentation I am using the Using the Java AWS SDKs (low-level-level API) to write data to the s3 bucket.
In my application I provide S3BufferedOutputStream which is an implementation OutputStream where other classes in the app can use this stream to write to the s3 bucket.
I store the data in a buffer and loop and once the data is bigger than bucket size I upload data in the buffer as a a single UploadPartRequest
Here is the implementation of the write method of S3BufferedOutputStream
public void write(byte[] b, int off, int len) throws IOException {
int o = off, l = len;
int size;
while (l > (size = this.buf.length - position)) {
System.arraycopy(b, o, this.buf, this.position, size);
this.position += size;
o += size;
l -= size;
System.arraycopy(b, o, this.buf, this.position, l);
this.position += l;
The whole implementation is similar to this: code repo
My problem here is that each UploadPartRequest is done synchronously, so we have to wait for one part to be uploaded to be able to upload the next part. And because I am using the AWS S3 low level API I can not benefit from the parallel uploading provided by the TransferManager
Is there a way to achieve the parallel upload using low level SDK?
Or some code changes that can be done to operate Asynchronously without corrupting the uploaded data and maintain order of the data?
Here's some example code from a class that I have. It submits the parts to an ExecutorService and holds onto the returned Future. This is written for the v1 Java SDK; if you're using the v2 SDK you could use an async client rather than the explicit threadpool:
// WARNING: data must not be updated by caller; make a defensive copy if needed
public synchronized void uploadPart(byte[] data, boolean isLastPart)
logger.debug("submitting part {} for s3://{}/{}", partNumber, bucket, key);
final UploadPartRequest request = new UploadPartRequest()
.withInputStream(new ByteArrayInputStream(data))
executor.submit(new Callable<PartETag>()
public PartETag call() throws Exception
int localPartNumber = request.getPartNumber();
logger.debug("uploading part {} for s3://{}/{}", localPartNumber, bucket, key);
UploadPartResult response = client.uploadPart(request);
String etag = response.getETag();
logger.debug("uploaded part {} for s3://{}/{}; etag is {}", localPartNumber, bucket, key, etag);
return new PartETag(localPartNumber, etag);
Note: this method is synchronized to ensure that parts are not submitted out of order.
Once you've submitted all of the parts, you use this method to wait for them to finish and then complete the upload:
public void complete()
logger.debug("waiting for upload tasks of s3://{}/{}", bucket, key);
List<PartETag> partTags = new ArrayList<>();
for (Future<PartETag> future : futures)
catch (Exception e)
throw new RuntimeException(String.format("failed to complete upload task for s3://%s/%s"), e);
logger.debug("completing multi-part upload for s3://{}/{}", bucket, key);
CompleteMultipartUploadRequest request = new CompleteMultipartUploadRequest()
logger.debug("completed multi-part upload for s3://{}/{}", bucket, key);
You'll also need an abort() method that cancels outstanding parts and aborts the upload. This, and the rest of the class, are left as an exercise for the reader.
You should look at using the AWS SDK for Java V2. You are referencing V1, not the newest Amazon S3 Java API. If you are not familiar with V2, start here:
Get started with the AWS SDK for Java 2.x
To perform Async operations via the Amazon S3 Java API, you use S3AsyncClient.
Now to learn how to upload an object using this client, see this code example:
import software.amazon.awssdk.core.async.AsyncRequestBody;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.S3AsyncClient;
import software.amazon.awssdk.services.s3.model.PutObjectRequest;
import software.amazon.awssdk.services.s3.model.PutObjectResponse;
import java.nio.file.Paths;
import java.util.concurrent.CompletableFuture;
// snippet-end:[s3.java2.async_ops.import]
// snippet-start:[s3.java2.async_ops.main]
* To run this AWS code example, ensure that you have setup your development environment, including your AWS credentials.
* For information, see this documentation topic:
* https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/get-started.html
public class S3AsyncOps {
public static void main(String[] args) {
final String USAGE = "\n" +
"Usage:\n" +
" S3AsyncOps <bucketName> <key> <path>\n\n" +
"Where:\n" +
" bucketName - the name of the Amazon S3 bucket (for example, bucket1). \n\n" +
" key - the name of the object (for example, book.pdf). \n" +
" path - the local path to the file (for example, C:/AWS/book.pdf). \n" ;
if (args.length != 3) {
String bucketName = args[0];
String key = args[1];
String path = args[2];
Region region = Region.US_WEST_2;
S3AsyncClient client = S3AsyncClient.builder()
PutObjectRequest objectRequest = PutObjectRequest.builder()
// Put the object into the bucket
CompletableFuture<PutObjectResponse> future = client.putObject(objectRequest,
future.whenComplete((resp, err) -> {
try {
if (resp != null) {
System.out.println("Object uploaded. Details: " + resp);
} else {
// Handle error
} finally {
// Only close the client when you are completely done with it
That is uploading an object using the S3AsyncClient client. To perform a multi-part upload, you need to use this method:
TO see an example of Multipart upload using the S3 Sync client, see:
That is your solution - use S3AsyncClient object's createMultipartUpload method.
I am trying to parse contents of a large number of emails in a gmail account. My code works fine on the Google App Engine for upto ~4000 emails, but I get the following error when the number is higher
Uncaught exception from servlet com.google.apphosting.runtime.HardDeadlineExceededError
My sample space has about 4500 emails and the code below will take a little over a minute to get all the emails. I am looking to lower the execution time to fetch the emails.
My code is:
final List<Message> messages = new ArrayList<Message>();
BatchRequest batchRequest = gmail.batch();
JsonBatchCallback<Message> callback = new JsonBatchCallback<Message>() {
public void onSuccess(Message message, HttpHeaders responseHeaders) {
synchronized (messages) {
public void onFailure(GoogleJsonError e, HttpHeaders responseHeaders)
throws IOException {
int batchCount=0;
for(Message message : messageList){
gmail.users().messages().get("me", message.getId()).set("format", "metadata").set("fields", "payload").queue(batchRequest, callback);
log.info("No of Emails Read : " + noOfEmailsRead);
catch(Exception e){
log.info("No of Emails Read : " + noOfEmailsRead);
As said here: RuntimeError
is because you must finish your task in 30 seconds.
To accomplish this whole task in about 30 seconds, you can use the Divide and Conquer Algorithms. This technique breaks the task into smaller tasks, using all the parallel power of your processor. To determine the best number of tasks, can be little hard because depends on your OS, Processor, .... You must do some tests and benchmark.
Java have the java.util.concurrent that can help you to accomplish this issue. You can use the Fork/Join Framework.
you will need to break up the work into smaller tasks that can each complete in under 30 seconds.
A simple google search would have revealed this to you.
I know Apache Curator can do the distributed lock feature which is build on the top of zookeeper. It looks like very easy to use based on the document which is posted in the Apache Curator official website. For example:
RetryPolicy retryPolicy = new ExponentialBackoffRetry(1000, 3);
CuratorFramework client = CuratorFrameworkFactory.newClient("host:ip",retryPolicy);
InterProcessSemaphoreMutex lock = new InterProcessSemaphoreMutex(client, path);
if(lock.acquire(10, TimeUnit.SECONDS))
try { /*do something*/ }
finally { lock.release(); }
But what does the second parameter "path" of "InterProcessSemaphoreMutex" mean? It means "the path for the lock" based on API but what exactly is it? Can anybody give me an example?
If I have millions of locks, do I have to create millions of "path to the lock"? Is there any limit that the maximum number of locks(znodes) a zookeeper cluster has? Or can we remove this lock when a process releases it?
ZooKeeper presents what looks like a distributed file system. For any ZooKeeper operation, recipe, etc., you write "znodes" to a particular path and watch for changes. See here: http://zookeeper.apache.org/doc/trunk/zookeeperOver.html#Simple+API (regarding znodes).
For Curator recipes, it needs to know the base path you want to use to perform the recipe. For InterProcessSemaphoreMutex, the path is what every participant should use. i.e. Process 1 and Process 2 want to both contend for the lock. So, they both allocate InterProcessSemaphoreMutex instances with the same path, say "/my/lock". Think of the path as the lock identifier. In the same ZooKeeper cluster, you could have multiple locks by using different paths.
Hope this helps (disclaimer: I'm the main author of Curator).
Some examples about Reaper.
public void testSomeNodes() throws Exception
Timing timing = new Timing();
ChildReaper reaper = null;
CuratorFramework client = CuratorFrameworkFactory.newClient(server.getConnectString(), timing.session(), timing.connection(), new RetryOneTime(1));
Random r = new Random();
int nonEmptyNodes = 0;
for ( int i = 0; i < 10; ++i )
client.create().creatingParentsIfNeeded().forPath("/test/" + Integer.toString(i));
if ( r.nextBoolean() )
client.create().forPath("/test/" + Integer.toString(i) + "/foo");
reaper = new ChildReaper(client, "/test", Reaper.Mode.REAP_UNTIL_DELETE, 1);
Stat stat = client.checkExists().forPath("/test");
Assert.assertEquals(stat.getNumChildren(), nonEmptyNodes);
Java Code Examples for org.apache.curator.framework.recipes.locks.Reaper
I've been testing out DynamoDB as a potential option for a scalable and steady throughput database for a site that will be hit pretty frequently and requires a very fast response time (< 50ms). I'm seeing pretty slow responses (both locally and on an EC2 instance) for the following code:
public static void main(String[] args) {
try {
AWSCredentials credentials = new PropertiesCredentials(new File("aws_credentials.properties"));
long start = System.currentTimeMillis();
AmazonDynamoDBClient client = new AmazonDynamoDBClient(credentials);
System.out.println((System.currentTimeMillis() - start) + " (ms) to connect");
DynamoDBMapper mapper = new DynamoDBMapper(client);
start = System.currentTimeMillis();
Model model = mapper.load(Model.class, "hashkey1", "rangekey1");
System.out.println((System.currentTimeMillis() - start) + " (ms) to load Model");
} catch (Exception e) {
The connection to the DB alone takes about 800 (ms) on average and the loading using the mapper takes an additional 200 (ms). According to Amazon's page about DynamoDB we should expect "Average service-side latencies...typically single-digit milliseconds." I wouldn't expect the full round-trip HTTP request to add that much overhead. Are these expected numbers even on an EC2 instance?
I think a better test would be to avoid the initial costs/latency incurred in starting up the JVM and loading the classes. Something like:
public class TestDynamoDBMain {
public static void main(String[] args) {
try {
AWSCredentials credentials = new PropertiesCredentials(new File("aws_credentials.properties"));
AmazonDynamoDBClient client = new AmazonDynamoDBClient(credentials);
DynamoDBMapper mapper = new DynamoDBMapper(client);
// Warm up
for (int i=0; i < 10; i++) {
testrun(mapper, false);
// Time it
for (int i=0; i < 10; i++) {
testrun(mapper, true);
} catch (Exception e) {
private static void testrun(DynamoDBMapper mapper, boolean timed) {
long start = System.nanoTime();
Model model = mapper.load(Model.class, "hashkey1", "rangekey1");
if (timed)
TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start)
+ " (ms) to load Model");
Furthermore, you may consider enabling the default metrics of the AWS SDK for Java to see the fine grain time allocation in Amazon CloudWatch. For more details, see:
Hope this helps.
Dynamo DB is located in a specific region (they dont yet support cross region replication). This is chosen by you when you create a table. Unless you are calling the APIs from the same region, it is bound to be slow.
It looks like you are trying to call Dynamo from your development desktop. You can re-do the same test from an EC2 instance started in the "same region". This will considerably speed up the responses. This is a more realistic test, since any way when you deploy your production system it will be in the same region as Dynamo.
Again, if you really need very quick response, consider using ElastiCache between your code and Dynamo. On every read, store on cache before returning the results. Next read should read from the cache (say for an expiry time of 10 mins). For "read-heavy" apps this is the suggested route. I have seen many fold better response using this approach.