Read a large number of emails using Gmail API - java

I am trying to parse contents of a large number of emails in a gmail account. My code works fine on the Google App Engine for upto ~4000 emails, but I get the following error when the number is higher
Uncaught exception from servlet com.google.apphosting.runtime.HardDeadlineExceededError
My sample space has about 4500 emails and the code below will take a little over a minute to get all the emails. I am looking to lower the execution time to fetch the emails.
My code is:
final List<Message> messages = new ArrayList<Message>();
BatchRequest batchRequest = gmail.batch();
JsonBatchCallback<Message> callback = new JsonBatchCallback<Message>() {
public void onSuccess(Message message, HttpHeaders responseHeaders) {
synchronized (messages) {
messages.add(message);
}
}
#Override
public void onFailure(GoogleJsonError e, HttpHeaders responseHeaders)
throws IOException {
}
};
int batchCount=0;
if(noOfEmails>0){
for(Message message : messageList){
gmail.users().messages().get("me", message.getId()).set("format", "metadata").set("fields", "payload").queue(batchRequest, callback);
batchCount++;
if(batchCount==1000){
try{
noOfEmailsRead+=batchCount;
log.info("No of Emails Read : " + noOfEmailsRead);
batchRequest.execute();
}
catch(Exception e){
}
batchCount=0;
}
}
noOfEmailsRead+=batchCount;
log.info("No of Emails Read : " + noOfEmailsRead);
batchRequest.execute();
}

As said here: RuntimeError
HardDeadlineExceededError
is because you must finish your task in 30 seconds.
To accomplish this whole task in about 30 seconds, you can use the Divide and Conquer Algorithms. This technique breaks the task into smaller tasks, using all the parallel power of your processor. To determine the best number of tasks, can be little hard because depends on your OS, Processor, .... You must do some tests and benchmark.
Java have the java.util.concurrent that can help you to accomplish this issue. You can use the Fork/Join Framework.

you will need to break up the work into smaller tasks that can each complete in under 30 seconds.
A simple google search would have revealed this to you.

Related

Is there anyway to write small file into S3 as live streaming

I am getting 20k small xml files 1kb to 3kb size in a minute.
I have to write all the files as it arrives in the directory.
Sometimes the speed of the incoming files increases to 100k per minute.
Is there anything in java or aws api that can help me match the incoming speed?
I am using uploadFileList() API to upload all the files .
I have tried watch event as well so that when ever files arrives in a folder it will upload that file into S3 but that is so slow compared to incoming files and creates huge amount of backlogs.
I have tried multi threading also but if i spin up more thread i get error from S3 reduce you request rate error.
and some times i get below error also
AmazonServiceException:
com.amazonaws.services.s3.model.AmazonS3Exception: Your socket
connection to the server was not read from or written to within the
timeout period. Idle connections will be closed.
but when i dot use threading i do not get this error
Another way I also have tried is to create one big files and then upload into S3 and then in S3 i again split it into small files which is fine but this solution delays the files upload int S3 and impacts the user who access this file from S3.
I know uploading small files into S3 is not suitable but i have use case like that.
The speed i noticed is 5k files upload in a minutes.
Can someone please suggest some alternate way so that my speed of uploading files will increase least 15k per minutes.
I am sharing my full code where i am trying to upload using multi threaded application
Class one where i create File to put into thread
public class FileProcessThreads {
public ArrayList process(String fileLocation) {
File dir = new File(fileLocation);
File[] directoryListing = dir.listFiles();
ArrayList<File> files = new ArrayList<File>();
if (directoryListing.length > 0) {
for (File path : directoryListing) {
files.add(path);
}
}
return files;
}
}
Class 2 where i create Thread pool and Executor
public class UploadExecutor {
private static String fileLocation = "C:\\Users\\u6034690\\Desktop\\ONEFILE";
// private static String fileLocation="D:\\TRFAudits_Moved\\";
private static final String _logFileName = "s3FileUploader.log";
private static Logger _logger = Logger.getLogger(UploadExecutor.class);
#SuppressWarnings("unchecked")
public static void main(String[] args) {
_logger.info("----------Stating application's main method----------------- ");
AWSCredentials credential = new ProfileCredentialsProvider("TRFAuditability-Prod-ServiceUser").getCredentials();
final ClientConfiguration config = new ClientConfiguration();
AmazonS3Client s3Client = (AmazonS3Client) AmazonS3ClientBuilder.standard().withRegion("us-east-1")
.withCredentials(new AWSStaticCredentialsProvider(credential)).withForceGlobalBucketAccessEnabled(true)
.build();
s3Client.getClientConfiguration().setMaxConnections(100);
TransferManager tm = new TransferManager(s3Client);
while (true) {
FileProcessThreads fp = new FileProcessThreads();
List<File> records = fp.process(fileLocation);
while (records.size() <= 0) {
try {
_logger.info("No records found willl wait for 10 Seconds");
TimeUnit.SECONDS.sleep(10);
records = fp.process(fileLocation);
} catch (InterruptedException e) {
_logger.error("InterruptedException: " + e.toString());
}
}
_logger.info("Total no of Audit files = " + records.size());
ExecutorService es = Executors.newFixedThreadPool(2);
int recordsInEachThread = (int) (records.size() / 2);
_logger.info("No of records in each thread = " + recordsInEachThread);
UploadObject my1 = new UploadObject(records.subList(0, recordsInEachThread), tm);
UploadObject my2 = new UploadObject(records.subList(recordsInEachThread, records.size()), tm);
es.execute(my1);
es.execute(my2);
es.shutdown();
try {
boolean finshed = es.awaitTermination(1, TimeUnit.MINUTES);
if (!finshed) {
Thread.sleep(1000);
}
} catch (InterruptedException e) {
_logger.error("InterruptedException: " + e.toString());
}
}
}
}
Last class where i upload files into S3
public class UploadObject implements Runnable{
static String bucketName = "a205381-auditxml/S3UPLOADER";
private String fileLocation="C:\\Users\\u6034690\\Desktop\\ONEFILE";
//private String fileLocation="D:\\TRFAudits\\";
//static String bucketName = "a205381-auditxml/S3UPLOADER";
private static Logger _logger;
List<File> records;
TransferManager tm;
UploadObject(List<File> list,TransferManager tm){
this.records = list;
this.tm=tm;
_logger = Logger.getLogger(UploadObject.class);
}
public void run(){
System.out.println(Thread.currentThread().getName() + " : ");
uploadToToS3();
}
public void uploadToToS3() {
_logger.info("Number of record to be processed in current thread: : "+records.size());
MultipleFileUpload xfer = tm.uploadFileList(bucketName, "TEST",new File(fileLocation), records);
try {
xfer.waitForCompletion();
TransferState xfer_state = xfer.getState();
_logger.info("Upload status -----------------" + xfer_state);
for (File file : records) {
try {
Files.delete(FileSystems.getDefault().getPath(file.getAbsolutePath()));
} catch (IOException e) {
System.exit(1);
_logger.error("IOException: "+e.toString());
}
}
_logger.info("Successfully completed file cleanse");
} catch (AmazonServiceException e) {
_logger.error("AmazonServiceException: "+e.toString());
System.exit(1);
} catch (AmazonClientException e) {
_logger.error("AmazonClientException: "+e.toString());
System.exit(1);
} catch (InterruptedException e) {
_logger.error("InterruptedException: "+e.toString());
System.exit(1);
}
System.out.println("Completed");
_logger.info("Upload completed");
_logger.info("Calling Transfer manager shutdown");
//tm.shutdownNow();
}
}
It sounds like you're tripping the built-in protections for S3 (quoted docs below). I've also listed some similar questions below; some of these advise rearchitecting using SQS to even out and distribute the load on S3.
Aside from introducing more moving pieces, you can reuse your S3Client and TransferManager. Move them up out of your runnable object and pass them into its constructor. TransferManager itself uses multithreading according to the javadoc.
When possible, TransferManager attempts to use multiple threads to upload multiple parts of a single upload at once. When dealing with large content sizes and high bandwidth, this can have a significant increase on throughput.
You can also increase the max number of simultaneous connections that the S3Client uses.
Maybe:
s3Client.getClientConfiguration().setMaxConnections(75) or even higher.
DEFAULT_MAX_CONNECTIONS is set to 50.
Lastly, you could try to upload to different prefixes/folders under the bucket, as noted below to allow scaling for high request rates.
The current AWS Request Rate and Performance Guidelines
Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/POST/DELETE and 5,500 GET requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. It is simple to increase your read or write performance exponentially. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second.
The current AWS S3 Error Best Practices
Tune Application for Repeated SlowDown errors
As with any distributed system, S3 has protection mechanisms which detect intentional or unintentional resource over-consumption and react accordingly. SlowDown errors can occur when a high request rate triggers one of these mechanisms. Reducing your request rate will decrease or eliminate errors of this type. Generally speaking, most users will not experience these errors regularly; however, if you would like more information or are experiencing high or unexpected SlowDown errors, please post to our Amazon S3 developer forum https://forums.aws.amazon.com/ or sign up for AWS Premium Support https://aws.amazon.com/premiumsupport/.
Similar questions:
S3 SlowDown: Please reduce your request rate exception
Amazon Web Services S3 Request Limit
AWS Forums - Maximizing Connection Reuse for S3 getObjectMetadata() Calls
S3 Transfer Acceleration does not necessarily give faster upload speeds. It is sometime slower than normal upload when using from same region. Amazon S3 Transfer Acceleration uses the AWS edge infrastructure they have around the world to get data on to the aws backbone quicker. When you use Amazon S3 Transfer Acceleration your request is routed to the best AWS edge location based on latency. Transfer Acceleration will then send your uploads back to S3 over the AWS-managed backbone network using optimized network protocols, persistent connections from edge to origin, fully-open send and receive windows, and so forth. As you would already be within the region you wouldn't see any benefit to using this. But, its better to test the speed from https://s3-accelerate-speedtest.s3-accelerate.amazonaws.com/en/accelerate-speed-comparsion.html

How to perform different actions when thread takes different amount of time to execute?

There exists a rest service which performs some calculations and prints the results in an excel file, which is returned, as response.. As data is increasing, we want to implement following behaviour. If file is ready within 3 sec , return excel else, send a text message.. And file is later mailed to user.
Any suggestions, how can we implement this functionality, in Java ??
Use a Future with timeout:
public String getExel() {
CompletableFuture<String> getter = CompletableFuture.supplyAsync(() -> "result");
try {
return getter().get(3, TimeUnits.SECONDS);
} catch(TimeoutException ex) {
// invoke email sending
getter.thenAcceptAsync(result -> sendEmail(result));
// I will send you email later
return "XXx to indicate you will send him later";
}
}
public void sendEmail(String resultFromGetExel) {
}

Streaming in jersey 2?

I've been trying to get json streaming to work in jersey 2. For the life of me nothing streams until the stream is complete.
I've tried this example trying to simulate a slow producer of data.
#Path("/foo")
#GET
public void getAsyncStream(#Suspended AsyncResponse response) {
StreamingOutput streamingOutput = output -> {
JsonGenerator jg = new ObjectMapper().getFactory().createGenerator(output, JsonEncoding.UTF8);
jg.writeStartArray();
for (int i = 0; i < 100; i++) {
jg.writeObject(i);
try {
Thread.sleep(100);
}
catch (InterruptedException e) {
logger.error(e, "Error");
}
}
jg.writeEndArray();
jg.flush();
jg.close();
};
response.resume(Response.ok(streamingOutput).build());
}
And yet jersey just sits there until the json generator is done to return the results. I'm watching the results come through in charles proxy.
Do I need to enable something? Not sure why this won't stream out
Edit:
This may actually be working, just not how I expected it. I dont' think stream is writing things realtime which is what I wanted, its more for not having to buffer responses and immediately write them out to the client. If I run a loop of a million and no thread sleep then data does get written out in chunks without having to buffer it in memory.
Your edit it correct. It is working as expected. StreamingOutput is just a wrapper that let's us write directly to the response stream, but does not actually mean the response is streamed on each server side write to the stream. Also AsyncResponse does not provide any different response as far as the client is concerned. It is simply to help increase throughput with long running tasks. The long running task should actually be done in another thread, so the method can return.
See more at Asynchronous Server API
What you seem to be looking for instead is Chunked Output
Jersey offers a facility for sending response to the client in multiple more-or-less independent chunks using a chunked output. Each response chunk usually takes some (longer) time to prepare before sending it to the client. The most important fact about response chunks is that you want to send them to the client immediately as they become available without waiting for the remaining chunks to become available too.
Not sure how it will work for your particular use case, as the JsonGenerator expects an OutputStream (of which the ChuckedOutput we use is not), but here is a simpler example
#Path("async")
public class AsyncResource {
#GET
public ChunkedOutput<String> getChunkedStream() throws Exception {
final ChunkedOutput<String> output = new ChunkedOutput<>(String.class);
new Thread(() -> {
try {
String chunk = "Message";
for (int i = 0; i < 10; i++) {
output.write(chunk + "#" + i);
Thread.sleep(1000);
}
} catch (Exception e) {
} finally {
try {
output.close();
} catch (IOException ex) {
Logger.getLogger(AsyncResource.class.getName())
.log(Level.SEVERE, null, ex);
}
}
}).start();
return output;
}
}
Note: I had a problem getting this to work at first. I would only get the delayed complete result. The problem seemed to have been with something completely separate from the program. It was actually my AVG causing the problem. Some feature called "LinkScanner" was stopping this chunking process to occur. I disabled that feature and it started working.
I haven't explored chunking much, and am not sure the security implications, so I am not sure why the AVG application has a problem with it.
EDIT
Seems the real problem is due to Jersey buffering the response in order to calculate the Content-Length header. You can see this post for how you can change this behavior

How to improve speed of DynamoDB requests

I've been testing out DynamoDB as a potential option for a scalable and steady throughput database for a site that will be hit pretty frequently and requires a very fast response time (< 50ms). I'm seeing pretty slow responses (both locally and on an EC2 instance) for the following code:
public static void main(String[] args) {
try {
AWSCredentials credentials = new PropertiesCredentials(new File("aws_credentials.properties"));
long start = System.currentTimeMillis();
AmazonDynamoDBClient client = new AmazonDynamoDBClient(credentials);
System.out.println((System.currentTimeMillis() - start) + " (ms) to connect");
DynamoDBMapper mapper = new DynamoDBMapper(client);
start = System.currentTimeMillis();
Model model = mapper.load(Model.class, "hashkey1", "rangekey1");
System.out.println((System.currentTimeMillis() - start) + " (ms) to load Model");
} catch (Exception e) {
e.printStackTrace();
}
}
The connection to the DB alone takes about 800 (ms) on average and the loading using the mapper takes an additional 200 (ms). According to Amazon's page about DynamoDB we should expect "Average service-side latencies...typically single-digit milliseconds." I wouldn't expect the full round-trip HTTP request to add that much overhead. Are these expected numbers even on an EC2 instance?
I think a better test would be to avoid the initial costs/latency incurred in starting up the JVM and loading the classes. Something like:
public class TestDynamoDBMain {
public static void main(String[] args) {
try {
AWSCredentials credentials = new PropertiesCredentials(new File("aws_credentials.properties"));
AmazonDynamoDBClient client = new AmazonDynamoDBClient(credentials);
DynamoDBMapper mapper = new DynamoDBMapper(client);
// Warm up
for (int i=0; i < 10; i++) {
testrun(mapper, false);
}
// Time it
for (int i=0; i < 10; i++) {
testrun(mapper, true);
}
} catch (Exception e) {
e.printStackTrace();
}
}
private static void testrun(DynamoDBMapper mapper, boolean timed) {
long start = System.nanoTime();
Model model = mapper.load(Model.class, "hashkey1", "rangekey1");
if (timed)
System.out.println(
TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start)
+ " (ms) to load Model");
}
}
Furthermore, you may consider enabling the default metrics of the AWS SDK for Java to see the fine grain time allocation in Amazon CloudWatch. For more details, see:
http://java.awsblog.com/post/Tx1O0S3I51OTZWT/Taste-of-JMX-Using-the-AWS-SDK-for-Java
Hope this helps.
Dynamo DB is located in a specific region (they dont yet support cross region replication). This is chosen by you when you create a table. Unless you are calling the APIs from the same region, it is bound to be slow.
It looks like you are trying to call Dynamo from your development desktop. You can re-do the same test from an EC2 instance started in the "same region". This will considerably speed up the responses. This is a more realistic test, since any way when you deploy your production system it will be in the same region as Dynamo.
Again, if you really need very quick response, consider using ElastiCache between your code and Dynamo. On every read, store on cache before returning the results. Next read should read from the cache (say for an expiry time of 10 mins). For "read-heavy" apps this is the suggested route. I have seen many fold better response using this approach.

Single-threaded Java Websocket for Testing

We are developing an application with Scala and Websockets. For the latter we use Java-Websocket. The application itself works great and we are in the middle of writing unit tests.
We use a WebSocket class as follows
class WebSocket(uri : URI) extends WebSocketClient(uri) {
connectBlocking()
var response = ""
def onOpen(handshakedata : ServerHandshake) {
println("onOpen")
}
def onMessage(message : String) {
println("Received: " + message)
response = message
}
def onClose(code : Int, reason : String, remote : Boolean) {
println("onClose")
}
def onError(ex : Exception) {
println("onError")
}
}
A test might look like this (pseudo code)
websocketTest {
ws = new WebSocket("ws://example.org")
ws.send("foo")
res = ws.getResponse()
....
}
Sending and receiving data works. However, the problem is that connecting to the websocket creates a new thread and only the new thread will have access to response using the onMessage handler. What is the best way to either make the websocket implementation single-threaded or connect the two threads so that we can access the response in the test case? Or is there another, even better way of doing it? In the end we should be able to somehow test the response of the websocket.
There are a number of ways you could try to do this. The issue will be that you might get an error or a successful response from the server. As a result, the best way is probably to use some sort of timeout. In the past I have used a pattern like (note, this is untested code):
...
use response in the onMessage like you did
...
long start = System.currentTimeMillis();
long timeout = 5000;//5 seconds
while((system.currentTimeMillis()-start)<timeout && response==null)
{
Thread.sleep(100);
}
if(response == null) .. timed out
else .. do something with the response
If you want to be especially safe you can use an AtomicReference for the response.
Of course the timeout and sleep can be minimized based on your test case.
Moreover, you can wrap this in a utility method.

Categories

Resources