Design suggestion for handling large mailboxes using java mail api (IMAP)

Design suggestion for handling large mailboxes using java mail api (IMAP) - java

We use java mail api with imap and fetch messages of folders containing millions of messages. There are some rules and limitiations:
We do not have always open connections to mail server and therefore we can not add listeners.
The messages will be stored in a local database with all properties, subject, body, receive date, from etc.
Can not use multiple threads
To keep the performance at acceptable levels and prevent out of memory crashes, I am planning:
1.During inital fetch, where all messages have to be fetched, store only the message headers and bypass body and attachments. Getting the body and attachment of a message will be done when requested by the client. The initialization can take hours, it is not a problem.
2.When fetching all messages at start, use a appropriate fetch profile to make it faster, but process in blocks, for example:
Message m1[] = f.getMessages(1, 10000);
f.fetch(m1, fp);
//process m1 array
Message m2[] = f.getMessages(10001, 20000);
f.fetch(m2, fp);
//process m2 array
instead of
Message m_all[] = f.getMessages(1, NUMALLMESSAGES);
f.fetch(m_all, fp);
//process m_all array, may throw out of memory errors
3.And after we have all the messages, store the UID of recent message in the DB and on the next fetch perform:
f.getMessagesByUID(LASTUIDREADFROMDB, UIDMAX)
Do you have additional suggestions, or see any points we have to care of (memory, performance)

Related

Java Mail - search and sort for reduced amount of emails together

I have currently a huge problem for which I need help for it.
Currently I´m not loading all emails at once.
I found here the following function for it:
Message[] messages = emailFolder.getMessages(start, end);
I know I can use SortTerm for sorting the emails:
SortTerm sortTerm[] = new SortTerm[] { SortTerm.REVERSE, SortTerm.DATE };
Message messages = ((IMAPFolder) emailFolder).getSortedMessages(sortTerm);
But than I will load again all emails.
How can I use together:
- search
- sort
- and use getMessages(start, end)
A sample code would be very helpful.
Many thanks

To be clear, when using IMAP no messages are "loaded" when you call getMessages. All that happens is that the JavaMail client creates a Message object that refers to the message on the server, and sets it up so that the Message object will fetch the data for the message on the server when you ask for it.
You could create a SearchTerm that uses a pair of MessageNumberTerms to constrain the message to be within a certain range just as you were doing with "start, end". But you should ask yourself whether you really want to sort all the messages in the mailbox first by message number (effectively a forward sort by received date) and then reverse sort them by sent date. What exactly are you trying to accomplish?

How to save message into database and send response into topic eventually consistent?

I have the following rabbitMq consumer:
Consumer consumer = new DefaultConsumer(channel) {
#Override
public void handleDelivery(String consumerTag, Envelope envelope, MQP.BasicProperties properties, byte[] body) throws IOException {
String message = new String(body, "UTF-8");
sendNotificationIntoTopic(message);
saveIntoDatabase(message);
}
};
Following situation can occur:
Message was send into topic successfully
Connection to database was lost so database insert was failed.
As a result we have data inconsistency.
Expected result either both action were successfully executed or both were not executed at all.
Any solutions how can I achieve it?
P.S.
Currently I have following idea(please comment upon)
We can suppose that broker doesn't lose any messages.
We have to be subscribed on topic we want to send.
Save entry into database and set field status with value 'pending'
Attempt to send data to topic. If send was successfull - update field status with value 'success'
We have to have a sheduled job which have to check rows with pending status. At the moment 2 cases are possible:
3.1 Notification wasn't send at all
3.2 Notification was send but save into database was failed(probability is very low but it is possible)
So we have to distinquish that 2 cases somehow: we may store messages from topic in the collection and job can check if the message was accepted or not. So if job found a message which corresponds the database row we have to update status to "success". Otherwise we have to remove entry from database.
I think my idea has some weaknesses(for example if we have multinode application we have to store messages in hazelcast(or analogs) but it is additional point of hypothetical failure)

Here is an example of Try Cancel Confirm pattern https://servicecomb.apache.org/docs/distributed_saga_3/ that should be capable of dealing with your problem. You should tolerate some chance of double submission of the data via the queue. Here is an example:
Define abstraction Operation and Assign ID to the operation plus a timestamp.
Write status Pending to the database (you can do this in the same step as 1)
Write a listener that polls the database for all operations with status pending and older than "timeout"
For each pending operation send the data via the queue with the assigned ID.
The recipient side should be aware of the ID and if the ID has been processed nothing should happen.
6A. If you need to be 100% that the operation has completed you need a second queue where the recipient side will post a message ID - DONE. If such consistency is not necessary skip this step. Alternatively it can post ID -Failed reason for failure.
6B. The submitting side either waits for a message from 6A of completes the operation by writing status DONE to the database.
Once a sertine timeout has passed or certain retry limit has passed. You write status to operation FAIL.
You can potentialy send a message to the recipient side opertaion with ID rollback.
Notice that all this steps do not involve a technical transactions. You can do this with a non transactional database.
What I have written is a variation of the Try Cancel Confirm Pattern where each recipient of message should be aware of how to manage its own data.

In the listener save database row with field staus='pending'
Another job(separated thread) will obtain all pending rows from DB and following for each row:
2.1 send data to topic
2.2 save into database
If we failured on the step 1 - everything is ok - data in consistent state because job won't know anything about that data
if we failured on the step 2.1 - no problem, next job invocation will attempt to handle it
if we failured on the step 2.2 - If we failured here - it means that next job invocation will handle the same data again. From the first glance you can think that it is a problem. But your consumer has to be idempotent - it means that it has to understand that message was already processed and skip the processing. This requirement is a consequence that all message brokers have guarantees that message will be delivered AT LEAST ONCE. So our consumers have to be ready for duplicated messages anyway. No problem again.

Here's the pseudocode for how i'd do it: (Assuming the dao layer has transactional capability and your messaging layer doesnt)
//Start a transaction
try {
String message = new String(body, "UTF-8");
// Ordering is important here as I'm assuming the database has commit and rollback capabilities, but the messaging system doesnt.
saveIntoDatabase(message);
sendNotificationIntoTopic(message);
} catch (MessageDeliveryException) {
// rollback the transaction
// Throw a domain specific exception
}
//commit the transaction
Scenarios:
1. If the database fails, the message wont be sent as the exception will break the code flow .
2. If the database call succeeds and the messaging system fails to deliver, catch the exception and rollback the database changes
All the actions necessary for logging and replaying the failures can be outside this method

If there is enough time to modify the design, it is recommended to use JTA like APIs to manage 2phase commit. Even weblogic and WebSphere support XA resource for 2 phase commit.
If timeline is less, it is suggested perform as below to reduce the failure gap.
Send data topic (no commit) (incase topic is down, retry to be performed with an interval)
Write data into DB
Commit DB
Commit Topic
Here failure will happen only when step 4 fails. It will result in same message send again. So receiving system will receive duplicate message. Each message has unique messageID and CorrelationID in JMS2.0 structure. So finding duplicate is bit straight forward (but this is to be handled at receiving system)
Both case will work for clustered environment as well.
Strict to your case, thought below steps might help to overcome your issue
Subscribe a listener listener-1 to your topic.
Process-1
Add DB entry with status 'to be sent' for message msg-1
Send message msg-1 to topic. Retry sending incase of any topic failure
If step 2 failed after certain retry, process-1 has to resend the msg-1 before sending any new messages OR step-1 to be rolled back
Listener-1
Using subscribed listener, read reference(meesageID/correlationID) from Topic, and update DB status to SENT, and read/remove message from topic. Incase reference-read success and DB update failed, topic still have message. So next read will update DB. Incase DB update success and message removal failed. Listener will read again and tries to update message which is already done. So can be ignored after validation.
Incase listener itself down, topic will have messages until listener reading the messages. Until then SENT messages will be in status 'to be sent'.

Bulk Mail Sending Through AWS SES

I am using Amazon AWS SES to send my email campaigns. I have around 35,000 subscribers on my list. At present I am using a code something similar to the following.
for (Entry<Integer, String> emailEntry : email_ids.entrySet()) {
MimeMessage msg = getMimeMessage(emailEntry.getKey(), emailEntry.getValue());
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
msg.writeTo(outputStream);
RawMessage rawMessage = new RawMessage(ByteBuffer.wrap(outputStream.toByteArray()));
ses.sendRawEmail(new SendRawEmailRequest(rawMessage));
}
This way I was able to send email to all my subscribers the way I wanted. But there was a huge bill accounting to Data transfer. Each MimeMessage is of 150Kb in size and sending it to 35,000 subscribers resulted in 5.5 GB of data transfer.
So I decided to use BulkTemplateEmail in my application, to create the template once and send it to 35,000 emails. This way the email has to be send to SES only once and there will be significant gain in terms of data transfer.
Can anyone provide me a sample to do this via Java AWS SDK? I want to add List-Unsubscribe header on each Destination. This is where I am actually stuck. Couldn't find any methods to add custom email headers for each Destination. Is this possible with BulkTemplateEmail?
Any info is highly appreciated.

When sending emails using SES, Amazon charges for data transfer out. The current price is $0.12 per GB. For large volumes of emails this can result in serious charges.
Amazon SES pricing
For embedded images, attachments, etc. another solution is to use links instead of embedded objects. This way you can mitigate and reduce data transfers fees. This can have a moderate to high impact for email campaigns where a lot of emails are never opened, thereby saving on the data transfer charges.
If your links reference files on your EC2 instances, remember that you will still be charged for Data-Out from your EC2 instances. S3 will provide a lower cost.

Entity & Entity Properties. Database design for effective searching

Last two days i've been searching suitable solution for the problem described below.
In my standalone notification-service module I have an abstract Message entity. Message has 'to', 'from', 'sentAt', 'receivedAt' and other attributes. The responsibility of the notification-service is to:
send new messages using different registered message providers (SMS, EMAIL, Skype , etc).
receive new messages from registered message providers
update status for already sent messages.
Notification-service module is developed as standalone module that is available by SOAP protocol. A lot of clients can use this module to send or searching through already received messages.
Clients want to attach some properties (~ smth like tags) while sending messages for further searching messages by these properties. These properties make a sense only in client's environment.
For example, Client A might want to send message and save following custom properties :
1. Internal system id of user whom system sends message
2. Distinguish flag (whether id related to users / admins or clients)
3. Notification flag (notification/alert/ ...)
Client B might want to send message and save another set of custom properties :
1. Internal system operator id (who sends sms)
2. Template id that was used to send message
Custom properties can be used by the clients to search already sent messages.
For example:
Client A could find SMS messages sent to administrator users in period between [Date 1; Date 2] that have 'alert' status.
Client B could find all notification sent by specified template.
Of course, data should be fetched page by page.
At first I created the following database model:
Database scheme
To find all messages with specified properties I tried to use query:
SELECT * FROM (SELECT message_id FROM custom_message_properties
WHERE CONCAT(CONCAT(key, ':'), value) IN ('property1:value1', 'property2:value2')
GROUP BY message_id having(count(*)) = 2)
as cmp JOIN message m ON cmp.message_id = m.id ORDER BY ID LIMIT 100 OFFSET 0
Query worked fine (although it seems me not very good) in database with small data. I decided to check results for ~ real awaited data .
So i generated 10 000 000 messages that have 40 000 000 custom properties and checked result. Execution time was ~ 2 minutes. The most time consumed operation was following sub-select:
SELECT message_id FROM custom_message_properties
WHERE CONCAT(CONCAT(key, ':'), value) IN ('property1:value1', 'property2:value2')
I understand that string comparison is very slow cause database index feature is not used. I decided to change database structure to merge 'key' and 'value' columns into single one. So i updated by database scheme :
Updated database scheme
I checked result again. Now execution time was ~20 seconds. It's much better but still is not suitable for production use.
So now I have no idea how to improve performance without significant changes in application architecture design.
The only one thought i have is to create separate table for each client with required client properties.
client(i)_custom_properties {
mid bigint, // foreign key references message (id)
p1 type1,
p2 type2,
......
pn type(n)
}
I have spent a lot of time while trying to find any useful information. I have also analyzed 'stackoverflow' database cause it seemed me that it should be quite the same. But in 'stackoverflow' there are ~ 50 000 different tags. Not so much that my database could have.
Any help is appreciated. Thanks, in advance!
Project environment that i use :
Postgres database (9.6)
Java 1.8
Spring modules (spring-boot, spring-data-jpa + hibernate, spring-ws, etc).

I have not found any suitable solution except creating additional table with client's properties for each client.
I know, that solution is not so flexible,
but now search query time is less than 1 second.
In future, I will try to solve the same problem using noSQL data storage.

Reading from javamail takes a long time

I use javamail to read mails from an exchage account using IMAP protocol. Those mails are in plain format and its contents are XMLs.
Almost all those mails have short size (usually under 100Kb). However, sometimes I have to deal with large mails (about 10Mb-15Mb). For example, yesterday I received an email which was 13Mb size. It took more than 50min just to read it. Is it normal? Is there a way to increase its performance?
The code is:
Session sesion = Session.getInstance(System.getProperties());
Store store = sesion.getStore("imap");
store.connect(host, user, passwd);
Folder inbox = store.getFolder("INBOX");
inbox.open(Folder.READ_WRITE);
Message[] messages = inbox.search(new FlagTerm(new Flags(Flags.Flag.SEEN), false));
for (int i = 0 ; i< messages.length ; i++){
Object contents = messages[i].getContent(); // Here it takes 50 min on 13Mb mail
// ...
}
Method that takes such a long time is messages[i].getContent(). What am I doing wrong? Any hint?
Thanks a lot and sorry for my english! ;)

I finally solved this issue and wanted to share.
The solution, at least the one that worked to me, was found in this site: http://www.oracle.com/technetwork/java/faq-135477.html#imapserverbug
So, my original code typed in my first post becomes to this:
Session sesion = Session.getInstance(System.getProperties());
Store store = sesion.getStore("imap");
store.connect(host, user, passwd);
Folder inbox = store.getFolder("INBOX");
inbox.open(Folder.READ_WRITE);
// Convert to MimeMessage after search
MimeMessage[] messages = (MimeMessage[]) carpetaInbox.search(new FlagTerm(new Flags(Flags.Flag.SEEN), false));
for (int i = 0 ; i< messages.length ; i++){
// Create a new message using MimeMessage copy constructor
MimeMessage cmsg = new MimeMessage(messages[i]);
// Use this message to read its contents
Object obj = cmsg.getContent();
// ....
}
The trick is, using MimeMessage() copy constructor, create a new MimeMessage and read its contents instead of original message.
You should note that such object is not really connected to server, so any changes you make on it, like setting flags, won't take effect. Any change on message, have to be done on original message.
To sum up: This solution works reading large Plain Text mails (up to 15Mb) connecting to an Exchange Server using IMAP protocol. The times lowered from 51-55min to read a 13Mb mail, to 9seconds to read same mail. Unbelievable.
Hope this helps someone and sorry for English mistakes ;)

It would always be messages[i].getContent() that would be the slowest part of the code. The reason is normally IMAP server would not cache this part of message data. Nevertheless, you can try this:
FetchProfile fp = new FetchProfile();
fp.add(FetchProfile.Item.ENVELOPE);
fp.add(FetchProfileItem.FLAGS);
fp.add(FetchProfileItem.CONTENT_INFO);
fp.add("X-mailer");
and after you have specified the fetch profile then you do your search/fetch of messages.
Basically the concept is that the IMAP provider fetches the data for a message from the server only when necessary. (The javax.mail.FetchProfile is used to optimize this). The header and body structure information, once fetched, is always cached within the Message object. However, the content of a bodypart is not cached. So each time the content is requested by the client (either using getContent() or using getInputStream()), a new FETCH request is issued to the server. The reason for this is that the content of a message could be potentially large, and if we cache this content for a large number of messages, there is the possibility that the system may run out of memory soon since the garbage collector cannot free the referenced objects. Clients should be aware of this and must hold on to the retrieved content themselves if needed.
By using the above mentioned code snippet you could 'hope' for some speed improvement but it solely depends on your SMTP server if this would work or not. All the big SMTP server do not support this behaviour because of the load issue mentioned in the previous paragraph and hence you may not gain any speed.

Using the Folder.fetch method you can prefetch in one operation the metadata for multiple messages. That will reduce the time to process each message, but won't help that much with a huge message.
The handle huge message parts efficiently, you'll generally want to use the getInputStream method to process the data incrementally, rather than using the getContent method to read all the data in and create a huge String object with all the data.
You can also tune the fetching by specifying the "mail.imap.fetchsize" property, which defaults to 16384. If most of your messages are less than 100K, and you always need to read all of the data in the message, you might set the fetchsize to 100K. That will make small messages much faster and larger message more efficient.

I had a similar issue. Fetching mails via IMAP were very slow. Furthermore I had another issue downloading large attachments. After a look in the JavaMail FAQ I found the solution for the later issue in this question that advises to set the mail.imap.partialfetch (respectively mail.imaps.partialfetch) to false. This not only fixes the download issue but the slow reading of the messages as well.
In the referenced JavaMail notes.txt it is said.
Due to a problem in the Microsoft Exchange IMAP server, insufficient
number of bytes may be retrieved when reading big messages. There
are two ways to workaround this Exchange bug:
(a) The Exchange IMAP server provides a configuration option called
"fast message retrieval" to the UI. Simply go to the site, server
or recipient, click on the "IMAP4" tab, and one of the check boxes
is "enable fast message retrieval". Turn it off and the octet
counts will be exact. This is fully described at
http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q191504
(b) Set the "mail.imap.partialfetch" property to false. You'll have
to set this property in the Properties object that you provide to
your Session.
Certain IMAP servers do not implement the IMAP Partial FETCH
functionality properly. This problem typically manifests as corrupt
email attachments when downloading large messages from the IMAP
server. To workaround this server bug, set the
"mail.imap.partialfetch"
property to false. You'll have to set this property in the Properties
object that you provide to your Session.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.