JavaMail API - Reading a large outlook mailbox (>3000) for message content - java

I have a requirement to read a mailbox having more than 3000 mails. I need to read these mails, fetch their mail contents and feed the body into another api. Its easy to do with a few mails (for me it was approx 250), but after that it slowed down significantly. Is the accepted answer in this link
the only choice, or is there any other alternative way.
NOTE: I have purposely not pasted any snippet,as I have used the straight forward approach, and yes I did use FetchProfile too.

JavaMail IMAP performance is usually controlled by the speed of the server, the number of network round trips required, and the ammount of data being read. Using a FetchProfile is essential to reducing the number of round trips. Don't forget to consider the IMAP-specific FetchProfile items.
JavaMail will fetch message contents a buffer at a time. Large messages will obviously require many buffer fetches, and thus many round trips. You can change the size of the buffer (default 16K) by setting the mail.imap.fetchsize property. Or you can disable these partial fetches and require it to fetch the entire contents in one operation by setting the mail.imap.partialfetch property to false. Obviously the latter will require significant memory on the client if large messages are being read.
The JavaMail IMAP provider does not (usually; see below) cache message contents on the client, but it does cache message headers. When processing a very large number of messages it is sometimes helpful to invalidate the cache of headers when done processing a message by calling the IMAPMessage.invalidateHeaders method. When using IMAPFolder.FetchProfileItem.MESSAGE, the message contents are cache, and will also be invalidated by the above call.
Beyond that, you should examine the JavaMail debug output to ensure only the expected IMAP commands are being issued and that you're not doing something in your program that would cause it to issue unnecessary IMAP commands. You can also look at time-stamps for the protocol commands to determine whether the time is being spent on the server or the client.
Only after all of that has failed to yield acceptable performance, and you're sure the performance problems are not on the server (which you can't fix), would you need to look into custom IMAP commands as suggested in the link you referred to.

Related

Find byte offsets for e-mail attachments

I got a requirement to deliver emails to a legacy system that needs to read the attachments.
For each part in a multipart email I need to provide the byte offset for where the attachment starts in the email, so the legacy system doesn't need to know how to parse emails.
Performance and memory usage is an issue, so the solution can't load the entire email into memory. And to my eyes that leaves out javax.mail.
How would you go about it in Java?
My first idea was to use mime4j, but the library does not keep of byte offset or even the line number.
I investigated making a PR to mime4j to add tracking of line numbers and byte offsets. But it is not very easy, since it is a very mature project and it uses lots of buffering internally.
Now I am thinking that maybe I am going about this the wrong way. So I would very much appreciate any ideas of how to solve this in a simple matter.
You're going to run into issues just sending the byte offsets and the full email, as emails still can be base64 encoded or printed-quoteable encoded.
You'll want to use a MimeStreamParser and give your own ContentHandler and override the body method. You can then directly send the BodyDescriptor and InputStream to the legacy system. The InputStream is the "decoded" email (IE handles any Content-Transfer-Encoding). The BodyDescriptor is useful to extract stuff from the headers of the part that you may care about (MimeType and Charset are the most useful ones).
This does not buffer the whole email, and allows you to stream out just the body parts. I'm not sure how you're communicating with the legacy system (via the network or if it's an inprocess subcomponent) but hopefully that works!

How to Implement Huawei's Chunked File Upload Using Java

I need to implement a deployment pipeline, and at the end of the pipeline, we are uploading a file, in this case, to Huawei's app store. But for a file with more than 5 megabytes in size, we have to use a chunked API. I'm not familiar of how chunked uploads work. Can someone give me an implementation guideline, preferably in java of how to implement such mechanism? The API parameters are as follow :
Edit :
In response in the comment below, let me clarify my question. Looking up for some references of how to do a chunked request, libraries such as httpclient and okhttp simply set the chunk flag to true, and seemed to hide the details from the library's client :
https://www.java-tips.org/other-api-tips-100035/147-httpclient/1359-how-to-use-unbuffered-chunk-encoded-post-request.html
Yet, the Input Parameters of the API seems to expect that I manage the chunk manually, since it expect ChunkSize and a sequence number. I'm thinking that I might need to use the plain java http interface to work with the API, yet I failed to find any good source to get me starting. If there is anyone who could give me a reference or an implementation guidance, that will definitely help.
More updates :
I tried to manually chunk my file into several parts, each of 1 megabyte in size. Then I thought I could try calling the API for every chunk, using a multipart/form-data. But the server side always close the connection before writing even begin, causing : Connection reset by peer: socket write error.
It shouldn't be a proxy issue, since I have set it up, and I could get the token, url and auth code without problem.
File segmentation: a file with more than a few gigabytes is uploaded to the server. If you can only use the simplest upload, receive, process and succeed, I can only say that your server is very good. Even if the server is good enough, this operation is not allowed. So we have to find a way to solve this problem.
First of all, we have to solve the problem of large files. There is no way to cut them into several m bytes and send them to the server many times and save them. Then name these files with MD5 + index of the source file. Of course, some friends use UUID + index to name them. The differences between the two will be described in detail below. When you upload these small files to the server separately, it is better to save these records to the database.
(1) When the first block upload is completed, write the name, type, MD5, upload date, address and unfinished status of the source file to a table, and change the splicing completion status to finished. Temporarily named file table
(2) After each block upload, the record is saved in the database. The MD5 + index name of the source file, the MD5 of the block (this is a key point), the upload time and the file address. Save into database and name it file__ TEM table
Second transmission function: many online disks realize this function. At the beginning of upload, send Ajax request to query the existence of the file to be uploaded. Here, H5 provides a method to obtain the MD5 file, and then use ajax to request whether the MD5 exists in the file and whether the status is completed. If it exists, also verify whether the local file still exists. In the case of simultaneous existence. You can return the presence status to the front desk, and then you can proudly tell the customer, seconds passed.
here is the link:
https://blog.csdn.net/weixin_42584752/article/details/80873376

Ways to buffer REST response

There's a REST endpoint, which serves large (tens of gigabytes) chunks of data to my application.
Application processes the data in it's own pace, and as incoming data volumes grow, I'm starting to hit REST endpoint timeout.
Meaning, processing speed is less then network throughoutput.
Unfortunately, there's no way to raise processing speed enough, as there's no "enough" - incoming data volumes may grow indefinitely.
I'm thinking of a way to store incoming data locally before processing, in order to release REST endpoint connection before timeout occurs.
What I've came up so far, is downloading incoming data to a temporary file and reading (processing) said file simultaneously using OutputStream/InputStream.
Sort of buffering, using a file.
This brings it's own problems:
what if processing speed becomes faster then downloading speed for
some time and I get EOF?
file parser operates with
ObjectInputStream and it behaves weird in cases of empty file/EOF
and so on
Are there conventional ways to do such a thing?
Are there alternative solutions?
Please provide some guidance.
Upd:
I'd like to point out: http server is out of my control.
Consider it to be a vendor data provider. They have many consumers and refuse to alter anything for just one.
Looks like we're the only ones to use all of their data, as our client app processing speed is far greater than their sample client performance metrics. Still, we can not match our app performance with network throughoutput.
Server does not support http range requests or pagination.
There's no way to divide data in chunks to load, as there's no filtering attribute to guarantee that every chunk will be small enough.
Shortly: we can download all the data in a given time before timeout occurs, but can not process it.
Having an adapter between inputstream and outpustream, to pefrorm as a blocking queue, will help a ton.
You're using something like new ObjectInputStream(new FileInputStream(..._) and the solution for EOF could be wrapping the FileInputStream first in an WriterAwareStream which would block when hitting EOF as long a the writer is writing.
Anyway, in case latency don't matter much, I would not bother start processing before the download finished. Oftentimes, there isn't much you can do with an incomplete list of objects.
Maybe some memory-mapped-file-based queue like Chronicle-Queue may help you. It's faster than dealing with files directly and may be even simpler to use.
You could also implement a HugeBufferingInputStream internally using a queue, which reads from its input stream, and, in case it has a lot of data, it spits them out to disk. This may be a nice abstraction, completely hiding the buffering.
There's also FileBackedOutputStream in Guava, automatically switching from using memory to using a file when getting big, but I'm afraid, it's optimized for small sizes (with tens of gigabytes expected, there's no point of trying to use memory).
Are there alternative solutions?
If your consumer (the http client) is having trouble keeping up with the stream of data, you might want to look at a design where the client manages its own work in progress, pulling data from the server on demand.
RFC 7233 describes the Range Requests
devices with limited local storage might benefit from being able to request only a subset of a larger representation, such as a single page of a very large document, or the dimensions of an embedded image
HTTP Range requests on the MDN Web Docs site might be a more approachable introduction.
This is the sort of thing that queueing servers are made for. RabbitMQ, Kafka, Kinesis, any of those. Perhaps KStream would work. With everything you get from the HTTP server (given your constraint that it cannot be broken up into units of work), you could partition it into chunks of bytes of some reasonable size, maybe 1024kB. Your application would push/publish those records/messages to the topic/queue. They would all share some common series ID so you know which chunks match up, and each would need to carry an ordinal so they can be put back together in the right order; with a single Kafka partition you could probably rely upon offsets. You might publish a final record for that series with a "done" flag that would act as an EOF for whatever is consuming it. Of course, you'd send an HTTP response as soon as all the data is queued, though it may not necessarily be processed yet.
not sure if this would help in your case because you haven't mentioned what structure & format the data are coming to you in, however, i'll assume a beautifully normalised, deeply nested hierarchical xml (ie. pretty much the worst case for streaming, right? ... pega bix?)
i propose a partial solution that could allow you to sidestep the limitation of your not being able to control how your client interacts with the http data server -
deploy your own webserver, in whatever contemporary tech you please (which you do control) - your local server will sit in front of your locally cached copy of the data
periodically download the output of the webservice using a built-in http querying library, a commnd-line util such as aria2c curl wget et. al, an etl (or whatever you please) directly onto a local device-backed .xml file - this happens as often as it needs to
point your rest client to your own-hosted 127.0.0.1/modern_gigabyte_large/get... 'smart' server, instead of the old api.vendor.com/last_tested_on_megabytes/get... server
some thoughts:
you might need to refactor your data model to indicate that the xml webservice data that you and your clients are consuming was dated at the last successful run^ (ie. update this date when the next ingest process completes)
it would be theoretically possible for you to transform the underlying xml on the way through to better yield records in a streaming fashion to your webservice client (if you're not already doing this) but this would take effort - i could discuss this more if a sample of the data structure was provided
all of this work can run in parallel to your existing application, which continues on your last version of the successfully processed 'old data' until the next version 'new data' are available
^
in trade you will now need to manage a 'sliding window' of data files, where each 'result' is a specific instance of your app downloading the webservice data and storing it on disc, then successfully ingesting it into your model:
last (two?) good result(s) compressed (in my experience, gigabytes of xml packs down a helluva lot)
next pending/ provisional result while you're streaming to disc/ doing an integrity check/ ingesting data - (this becomes the current 'good' result, and the last 'good' result becomes the 'previous good' result)
if we assume that you're ingesting into a relational db, the current (and maybe previous) tables with the webservice data loaded into your app, and the next pending table
switching these around becomes a metadata operation, but now your database must store at least webservice data x2 (or x3 - whatever fits in your limitations)
... yes you don't need to do this, but you'll wish you did after something goes wrong :)
Looks like we're the only ones to use all of their data
this implies that there is some way for you to partition or limit the webservice feed - how are the other clients discriminating so as not to receive the full monty?
You can use in-memory caching techniques OR you can use Java 8 streams. Please see the following link for more info:
https://www.conductor.com/nightlight/using-java-8-streams-to-process-large-amounts-of-data/
Camel could maybe help you the regulate the network load between the REST producer and producer ?
You might for instance introduce a Camel endpoint acting as a proxy in front of the real REST endpoint, apply some throttling policy, before forwarding to the real endpoint:
from("http://localhost:8081/mywebserviceproxy")
.throttle(...)
.to("http://myserver.com:8080/myrealwebservice);
http://camel.apache.org/throttler.html
http://camel.apache.org/route-throttling-example.html
My 2 cents,
Bernard.
If you have enough memory, Maybe you can use in-memory data store like Redis.
When you get data from your Rest endpoint you can save your data into Redis list (or any other data structure which is appropriate for you).
Your consumer will consume data from the list.

Java Chat system protocol design, how to determine message type?

I have a chat program implemented in Java. The client can send lots of different types of information to the server (i.e, Joins the server and sends username, password; requests a private chat with another user on the server, disconnects from the server, etc).
I'm looking for the correct way to have the server/client differentiate between 'text' messages that are just meant to be chat text messages sent from one client to the others, and 'command' messages (disconnect, request private chat, request file transfer, etc) that are meant for the server or the client.
I see two options:
Use serialized objects, and determine what they are on the receiving end by doing an 'instanceof'
Send the data as a byte array, reserving the first N bytes of the array to specify the 'type' of the incoming data.
What is the 'correct' way to do this? How to real protocols (oscar, irc) handle this situation?
I've googled around on this topic and only found examples/discussions centering on simple java chat applications. None that go into detail about protocol design (which I ultimately intend to practice).
Thanks to any help...
Second approach is much better, because serialization is a complex mechanism, that can be easily used in a wrong way (for example you may bind yourself to internal content of a concrete serialized class). Plus your protocol will be bound to JVM mechanism.
Using some "protocol header" for message differentiation is a common way in network protocols (FTP, HTTP, etc). It is even better when it is in a text form (people will be able to read it).
You typically have a little message header identifying the type of content in all messages, including standard text/chat messages.
Either of your two suggestions are fine. (In your second approach, you probably want to reserve some bytes for the length of the array as well.)

Rate Limit Exceeded - Custom Twitter app

I am working with a java Twitter app (using Twitter4J api). I have created the app and can view the current users timeline, user's profiles, etc..
However, when using the app it seems to quite quickly exceed the 150 requests an hour rate limit set on Twitter clients (i know developers can increase this to 350 on given accounts, but that would not resolve for other users).
Surely this is not affecting all clients, any ideas as to how to get around this?
Does anyone know what counts as a request? For example, when i view a user's profile, i load the User object (twitter4j) and then get the screenname, username, user description, user status, etc to put into a JSON object - would this be a single call to get the object or would it several to include all the user.get... calls?
Thanks in advance
You really do need to keep track what your current request count is when dealing with Twitter.
However, twitter does not seem to drop the count for 304 Not Modified (at least it didn't the last time I dealt with it), so make sure there isn't something breaking your normal use of HTTP caching, and your practical request per hour goes up.
Note that twitter suffers from a bug in mod_gzip on apache where the e-tag is mal-formed in changing it to reflect that the content-encoding is different to that of the non-gzipped entity (this is the Right Thing to do, there's just a bug in the implementation). Because of this, accepting gzipped content from twitter means it'll never send a 304, which increases your request count, and in many cases undermines the efficiency gains of using gzip.
Hence, if you are accepting gzip (your web-library may do so by default, see what you can see with a tool like Fiddler, I'm a .NET guy with only a little Java knowledge, answering at the level of how twitter deals with HTTP so I don't know the details of Java web libraries), try turning that off, and see if it improve things.
Almost every type of read from Twitter's servers (i.e. anything that calls HTTP GET) counts as a request. Getting user timelines, retweets, direct messages, getting user data all count as 1 request each. Pretty much the only Twitter API call that reads from the server without counting against your API limit is checking to see the rate limit status.

Categories

Resources