Failing to parse this multi-part mime message body in Java

Failing to parse this multi-part mime message body in Java - java

I'm not writing a mail application, so I don't have access to all the headers and such. All I have is something like the block at the end of this question. I've tried using the JavaMail API to parse this, using something like
Session s = Session.getDefaultInstance(new Properties());
InputStream is = new ByteArrayInputStream(<< String to parse >>);
MimeMessage message = new MimeMessage(s, is);
Multipart multipart = (Multipart) message.getContent();
But, it just tells me that message.getContent is a String, not a Multipart or MimeMultipart. Plus, I don't really need all the overhead of the whole JavaMail API, I just need to parse the text into it's parts. Here's an example:
This is a multi-part message in MIME format.\n\n------=_NextPart_000_005D_01CC73D5.3BA43FB0\nContent-Type: text/plain;\n\tcharset="iso-8859-1"\nContent-Transfer-Encoding: quoted-printable\n\nStuff:\n\n Please read this stuff at the beginning of each week. =\nFeel free to discuss it throughout the week.\n\n\n--=20\n\nMrs. Suzy M. Smith\n555-555-5555\nsuzy#suzy.com\n------=_NextPart_000_005D_01CC73D5.3BA43FB0\nContent-Type: text/html;\n\tcharset="iso-8859-1"\nContent-Transfer-Encoding: quoted-printable\n\n\n\n\n\n\n\n\n\nStuff:\n =20\nPlease read this stuff at the beginning of each =\nweek. Feel=20\nfree to discuss it throughout the week.\n-- Mrs. Suzy M. Smith555-555-5555suzy#suzy.com\n\n------=_NextPart_000_005D_01CC73D5.3BA43FB0--\n\n

First I took your example message and replaced all occurrences of \n with newlines and \t with tabs.
Then I downloaded the JARs from the Mime4J project, a subproject of Apache James, and executed the GUI parsing example org.apache.james.mime4j.samples.tree.MessageTree with the transformed message above as input. And apparently Mime4J was able to parse the message and to extract the HTML message part.

There are a few things wrong with the text you posted.
It is not a valid multi-part mime. Check out wikipedia reference which, while non-normative, is still correct.
The mime boundary is not defined. From the wikipedia example: Content-Type: multipart/mixed; boundary="frontier" shows that the boundary is "frontier". In your example, "----=_NextPart_000_005D_01CC73D5.3BA43FB0" is the boundary, but that can only be determined by scanning the text (i.e. the mime is malformed). You need to instruct the goofball that is passing you the mime content that you also need to know the mime boundary value, which is not defined in a message header. If you get the entire body of the message you will have enough because the body of the message starts with MIME-Version: 1.0 followed by Content-Type: multipart/mixed; boundary="frontier" where frontier will be replaced with the value of the boundary for the encoded mime.
If the person who is sending the body is a goofball (changed from monkey because monkey is too judgemental - my bad DwB), and will not (more likely does not know how to) send the full body, you can derive the boundary by scanning the text for a line that starts and ends with "--" (i.e. --boundary--). Note that I mentioned a "line". The terminal boundary is actually "--boundary--\n".
Finally, the stuff you posted has 2 parts. The first part appears to define substitutions to take place in the second part. If this is true, the Content-Type: of the first part should probably be something other than "text/plain". Perhaps "companyname/substitution-definition" or something like that. This will allow for multiple (as in future enhancements) substitution formats.

Can create MimeMultipart from http request.
javax.mail.internet.MimeMultipart m = new MimeMultipart(new ServletMultipartDataSource(httpRequest));
public class ServletMultipartDataSource implements DataSource {
String contentType;
InputStream inputStream;
public ServletMultipartDataSource(ServletRequest request) throws IOException {
inputStream = new SequenceInputStream(new ByteArrayInputStream("\n".getBytes()), request.getInputStream());
contentType = request.getContentType();
}
public InputStream getInputStream() throws IOException {
return inputStream;
}
public OutputStream getOutputStream() throws IOException {
return null;
}
public String getContentType() {
return contentType;
}
public String getName() {
return "ServletMultipartDataSource";
}
}
For get submitted form parameter need parse BodyPart headers:
public String getStringParameter(String name) throws MessagingException, IOException {
for (int i = 0; i < getCount(); i++) {
BodyPart bodyPart = m.getBodyPart(i);
String[] nameHeader = bodyPart.getHeader("Content-Disposition");
if (nameHeader != null && content instanceof String) {
for (String bodyName : nameHeader) {
if (bodyName.contains("name=\"" + name + "\"")) return String.valueOf(bodyPart.getContent());
}
}
}
return null;
}

If you are using javax.servlet.http.HttpServlet to receive the message, you will have to use HttpServletRequests.getHeaders to obtain the value of the HTTP header content-type. You will then use org.apache.james.mime4j.stream.MimeConfig.setHeadlessParsing to set the MimeConfig with the information so that it can properly process the mime message.
It appears that you are using HttpServletRequest.getInputStream to read the contents of the request. The input stream returned only has the content of the message after the HTTP headers (terminated by a blank line). That is why you have to extract content-type from the HTTP headers and feed it to the parser using setHeadlessParsing.

Related

Parsing curl response with Java

Before writing something like "why don't you use Java HTTP client such as apache, etc", I need you to know that the reason is SSL. I wish I could, they are very convenient, but I can't.
None of the available HTTP clients support GOST cipher suite, and I get handshake exception all the time. The ones which do support the suite, doesn't support SNI (they are also proprietary) - I'm returned with a wrong cert and get handshake exception over and over again.
The only solution was to configure openssl (with gost engine) and curl and finally execute the command with Java.
Having said that, I wrote a simple snippet for executing a command and getting input stream response:
public static InputStream executeCurlCommand(String finalCurlCommand) throws IOException
{
return Runtime.getRuntime().exec(finalCurlCommand).getInputStream();
}
Additionally, I can convert the returned IS to a string like that:
public static String convertResponseToString(InputStream isToConvertToString) throws IOException
{
StringWriter writer = new StringWriter();
IOUtils.copy(isToConvertToString, writer, "UTF-8");
return writer.toString();
}
However, I can't see a pattern according to which I could get a good response or a desired response header:
Here's what I mean
After executing a command (with -i flag), there might be lots and lots of information like in the screen below:
At first, I thought that I could just split it with '\n', but the thing is that a required response's header or a response itself may not satisfy the criteria (prettified JSON or long redirect URL break the rule).
Also, the static line GOST engine already loaded is a bit annoying (but I hope that I'll be able to get rid of it and nothing unrelated info like that will emerge)
I do believe that there's a pattern which I can use.
For now I can only do that:
public static String getLocationRedirectHeaderValue(String curlResponse)
{
String locationHeaderValue = curlResponse.substring(curlResponse.indexOf("Location: "));
locationHeaderValue = locationHeaderValue.substring(0, locationHeaderValue.indexOf("\n")).replace("Location: ", "");
return locationHeaderValue;
}
Which is not nice, obviosuly
Thanks in advance.

Instead of reading the whole result as a single string you might want to consider reading it line by line using a scanner.
Then keep a few status variables around. The main task would be to separate header from body. In the body you might have a payload you want to treat differently (e.g. use GSON to make a JSON object).
The nice thing: Header and Body are separated by an empty line. So your code would be along these lines:
boolean inHeader = true;
StringBuilder b = new StringBuilder;
String lastLine = "";
// Technically you would need Multimap
Map<String,String> headers = new HashMap<>();
Scanner scanner = new Scanner(yourInputStream);
while scanner.hasNextLine() {
String line = scanner.nextLine();
if (line.length() == 0) {
inHeader = false;
} else {
if (inHeader) {
// if line starts with space it is
// continuation of previous header
treatHeader(line, lastLine);
} else {
b.append(line);
b.appen(System.lineSeparator());
}
}
}
String body = b.toString();

Extracting attachment name from eml file using Content-Type header

I'm using Tika-server to parse bunch of eml files. Extracting both content and metadata of emls and attachments works fine while using /rmeta endpoint.
Problem occurs with proper attachment file name. When attachment part in raw eml file has got a following structure:
Content-Type: application/pdf; name="filename_a.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="filename_a.pdf"
everything works fine: extracted filename path in metadata object (in api response) is:
"X-TIKA:embedded_resource_path": "/filename_a.pdf"
However some of my emails have got malformed header structure (missing filename in Content-Disposition) i.e.:
Content-Type: application/pdf; name="filename_a.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
Then after parsing the whole eml I obtain:
"X-TIKA:embedded_resource_path": "/embedded-1"
I checked in Tika's source code that filename meta is defined in \org\apache\tika\parser\RecursiveParserWrapper.class here:
private String getResourceName(Metadata metadata, RecursiveParserWrapper.ParserState state) {
String objectName = "";
if (metadata.get("resourceName") != null) {
objectName = metadata.get("resourceName");
} else if (metadata.get("embeddedRelationshipId") != null) {
objectName = metadata.get("embeddedRelationshipId");
} else {
objectName = "embedded-" + ++state.unknownCount;
}
objectName = FilenameUtils.getName(objectName);
return objectName;
}
I was trying to access somehow mentioned filename attribute by inspecting Content-Type key in metadata object but it's not there. (I assume that Tika assess Content-type key not just by looking into proper header hence needed filename is absent)
Therefore my question (since I'm not able to figure it out) is there a way to modify Tika source code to force filename extraction from Content-Type header when proper filename attribute in Content-Disposition header is missing?

Ok, so I managed on my own. The workaround is preety simple and straightforward.
One has to extend one of the conditions in \org\apache\tika\parser\mail\MailContentHandler.class. In line 129 we have:
if (contentDispositionFileName != null) {
submd.set("resourceName", contentDispositionFileName);
}
By extending with additional else block:
if (contentDispositionFileName != null) {
submd.set("resourceName", contentDispositionFileName);
} else {
Map<String, String> contentTypeParameters = ((MaximalBodyDescriptor)body).getContentTypeParameters();
String contentTypeFilename = (String)contentTypeParameters.get("name");
submd.set("resourceName", contentTypeFilename);
}
we enforce the handler to look for an additional filename property in content type parameters.

multipart form data in javascript with java server using FileUpload

I try to send a multipart form data with a file by using only javascript. I write the request myself. So my javascript code is the following :
var data =
'------------f8n51w2QYCsvNftihodgfJ\n' +
'Content-Disposition: form-data; name="upload-id"\n' +
'\n' +
'uploadedFiles\n' +
'------------f8n51w2QYCsvNftihodgfJ\n' +
'Content-Disposition: form-data; name="file"; filename="doc1.txt"\n' +
'Content-Type: text/plain\n' +
'\n' +
'azerty\n' +
'------------f8n51w2QYCsvNftihodgfJ--\n';
var xhr = new XMLHttpRequest();
xhr.open('POST', '/upload');
xhr.setRequestHeader('Content-Type', 'multipart/form-data; boundary=----------f8n51w2QYCsvNftihodgfJ');
xhr.sendAsBinary(data);
I run this javascript on Firefox 18.
So i got a servlet on /upload. Here's the code :
protected void service(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
RequestContext request_context = new ServletRequestContext(request);
boolean is_multipart = ServletFileUpload.isMultipartContent(request_context);
if (is_multipart) {
FileUpload file_upload = new FileUpload(fileItemFactory);
List<FileItem> file_items = file_upload.parseRequest(request_context); // This line crash
}
}
As the comment says, the line file_upload.parseRequest(request_context); crash and throws the following exception :
org.apache.commons.fileupload.MultipartStream$MalformedStreamException: Stream ended unexpectedly
at org.apache.commons.fileupload.MultipartStream.readHeaders(MultipartStream.java:539)
at org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl.findNextItem(FileUploadBase.java:976)
at org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl.<init>(FileUploadBase.java:942)
at org.apache.commons.fileupload.FileUploadBase.getItemIterator(FileUploadBase.java:331)
at org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:349)
And i just don't know why i got this exception ... Any idea ?
It seems like MultipartStream can't find the request headers. But if i log the headers, they are all here and they are correct.
My servlet code works with a "normal" form. I tried to log the request body and headers of a normal form, and they are the same (except the boundary, of course).
I also tried to change the data variable with a invalid content. The error is still the same, so there's definitively a problem with my headers but i don't see what.

I found the solution.
\n IS NOT a valid separator for multipart form. You must use \r\n. Now my code works properly.

I don't understand why you use sendAsBinary. If not absolutely necessary I wouldn't assemble the payload (data variable) myself but use FormData.
https://developer.mozilla.org/en-US/docs/DOM/XMLHttpRequest/FormData/Using_FormData_Objects
var oMyForm = new FormData();
oMyForm.append("username", "Groucho");
oMyForm.append("accountnum", 123456); // number 123456 is immediately converted to string "123456"
// HTML file input user's choice...
oMyForm.append("userfile", fileInputElement.files[0]);
// JavaScript file-like object...
var oFileBody = '<a id="a"><b id="b">hey!</b></a>'; // the body of the new file...
var oBlob = new Blob([oFileBody], { type: "text/xml"});
oMyForm.append("webmasterfile", oBlob);
var oReq = new XMLHttpRequest();
oReq.open("POST", "http://foo.com/submitform.php");
oReq.send(oMyForm);

try change f8n51w2QYCsvNftihodgfJ to f8n51w2QYCsvNftihodgfM
I've tried running your code with different random boundaries and turn out only f8n51w2QYCsvNftihodgfJ\n got issue. I reckon you can try a different boundary, since it is really just a random string.

How to handle multipart/alternative mail with JavaMail?

I wrote an application which gets all emails from an inbox, filters the emails which contain a specific string and then puts those emails in an ArrayList.
After the emails are put in the List, I am doing some stuff with the subject and content of said emails. This works all fine for e-mails without an attachment. But when I started to use e-mails with attachments it all didn't work as expected anymore.
This is my code:
public void getInhoud(Message msg) throws IOException {
try {
cont = msg.getContent();
} catch (MessagingException ex) {
Logger.getLogger(ReadMailNew.class.getName()).log(Level.SEVERE, null, ex);
}
if (cont instanceof String) {
String body = (String) cont;
} else if (cont instanceof Multipart) {
try {
Multipart mp = (Multipart) msg.getContent();
int mp_count = mp.getCount();
for (int b = 0; b < 1; b++) {
dumpPart(mp.getBodyPart(b));
}
} catch (Exception ex) {
System.out.println("Exception arise at get Content");
ex.printStackTrace();
}
}
}
public void dumpPart(Part p) throws Exception {
email = null;
String contentType = p.getContentType();
System.out.println("dumpPart" + contentType);
InputStream is = p.getInputStream();
if (!(is instanceof BufferedInputStream)) {
is = new BufferedInputStream(is);
}
int c;
final StringWriter sw = new StringWriter();
while ((c = is.read()) != -1) {
sw.write(c);
}
if (!sw.toString().contains("<div>")) {
mpMessage = sw.toString();
getReferentie(mpMessage);
}
}
The content from the e-mail is stored in a String.
This code works all fine when I try to read mails without attachment. But if I use an e-mail with attachment the String also contains HTML code and even the attachment coding. Eventually I want to store the attachment and the content of an e-mail, but my first priority is to get just the text without any HTML or attachment coding.
Now I tried an different approach to handle the different parts:
public void getInhoud(Message msg) throws IOException {
try {
Object contt = msg.getContent();
if (contt instanceof Multipart) {
System.out.println("Met attachment");
handleMultipart((Multipart) contt);
} else {
handlePart(msg);
System.out.println("Zonder attachment");
}
} catch (MessagingException ex) {
ex.printStackTrace();
}
}
public static void handleMultipart(Multipart multipart)
throws MessagingException, IOException {
for (int i = 0, n = multipart.getCount(); i < n; i++) {
handlePart(multipart.getBodyPart(i));
System.out.println("Count "+n);
}
}
public static void handlePart(Part part)
throws MessagingException, IOException {
String disposition = part.getDisposition();
String contentType = part.getContentType();
if (disposition == null) { // When just body
System.out.println("Null: " + contentType);
// Check if plain
if ((contentType.length() >= 10)
&& (contentType.toLowerCase().substring(
0, 10).equals("text/plain"))) {
part.writeTo(System.out);
} else if ((contentType.length() >= 9)
&& (contentType.toLowerCase().substring(
0, 9).equals("text/html"))) {
part.writeTo(System.out);
} else if ((contentType.length() >= 9)
&& (contentType.toLowerCase().substring(
0, 9).equals("text/html"))) {
System.out.println("Ook html gevonden");
part.writeTo(System.out);
}else{
System.out.println("Other body: " + contentType);
part.writeTo(System.out);
}
} else if (disposition.equalsIgnoreCase(Part.ATTACHMENT)) {
System.out.println("Attachment: " + part.getFileName()
+ " : " + contentType);
} else if (disposition.equalsIgnoreCase(Part.INLINE)) {
System.out.println("Inline: "
+ part.getFileName()
+ " : " + contentType);
} else {
System.out.println("Other: " + disposition);
}
}
This is what is returned from the System.out.printlns
Null: multipart/alternative; boundary=047d7b6220720b499504ce3786d7
Other body: multipart/alternative; boundary=047d7b6220720b499504ce3786d7
Content-Type: multipart/alternative; boundary="047d7b6220720b499504ce3786d7"
--047d7b6220720b499504ce3786d7
Content-Type: text/plain; charset="ISO-8859-1"
'Text of the message here in normal text'
--047d7b6220720b499504ce3786d7
Content-Type: text/html; charset="ISO-8859-1"
Content-Transfer-Encoding: quoted-printable
'HTML code of the message'
This approach returns the normal text of the e-mail but also the HTML coding of the mail. I really don't understand why this happens, I've googled it but it seems like there is no one else with this problem.
Any help is appreciated,
Thanks!

I found reading e-mail with the JavaMail library much more difficult than expected. I don't blame the JavaMail API, rather I blame my poor understanding of RFC-5322 -- the official definition of Internet e-mail.
As a thought experiment: Consider how complicated an e-mail message can become in the real world. It is possible to "infinitely" embed messages within messages. Each message itself may have multiple attachments (binary or human-readable text). Now imagine how complicated this structure becomes in the JavaMail API after parsing.
A few tips that may help when traversing e-mail with JavaMail:
Message and BodyPart both implement Part.
MimeMessage and MimeBodyPart both implement MimePart.
Where possible, treat everything as a Part or MimePart. This will allow generic traversal methods to be built more easily.
These Part methods will help to traverse:
String getContentType(): Starts with the MIME type. You may be tempted to treat this as a MIME type (with some hacking/cutting/matching), but don't. Better to only use this method inside the debugger for inspection.
Oddly, MIME type cannot be extracted directly. Instead use boolean isMimeType(String) to match. Read docs carefully to learn about powerful wildcards, such as "multipart/*".
Object getContent(): Might be instanceof:
Multipart -- container for more Parts
Cast to Multipart, then iterate as zero-based index with int getCount() and BodyPart getBodyPart(int)
Note: BodyPart implements Part
In my experience, Microsoft Exchange servers regularly provide two copies of the body text: plain text and HTML.
To match plain text, try: Part.isMimeType("text/plain")
To match HTML, try: Part.isMimeType("text/html")
Message (implements Part) -- embedded or attached e-mail
String (just the body text -- plain text or HTML)
See note above about Microsoft Exchange servers.
InputStream (probably a BASE64-encoded attachment)
String getDisposition(): Value may be null
if Part.ATTACHMENT.equalsIgnoreCase(getDisposition()), then call getInputStream() to get raw bytes of the attachment.
Finally, I found the official Javadocs exclude everything in the com.sun.mail package (and possibly more). If you need these, read the code directly, or generate the unfiltered Javadocs by downloading the source and running mvn javadoc:javadoc in the mail project module of the project.

Did you find these JavaMail FAQ entries?
How do I read a message with an attachment and save the attachment?
How do I tell if a message has attachments?
How do I find the main message body in a message that has attachments?

Following up on Kevin's helpful advice, analyzing your email content Java object types with respect to their canonical names (or simple names) can be helpful too. For example, looking at one inbox I've got right now, of 486 messages 399 are Strings, and 87 are MimeMultipart. This suggests that - for my typical email - a strategy that uses instanceof to first peel off Strings is best.
Of the Strings, 394 are text/plain, and 5 are text/html. This will not be the case for most; it's reflective of my email feeds into this particular inbox.
But wait - there's more!!! :-) The HTML sneaks in there nevertheless: of the 87 Multipart's, 70 are multipart/alternative. No guarantees, but most (if not all of these) are TEXT + HTML.
Of the other 17 multipart, incidentally, 15 are multipart/mixed, and 2 are multipart/signed.
My use case with this inbox (and one other) is primarily to aggregate and analyze known mailing list content. I can't ignore any of the messages, but an analysis of this sort helps me make my processing more efficient.

how can I detect charset of a web page

I just want to get the web page source in java language and I just want to get that content with correct encoding type. I am able to get the content of a web page till now. But for some web pages the content comes with absurd characters. So I need to detect charset of that web page.
According to my little research I found that there is a jChardet library to do this. But I couldn't import it to my project. Can someone please help me?
By the way the code below is the code to read the web page content
StringBuilder builder = new StringBuilder();
InputStream is = fURL.openStream();
BufferedReader buffer = null;
buffer = new BufferedReader(new InputStreamReader(is, encodingType));
int byteRead;
while ((byteRead = buffer.read()) != -1) {
builder.append((char) byteRead);
}
buffer.close();
return builder;

Read the Content-Type header of the HTTP response, it's the best way to get the charset. Only apply guessing when you have no alternatives - you do.

You can use too the http://jchardet.sourceforge.net/
private static String detectCharset(byte[] body) {
nsDetector det = new nsDetector(nsPSMDetector.ALL);
det.Init(new nsICharsetDetectionObserver() {
public void Notify(String charset) {
HtmlCharsetDetector.found = true;
}
});
boolean done = false;
boolean isAscii = true;
if (isAscii) {
isAscii = det.isAscii(body, body.length);
}
// DoIt if non-ascii and not done yet.
if (!isAscii && !done) {
done = det.DoIt(body, body.length, false);
}
return det.getProbableCharsets()[0];
}

Minimally, you would need to read and parse the HTTP headers to see whether they declare the encoding in HTTP headers and, in the absence of such a declaration (rather common), parse the document itself to find a meta tag that declares the encoding. For XHTML documents, you would need to check the XML declaration and default to utf-8. This would still leave a considerable amount of pages with undeclared encoding, so some heuristics would be needed. You might check the section on encodings in the HTML5 draft, which contains some heuristic overrides too (e.g., treating iso-8859-1 as windows-1252).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.