docx4j generate PDF with mail merge fields - java

i am using Docx4j to generate pdf documents based on Microsoft Word templates.
In a microsoft word template, i have some Mail Merge fields, which should be replaced.
I am able to replace Mail Merge field but in generated PDF are displayed in a wrong way.
In output PDF i have always text like MERGEFIELD ContractNo * MERGEFORMAT.
In word, you can swith between field views by ALT+F9, but how can i achieve to show in generate PDF different view of mail merge fields?
Instead of MERGEFIELD ContractNo * MERGEFORMAT i want to show only ContractNo.

Should "just work" with a current nightly build (as opposed to 2.8.1).

Use Content Controls instead of MERGEFIELDs. I've posted an example on github complete with a sample template and a sample XML data file: https://github.com/sylnsr/docx4j-ws ...
MergeFields are deprecated and not (IMHO) recommended for continued use.

Related

PDFBox - Accessible PDF - How to check if PDF Tags have properties as per Accessiblity guidelines

Need to check if PDF Tags have properties as per Accessibility guidelines.
Examples:
H1 - validate that a H1 exists in the PDF
Image(Figure Tag) - validate image\figure has a Alt text
Language - Validate that language property is set so that screen reader will read properly. For Spanish and English documents, respective Language codes should be updated
Tables - access table object and validate that table structure is proper (headers columns match with row column etc)
So far I was able to:
Extract the Metadata and validate the document has proper Title, Subject and Producer info by PDDocument.getDocumentInformation().getMetadataKeys();
Validate if PDF is accessible or not by checking PDDocument.getDocumentCatalog().getMarkInfo().isMarked(); flag
To access the Tags, I have tried these options:
getDocumentCatalog().getAcroForm() returns Null
PDDocument.getDocumentCatalog().getPages().get(0).getAnnotations(); returns Null
I tried looping through PDDocument.getDocumentCatalog().getStructureTreeRoot().getKids() but its returning only 1 StructElem type object
Creation of Accessible PDF is done using OpenText so Dev team doesn't know about PDFBox.
I am lost here as how to get the access to Tags/Objects (use MarkedContent or something else).
Please suggest how to extract the individual objects(tags) such as P, H1, Table, Figure/Image and validate their properties.
Note: Manual validation of these properties are performed using Adobe Acrobat Pro
Based upon https://issues.apache.org/jira/browse/PDFBOX-7, it appears that you can use PDFMarkedContentExtractor to get the information that you need.

Create a DOCX reading data from Oracle database

I have a student database (Oracle 11G), I need to create a module(separate) which will generate a student's details in a well-formatted word document. When I give the student ID, I need all the info(Kind of a biodata) of the student in a docx file which is very presentable. I'm not sure how to start, I was exploring Python-docx and java DOCX4j. I need suggestion how can I achieve this. Is there any tool I can do this
Your help is highly appreciated
You could extract the data from Oracle into an XML format, then use content control data binding in your Word document to bind elements in the XML.
All you need to do is inject the XML into the docx as a custom xml part, and Word will display the results automatically.
docx4j can help you to the inject the XML. If you don't want to rely on Word to display the results, then you can use docx4j to also apply the bindings.
Or you could try simple variable replacement: https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/VariableReplace.java
If you want a simple way to format your Word document directly from Java, you can try pxDoc.
The screenshot below provide an example of code and document generated from an Authors/Books model: whatever the way you request the data from your database, it is easy to render them in a well formatted document.
simple document generation example
Regarding your use case, you could also generate a document for all students at once. In the context of the screenshot example:
for (author:library.authors) {
var filename = 'c:/MyDocuments/'+author.name+'.docx'
document fileName:filename {
/** Content of my document */
}

How to copy page from one docx to another in java

I am trying to automate docx report generation process. For this I am using java and docx4j. I have a template document containing only single page.I would like to copy that page modify it and save it in another docx document.The output report is of multiple similar pages with modification from the template. How do I go about it.
PS : java and docx4j are my first choice but I am open to solutions apart from java and docx4j.
Leaving it up to you to modify the template, here is how you could add one document to the end of another document. Suppose base.docx contains "This is the base document." and template.docx contains "The time is:", then after executing this code:
WordprocessingMLPackage doc = Docx4J.load(new File("base.docx"));
WordprocessingMLPackage template = Docx4J.load(new File("template.docx"));
MainDocumentPart main = doc.getMainDocumentPart();
Br pageBreak = Context.getWmlObjectFactory().createBr();
pageBreak.setType(STBrType.PAGE);
main.addObject(pageBreak);
for (Object obj : template.getMainDocumentPart().getContent()) {
main.addObject(obj);
}
main.addParagraphOfText(LocalDateTime.now().toString());
doc.save(new File("result.docx"));
Then result.docx will contain something like:
This is the base document.
^L
The time is:
2018-04-16T17:37:13.541984200
(Where ^L represents a page break.)
To be more precise my original template is containing only header and some styling component.
This kind of information can be stored in a Word stylesheet (.dotx file).
PS : java and docx4j are my first choice but I am open to solutions apart from java and docx4j.
A good tool would be pxDoc: you can specify a dedicated stylesheet in your document generator, or use "variable styles"and specify the stylesheet only when you launch the document generation

Missing fields when rendering lotus notes document to RTF,DXL with Java API

I'm attempting to render a notes document to RTF, then DXL using the Java API. Once I have the DXL, I'm converting it to HTML with an XSL stylesheet. My goal is to produce an HTML document that displays as close as possible to the document rendering in the notes client.
However, computed fields are missing from the rendered RTF and DXL.
Here is the code used to generate the DXL:
private String renderDocumentToDxl(lotus.domino.Document lotusDocument)
throws Exception {
Database db = getDatabase();
lotus.domino.Document tmp = db.createDocument();
RichTextItem rti = tmp.createRichTextItem("Body");
lotusDocument.computeWithForm(true, false);
lotusDocument.save();
lotusDocument.renderToRTItem(rti);
DxlExporter dxlExporter = getSession().createDxlExporter();
dxlExporter.setOutputDOCTYPE(false);
dxlExporter.setConvertNotesBitmapsToGIF(true);
return dxlExporter.exportDxl(tmp);
}
Fields added to the document by the call to computeWithForm are not present in the generated DXL.
Is there any way to get the computed fields into the generated DXL with the Java API? Or is there a better way to generate an HTML representation of a notes document using the domino Java API?
I'm not quite clear on your objective. There are two possibilities:
1) You want the items from lotusDocument to exist in tmp, and to be exported as actual tag data in the DXL. Your code does not do this.
2) You want the values of the non-hidden Items from lotusDocument to exist as text within the rich text Body item in tmp, and you want those values to be included within the DXL that is exported from tmp - as text within the tag for the Body item. This should be what your code is doing.
If you expected the former, then that's not what renderToRTItem does. What it does is the latter. I.e., it gives you a snapshot of the values of the items in lotusDocument - but if and only if they would be displayed to a user who opens the document. You do not get the items themselves, and they won't appear separately in the DXL. If that's all you expected, and it's not happening, then there's something else going wrong and you haven't given enough infornmation here to figure it out.
If you wanted the former, i.e., the actual items from lotusDocument to exist as separate tag elements within the DXL exported from tmp, then you should be using
lotusDocument.copyAllItems(tmp,true);,
or sequences of
Item tmpItem = lotusDocument.getFirstItem(itemName);
tmp.copyItem(tmpItem,"");
You can get the HTML representation of a RichText field with the URL
http://server/db.nsf/view/docunid/RichTextFieldname?OpenField
So, save your tmp document, get the docunid and read the result via http from URL
http://server/db.nsf/0/tmpdocunid/Body?OpenField
You don't need to call lotusDocument.computeWithForm as lotusDocument.renderToRTItem does execute form's input translation and validation formulas already.
Be aware that for both methods form's LotusScript code won't be executed - just in case your fields gets calculated this way.
In case you can use XPages this would be an alternative: http://linqed.eu/2014/07/11/getting-html-from-any-richtext-item/

Running a JavaScript command from MATLAB to fetch a PDF file

I'm currently writing some MATLAB code to interact with my company's internal reports database. So far I can access the HTML abstract page using code which looks like this:
import com.mathworks.mde.desk.*;
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.setCurrentLocation(ReportURL(8:end));
pause(1);
s={};
while isempty(s)
s=char(wb.getHtmlText);
pause(.1);
end
desk=MLDesktop.getInstance;
desk.removeClient(wb);
I can extract out various bits of information from the HTML text which ends up in the variable s, however the PDF of the report is accessed via what I believe is a JavaScript command (onClick="gotoFulltext('','[Report Number]')").
Any ideas as to how I execute this JavaScript command and get the contents of the PDF file into a MATLAB variable?
(MATLAB sits on top of Java, so I believe a Java solution would work...)
I think you should take a look at the JavaScript that is being called and see what the final request to the webserver looks like.
You can do this quite easily in Firefox using the FireBug plugin.
https://addons.mozilla.org/en-US/firefox/addon/1843
Once you have found the real server request then you can just request this URL or post to this URL instead of trying to run the JavaScript.
Once you have gotten the correct URL (a la the answer from pjp), your next problem is to "get the contents of the PDF file into a MATLAB variable". Whether or not this is possible may depend on what you mean by "contents"...
If you want to get the raw data in the PDF file, I don't think there is a way currently to do this in MATLAB. The URLREAD function was the first thing I thought of to read content from a URL into a string, but it has this note in the documentation:
s = urlread('url') reads the content
at a URL into the string s. If the
server returns binary data, s will
be unreadable.
Indeed, if you try to read a PDF as in the following example, s contains some text intermingled with mostly garbage:
s = urlread('http://samplepdf.com/sample.pdf');
If you want to get the text from the PDF file, you have some options. First, you can use URLWRITE to save the contents of the URL to a file:
urlwrite('http://samplepdf.com/sample.pdf','temp.pdf');
Then you should be able to use one of two submissions on The MathWorks File Exchange to extract the text from the PDF:
Extract text from a PDF document by Dimitri Shvorob
PDF Reader by Tom Gaudette
If you simply want to view the PDF, you can just open it in Adobe Acrobat with the OPEN function:
open('temp.pdf');
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.executeScript('javascript:alert(''Some code from a link'')');
desk=com.mathworks.mde.desk.MLDesktop.getInstance;
desk.removeClient(wb);

Categories

Resources