PDFBox how to clone a page - java

I'm working on a code which suppose to take a data to insert into a form and just fill the form with that data.
I'm trying to create a simple pdf multi-page document with the given data in it.
My problem is that my code works partially. I could run the code and generate a multi-page document 6 times in a row, but on the 7th attempt I would get an error like this:
java.io.IOException: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?
at org.apache.pdfbox.cos.COSStream.checkClosed(COSStream.java:77)
at org.apache.pdfbox.cos.COSStream.createOutputStream(COSStream.java:198)
at org.apache.pdfbox.cos.COSStream.createOutputStream(COSStream.java:186)
at org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.writeToStream(AppearanceGeneratorHelper.java:641)
at org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.setAppearanceContent(AppearanceGeneratorHelper.java:303)
at org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.setAppearanceValue(AppearanceGeneratorHelper.java:170)
at org.apache.pdfbox.pdmodel.interactive.form.PDTextField.constructAppearances(PDTextField.java:263)
at org.apache.pdfbox.pdmodel.interactive.form.PDTerminalField.applyChange(PDTerminalField.java:228)
at org.apache.pdfbox.pdmodel.interactive.form.PDTextField.setValue(PDTextField.java:218)
at app.components.FieldPopulator.insertData(FieldPopulator.java:153)
at app.components.FieldPopulator.populateGoodsFields(FieldPopulator.java:69)
at app.Main.populateData(Main.java:55)
at app.Main.main(Main.java:25)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:143)
Error: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?Document has been createdjava.io.IOException: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?
at org.apache.pdfbox.cos.COSStream.checkClosed(COSStream.java:77)
at org.apache.pdfbox.cos.COSStream.createRawInputStream(COSStream.java:125)
at org.apache.pdfbox.pdfwriter.COSWriter.visitFromStream(COSWriter.java:1200)
at org.apache.pdfbox.cos.COSStream.accept(COSStream.java:383)
at org.apache.pdfbox.cos.COSObject.accept(COSObject.java:158)
at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObject(COSWriter.java:522)
at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObjects(COSWriter.java:460)
at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBody(COSWriter.java:444)
at org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1096)
at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:419)
at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1367)
at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1254)
at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1232)
at app.Main.saveDocument(Main.java:62)
at app.Main.main(Main.java:26)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:143)
I would like to know why is this error keeps coming, or how could it be that the code works sometimes and sometimes it doesn't.
Here is the main class(I removed irrelevent parts of the code):
package app;
import app.components.*;
import org.apache.pdfbox.pdmodel.PDDocument;
import java.io.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
/**
* Created by David on 11/01/2017.
*/
public class Main {
private static FormMetaHolder metaHolder = new FormMetaHolder();
private static HashMap<String, Object> properties = new HashMap<>();
private static List<GoodObject> objects = new ArrayList();
private static PDDocument document = new PDDocument();
public static void main(String[] args) throws IOException {
initiateData(); // This method generates sample data. I removed it for simplicity
initiatePdf();
populateData();
saveDocument();
System.out.print("Document has been created");
}
public static void initiatePdf() {
DocumentBuilder builder = new DocumentBuilder();
builder.setObjectsData(objects);
document = builder.build();
}
public static void populateData() throws IOException {
FieldPopulator populator = new FieldPopulator(document.getDocumentCatalog().getAcroForm(), metaHolder);
populator.numberPages();
populator.populateStaticFields(properties);
populator.populateGoodsFields(objects);
}
public static void saveDocument() {
try {
FileOutputStream out = new FileOutputStream(Constants.Paths.RESULT_FILE);
document.save(out);
document.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
And here is the build() method from DocumentBuilder class. This is the
which generates an empty, ready to be populated document based on how much data need to be inserted:
public PDDocument build() {
PDDocument document = new PDDocument();
try {
int numPages = evaulator.getNumOfPages(objectsToInsert);
// A FieldRenamer for renaming the DG fields sequentially
FieldRenamer renamer = new FieldRenamer();
PDResources res = new PDResources();
// First one here is a list of the unique fields of the document
List<PDField> parentFields = new ArrayList<>();
for (int i = 1; i <= numPages; i++) {
try {
InputStream stream = Main.class.getClassLoader().getResourceAsStream(Constants.Paths.EMPTY_DOC);
templateDoc = new PDDocument().load(stream);
PDDocumentCatalog catalog = templateDoc.getDocumentCatalog();
PDAcroForm acroForm = catalog.getAcroForm();
if (i == 1)
res = acroForm.getDefaultResources();
// Renaming the PDFields
renamer.setAcroForm(acroForm);
renamer.renameFields(i * 10 + 1 - 10);
PDPage page = catalog.getPages().get(0);
// Iterating over the acro form to find new fields that are not
// In the main collection of fields - the parentFields object
// If a field is already inside the parentFields list, add new
// PDAnnotationWidget to that field, for visual representation
// In the new page
List<PDAnnotation> widgets = new ArrayList<>();
for (PDField field : acroForm.getFields()) {
boolean flag = true;
for (PDField parentField : parentFields) {
String currentFieldName = field.getFullyQualifiedName();
String parentFieldName = parentField.getFullyQualifiedName();
if (currentFieldName.equals(parentFieldName)) {
flag = false;
PDAnnotationWidget existWidget = parentField.getWidgets().get(0);
PDAnnotationWidget widget = new PDAnnotationWidget();
widget.setPage(page);
widget.getCOSObject().setItem(COSName.PARENT, parentField);
widget.setRectangle(existWidget.getRectangle());
parentField.getWidgets().add(widget);
widgets.add(widget);
page.getAnnotations().add(widget);
break;
}
}
if (flag) {
parentFields.add(field);
}
}
document.addPage(page);
stream.close();
} catch (IOException e) {
e.printStackTrace();
System.out.print("Error: could not load the template file");
}
}
PDAcroForm acroForm = new PDAcroForm(document);
acroForm.setFields(parentFields);
acroForm.setDefaultResources(res);
document.getDocumentCatalog().setAcroForm(acroForm);
} catch (NullPointerException e) {
e.printStackTrace();
System.out.print("Error: no data has been set to insert");
}
return document;
}
Now, I think the problem lies with the way I'm dealing with the PDFields and their annotations inside the build() method, but I can't pinpoint the source of the problem.
Please help me with this guys, I've been struggling with this for some time and I just want to proceed with my project, knowing that this part of the program works perfectly.
Thanks in advance!
UPDATE:
Ok so the main cause of the problem is in the build() method inside the DocumentBuilder class. I will try to clarify a little bit about my code:
first, I create a new document which is intended to be the result document.
Next, I figure out using my Evalutor class how much pages are needed in order
to populate the data. Inside the Evaluator I perform certain calculations. A little note, I use the FormMetaHolder object which its job is to provide meta data about acro form fields, data such as field font, font size,
field width and field name. The field font variable holds a reference to
the actual PDFont object I took from the form resources.
Next, I loop over the number of pages and for each iteration I do the following:
.1. Create a new PDDocument from the template file and assign it to templateDoc
.2. Get the acro form from that document.
.2.1. If it's the first iteration, assign default
resources of the acro form to the res variable.
.3. Rename identical fields which I want to contain different values. Because the multi line feature in PDFBox is full of bugs, I divided certain fields of the same size into 10 smaller fields. Each small field like this is numbered. For Instance, if the big field is called "UN" then
all the smaller fields will be called UN1, UN2,
UN3...
Now, if I have a lot of data to insert, I want to be able to duplicate the template document with the only exception of renaming those smaller fields appropriatly. For example, if I need 2 pages to store the data that was
provided by the user, I need 20 smaller fields, 10 for each page.
I need all the UNs field to be renamed like this:
UN1 - UN20
.4. After I rename all the smaller fields accordingly, I try to find new annotations for my list of parentFields. The reason I do that is because besides of the "smaller" fields, I also have static fields which don't change
across pages.
.5. After I find all the correct widgets, I add the current page to the result document and close the InputStream which was used to instantiate the templateDoc object.
After the for loop is over, I create a new PDAcroForm object with the result document as the argument. I set its fields to be
parentFields and set the default resources
to be the res object.
Finally, I set the document acro form to be the newly created acro form.

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

I am trying to merge 2 docx files which has their own bullet number, after merging of word docs the bullets are automatically updated.
E.g:
Doc A has 1 2 3
Doc B has 1 2 3
After merging the bullet numbering are updated to be 1 2 3 4 5 6
how to stop this.
I am using following code
if(counter==1)
{
FirstFileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FirstFileIS = new java.io.ByteArrayInputStream(FirstFileByteStream);
FirstWordFile = org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(FirstFileIS);
main = FirstWordFile.getMainDocumentPart();
//Add page break for Table of Content
main.addObject(objBr);
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Table of contents - End
}
else
{
FileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FileIS = new java.io.ByteArrayInputStream(FileByteStream);
byte[] bytes = IOUtils.toByteArray(FileIS);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/part" + (chunkCount++) + ".docx"));
afiPart.setContentType(new ContentType(CONTENT_TYPE));
afiPart.setBinaryData(bytes);
Relationship altChunkRel = main.addTargetPart(afiPart);
CTAltChunk chunk = Context.getWmlObjectFactory().createCTAltChunk();
chunk.setId(altChunkRel.getId());
main.addObject(objBr);
htmlCode = new StringBuilder();
htmlCode.append("<html>");
htmlCode.append("<h2><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p style=\"font-family:'Arial Black'; color: #f35b1c\">"+ReqName+"</p></h2>");
htmlCode.append("</html>");
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Add Page Break before new content
main.addObject(objBr);
//Add new content
main.addObject(chunk);
}
Looking at your code, you are adding HTML altChunks to your document.
For these to display it Word, the HTML is converted to normal docx content.
An altChunk is usually converted by Word when you open the docx.
(Alternatively, docx4j-ImportXHTML can do it for an altChunk of type XHTML)
The upshot is that what happens with the bullets (when Word converts your HTML) is largely outside your control. You could experiment with CSS but I think Word will mostly ignore it.
An alternative may be to use XHTML altChunks, and have docx4j-ImportXHTML convert them. main.convertAltChunks()
If the same problem occurs when you try that, well, at least we can address it.
I was able to fix my issue using following code. I found it at (http://webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml). You can also generate your custom code, they have a nice demo where they generate code according to your requirement :).
public final static String DIR_IN = System.getProperty("user.dir")+ "/";
public final static String DIR_OUT = System.getProperty("user.dir")+ "/";
public static void main(String[] args) throws Exception
{
String[] files = {"part1docx_20200717t173750539gmt.docx", "part1docx_20200717t173750539gmt (1).docx", "part1docx_20200717t173750539gmt.docx"};
List blockRanges = new ArrayList();
for (int i=0 ; i< files.length; i++) {
BlockRange block = new BlockRange(WordprocessingMLPackage.load(new File(DIR_IN + files[i])));
blockRanges.add( block );
block.setStyleHandler(StyleHandler.RENAME_RETAIN);
block.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
block.setRestartPageNumbering(false);
block.setHeaderBehaviour(HfBehaviour.DEFAULT);
block.setFooterBehaviour(HfBehaviour.DEFAULT);
block.setSectionBreakBefore(SectionBreakBefore.NEXT_PAGE);
}
// Perform the actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);
// Save the result
SaveToZipFile saver = new SaveToZipFile(output);
saver.save(DIR_OUT+"OUT_MergeWholeDocumentsUsingBlockRange.docx");
}

PDFbox saying PDDocument closed when its not

I am trying to populate repeated forms with PDFbox. I am using a TreeMap and populating the forms with individual records. The format of the pdf form is such that there are six records listed on page one and a static page inserted on page two. (For a TreeMap larger than six records, the process repeats). The error Im getting is specific to the size of the TreeMap. Therein lies my problem. I can't figure out why when I populate the TreeMap with more than 35 entries I get this warning:
Apr 23, 2018 2:36:25 AM org.apache.pdfbox.cos.COSDocument finalize
WARNING: Warning: You did not close a PDF Document
public class test {
public static void main(String[] args) throws IOException, IOException {
// TODO Auto-generated method stub
File dataFile = new File("dataFile.csv");
File fi = new File("form.pdf");
Scanner fileScanner = new Scanner(dataFile);
fileScanner.nextLine();
TreeMap<String, String[]> assetTable = new TreeMap<String, String[]>();
int x = 0;
while (x <= 36) {
String lineIn = fileScanner.nextLine();
String[] elements = lineIn.split(",");
elements[0] = elements[0].toUpperCase().replaceAll(" ", "");
String key = elements[0];
key = key.replaceAll(" ", "");
assetTable.put(key, elements);
x++;
}
PDDocument newDoc = new PDDocument();
int control = 1;
PDDocument doc = PDDocument.load(fi);
PDDocumentCatalog cat = doc.getDocumentCatalog();
PDAcroForm form = cat.getAcroForm();
for (String s : assetTable.keySet()) {
if (control <= 6) {
PDField IDno1 = (form.getField("IDno" + control));
PDField Locno1 = (form.getField("locNo" + control));
PDField serno1 = (form.getField("serNo" + control));
PDField typeno1 = (form.getField("typeNo" + control));
PDField maintno1 = (form.getField("maintNo" + control));
String IDnoOne = assetTable.get(s)[1];
//System.out.println(IDnoOne);
IDno1.setValue(assetTable.get(s)[0]);
IDno1.setReadOnly(true);
Locno1.setValue(assetTable.get(s)[1]);
Locno1.setReadOnly(true);
serno1.setValue(assetTable.get(s)[2]);
serno1.setReadOnly(true);
typeno1.setValue(assetTable.get(s)[3]);
typeno1.setReadOnly(true);
String type = "";
if (assetTable.get(s)[5].equals("1"))
type += "Hydrotest";
if (assetTable.get(s)[5].equals("6"))
type += "6 Year Maintenance";
String maint = assetTable.get(s)[4] + " - " + type;
maintno1.setValue(maint);
maintno1.setReadOnly(true);
control++;
} else {
PDField dateIn = form.getField("dateIn");
dateIn.setValue("1/2019 Yearlies");
dateIn.setReadOnly(true);
PDField tagDate = form.getField("tagDate");
tagDate.setValue("2019 / 2020");
tagDate.setReadOnly(true);
newDoc.addPage(doc.getPage(0));
newDoc.addPage(doc.getPage(1));
control = 1;
doc = PDDocument.load(fi);
cat = doc.getDocumentCatalog();
form = cat.getAcroForm();
}
}
PDField dateIn = form.getField("dateIn");
dateIn.setValue("1/2019 Yearlies");
dateIn.setReadOnly(true);
PDField tagDate = form.getField("tagDate");
tagDate.setValue("2019 / 2020");
tagDate.setReadOnly(true);
newDoc.addPage(doc.getPage(0));
newDoc.addPage(doc.getPage(1));
newDoc.save("PDFtest.pdf");
Desktop.getDesktop().open(new File("PDFtest.pdf"));
}
I cant figure out for the life of me what I'm doing wrong. This is the first week I've been working with PDFbox so I'm hoping its something simple.
Updated Error Message
WARNING: Warning: You did not close a PDF Document
Exception in thread "main" java.io.IOException: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?
at org.apache.pdfbox.cos.COSStream.checkClosed(COSStream.java:77)
at org.apache.pdfbox.cos.COSStream.createRawInputStream(COSStream.java:125)
at org.apache.pdfbox.pdfwriter.COSWriter.visitFromStream(COSWriter.java:1200)
at org.apache.pdfbox.cos.COSStream.accept(COSStream.java:383)
at org.apache.pdfbox.cos.COSObject.accept(COSObject.java:158)
at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObject(COSWriter.java:522)
at org.apache.pdfbox.pdfwriter.COSWriter.doWriteObjects(COSWriter.java:460)
at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBody(COSWriter.java:444)
at org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1096)
at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:419)
at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1367)
at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1254)
at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1232)
at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1204)
at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1192)
at test.test.main(test.java:87)
The warning by itself
You appear to get the warning wrong. It says:
Warning: You did not close a PDF Document
So in contrast to what you think, "PDFbox saying PDDocument closed when its not", PDFBox says that you did not close a document!
After your edit one sees that it actually says that a COSStream has been closed and that a possible cause is that the enclosing PDDocument already has been closed. This is a mere possibility!
The warning in your case
That been said, by adding pages from one document to another you probably end up having references to those pages from both documents. In that case in the course of closing both documents (e.g. automatically via garbage collection), the second one closing may indeed stumble across some already closed COSStream instances.
So my first advice to simply do close the documents at the end by
doc.close();
newDoc.close();
probably won't remove the warnings, merely change their timing.
Actually you don't merely create two documents doc and newDoc, you even create new PDDocument instances and assign them to doc again and again, in the process setting the former document objects in that variable free for garbage collection. So you eventually have a big bunch of documents to be closed as soon as not referenced anymore.
I don't think it would be a good idea to close all those documents in doc early, in particular not before saving newDoc.
But if your code will eventually be run as part of a larger application instead of as a small, one-shot test application, you should collect all those PDDocument instances in some Collection and explicitly close them right after saving newDoc and then clear the collection.
Actually your exception looks like one of those lost PDDocument instances has already been closed by garbage collection, so you should collect the documents even in case of a simple one-shot utility to keep them from being GC disposed.
(#Tilman, please correct me if I'm wrong...)
Importing pages
To prevent problems with different documents sharing pages, you can try and import the pages to the target document and thereafter add the imported page to the target document page tree. I.e. replace
newDoc.addPage(doc.getPage(0));
newDoc.addPage(doc.getPage(1));
by
newDoc.addPage(newDoc.importPage(doc.getPage(0)));
newDoc.addPage(newDoc.importPage(doc.getPage(1)));
This should allow you to close each PDDocument instance in doc before losing it.
There are certain drawbacks to this, though, cf. the method JavaDoc and this answer here.
An actual issue in your code
In your combined document you will have many fields with the same name (at least in case of a sufficiently high number of entries in your CSV file) which you initially set to different values. And you access the fields from the PDAcroForm of the respective original document but don't add them to the PDAcroForm of the combined result document.
This is asking for trouble! The PDF format does consider forms to be document-wide with all fields referenced (directly or indirectly) from the AcroForm dictionary of the document, and it expects fields with the same name to effectively be different visualizations of the same field and therefore to all have the same value.
Thus, PDF processors might handle your document fields in unexpected ways, e.g.
by showing the same value in all fields with the same name (as they are expected to have the same value) or
by ignoring your fields (as they are not in the document AcroForm structure).
In particular programmatic reading of your PDF field values will fail because in that context the form is definitively considered document-wide and based in AcroForm. PDF viewers on the other hand might first show your set values and make look things ok.
To prevent this you should rename the fields before merging. You might consider using the PDFMergerUtility which does such a renaming under the hood. For an example usage of that utility class have a look at the PDFMergerExample.
Even though the above answer was marked as the solution to the problem, since the solution is buried in the comments, I wanted to add this answer at this level. I spent several hours searching for the solution.
My code snippets and comments.
// Collection solely for purpose of preventing premature garbage collection
List<PDDocument> sourceDocuments = new ArrayList<>( );
...
// Source document (actually inside a loop)
PDDocument docIn = PDDocument.load( artifactBytes );
// Add document to collection before using it to prevent the problem
sourceDocuments.add( docIn );
// Extract from source document
PDPage extractedPage = docIn.getPage( 0 );
// Add page to destination document
docOut.addPage( extractedPage );
...
// This was failing with "COSStream has been closed and cannot be read."
// Now it works.
docOut.save( bundleStream );

iText 7.0.5: How to combine PDF and have existing bookmarks indented under new bookmarks for each document?

Problem:
com.itextpdf.kernel.PdfException: Pdf indirect object belongs to other PDF document. Copy object to current pdf document.
I want to combine PDF documents with an edited set of bookmarks that keeps a clear pairing of the bookmarks with each original document. I also want a new top level bookmark describing the set as a whole to improve combining with yet other documents later if the user chooses. The number of documents combined and the number of bookmarks in each is unknown and some documents might not have any bookmarks.
For simplicity, assume I have two documents with two pages and a bookmark to the second page in each. I would want the combined document to have a bookmark structure like this where "NEW" are the ones I am creating based on meta data I have about each source document and "EXISTING" are whatever I copy from the individual documents:
-- NEW Combined Document meta(page 1)
---- NEW Document one meta (page 1)
------ EXISTING Doc one link (page 2)
---- NEW Document two meta (page 3)
------ EXISTING Doc two link (page 4)
Code:
private static String combinePdf(List<String> allFile, LinkedHashMap<String, String> bookmarkMetaMap, Connection conn) throws IOException {
System.out.println("=== combinePdf() ENTER"); // TODO REMOVE
File outFile = File.createTempFile("combinePdf", "pdf", new File(DocumentObj.TEMP_DIR_ON_SERVER));
if (!outFile.exists() || !outFile.canWrite()) {
throw new IOException("Unable to create writeable file in " + DocumentObj.TEMP_DIR_ON_SERVER);
}
if (bookmarkMetaMap == null || bookmarkMetaMap.isEmpty()) {
bookmarkMetaMap = new LinkedHashMap<>(); // prevent NullPointer below
bookmarkMetaMap.put("Documents", "Documents");
}
try ( PdfDocument allPdfDoc = new PdfDocument(new PdfWriter(outFile)) ) {
allPdfDoc.initializeOutlines();
allPdfDoc.getCatalog().setPageMode(PdfName.UseOutlines);
PdfMerger allPdfMerger = new PdfMerger(allPdfDoc, true, false); // build own outline
Iterator<Map.Entry<String, String>> itr = bookmarkMetaMap.entrySet().iterator();
PdfOutline rootOutline = allPdfDoc.getOutlines(false);
PdfOutline mainOutline;
mainOutline = rootOutline.addOutline(itr.next().getValue());
mainOutline.addDestination(PdfExplicitDestination.createFit(allPdfDoc.getNumberOfPages() + 1));
int fileNum = 0;
for (String oneFile : allFile) {
PdfDocument onePdfDoc = new PdfDocument(new PdfReader(oneFile));
PdfAcroForm oneForm = PdfAcroForm.getAcroForm(onePdfDoc, false);
if (oneForm != null) {
oneForm.flattenFields();
}
allPdfMerger.merge(onePdfDoc, 1, onePdfDoc.getNumberOfPages());
fileNum++;
String bookmarkLabel = itr.hasNext() ? itr.next().getKey() : "Document " + fileNum;
PdfOutline linkToDoc = mainOutline.addOutline(bookmarkLabel);
linkToDoc.addDestination(PdfExplicitDestination.createFit(allPdfDoc.getNumberOfPages() + 1));
PdfOutline srcDocOutline = onePdfDoc.getOutlines(false);
if (srcDocOutline != null) {
List<PdfOutline> outlineList = srcDocOutline.getAllChildren();
if (!outlineList.isEmpty()) {
for (PdfOutline p : outlineList) {
linkToDoc.addOutline(p); // if I comment this out, no error, but links wrong order
}
}
}
onePdfDoc.close();
}
System.out.println("=== combinePdf() DONE ADDING PAGES ==="); //TODO REMOVE
}
return outFile.getAbsolutePath();
}
Problem:
com.itextpdf.kernel.PdfException: Pdf indirect object belongs to other PDF document. Copy object to current pdf document.
Error occurs after the debug line "=== combinePdf() DONE ADDING PAGES ===" so the for loop completes as expected.
This means the error occurs when allPdfDoc is automagically closed.
If I remove the line linkToDoc.addOutline(p); I get all of my links and they go to the correct pages but they are not nested/ordered as I want:
-- NEW Combined Document meta(page 1)
---- NEW Document one meta (page 1)
---- NEW Document two meta (page 3)
-- EXISTING Doc one link (page 2)
-- EXISTING Doc two link (page 4)
With the aforementioned line commented out, I am not even sure how the EXISTING links are included at all. I have the mergeOutlines flag set to false in the PdfMerger constructor since I thought I had to construct my own outline. I get similar results no matter whether I set the getOutlines() to true or false as well as if I take out my arbitrary top level new bookmark.
I know how to create a flattened list of new and existing bookmarks in the desired order. So my question is about how to get both the indenting and ordering as desired.
Thanks for taking a look!
Rather than shift bookmarks in the combined PDF, I did it in the component PDF before merging.
Feedback welcome, especially if something is horribly inefficient as PDF size increases:
private static void shiftPdfBookmarksUnderNewBookmark(PdfDocument pdfDocument, String bookmarkLabel) {
if (pdfDocument == null || pdfDocument.getWriter() == null) {
log.warn("shiftPdfBookmarksUnderNewBookmark(): no writer linked to PDFDocument, cannot modify bookmarks");
return;
}
pdfDocument.initializeOutlines();
try {
PdfOutline rootOutline = pdfDocument.getOutlines(false);
PdfOutline subOutline = rootOutline.addOutline(bookmarkLabel);
subOutline.addDestination(PdfExplicitDestination.createFit(pdfDocument.getFirstPage())); // Not sure why this is needed, but problems if omitted.
List<PdfOutline> pdfOutlineChildren = rootOutline.getAllChildren();
if (pdfOutlineChildren.size() == 1) {
return;
}
int i = 0;
for (PdfOutline p : rootOutline.getAllChildren()) {
if (p != subOutline) {
if (p.getDestination() == null) {
continue;
}
subOutline.addOutline(p);
}
}
rootOutline.getAllChildren().clear();
rootOutline.addOutline(subOutline);
subOutline.addDestination(PdfExplicitDestination.createFit(pdfDocument.getFirstPage())); // not sure why duplicate line above seems to be needed
}
catch (Exception logAndIgnore) {
log.warn("shiftPdfBookmarksUnderNewBookmark ignoring error and not shifting bookmarks: " +logAndIgnore, logAndIgnore);
}
}

JaxB marshaler overwriting file contents

I am trying to use JaxB to marshall objects I create to an XML. What I want is to create a list then print it to the file, then create a new list and print it to the same file but everytime I do it over writes the first. I want the final XML file to look like I only had 1 big list of objects. I would do this but there are so many that I quickly max my heap size.
So, my main creates a bunch of threads each of which iterate through a list of objects it receives and calls create_Log on each object. Once it is finished it calls printToFile which is where it marshalls the list to the file.
public class LogThread implements Runnable {
//private Thread myThread;
private Log_Message message = null;
private LinkedList<Log_Message> lmList = null;
LogServer Log = null;
private String Username = null;
public LogThread(LinkedList<Log_Message> lmList){
this.lmList = lmList;
}
public void run(){
//System.out.println("thread running");
LogServer Log = new LogServer();
//create iterator for list
final ListIterator<Log_Message> listIterator = lmList.listIterator();
while(listIterator.hasNext()){
message = listIterator.next();
CountTrans.addTransNumber(message.TransactionNumber);
Username = message.input[2];
Log.create_Log(message.input, message.TransactionNumber, message.Message, message.CMD);
}
Log.printToFile();
init_LogServer.threadCount--;
init_LogServer.doneList();
init_LogServer.doneUser();
System.out.println("Thread "+ Thread.currentThread().getId() +" Completed user: "+ Username+"... Number of Users Complete: " + init_LogServer.getUsersComplete());
//Thread.interrupt();
}
}
The above calls the below function create_Log to build a new object I generated from the XSD I was given (SystemEventType,QuoteServerType...etc). These objects are all added to an ArrayList using the function below and attached to the Root object. Once the LogThread loop is finished it calls the printToFile which takes the list from the Root object and marshalls it to the file... overwriting what was already there. How can I add it to the same file without over writing and without creating one master list in the heap?
public class LogServer {
public log Root = null;
public static String fileName = "LogFile.xml";
public static File XMLfile = new File(fileName);
public LogServer(){
this.Root = new log();
}
//output LogFile.xml
public synchronized void printToFile(){
System.out.println("Printing XML");
//write to xml file
try {
init_LogServer.marshaller.marshal(Root,XMLfile);
} catch (JAXBException e) {
e.printStackTrace();
}
System.out.println("Done Printing XML");
}
private BigDecimal ConvertStringtoBD(String input){
DecimalFormatSymbols symbols = new DecimalFormatSymbols();
symbols.setGroupingSeparator(',');
symbols.setDecimalSeparator('.');
String pattern = "#,##0.0#";
DecimalFormat decimalFormat = new DecimalFormat(pattern, symbols);
decimalFormat.setParseBigDecimal(true);
// parse the string
BigDecimal bigDecimal = new BigDecimal("0");
try {
bigDecimal = (BigDecimal) decimalFormat.parse(input);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return bigDecimal;
}
public QuoteServerType Log_Quote(String[] input, int TransactionNumber){
BigDecimal quote = ConvertStringtoBD(input[4]);
BigInteger TransNumber = BigInteger.valueOf(TransactionNumber);
BigInteger ServerTimeStamp = new BigInteger(input[6]);
Date date = new Date();
long timestamp = date.getTime();
ObjectFactory factory = new ObjectFactory();
QuoteServerType quoteCall = factory.createQuoteServerType();
quoteCall.setTimestamp(timestamp);
quoteCall.setServer(input[8]);
quoteCall.setTransactionNum(TransNumber);
quoteCall.setPrice(quote);
quoteCall.setStockSymbol(input[3]);
quoteCall.setUsername(input[2]);
quoteCall.setQuoteServerTime(ServerTimeStamp);
quoteCall.setCryptokey(input[7]);
return quoteCall;
}
public SystemEventType Log_SystemEvent(String[] input, int TransactionNumber, CommandType CMD){
BigInteger TransNumber = BigInteger.valueOf(TransactionNumber);
Date date = new Date();
long timestamp = date.getTime();
ObjectFactory factory = new ObjectFactory();
SystemEventType SysEvent = factory.createSystemEventType();
SysEvent.setTimestamp(timestamp);
SysEvent.setServer(input[8]);
SysEvent.setTransactionNum(TransNumber);
SysEvent.setCommand(CMD);
SysEvent.setFilename(fileName);
return SysEvent;
}
public void create_Log(String[] input, int TransactionNumber, String Message, CommandType Command){
switch(Command.toString()){
case "QUOTE": //Quote_Log
QuoteServerType quote_QuoteType = Log_Quote(input,TransactionNumber);
Root.getUserCommandOrQuoteServerOrAccountTransaction().add(quote_QuoteType);
break;
case "QUOTE_CACHED":
SystemEventType Quote_Cached_SysType = Log_SystemEvent(input, TransactionNumber, CommandType.QUOTE);
Root.getUserCommandOrQuoteServerOrAccountTransaction().add(Quote_Cached_SysType);
break;
}
}
EDIT: The below is code how the objects are added to the ArrayList
public List<Object> getUserCommandOrQuoteServerOrAccountTransaction() {
if (userCommandOrQuoteServerOrAccountTransaction == null) {
userCommandOrQuoteServerOrAccountTransaction = new ArrayList<Object>();
}
return this.userCommandOrQuoteServerOrAccountTransaction;
}
Jaxb is about mapping java object tree to xml document or vice versa. So in principle, you need complete object model before you can save it to xml.
Of course it would not be possible, for very large data, for example DB dump, so jaxb allows marshalling object tree in fragments, letting the user control moment of the object creation and marshaling. Typical use case would be fetching records from DB one by one and marshaling them one by one to a file, so there would not be problem with the heap.
However, you are asking about appending one object tree to another (one fresh in memory, second one already represented in a xml file). Which is not normally possible as it is not really appending but crating new object tree that contains content of the both (there is only one document root element, not two).
So what you could do,
is to create new xml representation with manually initiated root
element,
copy the existing xml content to the new xml either using XMLStreamWriter/XMLStreamReader read/write operations or unmarshaling
the log objects and marshaling them one by one.
marshal your log objects into the same xml stram
complete the xml with the root closing element. -
Vaguely, something like that:
XMLStreamWriter writer = XMLOutputFactory.newInstance().createXMLStreamWriter(new FileOutputStream(...), StandardCharsets.UTF_8.name());
//"mannually" output the beginign of the xml document == its declaration and the root element
writer.writeStartDocument();
writer.writeStartElement("YOUR_ROOT_ELM");
Marshaller mar = ...
mar.setProperty(Marshaller.JAXB_FRAGMENT, true); //instructs jaxb to output only objects not the whole xml document
PartialUnmarshaler existing = ...; //allows reading one by one xml content from existin file,
while (existing.hasNext()) {
YourObject obj = existing.next();
mar.marshal(obj, writer);
writer.flush();
}
List<YourObject> toAppend = ...
for (YourObject toAppend) {
mar.marshal(obj,writer);
writer.flush();
}
//finishing the document, closing the root element
writer.writeEndElement();
writer.writeEndDocument();
Reading the objects one by one from large xml file, and complete implementation of PartialUnmarshaler is described in this answer:
https://stackoverflow.com/a/9260039/4483840
That is the 'elegant' solution.
Less elegant is to have your threads write their logs list to individual files and the append them yourself. You only need to read and copy the header of the first file, then copy all its content apart from the last closing tag, copy the content of the other files ignoring the document openkng and closing tag, output the closing tag.
If your marshaller is set to marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
each opening/closing tag will be in different line, so the ugly hack is to
copy all the lines from 3rd to one before last, then output the closing tag.
It is ugly hack, cause it is sensitive to your output format (if you for examle change your container root element). But faster to implement than full Jaxb solution.

How to access ElasticSearch parent-doc fields from custom facet over child docs

I have most of a parent/child-doc solution for a problem I'm working on, but I ran into a hitch: from inside a facet that iterates over the child docs I need to access the value of a parent doc field. I have (or I can get) the parent doc ID (from the _parent field of the child doc, or worst case by indexing it again as a normal field) but that's an "external" ID, not the node-internal ID that I need to load the field value from the field cache. (I'm using default routing so the parent doc is definitely in the same shard as the children.)
More concretely, here's what I have in the FacetCollector so far (ES 0.20.6):
protected void doSetNextReader(IndexReader reader, int docBase) throws IOException {
/* not sure this will work, otherwise I can index the field seperately */
parentFieldData = (LongFieldData) fieldDataCache.cache(FieldDataType.DefaultTypes.LONG, reader, "_parent");
parentSpringinessFieldData = (FloatFieldData) fieldDataCache.cache(FieldDataType.DefaultTypes.FLOAT, "springiness");
/* ... */
protected void doCollect(int doc) throws IOException {
long parentID = parentFieldData.value(doc); // or whatever the correct equivalent here is
// here's the problem:
parentSpringiness = parentSpringinessFieldData.value(parentID)
// type error: expected int (node-internal ID), got long (external ID)
Any suggestions? (I can't upgrade to 0.90 yet but would be interested to hear if that would help.)
Honking great disclaimer: (1) I ended up not using this approach at all, so this is only slightly-tested code, and (2) far as I can see it will be pretty horribly inefficient, and it has the same memory overhead as parent queries. If another approach will work for you, do consider it (for my use case I ended up using nested documents, with a custom facet collector that iterates over both the nested and the parent documents, to have easy access to the field values of both).
The example within the ES code to work from is org.elasticsearch.index.search.child.ChildCollector. The first element you need is in the Collector initialisation:
try {
context.idCache().refresh(context.searcher().subReaders());
} catch (Exception e) {
throw new FacetPhaseExecutionException(facetName, "Failed to load parent-ID cache", e);
}
This makes possible the following line in doSetNextReader():
typeCache = context.idCache().reader(reader).type(parentType);
which gives you a lookup of the parent doc's UId in doCollect(int childDocId):
HashedBytesArray postingUid = typeCache.parentIdByDoc(childDocId);
The parent document won't necessarily be found in the same reader as the child doc: when the Collector initialises you also need to store all readers (needed to access the field value) and for each reader an IdReaderTypeCache (to resolve the parent doc's UId to a reader-internal docId).
this.readers = new Tuple[context.searcher().subReaders().length];
for (int i = 0; i < readers.length; i++) {
IndexReader reader = context.searcher().subReaders()[i];
readers[i] = new Tuple<IndexReader, IdReaderTypeCache>(reader, context.idCache().reader(reader).type(parentType));
}
this.context = context;
Then when you need the parent doc field, you have to iterate over the reader/typecache pairs looking for the right one:
int parentDocId = -1;
for (Tuple<IndexReader, IdReaderTypeCache> tuple : readers) {
IndexReader indexReader = tuple.v1();
IdReaderTypeCache idReaderTypeCache = tuple.v2();
if (idReaderTypeCache == null) { // might be if we don't have that doc with that type in this reader
continue;
}
parentDocId = idReaderTypeCache.docById(postingUid);
if (parentDocId != -1 && !indexReader.isDeleted(parentDocId)) {
FloatFieldData parentSpringinessFieldData = (FloatFieldData) fieldDataCache.cache(
FieldDataType.DefaultTypes.FLOAT,
indexReader,
"springiness");
parentSpringiness = parentSpringinessFieldData.value(parentDocId);
break;
}
}
if (parentDocId == -1) {
throw new FacetPhaseExecutionException(facetName, "Parent doc " + postingUid + " could not be found!");
}

Categories

Resources