My application uses an RTF file with merge fields as source and creates a PDF file with it using Aspose.Words. The users of this application give that resulting document to their clients, so copies of same document will be printed for each of their client. There is only one difference on those copies however, and that is copy number at the end of each document copy.
For now; lets say there are 4 clients so 4 copies of the same document will be printed with only copy numbers different. I achieve this by creating same document for 4 times and each time I insert my html text, merge fields, and add copy number then append the documents. In the end, I have one big document in which all 4 created documents appended.
Here is my code block for it, there were lots of code there, so I tried to downsize them to only related parts:
import com.aspose.words.*
Document docAllAppended = new Document(loadDocument("/documents/" + RTFFileName));
Document docTemp=null;
for (int i = 1; i <= copyNumber; i++) {
docTemp = new Document(loadDocument("/documents/" + RTFFileName));
DocumentBuilder builder = new DocumentBuilder(docTemp);
//insert html which includes file context
builder.insertHtml(htmlText);
//insert Copy number
builder.moveToBookmark("sayfa");
Font font = builder.getFont();
font.setBold(true);
font.setSize(8);
builder.write("Copy Number-" + i+ " / ");
font.setBold(false);
docAllAppended.appendDocument(docTemp,ImportFormatMode.USE_DESTINATION_STYLES);
}
This looks so unnecessary and has low performance. Also each time my users try to change copy number to be printed, my application calculates whole thing from the start. What I am asking is, is there a way to make this faster or how not to create whole thing again when copy number to be printed changes? So far I haven't found much.
Thanks in advance.
If the only difference is the copy number, then you can just prepare the document once by inserting HTML, merging etc.
Then, in a for loop, set the copy number and save the document as docx or pdf. Appending the document in the loop is not necessary, you can save each copy as different name.
import com.aspose.words.*
Document docAllAppended = new Document(loadDocument("/documents/" + RTFFileName));
Document docTemp=null;
docTemp = new Document(loadDocument("/documents/" + RTFFileName));
DocumentBuilder builder = new DocumentBuilder(docTemp);
//insert html which includes file context
builder.insertHtml(htmlText);
// In for loop, only update the copy number
for (int i = 1; i <= copyNumber; i++) {
// Use DocumentBuilder for font setting
builder.moveToBookmark("sayfa");
Font font = builder.getFont();
font.setBold(true);
font.setSize(8);
builder.write("dummy value");
font.setBold(false);
// Use Bookmark for setting the actual value
Bookmark bookmark = docAllAppended.getRange().getBookmarks().get("sayfa");
bookmark.setText("Copy Number-" + i + " / ");
// Save the document for each client
docAllAppended.save(Common.DATA_DIR + "Letter-Client-" + i + ".docx");
}
I work with Aspose as Developer Evangelist.
Related
I am trying to convert PDF file to CSV or EXCEL format.
Here is the code I use to convert to CSV format:
public void convert() throws Exception {
PdfReader pdfReader = new PdfReader("example.pdf");
PdfDocument pdf = new PdfDocument(pdfReader);;
int pages = pdf.getNumberOfPages();
FileWriter csvWriter = new FileWriter("student.csv");
for (int i = 1; i <= pages; i++) {
PdfPage page = pdf.getPage(i);
String content = PdfTextExtractor.getTextFromPage(page);
String[] splitContents = content.split("\n");
boolean isTitle = true;
for (int j = 0; j < splitContents.length; j++) {
if (isTitle) {
isTitle = false;
continue;
}
csvWriter.append(splitContents[j].replaceAll(" ", " "));
csvWriter.append("\n");
}
}
csvWriter.flush();
csvWriter.close();
}
This code works correctly, but the fact is that the CSV format groups rows without taking into account existing columns (some of them are empty), so I would like to convert this file (PDF) to EXCEL format.
The PDF file itself is formed as a table.
What do I mean about spaces. For example, in a PDF file, in a table
| name | some data | | | some data 1 | |
+----------+----------------+------------+-------------+-------------------+--------------+
After converting to a CSV file, the line looks like this:
name some data some data 1
How can I get the same result as a PDF table?
I'd suggest to use PDFBox, like here: Parsing PDF files (especially with tables) with PDFBox
or another library that will allow you to check the data in the Table point by point, and will allow you to create a table by column width (something like Table table = page.getTable(dividers)); ).
If the width of the columns changes, you'll have to implement it based on the headers/first data column ([e.g. position.x of the last character of the first word] minus [position.x of the first character of the new word] - you'll have to figure it out yourself), it's hard so you could make it hardcoded in the beginning. Using Foxit Reader PDF App you can easily measure column width. Then, if you don't find any data in a particular column, you will be able to add an empty column in the CSV file. I know from my own experience that it is not easy, so I wish you good luck.
I am evaluating to replace our pdf processing from itext to pdfbox. I did some tests with 200 pdfs with a single page (94KB, 469KB, 937KB) and merged them to one pdf in our application. PDFBox version: 2.0.23.
itext version: 2.1.7. Here are the test results:
Here is the itext implementation:
byte[] l_PDFPage = null;
PdfReader l_PDFReader = null;
PdfCopy l_Copier = null;
Document l_PDFDocument = null;
OutputStream l_Stream = new FileOutputStream(m_File);
// do it for all pages in the editor
for( int i = 0; i < m_Editor.getCountOfElements(); i++ ) {
l_Page = m_Editor.getPageAt(i);
l_PDFPage = l_Page.getAsPdf();
l_PDFReader = new PdfReader(l_PDFPage);
l_PDFReader.getPageN(1).put(PdfName.ROTATE, new PdfNumber(l_PDFReader.getPageRotation(1) + l_Page.getRotation() % 360));
l_PDFReader.consolidateNamedDestinations();
if( i == 0 ) {
l_PDFDocument = new Document(l_PDFReader.getPageSizeWithRotation(1));
l_Copier = new PdfCopy(l_PDFDocument, l_Stream);
l_PDFDocument.open();
}
l_Copier.addPage(l_Copier.getImportedPage(l_PDFReader, 1));
if( l_PDFReader.getAcroForm() != null )
l_Copier.copyAcroForm(l_PDFReader);
l_Copier.flush();
l_Copier.freeReader(l_PDFReader);
}
l_PDFDocument.close();
l_Stream.close();
Here is the pdfbox implementation:
byte[] l_PDFPage = null;
List<PDDocument> pageDocuments = new ArrayList<>();
PDDocument saveDocument = new PDDocument();
try {
// do it for all pages in the editor
for( int i = 0; i < m_Editor.getCountOfElements(); i++ ) {
// our wrapper object for a page
l_Page = m_Editor.getPageAt(i);
// page as byte[]
l_PDFPage = l_Page.getAsPdf();
PDDocument document = PDDocument.load(l_PDFPage);
// save page document to close it later
pageDocuments.add(document);
PDPage page = document.getPage(0);
saveDocument.addPage(saveDocument.importPage(page));
}
saveDocument.save(l_Stream);
}
finally {
// close every page document
for(PDDocument doc : pageDocuments) {
doc.close();
}
saveDocument.close();
}
I have also tried using pdfmerger of pdfbox. The performance was nearly the same as the other pdfbox implementation. But with the 937KB files I run in an outofmemory exception with this implementation:
byte[] l_PDFPage = null;
OutputStream l_Stream = new FileOutputStream(m_File);
PDFMergerUtility merger = new PDFMergerUtility();
// do it for all pages in the editor
for( int i = 0; i < m_Editor.getCountOfElements(); i++ ) {
l_Page = m_Editor.getPageAt(i);
// page as byte[]
l_PDFPage = l_Page.getAsPdf();
merger.addSource(new ByteArrayInputStream(l_PDFPage));
}
merger.setDestinationStream(l_Stream);
merger.mergeDocuments(null);
So my questions:
Why is the performance (needed time AND memory usage) of pdfbox so bad in comparison to itext?
Am I missing something in our pdfbox implementation?
Why I can't close the "page document" after I added the page in "saveDocument"? If i close it there I'd get an error while saving so I have to store the "page documents" and close them at the end.
PDFBox and iText are architecturally different and, therefore, perform differently well for different tasks.
In particular iText attempts to write out new contents early, in your case much of the page is written to the output already during
l_Copier.addPage(l_Copier.getImportedPage(l_PDFReader, 1));
and
l_PDFDocument.close();
eventually only finalizes the PDF and writes last remaining objects and the file trailer.
PDFBox on the other hand saves everything in the end at once:
saveDocument.save(l_Stream);
The approach of iText has the advantage of a smaller memory footprint (as you observed) and the disadvantage that you cannot change data of a page once it is written.
(As an aside: the iText architecture has changed from iText 5 to iText 7, in iText 7 you have the choice and can keep everything in memory, but the price here also is a big memory footprint.)
Thus,
Why is the performance (needed time AND memory usage) of pdfbox so bad in comparison to itext?
The difference in memory use can partially be explained by the above. Also in iText after
l_Copier.freeReader(l_PDFReader);
the PdfReader can be closed (which you leave to the garbage collection to do for you) to free its resources while in your PDFBox code you keep all the source documents open, holding the resources up to the end. (Actually I would have assumed that when you're using importPage, you needn't keep them.)
Concerning the time I'm not sure now. You should do some finer clocking and determine where exactly the extra time is used in PDFBox; thus, I second #Tilman's request for profiling data. I assume it's during the final save but that's only a hunch. Also such time differences might depend on structural details of the PDF in question and may be less extreme for other documents.
How can i generate pdf report of multiple pages with same content on each page. Following is the code for single page report. Multiple pages should be in a single pdf file.
<%
response.setContentType( "application/pdf" );
response.setHeader ("Content-Disposition","attachment;filename=TEST1.pdf");
Document document=new Document(PageSize.A4,25,25,35,0);
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
PdfWriter writer=PdfWriter.getInstance( document, buffer);
document.open();
Font fontnormalbold = FontFactory.getFont("Arial", 10, Font.BOLD);
Paragraph p1=new Paragraph("",fontnormalbold);
float[] iwidth = {1f,1f,1f,1f,1f,1f,1f,1f};
float[] iwidth1 = {1f};
PdfPTable table1 = new PdfPTable(iwidth);
table1.setWidthPercentage(100);
PdfPCell cell = new PdfPCell(new Paragraph("Testing Page",fontnormalbold));
cell.setHorizontalAlignment(1);
cell.setColspan(8);
cell.setPadding(5.0f);
table1.addCell(cell);
PdfPTable outerTable = new PdfPTable(iwidth1);
outerTable.setWidthPercentage(100);
PdfPCell containerCell = new PdfPCell();
containerCell.addElement(table1);
outerTable.addCell(containerCell);
p1.add(outerTable);
document.add(new Paragraph(p1));
document.close();
DataOutput output = new DataOutputStream( response.getOutputStream() );
byte[] bytes = buffer.toByteArray();
response.setContentLength(bytes.length);
for( int i = 0; i < bytes.length; i++ ) { output.writeByte( bytes[i] ); }
response.getOutputStream().flush();
response.getOutputStream().close();
%>
There are different way to solve this problem. Not all of the solutions are elegant.
Approach 1: add the same table many times.
I see that you are creating a PdfPTable object named outerTable. I'm going to ignore the silly things you do with this table (e.g. why are you adding this table to a Paragraph? Why are you adding a single cell with colspan 8 to a table with 8 columns? Why are you nesting this table into a table with a single column? All of these shenanigans are really weird), but having that outertable, you could do this:
for (int i = 0; i < x; i++) {
document.add(outerTable);
document.newPage();
}
This will add the table x times and it will start a new page for every table. This is also what the people in the comments advised you, and although the code looks really elegant, it doesn't result in an elegant PDF. That is: if you were my employee, I'd fire you if you did this.
Why? Because adding a table requires CPU and you are using x times the CPU you need. Moreover, with every table you create, you create new content streams. The same content will be added x times to your document. Your PDF will be about x times bigger than it should be.
Why would this be a reason to fire a developer? Because applications like this usually live in the cloud. In the cloud, one usually pays for CPU and bandwidth. A developer who writes code that requires a multiple of CPU and bandwidth, causes a cost that is unacceptable. In many cases, it is more cost-efficient to fire bad developers, hire slightly more expensive developers and buy slightly more expensive software, and then save plenty of money on the long term thanks to code that is more efficient in terms of CPU and band-width.
Approach 2: add the table to a PdfTemplate, reuse the PdfTemplate.
Please take a look at my answer to the StackOverflow question How to resize a PdfPTable to fit the page?
In this example, I create a PdfPTable named table. I know how wide I want the table to be (PageSize.A4.getWidth()), but I don't know in advance how high it will be. So I lock the width, I add the cells I need to add, and then I can calculate the height of the table like this: table.getTotalHeight().
I create a PdfTemplate that is exactly as big as the table:
PdfContentByte canvas = writer.getDirectContent();
PdfTemplate template = canvas.createTemplate(
table.getTotalWidth(), table.getTotalHeight());
I now add the table to this template:
table.writeSelectedRows(0, -1, 0, table.getTotalHeight(), template);
I wrap the table inside an Image object. This doesn't mean we're rasterizing the table, all text and lines are preserved as vector-data.
Image img = Image.getInstance(template);
I scale the img so that it fits the page size I have in mind:
img.scaleToFit(PageSize.A4.getWidth(), PageSize.A4.getHeight());
Now I position the table vertically in the middle.
img.setAbsolutePosition(
0, (PageSize.A4.getHeight() - table.getTotalHeight()) / 2);
If you want to add the table multiple times, this is how you'd do it:
for (int i = 0; i < x; i++) {
document.add(img);
document.newPage();
}
What is the difference with Approach 1? Well, by using PdfTemplate, you are creating a Form XObject. A Form XObject is a content stream that is external to the page stream. A Form XObject is stored in the PDF file only once, and it can be reused many times, e.g. on every page of a document.
Approach 3: create a PDF document with a single page; concatenate the file many times
You are creating your PDF in memory. The PDF is stored in the buffer object. You could read this PDF using PdfReader like this:
PdfReader reader = new PdfReader(buffer.toByteArray());
Then you reuse this content like this:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Document doc = new Document();
PdfSmartCopy copy = new PdfSmartCopy(doc, baos);
doc.open();
for (int i = 0; i < x; i++) {
copy.addDocument(reader);
}
doc.close();
reader.close();
Now you can send the bytes stored in baos to the OutputStream of your response object. Make sure that you use PdfSmartCopy instead of PdfCopy. PdfCopy just copies the pages AS-IS without checking if there is redundant information. The result is a bloated PDF similar to the one you'd get if you'd use Approach 1. PdfSmartCopy looks at the bytes of the content streams and will detect that you're adding the same page over and over again. That page will be reused the same way as is done in Approach 2.
we have a PDF with some fields in order to collect some data, and I have to fill it programmatically with iText on Android by adding some text in those positions. I've been thinking about different ways to achieve this, with little success in each one.
Note: I'm using the Android version of iText (iTextG 5.5.4) and a Samsung Galaxy Note 10.1 2014 (Android 4.4) for most of my tests.
The approach I took from the start was to "draw" the text on a given coordinates, for a given page. This has some problems with the management of the fields (I have to be aware of the length of the strings, and it could be hard to position each text in the exact coordinate of the pdf). But most importantly, the performance of the process is really slow in some devices/OSVersions (it works great in Nexus 5 with 5.0.2, but takes several minutes with a 5MB Pdf on the Note 10.1).
pdfReader = new PdfReader(is);
document = new Document();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
pdfCopy = new PdfCopy(document, baos);
document.open();
PdfImportedPage page;
PdfCopy.PageStamp stamp;
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++) {
page = pdfCopy.getImportedPage(pdfReader, i); // First page = 1
stamp = pdfCopy.createPageStamp(page);
for (int i=0; i<10; i++) {
int posX = i*50;
int posY = i*100;
Phrase phrase = new Phrase("Example text", FontFactory.getFont(FontFactory.HELVETICA, 12, BaseColor.RED));
ColumnText.showTextAligned(stamp.getOverContent(), Element.ALIGN_CENTER, phrase, posX, posY, 0);
}
stamp.alterContents();
pdfCopy.addPage(page);
}
We though about adding "forms fields" instead of drawing. That way I can configure a TextField and avoid managing the texts myself. However, the final PDF shouldn't have any annotations, so I would need to copy it into a new Pdf without annotations and with those "forms fields" drawn. I don't have an example of this because I wasn't able to perform this, I don't even know if this is possible/worthwhile.
The third option would be to receive a Pdf with the "forms fields" already added, that way I only have to fill them. However I still need to create a new Pdf with all those fields and without annotations...
I'd like to know what's be the best way in performance to do this process, and any help about achieving it. I am really newbie with iText and any help would be really appreciated.
Thanks!
EDIT
At the end I used the third option: a PDF with editable fields that we fill, and then we use the "flattening" to create a non-editable PDF with all texts already there.
The code is as follows:
pdfReader = new PdfReader(is);
FileOutputStream fios = new FileOutputStream(outPdf);
PdfStamper pdfStamper = new PdfStamper(pdfReader, fios);
//Filling the PDF (It's totally necessary that the PDF has Form fields)
fillPDF(pdfStamper);
//Setting the PDF to uneditable format
pdfStamper.setFormFlattening(true);
pdfStamper.close();
and the method to fill the forms:
public static void fillPDF(PdfStamper stamper) throws IOException, DocumentException{
//Getting the Form fields from the PDF
AcroFields form = stamper.getAcroFields();
Set<String> fields = form.getFields().keySet();
for(String field : fields){
form.setField("name", "Ernesto");
form.setField("surname", "Lage");
}
}
}
The only thing about this approach is that you need to know the name of each field in order to fill it.
There is a process in iText known as 'flattening', which takes the form fields, and replaces them with the text that the fields contain.
I haven't used iText in a few years (and not at all on Android), but if you search the manual or online examples for 'flattening', you should find how to do it.
how can i get the number of pages in a word document(.doc or .docx) using Aspose java?
or maybe get the number of pages in a word document in java without using Aspose.
You can use Document.getPageCount method to get the page count of a doc / docx file in Aspose.Words for Java. Following is the sample code:
//Open the Word Document
Document doc = new Document("C:\\Data\\Image2.doc");
//Get page count
int pageCount = doc.getPageCount();
//Print Page Count
System.out.println(pageCount);
Hope this helps.
To open from a stream simply pass a stream object that contains a document to the Document constructor. The code sample below shows how to open a document from a stream and get number of Pages.
String dataDir = "D:\\Temp\\";
String filename = "input.docx";
InputStream in = new FileInputStream(dataDir + filename);
Document doc = new Document(in);
System.out.println("Document opened. Total pages are " + doc.getPageCount());
in.close();
I work with Aspose as Developer Evangelist.