I have a business requirement that requires me to splits pdfs into multiple documents.
Lets say I have a 100MB pdf, I need to split that into for simplicity sake, into multiple pdfs no larger than 10MB a piece.
I am using iText.
I am going to get the original pdf, and loop through the pages, but how can I determine the file size of each page without writing it separately to the disk?
Sample code for simplicity
int numPages = reader.getNumberOfPages();
PdfImportedPage page;
for (int currentPage = 0; currentPage < numPages; ){
++currentPage;
//Get page from reader
page = writer.getImportedPage(reader, currentPage);
// I need the size in bytes here of the page
}
I think the easiest way is to write it to the disk and delete it afterwards:
Document document = new Document();
File f= new File("C:\\delete.pdf"); //for instance
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(f));
document.open();
document.add(page);
document.close();
long filesize = f.length(); //this is the filesize in byte
f.delete();
I'm not absolutely sure, I admit, but I don't know how it should be possible to figure out the filesize if the file is not existing.
Related
I am using iText to split a PDF document into separate pages as PDF files. Each file seems to be too large as all fonts used in the input PDF are saved into all result pages, which is apparently not very clean.
Code of splitting is as below. Notice PdfSmartCopy and setFullCompression doesn't help to reduce size (which I have no idea why).
public List<byte[]> split(byte[] input) throws IOException, DocumentException {
PdfReader pdfReader = new PdfReader(input);
List<byte[]> pdfFiles = new ArrayList<>();
int pageCount = pdfReader.getNumberOfPages();
int pageIndex = 0;
while (++pageIndex <= pageCount) {
Document document = new Document(pdfReader.getPageSizeWithRotation(pageIndex));
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
PdfCopy pdfCopy = new PdfSmartCopy(document, byteArrayOutputStream);
pdfCopy.setFullCompression();
PdfImportedPage pdfImportedPage = pdfCopy.getImportedPage(pdfReader, pageIndex);
document.open();
pdfCopy.addPage(pdfImportedPage);
document.close();
pdfCopy.close();
pdfFiles.add(byteArrayOutputStream.toByteArray());
}
return pdfFiles;
}
So is there a way in Java (iText or not) to solve these problem?
Update with demo PDF
Here is a 377KB PDF using multiple CJK fonts where any page in it using 1 or 2 fonts. The summary size of sub-PDFs is 1.2MB. Considering CJK fonts are very bloated, I would like to find a way to remove unused font and even remove unused characters in used fonts.
So my idea is to remain only used characters in used fonts and embed them in sub files and then un-embed all other fonts. Any advice?
How can i generate pdf report of multiple pages with same content on each page. Following is the code for single page report. Multiple pages should be in a single pdf file.
<%
response.setContentType( "application/pdf" );
response.setHeader ("Content-Disposition","attachment;filename=TEST1.pdf");
Document document=new Document(PageSize.A4,25,25,35,0);
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
PdfWriter writer=PdfWriter.getInstance( document, buffer);
document.open();
Font fontnormalbold = FontFactory.getFont("Arial", 10, Font.BOLD);
Paragraph p1=new Paragraph("",fontnormalbold);
float[] iwidth = {1f,1f,1f,1f,1f,1f,1f,1f};
float[] iwidth1 = {1f};
PdfPTable table1 = new PdfPTable(iwidth);
table1.setWidthPercentage(100);
PdfPCell cell = new PdfPCell(new Paragraph("Testing Page",fontnormalbold));
cell.setHorizontalAlignment(1);
cell.setColspan(8);
cell.setPadding(5.0f);
table1.addCell(cell);
PdfPTable outerTable = new PdfPTable(iwidth1);
outerTable.setWidthPercentage(100);
PdfPCell containerCell = new PdfPCell();
containerCell.addElement(table1);
outerTable.addCell(containerCell);
p1.add(outerTable);
document.add(new Paragraph(p1));
document.close();
DataOutput output = new DataOutputStream( response.getOutputStream() );
byte[] bytes = buffer.toByteArray();
response.setContentLength(bytes.length);
for( int i = 0; i < bytes.length; i++ ) { output.writeByte( bytes[i] ); }
response.getOutputStream().flush();
response.getOutputStream().close();
%>
There are different way to solve this problem. Not all of the solutions are elegant.
Approach 1: add the same table many times.
I see that you are creating a PdfPTable object named outerTable. I'm going to ignore the silly things you do with this table (e.g. why are you adding this table to a Paragraph? Why are you adding a single cell with colspan 8 to a table with 8 columns? Why are you nesting this table into a table with a single column? All of these shenanigans are really weird), but having that outertable, you could do this:
for (int i = 0; i < x; i++) {
document.add(outerTable);
document.newPage();
}
This will add the table x times and it will start a new page for every table. This is also what the people in the comments advised you, and although the code looks really elegant, it doesn't result in an elegant PDF. That is: if you were my employee, I'd fire you if you did this.
Why? Because adding a table requires CPU and you are using x times the CPU you need. Moreover, with every table you create, you create new content streams. The same content will be added x times to your document. Your PDF will be about x times bigger than it should be.
Why would this be a reason to fire a developer? Because applications like this usually live in the cloud. In the cloud, one usually pays for CPU and bandwidth. A developer who writes code that requires a multiple of CPU and bandwidth, causes a cost that is unacceptable. In many cases, it is more cost-efficient to fire bad developers, hire slightly more expensive developers and buy slightly more expensive software, and then save plenty of money on the long term thanks to code that is more efficient in terms of CPU and band-width.
Approach 2: add the table to a PdfTemplate, reuse the PdfTemplate.
Please take a look at my answer to the StackOverflow question How to resize a PdfPTable to fit the page?
In this example, I create a PdfPTable named table. I know how wide I want the table to be (PageSize.A4.getWidth()), but I don't know in advance how high it will be. So I lock the width, I add the cells I need to add, and then I can calculate the height of the table like this: table.getTotalHeight().
I create a PdfTemplate that is exactly as big as the table:
PdfContentByte canvas = writer.getDirectContent();
PdfTemplate template = canvas.createTemplate(
table.getTotalWidth(), table.getTotalHeight());
I now add the table to this template:
table.writeSelectedRows(0, -1, 0, table.getTotalHeight(), template);
I wrap the table inside an Image object. This doesn't mean we're rasterizing the table, all text and lines are preserved as vector-data.
Image img = Image.getInstance(template);
I scale the img so that it fits the page size I have in mind:
img.scaleToFit(PageSize.A4.getWidth(), PageSize.A4.getHeight());
Now I position the table vertically in the middle.
img.setAbsolutePosition(
0, (PageSize.A4.getHeight() - table.getTotalHeight()) / 2);
If you want to add the table multiple times, this is how you'd do it:
for (int i = 0; i < x; i++) {
document.add(img);
document.newPage();
}
What is the difference with Approach 1? Well, by using PdfTemplate, you are creating a Form XObject. A Form XObject is a content stream that is external to the page stream. A Form XObject is stored in the PDF file only once, and it can be reused many times, e.g. on every page of a document.
Approach 3: create a PDF document with a single page; concatenate the file many times
You are creating your PDF in memory. The PDF is stored in the buffer object. You could read this PDF using PdfReader like this:
PdfReader reader = new PdfReader(buffer.toByteArray());
Then you reuse this content like this:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Document doc = new Document();
PdfSmartCopy copy = new PdfSmartCopy(doc, baos);
doc.open();
for (int i = 0; i < x; i++) {
copy.addDocument(reader);
}
doc.close();
reader.close();
Now you can send the bytes stored in baos to the OutputStream of your response object. Make sure that you use PdfSmartCopy instead of PdfCopy. PdfCopy just copies the pages AS-IS without checking if there is redundant information. The result is a bloated PDF similar to the one you'd get if you'd use Approach 1. PdfSmartCopy looks at the bytes of the content streams and will detect that you're adding the same page over and over again. That page will be reused the same way as is done in Approach 2.
Here i am combining 2 pdf documents using the Itext packages.
Merging was done successfully using the code below
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
PdfContentByte cb = writer.getDirectContent();
for (InputStream in : list)
{
PdfReader reader = new PdfReader(in);
for (int i = 1; i <= reader.getNumberOfPages(); i++)
{
document.newPage();
//import the page from source pdf
PdfImportedPage page = writer.getImportedPage(reader, i);
//add the page to the destination pdf
cb.addTemplate(page, 0, 0);
}
}
outputStream.flush();
document.close();
outputStream.close();
Here the list is an InputStream List.
And outputStream is an output stream
The problem i am having is i want to append the PDFdocuments in the list after the 1st PDF is added
(i.e 1st PDF has 4 lines...i want the 2nd PDF to continue in the same page after the 4th line).
What i am getting is the 2nd PDF is added in the second page.
Is there any alternate keyword for document.newPage();
Can anyone help me with it.
Thanks would like to hear any responses:)
It depends on the requirements you have. As long as
you only are interested in the page contents of the merged PDFs, not in the page annotations and
the pages have no content but the text lines you mention, in particular no background graphics, watermarks, or header/footer lines,
you can you use either the
PdfDenseMergeTool from this answer or the
PdfVeryDenseMergeTool from this answer.
If you are interested in annotations, it should be no problem to extend those classes accordingly. If your PDDFs have background graphics or watermarks, headers or footers, they should be removed beforehand.
I have to display a PDF document in a JSP page. The PDF document has 25 pages, but I want to display only 10 pages of the PDF file. How can I achieve this with help of iText?
Assuming you have the PDF file already.
You can use PdfStamper and PdfCopy to slice the PDF up:
PdfReader reader = new PdfReader("THE PDF SOURCE");
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Document document = new Document();
PdfCopy copy = new PdfCopy(document, outputStream);
document.open();
PdfStamper stamper = new PdfStamper(reader, outputStream);
for (int i = 1; i < reader.getNumberOfPages(); i++) {
// Select what pages you need here
PdfImportedPage importedPage = stamper.getImportedPage(reader, i);
copy.addPage(importedPage);
}
copy.freeReader(reader);
outputStream.flush();
document.close();
// Now you can send the byte array to your user
// set content type to application/pdf
As for sending the pdf to display, it depends on the way you display it. The outputstream will at the end of the supplied code contain the pages you copy in the loop, in the example it is all of the pages.
This essentially is a new PDF file, but in memory. If it is the same 10 pages of the same file every time, you may consider saving it as a file.
the below code merges the pdf files and returns the combined pdf data. while this code runs, i try to combine the 100 files with each file approximately around 500kb, i get outofmemory error in the line document.close();. this code runs in the web environment, is the memory available to webspehere server is the problem? i read in an article to use freeReader method, but i cannot get how to use it my scenario.
protected ByteArrayOutputStream joinPDFs(List<InputStream> pdfStreams,
boolean paginate) {
Document document = new Document();
ByteArrayOutputStream mergedPdfStream = new ByteArrayOutputStream();
try {
//List<InputStream> pdfs = pdfStreams;
List<PdfReader> readers = new ArrayList<PdfReader>();
int totalPages = 0;
//Iterator<InputStream> iteratorPDFs = pdfs.iterator();
Iterator<InputStream> iteratorPDFs = pdfStreams.iterator();
// Create Readers for the pdfs.
while (iteratorPDFs.hasNext()) {
InputStream pdf = iteratorPDFs.next();
if (pdf == null)
continue;
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages += pdfReader.getNumberOfPages();
}
//clear this
pdfStreams = null;
//WeakReference ref = new WeakReference(pdfs);
//ref.clear();
// Create a writer for the outputstream
PdfWriter writer = PdfWriter.getInstance(document, mergedPdfStream);
writer.setFullCompression();
document.open();
BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA,
BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
PdfContentByte cb = writer.getDirectContent(); // Holds the PDF
// data
PdfImportedPage page;
int currentPageNumber = 0;
int pageOfCurrentReaderPDF = 0;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Loop through the PDF files and add to the output.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
// Create a new page in the target for each source page.
while (pageOfCurrentReaderPDF < pdfReader.getNumberOfPages()) {
pageOfCurrentReaderPDF++;
document.setPageSize(pdfReader
.getPageSizeWithRotation(pageOfCurrentReaderPDF));
document.newPage();
// pageOfCurrentReaderPDF++;
currentPageNumber++;
page = writer.getImportedPage(pdfReader,
pageOfCurrentReaderPDF);
cb.addTemplate(page, 0, 0);
// Code for pagination.
if (paginate) {
cb.beginText();
cb.setFontAndSize(bf, 9);
cb.showTextAligned(PdfContentByte.ALIGN_CENTER, ""
+ currentPageNumber + " of " + totalPages, 520,
5, 0);
cb.endText();
}
}
pageOfCurrentReaderPDF = 0;
System.out.println("now the size is: "+pdfReader.getFileLength());
}
mergedPdfStream.flush();
document.close();
mergedPdfStream.close();
return mergedPdfStream;
} catch (Exception e) {
e.printStackTrace();
} finally {
if (document.isOpen())
document.close();
try {
if (mergedPdfStream != null)
mergedPdfStream.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
return mergedPdfStream;
}
Thanks
V
This code merges all the PDF's in an array in the memory (the heap) so yes, memory usage will grow linearly with the number of files merged.
I don't know about the freeReader method, but maybe you could try to write the merged PDF into a temporary file instead of a byte array ? mergedPdfStream would be a FileOutputStream instead of a ByteArrayOutputStream. Then you return e.g. a File reference to the client code.
Or you could increase the quantity of memory Java can use (-Xmx JVM parameter), but if the number of files to merge eventually increases, you will find yourself with the same problem.
First, why do you clutter your code with all those Iterator<> boilerplate code?
Do you ever heard of the for statement?
i.e
for (PDfReader pdfReader: readers) {
// code for each single PDF reader in readers
}
Second: consider to close the pdfReader as soon as it is done. This will hopefully flush some buffers and free the memory occupied by the original PDF.
This is not proper way of doing file operation. You are doing merging of files using ArrayList and Array in memory. You should rather use File IO with buffering techniques.
Do you wish to show the final merged file at last? Then you can open the file after all your merging is done.
Do not use only in-memory buffering as you have shown. Use File Io with buffering (byte[] i mean)
Close each file after you read it and append it.
Java has limited memory you allocated at startup time, so merging some big number of file at once like this will lead to crashing of application. You should try this merging operation in separate thread using ThreadPool, so that your application will not get stucked for this.
thanks.
100 files * 500 kB is something around 50 MB. If maximum heap size is 64 MB I'm pretty sure this code won't work in such conditions.