Remain only used font subsets while splitting PDF in Java

Remain only used font subsets while splitting PDF in Java - java

I am using iText to split a PDF document into separate pages as PDF files. Each file seems to be too large as all fonts used in the input PDF are saved into all result pages, which is apparently not very clean.
Code of splitting is as below. Notice PdfSmartCopy and setFullCompression doesn't help to reduce size (which I have no idea why).
public List<byte[]> split(byte[] input) throws IOException, DocumentException {
PdfReader pdfReader = new PdfReader(input);
List<byte[]> pdfFiles = new ArrayList<>();
int pageCount = pdfReader.getNumberOfPages();
int pageIndex = 0;
while (++pageIndex <= pageCount) {
Document document = new Document(pdfReader.getPageSizeWithRotation(pageIndex));
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
PdfCopy pdfCopy = new PdfSmartCopy(document, byteArrayOutputStream);
pdfCopy.setFullCompression();
PdfImportedPage pdfImportedPage = pdfCopy.getImportedPage(pdfReader, pageIndex);
document.open();
pdfCopy.addPage(pdfImportedPage);
document.close();
pdfCopy.close();
pdfFiles.add(byteArrayOutputStream.toByteArray());
}
return pdfFiles;
}
So is there a way in Java (iText or not) to solve these problem?
Update with demo PDF
Here is a 377KB PDF using multiple CJK fonts where any page in it using 1 or 2 fonts. The summary size of sub-PDFs is 1.2MB. Considering CJK fonts are very bloated, I would like to find a way to remove unused font and even remove unused characters in used fonts.
So my idea is to remain only used characters in used fonts and embed them in sub files and then un-embed all other fonts. Any advice?

Related

The PDF has been Split using PDF Box library and the resultant PDF has almost same size as source PDF file

I'm using the below code to split the huge PDF into two different PDF. The PDF is splitting properly. The first PDF will be generated with first 2 pages of PDF and 2nd PDF will be generated with the rest of the pages of PDF.
The problem is the size, the source PDF is 17 MB. The 2 PDF's that are generated are also of 15MB each. Logically it should be less in size, I searched the forum, they said PDFont has to be used properly. I haven't used PDFont here not sure If Im doing it wrongly
public static void main(String[] args) throws IOException, COSVisitorException {
File input = new File("sourceFile.pdf");
// pdPage and pdPage1 will be used to get first and second page of entire PDF
//pdPageMedRec will get the rest of the pages
PDPage pdPage = null;
PDPage pdPage1 = null;
PDPage pdPageMedRec = null;
PDDocument firstOutputDocument = null;
PDDocument secondOutputDocument = null;
PDDocument inputDocument = PDDocument.loadNonSeq(input, null);
List<PDPage> list = inputDocument.getDocumentCatalog().getAllPages();
// I wanted two documents to be generated from the big PDF
//firstOutputDocument is document 1 and it will be having first 2 pages of the big pdf
//secondOutputDocument is document 2 and it will be having the rest of the pages of the PDF
firstOutputDocument = new PDDocument();
secondOutputDocument = new PDDocument();
// Taking first page and second page
pdPage = list.get(0);
pdPage1 = list.get(1);
// Appending them as one document
firstOutputDocument.importPage(pdPage);
firstOutputDocument.importPage(pdPage1);
// Looping the rest of the pages
for (int page = 3; page <= inputDocument.getNumberOfPages(); ++page) {
pdPageMedRec = (PDPage) inputDocument.getDocumentCatalog().getAllPages().get(page - 1);
// append page to current document
secondOutputDocument.importPage(pdPageMedRec);
}
// Saving first document
File f = new File("document1.pdf");
firstOutputDocument.save(f);
firstOutputDocument.close();
// Saving second document
File g = new File("document2.pdf");
secondOutputDocument.save(g);
secondOutputDocument.close();
inputDocument.close();
}

How to copy/move AcroForm fields from one document to new blank one using IText5 or IText7?

I need to copy whole AcroForm including field positions and values from template PDF to a new blank PDF file. How can I do that?
In short words - I need to get rid of "background" from the template and leave only filed forms.
The whole point of this is to create a PDF with content that would be printed on pre-printed templates.
I am using IText 5 but I can switch to 7 if usefull examples would be provided

After a lot of trial and error I have found the solution to "How to copy AcfroForm fields into another PDF". It is a iText v7 version. I hope it will help somebody someday.
private byte[] copyFormElements(byte[] sourceTemplate) throws IOException {
PdfReader completeReader = new PdfReader(new ByteArrayInputStream(sourceTemplate));
PdfDocument completeDoc = new PdfDocument(completeReader);
ByteArrayOutputStream out = new ByteArrayOutputStream();
PdfWriter offsetWriter = new PdfWriter(out);
PdfDocument offsetDoc = new PdfDocument(offsetWriter);
offsetDoc.initializeOutlines();
PdfPage blank = offsetDoc.addNewPage();
PdfAcroForm originalForm = PdfAcroForm.getAcroForm(completeDoc, false);
// originalForm.getPdfObject().copyTo(offsetDoc,false);
PdfAcroForm offsetForm = PdfAcroForm.getAcroForm(offsetDoc, true);
for (String name : originalForm.getFormFields().keySet()) {
PdfFormField field = originalForm.getField(name);
PdfDictionary copied = field.getPdfObject().copyTo(offsetDoc, false);
PdfFormField copiedField = PdfFormField.makeFormField(copied, offsetDoc);
offsetForm.addField(copiedField, blank);
}
offsetDoc.close();
completeDoc.close();
return out.toByteArray();
}

Did you check the PdfCopyForms object:
Allows you to add one (or more) existing PDF document(s) to create a new PDF and add the form of another PDF document to this new PDF.
I didn't find an example, but you could try something like this:
PdfReader reader1 = new PdfReader(src1); // a document with a form
PdfReader reader2 = new PdfReader(src2); // a document without a form
PdfCopyForms copy = new PdfCopyForms(new FileOutputStream(dest));
copy.AddDocument(reader1); // add the document without the form
copy.CopyDocumentFields(reader2); // add the fields of the document with the form
copy.close();
reader1.close();
reader2.close();
I see that the class is deprecated. I'm not sure of that's because iText 7 makes it much easier to do this, or if it's because there were technical problems with the class.

Can't remove whiteSpace in Java itext

Here i am combining 2 pdf documents using the Itext packages.
Merging was done successfully using the code below
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
PdfContentByte cb = writer.getDirectContent();
for (InputStream in : list)
{
PdfReader reader = new PdfReader(in);
for (int i = 1; i <= reader.getNumberOfPages(); i++)
{
document.newPage();
//import the page from source pdf
PdfImportedPage page = writer.getImportedPage(reader, i);
//add the page to the destination pdf
cb.addTemplate(page, 0, 0);
}
}
outputStream.flush();
document.close();
outputStream.close();
Here the list is an InputStream List.
And outputStream is an output stream
The problem i am having is i want to append the PDFdocuments in the list after the 1st PDF is added
(i.e 1st PDF has 4 lines...i want the 2nd PDF to continue in the same page after the 4th line).
What i am getting is the 2nd PDF is added in the second page.
Is there any alternate keyword for document.newPage();
Can anyone help me with it.
Thanks would like to hear any responses:)

It depends on the requirements you have. As long as
you only are interested in the page contents of the merged PDFs, not in the page annotations and
the pages have no content but the text lines you mention, in particular no background graphics, watermarks, or header/footer lines,
you can you use either the
PdfDenseMergeTool from this answer or the
PdfVeryDenseMergeTool from this answer.
If you are interested in annotations, it should be no problem to extend those classes accordingly. If your PDDFs have background graphics or watermarks, headers or footers, they should be removed beforehand.

Replace fonts in a PDF using iText (Java)

I'd like to convert all the fonts, embedded or otherwise, of a PDF to another font using iText. I understand that line-height, kerning and a bunch of other things would be bungled up, but this I truly don't mind how ugly the output is.
I have seen how to embed fonts into existing pdfs here, but I don't know how to set ALL EXISTING text in the document to that font.
I understand that this isn't as straightforward as I make it out to be. Perhaps it would be easier just to take all the raw text from the document, and create a new document using the new font (again, layout/readability is a non-issue to me)

The example EmbedFontPostFacto.java from chapter 16 of iText in Action — 2nd Edition shows how to embed an originally not embedded font. The central method is this:
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
// the font file
RandomAccessFile raf = new RandomAccessFile(FONT, "r");
byte fontfile[] = new byte[(int)raf.length()];
raf.readFully(fontfile);
raf.close();
// create a new stream for the font file
PdfStream stream = new PdfStream(fontfile);
stream.flateCompress();
stream.put(PdfName.LENGTH1, new PdfNumber(fontfile.length));
// create a reader object
PdfReader reader = new PdfReader(RESULT1);
int n = reader.getXrefSize();
PdfObject object;
PdfDictionary font;
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(RESULT2));
PdfName fontname = new PdfName(FONTNAME);
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isDictionary())
continue;
font = (PdfDictionary)object;
if (PdfName.FONTDESCRIPTOR.equals(font.get(PdfName.TYPE))
&& fontname.equals(font.get(PdfName.FONTNAME))) {
PdfIndirectObject objref = stamper.getWriter().addToBody(stream);
font.put(PdfName.FONTFILE2, objref.getIndirectReference());
}
}
stamper.close();
reader.close();
}
This (without the fontname.equals(font.get(PdfName.FONTNAME)) test) may be a starting point for the simple cases of your task.
You'll have to do quite a lot of tests concerning encoding and add some individual translations for a more generic solution. You may want to study section 9 Text of the PDF specification ISO 32000-1 for this.

iText Pdf Page Byte Size

I have a business requirement that requires me to splits pdfs into multiple documents.
Lets say I have a 100MB pdf, I need to split that into for simplicity sake, into multiple pdfs no larger than 10MB a piece.
I am using iText.
I am going to get the original pdf, and loop through the pages, but how can I determine the file size of each page without writing it separately to the disk?
Sample code for simplicity
int numPages = reader.getNumberOfPages();
PdfImportedPage page;
for (int currentPage = 0; currentPage &lt numPages; ){
++currentPage;
//Get page from reader
page = writer.getImportedPage(reader, currentPage);
// I need the size in bytes here of the page
}

I think the easiest way is to write it to the disk and delete it afterwards:
Document document = new Document();
File f= new File("C:\\delete.pdf"); //for instance
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(f));
document.open();
document.add(page);
document.close();
long filesize = f.length(); //this is the filesize in byte
f.delete();
I'm not absolutely sure, I admit, but I don't know how it should be possible to figure out the filesize if the file is not existing.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remain only used font subsets while splitting PDF in Java - java

Related

The PDF has been Split using PDF Box library and the resultant PDF has almost same size as source PDF file

How to copy/move AcroForm fields from one document to new blank one using IText5 or IText7?

Can't remove whiteSpace in Java itext

Replace fonts in a PDF using iText (Java)

iText Pdf Page Byte Size

Categories

Resources