I want to parse pdf websites.
Can anyone say how to extract all the words (word by word) from a pdf file using java.
The code below extract content from a pdf file and write it in another pdf file. I want that the program write it in a text file.
import java.io.FileOutputStream;
import java.io.IOException;
import com.itextpdf.text.*;
import com.itextpdf.text.pdf.*;
public class pdf {
private static String INPUTFILE = "http://www.britishcouncil.org/learning-infosheets-medicine.pdf" ;
private static String OUTPUTFILE = "c:/new3.pdf";
public static void main(String[] args) throws DocumentException,
IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,
new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
PdfImportedPage page;
for (int i = 1; i <= n; i++) {
page = writer.getImportedPage(reader, i);
Image instance = Image.getInstance(page);
document.add(instance);
}
document.close();
}
}
Thanks in advance
Take a look at this:
How to Read PDF File in Java (uses Apache PDF Box library)
using org.apache.pdfbox
import org.apache.pdfbox.*;
public static String convertPDFToTxt(String filePath) {
byte[] thePDFFileBytes = readFileAsBytes(filePath);
PDDocument pddDoc = PDDocument.load(thePDFFileBytes);
PDFTextStripper reader = new PDFTextStripper();
String pageText = reader.getText(pddDoc);
pddDoc.close();
return pageText;
}
private static byte[] readFileAsBytes(String filePath) {
FileInputStream inputStream = new FileInputStream(filePath);
return IOUtils.toByteArray(inputStream);
}
Related
I want to merge many PDF files into one using PDFBox and this is what I've done:
PDDocument document = new PDDocument();
for (String pdfFile: pdfFiles) {
PDDocument part = PDDocument.load(pdfFile);
List<PDPage> list = part.getDocumentCatalog().getAllPages();
for (PDPage page: list) {
document.addPage(page);
}
part.close();
}
document.save("merged.pdf");
document.close();
Where pdfFiles is an ArrayList<String> containing all the PDF files.
When I'm running the above, I'm always getting:
org.apache.pdfbox.exceptions.COSVisitorException: Bad file descriptor
Am I doing something wrong? Is there any other way of doing it?
Why not use the PDFMergerUtility of pdfbox?
PDFMergerUtility ut = new PDFMergerUtility();
ut.addSource(...);
ut.addSource(...);
ut.addSource(...);
ut.setDestinationFileName(...);
ut.mergeDocuments();
A quick Google search returned this bug: "Bad file descriptor while saving a document w. imported PDFs".
It looks like you need to keep the PDFs to be merged open, until after you have saved and closed the combined PDF.
This is a ready to use code, merging four pdf files with itext.jar from http://central.maven.org/maven2/com/itextpdf/itextpdf/5.5.0/itextpdf-5.5.0.jar, more on http://tutorialspointexamples.com/
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
/**
* This class is used to merge two or more
* existing pdf file using iText jar.
*/
public class PDFMerger {
static void mergePdfFiles(List<InputStream> inputPdfList,
OutputStream outputStream) throws Exception{
//Create document and pdfReader objects.
Document document = new Document();
List<PdfReader> readers =
new ArrayList<PdfReader>();
int totalPages = 0;
//Create pdf Iterator object using inputPdfList.
Iterator<InputStream> pdfIterator =
inputPdfList.iterator();
// Create reader list for the input pdf files.
while (pdfIterator.hasNext()) {
InputStream pdf = pdfIterator.next();
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages = totalPages + pdfReader.getNumberOfPages();
}
// Create writer for the outputStream
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
//Open document.
document.open();
//Contain the pdf data.
PdfContentByte pageContentByte = writer.getDirectContent();
PdfImportedPage pdfImportedPage;
int currentPdfReaderPage = 1;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Iterate and process the reader list.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
//Create page and add content.
while (currentPdfReaderPage <= pdfReader.getNumberOfPages()) {
document.newPage();
pdfImportedPage = writer.getImportedPage(
pdfReader,currentPdfReaderPage);
pageContentByte.addTemplate(pdfImportedPage, 0, 0);
currentPdfReaderPage++;
}
currentPdfReaderPage = 1;
}
//Close document and outputStream.
outputStream.flush();
document.close();
outputStream.close();
System.out.println("Pdf files merged successfully.");
}
public static void main(String args[]){
try {
//Prepare input pdf file list as list of input stream.
List<InputStream> inputPdfList = new ArrayList<InputStream>();
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_1.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_2.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_3.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_4.pdf"));
//Prepare output stream for merged pdf file.
OutputStream outputStream =
new FileOutputStream("..\\pdf\\MergeFile_1234.pdf");
//call method to merge pdf files.
mergePdfFiles(inputPdfList, outputStream);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Multiple pdf merged method using org.apache.pdfbox:
public void mergePDFFiles(List<File> files,
String mergedFileName) {
try {
PDFMergerUtility pdfmerger = new PDFMergerUtility();
for (File file : files) {
PDDocument document = PDDocument.load(file);
pdfmerger.setDestinationFileName(mergedFileName);
pdfmerger.addSource(file);
pdfmerger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
document.close();
}
} catch (IOException e) {
logger.error("Error to merge files. Error: " + e.getMessage());
}
}
From main program, call mergePDFFiles method using list of files and target file name.
String mergedFileName = "Merged.pdf";
mergePDFFiles(files, mergedFileName);
After calling mergePDFFiles, load merged file
File mergedFile = new File(mergedFileName);
package article14;
import java.io.File;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFMergerUtility;
public class Pdf
{
public static void main(String args[])
{
new Pdf().createNew();
new Pdf().combine();
}
public void combine()
{
try
{
PDFMergerUtility mergePdf = new PDFMergerUtility();
String folder ="pdf";
File _folder = new File(folder);
File[] filesInFolder;
filesInFolder = _folder.listFiles();
for (File string : filesInFolder)
{
mergePdf.addSource(string);
}
mergePdf.setDestinationFileName("Combined.pdf");
mergePdf.mergeDocuments();
}
catch(Exception e)
{
}
}
public void createNew()
{
PDDocument document = null;
try
{
String filename="test.pdf";
document=new PDDocument();
PDPage blankPage = new PDPage();
document.addPage( blankPage );
document.save( filename );
}
catch(Exception e)
{
}
}
}
If you want to combine two files where one overlays the other (example: document A is a template and document B has the text you want to put on the template), this works:
after creating "doc", you want to write your template (templateFile) on top of that -
PDDocument watermarkDoc = PDDocument.load(getServletContext()
.getRealPath(templateFile));
Overlay overlay = new Overlay();
overlay.overlay(watermarkDoc, doc);
Using iText (existing PDF in bytes)
public static byte[] mergePDF(List<byte[]> pdfFilesAsByteArray) throws DocumentException, IOException {
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
Document document = null;
PdfCopy writer = null;
for (byte[] pdfByteArray : pdfFilesAsByteArray) {
try {
PdfReader reader = new PdfReader(pdfByteArray);
int numberOfPages = reader.getNumberOfPages();
if (document == null) {
document = new Document(reader.getPageSizeWithRotation(1));
writer = new PdfCopy(document, outStream); // new
document.open();
}
PdfImportedPage page;
for (int i = 0; i < numberOfPages;) {
++i;
page = writer.getImportedPage(reader, i);
writer.addPage(page);
}
}
catch (Exception e) {
e.printStackTrace();
}
}
document.close();
outStream.close();
return outStream.toByteArray();
}
I want to merge many PDF files into one using PDFBox and this is what I've done:
PDDocument document = new PDDocument();
for (String pdfFile: pdfFiles) {
PDDocument part = PDDocument.load(pdfFile);
List<PDPage> list = part.getDocumentCatalog().getAllPages();
for (PDPage page: list) {
document.addPage(page);
}
part.close();
}
document.save("merged.pdf");
document.close();
Where pdfFiles is an ArrayList<String> containing all the PDF files.
When I'm running the above, I'm always getting:
org.apache.pdfbox.exceptions.COSVisitorException: Bad file descriptor
Am I doing something wrong? Is there any other way of doing it?
Why not use the PDFMergerUtility of pdfbox?
PDFMergerUtility ut = new PDFMergerUtility();
ut.addSource(...);
ut.addSource(...);
ut.addSource(...);
ut.setDestinationFileName(...);
ut.mergeDocuments();
A quick Google search returned this bug: "Bad file descriptor while saving a document w. imported PDFs".
It looks like you need to keep the PDFs to be merged open, until after you have saved and closed the combined PDF.
This is a ready to use code, merging four pdf files with itext.jar from http://central.maven.org/maven2/com/itextpdf/itextpdf/5.5.0/itextpdf-5.5.0.jar, more on http://tutorialspointexamples.com/
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
/**
* This class is used to merge two or more
* existing pdf file using iText jar.
*/
public class PDFMerger {
static void mergePdfFiles(List<InputStream> inputPdfList,
OutputStream outputStream) throws Exception{
//Create document and pdfReader objects.
Document document = new Document();
List<PdfReader> readers =
new ArrayList<PdfReader>();
int totalPages = 0;
//Create pdf Iterator object using inputPdfList.
Iterator<InputStream> pdfIterator =
inputPdfList.iterator();
// Create reader list for the input pdf files.
while (pdfIterator.hasNext()) {
InputStream pdf = pdfIterator.next();
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages = totalPages + pdfReader.getNumberOfPages();
}
// Create writer for the outputStream
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
//Open document.
document.open();
//Contain the pdf data.
PdfContentByte pageContentByte = writer.getDirectContent();
PdfImportedPage pdfImportedPage;
int currentPdfReaderPage = 1;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Iterate and process the reader list.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
//Create page and add content.
while (currentPdfReaderPage <= pdfReader.getNumberOfPages()) {
document.newPage();
pdfImportedPage = writer.getImportedPage(
pdfReader,currentPdfReaderPage);
pageContentByte.addTemplate(pdfImportedPage, 0, 0);
currentPdfReaderPage++;
}
currentPdfReaderPage = 1;
}
//Close document and outputStream.
outputStream.flush();
document.close();
outputStream.close();
System.out.println("Pdf files merged successfully.");
}
public static void main(String args[]){
try {
//Prepare input pdf file list as list of input stream.
List<InputStream> inputPdfList = new ArrayList<InputStream>();
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_1.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_2.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_3.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_4.pdf"));
//Prepare output stream for merged pdf file.
OutputStream outputStream =
new FileOutputStream("..\\pdf\\MergeFile_1234.pdf");
//call method to merge pdf files.
mergePdfFiles(inputPdfList, outputStream);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Multiple pdf merged method using org.apache.pdfbox:
public void mergePDFFiles(List<File> files,
String mergedFileName) {
try {
PDFMergerUtility pdfmerger = new PDFMergerUtility();
for (File file : files) {
PDDocument document = PDDocument.load(file);
pdfmerger.setDestinationFileName(mergedFileName);
pdfmerger.addSource(file);
pdfmerger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
document.close();
}
} catch (IOException e) {
logger.error("Error to merge files. Error: " + e.getMessage());
}
}
From main program, call mergePDFFiles method using list of files and target file name.
String mergedFileName = "Merged.pdf";
mergePDFFiles(files, mergedFileName);
After calling mergePDFFiles, load merged file
File mergedFile = new File(mergedFileName);
package article14;
import java.io.File;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFMergerUtility;
public class Pdf
{
public static void main(String args[])
{
new Pdf().createNew();
new Pdf().combine();
}
public void combine()
{
try
{
PDFMergerUtility mergePdf = new PDFMergerUtility();
String folder ="pdf";
File _folder = new File(folder);
File[] filesInFolder;
filesInFolder = _folder.listFiles();
for (File string : filesInFolder)
{
mergePdf.addSource(string);
}
mergePdf.setDestinationFileName("Combined.pdf");
mergePdf.mergeDocuments();
}
catch(Exception e)
{
}
}
public void createNew()
{
PDDocument document = null;
try
{
String filename="test.pdf";
document=new PDDocument();
PDPage blankPage = new PDPage();
document.addPage( blankPage );
document.save( filename );
}
catch(Exception e)
{
}
}
}
If you want to combine two files where one overlays the other (example: document A is a template and document B has the text you want to put on the template), this works:
after creating "doc", you want to write your template (templateFile) on top of that -
PDDocument watermarkDoc = PDDocument.load(getServletContext()
.getRealPath(templateFile));
Overlay overlay = new Overlay();
overlay.overlay(watermarkDoc, doc);
Using iText (existing PDF in bytes)
public static byte[] mergePDF(List<byte[]> pdfFilesAsByteArray) throws DocumentException, IOException {
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
Document document = null;
PdfCopy writer = null;
for (byte[] pdfByteArray : pdfFilesAsByteArray) {
try {
PdfReader reader = new PdfReader(pdfByteArray);
int numberOfPages = reader.getNumberOfPages();
if (document == null) {
document = new Document(reader.getPageSizeWithRotation(1));
writer = new PdfCopy(document, outStream); // new
document.open();
}
PdfImportedPage page;
for (int i = 0; i < numberOfPages;) {
++i;
page = writer.getImportedPage(reader, i);
writer.addPage(page);
}
}
catch (Exception e) {
e.printStackTrace();
}
}
document.close();
outStream.close();
return outStream.toByteArray();
}
I want to make pdf file password protected. I just goolge it for the same and find a good solution given below. It's working fine But it wipe out all the data which is already there in my pdf after i secure pdf using below given code.
Used jar files for this code are:
itextpdf-5.2.1.jar
bcmail-jdk16-1.46.jar
bcprov-jdk16-1.46.jar
bctsp-jdk16-1.46.jar
Code to secure PDF :
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfWriter;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
public class Secure_file {
private static String USER_PASSWORD = "password";
private static String OWNER_PASSWORD = "secured";
public static void main(String[] args) throws IOException {
Document document = new Document();
try {
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("E:\\sample.pdf"));
writer.setEncryption(USER_PASSWORD.getBytes(),OWNER_PASSWORD.getBytes(), PdfWriter.ALLOW_PRINTING,PdfWriter.ENCRYPTION_AES_128);
document.open();
document.add(new Paragraph("This is Password Protected PDF document."));
document.close();
writer.close();
} catch (DocumentException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}
what changes i need to made in this program ?
If you look up the iText in Action keywords you'll find encryption pointing to the sample part3.chapter12.EncryptionPdf. That sample's method createPdf essentially is equivalent to your code but the method encryptPdf is what you want:
/** User password. */
public static byte[] USER = "Hello".getBytes();
/** Owner password. */
public static byte[] OWNER = "World".getBytes();
...
public void encryptPdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
stamper.setEncryption(USER, OWNER,
PdfWriter.ALLOW_PRINTING, PdfWriter.ENCRYPTION_AES_128 | PdfWriter.DO_NOT_ENCRYPT_METADATA);
stamper.close();
reader.close();
}
example using iText5. If use iText7 is very similar but using another class instead of stampler.
PdfReader reader = new PdfReader(dp.getStream());
File tempFile = File.createTempFile("someFilename", FILE_EXTENSION_PDF);
tempFile.deleteOnExit();
FileOutputStream os = new FileOutputStream(tempFile);
PdfStamper stamper = new PdfStamper(reader, os);
String pdfPassword = "1234"
String pdfAdminPassword = "5678"
stamper.setEncryption(
pdfPassword.getBytes(),
pdfAdminPassword.getBytes(),
PdfWriter.ALLOW_PRINTING,
PdfWriter.ENCRYPTION_AES_128);
reader.close();
stamper.close();
InputStream encryptedFileIs = new FileInputStream(tempFile);
or apache lib pdfbox
PDDocument document = PDDocument.load(dp.getStream());
AccessPermission ap = new AccessPermission();
StandardProtectionPolicy spp = new StandardProtectionPolicy("1234", "1234", ap);
spp.setEncryptionKeyLength(128);
spp.setPermissions(ap);
document.protect(spp);
File tempFile = File.createTempFile("someFilename", FILE_EXTENSION_PDF);
tempFile.deleteOnExit();
FileOutputStream os = new FileOutputStream(tempFile);
document.save(os);
document.close();
InputStream encryptedFileIs = new FileInputStream(tempFile);
Good luck and happy coding :)
stamper.setEncryption(USER, OWNER,PdfWriter.ALLOW_PRINTING, PdfWriter.ENCRYPTION_AES_128);
I've used this code to add password for the pdf. it will ask for the password while opening the pdf
I have used FOP refer this document
FOUserAgent userAgent = fopFactory.newFOUserAgent();
useragent.getRendererOptions().put("encryption-params", new PDFEncryptionParams(
null, "password", false, false, true, true));
Fop fop = fopFactory.newFop(MimeConstants.MIME_PDF, userAgent);
I need to extract text (word by word) from a pdf file.
import java.io.*;
import com.itextpdf.text.*;
import com.itextpdf.text.pdf.*;
import com.itextpdf.text.pdf.parser.*;
public class pdf {
private static String INPUTFILE = "http://ontology.buffalo.edu/ontology%28PIC%29.pdf" ;
private static String OUTPUTFILE = "c:/new3.pdf";
public static void main(String[] args) throws DocumentException,
IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,
new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {
page = writer.getImportedPage(reader, i);
System.out.println(i);
Image instance = Image.getInstance(page);
document.add(instance);
}
document.close();
PdfReader readerN = new PdfReader(OUTPUTFILE);
PdfTextExtractor parse = new PdfTextExtractor();
for (int i = 1; i <= n; i++)
System.out.println(parser.getTextFromPage(reader,i));
}
When I compile the code, I have this error:
the constructor PdfTextExtractor is undefined
How do I fix this?
PDFTextExtractor only contains static methods and the constructor is private. itext
You can call it like so:
String myLine = PDFTextExtractor.getTextFromPage(reader, pageNumber)
If you want to get all the text from the PDF file and save it to a text file you can use below code.
Use pdfutil.jar library.
import java.io.IOException;
import java.io.PrintWriter;
import com.testautomationguru.utility.PDFUtil;
public class PDFToText{
public static void main(String[] args) {
try {
String pdfFilePath = "C:\\abc.pdf";
PDFUtil pdfUtil = new PDFUtil();
String content = pdfUtil.getText(pdfFilePath);
PrintWriter out = new PrintWriter("C:\\abc.txt");
out.println(content);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
// Try Apache PDF Box
import java.io.FilterInputStream;
import java.io.InputStream;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
// Your PDF file
String filePath = "";
InputStream inputStream = null;
try
{
inputStream = new FileInputStream(filePath);
PDFParser parser = new PDFParser(inputStream);
// This will parse the stream and populate the COSDocument object.
parser.parse();
// Get the document that was parsed.
COSDocument cosDoc = parser.getDocument();
// This class will take a pdf document and strip out all of the text and
// ignore the formatting and such.
PDFTextStripper pdfStripper = new PDFTextStripper();
// This is the in-memory representation of the PDF document
PDDocument pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(pdDoc.getNumberOfPages());
// This will return the text of a document.
def statementPDF = pdfStripper.getText(pdDoc);
}
catch(Exception e)
{
String errorMessage += "\nUnexpected Exception: " + e.getClass() + "\n" + e.getMessage();
for (trace in e.getStackTrace())
{
errorMessage += "\n\t" + trace;
}
}
finally
{
if (inputStream != null)
{
inputStream.close();
}
}
I want to merge many PDF files into one using PDFBox and this is what I've done:
PDDocument document = new PDDocument();
for (String pdfFile: pdfFiles) {
PDDocument part = PDDocument.load(pdfFile);
List<PDPage> list = part.getDocumentCatalog().getAllPages();
for (PDPage page: list) {
document.addPage(page);
}
part.close();
}
document.save("merged.pdf");
document.close();
Where pdfFiles is an ArrayList<String> containing all the PDF files.
When I'm running the above, I'm always getting:
org.apache.pdfbox.exceptions.COSVisitorException: Bad file descriptor
Am I doing something wrong? Is there any other way of doing it?
Why not use the PDFMergerUtility of pdfbox?
PDFMergerUtility ut = new PDFMergerUtility();
ut.addSource(...);
ut.addSource(...);
ut.addSource(...);
ut.setDestinationFileName(...);
ut.mergeDocuments();
A quick Google search returned this bug: "Bad file descriptor while saving a document w. imported PDFs".
It looks like you need to keep the PDFs to be merged open, until after you have saved and closed the combined PDF.
This is a ready to use code, merging four pdf files with itext.jar from http://central.maven.org/maven2/com/itextpdf/itextpdf/5.5.0/itextpdf-5.5.0.jar, more on http://tutorialspointexamples.com/
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
/**
* This class is used to merge two or more
* existing pdf file using iText jar.
*/
public class PDFMerger {
static void mergePdfFiles(List<InputStream> inputPdfList,
OutputStream outputStream) throws Exception{
//Create document and pdfReader objects.
Document document = new Document();
List<PdfReader> readers =
new ArrayList<PdfReader>();
int totalPages = 0;
//Create pdf Iterator object using inputPdfList.
Iterator<InputStream> pdfIterator =
inputPdfList.iterator();
// Create reader list for the input pdf files.
while (pdfIterator.hasNext()) {
InputStream pdf = pdfIterator.next();
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages = totalPages + pdfReader.getNumberOfPages();
}
// Create writer for the outputStream
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
//Open document.
document.open();
//Contain the pdf data.
PdfContentByte pageContentByte = writer.getDirectContent();
PdfImportedPage pdfImportedPage;
int currentPdfReaderPage = 1;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Iterate and process the reader list.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
//Create page and add content.
while (currentPdfReaderPage <= pdfReader.getNumberOfPages()) {
document.newPage();
pdfImportedPage = writer.getImportedPage(
pdfReader,currentPdfReaderPage);
pageContentByte.addTemplate(pdfImportedPage, 0, 0);
currentPdfReaderPage++;
}
currentPdfReaderPage = 1;
}
//Close document and outputStream.
outputStream.flush();
document.close();
outputStream.close();
System.out.println("Pdf files merged successfully.");
}
public static void main(String args[]){
try {
//Prepare input pdf file list as list of input stream.
List<InputStream> inputPdfList = new ArrayList<InputStream>();
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_1.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_2.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_3.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_4.pdf"));
//Prepare output stream for merged pdf file.
OutputStream outputStream =
new FileOutputStream("..\\pdf\\MergeFile_1234.pdf");
//call method to merge pdf files.
mergePdfFiles(inputPdfList, outputStream);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Multiple pdf merged method using org.apache.pdfbox:
public void mergePDFFiles(List<File> files,
String mergedFileName) {
try {
PDFMergerUtility pdfmerger = new PDFMergerUtility();
for (File file : files) {
PDDocument document = PDDocument.load(file);
pdfmerger.setDestinationFileName(mergedFileName);
pdfmerger.addSource(file);
pdfmerger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
document.close();
}
} catch (IOException e) {
logger.error("Error to merge files. Error: " + e.getMessage());
}
}
From main program, call mergePDFFiles method using list of files and target file name.
String mergedFileName = "Merged.pdf";
mergePDFFiles(files, mergedFileName);
After calling mergePDFFiles, load merged file
File mergedFile = new File(mergedFileName);
package article14;
import java.io.File;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFMergerUtility;
public class Pdf
{
public static void main(String args[])
{
new Pdf().createNew();
new Pdf().combine();
}
public void combine()
{
try
{
PDFMergerUtility mergePdf = new PDFMergerUtility();
String folder ="pdf";
File _folder = new File(folder);
File[] filesInFolder;
filesInFolder = _folder.listFiles();
for (File string : filesInFolder)
{
mergePdf.addSource(string);
}
mergePdf.setDestinationFileName("Combined.pdf");
mergePdf.mergeDocuments();
}
catch(Exception e)
{
}
}
public void createNew()
{
PDDocument document = null;
try
{
String filename="test.pdf";
document=new PDDocument();
PDPage blankPage = new PDPage();
document.addPage( blankPage );
document.save( filename );
}
catch(Exception e)
{
}
}
}
If you want to combine two files where one overlays the other (example: document A is a template and document B has the text you want to put on the template), this works:
after creating "doc", you want to write your template (templateFile) on top of that -
PDDocument watermarkDoc = PDDocument.load(getServletContext()
.getRealPath(templateFile));
Overlay overlay = new Overlay();
overlay.overlay(watermarkDoc, doc);
Using iText (existing PDF in bytes)
public static byte[] mergePDF(List<byte[]> pdfFilesAsByteArray) throws DocumentException, IOException {
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
Document document = null;
PdfCopy writer = null;
for (byte[] pdfByteArray : pdfFilesAsByteArray) {
try {
PdfReader reader = new PdfReader(pdfByteArray);
int numberOfPages = reader.getNumberOfPages();
if (document == null) {
document = new Document(reader.getPageSizeWithRotation(1));
writer = new PdfCopy(document, outStream); // new
document.open();
}
PdfImportedPage page;
for (int i = 0; i < numberOfPages;) {
++i;
page = writer.getImportedPage(reader, i);
writer.addPage(page);
}
}
catch (Exception e) {
e.printStackTrace();
}
}
document.close();
outStream.close();
return outStream.toByteArray();
}