How to read raw text from pdf file using java [duplicate]

How to read raw text from pdf file using java [duplicate] - java

I need to extract text (word by word) from a pdf file.
import java.io.*;
import com.itextpdf.text.*;
import com.itextpdf.text.pdf.*;
import com.itextpdf.text.pdf.parser.*;
public class pdf {
private static String INPUTFILE = "http://ontology.buffalo.edu/ontology%28PIC%29.pdf" ;
private static String OUTPUTFILE = "c:/new3.pdf";
public static void main(String[] args) throws DocumentException,
IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,
new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {
page = writer.getImportedPage(reader, i);
System.out.println(i);
Image instance = Image.getInstance(page);
document.add(instance);
}
document.close();
PdfReader readerN = new PdfReader(OUTPUTFILE);
PdfTextExtractor parse = new PdfTextExtractor();
for (int i = 1; i <= n; i++)
System.out.println(parser.getTextFromPage(reader,i));
}
When I compile the code, I have this error:
the constructor PdfTextExtractor is undefined
How do I fix this?

PDFTextExtractor only contains static methods and the constructor is private. itext
You can call it like so:
String myLine = PDFTextExtractor.getTextFromPage(reader, pageNumber)

If you want to get all the text from the PDF file and save it to a text file you can use below code.
Use pdfutil.jar library.
import java.io.IOException;
import java.io.PrintWriter;
import com.testautomationguru.utility.PDFUtil;
public class PDFToText{
public static void main(String[] args) {
try {
String pdfFilePath = "C:\\abc.pdf";
PDFUtil pdfUtil = new PDFUtil();
String content = pdfUtil.getText(pdfFilePath);
PrintWriter out = new PrintWriter("C:\\abc.txt");
out.println(content);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

// Try Apache PDF Box
import java.io.FilterInputStream;
import java.io.InputStream;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
// Your PDF file
String filePath = "";
InputStream inputStream = null;
try
{
inputStream = new FileInputStream(filePath);
PDFParser parser = new PDFParser(inputStream);
// This will parse the stream and populate the COSDocument object.
parser.parse();
// Get the document that was parsed.
COSDocument cosDoc = parser.getDocument();
// This class will take a pdf document and strip out all of the text and
// ignore the formatting and such.
PDFTextStripper pdfStripper = new PDFTextStripper();
// This is the in-memory representation of the PDF document
PDDocument pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(pdDoc.getNumberOfPages());
// This will return the text of a document.
def statementPDF = pdfStripper.getText(pdDoc);
}
catch(Exception e)
{
String errorMessage += "\nUnexpected Exception: " + e.getClass() + "\n" + e.getMessage();
for (trace in e.getStackTrace())
{
errorMessage += "\n\t" + trace;
}
}
finally
{
if (inputStream != null)
{
inputStream.close();
}
}

Related

How to merge or concat to byte [ ] arrays of PDF's type in java but am getting pdf1 as output [duplicate]

I want to merge many PDF files into one using PDFBox and this is what I've done:
PDDocument document = new PDDocument();
for (String pdfFile: pdfFiles) {
PDDocument part = PDDocument.load(pdfFile);
List<PDPage> list = part.getDocumentCatalog().getAllPages();
for (PDPage page: list) {
document.addPage(page);
}
part.close();
}
document.save("merged.pdf");
document.close();
Where pdfFiles is an ArrayList<String> containing all the PDF files.
When I'm running the above, I'm always getting:
org.apache.pdfbox.exceptions.COSVisitorException: Bad file descriptor
Am I doing something wrong? Is there any other way of doing it?

Why not use the PDFMergerUtility of pdfbox?
PDFMergerUtility ut = new PDFMergerUtility();
ut.addSource(...);
ut.addSource(...);
ut.addSource(...);
ut.setDestinationFileName(...);
ut.mergeDocuments();

A quick Google search returned this bug: "Bad file descriptor while saving a document w. imported PDFs".
It looks like you need to keep the PDFs to be merged open, until after you have saved and closed the combined PDF.

This is a ready to use code, merging four pdf files with itext.jar from http://central.maven.org/maven2/com/itextpdf/itextpdf/5.5.0/itextpdf-5.5.0.jar, more on http://tutorialspointexamples.com/
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
/**
* This class is used to merge two or more
* existing pdf file using iText jar.
*/
public class PDFMerger {
static void mergePdfFiles(List<InputStream> inputPdfList,
OutputStream outputStream) throws Exception{
//Create document and pdfReader objects.
Document document = new Document();
List<PdfReader> readers =
new ArrayList<PdfReader>();
int totalPages = 0;
//Create pdf Iterator object using inputPdfList.
Iterator<InputStream> pdfIterator =
inputPdfList.iterator();
// Create reader list for the input pdf files.
while (pdfIterator.hasNext()) {
InputStream pdf = pdfIterator.next();
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages = totalPages + pdfReader.getNumberOfPages();
}
// Create writer for the outputStream
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
//Open document.
document.open();
//Contain the pdf data.
PdfContentByte pageContentByte = writer.getDirectContent();
PdfImportedPage pdfImportedPage;
int currentPdfReaderPage = 1;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Iterate and process the reader list.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
//Create page and add content.
while (currentPdfReaderPage <= pdfReader.getNumberOfPages()) {
document.newPage();
pdfImportedPage = writer.getImportedPage(
pdfReader,currentPdfReaderPage);
pageContentByte.addTemplate(pdfImportedPage, 0, 0);
currentPdfReaderPage++;
}
currentPdfReaderPage = 1;
}
//Close document and outputStream.
outputStream.flush();
document.close();
outputStream.close();
System.out.println("Pdf files merged successfully.");
}
public static void main(String args[]){
try {
//Prepare input pdf file list as list of input stream.
List<InputStream> inputPdfList = new ArrayList<InputStream>();
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_1.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_2.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_3.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_4.pdf"));
//Prepare output stream for merged pdf file.
OutputStream outputStream =
new FileOutputStream("..\\pdf\\MergeFile_1234.pdf");
//call method to merge pdf files.
mergePdfFiles(inputPdfList, outputStream);
} catch (Exception e) {
e.printStackTrace();
}
}
}

Multiple pdf merged method using org.apache.pdfbox:
public void mergePDFFiles(List<File> files,
String mergedFileName) {
try {
PDFMergerUtility pdfmerger = new PDFMergerUtility();
for (File file : files) {
PDDocument document = PDDocument.load(file);
pdfmerger.setDestinationFileName(mergedFileName);
pdfmerger.addSource(file);
pdfmerger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
document.close();
}
} catch (IOException e) {
logger.error("Error to merge files. Error: " + e.getMessage());
}
}
From main program, call mergePDFFiles method using list of files and target file name.
String mergedFileName = "Merged.pdf";
mergePDFFiles(files, mergedFileName);
After calling mergePDFFiles, load merged file
File mergedFile = new File(mergedFileName);

package article14;
import java.io.File;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFMergerUtility;
public class Pdf
{
public static void main(String args[])
{
new Pdf().createNew();
new Pdf().combine();
}
public void combine()
{
try
{
PDFMergerUtility mergePdf = new PDFMergerUtility();
String folder ="pdf";
File _folder = new File(folder);
File[] filesInFolder;
filesInFolder = _folder.listFiles();
for (File string : filesInFolder)
{
mergePdf.addSource(string);
}
mergePdf.setDestinationFileName("Combined.pdf");
mergePdf.mergeDocuments();
}
catch(Exception e)
{
}
}
public void createNew()
{
PDDocument document = null;
try
{
String filename="test.pdf";
document=new PDDocument();
PDPage blankPage = new PDPage();
document.addPage( blankPage );
document.save( filename );
}
catch(Exception e)
{
}
}
}

If you want to combine two files where one overlays the other (example: document A is a template and document B has the text you want to put on the template), this works:
after creating "doc", you want to write your template (templateFile) on top of that -
PDDocument watermarkDoc = PDDocument.load(getServletContext()
.getRealPath(templateFile));
Overlay overlay = new Overlay();
overlay.overlay(watermarkDoc, doc);

Using iText (existing PDF in bytes)
public static byte[] mergePDF(List<byte[]> pdfFilesAsByteArray) throws DocumentException, IOException {
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
Document document = null;
PdfCopy writer = null;
for (byte[] pdfByteArray : pdfFilesAsByteArray) {
try {
PdfReader reader = new PdfReader(pdfByteArray);
int numberOfPages = reader.getNumberOfPages();
if (document == null) {
document = new Document(reader.getPageSizeWithRotation(1));
writer = new PdfCopy(document, outStream); // new
document.open();
}
PdfImportedPage page;
for (int i = 0; i < numberOfPages;) {
++i;
page = writer.getImportedPage(reader, i);
writer.addPage(page);
}
}
catch (Exception e) {
e.printStackTrace();
}
}
document.close();
outStream.close();
return outStream.toByteArray();
}

How can I append an existing PDF to a generated PDF with flying saucer for Java [duplicate]

I want to merge many PDF files into one using PDFBox and this is what I've done:
PDDocument document = new PDDocument();
for (String pdfFile: pdfFiles) {
PDDocument part = PDDocument.load(pdfFile);
List<PDPage> list = part.getDocumentCatalog().getAllPages();
for (PDPage page: list) {
document.addPage(page);
}
part.close();
}
document.save("merged.pdf");
document.close();
Where pdfFiles is an ArrayList<String> containing all the PDF files.
When I'm running the above, I'm always getting:
org.apache.pdfbox.exceptions.COSVisitorException: Bad file descriptor
Am I doing something wrong? Is there any other way of doing it?

Why not use the PDFMergerUtility of pdfbox?
PDFMergerUtility ut = new PDFMergerUtility();
ut.addSource(...);
ut.addSource(...);
ut.addSource(...);
ut.setDestinationFileName(...);
ut.mergeDocuments();

A quick Google search returned this bug: "Bad file descriptor while saving a document w. imported PDFs".
It looks like you need to keep the PDFs to be merged open, until after you have saved and closed the combined PDF.

This is a ready to use code, merging four pdf files with itext.jar from http://central.maven.org/maven2/com/itextpdf/itextpdf/5.5.0/itextpdf-5.5.0.jar, more on http://tutorialspointexamples.com/
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
/**
* This class is used to merge two or more
* existing pdf file using iText jar.
*/
public class PDFMerger {
static void mergePdfFiles(List<InputStream> inputPdfList,
OutputStream outputStream) throws Exception{
//Create document and pdfReader objects.
Document document = new Document();
List<PdfReader> readers =
new ArrayList<PdfReader>();
int totalPages = 0;
//Create pdf Iterator object using inputPdfList.
Iterator<InputStream> pdfIterator =
inputPdfList.iterator();
// Create reader list for the input pdf files.
while (pdfIterator.hasNext()) {
InputStream pdf = pdfIterator.next();
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages = totalPages + pdfReader.getNumberOfPages();
}
// Create writer for the outputStream
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
//Open document.
document.open();
//Contain the pdf data.
PdfContentByte pageContentByte = writer.getDirectContent();
PdfImportedPage pdfImportedPage;
int currentPdfReaderPage = 1;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Iterate and process the reader list.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
//Create page and add content.
while (currentPdfReaderPage <= pdfReader.getNumberOfPages()) {
document.newPage();
pdfImportedPage = writer.getImportedPage(
pdfReader,currentPdfReaderPage);
pageContentByte.addTemplate(pdfImportedPage, 0, 0);
currentPdfReaderPage++;
}
currentPdfReaderPage = 1;
}
//Close document and outputStream.
outputStream.flush();
document.close();
outputStream.close();
System.out.println("Pdf files merged successfully.");
}
public static void main(String args[]){
try {
//Prepare input pdf file list as list of input stream.
List<InputStream> inputPdfList = new ArrayList<InputStream>();
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_1.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_2.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_3.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_4.pdf"));
//Prepare output stream for merged pdf file.
OutputStream outputStream =
new FileOutputStream("..\\pdf\\MergeFile_1234.pdf");
//call method to merge pdf files.
mergePdfFiles(inputPdfList, outputStream);
} catch (Exception e) {
e.printStackTrace();
}
}
}

Multiple pdf merged method using org.apache.pdfbox:
public void mergePDFFiles(List<File> files,
String mergedFileName) {
try {
PDFMergerUtility pdfmerger = new PDFMergerUtility();
for (File file : files) {
PDDocument document = PDDocument.load(file);
pdfmerger.setDestinationFileName(mergedFileName);
pdfmerger.addSource(file);
pdfmerger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
document.close();
}
} catch (IOException e) {
logger.error("Error to merge files. Error: " + e.getMessage());
}
}
From main program, call mergePDFFiles method using list of files and target file name.
String mergedFileName = "Merged.pdf";
mergePDFFiles(files, mergedFileName);
After calling mergePDFFiles, load merged file
File mergedFile = new File(mergedFileName);

package article14;
import java.io.File;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFMergerUtility;
public class Pdf
{
public static void main(String args[])
{
new Pdf().createNew();
new Pdf().combine();
}
public void combine()
{
try
{
PDFMergerUtility mergePdf = new PDFMergerUtility();
String folder ="pdf";
File _folder = new File(folder);
File[] filesInFolder;
filesInFolder = _folder.listFiles();
for (File string : filesInFolder)
{
mergePdf.addSource(string);
}
mergePdf.setDestinationFileName("Combined.pdf");
mergePdf.mergeDocuments();
}
catch(Exception e)
{
}
}
public void createNew()
{
PDDocument document = null;
try
{
String filename="test.pdf";
document=new PDDocument();
PDPage blankPage = new PDPage();
document.addPage( blankPage );
document.save( filename );
}
catch(Exception e)
{
}
}
}

If you want to combine two files where one overlays the other (example: document A is a template and document B has the text you want to put on the template), this works:
after creating "doc", you want to write your template (templateFile) on top of that -
PDDocument watermarkDoc = PDDocument.load(getServletContext()
.getRealPath(templateFile));
Overlay overlay = new Overlay();
overlay.overlay(watermarkDoc, doc);

Using iText (existing PDF in bytes)
public static byte[] mergePDF(List<byte[]> pdfFilesAsByteArray) throws DocumentException, IOException {
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
Document document = null;
PdfCopy writer = null;
for (byte[] pdfByteArray : pdfFilesAsByteArray) {
try {
PdfReader reader = new PdfReader(pdfByteArray);
int numberOfPages = reader.getNumberOfPages();
if (document == null) {
document = new Document(reader.getPageSizeWithRotation(1));
writer = new PdfCopy(document, outStream); // new
document.open();
}
PdfImportedPage page;
for (int i = 0; i < numberOfPages;) {
++i;
page = writer.getImportedPage(reader, i);
writer.addPage(page);
}
}
catch (Exception e) {
e.printStackTrace();
}
}
document.close();
outStream.close();
return outStream.toByteArray();
}

Create a new Pdf from an existing Pdf and (HTML + CSS)

My use case is that I am generating pdf on the fly. Also I have a pdf with single page. I want to concatenate the newly generated PDF after/before the existing pdf page.
I was already able to generate PDF from HTML (this may result in 2-3 pages) Pdf from HTML with CSS
I tried looking up at the examples one of which is to concatenate existing PDFs pagewise Working with existing PDFs - Concatenate

This page shows exactly what you request with prepend and append static PDF with composed HTML and CSS content.
http://cloudformatter.com/CSS2Pdf.CustomTipsTricks.InjectPDF
Use instructions are here
http://cloudformatter.com/CSS2Pdf.APIDoc.Usage

Try this example:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.BaseFont;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
public class UtilPDF {
public static void main(String[] args) {
try {
List<InputStream> pdfs = new ArrayList<InputStream>();
File pdfDir = new File("C:\\PDF");
boolean pdfDirectoryExists = true;
if (!pdfDir.exists()) {
pdfDirectoryExists = pdfDir.mkdir();
}
if (pdfDirectoryExists) {
pdfs.add(new FileInputStream("C:\\PDF\\Document1.pdf"));
pdfs.add(new FileInputStream("C:\\PDF\\Document2.pdf"));
OutputStream output = new FileOutputStream("C:\\Projects\\FinalDocument_1_2.pdf");
UtilPDF.concatPDFs(pdfs, output, true);
}
} catch (Exception e) {
e.printStackTrace();
}
}
public static void concatPDFs(List<InputStream> streamOfPDFFiles, OutputStream outputStream, boolean paginate) {
Document document = new Document();
try {
List<InputStream> pdfs = streamOfPDFFiles;
List<PdfReader> readers = new ArrayList<PdfReader>();
int totalPages = 0;
Iterator<InputStream> iteratorPDFs = pdfs.iterator();
// Create Readers for the pdfs.
while (iteratorPDFs.hasNext()) {
InputStream pdf = iteratorPDFs.next();
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages += pdfReader.getNumberOfPages();
}
// Create a writer for the outputstream
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
PdfContentByte cb = writer.getDirectContent(); // Holds the PDF
// data
PdfImportedPage page;
int currentPageNumber = 0;
int pageOfCurrentReaderPDF = 0;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Loop through the PDF files and add to the output.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
// Create a new page in the target for each source page.
while (pageOfCurrentReaderPDF < pdfReader.getNumberOfPages()) {
document.newPage();
pageOfCurrentReaderPDF++;
currentPageNumber++;
page = writer.getImportedPage(pdfReader, pageOfCurrentReaderPDF);
cb.addTemplate(page, 0, 0);
// Code for pagination.
if (paginate) {
cb.beginText();
cb.setFontAndSize(bf, 9);
cb.showTextAligned(PdfContentByte.ALIGN_CENTER, "" + currentPageNumber + " of " + totalPages,
520, 5, 0);
cb.endText();
}
}
pageOfCurrentReaderPDF = 0;
}
outputStream.flush();
document.close();
outputStream.close();
} catch (Exception e) {
e.printStackTrace();
} finally {
if (document.isOpen())
document.close();
try {
if (outputStream != null)
outputStream.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
}
}

read pdf files using java

I want to parse pdf websites.
Can anyone say how to extract all the words (word by word) from a pdf file using java.
The code below extract content from a pdf file and write it in another pdf file. I want that the program write it in a text file.
import java.io.FileOutputStream;
import java.io.IOException;
import com.itextpdf.text.*;
import com.itextpdf.text.pdf.*;
public class pdf {
private static String INPUTFILE = "http://www.britishcouncil.org/learning-infosheets-medicine.pdf" ;
private static String OUTPUTFILE = "c:/new3.pdf";
public static void main(String[] args) throws DocumentException,
IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,
new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
PdfImportedPage page;
for (int i = 1; i <= n; i++) {
page = writer.getImportedPage(reader, i);
Image instance = Image.getInstance(page);
document.add(instance);
}
document.close();
}
}
Thanks in advance

Take a look at this:
How to Read PDF File in Java (uses Apache PDF Box library)

using org.apache.pdfbox
import org.apache.pdfbox.*;
public static String convertPDFToTxt(String filePath) {
byte[] thePDFFileBytes = readFileAsBytes(filePath);
PDDocument pddDoc = PDDocument.load(thePDFFileBytes);
PDFTextStripper reader = new PDFTextStripper();
String pageText = reader.getText(pddDoc);
pddDoc.close();
return pageText;
}
private static byte[] readFileAsBytes(String filePath) {
FileInputStream inputStream = new FileInputStream(filePath);
return IOUtils.toByteArray(inputStream);
}

How to merge two PDF files into one in Java?

I want to merge many PDF files into one using PDFBox and this is what I've done:
PDDocument document = new PDDocument();
for (String pdfFile: pdfFiles) {
PDDocument part = PDDocument.load(pdfFile);
List<PDPage> list = part.getDocumentCatalog().getAllPages();
for (PDPage page: list) {
document.addPage(page);
}
part.close();
}
document.save("merged.pdf");
document.close();
Where pdfFiles is an ArrayList<String> containing all the PDF files.
When I'm running the above, I'm always getting:
org.apache.pdfbox.exceptions.COSVisitorException: Bad file descriptor
Am I doing something wrong? Is there any other way of doing it?

Why not use the PDFMergerUtility of pdfbox?
PDFMergerUtility ut = new PDFMergerUtility();
ut.addSource(...);
ut.addSource(...);
ut.addSource(...);
ut.setDestinationFileName(...);
ut.mergeDocuments();

A quick Google search returned this bug: "Bad file descriptor while saving a document w. imported PDFs".
It looks like you need to keep the PDFs to be merged open, until after you have saved and closed the combined PDF.

This is a ready to use code, merging four pdf files with itext.jar from http://central.maven.org/maven2/com/itextpdf/itextpdf/5.5.0/itextpdf-5.5.0.jar, more on http://tutorialspointexamples.com/
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
/**
* This class is used to merge two or more
* existing pdf file using iText jar.
*/
public class PDFMerger {
static void mergePdfFiles(List<InputStream> inputPdfList,
OutputStream outputStream) throws Exception{
//Create document and pdfReader objects.
Document document = new Document();
List<PdfReader> readers =
new ArrayList<PdfReader>();
int totalPages = 0;
//Create pdf Iterator object using inputPdfList.
Iterator<InputStream> pdfIterator =
inputPdfList.iterator();
// Create reader list for the input pdf files.
while (pdfIterator.hasNext()) {
InputStream pdf = pdfIterator.next();
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages = totalPages + pdfReader.getNumberOfPages();
}
// Create writer for the outputStream
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
//Open document.
document.open();
//Contain the pdf data.
PdfContentByte pageContentByte = writer.getDirectContent();
PdfImportedPage pdfImportedPage;
int currentPdfReaderPage = 1;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Iterate and process the reader list.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
//Create page and add content.
while (currentPdfReaderPage <= pdfReader.getNumberOfPages()) {
document.newPage();
pdfImportedPage = writer.getImportedPage(
pdfReader,currentPdfReaderPage);
pageContentByte.addTemplate(pdfImportedPage, 0, 0);
currentPdfReaderPage++;
}
currentPdfReaderPage = 1;
}
//Close document and outputStream.
outputStream.flush();
document.close();
outputStream.close();
System.out.println("Pdf files merged successfully.");
}
public static void main(String args[]){
try {
//Prepare input pdf file list as list of input stream.
List<InputStream> inputPdfList = new ArrayList<InputStream>();
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_1.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_2.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_3.pdf"));
inputPdfList.add(new FileInputStream("..\\pdf\\pdf_4.pdf"));
//Prepare output stream for merged pdf file.
OutputStream outputStream =
new FileOutputStream("..\\pdf\\MergeFile_1234.pdf");
//call method to merge pdf files.
mergePdfFiles(inputPdfList, outputStream);
} catch (Exception e) {
e.printStackTrace();
}
}
}

Multiple pdf merged method using org.apache.pdfbox:
public void mergePDFFiles(List<File> files,
String mergedFileName) {
try {
PDFMergerUtility pdfmerger = new PDFMergerUtility();
for (File file : files) {
PDDocument document = PDDocument.load(file);
pdfmerger.setDestinationFileName(mergedFileName);
pdfmerger.addSource(file);
pdfmerger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
document.close();
}
} catch (IOException e) {
logger.error("Error to merge files. Error: " + e.getMessage());
}
}
From main program, call mergePDFFiles method using list of files and target file name.
String mergedFileName = "Merged.pdf";
mergePDFFiles(files, mergedFileName);
After calling mergePDFFiles, load merged file
File mergedFile = new File(mergedFileName);

package article14;
import java.io.File;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.util.PDFMergerUtility;
public class Pdf
{
public static void main(String args[])
{
new Pdf().createNew();
new Pdf().combine();
}
public void combine()
{
try
{
PDFMergerUtility mergePdf = new PDFMergerUtility();
String folder ="pdf";
File _folder = new File(folder);
File[] filesInFolder;
filesInFolder = _folder.listFiles();
for (File string : filesInFolder)
{
mergePdf.addSource(string);
}
mergePdf.setDestinationFileName("Combined.pdf");
mergePdf.mergeDocuments();
}
catch(Exception e)
{
}
}
public void createNew()
{
PDDocument document = null;
try
{
String filename="test.pdf";
document=new PDDocument();
PDPage blankPage = new PDPage();
document.addPage( blankPage );
document.save( filename );
}
catch(Exception e)
{
}
}
}

If you want to combine two files where one overlays the other (example: document A is a template and document B has the text you want to put on the template), this works:
after creating "doc", you want to write your template (templateFile) on top of that -
PDDocument watermarkDoc = PDDocument.load(getServletContext()
.getRealPath(templateFile));
Overlay overlay = new Overlay();
overlay.overlay(watermarkDoc, doc);

Using iText (existing PDF in bytes)
public static byte[] mergePDF(List<byte[]> pdfFilesAsByteArray) throws DocumentException, IOException {
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
Document document = null;
PdfCopy writer = null;
for (byte[] pdfByteArray : pdfFilesAsByteArray) {
try {
PdfReader reader = new PdfReader(pdfByteArray);
int numberOfPages = reader.getNumberOfPages();
if (document == null) {
document = new Document(reader.getPageSizeWithRotation(1));
writer = new PdfCopy(document, outStream); // new
document.open();
}
PdfImportedPage page;
for (int i = 0; i < numberOfPages;) {
++i;
page = writer.getImportedPage(reader, i);
writer.addPage(page);
}
}
catch (Exception e) {
e.printStackTrace();
}
}
document.close();
outStream.close();
return outStream.toByteArray();
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read raw text from pdf file using java [duplicate] - java

PDFTextExtractor only contains static methods and the constructor is private. itext You can call it like so: String myLine = PDFTextExtractor.getTextFromPage(reader, pageNumber)

Related

How to merge or concat to byte [ ] arrays of PDF's type in java but am getting pdf1 as output [duplicate]

How can I append an existing PDF to a generated PDF with flying saucer for Java [duplicate]

Create a new Pdf from an existing Pdf and (HTML + CSS)

read pdf files using java

How to merge two PDF files into one in Java?

Categories

Resources