Merge PDF documents and images into one PDF - java

I have read examples in merging PDF documents section however I couldn't develop more optimal solution for the following task:
I would like to merge series of PDF and image files coming in any order (original post). The inefficiency comes from the fact that I need to create dummy 1-page PDF file for image using PdfWriter and then read it back from byte array using PdfReader.
Question: Is there more efficient way of doing the same (maybe via PdfCopy#addPage())?
import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfSmartCopy;
import com.itextpdf.text.pdf.PdfWriter;
/**
* Helper class that creates PDF from given image(s) (JPEG, PNG, ...) or PDFs.
*/
public class MergeToPdf {
public static void main(String[] args) throws IOException, DocumentException {
if (args.length < 2) {
System.err.println("At least two arguments are required: in1.pdf [, image2.jpg ...], out.pdf");
System.exit(1);
}
Document mergedDocument = new Document();
PdfSmartCopy pdfCopy = new PdfSmartCopy(mergedDocument, new FileOutputStream(args[args.length - 1]));
mergedDocument.open();
for (int i = 0; i < args.length - 1; i++) {
PdfReader reader;
if (args[i].toLowerCase().endsWith(".pdf")) {
System.out.println("Adding PDF " + args[i] + "...");
// Copy PDF document:
reader = new PdfReader(args[i]);
}
else {
System.out.println("Adding image " + args[i] + "...");
final ByteArrayOutputStream byteStream = new ByteArrayOutputStream();
final Document imageDocument = new Document();
PdfWriter.getInstance(imageDocument, byteStream);
imageDocument.open();
// Create single page with the dimensions as source image and no margins:
Image image = Image.getInstance(args[i]);
image.setAbsolutePosition(0, 0);
imageDocument.setPageSize(image);
imageDocument.newPage();
imageDocument.add(image);
imageDocument.close();
// Copy PDF document with only one page carrying the image:
reader = new PdfReader(byteStream.toByteArray());
}
pdfCopy.addDocument(reader);
reader.close();
}
mergedDocument.close();
}
}

Related

how to convert html file to ppt using java spring boot project

I have Tried Aspose and GroupDocs but those are inbuit-'apis,generating only one slide and have water-mark for ppts. Can anyone helpme how to write code that converts Html file content to PowerPointPresentation
Tried Like this--
package com.example.demo.config;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
public class ConPptHtml {
public static void main(String[] args) throws Exception {
// The path to the documents directory.
String dataDir ="C:\\Downloads\\";
// Create Empty presentation instance
Presentation pres = new Presentation();
// Access the default first slide of presentation
ISlide slide = pres.getSlides().get_Item(0);
// Adding the AutoShape to accommodate the HTML content
IAutoShape ashape = slide.getShapes().addAutoShape(ShapeType.Rectangle, 10, 10, (float) pres.getSlideSize().getSize().getWidth(), (float) pres.getSlideSize().getSize().getHeight());
ashape.getFillFormat().setFillType(FillType.NoFill);
// Adding text frame to the shape
ashape.addTextFrame("");
// Clearing all paragraphs in added text frame
ashape.getTextFrame().getParagraphs().clear();
// Loading the HTML file using InputStream
InputStream inputStream = new FileInputStream(dataDir + "sample.html");
Reader reader = new InputStreamReader(inputStream);
int data = reader.read();
String content = ReadFile(dataDir + "sample.html");
// Adding text from HTML stream reader in text frame
ashape.getTextFrame().getParagraphs().addFromHtml(content);
// Saving Presentation
pres.save(dataDir + "hppt.pptx", SaveFormat.Pptx);
}

How to convert a PDF to a JSON/EXCEL/WORD file?

I need to get data from the pdf file with its header for further comparing with DB data
I tried to use the pdfbox , google vision ocr , itext, but all libraries gave me a row without structure and headers.
Example: Date\nNumber\nStatus\n12\12\2020\n442334\delivered
I will trying convert pdf to excel/word and get data from them, but for this realisation i need reading pdf and write data in excel/word
How can I get data with headers?
"Date\nNumber\nStatus\n12/12/2020\n442334\ndelivered" looks structured enough to me. You could just split it at the "\n"s. That would require some knowledge of the table structure, though.
I've made good experience with Google Vision OCR. How are you calling it?
I not found answer on my question.
I'm use this code for my task :
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.awt.*;
import java.io.File;
import java.io.IOException;
public class ExtractTextByArea {
public String getTextFromCoordinate(String filepath,int x,int y,int width,int height) {
String result = "";
try (PDDocument document = PDDocument.load(new File(filepath))) {
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
// Rectangle rect = new Rectangle(260, 35, 70, 10);
Rectangle rect = new Rectangle(x,y,width,height);
stripper.addRegion("class1", rect);
PDPage firstPage = document.getPage(0);
stripper.extractRegions( firstPage );
// System.out.println("Text in the area:" + rect);
result = stripper.getTextForRegion("class1");
}
} catch (IOException e){
System.err.println("Exception while trying to read pdf document - " + e);
}
return result;
}
}

PDFBox IO Exception: COSStream has been closed and cannot be read

I am having an issue with some code I'm writing in Java using PDFBox. I am attempting to populate a PDF with particular forms based on values read from an excel spreadsheet. Below is my class file.
import java.io.FileInputStream;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.PDPageContentStream.AppendMode;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.hssf.usermodel.*;
/**
* This is a test file for reading and populating a PDF with specific forms
*/
public class JU_TestFile {
PDPage Stick_Form;
PDPage IKE_Form;
PDPage BO_Form;
/**
* Constructor.
*/
public JU_TestFile() throws IOException
{
this.BO_Form = (PDPage) PDDocument.load(new File("C:\\Users\\saf\\Desktop\\JavaTest\\BO Pole Form.pdf")).getPage(0);
this.IKE_Form = (PDPage) PDDocument.load(new File("C:\\Users\\saf\\Desktop\\JavaTest\\IKE Form.pdf")).getPage(0);
this.Stick_Form = (PDPage) PDDocument.load(new File("C:\\Users\\saf\\Desktop\\JavaTest\\Sticking Form.pdf")).getPage(0);
}
public void buildFile(String fileName, String excelSheet) throws IOException {
// Create a Blank PDF Document and load in JU Excel Spreadsheet
PDDocument workingDocument = new PDDocument();
FileInputStream fis = new FileInputStream(new File(excelSheet));
// Load in the workbook
HSSFWorkbook JU_XML = new HSSFWorkbook(fis);
int sheetNumber = 0;
int rowNumber = 0;
String cellValue = "Starting Value";
HSSFSheet currentSheet = JU_XML.getSheetAt(sheetNumber);
// While we have not reached the 25th row in our current sheet
while (rowNumber <= 24) {
// Get the value in the current row, on the 8th column in the xls file
cellValue = currentSheet.getRow(rowNumber + 6).getCell(7).getStringCellValue();
// If it has stuff in it,
if (cellValue != "") {
// Check if it has the letters "IKE" and append the IKE form to our PDF
if (cellValue != "IKE") {
workingDocument.importPage(IKE_Form);
// If it is anything else (other than empty), append the Stick Form to our PDF
} else {
workingDocument.importPage(Stick_Form);
}
// Let's move on to the next row
rowNumber++;
// If the next row number is the "26th" row, we know we need to move on to the
// next sheet, and also reset the rows to the first row of that next sheet
if (rowNumber == 25) {
rowNumber = 0;
currentSheet = JU_XML.getSheetAt(++sheetNumber);
}
// if the 9th row is empty, we should break out of the loop and save/close our PDF, we are done
} else {
break;
}
}
workingDocument.save(fileName);
workingDocument.close();
}
}
I am getting the following error:
Exception in thread "main" java.io.IOException: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?
I've done research and it seems like a PDDocument is closing before I run the workingDocument.save(fileName) command. I'm not quite sure how to fix this, and I'm also a bit lost on how to find a workaround. I'm a bit rusty on my programming, so any help would be super appreciated! Also any feedback on how to make future posts more informative would be great.
Thanks in advance
Please try it
PDFMergerUtility merger = new PDFMergerUtility();
PDDocument combine = PDDocument.load(file);
merger.appendDocument(getDocument(), combine);
merger.mergeDocuments();
combine.close();
Update:
Since merger.mergeDocuments(); is deprecated in recent APIs, try to make use of the same method using following overloaded methods...
merger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
or
merger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
Depends on your memory usage, you can further fine tune this method by passing MemoryUsageSetting object.

How to read raw data, say only text, from a file(word document, excel) without format? [duplicate]

The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.
When using the following code:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());
the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.
The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
System.out.println(paragraph);
}
I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?
If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?
This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:
/*
* This class is used to read .doc and .docx files
*
* #author Developer
*
*/
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.URL;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
class TextExtractor {
private OutputStream outputstream;
private ParseContext context;
private Detector detector;
private Parser parser;
private Metadata metadata;
private String extractedText;
public TextExtractor() {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
}
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
} else {
url = new URL(filename);
}
InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
}
public void getString() {
//Get the text into a String object
extractedText = outputstream.toString();
//Do whatever you want with this String object.
System.out.println(extractedText);
}
public static void main(String args[]) throws Exception {
if (args.length == 1) {
TextExtractor textExtractor = new TextExtractor();
textExtractor.process(args[0]);
textExtractor.getString();
} else {
throw new Exception();
}
}
}
To compile:
javac -cp ".:tika-app-1.2.jar" TextExtractor.java
To run:
java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc
There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).
The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:
NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());
for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);
System.out.println(text);
}
The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:
TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();
Try this, works for me and is purely a POI solution. You will have to look for the HWPFDocument counterpart though. Make sure the document you are reading predates Word 97, else use XWPFDocument like I do.
InputStream inputstream = new FileInputStream(m_filepath);
//read the file
XWPFDocument adoc= new XWPFDocument(inputstream);
//and place it in a xwpf format
aString = new XWPFWordExtractor(adoc).getText();
//gets the full text
Now if you want certain parts you can use the getparagraphtext but dont use the text extractor, use it directly on the paragraph like this
for (XWPFParagraph p : adoc.getParagraphs())
{
System.out.println(p.getParagraphText());
}

Creating PDF from TIFF image using iText

I'm currently generating PDF files from TIFF images using iText.
Basically the procedure is as follows:
1. Read the TIFF file.
2. For each "page" of the TIFF, instantiate an Image object and write that to a Document instance, which is the PDF file.
I'm having a hard time understanding how to add those images to the PDF keeping the original resolution.
I've tried to scale the Image to the dimensions in pixels of the original image of the TIFF, for instance:
// Pixel Dimensions 1728 × 2156 pixels
// Resolution 204 × 196 ppi
RandomAccessFileOrArray tiff = new RandomAccessFileOrArray("/path/to/tiff/file");
Document pdf = new Document(PageSize.LETTER);
Image temp = TiffImage.getTiffImage(tiff, page);
temp.scaleAbsolute(1728f, 2156f);
pdf.add(temp);
I would really appreciate if someone can shed some light on this. Perhaps I'm missing the functionality of the Image class methods...
Thanks in advance!
I think if you scale the image then you can not retain the original resolution (please correct me if I am wrong :)).
What you can try doing is to creat a PDF document with different sized pages (if images are of different resolution in the tif image).
Try the following code. It sets the size of PDF page equal to that of image file and then create that PDF page. the PDF page size varies according to the image size so the resolution is maintained :)
import java.io.FileOutputStream;
import java.io.IOException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Image;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.Rectangle;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
import com.itextpdf.text.pdf.codec.TiffImage;
public class Tiff2Pdf {
/**
* #param args
* #throws DocumentException
* #throws IOException
*/
public static void main(String[] args) throws DocumentException,
IOException {
String imgeFilename = "/home/saurabh/Downloads/image.tif";
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(
document,
new FileOutputStream("/home/saurabh/Desktop/out"
+ Math.random() + ".pdf"));
writer.setStrictImageSequence(true);
document.open();
document.add(new Paragraph("Multipages tiff file"));
Image image;
RandomAccessFileOrArray ra = new RandomAccessFileOrArray(imgeFilename);
int pages = TiffImage.getNumberOfPages(ra);
for (int i = 1; i <= pages; i++) {
image = TiffImage.getTiffImage(ra, i);
Rectangle pageSize = new Rectangle(image.getWidth(),
image.getHeight());
document.setPageSize(pageSize);
document.add(image);
document.newPage();
}
document.close();
}
}
I've found that this line doesn't work well:
document.setPageSize(pageSize);
If your TIFF files only contain one image then you're better off using this instead:
RandomAccessFileOrArray ra = new RandomAccessFileOrArray(imageFilePath);
Image image = TiffImage.getTiffImage(ra, 1);
Rectangle pageSize = new Rectangle(image.getWidth(), image.getHeight());
Document document = new Document(pageSize);
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(outputFileName));
writer.setStrictImageSequence(true);
document.open();
document.add(image);
document.newPage();
document.close();
This will result in a page size that fits the image size exactly, so no scaling is required.
Another example non-deprecated up to iText 5.5 with the first page issue fixed. I'm using 5.5.11 Itext.
import java.io.FileOutputStream;
import java.io.RandomAccessFile;
import java.nio.channels.FileChannel;
import com.itextpdf.text.Document;
import com.itextpdf.text.Image;
import com.itextpdf.text.Rectangle;
import com.itextpdf.text.io.FileChannelRandomAccessSource;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
import com.itextpdf.text.pdf.codec.TiffImage;
public class Test1 {
public static void main(String[] args) throws Exception {
RandomAccessFile aFile = new RandomAccessFile("/myfolder/origin.tif", "r");
FileChannel inChannel = aFile.getChannel();
FileChannelRandomAccessSource fcra = new FileChannelRandomAccessSource(inChannel);
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream("/myfolder/destination.pdf"));
document.open();
RandomAccessFileOrArray rafa = new RandomAccessFileOrArray(fcra);
int pages = TiffImage.getNumberOfPages(rafa);
Image image;
for (int i = 1; i <= pages; i++) {
image = TiffImage.getTiffImage(rafa, i);
Rectangle pageSize = new Rectangle(image.getWidth(), image.getHeight());
document.setPageSize(pageSize);
document.newPage();
document.add(image);
}
document.close();
aFile.close();
}
}

Categories

Resources