How to convert a PDF to a JSON/EXCEL/WORD file?

How to convert a PDF to a JSON/EXCEL/WORD file? - java

I need to get data from the pdf file with its header for further comparing with DB data
I tried to use the pdfbox , google vision ocr , itext, but all libraries gave me a row without structure and headers.
Example: Date\nNumber\nStatus\n12\12\2020\n442334\delivered
I will trying convert pdf to excel/word and get data from them, but for this realisation i need reading pdf and write data in excel/word
How can I get data with headers?

"Date\nNumber\nStatus\n12/12/2020\n442334\ndelivered" looks structured enough to me. You could just split it at the "\n"s. That would require some knowledge of the table structure, though.
I've made good experience with Google Vision OCR. How are you calling it?

I not found answer on my question.
I'm use this code for my task :
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.awt.*;
import java.io.File;
import java.io.IOException;
public class ExtractTextByArea {
public String getTextFromCoordinate(String filepath,int x,int y,int width,int height) {
String result = "";
try (PDDocument document = PDDocument.load(new File(filepath))) {
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
// Rectangle rect = new Rectangle(260, 35, 70, 10);
Rectangle rect = new Rectangle(x,y,width,height);
stripper.addRegion("class1", rect);
PDPage firstPage = document.getPage(0);
stripper.extractRegions( firstPage );
// System.out.println("Text in the area:" + rect);
result = stripper.getTextForRegion("class1");
}
} catch (IOException e){
System.err.println("Exception while trying to read pdf document - " + e);
}
return result;
}
}

Related

Merge PDF documents and images into one PDF

I have read examples in merging PDF documents section however I couldn't develop more optimal solution for the following task:
I would like to merge series of PDF and image files coming in any order (original post). The inefficiency comes from the fact that I need to create dummy 1-page PDF file for image using PdfWriter and then read it back from byte array using PdfReader.
Question: Is there more efficient way of doing the same (maybe via PdfCopy#addPage())?
import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfSmartCopy;
import com.itextpdf.text.pdf.PdfWriter;
/**
* Helper class that creates PDF from given image(s) (JPEG, PNG, ...) or PDFs.
*/
public class MergeToPdf {
public static void main(String[] args) throws IOException, DocumentException {
if (args.length < 2) {
System.err.println("At least two arguments are required: in1.pdf [, image2.jpg ...], out.pdf");
System.exit(1);
}
Document mergedDocument = new Document();
PdfSmartCopy pdfCopy = new PdfSmartCopy(mergedDocument, new FileOutputStream(args[args.length - 1]));
mergedDocument.open();
for (int i = 0; i < args.length - 1; i++) {
PdfReader reader;
if (args[i].toLowerCase().endsWith(".pdf")) {
System.out.println("Adding PDF " + args[i] + "...");
// Copy PDF document:
reader = new PdfReader(args[i]);
}
else {
System.out.println("Adding image " + args[i] + "...");
final ByteArrayOutputStream byteStream = new ByteArrayOutputStream();
final Document imageDocument = new Document();
PdfWriter.getInstance(imageDocument, byteStream);
imageDocument.open();
// Create single page with the dimensions as source image and no margins:
Image image = Image.getInstance(args[i]);
image.setAbsolutePosition(0, 0);
imageDocument.setPageSize(image);
imageDocument.newPage();
imageDocument.add(image);
imageDocument.close();
// Copy PDF document with only one page carrying the image:
reader = new PdfReader(byteStream.toByteArray());
}
pdfCopy.addDocument(reader);
reader.close();
}
mergedDocument.close();
}
}

PDFBox IO Exception: COSStream has been closed and cannot be read

I am having an issue with some code I'm writing in Java using PDFBox. I am attempting to populate a PDF with particular forms based on values read from an excel spreadsheet. Below is my class file.
import java.io.FileInputStream;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.PDPageContentStream.AppendMode;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.hssf.usermodel.*;
/**
* This is a test file for reading and populating a PDF with specific forms
*/
public class JU_TestFile {
PDPage Stick_Form;
PDPage IKE_Form;
PDPage BO_Form;
/**
* Constructor.
*/
public JU_TestFile() throws IOException
{
this.BO_Form = (PDPage) PDDocument.load(new File("C:\\Users\\saf\\Desktop\\JavaTest\\BO Pole Form.pdf")).getPage(0);
this.IKE_Form = (PDPage) PDDocument.load(new File("C:\\Users\\saf\\Desktop\\JavaTest\\IKE Form.pdf")).getPage(0);
this.Stick_Form = (PDPage) PDDocument.load(new File("C:\\Users\\saf\\Desktop\\JavaTest\\Sticking Form.pdf")).getPage(0);
}
public void buildFile(String fileName, String excelSheet) throws IOException {
// Create a Blank PDF Document and load in JU Excel Spreadsheet
PDDocument workingDocument = new PDDocument();
FileInputStream fis = new FileInputStream(new File(excelSheet));
// Load in the workbook
HSSFWorkbook JU_XML = new HSSFWorkbook(fis);
int sheetNumber = 0;
int rowNumber = 0;
String cellValue = "Starting Value";
HSSFSheet currentSheet = JU_XML.getSheetAt(sheetNumber);
// While we have not reached the 25th row in our current sheet
while (rowNumber <= 24) {
// Get the value in the current row, on the 8th column in the xls file
cellValue = currentSheet.getRow(rowNumber + 6).getCell(7).getStringCellValue();
// If it has stuff in it,
if (cellValue != "") {
// Check if it has the letters "IKE" and append the IKE form to our PDF
if (cellValue != "IKE") {
workingDocument.importPage(IKE_Form);
// If it is anything else (other than empty), append the Stick Form to our PDF
} else {
workingDocument.importPage(Stick_Form);
}
// Let's move on to the next row
rowNumber++;
// If the next row number is the "26th" row, we know we need to move on to the
// next sheet, and also reset the rows to the first row of that next sheet
if (rowNumber == 25) {
rowNumber = 0;
currentSheet = JU_XML.getSheetAt(++sheetNumber);
}
// if the 9th row is empty, we should break out of the loop and save/close our PDF, we are done
} else {
break;
}
}
workingDocument.save(fileName);
workingDocument.close();
}
}
I am getting the following error:
Exception in thread "main" java.io.IOException: COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?
I've done research and it seems like a PDDocument is closing before I run the workingDocument.save(fileName) command. I'm not quite sure how to fix this, and I'm also a bit lost on how to find a workaround. I'm a bit rusty on my programming, so any help would be super appreciated! Also any feedback on how to make future posts more informative would be great.
Thanks in advance

Please try it
PDFMergerUtility merger = new PDFMergerUtility();
PDDocument combine = PDDocument.load(file);
merger.appendDocument(getDocument(), combine);
merger.mergeDocuments();
combine.close();
Update:
Since merger.mergeDocuments(); is deprecated in recent APIs, try to make use of the same method using following overloaded methods...
merger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
or
merger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
Depends on your memory usage, you can further fine tune this method by passing MemoryUsageSetting object.

Convert Tiff to Pdf in java using itext

I am using the below code for converting tiff to pdf
It works fine for tiff images of dimensions 850*1100.But when I am trying to give the input tiff image of dimensions(Eg :- 1574*732, 684*353 or other 850*1100), I am getting the below error. Please help me how to convert tiff images of different dimensions to pdf.
Error Occured for below code .Compression JPEG is only supported with a single strip. This image has 45 strips.
RandomAccessFileOrArray myTifFile = null;
com.itextpdf.text.Document tiffToPDF= null;
PdfWriter pdfWriter = null;
try{
myTifFile = new RandomAccessFileOrArray(fileName);
int numberOfPages = TiffImage.getNumberOfPages(myTifFile);
tiffToPDF = new com.itextpdf.text.Document(PageSize.LETTER_LANDSCAPE);
String temp = fileName.substring(0, fileName.lastIndexOf("."));
pdfWriter = PdfWriter.getInstance(tiffToPDF, new FileOutputStream(temp+".pdf"));
pdfWriter.setStrictImageSequence(true);
tiffToPDF.open();
for(int tiffImageCounter = 1;tiffImageCounter <= numberOfPages;tiffImageCounter++)
{
Image img = TiffImage.getTiffImage(myTifFile, tiffImageCounter);
img.setAbsolutePosition(0,0);
img.scaleToFit(612,792);
tiffToPDF.add(img);
tiffToPDF.newPage();
}
}

This code will explain how you can convert tiff to pdf.. more information can be found here and here
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
//Read Tiff File, Get number of Pages
import com.itextpdf.text.pdf.codec.TiffImage;
//We need the library below to write the final
//PDF file which has our image converted to PDF
import java.io.FileOutputStream;
//The image class to extract separate images from Tiff image
import com.itextpdf.text.Image;
//PdfWriter object to write the PDF document
import com.itextpdf.text.pdf.PdfWriter;
//Document object to add logical image files to PDF
import com.itextpdf.text.Document;
public class TiffToPDF {
public static void main(String args[]){
try{
//Read the Tiff File
RandomAccessFileOrArray myTiffFile=new RandomAccessFileOrArray("c:\\java\\test.tif");
//Find number of images in Tiff file
int numberOfPages=TiffImage.getNumberOfPages(myTiffFile);
System.out.println("Number of Images in Tiff File" + numberOfPages);
Document TifftoPDF=new Document();
PdfWriter.getInstance(TifftoPDF, new FileOutputStream("c:\\java\\tiff2Pdf.pdf"));
TifftoPDF.open();
//Run a for loop to extract images from Tiff file
//into a Image object and add to PDF recursively
for(int i=1;i<=numberOfPages;i++){
Image tempImage=TiffImage.getTiffImage(myTiffFile, i);
TifftoPDF.add(tempImage);
}
TifftoPDF.close();
System.out.println("Tiff to PDF Conversion in Java Completed" );
}
catch (Exception i1){
i1.printStackTrace();
}
}
}

Writing image into pdf file in java

I'm writing a code to convert Microsoft power-point(ppt) slides into images and to write the generated images into pdf file. Following code generates and writes the images into pdf file but the problem i'm facing is, when i write image into pdf file it's size is exceeding the pdf page size and i can view only 75% of the image rest is invisible. One more thing to notice here is, written images in pdf file look like zoomed or expanded. Take a look at the following snippet of code:
for (int i = 0; i < slide.length; i++) {
BufferedImage img = new BufferedImage(pgsize.width, pgsize.height, BufferedImage.TYPE_INT_RGB);
Graphics2D graphics = img.createGraphics();
graphics.setPaint(Color.white);
graphics.fill(new Rectangle(0, 0, pgsize.width, pgsize.height));
slide[i].draw(graphics);
fileName="C:/DATASTORE/slide-"+(i+1)+".png";
FileOutputStream out = new FileOutputStream(fileName);
javax.imageio.ImageIO.write(img, "png", out);
out.flush();
out.close();
com.lowagie.text.Image image =com.lowagie.text.Image.getInstance(fileName);
image.setWidthPercentage(40.0f);
doc.add((image));
}
doc.close();
} catch(DocumentException de) {
System.err.println(de.getMessage());
}
If anybody knows the solution please help me to rectify. Thank you.
Here is the code it accomplishes the task i wished. Now i'm getting the desired results after following Bruno Lowagie recommendations.
But, as Bruno Lowagie pointed out earlier, their is a problem in generated png image. The generated png image is not correct because shape or image in the slide overlaps with the texts of the slide. Can you help me to identify and rectify the error?
import java.awt.Color;
import java.awt.Dimension;
import java.awt.Graphics2D;
import java.awt.Rectangle;
import com.itextpdf.text.Image;
import java.awt.image.BufferedImage;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.hslf.model.Slide;
import org.apache.poi.hslf.usermodel.SlideShow;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.pdf.PdfWriter;
public class ConvertSlidesIntoImages {
public static void main(String[] args){
try {
FileInputStream is = new FileInputStream("C:/DATASTORE/testPPT.ppt");
SlideShow ppt = new SlideShow(is);
is.close();
String fileName;
Dimension pgsize = ppt.getPageSize();
Slide[] slide = ppt.getSlides();
Document doc=new Document();
PdfWriter.getInstance(doc, new FileOutputStream("c:/DATASTORE/convertPPTSlidesIntoPDFImages.pdf"));
doc.open();
for (int i = 0; i < slide.length; i++) {
BufferedImage img = new BufferedImage(pgsize.width, pgsize.height, BufferedImage.TYPE_INT_RGB);
Graphics2D graphics = img.createGraphics();
graphics.setPaint(Color.white);
graphics.fill(new Rectangle(0, 0, pgsize.width, pgsize.height));
slide[i].draw(graphics);
fileName="C:/DATASTORE/slide-"+(i+1)+".png";
FileOutputStream out = new FileOutputStream(fileName);
javax.imageio.ImageIO.write(img, "png", out);
out.flush();
out.close();
com.itextpdf.text.Image image =com.itextpdf.text.Image.getInstance(fileName);
doc.setPageSize(new com.itextpdf.text.Rectangle(image.getScaledWidth(), image.getScaledHeight()));
doc.newPage();
image.setAbsolutePosition(0, 0);
doc.add(image);
}
doc.close();
}catch(DocumentException de) {
System.err.println(de.getMessage());
}
catch(Exception ex) {
ex.printStackTrace();
}
}
Thank you

First this: If the png stored as "C:/DATASTORE/slide-"+(i+1)+".png" isn't correct, the slide in the PDF won't be correct either.
And this: Your code snippet doesn't show us how you create the Document object. By default, the page size is A4 in portrait. It goes without saying that images that are bigger than 595 x 842 don't fit that page.
Now the answer: There are two ways to solve this.
Either you change the size of the image (not with setWidthPercentage() unless you've calculated the actual percentage) and you add it a the position (0, 0) so that it doesn't take into account the margins. For instance:
image.scaleToFit(595, 842);
image.setAbsolutePosition(0, 0);
doc.add(image);
doc.newPage();
A better solution would be to adapt the size of the page to the size of the image.
Document doc = new Document(new Rectangle(image.getScaledWidth(), image.getScaledHeight()));
// create a writer, open the document
image.setAbsolutePosition(0, 0);
doc.add(image);
doc.newPage();
If the size of the images varies, you can change the page size while adding images like this:
doc.setPageSize(new Rectangle(image.getScaledWidth(), image.getScaledHeight()));
doc.newPage();
image.setAbsolutePosition(0, 0);
doc.add(image);
It is important to understand that the new page size will only come into effect after doc.newPage();
CAVEAT 1:
If your PDF only holds the last slide, you're probably putting all the slides on the same page, and the last slide covers them all. You need to invoke the newPage() method each time you add an image (as done in a code snippet in my answer).
CAVEAT 2:
Your allegation is wrong. According to the API docs, there is a method setPageSize(Rectangle rect), maybe you're using the wrong Rectangle class. If you didn't follow my advice (which IMHO wouldn't be wise), you're probably looking for com.lowagie.text.Rectangle instead of java.awt.Rectangle.
CAVEAT 3:
This is similar to CAVEAT 2, there are indeed no such methods in the class java.awt.Image, but as documented in the API docs, the class com.itextpdf.text.Image has a getScaleWidth() method and a getScaledHeight() method.

Itext rectangle will not bleed to edge of page

I am attempting to modify the background-color of a single page of a multi-page PDF document created using iText.
The easiest way to do this appeared to be by creating a Rectangle the entire size of the page, with the specified background color, and applying it to the page in question using the PdfContentByte utility. (having explored using the Document API, this seemed not to be the best option, since this applied the styling to ALL pages in the document, which I did not want).
When run, on close inspection, I can see that there is a single pixel along the upper, right and bottom margins, which remains white, the rest of the page being the correct color. I have played with the rectangle to ensure no margins were created, but to no avail. Find the code I am using below.
Rectangle r = new Rectangle(0, 0, helper.getPageWidth(), helper.getPageHeight());
r.setBackgroundColor(Constants.GREEN);
PdfContentByte cb = helper.getWriter().getDirectContent();
cb.rectangle(r);
cb.setColorFill(Constants.GREEN);
cb.setColorStroke(Constants.GREEN);
cb.fillStroke();
It seems whatever I try, I cannot get rid of the single white pixel row along these 3 sides of the page. Does anyone have any idea how to bleed to the VERY edge of an iText page?

First:Please mention the itext version you are using.I'm currently used your code snippet and made some changes and it work out well.May be full code snippet will help me to find out whats wrong in your code.
(prime suspect to me this line Rectangle r = new Rectangle(0,0,helper.getPageWidth(),helper.getPageHeight()))
I've attached the output and the code i used.
package com.pra.itext;
import com.lowagie.text.DocumentException;
import com.lowagie.text.Rectangle;
import com.lowagie.text.pdf.PdfContentByte;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PdfStamper;
import java.awt.Color;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
/**
*
* #author Prajit
*/
public class ItextRect {
public static void main(String[] args) {
PdfReader rdrPdf = null;
PdfStamper stmprPdf = null;
try {
rdrPdf = new PdfReader("E:/Head.First.Servlets&Jsp.pdf");
stmprPdf = new PdfStamper(rdrPdf, new FileOutputStream(new File("D:/Example.pdf")));
for (int pgCnt = 1; pgCnt <= rdrPdf.getNumberOfPages(); pgCnt++) {
if (pgCnt == 1) {
PdfContentByte pdfCntntByt = stmprPdf.getUnderContent(pgCnt);
Rectangle r = new Rectangle(rdrPdf.getPageSize(pgCnt));
r.setBackgroundColor(Color.red);
pdfCntntByt.rectangle(r);
pdfCntntByt.setColorFill(Color.red);
pdfCntntByt.setColorStroke(Color.red);
pdfCntntByt.fillStroke();
}
}
stmprPdf.close();
rdrPdf.close();
} catch (DocumentException de) {
System.err.println(de.getMessage());
} catch (IOException ioe) {
System.err.println(ioe.getMessage());
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to convert a PDF to a JSON/EXCEL/WORD file? - java

"Date\nNumber\nStatus\n12/12/2020\n442334\ndelivered" looks structured enough to me. You could just split it at the "\n"s. That would require some knowledge of the table structure, though. I've made good experience with Google Vision OCR. How are you calling it?

Related

Merge PDF documents and images into one PDF

PDFBox IO Exception: COSStream has been closed and cannot be read

Convert Tiff to Pdf in java using itext

Writing image into pdf file in java

Itext rectangle will not bleed to edge of page

Categories

Resources