Apache POI and XDOCREPORT NullPointerException

Apache POI and XDOCREPORT NullPointerException - java

I am doing a placeholder replacements in docx file and after that I need to convert file to PDF. All of my efforts are ending in
fr.opensagres.poi.xwpf.converter.core.XWPFConverterException: java.lang.NullPointerException
at fr.opensagres.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:71)
at fr.opensagres.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:39)
at fr.opensagres.poi.xwpf.converter.core.AbstractXWPFConverter.convert(AbstractXWPFConverter.java:46).
I am using these dependencies:
implementation("org.apache.poi:poi-ooxml:3.17")
implementation("fr.opensagres.xdocreport:fr.opensagres.xdocreport.converter.docx.xwpf:2.0.1")
If I try to convert source (unchanged) docx file, everything works as it should, but when I do replace placeholders and save document, everything crashes.
Piece of my code:
FileInputStream fis = new FileInputStream(COPIED);
XWPFDocument doc = new XWPFDocument(fis);
doc.createStyles();
for (XWPFParagraph p : doc.getParagraphs()) {
List<XWPFRun> runs = p.getRuns();
if (runs != null) {
for (XWPFRun r : runs) {
String text = r.getText(0);
StringSubstitutor substitutor = new StringSubstitutor(fieldsForReport);
String replacedText = substitutor.replace(text);
r.setText(replacedText, 0);
}
}
}
for (XWPFTable tbl : doc.getTables()) {
for (XWPFTableRow row : tbl.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph p : cell.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
StringSubstitutor substitutor = new StringSubstitutor(fieldsForReport);
String replacedText = substitutor.replace(text);
r.setText(replacedText, 0);
}
}
}
}
}
FileOutputStream fos = new FileOutputStream(COPIED);
doc.write(fos);
doc.close();
FileInputStream fis = new FileInputStream(COPIED);
XWPFDocument document = new XWPFDocument(fis);
PdfOptions options = PdfOptions.create();
PdfConverter converter = (PdfConverter) PdfConverter.getInstance();
converter.convert(document, new FileOutputStream(DEST), options);
document.close();

The following works for me using the newest apache poi version 4.0.1 and the newest version 2.0.2 of fr.opensagres.poi.xwpf.converter.core and consorts.
import java.io.InputStream;
import java.io.OutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.File;
//needed jars: fr.opensagres.poi.xwpf.converter.core-2.0.2.jar,
// fr.opensagres.poi.xwpf.converter.pdf-2.0.2.jar,
// fr.opensagres.xdocreport.itext.extension-2.0.2.jar,
// itext-4.2.1.jar
import fr.opensagres.poi.xwpf.converter.pdf.PdfOptions;
import fr.opensagres.poi.xwpf.converter.pdf.PdfConverter;
//needed jars: apache poi and it's dependencies
//inclusive ooxml-schemas-1.4.jar
import org.apache.poi.xwpf.usermodel.*;
public class DOCXToPDFConverterSampleMin {
public static void main(String[] args) throws Exception {
String docPath = "./WordDocument.docx";
String pdfPath = "./WordDocument.pdf";
InputStream in = new FileInputStream(new File(docPath));
XWPFDocument document = new XWPFDocument(in);
for (XWPFParagraph paragraph : document.getParagraphs()) {
for (XWPFRun run : paragraph.getRuns()) {
String text = run.getText(0);
if (text != null && text.contains("$name$")) {
text = text.replace("$name$", "Axel Richter");
run.setText(text, 0);
} else if (text != null && text.contains("$date$")) {
text = text.replace("$date$", "2019-02-28");
run.setText(text, 0);
}
}
}
for (XWPFTable table : document.getTables()) {
for (XWPFTableRow row : table.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph paragraph : cell.getParagraphs()) {
for (XWPFRun run : paragraph.getRuns()) {
String text = run.getText(0);
if (text != null && text.contains("$name$")) {
text = text.replace("$name$", "Axel Richter");
run.setText(text,0);
} else if (text != null && text.contains("$date$")) {
text = text.replace("$date$", "2019-02-28");
run.setText(text, 0);
}
}
}
}
}
}
XWPFParagraph paragraph = document.createParagraph();
XWPFRun run = paragraph.createRun();
run.setText("This is new Text in this document.");
PdfOptions options = PdfOptions.create();
OutputStream out = new FileOutputStream(new File(pdfPath));
PdfConverter.getInstance().convert(document, out, options);
document.close();
out.close();
}
}

Related

PDFBox Customized PDFTextStripper

I'm a rookie, really. I'm building my first project (if I can finish it).
I want to extract PDF text with formatting and location, and then write to .docx file. I checked the PDFBox API documentation, but I'm not sure if I want to get the location of the text, then should I traverse the rows? Or traverse the characters? I studied these three carefully.
Text coordinates when stripping from PDFBox
Get font of each line using PDFBox
How to extract font styles of text contents using pdfbox?
And here is my DEMO:
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.IOException;
import java.util.List;
public class PDFTextExtractor extends PDFTextStripper {
/**
* Instantiate a new PDFTextStripper object.
*
* #throws IOException If there is an error loading the properties.
*/
public PDFTextExtractor() throws IOException {
}
String prevFont = "";
#Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
StringBuilder sb = new StringBuilder();
for (TextPosition position : textPositions){
String font = position.getFont().getName();
float x = position.getX();
float y = position.getY();
float fontSize = position.getFontSizeInPt();
if (font != null && !font.equals(prevFont)){
sb.append("[").append(font.split("-")[0]).append("+").append(font.split("-")[1]).append("+").append(fontSize).append("]");
prevFont = font;
}
sb.append(position.getUnicode());
}
writeString(sb.toString());
}
#Override
public String getText(PDDocument doc) throws IOException {
return super.getText(doc);
}
}
And i calling it like here:
FileOutputStream outputStream = new FileOutputStream(EXPORT_PATH + file.getName().split("\\.")[0] + ".docx");
try (PDDocument originalPDF = PDDocument.load(file);
XWPFDocument doc = new XWPFDocument()) {
//get All pages
PDPageTree pageList = originalPDF.getDocumentCatalog().getPages();
for (PDPage page : pageList){
//Parse Content
PDFTextStripper stripper = new PDFTextExtractor();
stripper.setSortByPosition(true);
String ss = stripper.getText(originalPDF);
System.out.println(ss);
//Write Content
XWPFParagraph paragraph = doc.createParagraph();
XWPFRun run = paragraph.createRun();
run.setText(ss);
run.addBreak(BreakType.PAGE);
}
doc.write(outputStream);
originalPDF.close();
outputStream.close();
}

Read embedded object from Excel sheet row by row [duplicate]

I am using Apache POI with Java 1.8 in my application. In my application, I try to read Excel and get the embedded objects.
I need to know how to get the row number and file extensions for each embedded OLE object.
Workbook workbook = WorkbookFactory.create(new File(path));
XSSFWorkbook fWorkbook = (XSSFWorkbook) workbook;
List<PackagePart> embeddedDocs = fWorkbook.getAllEmbedds();
To get embeddedDocs.getContentType Which returns the application/vnd.openxmlformats-officedocument.oleObject.
But is there anyway where we can get the file extensions (i.e pdf,ppt,mp3) as which is returned by the MimeType. And which way to get row number of embedded objects. Any ideas / Coding logic to resolve this will be very useful.

The following is working for - I guess - the usual suspects. I've tested it with .xls/x on the POI trunk, which will be POI 4.1.0, but should work with POI 4.0.1 too.
Known issues are:
an object wasn't embedded based on a file, then you don't get a filename. This probably also applies to most .xls files.
a .xlsx only contains a vmlDrawing*.xml, then the DrawingPatriach can't be extracted and no shapes can be determined
a shape in a .xlsx wasn't anchored via twoCellAnchor, then you don't get a ClientAnchor
Code:
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.poi.hpsf.ClassIDPredefined;
import org.apache.poi.poifs.filesystem.DirectoryEntry;
import org.apache.poi.ss.usermodel.ChildAnchor;
import org.apache.poi.ss.usermodel.ClientAnchor;
import org.apache.poi.ss.usermodel.Drawing;
import org.apache.poi.ss.usermodel.ObjectData;
import org.apache.poi.ss.usermodel.Shape;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;
import org.junit.Test;
public class TestEmbed {
#Test
public void extract() throws IOException {
// String xlsName = "test-data/spreadsheet/WithEmbeddedObjects.xls";
String xlsName = "embed.xlsx";
try (FileInputStream fis = new FileInputStream(xlsName);
Workbook xls = WorkbookFactory.create(fis)) {
for (Sheet s : xls) {
Drawing<?> dp = s.getDrawingPatriarch();
if (dp != null) {
for (Shape sh : dp) {
if (sh instanceof ObjectData) {
ObjectData od = (ObjectData)sh;
String filename = od.getFileName();
String ext = null;
if (filename != null && !filename.isEmpty()) {
int i = filename.lastIndexOf('.');
ext = (i > 0) ? filename.substring(i) : ".bin";
} else {
String ct = null;
try {
DirectoryEntry de = od.getDirectory();
if (de != null) {
ClassIDPredefined ctcls = ClassIDPredefined.lookup(de.getStorageClsid());
if (ctcls != null) {
ext = ctcls.getFileExtension();
}
}
} catch (Exception ignore) {
}
}
if (ext == null) {
ext = ".bin";
}
ChildAnchor chAnc = sh.getAnchor();
if (chAnc instanceof ClientAnchor) {
ClientAnchor anc = (ClientAnchor) chAnc;
System.out.println("Rows: " + anc.getRow1() + " to " + anc.getRow2() + " - filename: "+filename+" - ext: "+ext);
}
}
}
}
}
}
}
}

convert html text into pdf without losing formatting

I used xmlworker-5.5.9.jar and itextpdf-5.5.13.jar,
In my web application I use CKEditor.
At submit button i want to convert the CK content in .pdf format,
I used this code and it is working properly:
public void createPDF(String text) throws DocumentException, IOException
{
String fileName="f:\\test.pdf";
Document document=new Document();
PdfWriter pdfWriter=PdfWriter.getInstance(document, new FileOutputStream(fileName));
document.open();
String finall=text;
InputStream is = new ByteArrayInputStream(finall.getBytes());
XMLWorkerHelper.getInstance().parseXHtml(pdfWriter,document, is);
document.close();
}
but this code does not work with arabic text
I try with this solution without sucsess :
public void createPdf(String htmlContentAr)
{
Charset CHARSET_UTF8 = Charset.forName("UTF-8");
try {
Document pdfDoc = new Document();
PdfWriter writer = PdfWriter.getInstance(pdfDoc, new FileOutputStream("f:\\test.pdf"));
writer.setRgbTransparencyBlending(true);
pdfDoc.open();
StyleAttrCSSResolver cssResolver = new StyleAttrCSSResolver();
ElementsCollector elementsHandler = new ElementsCollector();
HtmlPipelineContext htmlContext = new HtmlPipelineContext(new CssAppliersImpl(
new UnicodeFontProvider()));
htmlContext.charSet(CHARSET_UTF8);
htmlContext.setAcceptUnknown(true).autoBookmark(true)
.setTagFactory(Tags.getHtmlTagProcessorFactory());
CssResolverPipeline pipeline = new CssResolverPipeline(cssResolver, new HtmlPipeline(htmlContext,
new ElementHandlerPipeline(elementsHandler, null)));
XMLWorker worker = new XMLWorker(pipeline, true);
XMLParser parser = new XMLParser();
parser.addListener(worker);
parser.parse(new StringReader(htmlContentAr));
PdfPTable mainTable = new PdfPTable(1);
mainTable.setWidthPercentage(100);
PdfPCell cell = new PdfPCell();
cell.setBorder(0);
cell.setHorizontalAlignment(Element.ALIGN_LEFT);
cell.addElement(elementsHandler.getParagraph());
mainTable.addCell(cell);
pdfDoc.add(mainTable);
pdfDoc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
and this is the code of ElementsCollector.java :
import java.util.Iterator;
import java.util.List;
import com.itextpdf.text.Chunk;
import com.itextpdf.text.Element;
import com.itextpdf.text.Font;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfPCell;
import com.itextpdf.text.pdf.PdfPRow;
import com.itextpdf.text.pdf.PdfPTable;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.tool.xml.ElementHandler;
import com.itextpdf.tool.xml.Writable;
import com.itextpdf.tool.xml.html.pdfelement.NoNewLineParagraph;
import com.itextpdf.tool.xml.pipeline.WritableElement;
public class ElementsCollector implements ElementHandler {
private Paragraph _paragraph;
public ElementsCollector() {
_paragraph = new Paragraph();
_paragraph.setAlignment(Element.ALIGN_LEFT);
}
public Paragraph getParagraph() {
return _paragraph;
}
#Override
public void add(Writable htmlElement) {
WritableElement writableElement = (WritableElement) htmlElement;
if (writableElement == null) {
return;
}
for (Element element : writableElement.elements()) {
if (element instanceof NoNewLineParagraph) {
NoNewLineParagraph para = (NoNewLineParagraph) element;
Iterator<Element> it = para.iterator();
while (it.hasNext()) {
Element divChildElement = (Element) it.next();
fixNestedTablesRunDirection(divChildElement);
_paragraph.add(divChildElement);
}
} else {
fixNestedTablesRunDirection(element);
_paragraph.add(element);
}
}
}
private void fixNestedTablesRunDirection(Element element) {
if (element == null) {
return;
}
if (element instanceof PdfPTable) {
PdfPTable table = (PdfPTable) element;
for (PdfPRow row : table.getRows()) {
for (PdfPCell cell : row.getCells()) {
if (cell.getCompositeElements() != null) {
for (Element item : cell.getCompositeElements()) {
List<Chunk> chunks = item.getChunks();
if (chunks != null) {
for (Chunk chunk : chunks) {
Font font = chunk.getFont();
if (font != null) {
String name = font.getFamilyname() != null ? font.getFamilyname()
.toLowerCase() : null;
if (name != null && !name.isEmpty() && name.contains("arabic")) {
cell.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
if (item instanceof Paragraph
&& ((Paragraph) item).getAlignment() == 2) {
((Paragraph) item).setAlignment(0);
}
continue;
}
}
}
}
}
}
}
}
}
}
}
and this is the code of UnicodeFontProvider.java
import java.nio.file.FileSystems;
import java.nio.file.Path;
import java.nio.file.Paths;
import com.itextpdf.text.BaseColor;
import com.itextpdf.text.Font;
import com.itextpdf.text.FontFactory;
import com.itextpdf.text.FontFactoryImp;
import com.itextpdf.text.pdf.BaseFont;
public class UnicodeFontProvider extends FontFactoryImp {
public UnicodeFontProvider() {
String root = System.getenv("SystemRoot");
FileSystems.getDefault();
Path path = Paths.get(root, "fonts");
FontFactory.getFontImp().registerDirectory(path.toString());
// TODO test, works on windows so far
}
public Font getFont(String fontname, String encoding, boolean embedded, float size, int style,
BaseColor color, boolean cached) {
if (fontname!= null && !fontname.isEmpty()) {
return new Font(Font.FontFamily.UNDEFINED, size, style, color);
}
return FontFactory.getFont(fontname, BaseFont.IDENTITY_H, BaseFont.EMBEDDED, size, style, color);
}
}
but nothing has been displayed in pdf file
I think the problem in this line :
parser.parse(new StringReader(htmlContentAr));
updated :
I try with this code :
import java.io.File;
import java.io.IOException;
import com.itextpdf.text.FontProvider;
import com.itextpdf.html2pdf.ConverterProperties;
import com.itextpdf.html2pdf.HtmlConverter;
import com.itextpdf.html2pdf.resolver.font.DefaultFontProvider;
import com.itextpdf.io.font.FontProgram;
import com.itextpdf.io.font.FontProgramFactory;
import com.itextpdf.kernel.font.PdfFont;
import com.itextpdf.kernel.font.PdfFontFactory;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfWriter;
public class TestHTML {
public static final String[] FONTS = {
"src/main/resources/fonts/noto/NotoSans-Regular.ttf",
"src/main/resources/fonts/noto/NotoNaskhArabic-Regular.ttf",
"src/main/resources/fonts/noto/NotoSansHebrew-Regular.ttf"
};
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
}
public void createPdf(String src, String[] fonts, String dest) throws IOException {
ConverterProperties properties = new ConverterProperties();
FontProvider fontProvider = (FontProvider) new DefaultFontProvider(false, false, false);
for (String font : fonts) {
FontProgram fontProgram = FontProgramFactory.createFont(font);
fontProvider.addFont(fontProgram);
}
properties.setFontProvider(fontProvider);
HtmlConverter.convertToPdf(new File(src), new File(dest), properties);
}
}
but I have errors which are related to jar :
error in this line :
fontProvider.addFont(fontProgram);
The method addFont(FontProgram) is undefined for the type FontProvider
and error in :
properties.setFontProvider(fontProvider);
The method addFont(FontProgram) is undefined for the type FontProvider
also error in :
Multiple markers at this line
- The type com.itextpdf.layout.font.FontProvider cannot be resolved. It is indirectly referenced from
required .class files
- The type com.itextpdf.layout.font.FontProvider cannot be resolved. It is indirectly referenced from
required .class files
I used this jar :
kernel-7.0.0.jar ,io-7.0.0.jar ,html2pdf-1.0.2.jar ,itextpdf-5.5.13.jar,xmlworker-5.5.9.jar

Extract Only Images from PDF File in java using Apache Tika or PDFBox? [duplicate]

I m trying to extract images from a pdf using pdfbox. The example pdf here
But i m getting blank images only.
The code i m trying:-
public static void main(String[] args) {
PDFImageExtract obj = new PDFImageExtract();
try {
obj.read_pdf();
} catch (IOException ex) {
System.out.println("" + ex);
}
}
void read_pdf() throws IOException {
PDDocument document = null;
try {
document = PDDocument.load("C:\\Users\\Pradyut\\Documents\\MCS-034.pdf");
} catch (IOException ex) {
System.out.println("" + ex);
}
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
int i =1;
String name = null;
while (iter.hasNext()) {
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
Map pageImages = resources.getImages();
if (pageImages != null) {
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
image.write2file("C:\\Users\\Pradyut\\Documents\\image" + i);
i ++;
}
}
}
}
Thanks

Here is code using PDFBox 2.0.1 that will get a list of all images from the PDF. This is different than the other code in that it will recurse through the document instead of trying to get the images from the top level.
public List<RenderedImage> getImagesFromPDF(PDDocument document) throws IOException {
List<RenderedImage> images = new ArrayList<>();
for (PDPage page : document.getPages()) {
images.addAll(getImagesFromResources(page.getResources()));
}
return images;
}
private List<RenderedImage> getImagesFromResources(PDResources resources) throws IOException {
List<RenderedImage> images = new ArrayList<>();
for (COSName xObjectName : resources.getXObjectNames()) {
PDXObject xObject = resources.getXObject(xObjectName);
if (xObject instanceof PDFormXObject) {
images.addAll(getImagesFromResources(((PDFormXObject) xObject).getResources()));
} else if (xObject instanceof PDImageXObject) {
images.add(((PDImageXObject) xObject).getImage());
}
}
return images;
}

The below GetImagesFromPDF java class get all images in 04-Request-Headers.pdf file and save those files into destination folder PDFCopy.
import java.io.File;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
#SuppressWarnings({ "unchecked", "rawtypes", "deprecation" })
public class GetImagesFromPDF {
public static void main(String[] args) {
try {
String sourceDir = "C:/PDFCopy/04-Request-Headers.pdf";// Paste pdf files in PDFCopy folder to read
String destinationDir = "C:/PDFCopy/";
File oldFile = new File(sourceDir);
if (oldFile.exists()) {
PDDocument document = PDDocument.load(sourceDir);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
String fileName = oldFile.getName().replace(".pdf", "_cover");
int totalImages = 1;
for (PDPage page : list) {
PDResources pdResources = page.getResources();
Map pageImages = pdResources.getImages();
if (pageImages != null) {
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);
pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
totalImages++;
}
}
}
} else {
System.err.println("File not exists");
}
} catch (Exception e) {
e.printStackTrace();
}
}
}

For PDFBox 2.0.1, pudaykiran's answer must be slightly modified since some APIs have been changed.
public static void testPDFBoxExtractImages() throws Exception {
PDDocument document = PDDocument.load(new File("D:/Temp/Test.pdf"));
PDPageTree list = document.getPages();
for (PDPage page : list) {
PDResources pdResources = page.getResources();
for (COSName c : pdResources.getXObjectNames()) {
PDXObject o = pdResources.getXObject(c);
if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {
File file = new File("D:/Temp/" + System.nanoTime() + ".png");
ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject)o).getImage(), "png", file);
}
}
}
}

Just add the .jpeg to the end of your path:
image.write2file("C:\\Users\\Pradyut\\Documents\\image" + i + ".jpeg");
That works for me.

You can use PDPage.convertToImage() function which can convert the PDF page into a BufferedImage. Next you can use the BufferedImage to create an Image.
Use the following reference for further detail:
All PDF realated classes in PDFBox you can get in
Apache PDFBox 1.8.3 API
Here you can see PDPage related documentation.
And do not forget to look for PDPage.convertToImage() function in PDPage class.

This is a kotlin version of #Matt's answer.
fun <R> PDResources.onImageResources(block: (RenderedImage) -> (R)): List<R> =
this.xObjectNames.flatMap {
when (val xObject = this.getXObject(it)) {
is PDFormXObject -> xObject.resources.onImageResources(block)
is PDImageXObject -> listOf(block(xObject.image))
else -> emptyList()
}
}
You can use it on PDPage Resources like this:
page.resources.onImageResources { image ->
Files.createTempFile("image", "xxx").also { path->
if(!ImageIO.write(it, "xxx", file.toFile()))
IllegalStateException("Couldn't write image to file")
}
}
Where "xxx" is the format you need (like "jpeg")

For someone who want just copy and paste this ready to use code
import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.UUID;
public class ExtractImagesUseCase extends PDFStreamEngine{
private final String filePath;
private final String outputDir;
// Constructor
public ExtractImagesUseCase(String filePath,
String outputDir){
this.filePath = filePath;
this.outputDir = outputDir;
}
// Execute
public void execute(){
try{
File file = new File(filePath);
PDDocument document = PDDocument.load(file);
for(PDPage page : document.getPages()){
processPage(page);
}
}catch(IOException e){
e.printStackTrace();
}
}
#Override
protected void processOperator(Operator operator, List<COSBase> operands) throws IOException{
String operation = operator.getName();
if("Do".equals(operation)){
COSName objectName = (COSName) operands.get(0);
PDXObject pdxObject = getResources().getXObject(objectName);
if(pdxObject instanceof PDImageXObject){
// Image
PDImageXObject image = (PDImageXObject) pdxObject;
BufferedImage bImage = image.getImage();
// File
String randomName = UUID.randomUUID().toString();
File outputFile = new File(outputDir,randomName + ".png");
// Write image to file
ImageIO.write(bImage, "PNG", outputFile);
}else if(pdxObject instanceof PDFormXObject){
PDFormXObject form = (PDFormXObject) pdxObject;
showForm(form);
}
}
else super.processOperator(operator, operands);
}
}
Demo
public class ExtractImageDemo{
public static void main(String[] args){
String filePath = "C:\\Users\\John\\Downloads\\Documents\\sample-file.pdf";
String outputDir = "C:\\Users\\John\\Downloads\\Documents\\Output";
ExtractImagesUseCase useCase = new ExtractImagesUseCase(
filePath,
outputDir
);
useCase.execute();
}
}

Instead of calling
image.write2file("C:\\Users\\Pradyut\\Documents\\image" + i);
You can use the ImageIO.write() static method to write the RGB image out in whatever format you need. Here I've used PNG:
File outputFile = new File( "C:\\Users\\Pradyut\\Documents\\image" + i + ".png");
ImageIO.write( image.getRGBImage(), "png", outputFile);

itext Converting PDF to csv

I am trying to use itext framework to convert a pdf file into a csv for import into excel.
The output is garbled and I pressume I am missing a step in regards to format conversion however I can't seem to find the information in the itext site and am looking for assistance.
Current is as below.
package com.pdf.convert;
import java.io.FileOutputStream;
import java.io.IOException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
public class ThirdPDF {
private static String INPUTFILE = "/location/test.pdf";
private static String OUTPUTFILE = "/location/test.csv";
public static void main(String[] args) throws DocumentException,
IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,
new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {
// Only page number 2 will be included
if (i == 2) {
page = writer.getImportedPage(reader, i);
Image instance = Image.getInstance(page);
document.add(instance);
}
}
document.close();
}
}

Converting PDF file to CSV file.
Present Directory and File creation is based on Android Framework.
Change your path and Directory as per your Framework Accordingly.
private void convertPDFToCSV(String pdfFilePath) {
String myfolder = Environment.getExternalStorageDirectory() + "/Mycsv";
if (createFolder(myfolder)) {
try {
Document document = new Document();
document.open();
FileOutputStream fos=new FileOutputStream(myfolder + "/MyCSVFile.csv");
StringBuilder parsedText=new StringBuilder();
PdfReader reader1 = new PdfReader(pdfFilePath);
int n = reader1.getNumberOfPages();
for (int i = 0; i <n ; i++) {
parsedText.append(parsedText+PdfTextExtractor.getTextFromPage(reader1, i+1).trim()+"\n") ;
//Extracting the content fromx the different pages
}
StringReader stReader = new StringReader(parsedText.toString());
int t;
while((t=stReader.read())>0)
fos.write(t);
document.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
private boolean createFolder(String myfolder) {
File f = new File(myfolder);
if (!f.exists()) {
if (!f.mkdir()) {
return false;
} else {
return true;
}
}else{
return true;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache POI and XDOCREPORT NullPointerException - java

Related

PDFBox Customized PDFTextStripper

Read embedded object from Excel sheet row by row [duplicate]

convert html text into pdf without losing formatting

Extract Only Images from PDF File in java using Apache Tika or PDFBox? [duplicate]

itext Converting PDF to csv

Categories

Resources