Extracted pdf text is not getting displayed in console

Extracted pdf text is not getting displayed in console - java

I am trying to extract the pdf text using Tabula. But the code has no errors but when i run the extracted pdf text does not get displayed in console. Could some one help.
I have been using PDFBox and after doing some research, i have found that tabula is new and wanted to try it.
File file = new File(pdfFilePath);
PDDocument document = PDDocument.load(file);
ObjectExtractor oe = new ObjectExtractor(document);
Page page = oe.extract(1) //1st page
TextStripper textStripper = new TextStripper(document,1);
System.out.println(textStripper.getText(document));
output of pdf text

You are not using the page variable. Try the following code.
File file = new File(pdfFilePath);
PDDocument document = PDDocument.load(file);
ObjectExtractor oe = new ObjectExtractor(document);
Page page = oe.extract(1); // 1st page
for (TextElement textElement: page.getText()) {
System.out.print(textElement.getText());
}

Related

Apache POI convert HTML/XHTML to DOC/DOCX

I need to transform HTML to a doc file, the HTML is filled with custom information and the images and CSS change depending on what is request.
I'm trying to use Apache POI for this, but I'm having an error
`
org.apache.poi.xwpf.converter.core.XWPFConverterException: java.lang.IllegalStateException: Expecting one Styles document part, but found 0
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.convert(XHTMLConverter.java:72)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:58)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:38)
at org.apache.poi.xwpf.converter.core.AbstractXWPFConverter.convert(AbstractXWPFConverter.java:45)
My code is this:
// Load the HTML file
//Doc file
String htmlFile = "pathToHtml/file.html";
//String htmlFile = parseHTMLTemplate(disputeLetterDetails, template, fileExtension);
//new File(htmlFile);
//File file = new FileReader(htmlFile);
Path path = Path.of(htmlFile);
OutputStream in = new FileOutputStream(htmlFile, true);
// Create a new XWPFDocument
XWPFDocument document = new XWPFDocument();
// Set up the XHTML options
XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(new File("./images/")));
options.setExtractor(new FileImageExtractor(new File("./images/")));
// Convert the HTML to XWPFDocument
XHTMLConverter.getInstance().convert(document, in, options);
// Save the document to a .doc file
FileOutputStream out = new FileOutputStream("pathToHtml/OUT_from_XHTML_TEST.docx");
document.write(out);
out.close();
`
I want to get a docx file from an HTML file with the same styles but I'm getting this error `
org.apache.poi.xwpf.converter.core.XWPFConverterException: java.lang.IllegalStateException: Expecting one Styles document part, but found 0
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.convert(XHTMLConverter.java:72)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:58)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:38)
at org.apache.poi.xwpf.converter.core.AbstractXWPFConverter.convert(AbstractXWPFConverter.java:45)
`

unable to read pdf document using selenium web driver

I am writing code to read pdf file in selenium using Java PDF Library.
I wrote my code as
URL url = new URL(str);
InputStream is=url.openStream();
BufferedInputStream fileParse=new BufferedInputStream(is);
PDDocument document=null;
document=PDDocument.load(fileParse);
String pdfContent=new PDFTextStripper().getText(document);
But I am getting error at line document=PDDocument.load(fileParse) along with
java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1119)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2017)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:1988)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:269)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1143)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1040)
I need to verify the content on the pdf file .
Appreciate the help.
Thanks

Simply you can use below line of code and its working:
//Loading an existing document
File file = new File("yourPdfFilepath");
PDDocument document = PDDocument.load(file);
//Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
//Retrieving text from PDF document
String pdfcontent = pdfStripper.getText(document);
I hope it helps you

The PDF has been Split using PDF Box library and the resultant PDF has almost same size as source PDF file

I'm using the below code to split the huge PDF into two different PDF. The PDF is splitting properly. The first PDF will be generated with first 2 pages of PDF and 2nd PDF will be generated with the rest of the pages of PDF.
The problem is the size, the source PDF is 17 MB. The 2 PDF's that are generated are also of 15MB each. Logically it should be less in size, I searched the forum, they said PDFont has to be used properly. I haven't used PDFont here not sure If Im doing it wrongly
public static void main(String[] args) throws IOException, COSVisitorException {
File input = new File("sourceFile.pdf");
// pdPage and pdPage1 will be used to get first and second page of entire PDF
//pdPageMedRec will get the rest of the pages
PDPage pdPage = null;
PDPage pdPage1 = null;
PDPage pdPageMedRec = null;
PDDocument firstOutputDocument = null;
PDDocument secondOutputDocument = null;
PDDocument inputDocument = PDDocument.loadNonSeq(input, null);
List<PDPage> list = inputDocument.getDocumentCatalog().getAllPages();
// I wanted two documents to be generated from the big PDF
//firstOutputDocument is document 1 and it will be having first 2 pages of the big pdf
//secondOutputDocument is document 2 and it will be having the rest of the pages of the PDF
firstOutputDocument = new PDDocument();
secondOutputDocument = new PDDocument();
// Taking first page and second page
pdPage = list.get(0);
pdPage1 = list.get(1);
// Appending them as one document
firstOutputDocument.importPage(pdPage);
firstOutputDocument.importPage(pdPage1);
// Looping the rest of the pages
for (int page = 3; page <= inputDocument.getNumberOfPages(); ++page) {
pdPageMedRec = (PDPage) inputDocument.getDocumentCatalog().getAllPages().get(page - 1);
// append page to current document
secondOutputDocument.importPage(pdPageMedRec);
}
// Saving first document
File f = new File("document1.pdf");
firstOutputDocument.save(f);
firstOutputDocument.close();
// Saving second document
File g = new File("document2.pdf");
secondOutputDocument.save(g);
secondOutputDocument.close();
inputDocument.close();
}

How to copy/move AcroForm fields from one document to new blank one using IText5 or IText7?

I need to copy whole AcroForm including field positions and values from template PDF to a new blank PDF file. How can I do that?
In short words - I need to get rid of "background" from the template and leave only filed forms.
The whole point of this is to create a PDF with content that would be printed on pre-printed templates.
I am using IText 5 but I can switch to 7 if usefull examples would be provided

After a lot of trial and error I have found the solution to "How to copy AcfroForm fields into another PDF". It is a iText v7 version. I hope it will help somebody someday.
private byte[] copyFormElements(byte[] sourceTemplate) throws IOException {
PdfReader completeReader = new PdfReader(new ByteArrayInputStream(sourceTemplate));
PdfDocument completeDoc = new PdfDocument(completeReader);
ByteArrayOutputStream out = new ByteArrayOutputStream();
PdfWriter offsetWriter = new PdfWriter(out);
PdfDocument offsetDoc = new PdfDocument(offsetWriter);
offsetDoc.initializeOutlines();
PdfPage blank = offsetDoc.addNewPage();
PdfAcroForm originalForm = PdfAcroForm.getAcroForm(completeDoc, false);
// originalForm.getPdfObject().copyTo(offsetDoc,false);
PdfAcroForm offsetForm = PdfAcroForm.getAcroForm(offsetDoc, true);
for (String name : originalForm.getFormFields().keySet()) {
PdfFormField field = originalForm.getField(name);
PdfDictionary copied = field.getPdfObject().copyTo(offsetDoc, false);
PdfFormField copiedField = PdfFormField.makeFormField(copied, offsetDoc);
offsetForm.addField(copiedField, blank);
}
offsetDoc.close();
completeDoc.close();
return out.toByteArray();
}

Did you check the PdfCopyForms object:
Allows you to add one (or more) existing PDF document(s) to create a new PDF and add the form of another PDF document to this new PDF.
I didn't find an example, but you could try something like this:
PdfReader reader1 = new PdfReader(src1); // a document with a form
PdfReader reader2 = new PdfReader(src2); // a document without a form
PdfCopyForms copy = new PdfCopyForms(new FileOutputStream(dest));
copy.AddDocument(reader1); // add the document without the form
copy.CopyDocumentFields(reader2); // add the fields of the document with the form
copy.close();
reader1.close();
reader2.close();
I see that the class is deprecated. I'm not sure of that's because iText 7 makes it much easier to do this, or if it's because there were technical problems with the class.

How to add text watermark to pdf in Java using Apache PDFBox?

I am not getting any tutorial for adding a text watermark in a PDF file? Can you all please guide me, I am very new to PDFBOX.
Its not duplicate, the link in the comment didn't help me. I want to add text, not an image to the pdf.

Here is an example using PDFBox 2.0.2. This will load a PDF and write some text in the bottom right corner in a red transparent font. If it is a multiple page PDF the watermark will appear on every page. It might not be production ready, as I am not sure if there are some additional null conditions that need to be checked, but it should get you running in the right direction.
Keep in mind that this particular block of code will not modify the original PDF, but will create a new PDF using the Tmp_(filename) as the output.
private static void watermarkPDF (File fileStored) {
File tmpPDF;
PDDocument doc;
tmpPDF = new File(fileStored.getParent() + System.getProperty("file.separator") +"Tmp_"+fileStored.getName());
doc = PDDocument.load(fileStored);
for(PDPage page:doc.getPages()){
PDPageContentStream cs = new PDPageContentStream(doc, page, AppendMode.APPEND, true, true);
String ts = "Some sample text";
PDFont font = PDType1Font.HELVETICA_BOLD;
float fontSize = 14.0f;
PDResources resources = page.getResources();
PDExtendedGraphicsState r0 = new PDExtendedGraphicsState();
r0.setNonStrokingAlphaConstant(0.5f);
cs.setGraphicsStateParameters(r0);
cs.setNonStrokingColor(255,0,0);//Red
cs.beginText();
cs.setFont(font, fontSize);
cs.setTextMatrix(Matrix.getTranslateInstance(0f,0f));
cs.showText(ts);
cs.endText();
}
cs.close();
}
doc.save(tmpPDF);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracted pdf text is not getting displayed in console - java

Related

Apache POI convert HTML/XHTML to DOC/DOCX

unable to read pdf document using selenium web driver

The PDF has been Split using PDF Box library and the resultant PDF has almost same size as source PDF file

How to copy/move AcroForm fields from one document to new blank one using IText5 or IText7?

How to add text watermark to pdf in Java using Apache PDFBox?

Categories

Resources