PDFBox search for text on specific page in new PDF

PDFBox search for text on specific page in new PDF - java

I'm searching a way to check my new PDF for a specific String on every page.
The idea is to go on every page and if project name is missing from the page to add it (before saving the pdf - doc.save(new FileOutputStream(new File(pathToFile)));
I already tried:
document.save(new FileOutputStream(new File(pathToFile)));
PDDocument document = PDDocument.load(new File(pathToFile));
for (int i = 1; i < document.getNumberOfPages(); i++) {
PDFTextStripper reader = new PDFTextStripper();
reader.setStartPage(i);
reader.setEndPage(i);
String pageText = reader.getText(document);
System.out.println(pageText);
}
the result is : Hello World which is ok.
but this is working only if document is already saved and then load it again.
In my case when the document is not saved yet:
for (int i = 1; i < document.getNumberOfPages(); i++) {
PDFTextStripper reader = new PDFTextStripper();
reader.setStartPage(i);
reader.setEndPage(i);
String pageText = reader.getText(document);
System.out.println(pageText);
}
the result is empty String

Obviasly there is no way to find text before saving the document so I started a new approach.
oldPagesCount = document.getNumberOfPages();
addTableInformation(List<String> informationToAdd);
if (oldPagesCount < document.getNumberOfPages()) {
// we have auto generated pages and we should add projec name-number
for (int i = oldPagesCount; i < document.getNumberOfPages(); i++) {
page = document.getPage(i);
}
addProjectInfo(project);
}
}
In this case if table info is moved to multiple pages the code is going on every newly added page and adding project information. Hope that this will help to everybody that need to do something like this.

Related

Java PDFbox 2.0.25 - After copying fillable fields to another page, browser can see the copied field values but not Acrobat

I have a form filling program that takes in a pdf template with rows of fillable fields as well as JSON data for the field values, then populates said values into the form. If there's more rows than can fit on one page, the page is duplicated and the extra rows are added to the duplicate page. The page is deep cloned, then the tValues for the page annotations are changed, as well as the page for the annotation, and new fields are created for the new page. When I open my exported pdf in chrome I can see the field values on the new page, but not in Acrobat. The pdf can be re-saved from chrome to be opened in Acrobat with the values, but this strips away the fillable fields, and I don't want to make users do a work-around for what's likely developer error, since I don't have an in-depth understanding of the PDF spec.
Below is my code for duplicating the page, I'm hoping someone more generally knowledgeable about PDFs can identify what I'm doing wrong so that I can fix it. I haven't included the other helper functions called at the end, but they will essentially find the correct PDField instance and call setValue() on it.
private void createContent(PDDocument document, JSONObject jsonObj) throws IOException {
final PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
final JSONArray rowsArr = (JSONArray)jsonObj.get("rows");
final int pages = (int)Math.ceil(rowsArr.size() / 34.0);
if (pages > 1)
{
// Add additional pages
final PDFCloneUtility cloner = new PDFCloneUtility(document);
final PDPage oldPage = document.getPage(0);
for (int i=1; i < pages; i++)
{
final COSDictionary dupPageDict = (COSDictionary)cloner.cloneForNewDocument(oldPage);
final PDPage dupPage = new PDPage(dupPageDict);
final List<PDAnnotation> dupAnnoList = dupPage.getAnnotations();
for (PDAnnotation anno : dupAnnoList)
{
final COSDictionary annoDict = anno.getCOSObject();
final String oldTStr = annoDict.getString(COSName.T); // Field name, ex: INC0
// Change annotation to new page and create a field for it
if (oldTStr.endsWith(String.valueOf(i-1)))
{
// Change page link and add to dupPage
anno.setPage(dupPage);
dupPage.getAnnotations().add(anno);
// Update anno name for new page
final String dupTStr = oldTStr.substring(0, oldTStr.length() - 1) + i; // ex: INC1
annoDict.setItem(COSName.T, new COSString(dupTStr));
annoDict.setItem(COSName.AP, null);
// All the fields should be text fields.
COSBase ftBase = annoDict.getItem(COSName.FT);
if (ftBase instanceof COSName && COSName.TX == ftBase)
{
final PDTextField oldField = (PDTextField) acroForm.getField(oldTStr);
if (oldField != null)
{
// Create new field for anno
final PDTextField dupField = new PDTextField(acroForm);
dupField.setPartialName(dupTStr);
dupField.setDefaultAppearance(oldField.getDefaultAppearance());
if (anno instanceof PDAnnotationWidget)
{
dupField.getWidgets().add((PDAnnotationWidget) anno);
acroForm.getFields().add(dupField);
}
}
}
}
}
// Append dupPage before instructions page (which is at the end of the doc)
final PDPageTree pgTree = document.getDocumentCatalog().getPages();
pgTree.insertBefore(dupPage, pgTree.get(pgTree.getCount() - 1));
}
}
PDFont font = PDType1Font.HELVETICA;
PDResources resources = new PDResources();
resources.put(COSName.getPDFName("Helv"), font);
acroForm.setDefaultResources(resources);
for (int i = 0; i < pages; i++)
{
addHeaderInfo(acroForm, jsonObj, i);
addMainInfo(acroForm, rowsArr, i);
addFooterInfo(acroForm, jsonObj, i);
}
}
Example output file: https://www.mediafire.com/file/7zu7xxo2fflpdnw/example_out_73130945.pdf/file

org.apache.pdfbox.pdfparser.NonSequentialPDFParser checkXrefOffsets SEVERE: Can't find the object 8 0 (origin offset 0)

We are using pdfbox 1.8.8 for extracting text from pdf files in my application. there is an issue with the pdfbox where it logs the error "org.apache.pdfbox.pdfparser.NonSequentialPDFParser checkXrefOffsets SEVERE: Can't find the object 8 0 (origin offset 0) " but not throwing it to handle the issue. Finally it returns the text in different unknown language. We couldn't upgrade pdfbox version due to application doesn't have quality test cases.
Here is the code:
PDDocument pdDocument = null;
File tmpfile = File.createTempFile(String.format("txttmp-%s", UUID.randomUUID().toString()), null);
// pdDocument= PDDocument.load(new FileInputStream(sourceFile),true);
pdDocument =PDDocument.loadNonSeq(sourceFile, new RandomAccessFile(tmpfile, "rw"));
PDFTextStripper pdfTextStripper = new PDFTextStripper();
int pages = pdDocument.getNumberOfPages();
for (int page = 1; page <= pages; page++) {
//Set up the text stripper to grab just one page worth of text
pdfTextStripper.setSortByPosition(true);
pdfTextStripper.setStartPage(page);
pdfTextStripper.setEndPage(page);
String pageText = pdfTextStripper.getText(pdDocument);
}

How do I save all the pages of a PDDocument in seperate .pdf files?

I want to save all the pages of a PDDocument in a seperate pdf file.
I programmed it like this:
int numberOfPages = pdDocument.getNumberOfPages();
for (int i = 0; i < numberOfPages; i++) {
PDDocument pageDocument = new PDDocument();
PDPage page = pdDocument.getPage(i);
pageDocument.add(page);
pageDocument.save("c:\temp\page" + (i+1));
}
Is this the correct way to do it? Do I have to create each time a new PDDocument and add the page to it or is there a better way to save the pages of a PDDocument individually?
To be more clear:
I want to save each page in a PDDocument in separate pdfs.
So, if I have a PDDocument with 25 pages in it, I want to save each page in a separate pdf.
Like this:
-page1.pdf
-page2.pdf
-page3.pdf
...
-page25.pdf
I'm just wondering if I have to make a new PDDocument object for each page to save it to a pdf.

Please try (untested):
PDDocument pageDocument = new PDDocument();
for (int i = 0; i < pdDocument.getNumberOfPages(); i++) {
pageDocument.add(pdDocument.getPage(i));
}
pageDocument.save("c:\temp\page");
It should be possible to add multiple pages to a PDDocument.

Use splitter to split the PDDocument into seperate PDDocuments, simple example can be
Splitter splitter = new Splitter();
splitter.setStartPage(start); //page to start from
splitter.setEndPage(end); //page to end at
List<PDDocument> splittedItems = splitter.split(doc); //doc is original document
for(int index = 0; index < splittedItem.length(); index++){
splittedItem[index].save("destination to save");
}
In your case it would be,
int numberOfPages = pdDocument.getNumberOfPages();
Splitter splitter = new Splitter();
splitter.setStartPage(1);
splitter.setEndPage(numberOfPages); //page to end at
List<PDDocument> splittedItems = splitter.split(pdDocument);
for(int index = 0; index < splittedItem.length(); index++){
splittedItem[index].save("c:\temp\page" + (index+1));
}
For more information you can check out documentation
https://pdfbox.apache.org/docs/2.0.3/javadocs/org/apache/pdfbox/multipdf/Splitter.html

Concatenate tagged PDF files using the older version iText 4.2.0

I am maintaining the code which is using older version iText 4.2. Now, I am trying to merging multiple tagged PDF files into one using following codes:
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileOutputStream(RESULT));
document.open();
PdfReader reader;
int n;
for (int i = 0; i < files.length; i++) {
reader = new PdfReader(files[i]);
n = reader.getNumberOfPages();
for (int page = 0; page < n; ) {
copy.addPage(copy.getImportedPage(reader, ++page));
}
copy.freeReader(reader);
reader.close();
}
document.close();
However, the tags are not copied over at all. since this is old version, I cannot find the function getImportedPage(reader, ++page, true) with third boolean parameter.
My question is whether the version of itext is not possible to achieve this? I also want to let you know I cannot upgrade my itext version to newer ones.
Thanks for any help!

Identify hidden text Word 2003/2007 using Apache POI

I am converting a Word (2003 and 2007) document to HTML format. I have managed to read the text, formats etc from the Word document. But the document contains some hidden text like 'Header Change History' which need not be displayed on the page. Is there any way to identify hidden texts from a Word document.
Any help will be much valuable.

I am not sure if this is a complete (or even accurate) solution, but for the files in the DOCX format, it seems that you can check if a character run is hidden by
XWPFRun cr;
if (cr.getCTR().getRPr().getVanish() != null){
// it is hidden
}
Got this from reverse-engineering the XML, and at least in my usage it seems to work. Would be very glad for additional (more informed) input, and a way to do the same thing in the old binary file format.

The following code snippet helps in identifying if the text is hidden
POIFSFileSystem fs = null;
boolean isHidden = false;
try {
fs = new POIFSFileSystem(new FileInputStream(filesname));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
String[] paragraphs = we.getParagraphText();
System.out.println("Word Document has " + paragraphs.length
+ " paragraphs");
Range range = doc.getRange();
for (int k = 0; k < range.numParagraphs(); k++) {
org.apache.poi.hwpf.usermodel.Paragraph paragraph = range
.getParagraph(k);
paragraph.text().trim();
paragraph.text().replaceAll("\\cM?\r?\n", "");
for (int j = 0; j < paragraph.numCharacterRuns(); j++) {
org.apache.poi.hwpf.usermodel.CharacterRun cr = paragraph
.getCharacterRun(j);
if (cr.isVanished()) {
// it is hidden
System.out.println("text is hidden ");
isHidden = true;
break;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PDFBox search for text on specific page in new PDF - java

Related

Java PDFbox 2.0.25 - After copying fillable fields to another page, browser can see the copied field values but not Acrobat

org.apache.pdfbox.pdfparser.NonSequentialPDFParser checkXrefOffsets SEVERE: Can't find the object 8 0 (origin offset 0)

How do I save all the pages of a PDDocument in seperate .pdf files?

Concatenate tagged PDF files using the older version iText 4.2.0

Identify hidden text Word 2003/2007 using Apache POI

Categories

Resources