Split PDF by delimiter? - java

I have invoices combined in a single pdf file. Some invoices are half a page in size, while some invoices are larger than one page. How can I separate all these invoices as separate files by using static texts at the beginning of each invoice as separators? Or I can use a different method you suggest.
Sample file.

You need to extend the Splitter.splitAtPage method to indicate where you want to split your PDF.
Here is a working example:
public class PdfBoxSplitter {
private static String DELIMITER = "Efactory Inc";
public static void main(String[] args) throws IOException {
File file = new File("document.pdf");
try (PDDocument document = PDDocument.load(file)) {
// First find the list of pages where we need to split the PDF
List<Integer> splitPages = new ArrayList<>();
for (int page = 1; page <= document.getNumberOfPages(); page++) {
PDFTextStripper pdfStripper = new PDFTextStripper();
pdfStripper.setStartPage(page);
pdfStripper.setEndPage(page);
String parsedText = pdfStripper.getText(document);
if (parsedText.contains(DELIMITER)) splitPages.add(page - 1);
}
// Instantiate the custom splitter
Splitter splitter = new Splitter() {
protected boolean splitAtPage(int pageNumber) {
return splitPages.contains(pageNumber);
}
};
// Split the document and save each part
List<PDDocument> docs = splitter.split(document);
int cpt = 1;
for (PDDocument doc : docs) {
File f = new File("Document" + (cpt++) + ".pdf");
doc.save(f);
}
}
}
}

Related

How to remove/replace a specific text in pdf, If the text to replace is drawn using multiple instructions following each other?

I've tried the following code. It works fine, but only for limited cases like if the text is added with a single instruction. How to do this if the text is added in multiple instructions. Can anyone help me out with this?
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
final StringBuilder recentChars = new StringBuilder();
#Override
protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, Vector displacement)
throws IOException {
String string = font.toUnicode(code);
if (string != null)
recentChars.append(string);
super.showGlyph(textRenderingMatrix, font, code, displacement);
}
#Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
String recentText = recentChars.toString();
recentChars.setLength(0);
String operatorString = operator.getName();
if (TEXT_SHOWING_OPERATORS.contains(operatorString) && "Text which is to be replace".equals(recentText))
{
return;
}
super.write(contentStreamWriter, operator, operands);
}
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};
editor.processPage(page);
}
document.save("watermark-RemoveByText.pdf");```
You can use pdfSweep (iText 7 add-on) that removes or redacts information from a PDF document.
For more information see https://itextpdf.com/en/products/itext-7/pdf-redaction-pdfsweep
The code which removes "Text which is to be replace" text from PDF document on Java with using of PdfSweep looks like this:
try (PdfDocument pdf = new PdfDocument(new PdfReader("Path to source file"), new PdfWriter("Path to out file"))) {
ICleanupStrategy cleanupStrategy = new RegexBasedCleanupStrategy("Text which is to be replace")
.setRedactionColor(ColorConstants.WHITE);
PdfCleaner.autoSweepCleanUp(pdf, cleanupStrategy);
}

Merge two PDF pages into one showing different halves of each

I have a folder of single paged PDF files where I want to merge pairs of two into a new single paged PDF so that always the lower half of each page is not shown. It contains useless information that's why I need to blank it.
My code is now this but I get it wrong because I need to rotate only one of the input page. But now it seems there are both pages rotated:
(If I could just move one half it would be great, too)
private void clearLowerHalfOfPage(PDDocument doc) throws IOException {
PDPage page = doc.getPage(0);
PDPageContentStream contentStream = new PDPageContentStream(doc, page, true, true);
contentStream.setNonStrokingColor(Color.WHITE);
contentStream.addRect(0, 0, 600, 440);
contentStream.fill();
contentStream.close();
}
public void testMerge() throws IOException {
File path = new File("/Users/me/Documents/PDF/");
List<File> files = Arrays.asList(path.listFiles());
for (int i = 0; i < files.size() - 1; i += 2) {
PDDocument doc1 = PDDocument.load(files.get(i));
PDDocument doc2 = PDDocument.load(files.get(i + 1));
clearLowerHalfOfPage(doc2);
clearLowerHalfOfPage(doc1);
PDPage page = doc2.getPage(0);
page.setRotation(180);
Map<Integer, PDDocument> mapping = new HashMap<>();
mapping.put(1, doc1);
Overlay overlayObj = new Overlay();
overlayObj.setInputPDF(doc2);
PDDocument pdDocument = overlayObj.overlayDocuments(mapping);
pdDocument.save(new File(path, "x.pdf"));
pdDocument.close();
overlayObj.close();
doc1.close();
doc2.close();
}
}
How can I achieve this construction?

Edit header in dotx/docx file

I am currently trying to generate a new docx file from an existing template in dotx format. I want to change the firstname, lastname, etc. in the header but I'm not able to access them for some reason...
My approach is the following:
public void generateDocX(Long id) throws IOException, InvalidFormatException {
//Get user per id
EmployeeDTO employeeDTO = employeeService.getEmployee(id);
//Location where the new docx file will be saved
FileOutputStream outputStream = new FileOutputStream(new File("/home/user/Documents/project/src/main/files/" + employeeDTO.getId() + "header.docx"));
//Get the template for generating the new docx file
File template = new File("/home/user/Documents/project/src/main/files/template.dotx");
OPCPackage pkg = OPCPackage.open(template);
XWPFDocument document = new XWPFDocument(pkg);
for (XWPFHeader header : document.getHeaderList()) {
List<XWPFParagraph> paragraphs = header.getParagraphs();
System.out.println("Total paragraphs in header are: " + paragraphs.size());
System.out.println("Total elements in the header are: " + header.getBodyElements().size());
for (XWPFParagraph paragraph : paragraphs) {
System.out.println("Paragraph text is: " + paragraph.getText());
List<XWPFRun> runs = paragraph.getRuns();
for (XWPFRun run : runs) {
String runText = run.getText(run.getTextPosition());
System.out.println("Run text is: " + runText);
}
}
}
//Write the changes to the new docx file and close the document
document.write(outputStream);
document.close();
}
The output in the console is either 1, null or empty string... I've tried several approaches from here, here and here but without any luck...
Here is what's inside the template.dotx
IBody.getParagraphs and IBody.getBodyElements- only get the paragraphs or body elements which are directly in that IBody. But your paragraphs are not directly in there but are in a separate text box or text frame. That's why they cannot be got this way.
Since *.docx is a ZIP archive containijg XML files for document, headers and footers, one could get all text runs of one IBody by creating a XmlCursor which selects all w:r XML elements. For a XWPFHeader this could look like so:
private List<XmlObject> getAllCTRs(XWPFHeader header) {
CTHdrFtr ctHdrFtr = header._getHdrFtr();
XmlCursor cursor = ctHdrFtr.newCursor();
cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//*/w:r");
List<XmlObject> ctrInHdrFtr = new ArrayList<XmlObject>();
while (cursor.hasNextSelection()) {
cursor.toNextSelection();
XmlObject obj = cursor.getObject();
ctrInHdrFtr.add(obj);
}
return ctrInHdrFtr;
}
Now we have a list of all XML elements in that header which are text-run-elements in Word.
We could have a more general getAllCTRs which gets all CTR elements from any kind of IBody like so:
private List<XmlObject> getAllCTRs(IBody iBody) {
XmlCursor cursor = null;
List<XmlObject> ctrInIBody = new ArrayList<XmlObject>();
if (iBody instanceof XWPFHeaderFooter) {
XWPFHeaderFooter headerFooter = (XWPFHeaderFooter)iBody;
CTHdrFtr ctHdrFtr = headerFooter._getHdrFtr();
cursor = ctHdrFtr.newCursor();
} else if (iBody instanceof XWPFDocument) {
XWPFDocument document = (XWPFDocument)iBody;
CTDocument1 ctDocument1 = document.getDocument();
cursor = ctDocument1.newCursor();
} else if (iBody instanceof XWPFAbstractFootnoteEndnote) {
XWPFAbstractFootnoteEndnote footEndnote = (XWPFAbstractFootnoteEndnote)iBody;
CTFtnEdn ctFtnEdn = footEndnote.getCTFtnEdn();
cursor = ctFtnEdn.newCursor();
}
if (cursor != null) {
cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//*/w:r");
while (cursor.hasNextSelection()) {
cursor.toNextSelection();
XmlObject obj = cursor.getObject();
ctrInIBody.add(obj);
}
}
return ctrInIBody ;
}
Now we have a list of all XML elements in that IBody which are text-run-elements in Word.
Having that we can get the text out of them like so:
private void printAllTextInTextRunsOfIBody(IBody iBody) throws Exception {
List<XmlObject> ctrInIBody = getAllCTRs(iBody);
for (XmlObject obj : ctrInIBody) {
CTR ctr = CTR.Factory.parse(obj.xmlText());
for (CTText ctText : ctr.getTList()) {
String text = ctText.getStringValue();
System.out.println(text);
}
}
}
This probably shows the next challenge. Because Word is very messy in creating text-run-elements. For example your placeholder <<Firstname>> can be split into text-runs << + Firstname + >>. The reason migt be different formatting or spell checking or something else. Even this is possible: << + Lastname + >>; << + YearOfBirth + >>. Or even this: <<Firstname + >> << + Lastname>>; << + YearOfBirth>>. You see, replacing the placeholders with text is nearly impossible because the placeholders may be split into multiple tex-runs.
To avoid this the template.dotx needs to be created from users who know what they are doing.
At first turn spell check off. Grammar check as well. If not, all found possible spell errors or grammar violations are in separate text-runs to mark them accordingly.
Second make sure the whole placeholder is eaqual formatted. Different formatted text also must be in separate text-runs.
I am really skeptic that this will work properly. But try it yourself.
Complete example:
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
import org.apache.xmlbeans.XmlObject;
import org.apache.xmlbeans.XmlCursor;
import java.util.List;
import java.util.ArrayList;
public class WordEditAllIBodys {
private List<XmlObject> getAllCTRs(IBody iBody) {
XmlCursor cursor = null;
List<XmlObject> ctrInIBody = new ArrayList<XmlObject>();
if (iBody instanceof XWPFHeaderFooter) {
XWPFHeaderFooter headerFooter = (XWPFHeaderFooter)iBody;
CTHdrFtr ctHdrFtr = headerFooter._getHdrFtr();
cursor = ctHdrFtr.newCursor();
} else if (iBody instanceof XWPFDocument) {
XWPFDocument document = (XWPFDocument)iBody;
CTDocument1 ctDocument1 = document.getDocument();
cursor = ctDocument1.newCursor();
} else if (iBody instanceof XWPFAbstractFootnoteEndnote) {
XWPFAbstractFootnoteEndnote footEndnote = (XWPFAbstractFootnoteEndnote)iBody;
CTFtnEdn ctFtnEdn = footEndnote.getCTFtnEdn();
cursor = ctFtnEdn.newCursor();
}
if (cursor != null) {
cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//*/w:r");
while (cursor.hasNextSelection()) {
cursor.toNextSelection();
XmlObject obj = cursor.getObject();
ctrInIBody.add(obj);
}
}
return ctrInIBody ;
}
private void printAllTextInTextRunsOfIBody(IBody iBody) throws Exception {
List<XmlObject> ctrInIBody = getAllCTRs(iBody);
for (XmlObject obj : ctrInIBody) {
CTR ctr = CTR.Factory.parse(obj.xmlText());
for (CTText ctText : ctr.getTList()) {
String text = ctText.getStringValue();
System.out.println(text);
}
}
}
private void replaceTextInTextRunsOfIBody(IBody iBody, String placeHolder, String textValue) throws Exception {
List<XmlObject> ctrInIBody = getAllCTRs(iBody);
for (XmlObject obj : ctrInIBody) {
CTR ctr = CTR.Factory.parse(obj.xmlText());
for (CTText ctText : ctr.getTList()) {
String text = ctText.getStringValue();
if (text != null && text.contains(placeHolder)) {
text = text.replace(placeHolder, textValue);
ctText.setStringValue(text);
obj.set(ctr);
}
}
}
}
public void generateDocX() throws Exception {
FileOutputStream outputStream = new FileOutputStream(new File("./" + 1234 + "header.docx"));
//Get the template for generating the new docx file
File template = new File("./template.dotx");
XWPFDocument document = new XWPFDocument(new FileInputStream(template));
//traverse all headers
for (XWPFHeader header : document.getHeaderList()) {
printAllTextInTextRunsOfIBody(header);
replaceTextInTextRunsOfIBody(header, "<<Firstname>>", "Axel");
replaceTextInTextRunsOfIBody(header, "<<Lastname>>", "Richter");
replaceTextInTextRunsOfIBody(header, "<<ProfessionalTitle>>", "Skeptic");
}
//traverse all footers
for (XWPFFooter footer : document.getFooterList()) {
printAllTextInTextRunsOfIBody(footer);
replaceTextInTextRunsOfIBody(footer, "<<Firstname>>", "Axel");
replaceTextInTextRunsOfIBody(footer, "<<Lastname>>", "Richter");
replaceTextInTextRunsOfIBody(footer, "<<ProfessionalTitle>>", "Skeptic");
}
//traverse document body; note: tables needs not be traversed separately because they are in document body
printAllTextInTextRunsOfIBody(document);
replaceTextInTextRunsOfIBody(document, "<<Firstname>>", "Axel");
replaceTextInTextRunsOfIBody(document, "<<Lastname>>", "Richter");
replaceTextInTextRunsOfIBody(document, "<<ProfessionalTitle>>", "Skeptic");
//traverse all footnotes
for (XWPFFootnote footnote : document.getFootnotes()) {
printAllTextInTextRunsOfIBody(footnote);
replaceTextInTextRunsOfIBody(footnote, "<<Firstname>>", "Axel");
replaceTextInTextRunsOfIBody(footnote, "<<Lastname>>", "Richter");
replaceTextInTextRunsOfIBody(footnote, "<<ProfessionalTitle>>", "Skeptic");
}
//traverse all endnotes
for (XWPFEndnote endnote : document.getEndnotes()) {
printAllTextInTextRunsOfIBody(endnote);
replaceTextInTextRunsOfIBody(endnote, "<<Firstname>>", "Axel");
replaceTextInTextRunsOfIBody(endnote, "<<Lastname>>", "Richter");
replaceTextInTextRunsOfIBody(endnote, "<<ProfessionalTitle>>", "Skeptic");
}
//since document was opened from *.dotx the content type needs to be changed
document.getPackage().replaceContentType(
"application/vnd.openxmlformats-officedocument.wordprocessingml.template.main+xml",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml");
//Write the changes to the new docx file and close the document
document.write(outputStream);
outputStream.close();
document.close();
}
public static void main(String[] args) throws Exception {
WordEditAllIBodys app = new WordEditAllIBodys();
app.generateDocX();
}
}
Btw.: Since your document was opened from *.dotx the content type needs to be changed from wordprocessingml.template to wordprocessingml.document. Else Word will not open the resulting *.docx document. See Converting a file with ".dotx" extension (template) to "docx" (Word File).
As I am skeptic about the replacing-placeholder-text-approach, my preferred way is filling forms. See Problem with processing word document java. Of course such form fields cannot be used in header or footer. So headers or footers schould be created form scratch at whole.

Apache POI Table of contents not updating

I am using Apache POI XWPF components and java, to extract data from a .xml file into a word document. So far so good, but I am struggling to create a table of contents.
I have to create a table of contents at the start of the method and then I update it at the end to get all the new headers. Currently I use doc.createTOC(), where doc is a variable created from XWPFDocument, to create the table at the start and then I use doc.enforceUpdateFields() to update everything at the end of the document. But when I open the document after I ran the program, the table of contents is empty, but the navigation panel does include some of the headers I specified.
A comment recommended that I include some code. So i started off by create the document from a template:
XWPFDocument doc = new XWPFDocument(new FileInputStream("D://Template.docx"));
I then create a table of contents:
doc.createTOC();
Then throughout the method I add headers to the document:
XWPFParagraph documentControlHeading = doc.createParagraph();
documentControlHeading.setPageBreak(true);
documentControlHeading.setAlignment(ParagraphAlignment.LEFT);
documentControlHeading.setStyle("Tier1Header");
After all the headers are added, I want to update the document so that all the new headers will appear in the table of contents. I do this buy using the following command:
doc.enforceUpdateFields();
Hmmm... I am looking at the createTOC() method code, and it appears that it looks for styles that look like Heading #. So Tier1Header would not be found. Try creating your text first, and use styles like Heading 1 for your headings. Then add the TOC using createTOC(). It should find all the headings when the TOC is created. I do not know if enforceUpdateFields() affects the TOC.
//Your docx template should contain the following or something similar text //which will be searched for and replaced with a WORD TOC.
//${TOC}
public static void main(String[] args) throws IOException, OpenXML4JException {
XWPFDocument docTemplate = null;
try {
File file = new File(PATH_TO_FILE); //"C:\\Reports\\Template.docx";
FileInputStream fis = new FileInputStream(file);
docTemplate = new XWPFDocument(fis);
generateTOC(docTemplate);
saveDocument(docTemplate);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (docTemplate != null) {
docTemplate.close();
}
}
}
private static void saveDocument(XWPFDocument docTemplate) throws FileNotFoundException, IOException {
FileOutputStream outputFile = null;
try {
outputFile = new FileOutputStream(OUTFILENAME);
docTemplate.write(outputFile);
} finally {
if (outputFile != null) {
outputFile.close();
}
}
}
public static void generateTOC(XWPFDocument document) throws InvalidFormatException, FileNotFoundException, IOException {
String findText = "${TOC}";
String replaceText = "";
for (XWPFParagraph p : document.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
int pos = r.getTextPosition();
String text = r.getText(pos);
if (text != null && text.contains(findText)) {
text = text.replace(findText, replaceText);
r.setText(text, 0);
addField(p, "TOC \\o \"1-3\" \\h \\z \\u");
break;
}
}
}
}
private static void addField(XWPFParagraph paragraph, String fieldName) {
CTSimpleField ctSimpleField = paragraph.getCTP().addNewFldSimple();
// ctSimpleField.setInstr(fieldName + " \\* MERGEFORMAT ");
ctSimpleField.setInstr(fieldName);
ctSimpleField.addNewR().addNewT().setStringValue("<<fieldName>>");
}
This is the code of createTOC(), obtained by inspecting XWPFDocument.class:
public void createTOC() {
CTSdtBlock block = getDocument().getBody().addNewSdt();
TOC toc = new TOC(block);
for (XWPFParagraph par : this.paragraphs) {
String parStyle = par.getStyle();
if ((parStyle != null) && (parStyle.startsWith("Heading"))) try {
int level = Integer.valueOf(parStyle.substring("Heading".length())).intValue();
toc.addRow(level, par.getText(), 1, "112723803");
} catch (NumberFormatException e) {
e.printStackTrace();
}
}
}
As you can see, it adds to the TOC all paragraphs having styles named "HeadingX", with X being a number. But, unfortunately, that's not sufficent. The method, in fact, is bugged/uncomplete in its implementation.
The page number passed to addRow() is always 1, it's not even calculated.
So, at the end, you will have a TOC with all your paragraphs and the trailing dots giving the proper indentation, but the pages will be always equal to "1".
EDIT
...but, there's a solution here.

How to Convert splitted pdf to Excel file using java pdfbox

Im new to this PDFBOX. I hve one pdf file, which contain 60 pages. Im using Apache PDFBox-app-1.8.10. jar splitting up the PDF files.
public class SplitDemo {
public static void main(String[] args) throws IOException {
JButton open = new JButton();
JFileChooser fc = new JFileChooser();
fc.setCurrentDirectory(new java.io.File("C:/Users"));
fc.setDialogTitle("Select PDF");
if(fc.showOpenDialog(open)== JFileChooser.APPROVE_OPTION)
{
}
String a = null;
a = fc.getSelectedFile().getAbsolutePath();
PDDocument document = new PDDocument();
document = PDDocument.load(a);
// Create a Splitter object
Splitter splitter = new Splitter();
// We need this as split method returns a list
List<PDDocument> listOfSplitPages;
// We are receiving the split pages as a list of PDFs
listOfSplitPages = splitter.split(document);
// We need an iterator to iterate through them
Iterator<PDDocument> iterator = listOfSplitPages.listIterator();
// I am using variable i to denote page numbers.
int i = 1;
while(iterator.hasNext()){
PDDocument pd = iterator.next();
try{
// Saving each page with its assumed page no.
pd.save("G://PDFCopy/Page " + i++ + ".pdf");
} catch (COSVisitorException anException){
// Something went wrong with a PDF object
System.out.println("Something went wrong with page " + (i-1) + "\n Here is the error message" + anException);
}
}
}
}
**In PDFCopy Folder i hve list of pdf files. How can I convert all pdf to excel format and need to save it in the target folder. i am fully confused in this conversion. **

Categories

Resources