Apache Poi XWPF - How do we split a docx into two sections?

Apache Poi XWPF - How do we split a docx into two sections? - java

I have an existing document (in bytes) that I parsed into XWPFDocument using
InputStream is = new ByteArrayInputStream(docuByte);
XWPFDocument docx = new XWPFDocument(OPCPackage.open(is));
This document has at least 5 pages. I am planning to set blank footers on first two pages (title and TOC page), and a page footer from third page and up.
In order to do this, I understand that I need to separate the document into two different sections.
section 1 - first and second page
section 2 - third page and up
However, I could not find a method that would enable me to split the document into two sections. Would anyone know how to implement this?

There is no special method to add section breaks in XWPFDocument up to now. So one needs using the underlying org.openxmlformats.schemas.wordprocessingml.x2006.main.* classes.
A section break in Office Open XML Word documents (*.docx) is a paragraph having section properties setting in paragraph properties. So the need is to insert such a paragraph into the document. To insert a paragraph XWPFDocument provides a method insertNewParagraph(org.apache.xmlbeans.XmlCursor cursor). But to get this cursor position, one needs to know where the paragraph shall be inserted. This can be a already present paragraph containing a certain text for example.
The inserted section properties are then relevant for the section above that paragraph.
The document body also has section properties which are relevant for the last section.
The following code shows that. It searches for a paragraph containing a certain text. Then it inserts a paragraph having section properties, which are a copy of the former last section properties, before that found paragraph. Then it removes all header/footer settings from the new inserted section properties. After that the section above the new inserted paragraph has no header/footer settings while former header/footer settings remains for the last section.
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
public class WordInsertSectionbreak {
static org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr getDocumentBodySectPr(XWPFDocument document) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTDocument1 ctDocument = document.getDocument();
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTBody ctBody = ctDocument.getBody();
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr ctSectPrDocumentBody = ctBody.getSectPr();
return ctSectPrDocumentBody;
}
static org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr getNextSectPr(XWPFParagraph paragraph) {
// get the section settings of next section in document
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr ctSectPrNextSect = null;
// maybe next section settings are in a paragraph
XWPFDocument document = paragraph.getDocument();
int pos = document.getPosOfParagraph(paragraph);
for (int p = pos; p < document.getParagraphs().size(); p++) {
paragraph = document.getParagraphArray(p);
if (paragraph.getCTP().getPPr() != null) {
ctSectPrNextSect = paragraph.getCTP().getPPr().getSectPr();
}
if (ctSectPrNextSect != null) break;
}
// if not in a paragraph next section settings are in documetn body
if (ctSectPrNextSect == null) {
ctSectPrNextSect = getDocumentBodySectPr(document);
}
return ctSectPrNextSect;
}
static XWPFParagraph insertSectionbreak(XWPFDocument document, org.apache.xmlbeans.XmlCursor cursor) {
XWPFParagraph paragraph = null;;
// insert a paragraph for section settings for new section above and section break.
paragraph = document.insertNewParagraph(cursor);
// get next section properties, which were section properties for previous section above
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr ctSectPrNextSect = getNextSectPr(paragraph);
// set a copy of section properties for previous section above as section properties for new section
if (ctSectPrNextSect != null) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr ctSectPrNewSect = (org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr)ctSectPrNextSect.copy();
paragraph.getCTP().addNewPPr().setSectPr(ctSectPrNewSect);
return paragraph;
}
return null;
}
static XWPFParagraph getParagraphByText(XWPFDocument document, String text) {
for (XWPFParagraph paragraph : document.getParagraphs()) {
String paragraphText = paragraph.getText();
if (paragraphText.contains(text)) {
return paragraph;
}
}
return null;
}
static void removeHeadersAndFooters(XWPFParagraph sectionBreakParagraph) {
if (sectionBreakParagraph == null) return;
if (sectionBreakParagraph.getCTP().getPPr() != null) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr ctSectPr = sectionBreakParagraph.getCTP().getPPr().getSectPr();
// remove headers and footers from section
for (int i = ctSectPr.getHeaderReferenceArray().length-1; i >= 0; i--) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTHdrFtrRef ctHdrFtrRef = ctSectPr.getHeaderReferenceArray(i);
ctSectPr.removeHeaderReference(i);
}
for (int i = ctSectPr.getFooterReferenceArray().length-1; i >= 0; i--) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTHdrFtrRef ctHdrFtrRef = ctSectPr.getFooterReferenceArray(i);
ctSectPr.removeFooterReference(i);
}
}
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("./WordDocument.docx"));
XWPFParagraph paragraph = getParagraphByText(document, "Some text to mark where section break shall be inserted");
if (paragraph != null) {
XWPFParagraph sectionBreakParagraph = insertSectionbreak(document, paragraph.getCTP().newCursor());
if (sectionBreakParagraph != null) {
removeHeadersAndFooters(sectionBreakParagraph);
}
}
FileOutputStream out = new FileOutputStream("./WordDocumentResult.docx");
document.write(out);
out.close();
document.close();
}
}
Code is tested and works using current apache poi 5.2.2.

Related

How to delete first character after table using POI

I am attempting to format a Word document that has multiple tables. I need to delete line breaks that occur after table. How to i achieve this programatically in Java ?
I am currently trying it with the following code and it does not work
org.apache.xmlbeans.XmlCursor cursor = xwpfTable.getCTTbl().newCursor();
cursor.toEndToken();
cursor.toNextToken();
cursor.removeChars(2);
Further Clarification : We are receiving non-formatted word files from external source. We need to eliminate paragraph (extra lines in-between tables) when the table has only 1 row. Currently I are using a macro and achieving this by code :
For Each t In doc.Tables
Set myrange = doc.Characters(t.Range.End + 1)
If myrange.Text = Chr(13) Then
myrange.Delete
End If
Thanks in advance
What I am trying to remove:

According to your screenshot you wants to remove empty paragraphs which are placed immediately after tables.
This is possible, although i am wondering why those paragraphs are there. After removing those paragraphs, in Word the tables are not more editable as single tables but only as rows within one table. Is this what you want?
Anyway, as said removing the empty paragraphs after the tables is possible. To do so, you could traversing the body elements of the document. If there is a XWPFTable immediately followed by a XWPFParagraph and this XWPFParagraph does not have any text runs in it, then remove that XWPFParagraph from the document.
Example:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
public class WordRemoveEmptyParagraphs {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("./WordTables.docx"));
int thisBodyElementPos = 0;
int nextBodyElementPos = 1;
IBodyElement thisBodyElement = null;
IBodyElement nextBodyElement = null;
if (document.getBodyElements().size() > 1) { // document must have at least two body elements
do {
thisBodyElement = document.getBodyElements().get(thisBodyElementPos);
nextBodyElement = document.getBodyElements().get(nextBodyElementPos);
if (thisBodyElement instanceof XWPFTable && nextBodyElement instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph)nextBodyElement;
if (paragraph.getRuns().size() == 0) { // if paragraph does not have any text runs in it
document.removeBodyElement(nextBodyElementPos);
}
}
thisBodyElementPos++;
nextBodyElementPos = thisBodyElementPos + 1;
} while (nextBodyElementPos < document.getBodyElements().size());
}
FileOutputStream out = new FileOutputStream("./WordTablesChanged.docx");
document.write(out);
out.close();
document.close();
}
}

Removing an XWPFParagraph keeps the paragraph symbol (¶) for it

I am trying to remove a set of contiguous paragraphs from a Microsoft Word document, using Apache POI.
From what I have understood, deleting a paragraph is possible by removing all of its runs, this way:
/*
* Deletes the given paragraph.
*/
public static void deleteParagraph(XWPFParagraph p) {
if (p != null) {
List<XWPFRun> runs = p.getRuns();
//Delete all the runs
for (int i = runs.size() - 1; i >= 0; i--) {
p.removeRun(i);
}
p.setPageBreak(false); //Remove the eventual page break
}
}
In fact, it works, but there's something strange. The block of removed paragraphs does not disappear from the document, but it's converted in a set of empty lines. It's just like every paragraph would be converted into a new line.
By printing the paragraphs' content from code I can see, in fact, a space (for each one removed). Looking at the content directly from the document, with the formatting mark's visualization enabled, I can see this:
The vertical column of ¶ corresponds to the block of deleted elements.
Do you have an idea for that? I'd like my paragraphs to be completely removed.
I also tried by replacing the text (with setText()) and by removing eventual spaces that could be added automatically, this way:
p.setSpacingAfter(0);
p.setSpacingAfterLines(0);
p.setSpacingBefore(0);
p.setSpacingBeforeLines(0);
p.setIndentFromLeft(0);
p.setIndentFromRight(0);
p.setIndentationFirstLine(0);
p.setIndentationLeft(0);
p.setIndentationRight(0);
But with no luck.

I would delete paragraphs by deleting paragraphs, not by deleting only the runs in this paragraphs. Deleting paragraphs is not part of the apache poi high level API. But using XWPFDocument.getDocument().getBody() we can get the low level CTBody and there is a removeP(int i).
Example:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import java.awt.Desktop;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
public class WordRemoveParagraph {
/*
* Deletes the given paragraph.
*/
public static void deleteParagraph(XWPFParagraph p) {
XWPFDocument doc = p.getDocument();
int pPos = doc.getPosOfParagraph(p);
//doc.getDocument().getBody().removeP(pPos);
doc.removeBodyElement(pPos);
}
public static void main(String[] args) throws IOException, InvalidFormatException {
XWPFDocument doc = new XWPFDocument(new FileInputStream("source.docx"));
int pNumber = doc.getParagraphs().size() -1;
while (pNumber >= 0) {
XWPFParagraph p = doc.getParagraphs().get(pNumber);
if (p.getParagraphText().contains("delete")) {
deleteParagraph(p);
}
pNumber--;
}
FileOutputStream out = new FileOutputStream("result.docx");
doc.write(out);
out.close();
doc.close();
System.out.println("Done");
Desktop.getDesktop().open(new File("result.docx"));
}
}
This deletes all paragraphs from the document source.docx where the text contains "delete" and saves the result in result.docx.
Edited:
Although doc.getDocument().getBody().removeP(pPos); works, it will not update the XWPFDocument's paragraphs list. So it will destroy paragraph iterators and other accesses to that list since the list is only updated while reading the document again.
So the better approach is using doc.removeBodyElement(pPos); instead. removeBodyElement(int pos) does exactly the same as doc.getDocument().getBody().removeP(pos); if the pos is pointing to a pagagraph in the document body since that paragraph is an BodyElement too. But in addition, it will update the XWPFDocument's paragraphs list.

When you are inside of a table you need to use the functions of the XWPFTableCell instead of the XWPFDocument:
cell.removeParagraph(cell.getParagraphs().indexOf(para));

Apache POI Table of contents not updating

I am using Apache POI XWPF components and java, to extract data from a .xml file into a word document. So far so good, but I am struggling to create a table of contents.
I have to create a table of contents at the start of the method and then I update it at the end to get all the new headers. Currently I use doc.createTOC(), where doc is a variable created from XWPFDocument, to create the table at the start and then I use doc.enforceUpdateFields() to update everything at the end of the document. But when I open the document after I ran the program, the table of contents is empty, but the navigation panel does include some of the headers I specified.
A comment recommended that I include some code. So i started off by create the document from a template:
XWPFDocument doc = new XWPFDocument(new FileInputStream("D://Template.docx"));
I then create a table of contents:
doc.createTOC();
Then throughout the method I add headers to the document:
XWPFParagraph documentControlHeading = doc.createParagraph();
documentControlHeading.setPageBreak(true);
documentControlHeading.setAlignment(ParagraphAlignment.LEFT);
documentControlHeading.setStyle("Tier1Header");
After all the headers are added, I want to update the document so that all the new headers will appear in the table of contents. I do this buy using the following command:
doc.enforceUpdateFields();

Hmmm... I am looking at the createTOC() method code, and it appears that it looks for styles that look like Heading #. So Tier1Header would not be found. Try creating your text first, and use styles like Heading 1 for your headings. Then add the TOC using createTOC(). It should find all the headings when the TOC is created. I do not know if enforceUpdateFields() affects the TOC.

//Your docx template should contain the following or something similar text //which will be searched for and replaced with a WORD TOC.
//${TOC}
public static void main(String[] args) throws IOException, OpenXML4JException {
XWPFDocument docTemplate = null;
try {
File file = new File(PATH_TO_FILE); //"C:\\Reports\\Template.docx";
FileInputStream fis = new FileInputStream(file);
docTemplate = new XWPFDocument(fis);
generateTOC(docTemplate);
saveDocument(docTemplate);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (docTemplate != null) {
docTemplate.close();
}
}
}
private static void saveDocument(XWPFDocument docTemplate) throws FileNotFoundException, IOException {
FileOutputStream outputFile = null;
try {
outputFile = new FileOutputStream(OUTFILENAME);
docTemplate.write(outputFile);
} finally {
if (outputFile != null) {
outputFile.close();
}
}
}
public static void generateTOC(XWPFDocument document) throws InvalidFormatException, FileNotFoundException, IOException {
String findText = "${TOC}";
String replaceText = "";
for (XWPFParagraph p : document.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
int pos = r.getTextPosition();
String text = r.getText(pos);
if (text != null && text.contains(findText)) {
text = text.replace(findText, replaceText);
r.setText(text, 0);
addField(p, "TOC \\o \"1-3\" \\h \\z \\u");
break;
}
}
}
}
private static void addField(XWPFParagraph paragraph, String fieldName) {
CTSimpleField ctSimpleField = paragraph.getCTP().addNewFldSimple();
// ctSimpleField.setInstr(fieldName + " \\* MERGEFORMAT ");
ctSimpleField.setInstr(fieldName);
ctSimpleField.addNewR().addNewT().setStringValue("<<fieldName>>");
}

This is the code of createTOC(), obtained by inspecting XWPFDocument.class:
public void createTOC() {
CTSdtBlock block = getDocument().getBody().addNewSdt();
TOC toc = new TOC(block);
for (XWPFParagraph par : this.paragraphs) {
String parStyle = par.getStyle();
if ((parStyle != null) && (parStyle.startsWith("Heading"))) try {
int level = Integer.valueOf(parStyle.substring("Heading".length())).intValue();
toc.addRow(level, par.getText(), 1, "112723803");
} catch (NumberFormatException e) {
e.printStackTrace();
}
}
}
As you can see, it adds to the TOC all paragraphs having styles named "HeadingX", with X being a number. But, unfortunately, that's not sufficent. The method, in fact, is bugged/uncomplete in its implementation.
The page number passed to addRow() is always 1, it's not even calculated.
So, at the end, you will have a TOC with all your paragraphs and the trailing dots giving the proper indentation, but the pages will be always equal to "1".
EDIT
...but, there's a solution here.

How to set plain header in docx file using apache poi?

I would like to create a header for docx document using apache poi but I have difficulties. I have no working code to show. I would like to ask for some piece of code as starting point.

There's an Apache POI Unit test that covers your very case - you're looking for TestXWPFHeader#testSetHeader(). It covers starting with a document with no headers or footers set, then adding them
Your code would basically be something like:
XWPFHeaderFooterPolicy policy = sampleDoc.getHeaderFooterPolicy();
if (policy.getDefaultHeader() == null && policy.getFirstPageHeader() == null
&& policy.getDefaultFooter() == null) {
// Need to create some new headers
// The easy way, gives a single empty paragraph
XWPFHeader headerD = policy.createHeader(policy.DEFAULT);
headerD.getParagraphs(0).createRun().setText("Hello Header World!");
// Or the full control way
CTP ctP1 = CTP.Factory.newInstance();
CTR ctR1 = ctP1.addNewR();
CTText t = ctR1.addNewT();
t.setStringValue("Paragraph in header");
XWPFParagraph p1 = new XWPFParagraph(ctP1, sampleDoc);
XWPFParagraph[] pars = new XWPFParagraph[1];
pars[0] = p1;
policy.createHeader(policy.FIRST, pars);
} else {
// Already has a header, change it
}
See the XWPFHeaderFooterPolicy JavaDocs for a bit more on creating headers and footers.
It isn't the nicest, so it could ideally use some kind soul submitting a patch to make it nicer (hint hint...!), but it can work as the unit tests show

Based on the previous answer, just copy and paste:
public void test1() throws IOException{
XWPFDocument sampleDoc = new XWPFDocument();
XWPFHeaderFooterPolicy policy = sampleDoc.getHeaderFooterPolicy();
//in an empty document always will be null
if(policy==null){
CTSectPr sectPr = sampleDoc.getDocument().getBody().addNewSectPr();
policy = new XWPFHeaderFooterPolicy( sampleDoc, sectPr );
}
if (policy.getDefaultHeader() == null && policy.getFirstPageHeader() == null
&& policy.getDefaultFooter() == null) {
XWPFHeader headerD = policy.createHeader(policy.DEFAULT);
headerD.getParagraphs().get(0).createRun().setText("Hello Header World!");
}
FileOutputStream out = new FileOutputStream(System.currentTimeMillis()+"_test1_header.docx");
sampleDoc.write(out);
out.close();
sampleDoc.close();
}

Read .doc file content and write into pdf file in java

I'm writing a java code that utilizes Apache-poi to read ms-office .doc file and itext jar API's to create and write into pdf file. I have done reading texts and tables printed in the .doc file. Now i'm looking for a solution that reads images written in the document. I have coded as following to read images in the document file. Why this code is not working.
public static void main(String[] args) {
POIFSFileSystem fs = null;
Document document = new Document();
WordExtractor extractor = null ;
try {
fs = new POIFSFileSystem(new FileInputStream("C:\\DATASTORE\\tableandImage.doc"));
HWPFDocument hdocument=new HWPFDocument(fs);
extractor = new WordExtractor(hdocument);
OutputStream fileOutput = new FileOutputStream(new File("C:/DATASTORE/tableandImage.pdf"));
PdfWriter.getInstance(document, fileOutput);
document.open();
Range range=hdocument.getRange();
String readText=null;
PdfPTable createTable;
CharacterRun run;
PicturesTable picture;
for(int i=0;i<range.numParagraphs();i++) {
Paragraph par = range.getParagraph(i);
readText=par.text();
if(!par.isInTable()) {
if(readText.endsWith("\n")) {
readText=readText+"\n";
document.add(new com.itextpdf.text.Paragraph(readText));
} if(readText.endsWith("\r")) {
readText += "\n";
document.add(new com.itextpdf.text.Paragraph(readText));
}
run =range.getCharacterRun(i);
picture=hdocument.getPicturesTable();
if(picture.hasPicture(run)) {
//if(run.isSpecialCharacter()) {
Picture pic=picture.extractPicture(run, true);
byte[] picturearray=pic.getContent();
com.itextpdf.text.Image image=com.itextpdf.text.Image.getInstance(picturearray);
document.add(image);
}
} else if (par.isInTable()) {
Table table = range.getTable(par);
TableRow tRow1= table.getRow(0);
int numColumns=tRow1.numCells();
createTable=new PdfPTable(numColumns);
for (int rowId=0;rowId<table.numRows();rowId++) {
TableRow tRow = table.getRow(rowId);
for (int cellId=0;cellId<tRow.numCells();cellId++) {
TableCell tCell = tRow.getCell(cellId);
PdfPCell c1 = new PdfPCell(new Phrase(tCell.text()));
createTable.addCell(c1);
}
}
document.add(createTable);
}
}
}catch(IOException e) {
System.out.println("IO Exception");
e.printStackTrace();
}
catch(Exception exep) {
exep.printStackTrace();
}finally {
document.close();
}
}
The problems are:
1. Condition if(picture.hasPicture(run)) is not satisfying but document has jpeg image.
I'm getting following exception while reading table.
java.lang.IllegalArgumentException: This paragraph is not the first one in the table
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:876)
at pagecode.ReadDocxOrDocFile.main(ReadDocxOrDocFile.java:113)
Can anybody help me to solve the problem.
Thank you.

Regarding your exception:
Your code iterates over all paragraphs and calls isInTable() for each one of them. Since tables are commonly composed of several such paragraphs, your call to getTable() also gets executed several times for a single table.
However, what your code should do instead is to find the first paragraph of a table, then process all paragraphs therein (via getRow(m).getCell(n)) and ultimately continue with the outer loop in the first paragraph after the table. Codewise this may look roughly like the following (assuming no merged cells, no nested tables and no other funny edge cases):
if (par.isInTable()) {
Table table = range.getTable(par);
for (int rn=0; rn<table.numRows(); rn++) {
TableRow row = table.getRow(rn);
for (int cn=0; cn<row.numCells(); cn++) {
TableCell cell = row.getCell(cn);
for (int pn=0; pn<cell.numParagraphs(); pn++) {
Paragraph cellParagraph = cell.getParagraph(pn);
// your PDF conversion code goes here
}
}
}
i += table.numParagraphs()-1; // skip the already processed (table-)paragraphs in the outer loop
}
Regarding the pictures issue:
Am I guessing right that you are trying to obtain the picture which is anchored within a given paragraph? Unfortunately, the predefined methods of POI only work if the picture is not embedded within a field (which is rather rare, actually). For field-based images (i.e. preview images of embedded OLEs) you should do something like the following (untested!):
PictureStore pictureStore = new PictureStore(hdocument);
// bla bla ...
for (int cr=0; cr < par.numCharacterRuns(); cr++) {
CharacterRun characterRun = par.getCharacterRun(cr);
Field field = hdocument.getFields().getFieldByStartOffset(FieldsDocumentPart.MAIN, characterRun.getStartOffset());
if (field != null && field.getType() == 0x3A) { // 0x3A is type "EMBED"
Picture pic = pictureStore.getPicture(field.secondSubrange(characterRun));
}
}
For a list of possible values of Field.getType() see here.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Poi XWPF - How do we split a docx into two sections? - java

Related

How to delete first character after table using POI

Removing an XWPFParagraph keeps the paragraph symbol (¶) for it

Apache POI Table of contents not updating

How to set plain header in docx file using apache poi?

Read .doc file content and write into pdf file in java

Categories

Resources