How to delete first character after table using POI - java

I am attempting to format a Word document that has multiple tables. I need to delete line breaks that occur after table. How to i achieve this programatically in Java ?
I am currently trying it with the following code and it does not work
org.apache.xmlbeans.XmlCursor cursor = xwpfTable.getCTTbl().newCursor();
cursor.toEndToken();
cursor.toNextToken();
cursor.removeChars(2);
Further Clarification : We are receiving non-formatted word files from external source. We need to eliminate paragraph (extra lines in-between tables) when the table has only 1 row. Currently I are using a macro and achieving this by code :
For Each t In doc.Tables
Set myrange = doc.Characters(t.Range.End + 1)
If myrange.Text = Chr(13) Then
myrange.Delete
End If
Thanks in advance
What I am trying to remove:

According to your screenshot you wants to remove empty paragraphs which are placed immediately after tables.
This is possible, although i am wondering why those paragraphs are there. After removing those paragraphs, in Word the tables are not more editable as single tables but only as rows within one table. Is this what you want?
Anyway, as said removing the empty paragraphs after the tables is possible. To do so, you could traversing the body elements of the document. If there is a XWPFTable immediately followed by a XWPFParagraph and this XWPFParagraph does not have any text runs in it, then remove that XWPFParagraph from the document.
Example:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
public class WordRemoveEmptyParagraphs {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("./WordTables.docx"));
int thisBodyElementPos = 0;
int nextBodyElementPos = 1;
IBodyElement thisBodyElement = null;
IBodyElement nextBodyElement = null;
if (document.getBodyElements().size() > 1) { // document must have at least two body elements
do {
thisBodyElement = document.getBodyElements().get(thisBodyElementPos);
nextBodyElement = document.getBodyElements().get(nextBodyElementPos);
if (thisBodyElement instanceof XWPFTable && nextBodyElement instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph)nextBodyElement;
if (paragraph.getRuns().size() == 0) { // if paragraph does not have any text runs in it
document.removeBodyElement(nextBodyElementPos);
}
}
thisBodyElementPos++;
nextBodyElementPos = thisBodyElementPos + 1;
} while (nextBodyElementPos < document.getBodyElements().size());
}
FileOutputStream out = new FileOutputStream("./WordTablesChanged.docx");
document.write(out);
out.close();
document.close();
}
}

Related

Apache Poi XWPF - How do we split a docx into two sections?

I have an existing document (in bytes) that I parsed into XWPFDocument using
InputStream is = new ByteArrayInputStream(docuByte);
XWPFDocument docx = new XWPFDocument(OPCPackage.open(is));
This document has at least 5 pages. I am planning to set blank footers on first two pages (title and TOC page), and a page footer from third page and up.
In order to do this, I understand that I need to separate the document into two different sections.
section 1 - first and second page
section 2 - third page and up
However, I could not find a method that would enable me to split the document into two sections. Would anyone know how to implement this?
There is no special method to add section breaks in XWPFDocument up to now. So one needs using the underlying org.openxmlformats.schemas.wordprocessingml.x2006.main.* classes.
A section break in Office Open XML Word documents (*.docx) is a paragraph having section properties setting in paragraph properties. So the need is to insert such a paragraph into the document. To insert a paragraph XWPFDocument provides a method insertNewParagraph(org.apache.xmlbeans.XmlCursor cursor). But to get this cursor position, one needs to know where the paragraph shall be inserted. This can be a already present paragraph containing a certain text for example.
The inserted section properties are then relevant for the section above that paragraph.
The document body also has section properties which are relevant for the last section.
The following code shows that. It searches for a paragraph containing a certain text. Then it inserts a paragraph having section properties, which are a copy of the former last section properties, before that found paragraph. Then it removes all header/footer settings from the new inserted section properties. After that the section above the new inserted paragraph has no header/footer settings while former header/footer settings remains for the last section.
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
public class WordInsertSectionbreak {
static org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr getDocumentBodySectPr(XWPFDocument document) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTDocument1 ctDocument = document.getDocument();
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTBody ctBody = ctDocument.getBody();
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr ctSectPrDocumentBody = ctBody.getSectPr();
return ctSectPrDocumentBody;
}
static org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr getNextSectPr(XWPFParagraph paragraph) {
// get the section settings of next section in document
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr ctSectPrNextSect = null;
// maybe next section settings are in a paragraph
XWPFDocument document = paragraph.getDocument();
int pos = document.getPosOfParagraph(paragraph);
for (int p = pos; p < document.getParagraphs().size(); p++) {
paragraph = document.getParagraphArray(p);
if (paragraph.getCTP().getPPr() != null) {
ctSectPrNextSect = paragraph.getCTP().getPPr().getSectPr();
}
if (ctSectPrNextSect != null) break;
}
// if not in a paragraph next section settings are in documetn body
if (ctSectPrNextSect == null) {
ctSectPrNextSect = getDocumentBodySectPr(document);
}
return ctSectPrNextSect;
}
static XWPFParagraph insertSectionbreak(XWPFDocument document, org.apache.xmlbeans.XmlCursor cursor) {
XWPFParagraph paragraph = null;;
// insert a paragraph for section settings for new section above and section break.
paragraph = document.insertNewParagraph(cursor);
// get next section properties, which were section properties for previous section above
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr ctSectPrNextSect = getNextSectPr(paragraph);
// set a copy of section properties for previous section above as section properties for new section
if (ctSectPrNextSect != null) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr ctSectPrNewSect = (org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr)ctSectPrNextSect.copy();
paragraph.getCTP().addNewPPr().setSectPr(ctSectPrNewSect);
return paragraph;
}
return null;
}
static XWPFParagraph getParagraphByText(XWPFDocument document, String text) {
for (XWPFParagraph paragraph : document.getParagraphs()) {
String paragraphText = paragraph.getText();
if (paragraphText.contains(text)) {
return paragraph;
}
}
return null;
}
static void removeHeadersAndFooters(XWPFParagraph sectionBreakParagraph) {
if (sectionBreakParagraph == null) return;
if (sectionBreakParagraph.getCTP().getPPr() != null) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr ctSectPr = sectionBreakParagraph.getCTP().getPPr().getSectPr();
// remove headers and footers from section
for (int i = ctSectPr.getHeaderReferenceArray().length-1; i >= 0; i--) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTHdrFtrRef ctHdrFtrRef = ctSectPr.getHeaderReferenceArray(i);
ctSectPr.removeHeaderReference(i);
}
for (int i = ctSectPr.getFooterReferenceArray().length-1; i >= 0; i--) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTHdrFtrRef ctHdrFtrRef = ctSectPr.getFooterReferenceArray(i);
ctSectPr.removeFooterReference(i);
}
}
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("./WordDocument.docx"));
XWPFParagraph paragraph = getParagraphByText(document, "Some text to mark where section break shall be inserted");
if (paragraph != null) {
XWPFParagraph sectionBreakParagraph = insertSectionbreak(document, paragraph.getCTP().newCursor());
if (sectionBreakParagraph != null) {
removeHeadersAndFooters(sectionBreakParagraph);
}
}
FileOutputStream out = new FileOutputStream("./WordDocumentResult.docx");
document.write(out);
out.close();
document.close();
}
}
Code is tested and works using current apache poi 5.2.2.

How to compute =SUM(Above) function in docx using apache poi

I am trying to work with apache poi for docx format file and I am stuck at using formulas in table. For instance see the image :
I did try setting text to "=SUM(ABOVE)" but it doesnt work this way.
I think I might need to set custom xml data here but I am not sure how to proceed. I tried following piece of code :
XWPFTable table = document.createTable();
//create first row
XWPFTableRow tableRowOne = table.getRow(0);
table.getRow(0).createCell();
table.getRow(0).getCell(0).setText("10");
table.getRow(0).createCell();
table.getRow(0).getCell(1).setText("=SUM(ABOVE)");
What I am doing in case of such requirements is as follows:
First, creating the simplest possible Word document having the required things in it using the Word GUI. Then have a look into what Word has created to get a idea what needs to be created using apache poi.
In concrete here:
Do creating the simplest possible table in Word which has a field {=SUM(ABOVE)} in it. Save that as *.docx. Now unzip that *.docx (Office Open XML files like *.docx are simply ZIP archive). Have a look at /word/document.xml in that archive. There you will find something like:
<w:tc>
<w:p>
<w:fldSimple w:instr="=SUM(ABOVE)"/>
...
</w:p>
</w:tc>
This is XML for a table cell having a paragraph having a fldSimple element in it where instr attribute contains the formula.
Now we know, we need the table cell XWPFTableCell and the XWPFParagraph in it. Then we need set a fldSimple element in this paragaraph where instr attribute contains the formula.
This would be as simple as
paragraphInCell.getCTP().addNewFldSimple().setInstr("=SUM(ABOVE)");
But of course something must tell Word the need to calculate the formula when the document opens. The simplest solution for this is setting the field "dirty". That leads to the need for updating the field while opening the document in Word. It also leads to a confirming message dialog about the need for updating.
Complete example using apache poi 4.1.0:
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSimpleField;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.STOnOff;
public class CreateWordTableSumAbove {
public static void main(String[] args) throws Exception {
XWPFDocument document= new XWPFDocument();
XWPFParagraph paragraph = document.createParagraph();
XWPFRun run=paragraph.createRun();
run.setText("The table:");
//create the table
XWPFTable table = document.createTable(4,3);
table.setWidth("100%");
for (int row = 0; row < 3; row++) {
for (int col = 0; col < 3; col++) {
if (col < 2) table.getRow(row).getCell(col).setText("row " + row + ", col " + col);
else table.getRow(row).getCell(col).setText("" + ((row + 1) * 1234));
}
}
//set Sum row
table.getRow(3).getCell(0).setText("Sum:");
//get paragraph from cell where the sum field shall be contained
XWPFParagraph paragraphInCell = null;
if (table.getRow(3).getCell(2).getParagraphs().size() == 0) paragraphInCell = table.getRow(3).getCell(2).addParagraph();
else paragraphInCell = table.getRow(3).getCell(2).getParagraphs().get(0);
//set sum field in
CTSimpleField sumAbove = paragraphInCell.getCTP().addNewFldSimple();
sumAbove.setInstr("=SUM(ABOVE)");
//set sum field dirty, so it must be calculated while opening the document
sumAbove.setDirty(STOnOff.TRUE);
paragraph = document.createParagraph();
FileOutputStream out = new FileOutputStream("create_table.docx");
document.write(out);
out.close();
document.close();
}
}
That all only works properly when the document is opened using Microsoft Word. LibreOffice Writer is not able storing such formula fields into Office Open XML (*.docx) format nor is it able reading such Office Open XML formula fields properly.

Removing an XWPFParagraph keeps the paragraph symbol (¶) for it

I am trying to remove a set of contiguous paragraphs from a Microsoft Word document, using Apache POI.
From what I have understood, deleting a paragraph is possible by removing all of its runs, this way:
/*
* Deletes the given paragraph.
*/
public static void deleteParagraph(XWPFParagraph p) {
if (p != null) {
List<XWPFRun> runs = p.getRuns();
//Delete all the runs
for (int i = runs.size() - 1; i >= 0; i--) {
p.removeRun(i);
}
p.setPageBreak(false); //Remove the eventual page break
}
}
In fact, it works, but there's something strange. The block of removed paragraphs does not disappear from the document, but it's converted in a set of empty lines. It's just like every paragraph would be converted into a new line.
By printing the paragraphs' content from code I can see, in fact, a space (for each one removed). Looking at the content directly from the document, with the formatting mark's visualization enabled, I can see this:
The vertical column of ¶ corresponds to the block of deleted elements.
Do you have an idea for that? I'd like my paragraphs to be completely removed.
I also tried by replacing the text (with setText()) and by removing eventual spaces that could be added automatically, this way:
p.setSpacingAfter(0);
p.setSpacingAfterLines(0);
p.setSpacingBefore(0);
p.setSpacingBeforeLines(0);
p.setIndentFromLeft(0);
p.setIndentFromRight(0);
p.setIndentationFirstLine(0);
p.setIndentationLeft(0);
p.setIndentationRight(0);
But with no luck.
I would delete paragraphs by deleting paragraphs, not by deleting only the runs in this paragraphs. Deleting paragraphs is not part of the apache poi high level API. But using XWPFDocument.getDocument().getBody() we can get the low level CTBody and there is a removeP(int i).
Example:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import java.awt.Desktop;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
public class WordRemoveParagraph {
/*
* Deletes the given paragraph.
*/
public static void deleteParagraph(XWPFParagraph p) {
XWPFDocument doc = p.getDocument();
int pPos = doc.getPosOfParagraph(p);
//doc.getDocument().getBody().removeP(pPos);
doc.removeBodyElement(pPos);
}
public static void main(String[] args) throws IOException, InvalidFormatException {
XWPFDocument doc = new XWPFDocument(new FileInputStream("source.docx"));
int pNumber = doc.getParagraphs().size() -1;
while (pNumber >= 0) {
XWPFParagraph p = doc.getParagraphs().get(pNumber);
if (p.getParagraphText().contains("delete")) {
deleteParagraph(p);
}
pNumber--;
}
FileOutputStream out = new FileOutputStream("result.docx");
doc.write(out);
out.close();
doc.close();
System.out.println("Done");
Desktop.getDesktop().open(new File("result.docx"));
}
}
This deletes all paragraphs from the document source.docx where the text contains "delete" and saves the result in result.docx.
Edited:
Although doc.getDocument().getBody().removeP(pPos); works, it will not update the XWPFDocument's paragraphs list. So it will destroy paragraph iterators and other accesses to that list since the list is only updated while reading the document again.
So the better approach is using doc.removeBodyElement(pPos); instead. removeBodyElement(int pos) does exactly the same as doc.getDocument().getBody().removeP(pos); if the pos is pointing to a pagagraph in the document body since that paragraph is an BodyElement too. But in addition, it will update the XWPFDocument's paragraphs list.
When you are inside of a table you need to use the functions of the XWPFTableCell instead of the XWPFDocument:
cell.removeParagraph(cell.getParagraphs().indexOf(para));

How to edit docx using Java

I need replace cerain words or phrases in docx-file and save it with another name. I know that my problem is not unik and I tried find solution in the web. But I still can't get a result that I need.
I found two ways to solwe my task but came to the deadlock in each case.
1. Unfold docx like a zip-file, change xml with main content and pack into archive again. But after that manipulations I can't open new changed docx in MS Word. It is odd because I can do the similar steps by hands (without Java, using WinRar) and get a correct result file.
So can you explain me how to archive docx content to get a correct file using Java?
Using external API. I get an advice to use docx4j Java library. But all tat I can with it is just replace a label (like ${label}) in template with any words (I used VariableReplace sample). But I want change words that I want without using a template with labels.
I hope for a help.
I had this code. I hope that it helps you to resolve your problem. With it, you can read from a .docx find the word that you would change. Change this word and save the new paragraphs in new document.
//WriteDocx.java
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.*;
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;
public class WriteDocx
{
public static void main(String[] args) throws Exception {
int count = 0;
XWPFDocument document = new XWPFDocument();
XWPFDocument docx = new XWPFDocument(new FileInputStream("Bonjour1.docx"));
XWPFWordExtractor we = new XWPFWordExtractor(docx);
String text = we.getText() ;
if(text.contains("SMS")){
text = text.replace("SMS", "sms");
System.out.println(text);
}
char[] c = text.toCharArray();
for(int i= 0; i < c.length;i++){
if(c[i] == '\n'){
count ++;
}
}
System.out.println(c[0]);
StringTokenizer st = new StringTokenizer(text,"\n");
XWPFParagraph para = document.createParagraph();
para.setAlignment(ParagraphAlignment.CENTER);
XWPFRun run = para.createRun();
run.setBold(true);
run.setFontSize(36);
run.setText("Apache POI works well!");
List<XWPFParagraph>paragraphs = new ArrayList<XWPFParagraph>();
List<XWPFRun>runs = new ArrayList<XWPFRun>();
int k = 0;
for(k=0;k<count+1;k++){
paragraphs.add(document.createParagraph());
}
k=0;
while(st.hasMoreElements()){
paragraphs.get(k).setAlignment(ParagraphAlignment.LEFT);
paragraphs.get(k).setSpacingAfter(0);
paragraphs.get(k).setSpacingBefore(0);
run = paragraphs.get(k).createRun();
run.setText(st.nextElement().toString());
k++;
}
document.write(new FileOutputStream("test2.docx"));
}
}
PS: XWPFDocument docx = new XWPFDocument(new FileInputStream("Bonjour1.docx"))
You must change "Bonjour1.docx" with the name of file from where you would replace certain words or phrases.
I use APACHE POI library
And I take some code from this site HANDLING MS WORD DOCUMENTS USING APACHE POI
UPDATE
If you want to change arbitrary words, you can do that easily enough with docx4j.
But first you need to find them.
You can find your words using an XPath query, or by traversing the document tree in Java.

Read .doc file content and write into pdf file in java

I'm writing a java code that utilizes Apache-poi to read ms-office .doc file and itext jar API's to create and write into pdf file. I have done reading texts and tables printed in the .doc file. Now i'm looking for a solution that reads images written in the document. I have coded as following to read images in the document file. Why this code is not working.
public static void main(String[] args) {
POIFSFileSystem fs = null;
Document document = new Document();
WordExtractor extractor = null ;
try {
fs = new POIFSFileSystem(new FileInputStream("C:\\DATASTORE\\tableandImage.doc"));
HWPFDocument hdocument=new HWPFDocument(fs);
extractor = new WordExtractor(hdocument);
OutputStream fileOutput = new FileOutputStream(new File("C:/DATASTORE/tableandImage.pdf"));
PdfWriter.getInstance(document, fileOutput);
document.open();
Range range=hdocument.getRange();
String readText=null;
PdfPTable createTable;
CharacterRun run;
PicturesTable picture;
for(int i=0;i<range.numParagraphs();i++) {
Paragraph par = range.getParagraph(i);
readText=par.text();
if(!par.isInTable()) {
if(readText.endsWith("\n")) {
readText=readText+"\n";
document.add(new com.itextpdf.text.Paragraph(readText));
} if(readText.endsWith("\r")) {
readText += "\n";
document.add(new com.itextpdf.text.Paragraph(readText));
}
run =range.getCharacterRun(i);
picture=hdocument.getPicturesTable();
if(picture.hasPicture(run)) {
//if(run.isSpecialCharacter()) {
Picture pic=picture.extractPicture(run, true);
byte[] picturearray=pic.getContent();
com.itextpdf.text.Image image=com.itextpdf.text.Image.getInstance(picturearray);
document.add(image);
}
} else if (par.isInTable()) {
Table table = range.getTable(par);
TableRow tRow1= table.getRow(0);
int numColumns=tRow1.numCells();
createTable=new PdfPTable(numColumns);
for (int rowId=0;rowId<table.numRows();rowId++) {
TableRow tRow = table.getRow(rowId);
for (int cellId=0;cellId<tRow.numCells();cellId++) {
TableCell tCell = tRow.getCell(cellId);
PdfPCell c1 = new PdfPCell(new Phrase(tCell.text()));
createTable.addCell(c1);
}
}
document.add(createTable);
}
}
}catch(IOException e) {
System.out.println("IO Exception");
e.printStackTrace();
}
catch(Exception exep) {
exep.printStackTrace();
}finally {
document.close();
}
}
The problems are:
1. Condition if(picture.hasPicture(run)) is not satisfying but document has jpeg image.
I'm getting following exception while reading table.
java.lang.IllegalArgumentException: This paragraph is not the first one in the table
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:876)
at pagecode.ReadDocxOrDocFile.main(ReadDocxOrDocFile.java:113)
Can anybody help me to solve the problem.
Thank you.
Regarding your exception:
Your code iterates over all paragraphs and calls isInTable() for each one of them. Since tables are commonly composed of several such paragraphs, your call to getTable() also gets executed several times for a single table.
However, what your code should do instead is to find the first paragraph of a table, then process all paragraphs therein (via getRow(m).getCell(n)) and ultimately continue with the outer loop in the first paragraph after the table. Codewise this may look roughly like the following (assuming no merged cells, no nested tables and no other funny edge cases):
if (par.isInTable()) {
Table table = range.getTable(par);
for (int rn=0; rn<table.numRows(); rn++) {
TableRow row = table.getRow(rn);
for (int cn=0; cn<row.numCells(); cn++) {
TableCell cell = row.getCell(cn);
for (int pn=0; pn<cell.numParagraphs(); pn++) {
Paragraph cellParagraph = cell.getParagraph(pn);
// your PDF conversion code goes here
}
}
}
i += table.numParagraphs()-1; // skip the already processed (table-)paragraphs in the outer loop
}
Regarding the pictures issue:
Am I guessing right that you are trying to obtain the picture which is anchored within a given paragraph? Unfortunately, the predefined methods of POI only work if the picture is not embedded within a field (which is rather rare, actually). For field-based images (i.e. preview images of embedded OLEs) you should do something like the following (untested!):
PictureStore pictureStore = new PictureStore(hdocument);
// bla bla ...
for (int cr=0; cr < par.numCharacterRuns(); cr++) {
CharacterRun characterRun = par.getCharacterRun(cr);
Field field = hdocument.getFields().getFieldByStartOffset(FieldsDocumentPart.MAIN, characterRun.getStartOffset());
if (field != null && field.getType() == 0x3A) { // 0x3A is type "EMBED"
Picture pic = pictureStore.getPicture(field.secondSubrange(characterRun));
}
}
For a list of possible values of Field.getType() see here.

Categories

Resources