I am trying to work with apache poi for docx format file and I am stuck at using formulas in table. For instance see the image :
I did try setting text to "=SUM(ABOVE)" but it doesnt work this way.
I think I might need to set custom xml data here but I am not sure how to proceed. I tried following piece of code :
XWPFTable table = document.createTable();
//create first row
XWPFTableRow tableRowOne = table.getRow(0);
table.getRow(0).createCell();
table.getRow(0).getCell(0).setText("10");
table.getRow(0).createCell();
table.getRow(0).getCell(1).setText("=SUM(ABOVE)");
What I am doing in case of such requirements is as follows:
First, creating the simplest possible Word document having the required things in it using the Word GUI. Then have a look into what Word has created to get a idea what needs to be created using apache poi.
In concrete here:
Do creating the simplest possible table in Word which has a field {=SUM(ABOVE)} in it. Save that as *.docx. Now unzip that *.docx (Office Open XML files like *.docx are simply ZIP archive). Have a look at /word/document.xml in that archive. There you will find something like:
<w:tc>
<w:p>
<w:fldSimple w:instr="=SUM(ABOVE)"/>
...
</w:p>
</w:tc>
This is XML for a table cell having a paragraph having a fldSimple element in it where instr attribute contains the formula.
Now we know, we need the table cell XWPFTableCell and the XWPFParagraph in it. Then we need set a fldSimple element in this paragaraph where instr attribute contains the formula.
This would be as simple as
paragraphInCell.getCTP().addNewFldSimple().setInstr("=SUM(ABOVE)");
But of course something must tell Word the need to calculate the formula when the document opens. The simplest solution for this is setting the field "dirty". That leads to the need for updating the field while opening the document in Word. It also leads to a confirming message dialog about the need for updating.
Complete example using apache poi 4.1.0:
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSimpleField;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.STOnOff;
public class CreateWordTableSumAbove {
public static void main(String[] args) throws Exception {
XWPFDocument document= new XWPFDocument();
XWPFParagraph paragraph = document.createParagraph();
XWPFRun run=paragraph.createRun();
run.setText("The table:");
//create the table
XWPFTable table = document.createTable(4,3);
table.setWidth("100%");
for (int row = 0; row < 3; row++) {
for (int col = 0; col < 3; col++) {
if (col < 2) table.getRow(row).getCell(col).setText("row " + row + ", col " + col);
else table.getRow(row).getCell(col).setText("" + ((row + 1) * 1234));
}
}
//set Sum row
table.getRow(3).getCell(0).setText("Sum:");
//get paragraph from cell where the sum field shall be contained
XWPFParagraph paragraphInCell = null;
if (table.getRow(3).getCell(2).getParagraphs().size() == 0) paragraphInCell = table.getRow(3).getCell(2).addParagraph();
else paragraphInCell = table.getRow(3).getCell(2).getParagraphs().get(0);
//set sum field in
CTSimpleField sumAbove = paragraphInCell.getCTP().addNewFldSimple();
sumAbove.setInstr("=SUM(ABOVE)");
//set sum field dirty, so it must be calculated while opening the document
sumAbove.setDirty(STOnOff.TRUE);
paragraph = document.createParagraph();
FileOutputStream out = new FileOutputStream("create_table.docx");
document.write(out);
out.close();
document.close();
}
}
That all only works properly when the document is opened using Microsoft Word. LibreOffice Writer is not able storing such formula fields into Office Open XML (*.docx) format nor is it able reading such Office Open XML formula fields properly.
Related
I am trying to convert PDF file to CSV or EXCEL format.
Here is the code I use to convert to CSV format:
public void convert() throws Exception {
PdfReader pdfReader = new PdfReader("example.pdf");
PdfDocument pdf = new PdfDocument(pdfReader);;
int pages = pdf.getNumberOfPages();
FileWriter csvWriter = new FileWriter("student.csv");
for (int i = 1; i <= pages; i++) {
PdfPage page = pdf.getPage(i);
String content = PdfTextExtractor.getTextFromPage(page);
String[] splitContents = content.split("\n");
boolean isTitle = true;
for (int j = 0; j < splitContents.length; j++) {
if (isTitle) {
isTitle = false;
continue;
}
csvWriter.append(splitContents[j].replaceAll(" ", " "));
csvWriter.append("\n");
}
}
csvWriter.flush();
csvWriter.close();
}
This code works correctly, but the fact is that the CSV format groups rows without taking into account existing columns (some of them are empty), so I would like to convert this file (PDF) to EXCEL format.
The PDF file itself is formed as a table.
What do I mean about spaces. For example, in a PDF file, in a table
| name | some data | | | some data 1 | |
+----------+----------------+------------+-------------+-------------------+--------------+
After converting to a CSV file, the line looks like this:
name some data some data 1
How can I get the same result as a PDF table?
I'd suggest to use PDFBox, like here: Parsing PDF files (especially with tables) with PDFBox
or another library that will allow you to check the data in the Table point by point, and will allow you to create a table by column width (something like Table table = page.getTable(dividers)); ).
If the width of the columns changes, you'll have to implement it based on the headers/first data column ([e.g. position.x of the last character of the first word] minus [position.x of the first character of the new word] - you'll have to figure it out yourself), it's hard so you could make it hardcoded in the beginning. Using Foxit Reader PDF App you can easily measure column width. Then, if you don't find any data in a particular column, you will be able to add an empty column in the CSV file. I know from my own experience that it is not easy, so I wish you good luck.
I am attempting to format a Word document that has multiple tables. I need to delete line breaks that occur after table. How to i achieve this programatically in Java ?
I am currently trying it with the following code and it does not work
org.apache.xmlbeans.XmlCursor cursor = xwpfTable.getCTTbl().newCursor();
cursor.toEndToken();
cursor.toNextToken();
cursor.removeChars(2);
Further Clarification : We are receiving non-formatted word files from external source. We need to eliminate paragraph (extra lines in-between tables) when the table has only 1 row. Currently I are using a macro and achieving this by code :
For Each t In doc.Tables
Set myrange = doc.Characters(t.Range.End + 1)
If myrange.Text = Chr(13) Then
myrange.Delete
End If
Thanks in advance
What I am trying to remove:
According to your screenshot you wants to remove empty paragraphs which are placed immediately after tables.
This is possible, although i am wondering why those paragraphs are there. After removing those paragraphs, in Word the tables are not more editable as single tables but only as rows within one table. Is this what you want?
Anyway, as said removing the empty paragraphs after the tables is possible. To do so, you could traversing the body elements of the document. If there is a XWPFTable immediately followed by a XWPFParagraph and this XWPFParagraph does not have any text runs in it, then remove that XWPFParagraph from the document.
Example:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
public class WordRemoveEmptyParagraphs {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("./WordTables.docx"));
int thisBodyElementPos = 0;
int nextBodyElementPos = 1;
IBodyElement thisBodyElement = null;
IBodyElement nextBodyElement = null;
if (document.getBodyElements().size() > 1) { // document must have at least two body elements
do {
thisBodyElement = document.getBodyElements().get(thisBodyElementPos);
nextBodyElement = document.getBodyElements().get(nextBodyElementPos);
if (thisBodyElement instanceof XWPFTable && nextBodyElement instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph)nextBodyElement;
if (paragraph.getRuns().size() == 0) { // if paragraph does not have any text runs in it
document.removeBodyElement(nextBodyElementPos);
}
}
thisBodyElementPos++;
nextBodyElementPos = thisBodyElementPos + 1;
} while (nextBodyElementPos < document.getBodyElements().size());
}
FileOutputStream out = new FileOutputStream("./WordTablesChanged.docx");
document.write(out);
out.close();
document.close();
}
}
I need replace cerain words or phrases in docx-file and save it with another name. I know that my problem is not unik and I tried find solution in the web. But I still can't get a result that I need.
I found two ways to solwe my task but came to the deadlock in each case.
1. Unfold docx like a zip-file, change xml with main content and pack into archive again. But after that manipulations I can't open new changed docx in MS Word. It is odd because I can do the similar steps by hands (without Java, using WinRar) and get a correct result file.
So can you explain me how to archive docx content to get a correct file using Java?
Using external API. I get an advice to use docx4j Java library. But all tat I can with it is just replace a label (like ${label}) in template with any words (I used VariableReplace sample). But I want change words that I want without using a template with labels.
I hope for a help.
I had this code. I hope that it helps you to resolve your problem. With it, you can read from a .docx find the word that you would change. Change this word and save the new paragraphs in new document.
//WriteDocx.java
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.*;
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;
public class WriteDocx
{
public static void main(String[] args) throws Exception {
int count = 0;
XWPFDocument document = new XWPFDocument();
XWPFDocument docx = new XWPFDocument(new FileInputStream("Bonjour1.docx"));
XWPFWordExtractor we = new XWPFWordExtractor(docx);
String text = we.getText() ;
if(text.contains("SMS")){
text = text.replace("SMS", "sms");
System.out.println(text);
}
char[] c = text.toCharArray();
for(int i= 0; i < c.length;i++){
if(c[i] == '\n'){
count ++;
}
}
System.out.println(c[0]);
StringTokenizer st = new StringTokenizer(text,"\n");
XWPFParagraph para = document.createParagraph();
para.setAlignment(ParagraphAlignment.CENTER);
XWPFRun run = para.createRun();
run.setBold(true);
run.setFontSize(36);
run.setText("Apache POI works well!");
List<XWPFParagraph>paragraphs = new ArrayList<XWPFParagraph>();
List<XWPFRun>runs = new ArrayList<XWPFRun>();
int k = 0;
for(k=0;k<count+1;k++){
paragraphs.add(document.createParagraph());
}
k=0;
while(st.hasMoreElements()){
paragraphs.get(k).setAlignment(ParagraphAlignment.LEFT);
paragraphs.get(k).setSpacingAfter(0);
paragraphs.get(k).setSpacingBefore(0);
run = paragraphs.get(k).createRun();
run.setText(st.nextElement().toString());
k++;
}
document.write(new FileOutputStream("test2.docx"));
}
}
PS: XWPFDocument docx = new XWPFDocument(new FileInputStream("Bonjour1.docx"))
You must change "Bonjour1.docx" with the name of file from where you would replace certain words or phrases.
I use APACHE POI library
And I take some code from this site HANDLING MS WORD DOCUMENTS USING APACHE POI
UPDATE
If you want to change arbitrary words, you can do that easily enough with docx4j.
But first you need to find them.
You can find your words using an XPath query, or by traversing the document tree in Java.
I'm writing a java code that utilizes Apache-poi to read ms-office .doc file and itext jar API's to create and write into pdf file. I have done reading texts and tables printed in the .doc file. Now i'm looking for a solution that reads images written in the document. I have coded as following to read images in the document file. Why this code is not working.
public static void main(String[] args) {
POIFSFileSystem fs = null;
Document document = new Document();
WordExtractor extractor = null ;
try {
fs = new POIFSFileSystem(new FileInputStream("C:\\DATASTORE\\tableandImage.doc"));
HWPFDocument hdocument=new HWPFDocument(fs);
extractor = new WordExtractor(hdocument);
OutputStream fileOutput = new FileOutputStream(new File("C:/DATASTORE/tableandImage.pdf"));
PdfWriter.getInstance(document, fileOutput);
document.open();
Range range=hdocument.getRange();
String readText=null;
PdfPTable createTable;
CharacterRun run;
PicturesTable picture;
for(int i=0;i<range.numParagraphs();i++) {
Paragraph par = range.getParagraph(i);
readText=par.text();
if(!par.isInTable()) {
if(readText.endsWith("\n")) {
readText=readText+"\n";
document.add(new com.itextpdf.text.Paragraph(readText));
} if(readText.endsWith("\r")) {
readText += "\n";
document.add(new com.itextpdf.text.Paragraph(readText));
}
run =range.getCharacterRun(i);
picture=hdocument.getPicturesTable();
if(picture.hasPicture(run)) {
//if(run.isSpecialCharacter()) {
Picture pic=picture.extractPicture(run, true);
byte[] picturearray=pic.getContent();
com.itextpdf.text.Image image=com.itextpdf.text.Image.getInstance(picturearray);
document.add(image);
}
} else if (par.isInTable()) {
Table table = range.getTable(par);
TableRow tRow1= table.getRow(0);
int numColumns=tRow1.numCells();
createTable=new PdfPTable(numColumns);
for (int rowId=0;rowId<table.numRows();rowId++) {
TableRow tRow = table.getRow(rowId);
for (int cellId=0;cellId<tRow.numCells();cellId++) {
TableCell tCell = tRow.getCell(cellId);
PdfPCell c1 = new PdfPCell(new Phrase(tCell.text()));
createTable.addCell(c1);
}
}
document.add(createTable);
}
}
}catch(IOException e) {
System.out.println("IO Exception");
e.printStackTrace();
}
catch(Exception exep) {
exep.printStackTrace();
}finally {
document.close();
}
}
The problems are:
1. Condition if(picture.hasPicture(run)) is not satisfying but document has jpeg image.
I'm getting following exception while reading table.
java.lang.IllegalArgumentException: This paragraph is not the first one in the table
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:876)
at pagecode.ReadDocxOrDocFile.main(ReadDocxOrDocFile.java:113)
Can anybody help me to solve the problem.
Thank you.
Regarding your exception:
Your code iterates over all paragraphs and calls isInTable() for each one of them. Since tables are commonly composed of several such paragraphs, your call to getTable() also gets executed several times for a single table.
However, what your code should do instead is to find the first paragraph of a table, then process all paragraphs therein (via getRow(m).getCell(n)) and ultimately continue with the outer loop in the first paragraph after the table. Codewise this may look roughly like the following (assuming no merged cells, no nested tables and no other funny edge cases):
if (par.isInTable()) {
Table table = range.getTable(par);
for (int rn=0; rn<table.numRows(); rn++) {
TableRow row = table.getRow(rn);
for (int cn=0; cn<row.numCells(); cn++) {
TableCell cell = row.getCell(cn);
for (int pn=0; pn<cell.numParagraphs(); pn++) {
Paragraph cellParagraph = cell.getParagraph(pn);
// your PDF conversion code goes here
}
}
}
i += table.numParagraphs()-1; // skip the already processed (table-)paragraphs in the outer loop
}
Regarding the pictures issue:
Am I guessing right that you are trying to obtain the picture which is anchored within a given paragraph? Unfortunately, the predefined methods of POI only work if the picture is not embedded within a field (which is rather rare, actually). For field-based images (i.e. preview images of embedded OLEs) you should do something like the following (untested!):
PictureStore pictureStore = new PictureStore(hdocument);
// bla bla ...
for (int cr=0; cr < par.numCharacterRuns(); cr++) {
CharacterRun characterRun = par.getCharacterRun(cr);
Field field = hdocument.getFields().getFieldByStartOffset(FieldsDocumentPart.MAIN, characterRun.getStartOffset());
if (field != null && field.getType() == 0x3A) { // 0x3A is type "EMBED"
Picture pic = pictureStore.getPicture(field.secondSubrange(characterRun));
}
}
For a list of possible values of Field.getType() see here.
I have been able to loop through all paragraphs in a document and get at the text and everything and I have read and understood how you can create a document from scratch. But how can I update and replace the text in a paragraph? I can do createRun in a paragraph but that will just create a new piece of text in it.
...
FileInputStream fis = new FileInputStream("Muu.docx");
XWPFDocument myDoc = new XWPFDocument(fis);
XWPFParagraph[] myParas = myDoc.getParagraphs();
...
My theory is that I need to get at the existing "run" in the paragraph I want to change, or delete the paragraph and add it again) but I cannot find methods to do that.
You can't change the text on a XWPFParagraph directly. A XWPFParagraph is made up of one or more XWPFRun instances. These provide the way to set the text.
To change the text, your code would want to be something like:
public void changeText(XWPFParagraph p, String newText) {
List<XWPFRun> runs = p.getRuns();
for(int i = runs.size() - 1; i > 0; i--) {
p.removeRun(i);
}
XWPFRun run = runs.get(0);
run.setText(newText, 0);
}
That will ensure you only have one text run (the first one), and will replace all the text to be what you provided.