Table Content Extraction Section wise in .docx file

Table Content Extraction Section wise in .docx file - java

I am using Apache POI 3.9 to extract table contents from a .docx file.This doc contains multiple tables under different sections.I could extract all the table contents irrespective of the sections , but i want to extract table content under particular sections only.Can anyone help ?
.docx outline:
Section 1: ABC
Table 1:
Table 2:
Section 2 :CDE
Table 3:
Table 4:
Table Extraction Code:
XWPFDocument documentContent = new XWPFDocument(inputStream);
Iterator<IBodyElement> bodyElementIterator = documentContent.getBodyElementsIterator();
while(bodyElementIterator.hasNext())
{
IBodyElement element = bodyElementIterator.next();
if("TABLE".equalsIgnoreCase(element.getElementType().name()))
{
List<XWPFTable> tableList = element.getBody().getTables();
//Extract the table row name and their corresponding values from the word stream content
tableRowValues = getTableRowValues(tableList);
}
}
Method:
private static ArrayList getTableRowValues(List tableList) {
ArrayList<String> tableValues = new ArrayList<String>();
for (XWPFTable xwpfTable : tableList)
{
List<XWPFTableRow> row = xwpfTable.getRows();
for (XWPFTableRow xwpfTableRow : row)
{
List<XWPFTableCell> cell = xwpfTableRow.getTableCells();
for (XWPFTableCell xwpfTableCell : cell)
{
List<XWPFParagraph> para = xwpfTableCell.getParagraphs();
for (XWPFParagraph xwpfTablePara : para)
{
if(xwpfTablePara!=null)
{
tableValues.add( xwpfTablePara.getText());
}
}
}
}
}
return tableValues;
}

I did about the same thing.
With this code I extract all the sections with tables underneath it :
Iterator<IBodyElement> iter = xdoc.getBodyElementsIterator();
while (iter.hasNext())
{
IBodyElement elem = iter.next();
if (elem instanceof XWPFParagraph)
{
relevantText.setText(((XWPFParagraph) elem).getText());
relevantText.addBreak();
relevantText.addCarriageReturn();
}
else if (elem instanceof XWPFTable)
{
relevantText.addBreak();
relevantText.setText(((XWPFTable) elem).getText());
relevantText.addCarriageReturn();
}
}
You can create an if-statement before the getText() so it only extract text when the right conditions are true.
In example you can check on; styles, text etc.
paragraph.getStyle() //filters on word styles, eg ""header1"
paragraph.getNumFmt() //filters on bullet text
For more see the documentation from Apache
https://poi.apache.org/

Related

How can I duplicate a table in word with apache poi? [duplicate]

I have a table in the docx template.
Depending on the number of objects, I have to duplicate the table as many times as I have objects. Duplicate tables must be after the table from the template.
I have several tables in the template that should behave like this.
XmlCursor take the place of the first table from the template and put the next one there. I want to insert the next table after the previous one, which I added myself, but xmlcursor does not return the table item I added, but returns "STARTDOC"
XmlCursor cursor = docx.getTables().get(pointer).getCTTbl().newCursor();
cursor.toEndToken();
while (cursor.toNextToken() != XmlCursor.TokenType.START) ;
XWPFParagraph newParagraph = docx.insertNewParagraph(cursor);
newParagraph.createRun().setText("", 0);
cursor.toParent();
cursor.toEndToken();
while (cursor.toNextToken() != XmlCursor.TokenType.START) ;
docx.insertNewTbl(cursor);
CTTbl ctTbl = CTTbl.Factory.newInstance();
ctTbl.set(docx.getTables().get(numberTableFromTemplate).getCTTbl());
XWPFTable tableCopy = new XWPFTable(ctTbl, docx);
docx.setTable(index + 1, tableCopy);

Not clear what you are aiming for with the cursor.toParent();. And I also cannot reproduce the issue having only your small code snippet. But having a complete working example may possible help you.
Assuming we have following template:
Then following code:
import java.io.FileOutputStream;
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.xmlbeans.XmlObject;
import org.apache.xmlbeans.XmlCursor;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTTbl;
public class WordCopyTableAfterTable {
static XmlCursor setCursorToNextStartToken(XmlObject object) {
XmlCursor cursor = object.newCursor();
cursor.toEndToken(); //Now we are at end of the XmlObject.
//There always must be a next start token.
while(cursor.hasNextToken() && cursor.toNextToken() != org.apache.xmlbeans.XmlCursor.TokenType.START);
//Now we are at the next start token and can insert new things here.
return cursor;
}
static void removeCellValues(XWPFTableCell cell) {
for (XWPFParagraph paragraph : cell.getParagraphs()) {
for (int i = paragraph.getRuns().size()-1; i >= 0; i--) {
paragraph.removeRun(i);
}
}
}
public static void main(String[] args) throws Exception {
//The data. Each row a new table.
String[][] data= new String[][] {
new String[] {"John Doe", "5/23/2019", "1234.56"},
new String[] {"Jane Doe", "12/2/2019", "34.56"},
new String[] {"Marie Template", "9/20/2019", "4.56"},
new String[] {"Hans Template", "10/2/2019", "4567.89"}
};
String value;
XWPFDocument document = new XWPFDocument(new FileInputStream("WordTemplate.docx"));
XWPFTable tableTemplate;
CTTbl cTTblTemplate;
XWPFTable tableCopy;
XWPFTable table;
XWPFTableRow row;
XWPFTableCell cell;
XmlCursor cursor;
XWPFParagraph paragraph;
XWPFRun run;
//get first table (the template)
tableTemplate = document.getTableArray(0);
cTTblTemplate = tableTemplate.getCTTbl();
cursor = setCursorToNextStartToken(cTTblTemplate);
//fill in first data in first table (the template)
for (int c = 0; c < data[0].length; c++) {
value = data[0][c];
row = tableTemplate.getRow(1);
cell = row.getCell(c);
removeCellValues(cell);
cell.setText(value);
}
paragraph = document.insertNewParagraph(cursor); //insert new empty paragraph
cursor = setCursorToNextStartToken(paragraph.getCTP());
//fill in next data, each data row in one table
for (int t = 1; t < data.length; t++) {
table = document.insertNewTbl(cursor); //insert new empty table at position t
cursor = setCursorToNextStartToken(table.getCTTbl());
tableCopy = new XWPFTable((CTTbl)cTTblTemplate.copy(), document); //copy the template table
//fill in data in tableCopy
for (int c = 0; c < data[t].length; c++) {
value = data[t][c];
row = tableCopy.getRow(1);
cell = row.getCell(c);
removeCellValues(cell);
cell.setText(value);
}
document.setTable(t, tableCopy); //set tableCopy at position t instead of table
paragraph = document.insertNewParagraph(cursor); //insert new empty paragraph
cursor = setCursorToNextStartToken(paragraph.getCTP());
}
paragraph = document.insertNewParagraph(cursor);
run = paragraph.createRun();
run.setText("Inserted new text below last table.");
cursor = setCursorToNextStartToken(paragraph.getCTP());
FileOutputStream out = new FileOutputStream("WordResult.docx");
document.write(out);
out.close();
document.close();
}
}
leads to following result:
Is that about what you wanted to achieve?
Please note how I insert the additional tables.
Using table = document.insertNewTbl(cursor); a new empty table is inserted at position t. This table is placed into the document body. So this table must be taken for adjusting the cursor.
Then tableCopy = new XWPFTable((CTTbl)cTTblTemplate.copy(), document); copys the template table. Then this copy is filled with data. And then it is set into the document at position t using document.setTable(t, tableCopy);.
Unfortunately apache poi is incomplete here. XWPFDocument.setTable only sets the internally ArrayLists but not the underlying XML. XWPFDocument.insertNewTbl sets the underlying XML but only using an empty table. So we must do it that ugly complicated way.

How to fetch data of multiple HTML tables through Web Scraping in Java

I was trying to scrape the data of a website and to some extents I succeed in my goal. But, there is a problem that the web page I am trying to scrape have got multiple HTML tables in it. Now, when I execute my program it only retrieves the data of the first table in the CSV file and not retrieving the other tables. My java class code is as follows.
public static void parsingHTML() throws Exception {
//tbodyElements = doc.getElementsByTag("tbody");
for (int i = 1; i <= 1; i++) {
Elements table = doc.getElementsByTag("table");
if (table.isEmpty()) {
throw new Exception("Table is not found");
}
elements = table.get(0).getElementsByTag("tr");
for (Element trElement : elements) {
trElement2 = trElement.getElementsByTag("tr");
tdElements = trElement.getElementsByTag("td");
File fold = new File("C:\\convertedCSV9.csv");
fold.delete();
File fnew = new File("C:\\convertedCSV9.csv");
FileWriter sb = new FileWriter(fnew, true);
//StringBuilder sb = new StringBuilder(" ");
//String y = "<tr>";
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
//Element tdElement1 = it.next();
//final String content2 = tdElement1.text();
if (it.hasNext()) {
sb.append("\r\n");
}
for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
Element tdElement2 = it.next();
final String content = tdElement2.text();
//stringjoiner.add(content);
//sb.append(formatData(content));
if (it2.hasNext()) {
sb.append(formatData(content));
sb.append(" , ");
}
if (!it.hasNext()) {
String content1 = content.replaceAll(",$", " ");
sb.append(formatData(content1));
//it2.next();
}
}
System.out.println(sb.toString());
sb.flush();
sb.close();
}
System.out.println(sampleList.add(tdElements));
}
}
}
What I analyze is that there is a loop which is only checking tr tds. So, after first table there is a style sheet on the HTML page. May be due to style sheet loop is breaking. I think that's the reason it is proceeding to the next table.
P.S: here's the link which I am trying to scrap
http://www.mufap.com.pk/nav_returns_performance.php?tab=01

What you do just at the beginning of your code will not work:
// loop just once, why
for (int i = 1; i <= 1; i++) {
Elements table = doc.getElementsByTag("table");
if (table.isEmpty()) {
throw new Exception("Table is not found");
}
elements = table.get(0).getElementsByTag("tr");
Here you loop just once, read all table elements and then process all tr elements for the first table you find. So even if you would loop more than once, you would always process the first table.
You will have to iterate all table elements, e.g.
for(Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
// process "td"s and so on
}
}
Edit Since you're having troubles with the code above, here's a more thorough example. Note that I'm using Jsoup to read and parse the HTML (you didn't specify what you are using)
Document doc = Jsoup
.connect("http://www.mufap.com.pk/nav_returns_performance.php?tab=01")
.get();
for (Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
// skip header "tr"s and process only data "tr"s
if (trElement.hasClass("tab-data1")) {
StringJoiner tdj = new StringJoiner(",");
for (Element tdElement : trElement.getElementsByTag("td")) {
tdj.add(tdElement.text());
}
System.out.println(tdj);
}
}
}
This will concat and print all data cells (those having the class tab-data1). You will still have to modify it to write to your CSV file though.
Note: in my tests this processes 21 tables, 243 trs and 2634 tds.

Create table in word doc using docx4j in specific bookmark:

I’m creating a table in specific bookmark location (means it's adisplay table in word doc), but after converting word to PDF it can't show table in PDF because of bookmark is inside a w:p!!
<w:p w:rsidR="00800BD9" w:rsidRDefault="00800BD9">
<w:bookmarkStart w:id="0" w:name="abc"/>
<w:bookmarkEnd w:id="0"/>
</w:p>
Now I want to find the paragraph (using the bookmark), then replace the paragraph with the table.
Does anybody have any suggestion?
Here is my code:
private void replaceBookmarkContents(List<Object> paragraphs, Map<DataFieldName, String> data) throws Exception {
Tbl table = TblFactory.createTable(3,3,9600);
RangeFinder rt = new RangeFinder("CTBookmark", "CTMarkupRange");
new TraversalUtil(paragraphs, rt);
for (CTBookmark bm : rt.getStarts()) {
// do we have data for this one?
if (bm.getName()==null) continue;
String value = data.get(new DataFieldName(bm.getName()));
if (value==null) continue;
try {
// Can't just remove the object from the parent,
// since in the parent, it may be wrapped in a JAXBElement
List<Object> theList = null;
if (bm.getParent() instanceof P) {
System.out.println("OK!");
theList = ((ContentAccessor)(bm.getParent())).getContent();
} else {
continue;
}
int rangeStart = -1;
int rangeEnd=-1;
int i = 0;
for (Object ox : theList) {
Object listEntry = XmlUtils.unwrap(ox);
if (listEntry.equals(bm)) {
if (DELETE_BOOKMARK) {
rangeStart=i;
} else {
rangeStart=i+1;
}
} else if (listEntry instanceof CTMarkupRange) {
if ( ((CTMarkupRange)listEntry).getId().equals(bm.getId())) {
if (DELETE_BOOKMARK) {
rangeEnd=i;
} else {
rangeEnd=i-1;
}
break;
}
}
i++;
}
if (rangeStart>0) {
// Delete the bookmark range
for (int j=rangeStart; j>0; j--) {
theList.remove(j);
}
// now add a run
org.docx4j.wml.R run = factory.createR();
run.getContent().add(table);
theList.add(rangeStart, run);
}
} catch (ClassCastException cce) {
log.error(cce.getMessage(), cce);
}
}
}

The problem with blindly replacing a bookmark inside a paragraph with a table is that you'll end up with w:p/w:tbl (or with your code, w:p/w:r/w:tbl!) which is not allowed by the file format.
To avoid this issue, you could make the bookmark a sibling of the paragraph, or you could change your code so that once you have found a bookmark inside a paragraph, if you are replacing it with a table, you replace the parent p instead.
Note that bookmark find/replace is a brittle basis for document generation. If your requirements are other than modest, you'd be better off using content controls instead.

how to Create table in word doc using docx4j in specific bookmark without overwritting the word doc

I need to create a table at the location of particular bookmark. ie i need to find the bookmark and insert the table . how can i do this using docx4j
Thanks in Advance
Sorry Jason, I am new to Stackoverflow so i couldnt write my problem clearly, here is my situation and problem.
I made changes in that code as you suggested and to my needs, and the code is here
//loop through the bookmarks
for (CTBookmark bm : rt.getStarts()) {
// do we have data for this one?
String bmname =bm.getName();
// find the right bookmark (in this case i have only one bookmark so check if it is not null)
if (bmname!=null) {
String value = "some text for testing run";
//if (value==null) continue;
List<Object> theList = null;
//create bm list
theList = ((ContentAccessor)(bm.getParent())).getContent();
// I set the range as 1 (I assume this start range is to say where the start the table creating)
int rangeStart = 1;
WordprocessingMLPackage wordPackage = WordprocessingMLPackage.createPackage();
// create the table
Tbl table = factory.createTbl();
//add boards to the table
addBorders(table);
for(int rows = 0; rows<1;rows++)
{// create a row
Tr row = factory.createTr();
for(int colm = 0; colm<1;colm++)
{
// create a cell
Tc cell = factory.createTc();
// add the content to cell
cell.getContent().add(wordPackage.getMainDocumentPart()
.createParagraphOfText("cell"+colm));
// add the cell to row
row.getContent().add(cell);
}
// add the row to table
table.getContent().add(row);
// now add a run (to test whether run is working or not)
org.docx4j.wml.R run = factory.createR();
org.docx4j.wml.Text t = factory.createText();
run.getContent().add(t);
t.setValue(value);
//add table to list
theList.add(rangeStart, table);
//add run to list
//theList.add(rangeStart, run);
}
I dont need to delete text in bookmark so i removed it.
I dont know whats the problem, program is compiling but I cannot open the word doc , it says "unknown error". I test to write some string "value" it writes perfectly in that bookmark and document is opening but not in the case of table. Please help me
Thanks in advance

You can adapt the sample code BookmarksReplaceWithText.java
In your case:
line 89: the parent won't be p, it'll be body or tc. You could remove the test.
line 128: instead of adding a run, you want to insert a table
You can use TblFactory to create your table, or the docx4j webapp to generate code from a sample docx.

For some reason bookmark replacement with table didn't workout for me, so I relied on text replacement with table. I created my tables from HTML using XHTML importer for my use case
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
String xhtml= <your table HTML>;
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
int ct = 0;
List<Integer> tableIndexes = new ArrayList<>();
List<Object> documentContents = documentPart.getContent();
for (Object o: documentContents) {
if (o.toString().contains("PlaceholderForTable1")) {
tableIndexes.add(ct);
}
ct++;
}
for (Integer i: tableIndexes) {
documentPart.getContent().remove(i.intValue());
documentPart.getContent().addAll(i.intValue(), XHTMLImporter.convert( xhtml, null));
}
In my input word doc, I defined text 'PlaceholderForTable1' where I want to insert my table.

Java (Apache POI) How to stretch XWPFTable row across whole page

I have a simple XWPFTable I made using Apache POI for docx, but if the text of the cell is short the cell/row resizes to fit around the text. How do I set it so the table/rows stay stretched from left to right of the whole page no matter the length of the text? Thanks.
Here's the table part of the code if It helps:
XWPFTable header = document.createTable(1, 3);
header.getRow(0).getCell(0).setText("LCode");
header.getRow(0).getCell(1).setText("QTY");
header.getRow(0).getCell(2).setText("Description/Justification");
XWPFTable table = document.createTable(LCodes.size(), 3);
int x = 0;
for (Map.Entry entry : new TreeMap<Integer, List<String>>(LCodes).entrySet()) {
Object LCode = entry.getKey();
table.getRow(x).getCell(0).setText("L" + LCode);
table.getRow(x).getCell(1).setText(LCodes.get(LCode).get(2));
XWPFParagraph para = table.getRow(x).getCell(2).getParagraphs().get(0);
for (int i = 0; i < 2; i++) {
if (i == 1 && LCodes.get(LCode).get(i).trim().equals("")) {
continue;
}
XWPFRun leading = para.createRun();
leading.setItalic(true);
if (i == 0) {
leading.setText("Description: ");
} else {
leading.setText("Justification: ");
}
XWPFRun trailing = para.createRun();
trailing.setText(LCodes.get(LCode).get(i));
if (i == 0 && !(LCodes.get(LCode).get(i).trim().equals(""))) {
trailing.addBreak();
}
}
x++;
}

To stretch width of xwpf table you need to perform following.
1. Create table in sample xwpf document.
2. Download the same document as zip and extract document.xml file.
3. Find
Another solution: If you dont want to go from all these and your table borders are hidden. You can add below XWPFparagraph in first column of your table and this will do the trick.
//add this para in first column to get table width, 84pts total
private XWPFParagraph getFirstColumnText(){
XWPFParagraph para=new XWPFDocument().createParagraph();
XWPFRun run=para.createRun();
run.setColor("ffffff");
run.setText("...................................................................................");
return para;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Table Content Extraction Section wise in .docx file - java

Related

How can I duplicate a table in word with apache poi? [duplicate]

How to fetch data of multiple HTML tables through Web Scraping in Java

Create table in word doc using docx4j in specific bookmark:

how to Create table in word doc using docx4j in specific bookmark without overwritting the word doc

Java (Apache POI) How to stretch XWPFTable row across whole page

Categories

Resources