I check apache poi documentation but didn't find anything related to number list which can help to add number list to pptx, only thing i found was bullet points.
XSLFTextParagraph paragraph = textShape.addNewTextParagraph();
paragraph.setBullet(true);
is there any method which we can use as a replacement of setBullet(true) method for number list?
Any help would be appreciated.
Numbered lists are not yet implemented in XSLF. But the underlying org.openxmlformats.schemas.drawingml.x2006.main.* classes are there. So one could use this classes.
But what and how?
All Office Open XML files are ZIP archives. So one can simply unzip a *.pptx file and have a look into the XML of /ppt/slides/slide1.xml. There one will find something like:
<a:p>
<a:pPr marL="406400" indent="-406400">
<a:buAutoNum type="arabicPeriod"/>
</a:pPr>
<a:r>
<a:rPr sz="2200"/>
<a:t>First list item</a:t>
</a:r>
</a:p>
for a paragraph in a numbered list.
So the paragraph p contains a paragraph properties pPr element, having margin left set and a negative indent for the first text line (hanging indent). In pPr is a bullet auto number (sic) buAutoNum element having the type Arabic numbers followed by period arabicPeriod.
Unfortunately there is no API documentation for the org.openxmlformats.schemas.drawingml.x2006.main.* classes public available. So one would need getting the sources of ooxml-schemas or poi-ooxml-full to create the API documentation using javadoc.
Complete example:
import java.io.FileOutputStream;
import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.sl.usermodel.*;
public class CreatePPTXNunberings {
static void createBulletpointList(XSLFTextShape textShape, String[] items) {
XSLFTextParagraph paragraph = null;
XSLFTextRun run = null;
double fontSize = 22d;
double indent = 22d;
for (String item : items) {
paragraph = textShape.addNewTextParagraph();
paragraph.setBullet(true); //XSLFTextParagraph.setBullet sets pPr too
paragraph.getXmlObject().getPPr().setMarL(org.apache.poi.util.Units.toEMU(indent));
paragraph.setIndent(-indent);
run = paragraph.addNewTextRun();
run.setFontSize(fontSize);
run.setText(item);
}
}
static void createNumberedList(XSLFTextShape textShape, String[] items) {
XSLFTextParagraph paragraph = null;
XSLFTextRun run = null;
double fontSize = 22d;
double indent = 32d;
for (String item : items) {
paragraph = textShape.addNewTextParagraph();
if (paragraph.getXmlObject().getPPr() == null) paragraph.getXmlObject().addNewPPr();
if (paragraph.getXmlObject().getPPr().getBuAutoNum() == null) paragraph.getXmlObject().getPPr().addNewBuAutoNum();
paragraph.getXmlObject().getPPr().getBuAutoNum().setType(org.openxmlformats.schemas.drawingml.x2006.main.STTextAutonumberScheme.ARABIC_PERIOD);
paragraph.getXmlObject().getPPr().setMarL(org.apache.poi.util.Units.toEMU(indent));
paragraph.setIndent(-indent);
run = paragraph.addNewTextRun();
run.setFontSize(fontSize);
run.setText(item);
}
}
public static void main(String[] args) throws Exception {
XMLSlideShow slideShow = new XMLSlideShow();
XSLFSlide slide = slideShow.createSlide();
XSLFTextShape textShape = slide.createAutoShape();
java.awt.Rectangle rect = new java.awt.Rectangle(100, 100, 500, 300);
textShape.setAnchor(rect.getBounds2D());
textShape.setShapeType(ShapeType.RECT);
XSLFTextParagraph paragraph = null;
XSLFTextRun run = null;
double fontSize = 22d;
paragraph = textShape.addNewTextParagraph();
run = paragraph.addNewTextRun();
run.setFontSize(fontSize);
run.setText("Following is a bullet point list:");
createBulletpointList(textShape,
new String[]{"First list item", "Second list item, a little bit longer to show automatic line breaks", "Third list item"}
);
paragraph = textShape.addNewTextParagraph();
paragraph = textShape.addNewTextParagraph();
run = paragraph.addNewTextRun();
run.setFontSize(fontSize);
run.setText("Following is a numbered list:");
createNumberedList(textShape,
new String[]{"First list item", "Second list item, a little bit longer to show automatic line breaks", "Third list item"}
);
FileOutputStream out = new FileOutputStream("./CreatePPTXNunberings.pptx");
slideShow.write(out);
out.close();
slideShow.close();
}
}
This code needs the full jar of all of the schemas, which is poi-ooxml-full-5.2.2.jar for apache poi 5.2.2, as mentioned in FAQ. Note, since apache poi 5.* the formerly used ooxml-schemas-*.jar cannot be used anymore. There must not be any ooxml-schemas-*.jar in class path when using apache poi 5.*.
Related
I am attempting to format a Word document that has multiple tables. I need to delete line breaks that occur after table. How to i achieve this programatically in Java ?
I am currently trying it with the following code and it does not work
org.apache.xmlbeans.XmlCursor cursor = xwpfTable.getCTTbl().newCursor();
cursor.toEndToken();
cursor.toNextToken();
cursor.removeChars(2);
Further Clarification : We are receiving non-formatted word files from external source. We need to eliminate paragraph (extra lines in-between tables) when the table has only 1 row. Currently I are using a macro and achieving this by code :
For Each t In doc.Tables
Set myrange = doc.Characters(t.Range.End + 1)
If myrange.Text = Chr(13) Then
myrange.Delete
End If
Thanks in advance
What I am trying to remove:
According to your screenshot you wants to remove empty paragraphs which are placed immediately after tables.
This is possible, although i am wondering why those paragraphs are there. After removing those paragraphs, in Word the tables are not more editable as single tables but only as rows within one table. Is this what you want?
Anyway, as said removing the empty paragraphs after the tables is possible. To do so, you could traversing the body elements of the document. If there is a XWPFTable immediately followed by a XWPFParagraph and this XWPFParagraph does not have any text runs in it, then remove that XWPFParagraph from the document.
Example:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
public class WordRemoveEmptyParagraphs {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("./WordTables.docx"));
int thisBodyElementPos = 0;
int nextBodyElementPos = 1;
IBodyElement thisBodyElement = null;
IBodyElement nextBodyElement = null;
if (document.getBodyElements().size() > 1) { // document must have at least two body elements
do {
thisBodyElement = document.getBodyElements().get(thisBodyElementPos);
nextBodyElement = document.getBodyElements().get(nextBodyElementPos);
if (thisBodyElement instanceof XWPFTable && nextBodyElement instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph)nextBodyElement;
if (paragraph.getRuns().size() == 0) { // if paragraph does not have any text runs in it
document.removeBodyElement(nextBodyElementPos);
}
}
thisBodyElementPos++;
nextBodyElementPos = thisBodyElementPos + 1;
} while (nextBodyElementPos < document.getBodyElements().size());
}
FileOutputStream out = new FileOutputStream("./WordTablesChanged.docx");
document.write(out);
out.close();
document.close();
}
}
I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.
I already made it with all of them except for PowerPoint ones, I'm using Apache POI code snippets found on StackOverflow or on other sources.
I discovered all slides are made of shapes (XSLFTextShape) but many of them are objects of class XSLFGraphicFrame or XSLFTable for which I can't use simply the toString() methods. How can I extract all of the text contained in them using Java.
This is the piece of code\pseudocode:
File f = new File("C:\\Users\\Windows\\Desktop\\Modulo 9.pptx");
PrintStream out = System.out;
FileInputStream is = new FileInputStream(f);
XMLSlideShow ppt = new XMLSlideShow(is);
for (XSLFSlide slide : ppt.getSlides()) {
for (XSLFShape shape : slide) {
if (shape instanceof XSLFTextShape) {
XSLFTextShape txShape = (XSLFTextShape) shape;
out.println(txShape.getText());
} else if (shape instanceof XSLFPictureShape) {
//do nothing
} else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {
//print all text in it or in its children
}
}
}
If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.
Example:
import java.io.FileInputStream;
import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.sl.usermodel.SlideShow;
import org.apache.poi.sl.extractor.SlideShowExtractor;
import org.apache.poi.extractor.POITextExtractor;
public class SlideShowExtractorExample {
public static void main(String[] args) throws Exception {
SlideShow<XSLFShape,XSLFTextParagraph> slideshow
= new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));
SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor
= new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
slideShowExtractor.setCommentsByDefault(true);
slideShowExtractor.setMasterByDefault(true);
slideShowExtractor.setNotesByDefault(true);
String allTextContentInSlideShow = slideShowExtractor.getText();
System.out.println(allTextContentInSlideShow);
System.out.println("===========================================================================");
POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
String metaData = textExtractor.getText();
System.out.println(metaData);
}
}
Of course there are kinds of XSLFGraphicFrame which are not read by SlideShowExtractor because they are not supported by apache poi until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml document parts which are referenced from the slides. Since apache poi does not supporting this until now, it only can be read using low level underlying methods.
For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics we could do:
...
System.out.println("===========================================================================");
//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
StringBuilder sb = new StringBuilder();
for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
while(cursor.hasNextToken()) {
if (cursor.isText()) {
sb.append(cursor.getTextValue() + "\r\n");
}
cursor.toNextToken();
}
sb.append(slide.getSlideNumber() + "\r\n\r\n");
}
}
}
String allTextContentInDiagrams = sb.toString();
System.out.println(allTextContentInDiagrams);
...
I am trying to remove a set of contiguous paragraphs from a Microsoft Word document, using Apache POI.
From what I have understood, deleting a paragraph is possible by removing all of its runs, this way:
/*
* Deletes the given paragraph.
*/
public static void deleteParagraph(XWPFParagraph p) {
if (p != null) {
List<XWPFRun> runs = p.getRuns();
//Delete all the runs
for (int i = runs.size() - 1; i >= 0; i--) {
p.removeRun(i);
}
p.setPageBreak(false); //Remove the eventual page break
}
}
In fact, it works, but there's something strange. The block of removed paragraphs does not disappear from the document, but it's converted in a set of empty lines. It's just like every paragraph would be converted into a new line.
By printing the paragraphs' content from code I can see, in fact, a space (for each one removed). Looking at the content directly from the document, with the formatting mark's visualization enabled, I can see this:
The vertical column of ¶ corresponds to the block of deleted elements.
Do you have an idea for that? I'd like my paragraphs to be completely removed.
I also tried by replacing the text (with setText()) and by removing eventual spaces that could be added automatically, this way:
p.setSpacingAfter(0);
p.setSpacingAfterLines(0);
p.setSpacingBefore(0);
p.setSpacingBeforeLines(0);
p.setIndentFromLeft(0);
p.setIndentFromRight(0);
p.setIndentationFirstLine(0);
p.setIndentationLeft(0);
p.setIndentationRight(0);
But with no luck.
I would delete paragraphs by deleting paragraphs, not by deleting only the runs in this paragraphs. Deleting paragraphs is not part of the apache poi high level API. But using XWPFDocument.getDocument().getBody() we can get the low level CTBody and there is a removeP(int i).
Example:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import java.awt.Desktop;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
public class WordRemoveParagraph {
/*
* Deletes the given paragraph.
*/
public static void deleteParagraph(XWPFParagraph p) {
XWPFDocument doc = p.getDocument();
int pPos = doc.getPosOfParagraph(p);
//doc.getDocument().getBody().removeP(pPos);
doc.removeBodyElement(pPos);
}
public static void main(String[] args) throws IOException, InvalidFormatException {
XWPFDocument doc = new XWPFDocument(new FileInputStream("source.docx"));
int pNumber = doc.getParagraphs().size() -1;
while (pNumber >= 0) {
XWPFParagraph p = doc.getParagraphs().get(pNumber);
if (p.getParagraphText().contains("delete")) {
deleteParagraph(p);
}
pNumber--;
}
FileOutputStream out = new FileOutputStream("result.docx");
doc.write(out);
out.close();
doc.close();
System.out.println("Done");
Desktop.getDesktop().open(new File("result.docx"));
}
}
This deletes all paragraphs from the document source.docx where the text contains "delete" and saves the result in result.docx.
Edited:
Although doc.getDocument().getBody().removeP(pPos); works, it will not update the XWPFDocument's paragraphs list. So it will destroy paragraph iterators and other accesses to that list since the list is only updated while reading the document again.
So the better approach is using doc.removeBodyElement(pPos); instead. removeBodyElement(int pos) does exactly the same as doc.getDocument().getBody().removeP(pos); if the pos is pointing to a pagagraph in the document body since that paragraph is an BodyElement too. But in addition, it will update the XWPFDocument's paragraphs list.
When you are inside of a table you need to use the functions of the XWPFTableCell instead of the XWPFDocument:
cell.removeParagraph(cell.getParagraphs().indexOf(para));
I am trying to copy contents from a docx file to the clipboard eventually. The code I have come up with so far is:
package config;
public class buffer {
public static void main(String[] args) throws IOException, XmlException {
XWPFDocument srcDoc = new XWPFDocument(new FileInputStream("D:\\rules.docx"));
XWPFDocument destDoc = new XWPFDocument();
OutputStream out = new FileOutputStream("D:\\test.docx");
for (IBodyElement bodyElement : srcDoc.getBodyElements()) {
XWPFParagraph srcPr = (XWPFParagraph) bodyElement;
XWPFParagraph dstPr = destDoc.createParagraph();
dstPr.createRun();
int pos = destDoc.getParagraphs().size() - 1;
destDoc.setParagraph(srcPr, pos);
}
destDoc.write(out);
out.close();
}
}
This does fetch the bullets but numbers them. I want to retain the original bullet format. Is there a way to do this?
You'll need to handle the numbering definition (in the numbering part) correctly.
The most reliable thing to do would be to copy the definition (both the instance list and the abstract one) across, and renumber it (ie give it a new ID) so that it is unique.
Then of course you'll need to update the ID's in your paragraph to match.
Note that the above is a solution only for the question you have asked.
You'll run into problems if your content contains a rel to some other part (eg an image). And you'tr not handling the style definition etc.
I need to add a watermark to every page that has certain text, such as "PROCEDURE DELETED".
Based on Bruno Lowagie's suggestion in Adding watermark directly to the stream
So far have the PdfWatermark Class with:
protected Phrase watermark = new Phrase("DELETED", new Font(FontFamily.HELVETICA, 60, Font.NORMAL, BaseColor.PINK));
ArrayList<Integer> arrPages = new ArrayList<Integer>();
boolean pdfChecked = false;
#Override
public void onEndPage(PdfWriter writer, Document document) {
if(pdfChecked == false) {
detectPages(writer, document);
pdfChecked = true;
}
int pageNum = writer.getPageNumber();
if(arrPages.contains(pageNum)) {
PdfContentByte canvas = writer.getDirectContentUnder();
ColumnText.showTextAligned(canvas, Element.ALIGN_CENTER, watermark, 298, 421, 45);
}
}
And this works fine if I add, say, the number 3 to the arrPages ArrayList in my custom detectPages method - it shows the desired watermark on page 3.
What I am having trouble with is how to search through the document for the text string, which I have access to here, only from the PdfWriter writer or the com.itextpdf.text.Document document sent to onEndPage method.
Here is what I have tried, unsuccessfully:
private void detectPages(PdfWriter writer, Document document) {
try {
//arrPages.add(3);
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
PdfWriter.getInstance(document, byteArrayOutputStream);
//following code no work
PdfReader reader = new PdfReader(writer.getDirectContent().getPdfDocument());
PdfContentByte canvas = writer.getDirectContent();
PdfImportedPage page;
for (int i = 0; i < reader.getNumberOfPages(); ) {
page = writer.getImportedPage(reader, ++i);
canvas.addTemplate(page, 1f, 0, 0.4f, 0.4f, 72, 50 * i);
canvas.beginText();
canvas.showTextAligned(Element.ALIGN_CENTER,
//search String
String.valueOf((char)(181 + i)), 496, 150 + 50 * i, 0);
//if(detected) arrPages.add(i);
canvas.endText();
}
Am I on the right track with this as a solution or do I need to back out?
Can anyone supply the missing link needed to scan the doc and pick out "PROCEDURE DELETED" pages?
EDIT: I am using iText 5.0.4 - cannot upgrade to 5.5.X at this time, but could probably upgrade to latest version below that.
EDIT2: More information: This is approach to adding text to the document (doc):
String processed = processText(template);
List<Element> objects = HTMLWorker.parseToListLog(new StringReader(processed),
styles, interfaceProps, errors);
for (Element elem : objects) {
doc.add(elem);
}
That is called in an addText method I control. The template is simply html from a database LOB. The processText checks the html for custom markers contained by curlies as in ${replaceMe}.
This seems to be the place to identify the "PROCEDURE DELETED" string during generation of the document, but I don't see the path to Chunk.setGenericTags().
EDIT3: Table difficulties
List<Element> objects = HTMLWorker.parseToListLog(new StringReader(processed),
styles, interfaceProps, errors);
for (Element elem : objects) {
//Code below no work
if (elem instanceof PdfPTable) {
PdfPTable table = (PdfPTable) elem;
ArrayList<Chunk> chks = table.getChunks();
for(Chunk chk : chks){
if(chk.toString().contains("TEST DELETED")) {
chk.setGenericTag("delete_tag");
}
}
}
doc.add(elem);
}
Commenters mlk and Bruno suggested to detect the "PROCEDURE DELETED" keywords at the time they are added to the doc. However, since the keywords are necessarily inside a table, they have to be detected through PdfPTable rather than the simpler Element.
I could not do it with the code above. Any suggestions exactly how to find text inside a table cell and do a string comparison on it?
EDIT4: Based on some experimentation, I would like to make some assertions and please show me the way through them:
Using Chunk.setGenericTag() is required to trigger the handler onGenericTag
For some reason (PdfPTable) table.getChunks() does not return chunks, at least that my system picks up. This is counterintuitive and possibly there is a setup, version, or code bug causing this behavior.
Therefore, a selection text string inside a table cannot be used to trigger a watermark.
Thanks to all the help from #mkl and #Bruno, I finally got a workable solution. Anybody interested who can give a more elegant approach is welcome - there is certainly room.
Two classes were in play in all the snippets given in the original question. Below is partial code that embodies the working solution.
public class PdfExporter
//a custom class to build content
//create some content to add to pdf
String text = "<div>Lots of important content here</div>";
//assume the section is known to be deleted
if(section_deleted) {
//add a special marker to the text, in white-on-white
text = "<span style=\"color:white;\">.!^DEL^?.</span>" + text + "</span>";
}
//based on some indicator, we know we want to force a new page for a new section
boolean ensureNewPage = true;
//use an instance of PdfDocHandler to add the content
pdfDocHandler.addText(text, ensureNewPage);
public class PdfDocHandler extends PdfPageEventHelper
private boolean isDEL;
public boolean addText(String text, boolean ensureNewPage)
if (ensureNewPage) {
//turn isDEL off: a forced pagebreak indicates a new section
isDEL = false;
}
//attempt to find the special DELETE marker in first chunk
//NOTE: this can be done several ways
ArrayList<Chunk> chks = elem.getChunks();
if(chks.size()>0) {
if(chks.get(0).getContent().contains(".!^DEL^?.")) {
//special delete marker found in text
isDEL = true;
}
}
//doc is an iText Document
doc.add(elem);
public void onEndPage(PdfWriter writer, Document document) {
if(isDEL) {
//set the watermark
Phrase watermark = new Phrase("DELETED", new Font(Font.FontFamily.HELVETICA, 60, Font.NORMAL, BaseColor.PINK));
PdfContentByte canvas = writer.getDirectContentUnder();
ColumnText.showTextAligned(canvas, Element.ALIGN_CENTER, watermark, 298, 421, 45);
}
}