How to read text in XSLFGraphicFrame with Apache POI for PowerPoint - java

I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.
I already made it with all of them except for PowerPoint ones, I'm using Apache POI code snippets found on StackOverflow or on other sources.
I discovered all slides are made of shapes (XSLFTextShape) but many of them are objects of class XSLFGraphicFrame or XSLFTable for which I can't use simply the toString() methods. How can I extract all of the text contained in them using Java.
This is the piece of code\pseudocode:
File f = new File("C:\\Users\\Windows\\Desktop\\Modulo 9.pptx");
PrintStream out = System.out;
FileInputStream is = new FileInputStream(f);
XMLSlideShow ppt = new XMLSlideShow(is);
for (XSLFSlide slide : ppt.getSlides()) {
for (XSLFShape shape : slide) {
if (shape instanceof XSLFTextShape) {
XSLFTextShape txShape = (XSLFTextShape) shape;
out.println(txShape.getText());
} else if (shape instanceof XSLFPictureShape) {
//do nothing
} else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {
//print all text in it or in its children
}
}
}

If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.
Example:
import java.io.FileInputStream;
import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.sl.usermodel.SlideShow;
import org.apache.poi.sl.extractor.SlideShowExtractor;
import org.apache.poi.extractor.POITextExtractor;
public class SlideShowExtractorExample {
public static void main(String[] args) throws Exception {
SlideShow<XSLFShape,XSLFTextParagraph> slideshow
= new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));
SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor
= new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
slideShowExtractor.setCommentsByDefault(true);
slideShowExtractor.setMasterByDefault(true);
slideShowExtractor.setNotesByDefault(true);
String allTextContentInSlideShow = slideShowExtractor.getText();
System.out.println(allTextContentInSlideShow);
System.out.println("===========================================================================");
POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
String metaData = textExtractor.getText();
System.out.println(metaData);
}
}
Of course there are kinds of XSLFGraphicFrame which are not read by SlideShowExtractor because they are not supported by apache poi until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml document parts which are referenced from the slides. Since apache poi does not supporting this until now, it only can be read using low level underlying methods.
For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics we could do:
...
System.out.println("===========================================================================");
//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
StringBuilder sb = new StringBuilder();
for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
while(cursor.hasNextToken()) {
if (cursor.isText()) {
sb.append(cursor.getTextValue() + "\r\n");
}
cursor.toNextToken();
}
sb.append(slide.getSlideNumber() + "\r\n\r\n");
}
}
}
String allTextContentInDiagrams = sb.toString();
System.out.println(allTextContentInDiagrams);
...

Related

Java Aspose Slides find and replace text cannot keep text style

I'm working with: Aspose.Slides lib to read PPT and PPTX files.
When I replace text with another text, the font size is broken.
Origin:
After replace text:
public void asposeTranslate(String fileName) throws IOException {
Locale.setDefault(new Locale("en-us"));
// Load presentation
Presentation pres = new Presentation(URL + "/" + fileName);
// Loop through each slide
for (ISlide slide : pres.getSlides()) {
// Get all text frames in the slide
ITextFrame[] tf = SlideUtil.getAllTextBoxes(slide);
for (int i = 0; i < tf.length; i++) {
for (IParagraph para : tf[i].getParagraphs()) {
for (IPortion port : para.getPortions()) {
String originText = port.getText();
String newText = translateText(originTexmakes); // method make a new text
port.setText(newText); // replace with new text
}
}
}
}
pres.save(URL + "/new_" + fileName, SaveFormat.Pptx);
}
I read from blogs: https://blog.aspose.com/slides/find-and-replace-text-in-powerpoint-using-java/#API-to-Find-and-Replace-Text-in-PowerPoint
After replacing the new text, How can I keep older all styles of the older text?
I used aspose-slides-21.7
Thanks,
You can post the issue on Aspose.Slides forum, provide a sample presentation and get help. I am working as a Support Developer at Aspose.

How to add number list in ppt using apache poi?

I check apache poi documentation but didn't find anything related to number list which can help to add number list to pptx, only thing i found was bullet points.
XSLFTextParagraph paragraph = textShape.addNewTextParagraph();
paragraph.setBullet(true);
is there any method which we can use as a replacement of setBullet(true) method for number list?
Any help would be appreciated.
Numbered lists are not yet implemented in XSLF. But the underlying org.openxmlformats.schemas.drawingml.x2006.main.* classes are there. So one could use this classes.
But what and how?
All Office Open XML files are ZIP archives. So one can simply unzip a *.pptx file and have a look into the XML of /ppt/slides/slide1.xml. There one will find something like:
<a:p>
<a:pPr marL="406400" indent="-406400">
<a:buAutoNum type="arabicPeriod"/>
</a:pPr>
<a:r>
<a:rPr sz="2200"/>
<a:t>First list item</a:t>
</a:r>
</a:p>
for a paragraph in a numbered list.
So the paragraph p contains a paragraph properties pPr element, having margin left set and a negative indent for the first text line (hanging indent). In pPr is a bullet auto number (sic) buAutoNum element having the type Arabic numbers followed by period arabicPeriod.
Unfortunately there is no API documentation for the org.openxmlformats.schemas.drawingml.x2006.main.* classes public available. So one would need getting the sources of ooxml-schemas or poi-ooxml-full to create the API documentation using javadoc.
Complete example:
import java.io.FileOutputStream;
import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.sl.usermodel.*;
public class CreatePPTXNunberings {
static void createBulletpointList(XSLFTextShape textShape, String[] items) {
XSLFTextParagraph paragraph = null;
XSLFTextRun run = null;
double fontSize = 22d;
double indent = 22d;
for (String item : items) {
paragraph = textShape.addNewTextParagraph();
paragraph.setBullet(true); //XSLFTextParagraph.setBullet sets pPr too
paragraph.getXmlObject().getPPr().setMarL(org.apache.poi.util.Units.toEMU(indent));
paragraph.setIndent(-indent);
run = paragraph.addNewTextRun();
run.setFontSize(fontSize);
run.setText(item);
}
}
static void createNumberedList(XSLFTextShape textShape, String[] items) {
XSLFTextParagraph paragraph = null;
XSLFTextRun run = null;
double fontSize = 22d;
double indent = 32d;
for (String item : items) {
paragraph = textShape.addNewTextParagraph();
if (paragraph.getXmlObject().getPPr() == null) paragraph.getXmlObject().addNewPPr();
if (paragraph.getXmlObject().getPPr().getBuAutoNum() == null) paragraph.getXmlObject().getPPr().addNewBuAutoNum();
paragraph.getXmlObject().getPPr().getBuAutoNum().setType(org.openxmlformats.schemas.drawingml.x2006.main.STTextAutonumberScheme.ARABIC_PERIOD);
paragraph.getXmlObject().getPPr().setMarL(org.apache.poi.util.Units.toEMU(indent));
paragraph.setIndent(-indent);
run = paragraph.addNewTextRun();
run.setFontSize(fontSize);
run.setText(item);
}
}
public static void main(String[] args) throws Exception {
XMLSlideShow slideShow = new XMLSlideShow();
XSLFSlide slide = slideShow.createSlide();
XSLFTextShape textShape = slide.createAutoShape();
java.awt.Rectangle rect = new java.awt.Rectangle(100, 100, 500, 300);
textShape.setAnchor(rect.getBounds2D());
textShape.setShapeType(ShapeType.RECT);
XSLFTextParagraph paragraph = null;
XSLFTextRun run = null;
double fontSize = 22d;
paragraph = textShape.addNewTextParagraph();
run = paragraph.addNewTextRun();
run.setFontSize(fontSize);
run.setText("Following is a bullet point list:");
createBulletpointList(textShape,
new String[]{"First list item", "Second list item, a little bit longer to show automatic line breaks", "Third list item"}
);
paragraph = textShape.addNewTextParagraph();
paragraph = textShape.addNewTextParagraph();
run = paragraph.addNewTextRun();
run.setFontSize(fontSize);
run.setText("Following is a numbered list:");
createNumberedList(textShape,
new String[]{"First list item", "Second list item, a little bit longer to show automatic line breaks", "Third list item"}
);
FileOutputStream out = new FileOutputStream("./CreatePPTXNunberings.pptx");
slideShow.write(out);
out.close();
slideShow.close();
}
}
This code needs the full jar of all of the schemas, which is poi-ooxml-full-5.2.2.jar for apache poi 5.2.2, as mentioned in FAQ. Note, since apache poi 5.* the formerly used ooxml-schemas-*.jar cannot be used anymore. There must not be any ooxml-schemas-*.jar in class path when using apache poi 5.*.

How to Extract Diagonal watermark from pdf using PDFBOX and Extract Text by maintaining alignment

How can I extract diagonal watermark text from PDF using PDFBox ?
After referring to ExtractText's rotationMagic option, I am now extracting vertical and horizontal watermarks but not diagonal. This is my code so far.
class AngleCollector extends PDFTextStripper {
private final Set<Integer> angles = new TreeSet<>();
AngleCollector() throws IOException {}
Set<Integer> getAngles() {
return angles;
}
#Override
protected void processTextPosition(TextPosition text) {
int angle = ExtractText.getAngle(text);
angle = (angle + 360) % 360;
angles.add(angle);
}
}
class FilteredTextStripper extends PDFTextStripper {
FilteredTextStripper() throws IOException {
}
#Override
protected void processTextPosition(TextPosition text) {
int angle = ExtractText.getAngle(text);
if (angle == 0) {
super.processTextPosition(text);
}
}
}
final class ExtractText {
static int getAngle(TextPosition text) {
//The Matrix containing the starting text position
Matrix m = text.getTextMatrix().clone();
m.concatenate(text.getFont().getFontMatrix());
return (int) Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
}
private List<String> getAnnots(PDPage page) throws IOException {
List<String> returnList = new ArrayList<>();
for (PDAnnotation pdAnnot : page.getAnnotations()) {
if(pdAnnot.getContents() != null && !pdAnnot.getContents().isEmpty()) {
returnList.add(pdAnnot.getContents());
}
}
return returnList;
}
public void extractPages(int startPage, int endPage, PDFTextStripper stripper, PDDocument document, Writer output) {
for (int p = startPage; p <= endPage; ++p) {
stripper.setStartPage(p);
stripper.setEndPage(p);
try {
PDPage page = document.getPage(p - 1);
for (var annot : getAnnots(page)) {
output.write(annot);
}
int rotation = page.getRotation();
page.setRotation(0);
var angleCollector = new AngleCollector();
angleCollector.setStartPage(p);
angleCollector.setEndPage(p);
angleCollector.writeText(document, output);
for (int angle : angleCollector.getAngles()) {
// prepend a transformation
try (var cs = new PDPageContentStream(document, page,
PDPageContentStream.AppendMode.PREPEND, false)) {
cs.transform(Matrix.getRotateInstance(-Math.toRadians(angle), 0, 0));
}
stripper.writeText(document, output);
// remove prepended transformation
((COSArray) page.getCOSObject().getItem(COSName.CONTENTS)).remove(0);
}
page.setRotation(rotation);
} catch (IOException ex) {
System.err.println("Failed to process page " + p + ex);
}
}
}
}
public class pdfTest {
private pdfTest() {
}
public static void main(String[] args) throws IOException {
var pdfFile = "test-resources/pdf/pdf_sample_2.pdf";
Writer output = new OutputStreamWriter(System.out, StandardCharsets.UTF_8);
var etObj = new ExtractText();
var rawDoc = PDDocument.load(new File(pdfFile));
PDFTextStripper stripper = new FilteredTextStripper();
if(rawDoc.getDocumentCatalog().getAcroForm() != null) {
rawDoc.getDocumentCatalog().getAcroForm().flatten();
}
etObj.extractPages(1, rawDoc.getNumberOfPages(), stripper, rawDoc, output);
output.flush();
}
}
Edit 1:
I am also unable to detect form (Acro, XFA) field contents via TextExtractor code with correct Alignment. How can I do that ?
I am attaching the sample PDFs for references.
Sample PDF 1
Sample PDF 2
I require following things using PDFBox
Diagonal text detection. (including watermarks).
Form fields extraction by maintaining Proper alignment.
In your "question" you actually ask multiple distinct questions. I'll look into each of them. The answers will be less specific than you'd probably wish because your questions are based on assumptions that are not all true.
"How can I extract diagonal watermark text from PDF using PDFBox ?"
First of all, PDF text extraction works by inspecting the instructions in content streams of a page and contained XObjects, finding text drawing instructions therein, taking the coordinates and orientations and the string parameters thereof, mapping the strings to Unicode, and arranging the many individual Unicode strings by their coordinates and orientations in a single content string.
In case of PDFBox the PDFTextStripper as-is does this with a limited support for orientation processing, but it can be extended to filter the text pieces by orientation for better orientation support as shown in the ExtractText example with rotation magic activated.
double_watermark.pdf
In case of your double_watermark.pdf example PDF, though, the diagonal text "Top Secret" is not created using text drawing instructions but instead path construction and painting instructions, as Tilman already remarked. (Actually the paths here all are sequences of very short lines, no curves are used, which you can see using a high zoom factor.) Thus, PDF text extraction cannot extract this text.
To answer your question
How can I extract diagonal watermark text from PDF using PDFBox ?
in this context, therefore: You can not.
(You can of course use PDFBox as a PDF processing framework based on which you also collect paths and try to match them to characters, but would be a greater project by itself. Or you can use PDFBox to draw the pages as bitmaps and apply OCR to those bitmaps.)
"I am also unable to detect form (Acro, XFA) field contents via TextExtractor code with correct Alignment. How can I do that ?"
Form data in AcroForm or XFA form definitions are not part of the page content streams or the XObject content streams referenced from therein. Thus, they are not immediately subject to text extraction.
AcroForm forms
AcroForm form fields are abstract PDF data objects which may or may not have associated content streams for display. To include them into the content streams text extraction operates on, you can first flatten the form. As you mentioned in your own answer, you also have to activate sorting to extract the field contents in context.
Beware, PDF renderers do have certain freedoms when creating the visualization of a form field. Thus, text extraction order may be slightly different from what you expect.
XFA forms
XFA form definitions are a cuckoo's egg in PDF. They are XML streams which are not related to regular PDF objects; furthermore, XFA in PDFs has been deprecated a number of years ago. Thus, most PDF libraries don't support XFA forms.
PDFBox only allows to extract or replace the XFA XML stream. Thus, there is no immediate support for XFA form contents during text extraction.
Form fields extraction by maintaining Proper alignment.
This is solved by
setSortByPosition

How to test PDF generated with PDFbox

I write a class that read pdf template and add some lines like:
List<PdfTextLine> textLines = pdfBankActPage.getTextLines();
if (textLines != null) {
for (PdfTextLine textLine : textLines) {
contentStream.setFont(cyrillicFont, textLine.getFontSize());
contentStream.beginText();
contentStream.newLineAtOffset(textLine.getOffsetX(), textLine.getOffsetY());
contentStream.showText(textLine.getText());
contentStream.endText();
}
}
My util class to store one-line info:
public class PdfTextLine {
private Integer offsetX;
private Integer offsetY;
private String text;
private Integer fontSize;
What is common approach how to test, that PDF was generated correct?
Since your question is lacking detail it is difficult to answer.
There are different possibilities to verify that your content was added:
Extract the text from the PDF and check whether your text is in there
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(pageNumber);
stripper.setEndPage(pageNumber);
stripper.setAddMoreFormatting(false);
String text = stripper.getText(this.document);
Use a library like pdfcompare (which itself uses Pdfbox) to compare the pdf visually...

apache pdfbox - how to test if a document is flattened?

I have written the following small Java main method. It takes in a (hardcoded for testing purposes!) PDF document I know contains active elements in the form and need to flatten it.
public static void main(String [] args) {
try {
// for testing
Tika tika = new Tika();
String filePath = "<path-to>/<pdf-document-with-active-elements>.pdf";
String fileName = filePath.substring(0, filePath.length() -4);
File file = new File(filePath);
if (tika.detect(file).equalsIgnoreCase("application/pdf")) {
PDDocument pdDocument = PDDocument.load(file);
PDAcroForm pdAcroForm = pdDocument.getDocumentCatalog().getAcroForm();
if (pdAcroForm != null) {
pdAcroForm.flatten();
pdAcroForm.refreshAppearances();
pdDocument.save(fileName + "-flattened.pdf");
}
pdDocument.close();
}
}
catch (Exception e) {
System.err.println("Exception: " + e.getLocalizedMessage());
}
}
What kind of test would assert the File(<path-to>/<pdf-document-with-active-elements>-flattened.pdf) generated by this code would, in fact, be flat?
What kind of test would assert that the file generated by this code would, in fact, be flat?
Load that document anew and check whether it has any form fields in its PDAcroForm (if there is a PDAcroForm at all).
If you want to be thorough, also iterate through the pages and assure that there are no Widget annotations associated to them anymore.
And to really be thorough, additionally determine the field positions and contents before flattening and apply text extraction at those positions to the flattened pdf. This verifies that the form has not merely been dropped but indeed flattened.

Categories

Resources