Replacing all text in powerpoint using Apache POI - java

I looked at the apache POI documentation and created a function that redacts all the text in a powerpoint. Function works well in replacing texts in slides but not the texts found in grouped textboxes. Is there seperate object that handles the grouped items?
private static void redactText(XMLSlideShow ppt) {
for (XSLFSlide slide : ppt.getSlides()) {
System.out.println("REDACT Slide: " + slide.getTitle());
XSLFTextShape[] shapes = slide.getPlaceholders();
for (XSLFTextShape textShape : shapes) {
List<XSLFTextParagraph> textparagraphs = textShape.getTextParagraphs();
for (XSLFTextParagraph para : textparagraphs) {
List<XSLFTextRun> textruns = para.getTextRuns();
for (XSLFTextRun incomingTextRun : textruns) {
String text = incomingTextRun.getRawText();
System.out.println(text);
if (text.toLowerCase().contains("test")) {
String newText = text.replaceAll("(?i)" + "test", "XXXXXXXX");
incomingTextRun.setText(newText);
}
}
}
}
}
}

If the need is simply getting all text contents independent of in what objects it is, then one could simply do exactly that. Text contents are contained in org.apache.xmlbeans.XmlString elements. In PowerPoint XML they are in a:t tags. Name space a="http://schemas.openxmlformats.org/drawingml/2006/main".
So following code gets all text in all objects in all slides and does replacing case-insensitive string "test" with "XXXXXXXX".
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xslf.usermodel.*;
import org.openxmlformats.schemas.presentationml.x2006.main.CTSlide;
import org.apache.xmlbeans.XmlObject;
import org.apache.xmlbeans.XmlString;
public class ReadPPTXAllText {
public static void main(String[] args) throws Exception {
XMLSlideShow slideShow = new XMLSlideShow(new FileInputStream("MicrosoftPowerPoint.pptx"));
for (XSLFSlide slide : slideShow.getSlides()) {
CTSlide ctSlide = slide.getXmlObject();
XmlObject[] allText = ctSlide.selectPath(
"declare namespace a='http://schemas.openxmlformats.org/drawingml/2006/main' " +
".//a:t"
);
for (int i = 0; i < allText.length; i++) {
if (allText[i] instanceof XmlString) {
XmlString xmlString = (XmlString)allText[i];
String text = xmlString.getStringValue();
System.out.println(text);
if (text.toLowerCase().contains("test")) {
String newText = text.replaceAll("(?i)" + "test", "XXXXXXXX");
xmlString.setStringValue(newText);
}
}
}
}
FileOutputStream out = new FileOutputStream("MicrosoftPowerPointChanged.pptx");
slideShow.write(out);
slideShow.close();
out.close();
}
}

If one doesn't like the approach of replacing via Xml directly, it is possible to iterate over all slides and their shapes. If a shape is a XSLFTextShape, get the paragraphs and handle them like you did.
If you receive a XSLFGroupShape, iterate over their getShapes() as well. Since they could contain different types of shapes you might use recursion for that. You might handle the shape type XSLFTable also.
But the real trouble starts when you realize, that something you want to replace is divided into several runs ;-)

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

I am trying to merge 2 docx files which has their own bullet number, after merging of word docs the bullets are automatically updated.
E.g:
Doc A has 1 2 3
Doc B has 1 2 3
After merging the bullet numbering are updated to be 1 2 3 4 5 6
how to stop this.
I am using following code
if(counter==1)
{
FirstFileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FirstFileIS = new java.io.ByteArrayInputStream(FirstFileByteStream);
FirstWordFile = org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(FirstFileIS);
main = FirstWordFile.getMainDocumentPart();
//Add page break for Table of Content
main.addObject(objBr);
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Table of contents - End
}
else
{
FileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FileIS = new java.io.ByteArrayInputStream(FileByteStream);
byte[] bytes = IOUtils.toByteArray(FileIS);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/part" + (chunkCount++) + ".docx"));
afiPart.setContentType(new ContentType(CONTENT_TYPE));
afiPart.setBinaryData(bytes);
Relationship altChunkRel = main.addTargetPart(afiPart);
CTAltChunk chunk = Context.getWmlObjectFactory().createCTAltChunk();
chunk.setId(altChunkRel.getId());
main.addObject(objBr);
htmlCode = new StringBuilder();
htmlCode.append("<html>");
htmlCode.append("<h2><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p style=\"font-family:'Arial Black'; color: #f35b1c\">"+ReqName+"</p></h2>");
htmlCode.append("</html>");
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Add Page Break before new content
main.addObject(objBr);
//Add new content
main.addObject(chunk);
}
Looking at your code, you are adding HTML altChunks to your document.
For these to display it Word, the HTML is converted to normal docx content.
An altChunk is usually converted by Word when you open the docx.
(Alternatively, docx4j-ImportXHTML can do it for an altChunk of type XHTML)
The upshot is that what happens with the bullets (when Word converts your HTML) is largely outside your control. You could experiment with CSS but I think Word will mostly ignore it.
An alternative may be to use XHTML altChunks, and have docx4j-ImportXHTML convert them. main.convertAltChunks()
If the same problem occurs when you try that, well, at least we can address it.
I was able to fix my issue using following code. I found it at (http://webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml). You can also generate your custom code, they have a nice demo where they generate code according to your requirement :).
public final static String DIR_IN = System.getProperty("user.dir")+ "/";
public final static String DIR_OUT = System.getProperty("user.dir")+ "/";
public static void main(String[] args) throws Exception
{
String[] files = {"part1docx_20200717t173750539gmt.docx", "part1docx_20200717t173750539gmt (1).docx", "part1docx_20200717t173750539gmt.docx"};
List blockRanges = new ArrayList();
for (int i=0 ; i< files.length; i++) {
BlockRange block = new BlockRange(WordprocessingMLPackage.load(new File(DIR_IN + files[i])));
blockRanges.add( block );
block.setStyleHandler(StyleHandler.RENAME_RETAIN);
block.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
block.setRestartPageNumbering(false);
block.setHeaderBehaviour(HfBehaviour.DEFAULT);
block.setFooterBehaviour(HfBehaviour.DEFAULT);
block.setSectionBreakBefore(SectionBreakBefore.NEXT_PAGE);
}
// Perform the actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);
// Save the result
SaveToZipFile saver = new SaveToZipFile(output);
saver.save(DIR_OUT+"OUT_MergeWholeDocumentsUsingBlockRange.docx");
}

Apache POI doesn't find highlighted text

I have a file saved in doc format, and I need to extract highlighted text.
I have code like in following:
HWPFDocument document = new HWPFDocument(fis);
Range r = document.getRange();
for (int i=0;i<5;i++) {
CharacterRun t = r.getCharacterRun(i);
System.out.println(t.isHighlighted());
System.out.println(t.getHighlightedColor());
System.out.println(r.getCharacterRun(i).SPRM_HIGHLIGHT);
System.out.println(r.getCharacterRun(i));
}
None of the above methods show that text is highlighted, but when I open it, it is highlighted.
What can be the reason, and how to find if the text is highlighted or not?
Highlighting text in Word is possible using two different methods. First is applying highlighting to text runs. Second is applying shading to words or paragraphs.
For the first and using *.doc, the Word binary file format, apache poi provides methods in CharacterRun. For the second apache poi provides Paragraph.getShading. But this is only set if the shading applies to the whole paragraph. If the shading is applied only to single runs, then apache poi provides nothing for that. So using the underlying SprmOperations is needed.
Microsoft's documentation 2.6.1 Character Properties describes sprmCShd80 (0x4866) which is "A Shd80 structure that specifies the background shading for the text.". So we need searching for that.
Example:
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.*;
import org.apache.poi.hwpf.sprm.*;
import java.lang.reflect.Field;
import java.lang.reflect.Method;
public class HWPFInspectBgColor {
private static void showCharacterRunInternals(CharacterRun run) throws Exception {
Field _chpx = CharacterRun.class.getDeclaredField("_chpx");
_chpx.setAccessible(true);
SprmBuffer sprmBuffer = (SprmBuffer) _chpx.get(run);
for (SprmIterator sprmIterator = sprmBuffer.iterator(); sprmIterator.hasNext(); ) {
SprmOperation sprmOperation = sprmIterator.next();
System.out.println(sprmOperation);
}
}
static SprmOperation getCharacterRunShading(CharacterRun run) throws Exception {
SprmOperation shd80Operation = null;
Field _chpx = CharacterRun.class.getDeclaredField("_chpx");
_chpx.setAccessible(true);
Field _value = SprmOperation.class.getDeclaredField("_value");
_value.setAccessible(true);
SprmBuffer sprmBuffer = (SprmBuffer) _chpx.get(run);
for (SprmIterator sprmIterator = sprmBuffer.iterator(); sprmIterator.hasNext(); ) {
SprmOperation sprmOperation = sprmIterator.next();
short sprmValue = (short)_value.get(sprmOperation);
if (sprmValue == (short)0x4866) { // we have a Shd80 structure, see https://msdn.microsoft.com/en-us/library/dd947480(v=office.12).aspx
shd80Operation = sprmOperation;
}
}
return shd80Operation;
}
public static void main(String[] args) throws Exception {
HWPFDocument document = new HWPFDocument(new FileInputStream("sample.doc"));
Range range = document.getRange();
for (int p = 0; p < range.numParagraphs(); p++) {
Paragraph paragraph = range.getParagraph(p);
System.out.println(paragraph);
if (!paragraph.getShading().isEmpty()) {
System.out.println("Paragraph's shading: " + paragraph.getShading());
}
for (int r = 0; r < paragraph.numCharacterRuns(); r++) {
CharacterRun run = paragraph.getCharacterRun(r);
System.out.println(run);
if (run.isHighlighted()) {
System.out.println("Run's highlighted color: " + run.getHighlightedColor());
}
if (getCharacterRunShading(run) != null) {
System.out.println("Run's Shd80 structure: " + getCharacterRunShading(run));
}
}
}
}
}

ArrayList<String> in PDF from a new row

I want to send some survey in PDF from java, I tryed different methods. I use with StringBuffer and without, but always see text in PDF in one row.
public void writePdf(OutputStream outputStream) throws Exception {
Paragraph paragraph = new Paragraph();
Document document = new Document();
PdfWriter.getInstance(document, outputStream);
document.open();
document.addTitle("Survey PDF");
ArrayList nameArrays = new ArrayList();
StringBuffer sb = new StringBuffer();
int i = -1;
for (String properties : textService.getAnswer()) {
nameArrays.add(properties);
i++;
}
for (int a= 0; a<=i; a++){
System.out.println("nameArrays.get(a) -"+nameArrays.get(a));
sb.append(nameArrays.get(a));
}
paragraph.add(sb.toString());
document.add(paragraph);
document.close();
}
textService.getAnswer() this - ArrayList<String>
Could you please advise how to separate the text in order each new sentence will be starting from new row?
Now I see like this:
You forgot the newline character \n and your code seems a bit overcomplicated.
Try this:
StringBuffer sb = new StringBuffer();
for (String property : textService.getAnswer()) {
sb.append(property);
sb.append('\n');
}
What about:
nameArrays.add(properties+"\n");
You might be able to fix that by simply appending "\n" to the strings that you collecting in your list; but I think: that very much depends on the PDF library you are using.
You see, "newlines" or "paragraphs" are to a certain degree about formatting. It seems like a conceptual problem to add that "formatting" information to the data that you are processing.
Meaning: you might want to check if your library allows you to provide strings - and then have the library do the formatting for you!
In other words: instead of giving strings with newlines; you should check if you can keep using strings without newlines, but if there is way to have the PDF library add line breaks were appropriate.
Side note on code quality: you are using raw types:
ArrayList nameArrays = new ArrayList();
should better be
ArrayList<String> names = new ArrayList<>();
[ I also changed the name - there is no point in putting the type of a collection into the variable name! ]
This method is for save values in array list into a pdf document. In the mfilePath variable "/" in here you can give folder name. As a example "/example/".
and also for mFileName variable you can use name. I give the date and time that document will created. don't give static name other vice your values are overriding in same pdf.
private void savePDF()
{
com.itextpdf.text.Document mDoc = new com.itextpdf.text.Document();
String mFileName = new SimpleDateFormat("YYYY-MM-DD-HH-MM-SS", Locale.getDefault()).format(System.currentTimeMillis());
String mFilePath = Environment.getExternalStorageDirectory() + "/" + mFileName + ".pdf";
try
{
PdfWriter.getInstance(mDoc, new FileOutputStream(mFilePath));
mDoc.open();
for(int d = 0; d < g; d++)
{
String mtext = answers.get(d);
mDoc.add(new Paragraph(mtext));
}
mDoc.close();
}
catch (Exception e)
{
}
}

Solr custom Tokenizer Factory works randomly

I am new in Solr and I have to do a filter to lemmatize text to index documents and also to lemmatize querys.
I created a custom Tokenizer Factory for lemmatized text before passing it to the Standard Tokenizer.
Making tests in Solr analysis section works fairly good (on index ok, but on query sometimes analyzes text two times), but when indexing documents it analyzes only the first documment and on querys it analyses randomly (It only analyzes first, and to analyze another you have to wait a bit time). It's not performance problem because I tried modifyng text instead of lemmatizing.
Here is the code:
package test.solr.analysis;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Map;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeSource.AttributeFactory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardTokenizer;
//import test.solr.analysis.TestLemmatizer;
public class TestLemmatizerTokenizerFactory extends TokenizerFactory {
//private TestLemmatizer lemmatizer = new TestLemmatizer();
private final int maxTokenLength;
public TestLemmatizerTokenizerFactory(Map<String,String> args) {
super(args);
assureMatchVersion();
maxTokenLength = getInt(args, "maxTokenLength", StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}
public String readFully(Reader reader){
char[] arr = new char[8 * 1024]; // 8K at a time
StringBuffer buf = new StringBuffer();
int numChars;
try {
while ((numChars = reader.read(arr, 0, arr.length)) > 0) {
buf.append(arr, 0, numChars);
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("### READFULLY ### => " + buf.toString());
/*
The original return with lemmatized text would be this:
return lemmatizer.getLemma(buf.toString());
To test it I only change the text adding "lemmatized" word
*/
return buf.toString() + " lemmatized";
}
#Override
public StandardTokenizer create(AttributeFactory factory, Reader input) {
// I print this to see when enters to the tokenizer
System.out.println("### Standar tokenizer ###");
StandardTokenizer tokenizer = new StandardTokenizer(luceneMatchVersion, factory, new StringReader(readFully(input)));
tokenizer.setMaxTokenLength(maxTokenLength);
return tokenizer;
}
}
With this, it only indexes the first text adding the word "lemmatized" to the text.
Then on first query if I search "example" word it looks for "example" and "lemmatized" so it returns me the first document.
On next searches it doesn't modify the query. To make a new query adding "lemmatized" word to the query, I have to wait some minutes.
What happens?
Thank you all.
I highly doubt that the create method is invoked on each query (for starters performance issues come to mind). I would take the safe route and create a Tokenizer that wraps a StandardTokenizer, then just override the setReader method and do my work there

Extracting anchor tag from html using Java

I have several anchor tags in a text,
Input: <a href="http://stackoverflow.com" >Take me to StackOverflow</a>
Output:
http://stackoverflow.com
How can I find all those input strings and convert it to the output string in java, without using a 3rd party API ???
There are classes in the core API that you can use to get all href attributes from anchor tags (if present!):
import java.io.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class HtmlParseDemo {
public static void main(String [] args) throws Exception {
String html =
"<a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> " +
"<!-- " +
"<a href=\"http://ignoreme.com\" >...</a> " +
"--> " +
"<a href=\"http://www.google.com\" >Take me to Google</a> " +
"<a>NOOOoooo!</a> ";
Reader reader = new StringReader(html);
HTMLEditorKit.Parser parser = new ParserDelegator();
final List<String> links = new ArrayList<String>();
parser.parse(reader, new HTMLEditorKit.ParserCallback(){
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if(t == HTML.Tag.A) {
Object link = a.getAttribute(HTML.Attribute.HREF);
if(link != null) {
links.add(String.valueOf(link));
}
}
}
}, true);
reader.close();
System.out.println(links);
}
}
which will print:
[http://stackoverflow.com, http://www.google.com]
public static void main(String[] args) {
String test = "qazwsxTake me to StackOverflowfdgfdhgfd"
+ "Take me to StackOverflow2dcgdf";
String regex = "<a href=(\"[^\"]*\")[^<]*</a>";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(test);
System.out.println(m.replaceAll("$1"));
}
NOTE: All Andrzej Doyle's points are valid and if you have more then simple Y in your input, and you are sure that is parsable HTML, then you are better with HTML parser.
To summarize:
The regex i posted doesn't work if you have <a> in comment. (you can treat it as special case)
It doesn't work if you have other attributes in the <a> tag. (again you can treat it as special case)
there are many other cases that regex wont work, and you can not cover all of them with regex, since HTML is not regular language.
However, if your req is always replace Y with "X" without considering the context, then the code i've posted will work.
You can use JSoup
String html = "<p>An <a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String linkHref = link.attr("href"); // "http://stackoverflow.com"
Also See
Example
The above example works perfect; if you want to parse an HTML document say instead of concatenated strings, write something like this to compliment the code above.
Existing code above ~ modified to show: HtmlParser.java (HtmlParseDemo.java) above
complementing code with HtmlPage.java below. The content of the HtmlPage.properties file is at the bottom of this page.
The main.url property in the HtmlPage.properties file is:
main.url=http://www.whatever.com/
That way you can just parse the url that your after. :-)
Happy coding :-D
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class HtmlParser
{
public static void main(String[] args) throws Exception
{
String html = HtmlPage.getPage();
Reader reader = new StringReader(html);
HTMLEditorKit.Parser parser = new ParserDelegator();
final List<String> links = new ArrayList<String>();
parser.parse(reader, new HTMLEditorKit.ParserCallback()
{
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
{
if (t == HTML.Tag.A)
{
Object link = a.getAttribute(HTML.Attribute.HREF);
if (link != null)
{
links.add(String.valueOf(link));
}
}
}
}, true);
reader.close();
// create the header
System.out.println("<html>\n<head>\n <title>Link City</title>\n</head>\n<body>");
// spit out the links and create href
for (String l : links)
{
System.out.print(" " + l + "\n");
}
// create footer
System.out.println("</body>\n</html>");
}
}
import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ResourceBundle;
public class HtmlPage
{
public static String getPage()
{
StringWriter sw = new StringWriter();
ResourceBundle bundle = ResourceBundle.getBundle(HtmlPage.class.getName().toString());
try
{
URL url = new URL(bundle.getString("main.url"));
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.setDoOutput(true);
InputStream content = (InputStream) connection.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(content));
String line;
while ((line = in.readLine()) != null)
{
sw.append(line).append("\n");
}
} catch (Exception e)
{
e.printStackTrace();
}
return sw.getBuffer().toString();
}
}
For example, this will output links from http://ebay.com.au/ if viewed in a browser.
This is a subset, as there are a lot of links
Link City
#mainContent
http://realestate.ebay.com.au/
The most robust way (as has been suggested already) is to use regular expressions (java.util.regexp), if you are required to build this without using 3d party libs.
The alternative is to parse the html as XML, either using a SAX parser to capture and handle each instance of an "a" element or as a DOM Document and then searching it using XPATH (see http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/package-summary.html). This is problematic though, since it requires the HTML page to be fully XML compliant in markup, a very dangerous assumption and not an approach I would recommend since most "real" html pages are not XML compliant.
Still, I would recommend also looking at existing frameworks out there built for this purpose (like JSoup, also mentioned above). No need to reinvent the wheel.

Categories

Resources