Copy contents from docx with bullets intact with Apache POI

Copy contents from docx with bullets intact with Apache POI - java

I am trying to copy contents from a docx file to the clipboard eventually. The code I have come up with so far is:
package config;
public class buffer {
public static void main(String[] args) throws IOException, XmlException {
XWPFDocument srcDoc = new XWPFDocument(new FileInputStream("D:\\rules.docx"));
XWPFDocument destDoc = new XWPFDocument();
OutputStream out = new FileOutputStream("D:\\test.docx");
for (IBodyElement bodyElement : srcDoc.getBodyElements()) {
XWPFParagraph srcPr = (XWPFParagraph) bodyElement;
XWPFParagraph dstPr = destDoc.createParagraph();
dstPr.createRun();
int pos = destDoc.getParagraphs().size() - 1;
destDoc.setParagraph(srcPr, pos);
}
destDoc.write(out);
out.close();
}
}
This does fetch the bullets but numbers them. I want to retain the original bullet format. Is there a way to do this?

You'll need to handle the numbering definition (in the numbering part) correctly.
The most reliable thing to do would be to copy the definition (both the instance list and the abstract one) across, and renumber it (ie give it a new ID) so that it is unique.
Then of course you'll need to update the ID's in your paragraph to match.
Note that the above is a solution only for the question you have asked.
You'll run into problems if your content contains a rel to some other part (eg an image). And you'tr not handling the style definition etc.

Related

apache pdfbox - how to test if a document is flattened?

I have written the following small Java main method. It takes in a (hardcoded for testing purposes!) PDF document I know contains active elements in the form and need to flatten it.
public static void main(String [] args) {
try {
// for testing
Tika tika = new Tika();
String filePath = "<path-to>/<pdf-document-with-active-elements>.pdf";
String fileName = filePath.substring(0, filePath.length() -4);
File file = new File(filePath);
if (tika.detect(file).equalsIgnoreCase("application/pdf")) {
PDDocument pdDocument = PDDocument.load(file);
PDAcroForm pdAcroForm = pdDocument.getDocumentCatalog().getAcroForm();
if (pdAcroForm != null) {
pdAcroForm.flatten();
pdAcroForm.refreshAppearances();
pdDocument.save(fileName + "-flattened.pdf");
}
pdDocument.close();
}
}
catch (Exception e) {
System.err.println("Exception: " + e.getLocalizedMessage());
}
}
What kind of test would assert the File(<path-to>/<pdf-document-with-active-elements>-flattened.pdf) generated by this code would, in fact, be flat?

What kind of test would assert that the file generated by this code would, in fact, be flat?
Load that document anew and check whether it has any form fields in its PDAcroForm (if there is a PDAcroForm at all).
If you want to be thorough, also iterate through the pages and assure that there are no Widget annotations associated to them anymore.
And to really be thorough, additionally determine the field positions and contents before flattening and apply text extraction at those positions to the flattened pdf. This verifies that the form has not merely been dropped but indeed flattened.

PDF Generation using iText7 with Java

I am trying to add content to an existing PDF using iText7. I have been able to create new PDFs and add content to them using Paragraphs and Tables. However, once I go to reopen a PDF that I have created and attempt to write more content to it, the new content starts overwriting the old content. I want the new content to be appended to the Document after the old content. How can I achieve this?
Edit
This is the Class which sets up some common methods that will be executed with each change done to a PDF document.
public class PDFParent {
private static Document document;
private static PdfWriter writer;
private static PdfReader reader;
private static PageSize ps;
private static PdfDocument pdfDoc;
public static Document getDocument() {
return document;
}
public static void setDocument(Document document) {
PDFParent.document = document;
}
public static void setupPdf(byte[] inParamInPDFBinary){
writer = new PdfWriter(new ByteArrayOutputStream());
try {
reader = new PdfReader(new ByteArrayInputStream(inParamInPDFBinary));
} catch (IOException e) {
e.printStackTrace();
}
pdfDoc = new PdfDocument(reader, writer);
ps = PageSize.A4;
document = new Document(pdfDoc, ps);
}
public static byte[] writePdf(){
ByteArrayOutputStream stream = (ByteArrayOutputStream) writer.getOutputStream();
return stream.toByteArray();
}
public static void closePdf(){
pdfDoc.close();
}
And this is how I am adding the content to the pdf
public class ActAddParagraphToPDF extends PDFParent{
// output parameters
public static byte[] outParamOutPDFBinary;
public static ActAddParagraphToPDF mosAddParagraphToPDF(byte[] inParamInPDFBinary, String inParamParagraph) throws IOException {
ActAddParagraphToPDF result = new ActAddParagraphToPDF();
setupPdf(inParamInPDFBinary);
//---------------------begin content-------------------//
getDocument().add((Paragraph) new Paragraph(inParamParagraph));
//---------------------end content-------------------//
closePdf();
outParamOutPDFBinary = writePdf();
return result;
}
When I go to execute this second class, it appears to be treating the original document as if it is blank. Then writes the new Paragraph on top of the original content. I know that I am missing something, just not sure what that is.

Is reopening the document every time a requirement? If you keep the document open, you can append as much content as you want and you won't have to deal with content overlapping problems.
If it is a requirement, then you will have to track the last free content position yourself and reset it to new DocumentRenderer.
A Rectangle would be enough to store the free area that is left on the last page. Right before closing the document, save the free area in some Rectangle in the following way:
Rectangle savedBbox = document.getRenderer().getCurrentArea().getBBox();
After that, when you have to reopen the document, first jump to the last page:
document.add(new AreaBreak(AreaBreakType.LAST_PAGE));
And then reset the free occupied area left from the previous time you dealt with the document:
document.getRenderer().getCurrentArea().setBBox(savedBbox);
After that you are free to add new content to the document and it will appear at the saved position:
document.add(new Paragraph("Hello again"));
Please note that this approach works if you know which documents you are dealing with (i.e. you can associate last "free" position with the document's ID) and this document is not changed outside of your environment. If this is not the case, I recommend that you look into content extraction and in particular PdfDocumentContentParser. It can help you to extract the content you have on the page and determine which positions it occupies. Then you can calculate the free area on a page and use document.getRenderer().getCurrentArea().setBBox approach I described above to point DocumentRenderer to the correct place to write content to.

Removing an XWPFParagraph keeps the paragraph symbol (¶) for it

I am trying to remove a set of contiguous paragraphs from a Microsoft Word document, using Apache POI.
From what I have understood, deleting a paragraph is possible by removing all of its runs, this way:
/*
* Deletes the given paragraph.
*/
public static void deleteParagraph(XWPFParagraph p) {
if (p != null) {
List<XWPFRun> runs = p.getRuns();
//Delete all the runs
for (int i = runs.size() - 1; i >= 0; i--) {
p.removeRun(i);
}
p.setPageBreak(false); //Remove the eventual page break
}
}
In fact, it works, but there's something strange. The block of removed paragraphs does not disappear from the document, but it's converted in a set of empty lines. It's just like every paragraph would be converted into a new line.
By printing the paragraphs' content from code I can see, in fact, a space (for each one removed). Looking at the content directly from the document, with the formatting mark's visualization enabled, I can see this:
The vertical column of ¶ corresponds to the block of deleted elements.
Do you have an idea for that? I'd like my paragraphs to be completely removed.
I also tried by replacing the text (with setText()) and by removing eventual spaces that could be added automatically, this way:
p.setSpacingAfter(0);
p.setSpacingAfterLines(0);
p.setSpacingBefore(0);
p.setSpacingBeforeLines(0);
p.setIndentFromLeft(0);
p.setIndentFromRight(0);
p.setIndentationFirstLine(0);
p.setIndentationLeft(0);
p.setIndentationRight(0);
But with no luck.

I would delete paragraphs by deleting paragraphs, not by deleting only the runs in this paragraphs. Deleting paragraphs is not part of the apache poi high level API. But using XWPFDocument.getDocument().getBody() we can get the low level CTBody and there is a removeP(int i).
Example:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import java.awt.Desktop;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
public class WordRemoveParagraph {
/*
* Deletes the given paragraph.
*/
public static void deleteParagraph(XWPFParagraph p) {
XWPFDocument doc = p.getDocument();
int pPos = doc.getPosOfParagraph(p);
//doc.getDocument().getBody().removeP(pPos);
doc.removeBodyElement(pPos);
}
public static void main(String[] args) throws IOException, InvalidFormatException {
XWPFDocument doc = new XWPFDocument(new FileInputStream("source.docx"));
int pNumber = doc.getParagraphs().size() -1;
while (pNumber >= 0) {
XWPFParagraph p = doc.getParagraphs().get(pNumber);
if (p.getParagraphText().contains("delete")) {
deleteParagraph(p);
}
pNumber--;
}
FileOutputStream out = new FileOutputStream("result.docx");
doc.write(out);
out.close();
doc.close();
System.out.println("Done");
Desktop.getDesktop().open(new File("result.docx"));
}
}
This deletes all paragraphs from the document source.docx where the text contains "delete" and saves the result in result.docx.
Edited:
Although doc.getDocument().getBody().removeP(pPos); works, it will not update the XWPFDocument's paragraphs list. So it will destroy paragraph iterators and other accesses to that list since the list is only updated while reading the document again.
So the better approach is using doc.removeBodyElement(pPos); instead. removeBodyElement(int pos) does exactly the same as doc.getDocument().getBody().removeP(pos); if the pos is pointing to a pagagraph in the document body since that paragraph is an BodyElement too. But in addition, it will update the XWPFDocument's paragraphs list.

When you are inside of a table you need to use the functions of the XWPFTableCell instead of the XWPFDocument:
cell.removeParagraph(cell.getParagraphs().indexOf(para));

Extracting an embedded object from a pdf

I had embedded a byte array into a pdf file (Java).
Now I am trying to extract that same array.
The array was embedded as a "MOVIE" file.
I couldn't find any clue on how to do that...
Any ideas?
Thanks!
EDIT
I used this code to embed the byte array:
public static void pack(byte[] file) throws IOException, DocumentException{
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
writer.setPdfVersion(PdfWriter.PDF_VERSION_1_7);
writer.addDeveloperExtension(PdfDeveloperExtension.ADOBE_1_7_EXTENSIONLEVEL3);
document.open();
RichMediaAnnotation richMedia = new RichMediaAnnotation(writer, new Rectangle(0,0,0,0));
PdfFileSpecification fs
= PdfFileSpecification.fileEmbedded(writer, null, "test.avi", file);
PdfIndirectReference asset = richMedia.addAsset("test.avi", fs);
RichMediaConfiguration configuration = new RichMediaConfiguration(PdfName.MOVIE);
RichMediaInstance instance = new RichMediaInstance(PdfName.MOVIE);
RichMediaParams flashVars = new RichMediaParams();
instance.setAsset(asset);
configuration.addInstance(instance);
RichMediaActivation activation = new RichMediaActivation();
richMedia.setActivation(activation);
PdfAnnotation richMediaAnnotation = richMedia.createAnnotation();
richMediaAnnotation.setFlags(PdfAnnotation.FLAGS_PRINT);
writer.addAnnotation(richMediaAnnotation);
document.close();

I have written a brute force method to extract all streams in a PDF and store them as a file without an extension:
public static final String SRC = "resources/pdfs/image.pdf";
public static final String DEST = "results/parse/stream%s";
public static void main(String[] args) throws IOException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new ExtractStreams().parse(SRC, DEST);
}
public void parse(String src, String dest) throws IOException {
PdfReader reader = new PdfReader(src);
PdfObject obj;
for (int i = 1; i <= reader.getXrefSize(); i++) {
obj = reader.getPdfObject(i);
if (obj != null && obj.isStream()) {
PRStream stream = (PRStream)obj;
byte[] b;
try {
b = PdfReader.getStreamBytes(stream);
}
catch(UnsupportedPdfException e) {
b = PdfReader.getStreamBytesRaw(stream);
}
FileOutputStream fos = new FileOutputStream(String.format(dest, i));
fos.write(b);
fos.flush();
fos.close();
}
}
}
Note that I get all PDF objects that are streams as a PRStream object. I also use two different methods:
When I use PdfReader.getStreamBytes(stream), iText will look at the filter. For instance: page content streams consists of PDF syntax that is compressed using /FlateDecode. By using PdfReader.getStreamBytes(stream), you will get the uncompressed PDF syntax.
Not all filters are supported in iText. Take for instance /DCTDecode which is the filter used to store JPEGs inside a PDF. Why and how would you "decode" such a stream? You wouldn't, and that's when we use PdfReader.getStreamBytesRaw(stream) which is also the method you need to get your AVI-bytes from your PDF.
This example already gives you the methods you'll certainly need to extract PDF streams. Now it's up to you to find the path to the stream you need. That calls for iText RUPS. With iText RUPS you can look at the internal structure of a PDF file. In your case, you need to find the annotations as is done in this question: All links of existing pdf change the action property to inherit zoom - iText library
You loop over the page dictionaries, then loop over the /Annots array of this dictionary (if it's present), but instead of checking for /Link annotations (which is what was asked in the question I refer to), you have to check for /RichMedia annotations and from there examine the assets until you find the stream that contains the AVI file. RUPS will show you how to dive into the annotation dictionary.

Replace fonts in a PDF using iText (Java)

I'd like to convert all the fonts, embedded or otherwise, of a PDF to another font using iText. I understand that line-height, kerning and a bunch of other things would be bungled up, but this I truly don't mind how ugly the output is.
I have seen how to embed fonts into existing pdfs here, but I don't know how to set ALL EXISTING text in the document to that font.
I understand that this isn't as straightforward as I make it out to be. Perhaps it would be easier just to take all the raw text from the document, and create a new document using the new font (again, layout/readability is a non-issue to me)

The example EmbedFontPostFacto.java from chapter 16 of iText in Action — 2nd Edition shows how to embed an originally not embedded font. The central method is this:
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
// the font file
RandomAccessFile raf = new RandomAccessFile(FONT, "r");
byte fontfile[] = new byte[(int)raf.length()];
raf.readFully(fontfile);
raf.close();
// create a new stream for the font file
PdfStream stream = new PdfStream(fontfile);
stream.flateCompress();
stream.put(PdfName.LENGTH1, new PdfNumber(fontfile.length));
// create a reader object
PdfReader reader = new PdfReader(RESULT1);
int n = reader.getXrefSize();
PdfObject object;
PdfDictionary font;
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(RESULT2));
PdfName fontname = new PdfName(FONTNAME);
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isDictionary())
continue;
font = (PdfDictionary)object;
if (PdfName.FONTDESCRIPTOR.equals(font.get(PdfName.TYPE))
&& fontname.equals(font.get(PdfName.FONTNAME))) {
PdfIndirectObject objref = stamper.getWriter().addToBody(stream);
font.put(PdfName.FONTFILE2, objref.getIndirectReference());
}
}
stamper.close();
reader.close();
}
This (without the fontname.equals(font.get(PdfName.FONTNAME)) test) may be a starting point for the simple cases of your task.
You'll have to do quite a lot of tests concerning encoding and add some individual translations for a more generic solution. You may want to study section 9 Text of the PDF specification ISO 32000-1 for this.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Copy contents from docx with bullets intact with Apache POI - java

Related

apache pdfbox - how to test if a document is flattened?

PDF Generation using iText7 with Java

Removing an XWPFParagraph keeps the paragraph symbol (¶) for it

Extracting an embedded object from a pdf

Replace fonts in a PDF using iText (Java)

Categories

Resources