Apache POI (Java) : Display embedded files on Microsoft Word (.docx)

Apache POI (Java) : Display embedded files on Microsoft Word (.docx) - java

It's my first time posting here :)
I want with apache POI to embed a file inside a .docx and reference it via an icon or a link inside the document.
I guess I've managed to embed the file.
My problem : I can't display a reference to the embed file.
To illustrate my problem:
With the following code, I've manage to embed inside myWord.docx the file "first.txt" at the location "/word/embeddings/first.txt".
I don't know how to reference it.
This is my code :
public void saveToDocx(OutputStream myOutputStream){
Resource r = new ClassPathResource("/myWord.docx") ;
try (FileInputStream fis= new FileInputStream(r.getFile())){
OPCPackage pkg = OPCPackage.open(fis);
XWPFDocument docx = new XWPFDocument(pkg);
fis.close();
// first.txt
final PackagePartName partName = PackagingURIHelper.createPartName("/word/embeddings/first.txt");
final PackagePart pkgPart = pkg.createPart(partName, "application/vnd.openxmlformats-officedocument.oleobject");
final OutputStream partOutputStream = pkgPart.getOutputStream();
partOutputStream.write("test test test".getBytes());
partOutputStream.close();
pkgPart.addRelationship(partName, TargetMode.INTERNAL, "http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject");
//image.jpg
String imageName = "C:/image.jpg";
InputStream imageIS = new FileInputStream(imageName);
imageIS.close();
imgPartOutputStream.close();
//add simple picture to my document
XWPFParagraph myParagraph = docx.createParagraph();
XWPFRun run = myParagraph.createRun();
run.addPicture(imageIS, XWPFDocument.PICTURE_TYPE_JPEG, imageName, Units.toEMU(77.25), Units.toEMU(49.5));
docx.write(myOutputStream);
//pkg.save(myOutputStream); there is a difference between this and docx.write ?
}catch(Exception e){
e.printStackTrace();
}
}
Can someone please try to help me? I'm really stuck (since last Friday). Thanks guys!
(Do forgive me for my grammar, i'm not a native :/)

Related

Convert word to pdf java

I'm trying convert word to pdf, my code is:
public static void main(String[] args) {
try {
XWPFDocument document = new XWPFDocument();
document.createStyles();
XWPFParagraph paragraph = document.createParagraph();
XWPFRun title = paragraph.createRun();
title.setText("gLETS GO");
PdfOptions options = PdfOptions.create();
OutputStream out = new FileOutputStream(new File("C:/Users/pepe/Desktop/DocxToPdf1.pdf"));
PdfConverter.getInstance().convert(document, out, options);
System.out.println("Done");
} catch (Exception e) {
e.printStackTrace();
}
}
I'm getting error:
fr.opensagres.poi.xwpf.converter.core.XWPFConverterException: org.apache.xmlbeans.XmlException: error: Unexpected end of file after null
at fr.opensagres.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:71)
at fr.opensagres.poi.xwpf.converter.pdf.PdfConverter.doConvert(PdfConverter.java:39)
Caused by: org.apache.xmlbeans.XmlException: error: Unexpected end of file
I have tried other solutions but doesnt works. I create a java project, if someone can help me or other way to do

This is probably a duplicate of Trying to make simple PDF document with Apache poi. But let's have a complete example again to show how to create a new XWPFDocument from scratch using the latest apache poi 4.1.2 which then can be converted to PDF using PdfConverter of fr.opensagres.poi.xwpf.converter version 2.0.2 and iText.
As told the default *.docx documents created by apache poi lacks some content which PdfConverter needs.
There must be a styles document, even if it is empty.
And there must be section properties for the page having at least the page size set. To fulfilling this we must add some code additionally in our program. Unfortunately this then needs the full jar of all of the schemas ooxml-schemas-1.4.jar as mentioned in Faq-N10025.
And because we need changing the underlying low level objects, the document must be written so underlying objects will be committed. Else the XWPFDocument which we hand over the PdfConverter will be incomplete.
Minimal complete working example:
import java.io.*;
import java.math.BigInteger;
//needed jars: fr.opensagres.poi.xwpf.converter.core-2.0.2.jar,
// fr.opensagres.poi.xwpf.converter.pdf-2.0.2.jar,
// fr.opensagres.xdocreport.itext.extension-2.0.2.jar,
// itext-4.2.1.jar
import fr.opensagres.poi.xwpf.converter.pdf.PdfOptions;
import fr.opensagres.poi.xwpf.converter.pdf.PdfConverter;
//needed jars: apache poi and it's dependencies
// and additionally: ooxml-schemas-1.4.jar
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.util.Units;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
public class XWPFToPDFConverterSampleMin {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument();
// there must be a styles document, even if it is empty
XWPFStyles styles = document.createStyles();
// there must be section properties for the page having at least the page size set
CTSectPr sectPr = document.getDocument().getBody().addNewSectPr();
CTPageSz pageSz = sectPr.addNewPgSz();
pageSz.setW(BigInteger.valueOf(12240)); //12240 Twips = 12240/20 = 612 pt = 612/72 = 8.5"
pageSz.setH(BigInteger.valueOf(15840)); //15840 Twips = 15840/20 = 792 pt = 792/72 = 11"
// filling the body
XWPFParagraph paragraph = document.createParagraph();
XWPFRun title = paragraph.createRun();
title.setText("gLETS GO");
//document must be written so underlaaying objects will be committed
ByteArrayOutputStream out = new ByteArrayOutputStream();
document.write(out);
document.close();
document = new XWPFDocument(new ByteArrayInputStream(out.toByteArray()));
PdfOptions options = PdfOptions.create();
PdfConverter converter = (PdfConverter)PdfConverter.getInstance();
converter.convert(document, new FileOutputStream("XWPFToPDFConverterSampleMin.pdf"), options);
document.close();
}
}

I would not suggest you to use apache poi since its library to convert word to pdf have been discontinued now. As of today I don't think that there is any open source library which do the conversion (they require some dependencies like some need MS word to be installed, etc). The best way I could think of (it will only work if you are deploying project on linux machine) is that install Libre Office (open source) in the linux machine and run this :
String command = "libreoffice --headless --convert-to pdf " + inputPath + " --outdir " + outputPath;
try {
Runtime.getRuntime().exec(command);
} catch (IOException e) {
e.printStackTrace();
}

Problem about font encoding in PDF/A generation

So here is my problem :
I'm currently working on an java application that will archive document in a PDF/A-1. I'm using PdfBox for pdf generation and when I can't generate a valid PDF/A-1 pdf, because of the font. The font is embedded inside the pdf file but this website : https://www.pdf-online.com/osa/validate.aspx tell me that this is not a valid PDF/A because of :
The key Encoding has a value Identity-H which is prohibited.
I look on internet on what is this Identity-H encoding and it seem that it's the way that font are encoded, like the ansi encoding.
I've already tried to get different font like Helvetica or arial unicode Ms but nothing work, there is alway this Identity-H encoding.I'm a bit lost with all this mess in encoding so if someone can explain me it'll be great. Also here is the code I write to embedded a font in the pdf :
// load the font as this needs to be embedded
PDFont font = PDType0Font.load(doc, getClass().getClassLoader().getResourceAsStream(fontfile), true);
if (!font.isEmbedded())
{
throw new IllegalStateException("PDF/A compliance requires that all fonts used for"
+ " text rendering in rendering modes other than rendering mode 3 are embedded.");
}
Thanks for your help :)

Problem solved :
I used the example of apache : CreatePDFA ( I have no clue why that work and not my code ) : Example in examples/src/main/java/org/apache/pdfbox/examples
I add to fit the PDF/A-3 requirement :
doc.getDocumentCatalog().setLanguage("en-US");
PDMarkInfo mark = new PDMarkInfo(); // new PDMarkInfo(page.getCOSObject());
PDStructureTreeRoot treeRoot = new PDStructureTreeRoot();
doc.getDocumentCatalog().setMarkInfo(mark);
doc.getDocumentCatalog().setStructureTreeRoot(treeRoot);
doc.getDocumentCatalog().getMarkInfo().setMarked(true);
PDDocumentInformation info = doc.getDocumentInformation();
info.setCreationDate(date);
info.setModificationDate(date);
info.setAuthor("KairosPDF");
info.setProducer("KairosPDF");
info.setCreator("KairosPDF");
info.setTitle("Generated PDf");
info.setSubject("PDF/A3-A");
Here is my code to embedded a file to the pdf :
private final PDDocument doc = new PDDocument();
private final PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
private final PDDocumentNameDictionary names = new PDDocumentNameDictionary(doc.getDocumentCatalog());
private final Map<String, PDComplexFileSpecification> efMap = new HashMap<>();
public void addFile(PDDocument doc, File child) throws IOException {
File file = new File(child.getPath());
Calendar date = Calendar.getInstance();
//first create the file specification, which holds the embedded file
PDComplexFileSpecification fs = new PDComplexFileSpecification();
fs.setFileUnicode(child.getName());
fs.setFile(child.getName());
InputStream is = new FileInputStream(file);
PDEmbeddedFile ef = new PDEmbeddedFile(doc, is);
//Setting
ef.setSubtype("application/octet-stream");
ef.setSize((int) file.length() + 1);
ef.setCreationDate(date);
ef.setModDate(date);
COSDictionary dictionary = fs.getCOSObject();
dictionary.setItem(COSName.getPDFName("AFRelationship"), COSName.getPDFName("Data"));
fs.setEmbeddedFile(ef);
efMap.put(child.getName(), fs);
efTree.setNames(efMap);
names.setEmbeddedFiles(efTree);
doc.getDocumentCatalog().setNames(names);
is.close();
}
The only problem left is this error from the validation :
File specification 'Test.txt' not associated with an object.
Hope it'll help some.

How to delete a particular paragraph in docx containing specific string using Apache POI

I am trying to delete a particular paragraph of a document containing specific string using Apache POI. I found a piece of code in the 2nd answer here:
https://stackoverflow.com/a/44326734/8504176
FileInputStream fis = new FileInputStream(fileName);
XWPFDocument doc = new XWPFDocument(fis);
fis.close();
// Find a paragraph with todelete text inside
XWPFParagraph toDelete = doc.getParagraphs().stream().filter(p -> StringUtils.equalsIgnoreCase("todelete", p.getParagraphText()))
.findFirst().orElse(null);
if (toDelete != null)
{
doc.removeBodyElement(doc.getPosOfParagraph(toDelete));
OutputStream fos = new FileOutputStream(fileName);
doc.write(fos);
fos.close();
}
But equalsIgnoreCase() is throwing an error and I'm not getting it:
Exception in thread "main" java.lang.RuntimeException: Uncompilable
source code - Erroneous sym type:
com.sun.xml.internal.ws.util.StringUtils.equalsIgnoreCase
How to solve this issue? Thanks in advance

Converting a pdf to word document using java

I've successfully converted JPEG to Pdf using Java, but don't know how to convert Pdf to Word using Java, the code for converting JPEG to Pdf is given below.
Can anyone tell me how to convert Pdf to Word (.doc/ .docx) using Java?
import java.io.FileOutputStream;
import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.Document;
public class JpegToPDF {
public static void main(String[] args) {
try {
Document convertJpgToPdf = new Document();
PdfWriter.getInstance(convertJpgToPdf, new FileOutputStream(
"c:\\java\\ConvertImagetoPDF.pdf"));
convertJpgToPdf.open();
Image convertJpg = Image.getInstance("c:\\java\\test.jpg");
convertJpgToPdf.add(convertJpg);
convertJpgToPdf.close();
System.out.println("Successfully Converted JPG to PDF in iText");
} catch (Exception i1) {
i1.printStackTrace();
}
}
}

In fact, you need two libraries. Both libraries are open source. The first one is iText, it is used to extract the text from a PDF file. The second one is POI, it is ued to create the word document.
The code is quite simple:
//Create the word document
XWPFDocument doc = new XWPFDocument();
// Open the pdf file
String pdf = "myfile.pdf";
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
// Read the PDF page by page
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
// Extract the text
String text=strategy.getResultantText();
// Create a new paragraph in the word document, adding the extracted text
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
run.setText(text);
// Adding a page break
run.addBreak(BreakType.PAGE);
}
// Write the word document
FileOutputStream out = new FileOutputStream("myfile.docx");
doc.write(out);
// Close all open files
out.close();
reader.close();
Beware: With the used extraction strategy, you will lose all formatting. But you can fix this, by inserting your own, more complex extraction strategy.

You can use 7-pdf library
have a look at this it may help :
http://www.7-pdf.de/sites/default/files/guide/manuals/library/index.html
PS: itext has some issues when given file is non RGB image, try this out!!

Although it's far from being a pure Java solution OpenOffice/LibreOfffice allows one to connect to it through a TCP port; it's possible to use that to convert documents. If this looks like an acceptable solution, JODConverter can help you.

get thumbnail of word in java using Apache POI

I study on a web sharing project in jsf.In this project users can upload documents such as .doc,.pdf,.ppt,..etc . I want show this documents first pages as a thumbnail. After some googling around I found Apache POI.Can anybody has any suggestion for my problem? How can I return thumbnail image of word doc's first page? I try this code.This code just get first picture that word doc contains:
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("d:\\test.doc"));
HWPFDocument doc = new HWPFDocument(fs);
PicturesTable pt=doc.getPicturesTable();
List<Picture> p=pt.getAllPictures();
BufferedImage image=ImageIO.read(new ByteArrayInputStream(p.get(0).getContent()));
ImageIO.write(image, "JPG", new File("d:\\test.jpg"));

What's you are doing make nothing. HWPFDocument can extract thumbnail embedded in document (when saving files, check on 'add preview' option). So HWPFDocument can extract only thumbnail of documents having thumbnail.
Even, to do that, you have to make:
{code}
static byte[] process(File docFile) throws Exception {
final HWPFDocumentCore wordDocument = AbstractWordUtils.loadDoc(docFile);
SummaryInformation summaryInformation = wordDocument.getSummaryInformation();
System.out.println(summaryInformation.getAuthor());
System.out.println(summaryInformation.getApplicationName() + ":" + summaryInformation.getTitle());
Thumbnail thumbnail = new Thumbnail(summaryInformation.getThumbnail());
System.out.println(thumbnail.getClipboardFormat());
System.out.println(thumbnail.getClipboardFormatTag());
return thumbnail.getThumbnailAsWMF();
}
{code}
after that, you have to probably convert WMF file format to more common format (jpeg, png...). ImageMagick can help.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache POI (Java) : Display embedded files on Microsoft Word (.docx) - java

Related

Convert word to pdf java

Problem about font encoding in PDF/A generation

How to delete a particular paragraph in docx containing specific string using Apache POI

Converting a pdf to word document using java

get thumbnail of word in java using Apache POI

Categories

Resources