Java: parsing ms-word document using POI/HWPF

Java: parsing ms-word document using POI/HWPF - java

I have a ms-word document (MS-Office 2003; non-xml). Within this
document there is a string associated with a bookmark. Furthermore,
the word document contains word-macros. My goal is to read the
document with java, replace the string associated with the bookmark,
and save the document back to word format.
My first approach was using Apache POI HWPF:
HWPFDocument doc = new HWPFDocument(new FileInputStream("Test.doc"));
doc.write(new FileOutputStream("Test_generated.doc"));
The problem with this solution is that the generated file does not
contain the macro anymore (File size of the original document: 32k;
file size of the generated document 19k).
Does anybody now if it's possible to retain all the original info
using POI/HWPF?

never found a solution. The customer had to pay an Aspose-license (expensive) or refrain from using macros.

Related

PDDocument.load(file) isnt a method (PDFBox)

I wanted to make a simple program to get text content from a pdf file through Java. Here is the code:
PDFTextStripper ts = new PDFTextStripper();
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = PDDocument.load(file);
String allText = ts.getText(doc1);
String gradeText = allText.substring(allText.indexOf("GRADE 10B"), allText.indexOf("GRADE 10C"));
System.out.println("Meeting ID for English: "
+ gradeText.substring(gradeText.indexOf("English") + 7, gradeText.indexOf("English") + 20));
This is just part of the code, but this is the part with the problem.
The error is: The method load(File) is undefined for the type PDDocument
I have learnt using PDFBox from JavaTPoint. I have followed the correct instructions for installing the PDFBox libraries and adding them to the Build Path.
My PDFBox version is 3.0.0
I have also searched the source files and their methods, and I am unable to find the load method there.
Thank you in advance.

As per the 3.0 migration guide the PDDocument.load method has been replaced with the Loader method:
For loading a PDF PDDocument.load has been replaced with the Loader
methods. The same is true for loading a FDF document.
When saving a PDF this will now be done in compressed mode per
default. To override that use PDDocument.save with
CompressParameters.NO_COMPRESSION.
PDFBox now loads a PDF Document incrementally reducing the initial
memory footprint. This will also reduce the memory needed to consume a
PDF if only certain parts of the PDF are accessed. Note that, due to
the nature of PDF, uses such as iterating over all pages, accessing
annotations, signing a PDF etc. might still load all parts of the PDF
overtime leading to a similar memory consumption as with PDFBox 2.0.
The input file must not be used as output for saving operations. It
will corrupt the file and throw an exception as parts of the file are
read the first time when saving it.
So you can either swap to an earlier 2.x version of PDFBox, or you need to use the new Loader method. I believe this should work:
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = Loader.loadPDF(file);

Create a DOCX reading data from Oracle database

I have a student database (Oracle 11G), I need to create a module(separate) which will generate a student's details in a well-formatted word document. When I give the student ID, I need all the info(Kind of a biodata) of the student in a docx file which is very presentable. I'm not sure how to start, I was exploring Python-docx and java DOCX4j. I need suggestion how can I achieve this. Is there any tool I can do this
Your help is highly appreciated

You could extract the data from Oracle into an XML format, then use content control data binding in your Word document to bind elements in the XML.
All you need to do is inject the XML into the docx as a custom xml part, and Word will display the results automatically.
docx4j can help you to the inject the XML. If you don't want to rely on Word to display the results, then you can use docx4j to also apply the bindings.
Or you could try simple variable replacement: https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/VariableReplace.java

If you want a simple way to format your Word document directly from Java, you can try pxDoc.
The screenshot below provide an example of code and document generated from an Authors/Books model: whatever the way you request the data from your database, it is easy to render them in a well formatted document.
simple document generation example
Regarding your use case, you could also generate a document for all students at once. In the context of the screenshot example:
for (author:library.authors) {
var filename = 'c:/MyDocuments/'+author.name+'.docx'
document fileName:filename {
/** Content of my document */
}

How to copy page from one docx to another in java

I am trying to automate docx report generation process. For this I am using java and docx4j. I have a template document containing only single page.I would like to copy that page modify it and save it in another docx document.The output report is of multiple similar pages with modification from the template. How do I go about it.
PS : java and docx4j are my first choice but I am open to solutions apart from java and docx4j.

Leaving it up to you to modify the template, here is how you could add one document to the end of another document. Suppose base.docx contains "This is the base document." and template.docx contains "The time is:", then after executing this code:
WordprocessingMLPackage doc = Docx4J.load(new File("base.docx"));
WordprocessingMLPackage template = Docx4J.load(new File("template.docx"));
MainDocumentPart main = doc.getMainDocumentPart();
Br pageBreak = Context.getWmlObjectFactory().createBr();
pageBreak.setType(STBrType.PAGE);
main.addObject(pageBreak);
for (Object obj : template.getMainDocumentPart().getContent()) {
main.addObject(obj);
}
main.addParagraphOfText(LocalDateTime.now().toString());
doc.save(new File("result.docx"));
Then result.docx will contain something like:
This is the base document.
^L
The time is:
2018-04-16T17:37:13.541984200
(Where ^L represents a page break.)

To be more precise my original template is containing only header and some styling component.
This kind of information can be stored in a Word stylesheet (.dotx file).
PS : java and docx4j are my first choice but I am open to solutions apart from java and docx4j.
A good tool would be pxDoc: you can specify a dedicated stylesheet in your document generator, or use "variable styles"and specify the stylesheet only when you launch the document generation

How to extract data from a .docx file including image, table, formula etc?

I am doing a task in which i have to extract data from word document mainly images, tables and special texts(formula etc) .
I am able to save image from a word file it is downloaded from web but when i am applying same code to my .docx file than it is giving error.
Code for same is
//create file inputstream to read from a binary file
FileInputStream fs=new FileInputStream(filename);
//create office word 2007+ document object to wrap the word file
XWPFDocument docx=new XWPFDocument(fs);
//get all images from the document and store them in the list piclist
List<XWPFPictureData> piclist=docx.getAllPictures();
//traverse through the list and write each image to a file
Iterator<XWPFPictureData> iterator=piclist.iterator();
System.out.println(piclist.size());
while(iterator.hasNext()){
XWPFPictureData pic=iterator.next();
byte[] bytepic=pic.getData();
int i=0;
BufferedImage imag=ImageIO.read(new ByteArrayInputStream(bytepic));
//captureimage(imag,i,flag,j);
if(imag != null)
{
ImageIO.write(imag, "jpg", new File("D:/imagefromword"+i+".jpg"));
}else{
System.out.println("imag is empty");
}
It is giving incorrect format error. But I cannot change the doc file.
Secondly for above code if i am having more then one image and when i am saving this than every time it saving save image. Suppose we have 3 images then it will save 3 images but all three will be latest one.
Any help will be appreciated.

Without actual error one can only guess.
But there are two POI implementations HWPF and XWPF depending which version of word document your read the old doc one or xml-new-one docx. Typically the format error comes when you try to open the doc using the wrong one.
Also you need the full poi-ooxml-schemas jar to read more complicated documents.

Specify document Fields using Lucene library

I'm using lucene library to create an index from a number of documents. For example, the name of the first document is file1.txt and it contains the following text:
.T (title of the document) .A (author of the document) .S (summary of the document)
If i want define as Field all the contents fo the document i'm writing this:
doc.add(new TextField("contents", new BufferedReader(
new InputStreamReader(fis, "UTF-8"))));
What if i want to specify only the summary of the document as a Field? Im new to java and i can't find a way.

You need to manually read the file, till you get your summary, save it all in some sort of String, e.g. StringBuilder and then add a TextField as you listed.
For reading files line by line you could use Scanner (http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html), for String concatenation you could use StringBuilder (http://docs.oracle.com/javase/7/docs/api/java/lang/StringBuilder.html)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: parsing ms-word document using POI/HWPF - java

never found a solution. The customer had to pay an Aspose-license (expensive) or refrain from using macros.

Related

PDDocument.load(file) isnt a method (PDFBox)

Create a DOCX reading data from Oracle database

How to copy page from one docx to another in java

How to extract data from a .docx file including image, table, formula etc?

Specify document Fields using Lucene library

Categories

Resources