Adding content to ms word, Include macro - java

I have a JSP file to flush all data from database into a MS-Word document by setting the content-type keyword.
I need to add header and footer to the same document. I couldn't find a direct way from JSP without using APIs like POI. So I created a macro which works locally.
How do I add this to a dynamically generated Word file?

I had a similar problem with POI and Excel.
The solution is to manually create a template .doc file, with the macro present. Then in your code, load that document, amend it with your data, and save it. The macro will be preserved from the template document.

I'd use POI or docx4j to create a docx file on the server, and add the header/footer as part of that process.

Related

PDF text extraction in Java

I have a PDF file that was produced with iText and created with JasperReports (I don't know if it's relevant) and I was wondering if I can find some API or anything to see the structure because I need to extract text from it.
I tried with iText, PDFBox and other Java libraries but I only get text line by line and that's not what I need.
I also tried conversion in HTML, XML, DOM but I get the same result with text extraction, no structure parsed.
If I try to open it as DOCX I see that Word recognize sort of structure, for example an area that looks like a table in PDF, after conversion in DOCX it is actually a table.
I need to understand how the PDF was created, if this is possible. I know that working with PDF's is not easy, but I need to start with something useful. Thanks!
PDFTron PDFGenie can do full semantic table and paragraph extraction from a PDF file. It can generate a reflowable HTML file containing all the appropriate HTML tags for tables and paragraphs.
See this blog for more details.
https://www.pdftron.com/blog/parsing-extraction/table-extraction-and-pdf-to-xml-with-pdfgenie/#a-idpart7aevaluating-accuracy-of-pdf-table-recognition
You can download Windows/macOS/Linux PDFGenie command line tool here.
https://www.pdftron.com/downloads/linux
One more option, we can extract from Aspose PDF also, if you want look into the below link
https://blog.aspose.com/2018/02/28/extract-text-by-paragraphs-and-convert-files-to-pdf-with-aspose.pdf/

Generate PDF files using iText and apache velocity template(.vm)

What is the general workflow to generate a PDF using iText and an Apache Velocity template file (.vm) in Java?
I am interested in knowing steps like: parse template file, put Java object in context and steps to be performed to generate pdf etc.
I know this is a very basic question. But I am not able to find even a single example of this type on the web. I found XDocReport, but I am interested to know other alternatives as well.
Please help me with some sample project link or at least the steps to get started.
Yes, you can.
It all depends on how complex you want the PDFs to be.
Here are the steps for basic functionality
Generate a HTML file using Apache Velocity template file (.vm).
Use com.itextpdf.text.html.simpleparser.HTMLWorker (deprecated) to parse/convert that HTML file into a PDF.
Additionally, you can use com.itextpdf.text.pdf.PdfCopy.PageStamp to add content (borders, stamps, notes, annotations etc) to an existing PDF.
There is also com.itextpdf.tool.xml.XMLWorker for more advanced HTML conversion (adding style sheets etc)
Generating PDF using iText and an Apache Velocity template file (.vm) in Java directly is not possible because:
PDF is binary format,
Velocity generates plain text content.
On other words, Velocity cannot generate PDF.
XDocReport is able to generate a docx/odt report by merging a docx/odt template which contains some Velocity/Freemarker syntax with Java context. The generated docx/odt report can be convert it to pdf/xhtml.
It works because docx/odt are a zip which contains several xml entries. If you unzip a docx you will see word/document.xml. In this entry, you will see the content that you have typed with MS Word. word/document.xml is a plain text, so Velocity can be used in this case.
Here the XDocReport process to generate pdf from a docx template which uses Velocity:
Load docx template. this step consist to unzip the docx and stores in a map each xml entries (name entry as key and byte array as value). For instance map contains a key with word/document.xml and the xml content of this entry as value.
Loop for each xml entries which must be merged with Java context. For instance word/document.xml is merged with Java context by using Velocity and the result of merge replace the word/document.xml value of the map
Rebuild a new docx by zipping each entries of the map.
At this step we have a generated docx (the report).
To convert it to another format, XDocReport provides a docx-to-pdf converter based on Apache POI and iText. Here the XDocReport process to convert a docx to pdf:
Load docx with Apache POI
Loop for each structures of POI (XWPFParagraph, etc.) to create iText structure (iText Paragraph).
Note that XDocReport is modular and you can use other converters as well.
At first,we use freemarker template to generate a html file,and then render html to a pdf file by IItextRender .Finally, we can view pdf file in browser,there has a very useful javascript tools called pdfjs. Maybe you can try it.

Searching Docx files in java

I am writing an application for searching the Content of Documents
i have already written the code for searching the documents which are editable by notepad.
I also wish to do the same for docx files. After some research i have come up with these two things
http://www.infoq.com/articles/cracking-office-2007-with-java
this method requires me to extract docx file and then search the xml files however this would involve an extra overhead on the extraction part and frankly i dont know how to process an xml file ( discarding attribute content etc)
http://www.javadocx.com/download
this method allows me to import a jar library to my project and supposedly i can create docx files with it, what i dont understand is how to open docx files using it
can anyone recommend me a alternate method to perform the same action or help with the above two mentioned methods?
Try http://tika.apache.org/ or docx4j or POI.

Java: Placing a Header in MS Word Document

We are converting a C++ project to Java where we generate reports in ".doc" extension. The problem is we don't use any third party library to generate MS Word document, rather a file with .doc extension. Everything works fine except that we can't seem to find a way to add a Header at the beginning of every page. Using line numbers is not an option. Any other way it can be done?
Thank you.
The Apache POI library might be of some help.
It has facilities to read and modify Microsoft proprietary file formats like MS-Word .doc and MS-Excel .xls

What's the best way to extract table content from a group of HTML files?

After cleaning a folder full of HTML files with TIDY, how can the tables content be extracted for further processing?
I've used BeautifulSoup for such things in the past with great success.
Depends on what sort of processing you want to do. You can tell Tidy to generate XHTML, which is a type of XML, which means you can use all the usual XML tools like XSLT and XQuery on the results.
If you want to process them in Microsoft Excel, then you should be able to slice the table out of the HTML and put it in a file, then open that file in Excel: it will happily convert an HTML table in to a spreadsheet page. You could then save it as CSV or as an Excel workbook etc. (You can even use this on a web server -- return an HTML table but set the Content-Type header to application/ms-vnd.excel: Excel will open and import the table and turn it in to a spreadsheet.)
If you want CSV to feed in to a database then you could go via Excel as before, or if you want to automate the process, you could write a program that uses the XML-navigating API of your choice to iterate of the table rows and save them as CSV. Python's Elementtree and CSV modules would make this pretty easy.
After reviewing the suggestions, I wound up using HtmlUnit.
With HtmlUnit, I was able to customize the Java code to open each HTML file in the folder, navigate to the TABLE tag,
query each column content and extract the data I needed to create a CSV file.
In .NET you could use HTMLAgilityPack.
See this previous question on StackOverflow for more information.
If you want to extract the content from the the HTML markup, you should use some type of HTML parser. To that end there are plenty out there and here are two that might suite your needs:
http://jtidy.sourceforge.net/
http://htmlparser.sourceforge.net/
iterate through the text and Use regular expression :)
http://www.knowledgehouse.sg

Categories

Resources