I am looking for a convenient method to export some data from my database into a form that would be editable afterwards. The perfect scenario would be to export a word document, and perhaps a brutally simple solution would be to generate HTML and copy/paste it into Word.
I've looked at several open source libraries for generating word documents, but they seem a bit too simple or incomplete. I need support for tables and embedded images and control over formatting the fonts, table borders etc. (too much formatting seems to be lost when copying html and pasting into word).
Although Word is the end format, it'd be fine to generate it in any format that word would be able to open and subsequently save as DOCX.
I really haven't been able to find anything about generating ODT files (server side without client installation).
I would just dive into the ASPOSE libraries, but it'll take ages (and significant pain) to get a purchase order sorted out so I need to make sure its the only viable option before taking that route.
I could generate it as an excel file and copy that to word - this is looking like the best option currently.
Related
I am currently working at a project which generates contracts. The idea is that I put the data in a form and save it in a simple database.
So long, this was my favorite place to search for good ideas and simple solutions.
Now I am facing another problem and I don't know how I can solve that. I want to create a PDF and replace some placeholders with some data from my form.
One idea was, that I use an existing Word template with some bookmarks and replace them with the data from my form. Maybe there is a way to do that, and I am just too stupid to find it.
Another idea was, that I am using XML. Therefore, I thought I was clever and just converted the Word template to an PDF, so I am able to convert that PDF to an XML. Attached, you find the XML file. But now I need the XSL file - is there an easy way to create the XSL file?
Or maybe there is another simple solution to solve my problem.
In these attachments you find the PDF file, the Word template and the XML:
Thank you a lot :)
Using a template is a good idea - it makes some changes much quicker to make and then deploy. The comments above are focused on conversion, but don't forget you need to merge your data in (population) first.
If you can use Adobe tools, you can have a PDF template and use the Adobe tools to populate. This saves a "conversion" stage.
You mentioned using Word for templates. This means you to run through two stages of processing:
population - docx is a zipped set of XML files - so you can process them with your own code or using a library.
conversion - you need pdf, so you have to convert the docx to pdf. You also have to watch out for fonts at this stage (ie make sure they are available on your host).
The population stage you could do yourself since you are familiar with XML. But it is definitely complicated. The conversion needs to use a tool that is ideal for it. There are a few mentioned in the comments already.
There are some free/os and commercial tools that can do both parts:
docx4j
JOD Reports
Libre Office (using the Java Uno API) (I blogged this once - Java Convert Word to PDF with UNO)
Docmosis (please note I work for Docmosis)
I suggest starting with the simple example you have attached and prove you can both populate and convert that. Then switch to a more complicated example to see if you can do the other things that might be required (eg repeating or conditions or other logic) during the population stage.
I have a .xlsx file containing my university's timetable. I'm working on an application that makes use of the timetable. But I don't want to "copy" the timetable contents from this Excel spreadsheet into a more "programmer-friendly" format, instead, I'd like to write a program/script that would parse this .xlsx table and automatically convert it in the format I need (e.g. in some objects in code).
There's no trouble for me in reading "normal" cells of the spreadsheet. However, instead of simply putting 1 text entry in each cell, the person who created this timetable file manually "divided" some cells into "subcells" and manually inserted some text in each of them. This looks like:
How should this be interpreted: students are divided into 4 groups. At 15.20-16.50 only groups number 1 and 2 will have a specific class. At 17.00-18.30 only groups 1, 3, and 4 will have that class.
As one can see, these "cells" are not real cells — they seem to have been created ("divided") manually, just like the text that is selected in the picture.
The question is: how do I find and read such "cells" (manually inserted text components) like in the picture (preferably also knowing their position so that I can not only read what classes exist, but also when they start (time is stated in the very left of the spreadsheet))?
I tried using Python's xlrd module but haven't been able to achieve what I need. Neither have I had any success with Java's Apache POI — I just can't find how to read such text entries. Solutions on both languages, no matter what libraries and approaches are used, will be fine for me.
Both xls and xslx are proprietary formats. Microsoft went out of their way to explain in court that xslx is open, but unfortunately not one of the judges involved knew anything significant about computer science and the lawyers knew it, so don't get distracted by their misleading case. XSLX has the option for the 'vendor' to add a block of 'custom binary blobs' and the vast majority of the excel features that aren't the most common, lowest level stuff imaginable are in these binary blobs. No doubt this 'stick a text table object into a single cell' thing that's going on here is exactly like that.
Microsoft has never released any documentation on these binary blobs, nor any library that can parse them.
Therefore, Apache POI, xlrd, and all other libraries to read XLS files that do not explicitly require Excel to be installed and running on the computer that's running the 'library' (kind of a tricky thing to pull if you have e.g. a linux-based server!) are based on reverse engineering it, and it's a horrible format. Literally - look up what Apache POI's 'HSSF' stands for. Officially nothing, but etymologically, that H is for Horrible. (Horrible Spread Sheet Format - HSSF).
That's the long way around of saying: Sorry - you probably can't. And it's not the fault of POI or xlrd, it's on microsoft. It is not appropriate to use such a closed, proprietary and undocumented format to transfer anything meaningful. The error lies in whatever process led to the situation that you're now stuck trying to write software to parse a weird excel file.
If you must, most likely a script running within excel can untangle this mess and write out a csv file or json or something in a documented format. Alternatively, you can write something in C#, but it would just be farming out the work to excel, so, you still would not be able to port this code to other platforms.
Apache POI does give you the option of a more low-level approach where you can read the binary blobs. You can attempt to reverse engineer whatever's going on in that 'cell-with-a-table-in-it' yourself, but as neither the xlrd team nor the Apache POI team has bothered, and at least the POI team is on record as saying the format seems to be designed to be obfuscated - that sounds like a job that will take you many, many weeks.
That gets me back to the solution I advised earlier: Unless spending many weeks building an incredibly fragile stack that requires a full blown windows and an excel license is the lesser evil compared to a simple change in human behaviour (unlikely), the fix lies in addressing the process (as in, address that excel is used to transfer this info, or at least make the excel sheet muuuch simpler than this thing), and not by finding out how to read this mess in java or python.
I have been given a Microsoft Word Document, with some tables and spots to fill in automatically. I am not sure if this can be done with JAVA, which is my most preferable language.
I am looking for a way to implement a function which I can give the word file to it, and it fills the required spots for me. Is it possible to do it? A hint or a link to a tutorial would definitely suffice. Thanks.
Newer versions Word store documents as zipped XML. Have you filled out the form manually in Word and done a before/after comparison on the XML? Depending on the extent of the changes you could use the standard Java XML APIs to do the same thing programmatically.
A bit of googling and I found docx4j and Apache POI. I haven't used either personally, but it appears that what you're asking for is certainly possible. See this example from the POI SVN repo on how to manipulate tables.
I have been bumping my head against the wall with this one, have researched and pretty much tried every library suggested to me. I am currently trying to write a program in java that will extract text AND images from a pdf file and allow me to write the extracted content to a word file. I have managed to extract the content using the ICEpdf library, however the problem is that I need to be able to write the content in the exact same order as it was read. So, to clarify, I need a library that will help me keep track of where exactly in the page the text and images are situated so I can put them in the same place in my word file.
A PDF to Word converter is a horribly complex proposition.
Your best bet will probably to use Open Office to do it for you and not even try to handle the intermediate steps.
http://www.openoffice.org/api/
Look at this: Advanced PDF parser for Java
OFF:
-Also to my knowledge there is a python parser that sorta converts the pdf to html (that way you can keep track of the ordering of the objects within the pdf). I know its not java, but you might be able to use the output.
http://www.unixuser.org/~euske/python/pdfminer/index.html
I wrote a web app for generating PDF by filling data into a pre-saved PDF template, template edited by acrobat, with some text-fields. But the context of those text-fields seems in a different layer and cannot affect other existing words in template.
... But I want it affect the existing words and make them flow base on how many data fill into the text-fields.
The solution maybe use program to generate a whole PDF instead of using template. But the template changes really often in my case, I don't want waste a lot of time to adjust the position and format by coding...
Do anyone know how to use text-field with auto flow in a PDF template? just like a Word document.
PDF doesn't work like that. You need to generate the whole PDF.
Ah... but from what?
There are quite a few HTML->PDF converters out there. You could fill in your template HTML, and convert it that way.
You could develop your own input format (for your template), and write an app that reads it and builds a PDF.
The later is similar enough to HTML->PDF, that unless you can't find a converter that handles some PDF feature or other you need, I'd just go that route. There are LOTS of html->pdf apps out there. You can search SO, google, whatever. Lots.