Background:
I am not able to install a 3rd party library as iText or anything similiar. I have to write a PDF package myself.
I am looking for a resource covering and explaining how to embedd fonts and texts in the PDF file format. So far, I have a pretty good coverage about adding rectangles, ISO-8859-1 texts and n-hedrons using Path elements.
Now, the next step is supporting UTF-8 charset (or overall just different charsets) with different fonts. Reading the ISO-32000:2008 I cannot understand how to that (the overall document is very techy and I am still a junior developer).
I found PDFBox, but I am having a hard time understanding the overcomplicated principles and decisions made.
If anyone has a reference to simple code examples or cookboks how to handle texts properly in the PDF file format, I appreciate if you link it.
If the language matters:
I am using Java. But I am more looking for a generous text/ article covering the topic with examples in any language.
Related
All the documents generated in our application are generated with java-11 + opensagres
/xdocreport-2.0.2 + Freemarker template engine.
The documents are generated correctly in multiple languages like: Russian and Chinese.
We've observed that when the input is in Cambodian language the Word document generated contains some utility boxes instead of Cambodian characters.
I've explained more in detail the issue here: https://github.com/opensagres/xdocreport/issues/575 , but I didn't receive any answer until now.
Did anyone manage to generate documents containing this language with opensagres ?
Thanks upfront!
The answer was, using Aspose framework(this is not free like opensagres).
The biggest advantages are that in Aspose you can force the framework to use some sets of fonts from the application resources and other great features(like smooth and simple pdf convertions).
The only trouble was that Aspose doesn't have integration with Freemarker template. In our case that meant changing a lot of quite big complex existing documents.
After some analyses and based on Aspose really kind support, we took the decision to use a hybrid solution like:
Documents would be still generated in memory with Opensagres and Freemarker
After that the documents will be loaded with Aspose, and render based on the application resources fonts. The native font for Cambodian characters is Daunpenh Font. This font was placed in application resources.
The full topic can be found here: https://forum.aspose.com/t/support-cambodian-language/252057
I am currently working at a project which generates contracts. The idea is that I put the data in a form and save it in a simple database.
So long, this was my favorite place to search for good ideas and simple solutions.
Now I am facing another problem and I don't know how I can solve that. I want to create a PDF and replace some placeholders with some data from my form.
One idea was, that I use an existing Word template with some bookmarks and replace them with the data from my form. Maybe there is a way to do that, and I am just too stupid to find it.
Another idea was, that I am using XML. Therefore, I thought I was clever and just converted the Word template to an PDF, so I am able to convert that PDF to an XML. Attached, you find the XML file. But now I need the XSL file - is there an easy way to create the XSL file?
Or maybe there is another simple solution to solve my problem.
In these attachments you find the PDF file, the Word template and the XML:
Thank you a lot :)
Using a template is a good idea - it makes some changes much quicker to make and then deploy. The comments above are focused on conversion, but don't forget you need to merge your data in (population) first.
If you can use Adobe tools, you can have a PDF template and use the Adobe tools to populate. This saves a "conversion" stage.
You mentioned using Word for templates. This means you to run through two stages of processing:
population - docx is a zipped set of XML files - so you can process them with your own code or using a library.
conversion - you need pdf, so you have to convert the docx to pdf. You also have to watch out for fonts at this stage (ie make sure they are available on your host).
The population stage you could do yourself since you are familiar with XML. But it is definitely complicated. The conversion needs to use a tool that is ideal for it. There are a few mentioned in the comments already.
There are some free/os and commercial tools that can do both parts:
docx4j
JOD Reports
Libre Office (using the Java Uno API) (I blogged this once - Java Convert Word to PDF with UNO)
Docmosis (please note I work for Docmosis)
I suggest starting with the simple example you have attached and prove you can both populate and convert that. Then switch to a more complicated example to see if you can do the other things that might be required (eg repeating or conditions or other logic) during the population stage.
I want to implement a function that can see PowerPoint on the web at this time.
You can do it simply by converting PowerPoint to an image, but if you convert it to an image, I think there are issues that you can not use video or audio.
So the idea was to convert PowerPoint to HTML and place it where I wanted. However, it does not have much ability to directly implement the pure function of converting PowerPoint to HTML. To solve this problem, I have been looking for open source or various libraries, but I have not found them yet.
The development environment is java8 + Spring Boot.
If you are OK with converting your PPT files to PDF before converting them to HTML, then pdf2htmlEX could be worth looking at. It is the best tool I could find for this kind of work, as it is capable of converting PDFs to HTML very precisely (have a look at the exmples 1,2,3,4). You should be able to find wrapper libraries in the maven repo so that you are able to call it from your Java applications.
If you are OK in using iframe you may use a Microsoft solution https://products.office.com/it-IT/office-online/view-office-documents-online
You may use this code:
<iframe src='https://view.officeapps.live.com/op/embed.aspx?src=[you_ppt_url]' width='100%' height='600px' frameborder='0'>
There's an older node package called PPTX2HTML. It outputs a bunch of garbled code on a canvas element, but it might work. They even have a demo website to try it out. They seemed to have broken the powerpoint up into parseable XML and rendered the elements.
I want to create a PDF using pdfbox (https://pdfbox.apache.org/cookbook/documentcreation.html). However, pdfbox does not seem to provide dynamic text layout mechanisms like those a text editor like OpenOffice provides (automatic text flow using predefined text formattings like block format, centered text, line breaks etc.).
Is there any Java library that provides that functionality on top of pdfbox or separate from it? Or do you have any free code available?
I had the same problem, that's why I started PDFBox-Layout. It has support for simple word wrapping, text and paragraph alignment, pagination, vertical and column layout, and markup for easy bold/italic highlighting.
See the Wiki for more information. Maybe you will find it useful :-)
BlockFrame (on GitHub) is another layout framework for PDFBox, filling a different space to PDFBox-Layout. PDFBox-Layout seems oriented to text, but BlockFrame is designed for complex data structures. It's also designed with extensibility in mind.
I needed something to print crosswords I'd generated, and wound up coding a framework. If there's interest, I'll extend and maintain it. It should be possible to use BlockFrame to draw small, complex sections of larger PDF documents, as well as generate an entire PDF.
Feedback would be appreciated.
I had a similar problem in Ruby. I used Prawn in the past, which has a syntax similar to pdfbox. Lots of primitives.
I found it was a lot better to use a HTML+CSS to PDF solution. I'm already generating HTML and CSS, and it's easy to make print-specific CSS. Then I use either wkhtmltopdf or princexml to generate PDF. Both are command-line tools that run on a variety of platforms.
Are there any JAVA APIs or tools that can convert Handwritten Scanned Doc to txt files?
I have tried google tesseract and few other tools , but I am not getting satisfactory results for hand written scanned docs.
Strange that other answers here are pointing out to OCR tools while question clearly states handwriting recongition.
Handwriting is even more difficult area than OCR and number of technologies available is very narrow. I don't think you will be able to find any open source tool for that, while there are few commertial vendors:
http://www.a2ia.com
http://www.parascript.com/
I don't know if they have Java API, but it is better to start researching from contacting them.
You can try the Java OCR Project. I think that you might do the writing to a text file section yourself though.
Also, hand writing tends to vary from one individual to another, so I guess you will need to select some good training data to get good results.
Have a look at these :
Java OCR
Java OCR is a suite of pure java libraries for image processing and character recognition. Provides modular structure for easier deployment .
GOCR
GOCR is an OCR program, developed under the GNU Public License. It converts scanned images of text back to text files.