Creating documents programmatically

Creating documents programmatically - java

I have an Android app that generates documents with each client's info based in a template, and that the client can sign. The signature is then jointed into the document, and the whole doc is converted to image and uploaded to a server.
Despite being converted to image, the objective is to be as similar as possible to the A4 format.
For this I use WebView, and then I convert it to Bitmap based on the width and height of an ScrollView.
For the signature I use Canvas.
But I'm not sure this approach is the best, as it is very difficult to simulate an A4 document. Depending on the device, the dimensions of the doc are not proportional and to be I would have to be adjusting based on each device display size. Because of that, this component of the app is not available in some devices, purposefully. But now we want to make it available to every device.
What approach do you recommend? Is there some way of develop one doc fits all with correct proportions and similar aspect?
Thanks in advance.

Print to a PDF file. You can use Flying Saucer for that, or similar libraries.
PDF is famous for sticking to definitions and is especially fitted to the needs of proper scaling and displaying.
You could even get away from the doc template and only use (X)HTML to design and fill the page.
Simply fill in some HTML, link an image (<img src='rel.jpg'>), send all that to FlyingSaucer, and you're done.

Related

Best way to extract text from PDF in java

I want to make a program that is able to read PDF files and parse it's contents.
Thus I need to extract the text using some kind of library. I found 3 ways to do so.
OCR libraries (like Tesseract)
ScanPdf libraries (like iText)
Converters from PDF to text.
I fail to understand the big differences between them since all of them will produce in the end a text file from the PDF. So which is the best way to go about this?

PDF is a complex format. If you open a PDF and you're staring at a bunch of text, that doesn't really tell you much. It could be that you're staring at an image file someone decided to wrap into a PDF file. This is 99%+ certain what you have if someone scanned a document and told their scanner to 'scan to PDF', and 100% certain what you got if you have a PNG or JPG and 'save as PDF', or try to 'print to PDF' such a thing.
There is no text in the PDF then. There are pixels.
To turn pixels into text, that's where OCR libraries come in. That's what they do. That is all they do. It's an AI bonanza and error prone. No guarantees.
However, PDF is more complex than that, it isn't like PNG/JPG: It's more like HTML. You can put actual text in there.
This has different issues, though. You can place text blobs (i.e. a 'rectangle', with coordinates, and then the text that is supposed to go inside). Again a lot like HTML: You can do something like:
<p class="foo">
World!
</p>
<p class="bar">
Hello,
</p>
and then create CSS so that the foo is rendered after the bar block (can be as simple as .foo, .bar { display: block; } .foo {float: right}).
Turning that HTML into "World! Hello," is not all that tricky. Realizing that during a render, you end up seeing "Hello, World!", and thus writing code that returns "Hello, World!", that's way more complicated.
The same problem applies to PDF. For simple PDFs, extracting the raw text inside is not too difficult, but be aware that for even mildly complex PDFs, the text can arrive in a jumbled mess.
iText is trying to give you enough power, at least, to provide the latter: To give you a full hierarchical breakdown. It returns 'here is a text box, here is its positioning, and here is the text inside. and now here is another text box, etc'. It does not return a big string.
In other words: The answer depends a lot on what PDFs you have / what PDFs you expect to be able to read, and how complex they are. If they are scans, you need an OCR library. If they are simple, a basic pdf2text converter will do fine. If you want to attempt to take into account fancily positioned PDFs with forms inside and 'popups' that can be opened and closed, oof. Probably all these tools are insufficient and you're signing up to many personweeks worth of effort.

There definitely IS text embedded PDFs, it is NOT just pixels.
It depends on if the PDF is a "true" PDF (ie you can highlight the text and copy and paste it elsewhere) or if the PDF is a scanned image.
With scanned images, you'll have to use an OCR API. All of the major cloud providers have OCR APIs (ie Amazon Textract, Google Document AI, Microsoft Form Recognizer, etc). If it's a true PDF, then I've found the pdf.js library (https://mozilla.github.io/pdf.js/) quite helpful in doing a direct text extraction.
Just know that doing this only gets you the text that is literally on the page, and there's quite a bit of work still to do to get key/value data fields programmatically across many documents.
This is something that my startup is working on (www.sensible.so/) too if you're interested in something more powerful!

PDFBox. Generate multipage document with the same image

I am using PDFBox and java to generate a pdf document. The document has several pages with text and images. Every page has the same images in the header and footer. I am currently creating a new PDImageXObject and calling drawImage() with the new object every time I add a new page. The resulting document is very heavy and I suppose it is so because it contains repeated copies of the same image.
What would be the most effective way to do this?. Most probably, pdfbox has a much better way of managing document wide resources. I am new to pdfbox and frankly I could not find documentation or examples about this specific use case.
Many thanks

You answered the question yourself. You don't have to call new PDImageXObject every time, once per file is enough. However you'll have to call drawImage. (You could save slightly more space if the header and footer are 100% identical by using a form XObject, but you won't save very much, unless the hearder/footer is very complex).

How to add a non-printable image in xsl / apache-fop

I'm trying to generate an xsl to be printed in a pre-printed sheet which works fine.
Now i want to give the user a better previsualization (in the pdf screen version) adding a background image which emulates the "pre-printed" stuf on the sheet to give the user a "context" of what is he printing.
The question is: Is there any way I can set a background image in xsl (using apache fop) visible only in pdf but not in the printed version of it?
Thank you all for reading or givin any advice.

Although as the comments state, you can't have content in the PDF that does not come out in a physical printed copy, here is one possible work around for you. Depending on how your users are ultimately going to be using FOP for PDF rendering and how your a driving the work flow, it's possible to pass a parameter into an xslt file before the transofrmation phase is run, so potentially, you could do a dual rendering of the same PDF, one that is presented to the user where the background image is enabled, and one that gets printed, you could just set a variable similar to how they do in this Example, and call it something like $isPreview, and just use a simple if or choose statement to check for 'Y' or 'N'.
Since you are sending to a printer, you may even want to take advantage of FOP's ability to generate to Postscript rather than PDF, I've used this feature quite extensively for print documents using FOP while also producing a PDF copy for electronic delivery via email or hosted services, and I've yet to find any discrepancy between the PDF rendering and what is printed after sending a rendered postscript file, so it should work well for you as well.
As I said, this is not truly a solution to your problem as you've presented it, but as a work around, it could get you the desired results if your clever about how you implement it.

I don;t think the statement that it is not possible is true, I am just not sure how to create such a PDF with FOP. Certainly you can add an image field. One would use a button field and place the image in the button. Then you would set the properties of that button to not print (printable false).
PDF support images in fields: https://answers.acrobatusers.com/adding-image-field-form-q41825.aspx
RenderX supports PDF Form fields but I do not see where they support an image inside the button, only text: http://www.renderx.com/reference.html#PDF%20Forms. But they do support setting a field to "printable".

Fetching correct news image - JAVA

I am trying to make a small news crawler.
I got every thing working after many tries.
Problem is that approx every HTML news page have more then 50 images.
Many of them are too small. So, i am filtering them simply by checking size.
Only images lager them 200x200 will be taken.
But there are many images on a single page which are large.
and some news articles not have any related image.
Lets take a example -
Link to News - http://timesofindia.indiatimes.com/india/Over-9-3-lakh-TB-patients-in-India-undetected-Report/articleshow/24600851.cms
My code got this image - Image no. 0 http://timesofindia.indiatimes.com/photo/10905539.cms
Image height - 300
Image width - 450
But this image is useless to image topic.
In simple words "How to get correct image dynamically"
I do not want to make code for each website.
A blank image is better then a wrong image.

I would recommend an approach where you identify the proximity of an image based on its position.. so, if an Image comes inside the article its probably an image about the article itself (except for ads which are very wide).
you can findout the source of the image and decide if it should interest you or not. for instance ad images usually come from a different server which doesn't belong to the site. (in your case indiatimes.com).

Consider the alt text. The alt text usually contains either the title completely or some words from the title.
Also, the article does not have any relevant image associated with the title.
I also suggest JSoup:
jsoup: Java HTML Parser
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods.
jsoup implements the WHATWG HTML5 specification, and parses HTML to
the same DOM as modern browsers do.
scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safe white-list, to prevent XSS attacks
output tidy HTML

If an image is tampered with some additional content, how to remove that additional content from the image in Java?

I want to know if there is any solution for the following scenario:
I have an application which uploads the files, after scanning and transcoding them, onto a server. Suppose, an image file is being uploaded which has been tampered with some additional contents over it. Now, as the uploaded file is illegitimate, I want to remove the additional tampered contents and upload just the original part of this image file. Is it possible to do so in Java?
Thanks.

It's not possible to detect in the general case, but there are some heuristic methods available to determine whether an image has been edited. Try using the tools at http://imageedited.com/ to get an idea of what's possible.
Removing the edit is a much more difficult problem, which is probably impossible with current methods.

I'm just speculating here, and I don't know how well it would work in practice, but you could do it if you limit to specific sources of tampering. E.g., suppose you want to remove the logo added to an image by memegenerator.net.
You know in advance what the text looks like and where it is. Create a transparent png template that matches the text. Then sum the differences between the image and template pixel colors, multiplying each by the alpha of the template pixel. Since for this particular logo, it's basically white (although it seems to have a thin black shadow) you would get false positives for a picture with a white part there, so you'd also need to verify that the surrounding pixels are (within a tolerance) not white. It's not clever but it could work for certain sites.
For anything more flexible (e.g., logos on images which have subsequently been resized) you're in to the territory of OCR and TinEye-like image matching, which are more advanced than I could advise you on.
To correctly detect all kinds of "tampering" and filter "illegitimate" from "legitimate" in general, you'd need an artificial intelligence that could understand the meaning and context of what it's seeing. The short answer is: you can't. That's what humans are for.
If this is for a website, probably the best thing you can do is a report button that lets users of your site report images that don't fit with your site's rules.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.