I am using PDFBox and java to generate a pdf document. The document has several pages with text and images. Every page has the same images in the header and footer. I am currently creating a new PDImageXObject and calling drawImage() with the new object every time I add a new page. The resulting document is very heavy and I suppose it is so because it contains repeated copies of the same image.
What would be the most effective way to do this?. Most probably, pdfbox has a much better way of managing document wide resources. I am new to pdfbox and frankly I could not find documentation or examples about this specific use case.
Many thanks
You answered the question yourself. You don't have to call new PDImageXObject every time, once per file is enough. However you'll have to call drawImage. (You could save slightly more space if the header and footer are 100% identical by using a form XObject, but you won't save very much, unless the hearder/footer is very complex).
Related
I have an Android app that generates documents with each client's info based in a template, and that the client can sign. The signature is then jointed into the document, and the whole doc is converted to image and uploaded to a server.
Despite being converted to image, the objective is to be as similar as possible to the A4 format.
For this I use WebView, and then I convert it to Bitmap based on the width and height of an ScrollView.
For the signature I use Canvas.
But I'm not sure this approach is the best, as it is very difficult to simulate an A4 document. Depending on the device, the dimensions of the doc are not proportional and to be I would have to be adjusting based on each device display size. Because of that, this component of the app is not available in some devices, purposefully. But now we want to make it available to every device.
What approach do you recommend? Is there some way of develop one doc fits all with correct proportions and similar aspect?
Thanks in advance.
Print to a PDF file. You can use Flying Saucer for that, or similar libraries.
PDF is famous for sticking to definitions and is especially fitted to the needs of proper scaling and displaying.
You could even get away from the doc template and only use (X)HTML to design and fill the page.
Simply fill in some HTML, link an image (<img src='rel.jpg'>), send all that to FlyingSaucer, and you're done.
I'm attempting to perform some string validation against individual PDF pages in a file via the use of Apache PDFBox.
I'm going to be utilizing PDFTextStripper for the majority of this, so my first issue to tackle was the fact that all the PDFs i'm going to be validating against are generated as 2up; e.g Page 1 of 2 and page 2 of 2 were on the same page or if you imagine you literally scanned a book face down into a scanner - In addition to this, they were oriented incorrectly, and needed rotating 90 degrees so PDFTextStripper could read them properly.
Using elements of the below questions/solutions, i have built a method which first crops the page exactly in half, exports the cropped pages in order to a new file, rotates each page to the correct orientation and then saves the file;
Rotate PDF around its center using PDFBox in java
Split a PDF page in two parts [duplicate]
Visually, my method is seemingly working as expected until i run PDFTextStripper against it - It appears to be returning the text of not just the page i want, but also the page i cropped out of it.
To confirm the issue, I extracted a single page out of the entire document and saved it as a new file - when running PDFTextStripper, i still get the same results even though all i can see is literally one page. Adobe search doesn't bring up the hidden, legacy data either.
I can only assume that during my transform method, i need to redefine the cropped page with only the contents of the cropped page.
My question is, how can i do this?
p.s - i haven't posted my code as it's basically a amalgamation of the solutions provided in the aforementioned links above - however if it i needed, i can provide
The PDFTextStripper ignores the CropBox you set to crop the pages. It also ignores whether text is covered by some filled rectangle or image or whether the text is invisible, it extracts all text (except text in patterns or contains in Type 3 font characters).
You might want to try the PDFTextStripperByArea instead. This class (which is derived from PDFTextStripper) restricts itself to regions you can define.
(Unfortunately these regions have to be defined using a different coordinate system than the one used for the CropBox, so usually you will have to transform the coordinates first.)
I have to create a pdf using itext which will contain a button, when clicked should add a row in an existing PdfPTable. I wrote some code to create a PushbuttonField. While trying to set action I can only find PdfAction.javaScript. I am not able to figure out how to add a row in a table. I tried searching online but all I could find is PdfAction.javaScript
Any help would be greatly appreciated. Thank you.
When you create a PDF file, you draw text, lines and shapes to a canvas. That is also what happens when you add a PdfPTable to a Document. If you look at the syntax of the PDF page, you won't recognize a table. You'll find text (the content of the cells), lines (the borders), and shapes (the backgrounds), but you won't find a table. If the table is distributed over different pages, the "table" on one page won't know that it is related to the "table" on the other page.
Sure, you can add semantic structure to the document by introducing marked content, and by creating a structure tree, but that mechanism which we call Tagged PDF can't be used to make the PDF "editable" the same way a Word document is editable. Tagged PDF is (among others) used to allow assistive technology to present the content to the visually impaired (e.g. in the context of PDF/UA). The presence of structure doesn't change the fact that all text, all lines, and all shapes are added at absolute positions.
This is very different from HTML where the position on a page of a <table>, <tr>, <th>, or <td> is calculated at the moment the page is rendered. In HTML this position can even change when you resize the browser window.
There is no such thing in PDF (except if you use XFA (*), a technology that is deprecated since ISO 32000-2). All content on a page has a fixed position, hardcoded into the page's content stream. Changing the size of the PDF viewer window won't change anything to the position of the page content.
Because of all of this, your question is invalid. It is impossible to create a button in PDF that adds a row to a table, because:
In many cases there is no table: there is just a bunch of text, lines, and shapes at absolute positions,
Even if there is the notion of a table (using Tagged PDF): the visual represenation of that table is fixed at creation time, it can't be changed at consumption time.
You want to use an ordinary PDF viewer as if it were a PDF editor. That is impossible for all the reasons listed above.
(*) XFA was deprecated for different reasons. One of the most important reasons it is the lack of support for XFA. There aren't many viewers that support XFA. If you would post a follow-up question asking *"How can I create an XFA document?", the answer would be: "Don't do this!" Creating XFA is extremely complex, and once you've succeeded in creating an XFA form, you'll discover that many of your customers won't be able to consume the file because their viewer doesn't support the format.
I am trying to make a small news crawler.
I got every thing working after many tries.
Problem is that approx every HTML news page have more then 50 images.
Many of them are too small. So, i am filtering them simply by checking size.
Only images lager them 200x200 will be taken.
But there are many images on a single page which are large.
and some news articles not have any related image.
Lets take a example -
Link to News - http://timesofindia.indiatimes.com/india/Over-9-3-lakh-TB-patients-in-India-undetected-Report/articleshow/24600851.cms
My code got this image - Image no. 0 http://timesofindia.indiatimes.com/photo/10905539.cms
Image height - 300
Image width - 450
But this image is useless to image topic.
In simple words "How to get correct image dynamically"
I do not want to make code for each website.
A blank image is better then a wrong image.
I would recommend an approach where you identify the proximity of an image based on its position.. so, if an Image comes inside the article its probably an image about the article itself (except for ads which are very wide).
you can findout the source of the image and decide if it should interest you or not. for instance ad images usually come from a different server which doesn't belong to the site. (in your case indiatimes.com).
Consider the alt text. The alt text usually contains either the title completely or some words from the title.
Also, the article does not have any relevant image associated with the title.
I also suggest JSoup:
jsoup: Java HTML Parser
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods.
jsoup implements the WHATWG HTML5 specification, and parses HTML to
the same DOM as modern browsers do.
scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safe white-list, to prevent XSS attacks
output tidy HTML
I am working on a billing program - right now when you click the appropriate button it generates a frame that shows the various charges etc, basically an invoice. Is there a way to give the user an option of saving that frame as a document, either Microsoft Word, Microsoft Works or PDF?
One approach would be to save the frame as an image, you can do that by using the following syntax to convert it to an image.
BufferedImage myImage = new BufferedImage(size.width,size.height,
BufferedImage.TYPE_INT_RGB);
Graphics2D g2 = myImage.createGraphics();
myComponent.paint(g2);
you can then save this image and pass it into a jasper report. From the JasperPrint object you can then save in a few different formats, including pdf. A better but similar approach would be to pass the Graphics context into JasperReports(there is a renderer to do this in jasper, and the quality is much better).
Paint JFrame in a BufferedImage. paint() method of JFrame
Save the image as jpg or png or whatever image format
Take some pdf library and create a blank pdf (e.g. iText)
Insert the image into the PDF document
Save it - done
Instead of generating a word document, I'd rather use a Java library like iText to produce a PDF document (more portable) or, even better, the JasperReport report library that can output reports in a wide range of formats (PDF, XML, HTML, CSV, XLS, RTF, TXT) as suggested by bigbrother82 in a comment. This looks cleaner to me than using an image, especially for printing (not even mentioning that your invoice may be a multi-page document).
I'd likely look at this from a slightly different direction and instead of asking how to splat the GUI form as-is into a PDF or word document I'd ask how to get that content into a Word/PDF document.
The answer to that question is Apache FOP. Generate a XSL-FO file and ask FOP to convert it into a RTF document (with a .DOC extension) or a PDF.
Normally one does this by generating an XML file containing the data you need printed. Then use an XSLT to convert that XML to XSL-FO. I however found it easier to generate a XSL-FO file directly using a templating language (such as Freemarker).
You might want to look at the online demo for Docmosis as an example which gives the user the options for requesting the document up front. That demo does a download, but it could direct the document into a frame instead and leave it to the browser to display. This style of working (as metioned by others) is looking at the problem from a different angle and deciding up front what format, rather than after the fact and then trying to save the frame contents.