How to get line count with iText - java

I am using iText library for writing a PDF file.
I want to give page numbers and page header on every page of file
How can I do that?

If you don't break lines manually then it's really hard almost impossible to precisely get the line count. This number depends on font measurement and the text layout on a page. iText is a great tool for PDF generation not for parsing.

I suggest you buy the iText book "iText in Action" which is very helpful. If you don't generate your page breaks yourself, you could checkout the PdfPageEvent methods such as onStartPage() and the getPageNumber() method.

Related

Best way to extract text from PDF in java

I want to make a program that is able to read PDF files and parse it's contents.
Thus I need to extract the text using some kind of library. I found 3 ways to do so.
OCR libraries (like Tesseract)
ScanPdf libraries (like iText)
Converters from PDF to text.
I fail to understand the big differences between them since all of them will produce in the end a text file from the PDF. So which is the best way to go about this?
PDF is a complex format. If you open a PDF and you're staring at a bunch of text, that doesn't really tell you much. It could be that you're staring at an image file someone decided to wrap into a PDF file. This is 99%+ certain what you have if someone scanned a document and told their scanner to 'scan to PDF', and 100% certain what you got if you have a PNG or JPG and 'save as PDF', or try to 'print to PDF' such a thing.
There is no text in the PDF then. There are pixels.
To turn pixels into text, that's where OCR libraries come in. That's what they do. That is all they do. It's an AI bonanza and error prone. No guarantees.
However, PDF is more complex than that, it isn't like PNG/JPG: It's more like HTML. You can put actual text in there.
This has different issues, though. You can place text blobs (i.e. a 'rectangle', with coordinates, and then the text that is supposed to go inside). Again a lot like HTML: You can do something like:
<p class="foo">
World!
</p>
<p class="bar">
Hello,
</p>
and then create CSS so that the foo is rendered after the bar block (can be as simple as .foo, .bar { display: block; } .foo {float: right}).
Turning that HTML into "World! Hello," is not all that tricky. Realizing that during a render, you end up seeing "Hello, World!", and thus writing code that returns "Hello, World!", that's way more complicated.
The same problem applies to PDF. For simple PDFs, extracting the raw text inside is not too difficult, but be aware that for even mildly complex PDFs, the text can arrive in a jumbled mess.
iText is trying to give you enough power, at least, to provide the latter: To give you a full hierarchical breakdown. It returns 'here is a text box, here is its positioning, and here is the text inside. and now here is another text box, etc'. It does not return a big string.
In other words: The answer depends a lot on what PDFs you have / what PDFs you expect to be able to read, and how complex they are. If they are scans, you need an OCR library. If they are simple, a basic pdf2text converter will do fine. If you want to attempt to take into account fancily positioned PDFs with forms inside and 'popups' that can be opened and closed, oof. Probably all these tools are insufficient and you're signing up to many personweeks worth of effort.
There definitely IS text embedded PDFs, it is NOT just pixels.
It depends on if the PDF is a "true" PDF (ie you can highlight the text and copy and paste it elsewhere) or if the PDF is a scanned image.
With scanned images, you'll have to use an OCR API. All of the major cloud providers have OCR APIs (ie Amazon Textract, Google Document AI, Microsoft Form Recognizer, etc). If it's a true PDF, then I've found the pdf.js library (https://mozilla.github.io/pdf.js/) quite helpful in doing a direct text extraction.
Just know that doing this only gets you the text that is literally on the page, and there's quite a bit of work still to do to get key/value data fields programmatically across many documents.
This is something that my startup is working on (www.sensible.so/) too if you're interested in something more powerful!

How to add a non-printable image in xsl / apache-fop

I'm trying to generate an xsl to be printed in a pre-printed sheet which works fine.
Now i want to give the user a better previsualization (in the pdf screen version) adding a background image which emulates the "pre-printed" stuf on the sheet to give the user a "context" of what is he printing.
The question is: Is there any way I can set a background image in xsl (using apache fop) visible only in pdf but not in the printed version of it?
Thank you all for reading or givin any advice.
Although as the comments state, you can't have content in the PDF that does not come out in a physical printed copy, here is one possible work around for you. Depending on how your users are ultimately going to be using FOP for PDF rendering and how your a driving the work flow, it's possible to pass a parameter into an xslt file before the transofrmation phase is run, so potentially, you could do a dual rendering of the same PDF, one that is presented to the user where the background image is enabled, and one that gets printed, you could just set a variable similar to how they do in this Example, and call it something like $isPreview, and just use a simple if or choose statement to check for 'Y' or 'N'.
Since you are sending to a printer, you may even want to take advantage of FOP's ability to generate to Postscript rather than PDF, I've used this feature quite extensively for print documents using FOP while also producing a PDF copy for electronic delivery via email or hosted services, and I've yet to find any discrepancy between the PDF rendering and what is printed after sending a rendered postscript file, so it should work well for you as well.
As I said, this is not truly a solution to your problem as you've presented it, but as a work around, it could get you the desired results if your clever about how you implement it.
I don;t think the statement that it is not possible is true, I am just not sure how to create such a PDF with FOP. Certainly you can add an image field. One would use a button field and place the image in the button. Then you would set the properties of that button to not print (printable false).
PDF support images in fields: https://answers.acrobatusers.com/adding-image-field-form-q41825.aspx
RenderX supports PDF Form fields but I do not see where they support an image inside the button, only text: http://www.renderx.com/reference.html#PDF%20Forms. But they do support setting a field to "printable".

Workaround for known Bug in PDFBox

first of all: My goal is to just load a PDF, highlight words from that PDF (Page) and show that Page / PDF to the user as Image.
Till now i parse the PDF with a custom Text-Stripper to get all word-positions with their coordinates ( needed to generate a rectangle for highlighting later)
After that i started to generate PDAnnotationTextMarkup's so. Now i'm at this point where i can see my annotations well if i save the pdf to a file and view it with a PDFReader by choice. But if i use the convertToImage Method given by PDFBox, i only get a normal page rendered without annotations.
After a little time on google i found: PDFBOX-2019 which was mentioned in another stackoverflow question
Now im looking for a workaround because i think the ticket history is showing that no one will fix that issue in about a year.
Anybody a good idea to fix that and achieve my goal?
thanks in advance
ben

How to get given paragraph content of pdf file using iText library?

Is there any way to get number of paragraphs or content of given paragraph in pdf file using iText library ?. I saw some classes like Paragraph, Chunk in some code to create new pdf file but I can not find any way to get these classes in reading file. Every idea is appreciated
Is the PDF you're talking about a Tagged PDF? If not, you are making the wrong assumptions about PDF. In a PDF, content is drawn on a page. For instance: an iText PdfPTable is converted into text state operators that draw snippets of text to a canvas, as well as graphics state operators that draw paths and shapes. If the PDF isn't tagged, the lines don't know that they are borders of a table; a word doesn't know to which cell it belongs.
The same goes for paragraphs: a snippet of text doesn't know whether it belongs to a sentence, to a paragraph, to a title line,...
Due to the very nature of PDF, what you're looking for may be impossible (using iText or any other software product), or may require heuristics (artificial intelligence) to examine all text state operators and the semantics of the content to get a result that mimics how humans would interpret text.
It's very easy to achieve if your PDF is tagged correctly. See the ParseTaggedPdf example.

how can get highligted word from pdf file?

I develope new program but i need to allow user to highlighting word in pdf file then i want to process the file to get list of highlighted words with place
how can do that by java
thank in advance
PDF files are PostScript, which is very difficult to process. I doubt there's an easy way.
Take a look at http://java-source.net/open-source/pdf-libraries , but be aware you might have some difficulty.
Also, read http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf for the specs of the highlight format. Depending on what "place" information you need, that might be enough.
How are you displaying the PDF? If you are displaying the image, you just need the word co-ordinates. Something like PdfBox or JPedal or maybe IText can do this.

Categories

Resources