Sorry for my English....
Using sluduyuschy code, I want to add a field in a PDF document that it had been a combination of several characters, and were filled, but it becomes a combination of just after I click on it from Adober reader or fill in it from the Adobe Reader again.
That see as enter image description here, only when i click that see good enter image description here
What am I doing wrong
public class TestClass {
public static void main(String[] args) throws IOException {
PdfWriter writer = new PdfWriter("D:\\TestPdf.pdf");
PdfDocument pdf = new PdfDocument(writer);
PdfAcroForm myAcro = PdfAcroForm.getAcroForm(pdf,true);
pdf.addNewPage();
PdfFormField text = PdfFormField.createText(pdf,new Rectangle(0f,800f,200f,50f),"textValue","")
.setFileSelect(false) //Meaningful only if the MaxLen entry is present in the text field dictionary and if the Multiline, Password, and FileSelect flags are clear. If true, the field is automatically divided into as many equally spaced positions, or combs, as the value of MaxLen, and the text is laid out into those combs.
.setMultiline(false)
.setPassword(false)
.setMaxLen(5)
.setComb(true);
text.setValue("What not good!");
text.regenerateField();
myAcro.addField(text);
pdf.close();
}
try to set myAcro.setNeedAppearances(true);
Related
How can I extract diagonal watermark text from PDF using PDFBox ?
After referring to ExtractText's rotationMagic option, I am now extracting vertical and horizontal watermarks but not diagonal. This is my code so far.
class AngleCollector extends PDFTextStripper {
private final Set<Integer> angles = new TreeSet<>();
AngleCollector() throws IOException {}
Set<Integer> getAngles() {
return angles;
}
#Override
protected void processTextPosition(TextPosition text) {
int angle = ExtractText.getAngle(text);
angle = (angle + 360) % 360;
angles.add(angle);
}
}
class FilteredTextStripper extends PDFTextStripper {
FilteredTextStripper() throws IOException {
}
#Override
protected void processTextPosition(TextPosition text) {
int angle = ExtractText.getAngle(text);
if (angle == 0) {
super.processTextPosition(text);
}
}
}
final class ExtractText {
static int getAngle(TextPosition text) {
//The Matrix containing the starting text position
Matrix m = text.getTextMatrix().clone();
m.concatenate(text.getFont().getFontMatrix());
return (int) Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
}
private List<String> getAnnots(PDPage page) throws IOException {
List<String> returnList = new ArrayList<>();
for (PDAnnotation pdAnnot : page.getAnnotations()) {
if(pdAnnot.getContents() != null && !pdAnnot.getContents().isEmpty()) {
returnList.add(pdAnnot.getContents());
}
}
return returnList;
}
public void extractPages(int startPage, int endPage, PDFTextStripper stripper, PDDocument document, Writer output) {
for (int p = startPage; p <= endPage; ++p) {
stripper.setStartPage(p);
stripper.setEndPage(p);
try {
PDPage page = document.getPage(p - 1);
for (var annot : getAnnots(page)) {
output.write(annot);
}
int rotation = page.getRotation();
page.setRotation(0);
var angleCollector = new AngleCollector();
angleCollector.setStartPage(p);
angleCollector.setEndPage(p);
angleCollector.writeText(document, output);
for (int angle : angleCollector.getAngles()) {
// prepend a transformation
try (var cs = new PDPageContentStream(document, page,
PDPageContentStream.AppendMode.PREPEND, false)) {
cs.transform(Matrix.getRotateInstance(-Math.toRadians(angle), 0, 0));
}
stripper.writeText(document, output);
// remove prepended transformation
((COSArray) page.getCOSObject().getItem(COSName.CONTENTS)).remove(0);
}
page.setRotation(rotation);
} catch (IOException ex) {
System.err.println("Failed to process page " + p + ex);
}
}
}
}
public class pdfTest {
private pdfTest() {
}
public static void main(String[] args) throws IOException {
var pdfFile = "test-resources/pdf/pdf_sample_2.pdf";
Writer output = new OutputStreamWriter(System.out, StandardCharsets.UTF_8);
var etObj = new ExtractText();
var rawDoc = PDDocument.load(new File(pdfFile));
PDFTextStripper stripper = new FilteredTextStripper();
if(rawDoc.getDocumentCatalog().getAcroForm() != null) {
rawDoc.getDocumentCatalog().getAcroForm().flatten();
}
etObj.extractPages(1, rawDoc.getNumberOfPages(), stripper, rawDoc, output);
output.flush();
}
}
Edit 1:
I am also unable to detect form (Acro, XFA) field contents via TextExtractor code with correct Alignment. How can I do that ?
I am attaching the sample PDFs for references.
Sample PDF 1
Sample PDF 2
I require following things using PDFBox
Diagonal text detection. (including watermarks).
Form fields extraction by maintaining Proper alignment.
In your "question" you actually ask multiple distinct questions. I'll look into each of them. The answers will be less specific than you'd probably wish because your questions are based on assumptions that are not all true.
"How can I extract diagonal watermark text from PDF using PDFBox ?"
First of all, PDF text extraction works by inspecting the instructions in content streams of a page and contained XObjects, finding text drawing instructions therein, taking the coordinates and orientations and the string parameters thereof, mapping the strings to Unicode, and arranging the many individual Unicode strings by their coordinates and orientations in a single content string.
In case of PDFBox the PDFTextStripper as-is does this with a limited support for orientation processing, but it can be extended to filter the text pieces by orientation for better orientation support as shown in the ExtractText example with rotation magic activated.
double_watermark.pdf
In case of your double_watermark.pdf example PDF, though, the diagonal text "Top Secret" is not created using text drawing instructions but instead path construction and painting instructions, as Tilman already remarked. (Actually the paths here all are sequences of very short lines, no curves are used, which you can see using a high zoom factor.) Thus, PDF text extraction cannot extract this text.
To answer your question
How can I extract diagonal watermark text from PDF using PDFBox ?
in this context, therefore: You can not.
(You can of course use PDFBox as a PDF processing framework based on which you also collect paths and try to match them to characters, but would be a greater project by itself. Or you can use PDFBox to draw the pages as bitmaps and apply OCR to those bitmaps.)
"I am also unable to detect form (Acro, XFA) field contents via TextExtractor code with correct Alignment. How can I do that ?"
Form data in AcroForm or XFA form definitions are not part of the page content streams or the XObject content streams referenced from therein. Thus, they are not immediately subject to text extraction.
AcroForm forms
AcroForm form fields are abstract PDF data objects which may or may not have associated content streams for display. To include them into the content streams text extraction operates on, you can first flatten the form. As you mentioned in your own answer, you also have to activate sorting to extract the field contents in context.
Beware, PDF renderers do have certain freedoms when creating the visualization of a form field. Thus, text extraction order may be slightly different from what you expect.
XFA forms
XFA form definitions are a cuckoo's egg in PDF. They are XML streams which are not related to regular PDF objects; furthermore, XFA in PDFs has been deprecated a number of years ago. Thus, most PDF libraries don't support XFA forms.
PDFBox only allows to extract or replace the XFA XML stream. Thus, there is no immediate support for XFA form contents during text extraction.
Form fields extraction by maintaining Proper alignment.
This is solved by
setSortByPosition
Objective: fill out a PDF form in which the characters must be inscribed in the squares (Using pdfBox 2.0.24). Therefore, there is a sign “divide into N characters” in the fields.
template.pdf
Problem: horizontal alignment doesn't work. If you open the template in any viewer, then you will be able to enter information with your hands and it will be displayed correctly.
but after filling from the program - alignment doesn't work filled-form.pdf
Some screenshots:
In viewer:
After filling form from programm:
An example that will repeat the problem:
template.pdf
code:
public static byte[] testFillPdf(byte[] pdf, Map<String,String> data) throws Exception{
PDDocument inDoc = PDDocument.load(pdf);
PDDocumentCatalog docCatalog = inDoc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
acroForm.setXFA(null);
acroForm.setNeedAppearances(false);
PDTextField field1 = (PDTextField) acroForm.getField("field1");
field1.setValue("1");
PDTextField field2 = (PDTextField) acroForm.getField("field2");
field2.setValue("2");
acroForm.refreshAppearances();
acroForm.flatten();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
inDoc.setAllSecurityToBeRemoved(true);
inDoc.save(baos);
inDoc.close();
return baos.toByteArray();
}
I could not find a solution to this problem. Is this a bug? Or am I missing something?
Tilman Hausherr - "After looking at the code, I see it's a TODO. issues.apache.org/jira/browse/PDFBOX-5256"
I am trying to add content to an existing PDF using iText7. I have been able to create new PDFs and add content to them using Paragraphs and Tables. However, once I go to reopen a PDF that I have created and attempt to write more content to it, the new content starts overwriting the old content. I want the new content to be appended to the Document after the old content. How can I achieve this?
Edit
This is the Class which sets up some common methods that will be executed with each change done to a PDF document.
public class PDFParent {
private static Document document;
private static PdfWriter writer;
private static PdfReader reader;
private static PageSize ps;
private static PdfDocument pdfDoc;
public static Document getDocument() {
return document;
}
public static void setDocument(Document document) {
PDFParent.document = document;
}
public static void setupPdf(byte[] inParamInPDFBinary){
writer = new PdfWriter(new ByteArrayOutputStream());
try {
reader = new PdfReader(new ByteArrayInputStream(inParamInPDFBinary));
} catch (IOException e) {
e.printStackTrace();
}
pdfDoc = new PdfDocument(reader, writer);
ps = PageSize.A4;
document = new Document(pdfDoc, ps);
}
public static byte[] writePdf(){
ByteArrayOutputStream stream = (ByteArrayOutputStream) writer.getOutputStream();
return stream.toByteArray();
}
public static void closePdf(){
pdfDoc.close();
}
And this is how I am adding the content to the pdf
public class ActAddParagraphToPDF extends PDFParent{
// output parameters
public static byte[] outParamOutPDFBinary;
public static ActAddParagraphToPDF mosAddParagraphToPDF(byte[] inParamInPDFBinary, String inParamParagraph) throws IOException {
ActAddParagraphToPDF result = new ActAddParagraphToPDF();
setupPdf(inParamInPDFBinary);
//---------------------begin content-------------------//
getDocument().add((Paragraph) new Paragraph(inParamParagraph));
//---------------------end content-------------------//
closePdf();
outParamOutPDFBinary = writePdf();
return result;
}
When I go to execute this second class, it appears to be treating the original document as if it is blank. Then writes the new Paragraph on top of the original content. I know that I am missing something, just not sure what that is.
Is reopening the document every time a requirement? If you keep the document open, you can append as much content as you want and you won't have to deal with content overlapping problems.
If it is a requirement, then you will have to track the last free content position yourself and reset it to new DocumentRenderer.
A Rectangle would be enough to store the free area that is left on the last page. Right before closing the document, save the free area in some Rectangle in the following way:
Rectangle savedBbox = document.getRenderer().getCurrentArea().getBBox();
After that, when you have to reopen the document, first jump to the last page:
document.add(new AreaBreak(AreaBreakType.LAST_PAGE));
And then reset the free occupied area left from the previous time you dealt with the document:
document.getRenderer().getCurrentArea().setBBox(savedBbox);
After that you are free to add new content to the document and it will appear at the saved position:
document.add(new Paragraph("Hello again"));
Please note that this approach works if you know which documents you are dealing with (i.e. you can associate last "free" position with the document's ID) and this document is not changed outside of your environment. If this is not the case, I recommend that you look into content extraction and in particular PdfDocumentContentParser. It can help you to extract the content you have on the page and determine which positions it occupies. Then you can calculate the free area on a page and use document.getRenderer().getCurrentArea().setBBox approach I described above to point DocumentRenderer to the correct place to write content to.
This question already has an answer here:
iText : Unable to print mathematical characters like ∈, ∩, ∑, ∫, ∆ √, ∠
(1 answer)
Closed 6 years ago.
Hiyas
I'm trying to display this string:
λλλλλλλλλλλλλλλλλλλλλλλλ
which is read from a RTF file, parsed and put into this variable. It is NOT used as constant in the code.
Font pdfFont = FontFactory.getFont(font.getFont().getName(), BaseFont.IDENTITY_H, embed, font.getFont().getSize2D(), style);
Phrase phrase = new Phrase("λλλλλλλλλλλλλλλλλλλλλλλλ", pdfFont);
ColumnText.showTextAligned(content[i], alignment, phrase, x, y, rotation);
I also tried CP1252 (and basically all the other encodings I found) together with a simple ArialMT.ttf font, but that damn string is never displayed. I can see that the conversion to the byte array inside iText (we use 5.5.0) always returns a null length byte array which explains why the text is not used, but I don't understand why. What encoding would I need to use to make this visible in a PDF?
Thanks a lot
I suppose that you want to get a result that looks like this:
That's easy. I first tried the SunCharacter example from the official documentation. That example was written in answer to the question: iText : Unable to print mathematical characters like ∈, ∩, ∑, ∫, ∆ √, ∠
I then changed the TEXT to:
public static final String TEXT = "Always use the Unicode notation for special characters: \u03bb";
As you can see, I don't use λ in my source code (that's bad practice). Instead I use \u03bb which is the Unicode notation of λ.
The result looked like this:
That's not what you want; you want ArialMT. So I changed the FONT to:
public static final String FONT = "c:/windows/fonts/arial.ttf";
This gave me the desired PDF.
This is the full code sample:
public class LambdaCharacter {
public static final String DEST = "results/fonts/lambda_character.pdf";
public static final String FONT = "c:/windows/fonts/arial.ttf";
public static final String TEXT = "Always use the Unicode notation for special characters: \u03bb";
public static void main(String[] args) throws IOException, DocumentException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new LambdaCharacter().createPdf(DEST);
}
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(dest));
document.open();
BaseFont bf = BaseFont.createFont(FONT, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Font f = new Font(bf, 12);
Paragraph p = new Paragraph(TEXT, f);
document.add(p);
document.close();
}
}
I works just fine.
Maybe you aren't really using Arial. Maybe font.getFont().getName() doesn't give you the correct name of the font. Or maybe it gives you the correct name of the font, but you forgot to register the font. In that case, you will see that Helvetica is used. Helvetica can't render a lambda. You need Arial or Cardo-Regular or Arial Unicode or another font, as long as that font knows how to render a lambda.
If you don't know how to register a font, read:
How to load custom font in FontFactory.register in iText or
Creating fonts from *.ttf files using iText or
Using Fonts in System with iTextSharp or
Get list of supported fonts in ITextSharp or
Why is my font not applied when I create a PDF document? or... (there are just too many hits when I search for an answer to that question)
I have gone through Java and PDF forums to extract a text value from the table in a pdf file, but could't find any solution except JPedal (It's not opensource and licensed).
So, I would like to know any opensource API's like pdfbox, itext to achieve the same result as JPedal.
Ref. Example:
In comments the OP clarified that he locates the text value from the table in a pdf file he wants to extract
By providing X and Y co-ordinates
Thus, while the question initially sounded like generic extraction of tabular data from PDFs (which can be difficult at least), it actually is essentially about extracting the text from a rectangular region on a page given by coordinates.
This is possible using either of the libraries you mentioned (and surely others, too).
iText
To restrict the region from which you want to extract text, you can use the RegionTextRenderFilter in a FilteredTextRenderListener, e.g.:
/**
* Parses a specific area of a PDF to a plain text file.
* #param pdf the original PDF
* #param txt the resulting text
* #throws IOException
*/
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
Rectangle rect = new Rectangle(70, 80, 490, 580);
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
}
out.flush();
out.close();
reader.close();
}
(ExtractPageContentArea sample from iText in Action, 2nd edition)
Beware, though, iText extracts text based on the basic text chunks in the content stream, not based on each individual glyph in such a chunk. Thus, the whole chunk is processed if only the tiniest part of it is in the area.
This may or may not suit you.
If you run into the problem that more is extracted than you wanted, you should split the chunks into their constituting glyphs beforehand. This stackoverflow answer explains how to do that.
PDFBox
To restrict the region from which you want to extract text, you can use the PDFTextStripperByArea, e.g.:
PDDocument document = PDDocument.load( args[0] );
if( document.isEncrypted() )
{
document.decrypt( "" );
}
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
Rectangle rect = new Rectangle( 10, 280, 275, 60 );
stripper.addRegion( "class1", rect );
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( 0 );
stripper.extractRegions( firstPage );
System.out.println( "Text in the area:" + rect );
System.out.println( stripper.getTextForRegion( "class1" ) );
(ExtractTextByArea from the PDFBox 1.8.8 examples)
Try PDFTextStream. At least I am able to identify the column values. Earlier, I was using iText and got stuck in defining strategy. Its hard.
This api separates column cells by putting more spaces. Its fixed. you can put logic. (this was missing in iText).
import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;
public class PDFText {
public static void main(String[] args) throws java.io.IOException {
String pdfFilePath = "xyz.pdf";
Document pdf = PDF.open(pdfFilePath);
StringBuilder text = new StringBuilder(1024);
pdf.pipe(new OutputTarget(text));
pdf.close();
System.out.println(text);
}
}
Question has been asked related to this on stackoverflow!