How to test PDF generated with PDFbox

How to test PDF generated with PDFbox - java

I write a class that read pdf template and add some lines like:
List<PdfTextLine> textLines = pdfBankActPage.getTextLines();
if (textLines != null) {
for (PdfTextLine textLine : textLines) {
contentStream.setFont(cyrillicFont, textLine.getFontSize());
contentStream.beginText();
contentStream.newLineAtOffset(textLine.getOffsetX(), textLine.getOffsetY());
contentStream.showText(textLine.getText());
contentStream.endText();
}
}
My util class to store one-line info:
public class PdfTextLine {
private Integer offsetX;
private Integer offsetY;
private String text;
private Integer fontSize;
What is common approach how to test, that PDF was generated correct?

Since your question is lacking detail it is difficult to answer.
There are different possibilities to verify that your content was added:
Extract the text from the PDF and check whether your text is in there
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(pageNumber);
stripper.setEndPage(pageNumber);
stripper.setAddMoreFormatting(false);
String text = stripper.getText(this.document);
Use a library like pdfcompare (which itself uses Pdfbox) to compare the pdf visually...

Related

How to Extract Diagonal watermark from pdf using PDFBOX and Extract Text by maintaining alignment

How can I extract diagonal watermark text from PDF using PDFBox ?
After referring to ExtractText's rotationMagic option, I am now extracting vertical and horizontal watermarks but not diagonal. This is my code so far.
class AngleCollector extends PDFTextStripper {
private final Set<Integer> angles = new TreeSet<>();
AngleCollector() throws IOException {}
Set<Integer> getAngles() {
return angles;
}
#Override
protected void processTextPosition(TextPosition text) {
int angle = ExtractText.getAngle(text);
angle = (angle + 360) % 360;
angles.add(angle);
}
}
class FilteredTextStripper extends PDFTextStripper {
FilteredTextStripper() throws IOException {
}
#Override
protected void processTextPosition(TextPosition text) {
int angle = ExtractText.getAngle(text);
if (angle == 0) {
super.processTextPosition(text);
}
}
}
final class ExtractText {
static int getAngle(TextPosition text) {
//The Matrix containing the starting text position
Matrix m = text.getTextMatrix().clone();
m.concatenate(text.getFont().getFontMatrix());
return (int) Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
}
private List<String> getAnnots(PDPage page) throws IOException {
List<String> returnList = new ArrayList<>();
for (PDAnnotation pdAnnot : page.getAnnotations()) {
if(pdAnnot.getContents() != null && !pdAnnot.getContents().isEmpty()) {
returnList.add(pdAnnot.getContents());
}
}
return returnList;
}
public void extractPages(int startPage, int endPage, PDFTextStripper stripper, PDDocument document, Writer output) {
for (int p = startPage; p <= endPage; ++p) {
stripper.setStartPage(p);
stripper.setEndPage(p);
try {
PDPage page = document.getPage(p - 1);
for (var annot : getAnnots(page)) {
output.write(annot);
}
int rotation = page.getRotation();
page.setRotation(0);
var angleCollector = new AngleCollector();
angleCollector.setStartPage(p);
angleCollector.setEndPage(p);
angleCollector.writeText(document, output);
for (int angle : angleCollector.getAngles()) {
// prepend a transformation
try (var cs = new PDPageContentStream(document, page,
PDPageContentStream.AppendMode.PREPEND, false)) {
cs.transform(Matrix.getRotateInstance(-Math.toRadians(angle), 0, 0));
}
stripper.writeText(document, output);
// remove prepended transformation
((COSArray) page.getCOSObject().getItem(COSName.CONTENTS)).remove(0);
}
page.setRotation(rotation);
} catch (IOException ex) {
System.err.println("Failed to process page " + p + ex);
}
}
}
}
public class pdfTest {
private pdfTest() {
}
public static void main(String[] args) throws IOException {
var pdfFile = "test-resources/pdf/pdf_sample_2.pdf";
Writer output = new OutputStreamWriter(System.out, StandardCharsets.UTF_8);
var etObj = new ExtractText();
var rawDoc = PDDocument.load(new File(pdfFile));
PDFTextStripper stripper = new FilteredTextStripper();
if(rawDoc.getDocumentCatalog().getAcroForm() != null) {
rawDoc.getDocumentCatalog().getAcroForm().flatten();
}
etObj.extractPages(1, rawDoc.getNumberOfPages(), stripper, rawDoc, output);
output.flush();
}
}
Edit 1:
I am also unable to detect form (Acro, XFA) field contents via TextExtractor code with correct Alignment. How can I do that ?
I am attaching the sample PDFs for references.
Sample PDF 1
Sample PDF 2
I require following things using PDFBox
Diagonal text detection. (including watermarks).
Form fields extraction by maintaining Proper alignment.

In your "question" you actually ask multiple distinct questions. I'll look into each of them. The answers will be less specific than you'd probably wish because your questions are based on assumptions that are not all true.
"How can I extract diagonal watermark text from PDF using PDFBox ?"
First of all, PDF text extraction works by inspecting the instructions in content streams of a page and contained XObjects, finding text drawing instructions therein, taking the coordinates and orientations and the string parameters thereof, mapping the strings to Unicode, and arranging the many individual Unicode strings by their coordinates and orientations in a single content string.
In case of PDFBox the PDFTextStripper as-is does this with a limited support for orientation processing, but it can be extended to filter the text pieces by orientation for better orientation support as shown in the ExtractText example with rotation magic activated.
double_watermark.pdf
In case of your double_watermark.pdf example PDF, though, the diagonal text "Top Secret" is not created using text drawing instructions but instead path construction and painting instructions, as Tilman already remarked. (Actually the paths here all are sequences of very short lines, no curves are used, which you can see using a high zoom factor.) Thus, PDF text extraction cannot extract this text.
To answer your question
How can I extract diagonal watermark text from PDF using PDFBox ?
in this context, therefore: You can not.
(You can of course use PDFBox as a PDF processing framework based on which you also collect paths and try to match them to characters, but would be a greater project by itself. Or you can use PDFBox to draw the pages as bitmaps and apply OCR to those bitmaps.)
"I am also unable to detect form (Acro, XFA) field contents via TextExtractor code with correct Alignment. How can I do that ?"
Form data in AcroForm or XFA form definitions are not part of the page content streams or the XObject content streams referenced from therein. Thus, they are not immediately subject to text extraction.
AcroForm forms
AcroForm form fields are abstract PDF data objects which may or may not have associated content streams for display. To include them into the content streams text extraction operates on, you can first flatten the form. As you mentioned in your own answer, you also have to activate sorting to extract the field contents in context.
Beware, PDF renderers do have certain freedoms when creating the visualization of a form field. Thus, text extraction order may be slightly different from what you expect.
XFA forms
XFA form definitions are a cuckoo's egg in PDF. They are XML streams which are not related to regular PDF objects; furthermore, XFA in PDFs has been deprecated a number of years ago. Thus, most PDF libraries don't support XFA forms.
PDFBox only allows to extract or replace the XFA XML stream. Thus, there is no immediate support for XFA form contents during text extraction.

Form fields extraction by maintaining Proper alignment.
This is solved by
setSortByPosition

How to read text in XSLFGraphicFrame with Apache POI for PowerPoint

I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.
I already made it with all of them except for PowerPoint ones, I'm using Apache POI code snippets found on StackOverflow or on other sources.
I discovered all slides are made of shapes (XSLFTextShape) but many of them are objects of class XSLFGraphicFrame or XSLFTable for which I can't use simply the toString() methods. How can I extract all of the text contained in them using Java.
This is the piece of code\pseudocode:
File f = new File("C:\\Users\\Windows\\Desktop\\Modulo 9.pptx");
PrintStream out = System.out;
FileInputStream is = new FileInputStream(f);
XMLSlideShow ppt = new XMLSlideShow(is);
for (XSLFSlide slide : ppt.getSlides()) {
for (XSLFShape shape : slide) {
if (shape instanceof XSLFTextShape) {
XSLFTextShape txShape = (XSLFTextShape) shape;
out.println(txShape.getText());
} else if (shape instanceof XSLFPictureShape) {
//do nothing
} else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {
//print all text in it or in its children
}
}
}

If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.
Example:
import java.io.FileInputStream;
import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.sl.usermodel.SlideShow;
import org.apache.poi.sl.extractor.SlideShowExtractor;
import org.apache.poi.extractor.POITextExtractor;
public class SlideShowExtractorExample {
public static void main(String[] args) throws Exception {
SlideShow<XSLFShape,XSLFTextParagraph> slideshow
= new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));
SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor
= new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
slideShowExtractor.setCommentsByDefault(true);
slideShowExtractor.setMasterByDefault(true);
slideShowExtractor.setNotesByDefault(true);
String allTextContentInSlideShow = slideShowExtractor.getText();
System.out.println(allTextContentInSlideShow);
System.out.println("===========================================================================");
POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
String metaData = textExtractor.getText();
System.out.println(metaData);
}
}
Of course there are kinds of XSLFGraphicFrame which are not read by SlideShowExtractor because they are not supported by apache poi until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml document parts which are referenced from the slides. Since apache poi does not supporting this until now, it only can be read using low level underlying methods.
For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics we could do:
...
System.out.println("===========================================================================");
//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
StringBuilder sb = new StringBuilder();
for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
while(cursor.hasNextToken()) {
if (cursor.isText()) {
sb.append(cursor.getTextValue() + "\r\n");
}
cursor.toNextToken();
}
sb.append(slide.getSlideNumber() + "\r\n\r\n");
}
}
}
String allTextContentInDiagrams = sb.toString();
System.out.println(allTextContentInDiagrams);
...

Render Type3 font character as image using PDFBox

In my project, I'm stuck with necessity to parse PDF file, that contains some characters rendered by Type3 fonts. So, what I need to do is to render such characters into BufferedImage for further processing.
I'm not sure if I'm looking in correct way, but I'm trying to get PDType3CharProc for such characters:
PDType3Font font = (PDType3Font)textPosition.getFont();
PDType3CharProc charProc = font.getCharProc(textPosition.getCharacterCodes()[0]);
and the input stream of this procedure contains following data:
54 0 1 -1 50 43 d1
q
49 0 0 44 1.1 -1.1 cm
BI
/W 49
/H 44
/BPC 1
/IM true
ID
<some binary data here>
EI
Q
but unfortunately I don't have any idea how can I use this data to render character into an image using PDFBox (or any other Java libraries).
Am I looking in correct direction, and what can I do with this data?
If not, are there some other tools that can solve such problem?

Unfortunately PDFBox out-of-the-box does not provide a class to render contents of arbitrary XObjects (like the type 3 font char procs), at least as far as I can see.
But it does provide a class for rendering complete PDF pages; thus, to render a given type 3 font glyph, one can simply create a page containing only that glyph and render this temporary page!
Assuming, for example, the type 3 font is defined on the first page of a PDDocument document and has name F1, all its char procs can be rendered like this:
PDPage page = document.getPage(0);
PDResources pageResources = page.getResources();
COSName f1Name = COSName.getPDFName("F1");
PDType3Font fontF1 = (PDType3Font) pageResources.getFont(f1Name);
Map<String, Integer> f1NameToCode = fontF1.getEncoding().getNameToCodeMap();
COSDictionary charProcsDictionary = fontF1.getCharProcs();
for (COSName key : charProcsDictionary.keySet())
{
COSStream stream = (COSStream) charProcsDictionary.getDictionaryObject(key);
PDType3CharProc charProc = new PDType3CharProc(fontF1, stream);
PDRectangle bbox = charProc.getGlyphBBox();
if (bbox == null)
bbox = charProc.getBBox();
Integer code = f1NameToCode.get(key.getName());
if (code != null)
{
PDDocument charDocument = new PDDocument();
PDPage charPage = new PDPage(bbox);
charDocument.addPage(charPage);
charPage.setResources(pageResources);
PDPageContentStream charContentStream = new PDPageContentStream(charDocument, charPage);
charContentStream.beginText();
charContentStream.setFont(fontF1, bbox.getHeight());
charContentStream.getOutput().write(String.format("<%2X> Tj\n", code).getBytes());
charContentStream.endText();
charContentStream.close();
File result = new File(RESULT_FOLDER, String.format("4700198773-%s-%s.png", key.getName(), code));
PDFRenderer renderer = new PDFRenderer(charDocument);
BufferedImage image = renderer.renderImageWithDPI(0, 96);
ImageIO.write(image, "PNG", result);
charDocument.close();
}
}
(RenderType3Character.java test method testRender4700198773)
Considering the textPosition variable in the OP's code, he quite likely attempts this from a text extraction use case. Thus, he'll have to either pre-generate the bitmaps as above and simply look them up by name or adapt the code to match the available information in his use case (e.g. he might not have the original page at hand, only the font object; in that case he cannot copy the resources of the original page but instead may create a new resources object and add the font object to it).
Unfortunately the OP did not provide a sample PDF. Thus I used one from another stack overflow question, 4700198773.pdf from extract text with custom font result non readble for my test. There obviously might remain issues with the OP's own files.

I stumbled upon the same issue and I was able to render Type3 font by modifying PDFRenderer and the underlying PageDrawer:
class Type3PDFRenderer extends PDFRenderer
{
private PDFont font;
public Type3PDFRenderer(PDDocument document, PDFont font)
{
super(document);
this.font = font;
}
#Override
protected PageDrawer createPageDrawer(PageDrawerParameters parameters) throws IOException
{
FontType3PageDrawer pd = new FontType3PageDrawer(parameters, this.font);
pd.setAnnotationFilter(super.getAnnotationsFilter());//as done in the super class
return pd;
}
}
class FontType3PageDrawer extends PageDrawer
{
private PDFont font;
public FontType3PageDrawer(PageDrawerParameters parameters, PDFont font) throws IOException
{
super(parameters);
this.font = font;
}
#Override
public PDGraphicsState getGraphicsState()
{
PDGraphicsState gs = super.getGraphicsState();
gs.getTextState().setFont(this.font);
return gs;
}
}
Simply use Type3PDFRenderer instead of PDFRendered. Of course if you have multiple fonts this needs some more modification to handle them.
Edit: tested with pdfbox 2.0.9

Reading a table or cell value in a pdf file using java?

I have gone through Java and PDF forums to extract a text value from the table in a pdf file, but could't find any solution except JPedal (It's not opensource and licensed).
So, I would like to know any opensource API's like pdfbox, itext to achieve the same result as JPedal.
Ref. Example:

In comments the OP clarified that he locates the text value from the table in a pdf file he wants to extract
By providing X and Y co-ordinates
Thus, while the question initially sounded like generic extraction of tabular data from PDFs (which can be difficult at least), it actually is essentially about extracting the text from a rectangular region on a page given by coordinates.
This is possible using either of the libraries you mentioned (and surely others, too).
iText
To restrict the region from which you want to extract text, you can use the RegionTextRenderFilter in a FilteredTextRenderListener, e.g.:
/**
* Parses a specific area of a PDF to a plain text file.
* #param pdf the original PDF
* #param txt the resulting text
* #throws IOException
*/
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
Rectangle rect = new Rectangle(70, 80, 490, 580);
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
}
out.flush();
out.close();
reader.close();
}
(ExtractPageContentArea sample from iText in Action, 2nd edition)
Beware, though, iText extracts text based on the basic text chunks in the content stream, not based on each individual glyph in such a chunk. Thus, the whole chunk is processed if only the tiniest part of it is in the area.
This may or may not suit you.
If you run into the problem that more is extracted than you wanted, you should split the chunks into their constituting glyphs beforehand. This stackoverflow answer explains how to do that.
PDFBox
To restrict the region from which you want to extract text, you can use the PDFTextStripperByArea, e.g.:
PDDocument document = PDDocument.load( args[0] );
if( document.isEncrypted() )
{
document.decrypt( "" );
}
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
Rectangle rect = new Rectangle( 10, 280, 275, 60 );
stripper.addRegion( "class1", rect );
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( 0 );
stripper.extractRegions( firstPage );
System.out.println( "Text in the area:" + rect );
System.out.println( stripper.getTextForRegion( "class1" ) );
(ExtractTextByArea from the PDFBox 1.8.8 examples)

Try PDFTextStream. At least I am able to identify the column values. Earlier, I was using iText and got stuck in defining strategy. Its hard.
This api separates column cells by putting more spaces. Its fixed. you can put logic. (this was missing in iText).
import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;
public class PDFText {
public static void main(String[] args) throws java.io.IOException {
String pdfFilePath = "xyz.pdf";
Document pdf = PDF.open(pdfFilePath);
StringBuilder text = new StringBuilder(1024);
pdf.pipe(new OutputTarget(text));
pdf.close();
System.out.println(text);
}
}
Question has been asked related to this on stackoverflow!

Creating an SVGIcon using a valid svg xml with SVGSalamander

I have an .xml document which is a valid SVG image.
I want to load this icon with different colors. But I could not.
As a solution firstly I read an xml file as a string and replaced colors using simple String.replace()
Now I must create an SVGIcon using my new XML content.
Is it posssible to do this with SVG Salamander library.

With SVG Salamander:
Get the diagram from the cache and call a recursive search and replace:
SVGDiagram diagram = SVGCache.getSVGUniverse().getDiagram(uri);
setStroke(Color.BLACK, getHexString(Color.GREEN), diagram.getRoot());
Code for the functions:
private void setStroke(Color fromColor, String toColor, SVGElement node) throws SVGException {
if (node.hasAttribute("stroke", AnimationElement.AT_CSS)) {
StyleAttribute abs = node.getStyleAbsolute("stroke");
Color was = abs.getColorValue();
if (was.equals(fromColor)) {
abs.setStringValue(toColor);
}
}
for (int i = 0; i < node.getNumChildren(); ++i) {
setStroke(fromColor, toColor, node.getChild(i));
}
}
private String getHexString(Color color) {
return String.format("#%06x", (0xFFFFFF & color.getRGB()));
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.