Using PDFBox to get location of line of text

Using PDFBox to get location of line of text - java

I'm using PDFBox to extract information from a pdf, and the information I'm currently trying to find is related to the x-position of the first character in the line. I can't find anything related to how to get that information though. I know pdfbox has a class called TextPosition, but I can't find out how to get a TextPosition object from the PDDocument either. How do I get the location information of a line of text from a pdf?

In general
To extract text (with or without extra information like positions, colors, etc.) using PDFBox, you instantiate a PDFTextStripper or a class derived from it and use it like this:
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
(There are a number of PDFTextStripper attributes allowing you to restrict the pages text is extracted from.)
In the course of the execution of getText the content streams of the pages in question (and those of form xObjects referenced from those pages) are parsed and text drawing commands are processed.
If you want to change the text extraction behavior, you have to change this text drawing command processing which you most often should do by overriding this method:
/**
* Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
* and just calls {#link #writeString(String)}.
*
* #param text The text to write to the stream.
* #param textPositions The TextPositions belonging to the text.
* #throws IOException If there is an error when writing the text.
*/
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
writeString(text);
}
If you additionally need to know when a new line starts, you may also want to override
/**
* Write the line separator value to the output stream.
* #throws IOException
* If there is a problem writing out the lineseparator to the document.
*/
protected void writeLineSeparator( ) throws IOException
{
output.write(getLineSeparator());
}
writeString can be overridden to channel the text information into separate members (e.g. if you might want a result in a more structured format than a mere String) or it can be overridden to simply add some extra information into the result String.
writeLineSeparator can be overridden to trigger some specific output between lines.
There are more methods which can be overridden but you are less likely to need them in general.
In the case at hand
I'm using PDFBox to extract information from a pdf, and the information I'm currently trying to find is related to the x-position of the first character in the line.
This can be implemented as follows (simply adding the information at the start of each line):
PDFTextStripper stripper = new PDFTextStripper()
{
#Override
protected void startPage(PDPage page) throws IOException
{
startOfLine = true;
super.startPage(page);
}
#Override
protected void writeLineSeparator() throws IOException
{
startOfLine = true;
super.writeLineSeparator();
}
#Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
if (startOfLine)
{
TextPosition firstProsition = textPositions.get(0);
writeString(String.format("[%s]", firstProsition.getXDirAdj()));
startOfLine = false;
}
super.writeString(text, textPositions);
}
boolean startOfLine = true;
};
text = stripper.getText(document);
(ExtractText.java method extractLineStart tested by testExtractLineStartFromSampleFile)

Related

Extracting answers to a flattened PDF form with iText 7

We have a few forms created from Adobe LiveCycle where users fill the dynamic forms and submits the document to our office where we stamp it with our signature and flatten it (at least most of the time - I've seen a few documents in our system that haven't been flattened yet but that can be a separate question, I'll focus on the flattened documents here because that's most of what we have).
I'm trying to use iText 7 to parse/extract the user's answers to our forms for migrating to an electronic solution that will happen a few months from now. I was able to make the example work in Java but I don't understand the process.
/*
This file is part of the iText (R) project.
Copyright (c) 1998-2020 iText Group NV
Authors: iText Software.
For more information, please contact iText Software at this address:
sales#itextpdf.com
*/
/**
* Example written by Bruno Lowagie in answer to:
* http://stackoverflow.com/questions/24506830/can-we-use-text-extraction-strategy-after-applying-location-extraction-strategy
*/
package ca.umanitoba.ad.research;
import com.itextpdf.kernel.font.PdfFont;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.kernel.pdf.canvas.parser.EventType;
import com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor;
import com.itextpdf.kernel.pdf.canvas.parser.data.IEventData;
import com.itextpdf.kernel.pdf.canvas.parser.data.TextRenderInfo;
import com.itextpdf.kernel.pdf.canvas.parser.filter.TextRegionEventFilter;
import com.itextpdf.kernel.pdf.canvas.parser.listener.FilteredEventListener;
import com.itextpdf.kernel.pdf.canvas.parser.listener.LocationTextExtractionStrategy;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.FileOutputStream;
import java.io.Writer;
import java.io.BufferedWriter;
public class Main {
public static final String DEST = "./target/txt/parse_custom.txt";
public static final String SRC = "./src/main/resources/pdfs/nameddestinations.pdf";
public static void main(String[] args) throws IOException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new Main().manipulatePdf(DEST);
}
protected void manipulatePdf(String dest) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));
Rectangle rect = new Rectangle(36, 750, 523, 56);
CustomFontFilter fontFilter = new CustomFontFilter(rect);
FilteredEventListener listener = new FilteredEventListener();
// Create a text extraction renderer
LocationTextExtractionStrategy extractionStrategy = listener
.attachEventListener(new LocationTextExtractionStrategy(), fontFilter);
// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.processPageContent(pdfDoc.getFirstPage());
// Get the resultant text after applying the custom filter
String actualText = extractionStrategy.getResultantText();
pdfDoc.close();
// See the resultant text in the console
System.out.println(actualText);
try (Writer writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(dest)))) {
writer.write(actualText);
}
}
/*
* The custom filter filters only the text of which the font name ends with Bold or Oblique.
*/
protected class CustomFontFilter extends TextRegionEventFilter {
public CustomFontFilter(Rectangle filterRect) {
super(filterRect);
}
#Override
public boolean accept(IEventData data, EventType type) {
if (type.equals(EventType.RENDER_TEXT)) {
TextRenderInfo renderInfo = (TextRenderInfo) data;
PdfFont font = renderInfo.getFont();
if (null != font) {
String fontName = font.getFontProgram().getFontNames().getFontName();
return fontName.endsWith("Bold") || fontName.endsWith("Oblique");
}
}
return false;
}
}
}
Why is there a need to specify a Rectangle? Our forms are dynamic so users can add more fields as needed and we also accept paragraphs on some of the questions so the length will always vary so it's unlikely that the coordinates of the texts will be the same.
How can I change the flow so that I can perhaps just search for the question and then get the text right after it (presumably the answer) - I don't really know what the best way to parse a PDF is. If there's no other way except providing a Rectangle, can I programmatically determine the coordinates/dimensions of the rectangles?
From the example it looks like it's filtering the text based on whether it's bolded or italicized which I probably don't need but it looks to be easy enough to fix by modifying/removing the accept() method.

Please take a look at what that example is for: In the JavaDoc comment you can read
/**
* Example written by Bruno Lowagie in answer to:
* http://stackoverflow.com/questions/24506830/can-we-use-text-extraction-strategy-after-applying-location-extraction-strategy
*/
and that stack overflow question starts with
I used the following code to get data in PDF from a particular location. I want to get bold text present in that location
When you wonder, therefore,
Why is there a need to specify a Rectangle?
the answer is: because the example is about finding bold text in a particular location.
You mention your forms were dynamic before flattening and fields don't have fixed positions. Thus, this filter probably is not optimal for your use case.
How can I change the flow so that I can perhaps just search for the question and then get the text right after it
In that case simply don't filter at all but use a plain LocationTextExtractionStrategy to extract text, search for the question text in the extracted text, and use the text thereafter up to the next question text.
Alternatively, if you still have the unflattened dynamic forms, you may consider extracting the xfa xml and extract the filled-in data from that xml.

Correct way to distinguish .xls from .doc file?

I searched how to detect that file is .xls and I've found a solution like this (but not deprecated):
POIFSFileSystem:
#Deprecated
#Removal(version="4.0")
public static boolean hasPOIFSHeader(InputStream inp) throws IOException {
return FileMagic.valueOf(inp) == FileMagic.OLE2;
}
But this one returns true for all microsoft word documents for example for .doc
Is there a way to detect .xls document?

Both .doc/.xls documents can are stored in the OLE2 storage format. The org.apache.poi.poifs.filesystem.FileMagic helps you to detect the file storage format only and not sufficient alone to distinguish between .doc/.xls files.
Also it does not appear that there is any direct API available in POI library to determine the document type (excel or document) for given inputstream/file.
Below example my be helpful to determine if given stream is a valid .xls (or .xlsx)file with the caveat that it read the given inputstram and close it.
// slurp content from given input and close it
public static boolean isExcelFile(InputStream in) throws IOException {
try {
// it slurp the input stream
Workbook workbook = org.apache.poi.ss.usermodel.WorkbookFactory.create(in);
workbook.close();
return true;
} catch (java.lang.IllegalArgumentException | org.apache.poi.openxml4j.exceptions.InvalidFormatException e) {
return false;
}
}
You may found more information on excel file format on this link
Update
Solution based on Apache Tika as suggested by gagravarr:
public class TikaBasedFileTypeDetector {
private Tika tika;
private TemporaryResources temporaryResources;
public void init() {
this.tika = new Tika();
this.temporaryResources = new TemporaryResources();
}
// clean up all the temporary resources
public void destroy() throws IOException {
temporaryResources.close();
}
// return content mime type
public String detectType(InputStream in) throws IOException {
TikaInputStream tikaInputStream = TikaInputStream.get(in, temporaryResources);
return tika.detect(tikaInputStream);
}
public boolean isExcelFile(InputStream in) throws IOException{
// see https://stackoverflow.com/a/4212908/1700467 for information on mimetypes
String type = detectType(in);
return type.startsWith("application/vnd.ms-excel") || //for Micorsoft document
type.startsWith("application/vnd.openxmlformats-officedocument.spreadsheetml"); // for OpenOffice xml format
}
}
See this answer on mime types.

You can work with Apache POI's - HSSF module.
That model (library) is written to read and write xls files (and latest for xlsx as well - although these are different languages).
With this code...
InputStream ExcelFileToRead = new FileInputStream("FileNameWithLink.xls");
HSSFWorkbook wb = new HSSFWorkbook(ExcelFileToRead);
HSSFSheet sheet = wb.getSheetAt(0);
...you can detect if it is readable xls file.
Going deeper you can use this code to try reading it etc. Actually that module is really easy to use.
There can be situations that it technically is .xls file, but it may not be readable (there can be various problems with it).
Extra - XSSF is for .xlsx and HSSF is for .xls.
I haven't used other techniques as I always want to be sure that I will be able read that file later.

You can use docx4j. Load the file with OpcPackage.load() and then check the content type.
OpcPackage.load()
* Convenience method to create a WordprocessingMLPackage
* or PresentationMLPackage
* from an inputstream (.docx/.docxm, .ppxtx or Flat OPC .xml).
* It detects the convenient format inspecting two first bytes of stream (magic bytes).
* For office 2007 'x' formats, these two bytes are 'PK' (same as zip file)
load() returns a OpcPackage which is the abstract class that GloxPackage, PresentationMLPackage, SpreadsheetMLPackage, WordprocessingMLPackage are based on. So this should work for word, excel and powerpoint docs.
A basic check
public final String XLSX_FILE = "application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml";
public final String WORD_FILE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml";
public final String UNKNOWN_FILE = "UNKNOWN";
public boolean isFileXLSX(String fileLocation) {
return getContentTypeFromFile(fileLocation).equals(XLSX_FILE);
}
public String getContentTypeFromFile(String fileLocation) {
try {
return OpcPackage.load(new File(fileLocation)).getContentType();
} catch (Docx4JException e) {
return UNKNOWN_FILE;
}
}
You should see values like
application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml

Error while retrieving images from pdf using Itext

I have an existing PDF from which I want to retrieve images
NOTE:
In the Documentation, this is the RESULT variable
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
I am not getting why this image is needed?I just want to extract the images from my PDF file
So Now when I use MyImageRenderListener listener = new MyImageRenderListener(RESULT);
I am getting the error:
results\part4\chapter15\Img16.jpg (The system
cannot find the path specified)
This is the code that I am having.
package part4.chapter15;
import java.io.IOException;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
/**
* Extracts images from a PDF file.
*/
public class ExtractImages {
/** The new document to which we've added a border rectangle. */
public static final String RESOURCE = "resources/pdfs/samplefile.pdf";
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
/**
* Parses a PDF and extracts all the images.
* #param src the source PDF
* #param dest the resulting PDF
*/
public void extractImages(String filename)
throws IOException, DocumentException {
PdfReader reader = new PdfReader(filename);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
MyImageRenderListener listener = new MyImageRenderListener(RESULT);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
parser.processContent(i, listener);
}
reader.close();
}
/**
* Main method.
* #param args no arguments needed
* #throws DocumentException
* #throws IOException
*/
public static void main(String[] args) throws IOException, DocumentException {
new ExtractImages().extractImages(RESOURCE);
}
}

You have two questions and the answer to the first question is the key to the answer of the second.
Question 1:
You refer to:
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
And you ask: why is this image needed?
That question is wrong, because Img%s.%s is not a filename of an image, it's a pattern of the filename of an image. While parsing, iText will detect images in the PDF. These images are stored in numbered objects (e.g. object 16) and these images can be exported in different formats (e.g. jpg, png,...).
Suppose that an image is stored in object 16 and that this image is a jpg, then the pattern will resolve to Img16.jpg.
Question 2:
Why do I get an error:
results\part4\chapter15\Img16.jpg (The system cannot find the path specified)
In your PDF, there's a jpg stored in object 16. You are asking iText to store that image using this path: results\part4\chapter15\Img16.jpg (as explained in my answer to Question 1). However: you working directory doesn't have the subdirectories results\part4\chapter15\, hence an IOException (or a FileNotFoundException?) is thrown.
What is the general problem?
You have copy/pasted the ExtractImages example I wrote for my book "iText in Action - Second Edition", but:
You didn't read that book, so you have no idea what that code is supposed to do.
You aren't telling the readers on StackOverflow that this example depends on the MyImageRenderer class, which is where all the magic happens.
How can you solve your problem?
Option 1:
Change RESULT like this:
public static final String RESULT = "Img%s.%s";
Now the images will be stored in your working directory.
Option 2:
Adapt the MyImageRenderer class, more specifically this method:
public void renderImage(ImageRenderInfo renderInfo) {
try {
String filename;
FileOutputStream os;
PdfImageObject image = renderInfo.getImage();
if (image == null) return;
filename = String.format(path,
renderInfo.getRef().getNumber(), image.getFileType());
os = new FileOutputStream(filename);
os.write(image.getImageAsBytes());
os.flush();
os.close();
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
iText calls this class whenever an image is encountered. It passed an ImageRenderInfo to this method that contains plenty of information about that image.
In this implementation, we store the image bytes as a file. This is how we create the path to that file:
String.format(path,
renderInfo.getRef().getNumber(), image.getFileType())
As you can see, the pattern stored in RESULT is used in such a way that the first occurrence of %s is replaced with a number and the second occurrence with a file extension.
You could easily adapt this method so that it stores the images as byte[] in a List if that is what you want.

How to replace text in Powerpoint file with Java

I have a requirement where I need to replace some text in a Powerpoint File at runtime. (Powerpoint file is being used as a template with some placeholders/tokes e.g. {{USER_NAME}})
I have tried using POI but with no luck.
I referred to the other links on the forum and started with 'docx4j' but am not able to go beyond a point and the documentation is not very clear (at least for me).
Here is what I have done so far:
Got the PPTX loaded to 'PresentationMLPackage'
Got the 'MainPresentationPart' and the slides (Using mainPresentationPart.getSlide(n);)
But I am not sure of the next steps from here (or if this is the right approach in the first place).
Any suggestions will be greatly appreciated.
Thanks a Lot,
-Vini

SlidePart extends JaxbPmlPart<Sld>
JaxbPmlPart<E> extends JaxbXmlPartXPathAware<E>
JaxbXmlPartXPathAware<E> extends JaxbXmlPart<E>
JaxbXmlPart contains:
/**
* unmarshallFromTemplate. Where jaxbElement has not been
* unmarshalled yet, this is more efficient (3 times
* faster, in some testing) than calling
* XmlUtils.marshaltoString directly, since it avoids
* some JAXB processing.
*
* #param mappings
* #throws JAXBException
* #throws Docx4JException
*
* #since 3.0.0
*/
public void variableReplace(java.util.HashMap<String, String> mappings) throws JAXBException, Docx4JException {
// Get the contents as a string
String wmlTemplateString = null;
if (jaxbElement==null) {
PartStore partStore = this.getPackage().getSourcePartStore();
String name = this.getPartName().getName();
InputStream is = partStore.loadPart(
name.substring(1));
if (is==null) {
log.warn(name + " missing from part store");
throw new Docx4JException(name + " missing from part store");
} else {
log.info("Lazily unmarshalling " + name);
// This seems to be about 5% faster than the Scanner approach
try {
wmlTemplateString = IOUtils.toString(is, "UTF-8");
} catch (IOException e) {
throw new Docx4JException(e.getMessage(), e);
}
}
} else {
wmlTemplateString = XmlUtils.marshaltoString(jaxbElement, true, false, jc);
}
// Do the replacement
jaxbElement = (E)XmlUtils.unwrap(
XmlUtils.unmarshallFromTemplate(wmlTemplateString, mappings, jc));
}
So once you have the slide part, you can invoke variableReplace on it. You'll need your variables to be in the format expected by XmlUtils.unmarshallFromTemplate

load RTF into JTextPane

I created a class of type JTextPane in my text editor program. it has a subclass of text and richtext that inherts from my main JTextPaneClass. However, I'm unable to load RTF into my richtext because the method of reading fileinput stream isn't in the superclass JTextPane. So how do I read rich text into jtextpane? This seems very simple I must be missing something. I see lots of examples using RTFEditorKit and filling into the JTextPane but not when its instantiated as a class.
public class RichTextEditor extends TextEditorPane {
private final String extension = ".rtf";
private final String filetype = "text/richtext";
public RichTextEditor() {
// super( null, "", "Untitled", null );
super();
// this.setContentType( "text/richtext" );
}
/**
* Constructor for tabs with content.
*
* #param stream
* #param path
* #param fileName
* #param color
*/
public RichTextEditor( FileInputStream stream, String path, String fileName, Color color, boolean saveEligible ) {
super( path, fileName, color, saveEligible );
super.getScrollableTracksViewportWidth();
//RTFEditorKit rtf = new RTFEditorKit();
//this.setEditorKit( rtf );
setEditor();
this.read(stream, this.getDocument(), 0);
//this.read( stream, "RTFEditorKit" );
this.getDocument().putProperty( "file name", fileName );
}
private void setEditor() {
this.setEditorKit( new RTFEditorKit() );
}
the line:
this.read(stream, this.getDocument(), 0);
tells me
The method read(InputStream, Document) in the type JEditorPane is not applicable for the arguments (FileInputStream, Document, int)

To be able to access your editor kit, you should keep a reference to it. In fact, your setEditor() method's name is setXXX so this should be a setter (in fact, I'm not convinced that you need to set it more than once, so it may be that this method should not exist at all). Define a field:
private RTFEditorKit kit = new RTFEditorKit();
Then in the constructor,
setEditorKit( kit );
kit.read(...);
If you insist on keeping the method, its code should be
kit = new RTFEditorKit();
setEditorKit( kit );
And if you use this from the constructor, remember to set kit to void initially so as not to create an extra object that will be immediately discarded.

I've been looking for a java implementation for loading an RTF document into a JTextPane. Besides this thread, I couldn't find anything else. Thus, I'll post here my solution in case this helps other developers:
private static final RTFEditorKit RTF_KIT = new RTFEditorKit();
(...)
_txtHelp.setContentType("text/rtf");
final InputStream inputStream = new FileInputStream(_helpFile);
final DefaultStyledDocument styledDocument = new DefaultStyledDocument(new StyleContext());
RTF_KIT.read(inputStream, styledDocument, 0);
_txtHelp.setDocument(styledDocument);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using PDFBox to get location of line of text - java

Related

Extracting answers to a flattened PDF form with iText 7

Correct way to distinguish .xls from .doc file?

Error while retrieving images from pdf using Itext

How to replace text in Powerpoint file with Java

load RTF into JTextPane

Categories

Resources