PDFBox Customized PDFTextStripper

PDFBox Customized PDFTextStripper - java

I'm a rookie, really. I'm building my first project (if I can finish it).
I want to extract PDF text with formatting and location, and then write to .docx file. I checked the PDFBox API documentation, but I'm not sure if I want to get the location of the text, then should I traverse the rows? Or traverse the characters? I studied these three carefully.
Text coordinates when stripping from PDFBox
Get font of each line using PDFBox
How to extract font styles of text contents using pdfbox?
And here is my DEMO:
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.IOException;
import java.util.List;
public class PDFTextExtractor extends PDFTextStripper {
/**
* Instantiate a new PDFTextStripper object.
*
* #throws IOException If there is an error loading the properties.
*/
public PDFTextExtractor() throws IOException {
}
String prevFont = "";
#Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
StringBuilder sb = new StringBuilder();
for (TextPosition position : textPositions){
String font = position.getFont().getName();
float x = position.getX();
float y = position.getY();
float fontSize = position.getFontSizeInPt();
if (font != null && !font.equals(prevFont)){
sb.append("[").append(font.split("-")[0]).append("+").append(font.split("-")[1]).append("+").append(fontSize).append("]");
prevFont = font;
}
sb.append(position.getUnicode());
}
writeString(sb.toString());
}
#Override
public String getText(PDDocument doc) throws IOException {
return super.getText(doc);
}
}
And i calling it like here:
FileOutputStream outputStream = new FileOutputStream(EXPORT_PATH + file.getName().split("\\.")[0] + ".docx");
try (PDDocument originalPDF = PDDocument.load(file);
XWPFDocument doc = new XWPFDocument()) {
//get All pages
PDPageTree pageList = originalPDF.getDocumentCatalog().getPages();
for (PDPage page : pageList){
//Parse Content
PDFTextStripper stripper = new PDFTextExtractor();
stripper.setSortByPosition(true);
String ss = stripper.getText(originalPDF);
System.out.println(ss);
//Write Content
XWPFParagraph paragraph = doc.createParagraph();
XWPFRun run = paragraph.createRun();
run.setText(ss);
run.addBreak(BreakType.PAGE);
}
doc.write(outputStream);
originalPDF.close();
outputStream.close();
}

Related

How to add a hyperlink to the footer of a XWPFDocument using Apache POI?

The appendExternalHyperlink() method (source) is not working in the footer of a XWPFDocument. In the footer the result is not getting recognised as a hyperlink.
I am new to Apache POI and have no experiences with the low level stuff. Can someone explain what is the problem in here, please?
public class FooterProblem {
public static void main(final String[] args) throws Exception {
final XWPFDocument docx = new XWPFDocument();
final XWPFParagraph para = docx.createParagraph();
final XWPFRun paraRun = para.createRun();
paraRun.setText("Email: ");
appendExternalHyperlink("mailto:me#example.com", "me#example.com", para);
final XWPFParagraph footer = docx.createFooter(HeaderFooterType.DEFAULT).createParagraph();
final XWPFRun footerRun = footer.createRun();
footerRun.setText("Email: ");
appendExternalHyperlink("mailto:me#example.com", "me#example.com", footer);
final FileOutputStream out = new FileOutputStream("FooterProblem.docx");
docx.write(out);
out.close();
docx.close();
}
public static void appendExternalHyperlink(final String url, final String text, final XWPFParagraph paragraph) {
// Add the link as External relationship
final String id = paragraph.getDocument().getPackagePart()
.addExternalRelationship(url, XWPFRelation.HYPERLINK.getRelation()).getId();
// Append the link and bind it to the relationship
final CTHyperlink cLink = paragraph.getCTP().addNewHyperlink();
cLink.setId(id);
// Create the linked text
final CTText ctText = CTText.Factory.newInstance();
ctText.setStringValue(text);
final CTR ctr = CTR.Factory.newInstance();
ctr.setTArray(new CTText[] { ctText });
// Insert the linked text into the link
cLink.setRArray(new CTR[] { ctr });
}
}

The footer[n].xml is its own package part and needs its own relations. But your code creates the external hyperlink relations for the document.xml package part always. It always uses paragraph.getDocument(). This is wrong.
The following code provides a method for creating a XWPFHyperlinkRun in a given XWPFParagraph and gets the correct package part to put the relations on. It uses paragraph.getPart() to get the correct part. So this method works for paragraphs in the document body as well as in header and/or footer.
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.wp.usermodel.HeaderFooterType;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTHyperlink;
public class CreateWordHyperlinks {
static XWPFHyperlinkRun createHyperlinkRun(XWPFParagraph paragraph, String uri) throws Exception {
String rId = paragraph.getPart().getPackagePart().addExternalRelationship(
uri,
XWPFRelation.HYPERLINK.getRelation()
).getId();
CTHyperlink cthyperLink=paragraph.getCTP().addNewHyperlink();
cthyperLink.setId(rId);
cthyperLink.addNewR();
return new XWPFHyperlinkRun(
cthyperLink,
cthyperLink.getRArray(0),
paragraph
);
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument();
XWPFParagraph paragraph = document.createParagraph();
XWPFRun run = paragraph.createRun();
run.setText("This is a text paragraph having a link to Google ");
XWPFHyperlinkRun hyperlinkrun = createHyperlinkRun(paragraph, "https://www.google.de");
hyperlinkrun.setText("https://www.google.de");
hyperlinkrun.setColor("0000FF");
hyperlinkrun.setUnderline(UnderlinePatterns.SINGLE);
run = paragraph.createRun();
run.setText(" in it.");
XWPFFooter footer = document.createFooter(HeaderFooterType.DEFAULT);
paragraph = footer.createParagraph();
run = paragraph.createRun();
run.setText("Email: ");
hyperlinkrun = createHyperlinkRun(paragraph, "mailto:me#example.com");
hyperlinkrun.setText("me#example.com");
hyperlinkrun.setColor("0000FF");
hyperlinkrun.setUnderline(UnderlinePatterns.SINGLE);
FileOutputStream out = new FileOutputStream("CreateWordHyperlinks.docx");
document.write(out);
out.close();
document.close();
}
}

PDFBox - read text from multiple PDFs and load it into multiple Text files

I have more than 1000 pdf files in a folder , each one to be converted and saved in its corresponding text file .
I'm a bit new to Java and i'm using PDFBox to make the conversion ; I successfully got the code for one single pdf , but I'm stuck on how to do the conversion for all the PDFS in a single Folder. Can someone help me to achieve that in Java? .
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
public final class ExtractPdf
{
public static void main( String[] args ) throws IOException
{
String fileName = "sample.pdf";
PDDocument document = null;
try (PrintWriter out = new PrintWriter("out.txt"))
{
document = PDDocument.load( new File(fileName));
PDFTextStripper stripper = new PDFTextStripper();
String pdfText = stripper.getText(document).toString();
System.out.println( "Text in the area:" + pdfText);
out.println(pdfText);
}
finally
{
if( document != null )
{
document.close();
}
}
}
}
Thanks, Free

Basically your question is how to go through a directory…
public static void main(String[] args) throws IOException
{
File dir = new File("....");
File[] files = dir.listFiles(new FilenameFilter()
{
// use anonymous inner class
#Override
public boolean accept(File dir, String name)
{
return name.toLowerCase().endsWith(".pdf");
}
});
// null check omitted!
for (File file : files)
{
int len = file.getAbsolutePath().length();
String txtFilename = file.getAbsolutePath().substring(0, len - 4) + ".txt";
// check whether txt file exists omitted
try (OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(txtFilename), Charsets.UTF_8);
PDDocument document = PDDocument.load(file))
{
PDFTextStripper stripper = new PDFTextStripper();
stripper.writeText(document, out);
}
}
// exception catch omitted. Add code here to avoid your whole job
// dying if only one file is broken
}

java poi XWPF word - create bookmark in new document

Very many examples exist for reading and editing/replacing bookmarks in XWPF word document.
But I want to create a document and create new bookmarks.
Create document - no problem:
private void createWordDoc() throws IOException {
XWPFDocument document = new XWPFDocument();
File tempDocFile = new File(pathName+"\\temp.docx");
FileOutputStream out = new FileOutputStream(tempDocFile);
XWPFParagraph paragraph = document.createParagraph();
XWPFRun run = paragraph.createRun();
run.setText("testing string ");
document.write(out);
out.close();
}
How can I make a bookmark on text "testing string"?

This is not implemented in high level classes of apache poi until now. Therefore low level CTP and CTBookmark are needed.
Example:
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTBookmark;
import java.math.BigInteger;
public class CreateWordBookmark {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument();
XWPFParagraph paragraph = document.createParagraph();
//bookmark before the run
CTBookmark bookmark = paragraph.getCTP().addNewBookmarkStart();
bookmark.setName("before_testing_string");
bookmark.setId(BigInteger.valueOf(0));
paragraph.getCTP().addNewBookmarkEnd().setId(BigInteger.valueOf(0));
//bookmark the run
bookmark = paragraph.getCTP().addNewBookmarkStart();
bookmark.setName("testing_string");
bookmark.setId(BigInteger.valueOf(1));
XWPFRun run = paragraph.createRun();
run.setText("testing string ");
paragraph.getCTP().addNewBookmarkEnd().setId(BigInteger.valueOf(1));
//bookmark after the run
bookmark = paragraph.getCTP().addNewBookmarkStart();
bookmark.setName("after_testing_string");
bookmark.setId(BigInteger.valueOf(2));
paragraph.getCTP().addNewBookmarkEnd().setId(BigInteger.valueOf(2));
document.write(new FileOutputStream("CreateWordBookmark.docx"));
document.close();
}
}

Reading text of a pdf using PDFBOX occasionally returns \r\n

I’m currently using PDFBox to read the text of a set of pdfs that I’ve inherited.
I’m only interested in reading the text, not making any changes to the file.
The code that works for most of the files is:
File pdfFile = myPath.toFile();
PDDocument document = PDDocument.load(pdfFile );
Writer sw = new StringWriter();
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage( 1 );
stripper.writeText( document, sw );
String documentText = sw.toString()
For most files, I wind up with the text in the documentText field.
But, for 3 of 24 files, the documentText content for the first file is “\r\n”, for the second “\r\n\r\n”, and for the third “\r\n\r\n\r\n:, But the three files are not consecutive. Multiple good files are between each of these files.
The File is derived from a java.nio.Path. The WindowsFileAttribute that is part of the Path has a size of 279K, so the file is not empty on disk.
I can open the file and view the data, and it looks like the other files that my code reads.
I’m using Java 8.0.121, and PDFBox 2.0.4. (this is the latest version, I believe.)
Any suggestions? Is there a better way to read the text? (I’m not interested in the formatting, or fonts used, just the text.)
Thanks.

Reading multiple PDF docs using pdfbox in java
package readwordfile;
import java.io.BufferedReader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
/**
* This is an example on how to extract words from PDF document
*
* #author saravanan
*/
public class GetWordsFromPDF extends PDFTextStripper {
static List<String> words = new ArrayList<String>();
public GetWordsFromPDF() throws IOException {
}
/**
* #param args
* #throws IOException If there is an error parsing the document.
*/
public static void main(String[] args) throws IOException {
String files;
// FileWriter fs = new FileWriter("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");
// FileInputStream fstream1 = new FileInputStream("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");
// DataInputStream in1 = new DataInputStream(fstream1);
// BufferedReader br1 = new BufferedReader(new InputStreamReader(in1));
String path = "C:\\Users\\saravanan\\Desktop\\New folder\\"; //local folder path name
File folder = new File(path);
File[] listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++) {
if (listOfFiles[i].isFile()) {
files = listOfFiles[i].getName();
if (files.endsWith(".pdf") || files.endsWith(".PDF")) {
String nfiles = "C:\\Users\\saravanan\\Desktop\\New folder\\";
String fileName1 = nfiles + files;
System.out.print("\n\n" + files+"\n");
PDDocument document = null;
try {
document = PDDocument.load(new File(fileName1));
PDFTextStripper stripper = new GetWordsFromPDF();
stripper.setSortByPosition(true);
stripper.setStartPage(0);
stripper.setEndPage(document.getNumberOfPages());
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
int x = 0;
System.out.println("");
for (String word : words) {
if (word.startsWith("xxxxxx")) { //here you can give your pdf doc starting word
x = 1;
}
if (x == 1) {
if (!(word.endsWith("YYYYYY"))) { //here you can give your pdf doc ending word
System.out.print(word + " ");
// fs.write(word);
} else {
x = 0;
break;
}
}
}
} finally {
if (document != null) {
document.close();
words.clear();
}
}
}
}
}
}
/**
* Override the default functionality of PDFTextStripper.writeString()
*
* #param str
* #param textPositions
* #throws java.io.IOException
*/
#Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
String[] wordsInStream = str.split(getWordSeparator());
if (wordsInStream != null) {
for (String word : wordsInStream) {
words.add(word); //store the pdf content into the List
}
}
}
}

How to solve this iText PDF error?

I'm merging two pdf pages (test1.pdf & test2.pdf) using iText PDF and got the output in test_result.pdf. But the output page has not come out like the input page, it's cropped to half of the actual size. How to overcome this error? Here is my code:
public class MergePDF {
public static void main(String[] args) {
try {
List<InputStream> pdfs = new ArrayList<InputStream>();
pdfs.add(new FileInputStream("test1.pdf"));
pdfs.add(new FileInputStream("test2.pdf/"));
// pdfs.add(new FileInputStream("test_result.pdf/"));
OutputStream output = new FileOutputStream("/home/ant000112/merge_result.pdf");
MergePDF.concatPDFs(pdfs, output, true);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void concatPDFs(List<InputStream> streamOfPDFFiles,
OutputStream outputStream, boolean paginate) {
Document document = new Document();
try {
List<InputStream> pdfs = streamOfPDFFiles;
List<PdfReader> readers = new ArrayList<PdfReader>();
int totalPages = 0;
Iterator<InputStream> iteratorPDFs = pdfs.iterator();
// Create Readers for the pdfs.
while (iteratorPDFs.hasNext()) {
InputStream pdf = iteratorPDFs.next();
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages += pdfReader.getNumberOfPages();
}
// Create a writer for the outputstream
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
//BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA,BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA,BaseFont.CP1257, BaseFont.NOT_EMBEDDED);
PdfContentByte cb = writer.getDirectContent(); // Holds the PDF
// data
PdfImportedPage page;
int currentPageNumber = 0;
int pageOfCurrentReaderPDF = 0;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Loop through the PDF files and add to the output.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
// Create a new page in the target for each source page.
while (pageOfCurrentReaderPDF < pdfReader.getNumberOfPages()) {
document.newPage();
pageOfCurrentReaderPDF++;
currentPageNumber++;
page = writer.getImportedPage(pdfReader,
pageOfCurrentReaderPDF);
cb.addTemplate(page, 0, 0);
// Code for pagination.
if (paginate) {
cb.beginText();
cb.setFontAndSize(bf, 9);
cb.showTextAligned(PdfContentByte.ALIGN_CENTER, ""+ currentPageNumber + " of " + totalPages, 520,5, 0);
cb.endText();
}
}
pageOfCurrentReaderPDF = 0;
}
outputStream.flush();
document.close();
outputStream.close();
} catch (Exception e) {
e.printStackTrace();
} finally {
if (document.isOpen())
document.close();
System.out.println("ghghklh");
try {
if (outputStream != null)
outputStream.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
}

Did you try reading examples from the iText site? It appears you're doing it in a different way. Like you're not using PdfCopy.
/*
* This class is part of the book "iText in Action - 2nd Edition"
* written by Bruno Lowagie (ISBN: 9781935182610)
* For more info, go to: http://itextpdf.com/examples/
* This example only works with the AGPL version of iText.
*/
package part2.chapter06;
import java.io.FileOutputStream;
import java.io.IOException;
import java.sql.SQLException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfReader;
import part1.chapter02.MovieHistory;
import part1.chapter02.MovieLinks1;
public class Concatenate {
/** The resulting PDF file. */
public static final String RESULT
= "results/part2/chapter06/concatenated.pdf";
/**
* Main method.
* #param args no arguments needed
* #throws DocumentException
* #throws IOException
* #throws SQLException
*/
public static void main(String[] args)
throws IOException, DocumentException, SQLException {
// using previous examples to create PDFs
MovieLinks1.main(args);
MovieHistory.main(args);
String[] files = { MovieLinks1.RESULT, MovieHistory.RESULT };
// step 1
Document document = new Document();
// step 2
PdfCopy copy = new PdfCopy(document, new FileOutputStream(RESULT));
// step 3
document.open();
// step 4
PdfReader reader;
int n;
// loop over the documents you want to concatenate
for (int i = 0; i < files.length; i++) {
reader = new PdfReader(files[i]);
// loop over the pages in that document
n = reader.getNumberOfPages();
for (int page = 0; page < n; ) {
copy.addPage(copy.getImportedPage(reader, ++page));
}
copy.freeReader(reader);
}
// step 5
document.close();
}
}
http://itextpdf.com/examples/iia.php?id=123
EDIT: Just to be fair I downloaded the library and I tried the example. It works like a charm.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PDFBox Customized PDFTextStripper - java

Related

How to add a hyperlink to the footer of a XWPFDocument using Apache POI?

PDFBox - read text from multiple PDFs and load it into multiple Text files

java poi XWPF word - create bookmark in new document

Reading text of a pdf using PDFBOX occasionally returns \r\n

How to solve this iText PDF error?

Categories

Resources