I work on a web-based tool where we offer customized prints.
Currently we build an XML structure with Java, feed it to the XMLmind XSL-FO Converter along with customized XSL-FO, which then produces an RTF document.
This works fine on simple layouts, but there's some problem areas where I'd like greater control, or where I can't do what I want at all. F.ex: tables in header, footers (e.g., page numbers), columns, having a separate column setup or different page number info on the first page, etc.
Do any of you know of better alternatives, either to XMLmind or to the way we get from data to RTF, i.e., Java-> XML, XML+XSL-> RTF? (The only practical limitation for us is the JVM.)
You can take a look at a new library called jRTF. It allows you to create new RTF documents and to fill RTF templates.
Have you had a look at the iText library? It's touted primarily as a PDF generator, though it can also generate RTF. I haven't had cause to use it personally, but the general feeling I get is that it's good, and the interface looks comprehensive and easy to work to in the abstract. Whether it would fit in well with your existing data model is another question.
If you could afford spending some money, you could use Aspose.Words, a professional library for creating Word and RTF documents for Java and .NET.
iText supports RTF.
import com.lowagie.text.*;
import com.lowagie.text.html.simpleparser.HTMLWorker;
import com.lowagie.text.html.simpleparser.StyleSheet;
import com.lowagie.text.rtf.*;
import java.io.*;
import java.util.ArrayList;
public class HTMLtoRTF {
public static void main(String[] args) throws DocumentException {
Document document = new Document();
try {
Reader htmlreader = new BufferedReader((new InputStreamReader((new FileInputStream("C:\\Users\\asrikantan\\Desktop\\sample.htm")))));
RtfWriter2 rtfWriter = RtfWriter2.getInstance(document, new FileOutputStream(("C:\\Users\\asrikantan\\Desktop\\sample12.rtf")));
document.open();
document.add(new Paragraph("Testing simple paragraph addition."));
//ByteArrayOutputStream out = new ByteArrayOutputStream();
StyleSheet styles = new StyleSheet();
styles.loadTagStyle("body", "font", "Bitstream Vera Sans");
ArrayList htmlParser = HTMLWorker.parseToList(htmlreader, styles);
//fetch HTML line by line
for (int htmlDatacntr = 0; htmlDatacntr < htmlParser.size(); htmlDatacntr++) {
Element htmlDataElement = (Element) htmlParser.get(htmlDatacntr);
document.add((htmlDataElement));
}
htmlreader.close();
document.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (Exception e) {
System.out.println(e);
}
}
}
Related
We have a few forms created from Adobe LiveCycle where users fill the dynamic forms and submits the document to our office where we stamp it with our signature and flatten it (at least most of the time - I've seen a few documents in our system that haven't been flattened yet but that can be a separate question, I'll focus on the flattened documents here because that's most of what we have).
I'm trying to use iText 7 to parse/extract the user's answers to our forms for migrating to an electronic solution that will happen a few months from now. I was able to make the example work in Java but I don't understand the process.
/*
This file is part of the iText (R) project.
Copyright (c) 1998-2020 iText Group NV
Authors: iText Software.
For more information, please contact iText Software at this address:
sales#itextpdf.com
*/
/**
* Example written by Bruno Lowagie in answer to:
* http://stackoverflow.com/questions/24506830/can-we-use-text-extraction-strategy-after-applying-location-extraction-strategy
*/
package ca.umanitoba.ad.research;
import com.itextpdf.kernel.font.PdfFont;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.kernel.pdf.canvas.parser.EventType;
import com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor;
import com.itextpdf.kernel.pdf.canvas.parser.data.IEventData;
import com.itextpdf.kernel.pdf.canvas.parser.data.TextRenderInfo;
import com.itextpdf.kernel.pdf.canvas.parser.filter.TextRegionEventFilter;
import com.itextpdf.kernel.pdf.canvas.parser.listener.FilteredEventListener;
import com.itextpdf.kernel.pdf.canvas.parser.listener.LocationTextExtractionStrategy;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.FileOutputStream;
import java.io.Writer;
import java.io.BufferedWriter;
public class Main {
public static final String DEST = "./target/txt/parse_custom.txt";
public static final String SRC = "./src/main/resources/pdfs/nameddestinations.pdf";
public static void main(String[] args) throws IOException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new Main().manipulatePdf(DEST);
}
protected void manipulatePdf(String dest) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));
Rectangle rect = new Rectangle(36, 750, 523, 56);
CustomFontFilter fontFilter = new CustomFontFilter(rect);
FilteredEventListener listener = new FilteredEventListener();
// Create a text extraction renderer
LocationTextExtractionStrategy extractionStrategy = listener
.attachEventListener(new LocationTextExtractionStrategy(), fontFilter);
// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.processPageContent(pdfDoc.getFirstPage());
// Get the resultant text after applying the custom filter
String actualText = extractionStrategy.getResultantText();
pdfDoc.close();
// See the resultant text in the console
System.out.println(actualText);
try (Writer writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(dest)))) {
writer.write(actualText);
}
}
/*
* The custom filter filters only the text of which the font name ends with Bold or Oblique.
*/
protected class CustomFontFilter extends TextRegionEventFilter {
public CustomFontFilter(Rectangle filterRect) {
super(filterRect);
}
#Override
public boolean accept(IEventData data, EventType type) {
if (type.equals(EventType.RENDER_TEXT)) {
TextRenderInfo renderInfo = (TextRenderInfo) data;
PdfFont font = renderInfo.getFont();
if (null != font) {
String fontName = font.getFontProgram().getFontNames().getFontName();
return fontName.endsWith("Bold") || fontName.endsWith("Oblique");
}
}
return false;
}
}
}
Why is there a need to specify a Rectangle? Our forms are dynamic so users can add more fields as needed and we also accept paragraphs on some of the questions so the length will always vary so it's unlikely that the coordinates of the texts will be the same.
How can I change the flow so that I can perhaps just search for the question and then get the text right after it (presumably the answer) - I don't really know what the best way to parse a PDF is. If there's no other way except providing a Rectangle, can I programmatically determine the coordinates/dimensions of the rectangles?
From the example it looks like it's filtering the text based on whether it's bolded or italicized which I probably don't need but it looks to be easy enough to fix by modifying/removing the accept() method.
Please take a look at what that example is for: In the JavaDoc comment you can read
/**
* Example written by Bruno Lowagie in answer to:
* http://stackoverflow.com/questions/24506830/can-we-use-text-extraction-strategy-after-applying-location-extraction-strategy
*/
and that stack overflow question starts with
I used the following code to get data in PDF from a particular location. I want to get bold text present in that location
When you wonder, therefore,
Why is there a need to specify a Rectangle?
the answer is: because the example is about finding bold text in a particular location.
You mention your forms were dynamic before flattening and fields don't have fixed positions. Thus, this filter probably is not optimal for your use case.
How can I change the flow so that I can perhaps just search for the question and then get the text right after it
In that case simply don't filter at all but use a plain LocationTextExtractionStrategy to extract text, search for the question text in the extracted text, and use the text thereafter up to the next question text.
Alternatively, if you still have the unflattened dynamic forms, you may consider extracting the xfa xml and extract the filled-in data from that xml.
I'm trying to generate a PDF that contains Arabic text using PDFBox Apache but the text is generated as separated characters because Apache parses given Arabic string to a sequence of general 'official' Unicode characters that is equivalent to the isolated form of Arabic characters.
Here is an example:
Target text to Write in PDF "Should be expected output in PDF File" -> جملة بالعربي
What I get in PDF File ->
I tried some methods but it's no use here are some of them:
1. Converting String to Stream of bits and trying to extract right values
2. Treating String a sequence of bytes with UTF-8 && UTF-16 and extracting values from them
There is some approach seems very promising to get the value "Unicode" of each character But it generate general "official Unicode" Here is what I mean
System.out.println( Integer.toHexString( (int)(new String("كلمة").charAt(1))) );
output is 644 but fee0 was the expected output because this character is in middle from then I should get the middle Unicode fee0
so what I want is some method that generates the correct Unicode not the just the official one
The very Left column in the first table in the following link represents the general Unicode
Arabic Unicode Tables Wikipedia
Notice:
The sample code in this answer might be outdated please refer to h q's answer for the working sample code
At First I will thank Tilman Hausherr and M.Prokhorov for showing me the library that made writing Arabic possible using PDFBox Apache.
This Answer will be divided into two Sections:
Downloading the library and installing it
How to use the library
Downloading the library and installing it
We are going to use ICU Library.
ICU stands for International Components for Unicode and it is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.
To download the Library go to the downloads page from here.
Choose the latest version of ICU4J as shown in the following image.
You will be transferred to another page and you will find a box with direct links of the needed components .Go ahead and download three Files you will find the highlighted in next image.
icu4j-docs.jar
icu4j-src.jar
icu4j.jar
The following explanation for creating and adding a library in Netbeans IDE
Navigate to the Toolbar and Click tools
Choose Libraries
At the bottom left you will find new Library button Create yours
Navigate to the library that you created in libraries list
Click it and add jar folders like that
Add icu4j.jar in class path
Add icu4j-src.jar in Sources
Add icu4j-docs.jar in Javadoc
View your opened projects from the very right
Expand the project that you want to use the library in
Right Click on the libraries folder and choose add library
Finally choose the library that you had just created.
Now you are ready to use the library just import what you want like that
import com.ibm.icu.What_You_Want_To_Import;
How to use the library
With ArabicShaping Class and reversing the String we can write a correct attached Arabic LINE
Here is the Code Notice the comments in the following code
import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;
public class Main {
public static void main(String[] args) throws IOException , ArabicShapingException
{
File f = new File("Arabic Font File of format.ttf");
PDDocument doc = new PDDocument();
PDPage Page = new PDPage();
doc.addPage(Page);
PDPageContentStream Writer = new PDPageContentStream(doc, Page);
Writer.beginText();
Writer.setFont(PDType0Font.load(doc, f), 20);
Writer.newLineAtOffset(0, 700);
//The Trick in the next Line of Code But Here is some few Notes first
//We have to reverse the string because PDFBox is Writting from the left but Arabic is RTL Language
//The output will be perfect except every line will be justified to the left "It's not hard to resolve this"
// So we have to write arabic string to pdf line by line..It will be like this
String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
Writer.showText(new StringBuilder(new ArabicShaping(reverseNumbersInString(ArabicShaping.LETTERS_SHAPE).shape(s))).reverse().toString());
// Note the previous line of code throws ArabicShapingExcpetion
Writer.endText();
Writer.close();
doc.save(new File("File_Test.pdf"));
doc.close();
}
}
Here is the output
I hope that I had gone over everything.
Update : After reversing make sure to reverse the numbers again in order to get the same proper number
Here is a couple of functions that could help
public static boolean isInt(String Input)
{
try{Integer.parseInt(Input);return true;}
catch(NumberFormatException e){return false;}
}
public static String reverseNumbersInString(String Input)
{
char[] Separated = Input.toCharArray();int i = 0;
String Result = "",Hold = "";
for(;i<Separated.length;i++ )
{
if(isInt(Separated[i]+"") == true)
{
while(i < Separated.length && (isInt(Separated[i]+"") == true || Separated[i] == '.' || Separated[i] == '-'))
{
Hold += Separated[i];
i++;
}
Result+=reverse(Hold);
Hold="";
}
else{Result+=Separated[i];}
}
return Result;
}
Here is a code that works. Download a sample font, e.g. trado.ttf
Make sure the pdfbox-app and icu4j jar files are in your classpath.
import java.io.File;
import java.io.IOException;
import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import com.ibm.icu.text.Bidi;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;
public class Main {
public static void main(String[] args) throws IOException , ArabicShapingException
{
File f = new File("trado.ttf");
PDDocument doc = new PDDocument();
PDPage Page = new PDPage();
doc.addPage(Page);
PDPageContentStream Writer = new PDPageContentStream(doc, Page);
Writer.beginText();
Writer.setFont(PDType0Font.load(doc, f), 20);
Writer.newLineAtOffset(0, 700);
String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
Writer.showText(bidiReorder(s));
Writer.endText();
Writer.close();
doc.save(new File("File_Test.pdf"));
doc.close();
}
private static String bidiReorder(String text)
{
try {
Bidi bidi = new Bidi((new ArabicShaping(ArabicShaping.LETTERS_SHAPE)).shape(text), 127);
bidi.setReorderingMode(0);
return bidi.writeReordered(2);
}
catch (ArabicShapingException ase3) {
return text;
}
}
}
I'm trying to generate a PDF that contains Arabic text using PDFBox Apache but the text is generated as separated characters because Apache parses given Arabic string to a sequence of general 'official' Unicode characters that is equivalent to the isolated form of Arabic characters.
Here is an example:
Target text to Write in PDF "Should be expected output in PDF File" -> جملة بالعربي
What I get in PDF File ->
I tried some methods but it's no use here are some of them:
1. Converting String to Stream of bits and trying to extract right values
2. Treating String a sequence of bytes with UTF-8 && UTF-16 and extracting values from them
There is some approach seems very promising to get the value "Unicode" of each character But it generate general "official Unicode" Here is what I mean
System.out.println( Integer.toHexString( (int)(new String("كلمة").charAt(1))) );
output is 644 but fee0 was the expected output because this character is in middle from then I should get the middle Unicode fee0
so what I want is some method that generates the correct Unicode not the just the official one
The very Left column in the first table in the following link represents the general Unicode
Arabic Unicode Tables Wikipedia
Notice:
The sample code in this answer might be outdated please refer to h q's answer for the working sample code
At First I will thank Tilman Hausherr and M.Prokhorov for showing me the library that made writing Arabic possible using PDFBox Apache.
This Answer will be divided into two Sections:
Downloading the library and installing it
How to use the library
Downloading the library and installing it
We are going to use ICU Library.
ICU stands for International Components for Unicode and it is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.
To download the Library go to the downloads page from here.
Choose the latest version of ICU4J as shown in the following image.
You will be transferred to another page and you will find a box with direct links of the needed components .Go ahead and download three Files you will find the highlighted in next image.
icu4j-docs.jar
icu4j-src.jar
icu4j.jar
The following explanation for creating and adding a library in Netbeans IDE
Navigate to the Toolbar and Click tools
Choose Libraries
At the bottom left you will find new Library button Create yours
Navigate to the library that you created in libraries list
Click it and add jar folders like that
Add icu4j.jar in class path
Add icu4j-src.jar in Sources
Add icu4j-docs.jar in Javadoc
View your opened projects from the very right
Expand the project that you want to use the library in
Right Click on the libraries folder and choose add library
Finally choose the library that you had just created.
Now you are ready to use the library just import what you want like that
import com.ibm.icu.What_You_Want_To_Import;
How to use the library
With ArabicShaping Class and reversing the String we can write a correct attached Arabic LINE
Here is the Code Notice the comments in the following code
import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;
public class Main {
public static void main(String[] args) throws IOException , ArabicShapingException
{
File f = new File("Arabic Font File of format.ttf");
PDDocument doc = new PDDocument();
PDPage Page = new PDPage();
doc.addPage(Page);
PDPageContentStream Writer = new PDPageContentStream(doc, Page);
Writer.beginText();
Writer.setFont(PDType0Font.load(doc, f), 20);
Writer.newLineAtOffset(0, 700);
//The Trick in the next Line of Code But Here is some few Notes first
//We have to reverse the string because PDFBox is Writting from the left but Arabic is RTL Language
//The output will be perfect except every line will be justified to the left "It's not hard to resolve this"
// So we have to write arabic string to pdf line by line..It will be like this
String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
Writer.showText(new StringBuilder(new ArabicShaping(reverseNumbersInString(ArabicShaping.LETTERS_SHAPE).shape(s))).reverse().toString());
// Note the previous line of code throws ArabicShapingExcpetion
Writer.endText();
Writer.close();
doc.save(new File("File_Test.pdf"));
doc.close();
}
}
Here is the output
I hope that I had gone over everything.
Update : After reversing make sure to reverse the numbers again in order to get the same proper number
Here is a couple of functions that could help
public static boolean isInt(String Input)
{
try{Integer.parseInt(Input);return true;}
catch(NumberFormatException e){return false;}
}
public static String reverseNumbersInString(String Input)
{
char[] Separated = Input.toCharArray();int i = 0;
String Result = "",Hold = "";
for(;i<Separated.length;i++ )
{
if(isInt(Separated[i]+"") == true)
{
while(i < Separated.length && (isInt(Separated[i]+"") == true || Separated[i] == '.' || Separated[i] == '-'))
{
Hold += Separated[i];
i++;
}
Result+=reverse(Hold);
Hold="";
}
else{Result+=Separated[i];}
}
return Result;
}
Here is a code that works. Download a sample font, e.g. trado.ttf
Make sure the pdfbox-app and icu4j jar files are in your classpath.
import java.io.File;
import java.io.IOException;
import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import com.ibm.icu.text.Bidi;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;
public class Main {
public static void main(String[] args) throws IOException , ArabicShapingException
{
File f = new File("trado.ttf");
PDDocument doc = new PDDocument();
PDPage Page = new PDPage();
doc.addPage(Page);
PDPageContentStream Writer = new PDPageContentStream(doc, Page);
Writer.beginText();
Writer.setFont(PDType0Font.load(doc, f), 20);
Writer.newLineAtOffset(0, 700);
String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
Writer.showText(bidiReorder(s));
Writer.endText();
Writer.close();
doc.save(new File("File_Test.pdf"));
doc.close();
}
private static String bidiReorder(String text)
{
try {
Bidi bidi = new Bidi((new ArabicShaping(ArabicShaping.LETTERS_SHAPE)).shape(text), 127);
bidi.setReorderingMode(0);
return bidi.writeReordered(2);
}
catch (ArabicShapingException ase3) {
return text;
}
}
}
I've got a docx4j generated file which contains several tables, titles and, finally, an excel-generated curve chart.
I have tried many approaches in order to convert this file to PDF, but did not get to any successful result.
Docx4j with xsl-fo did not work, most of the things included in the docx file are not yet implemented and show up in red text as "not implemented".
JODConverter did not work either, I got a resulting PDF in which everything was pretty good (just little formatting/styling issues) BUT the graph did not show up.
Finally, the closest approach was using Apache POI: The resulting PDF was identical to my docx file, but still no chart showing up.
I already know Aspose would solve this pretty easily, but I am looking for an open source, free solution.
The code I am using with Apache POI is as follows:
public static void convert(String inputPath, String outputPath)
throws XWPFConverterException, IOException {
PdfConverter converter = new PdfConverter();
converter.convert(new XWPFDocument(new FileInputStream(new File(
inputPath))), new FileOutputStream(new File(outputPath)),
PdfOptions.create());
}
I do not know what to do to get the chart inside the PDF, could anybody tell me how to proceed?
Thanks in advance.
I don't know if this helps you but you could use "jacob" (I don't know if its possible with apache poi or docx4j)
With this solution you open "Word" yourself and export it as pdf.
!Word needs to be installed on the computer!
Heres the download-page: http://sourceforge.net/projects/jacob-project/
try {
if (System.getProperty("os.arch").contains("64")) {
System.load(DLL_64BIT_PATH);
} else {
System.load(DLL_32BIT_PATH);
}
} catch (UnsatisfiedLinkError e) {
//TODO
} catch (IOException e) {
//TODO
}
ActiveXComponent oleComponent = new ActiveXComponent("Word.Application");
oleComponent.setProperty("Visible", false);
Variant var = Dispatch.get(oleComponent, "Documents");
Dispatch document = var.getDispatch();
Dispatch activeDoc = Dispatch.call(document, "Open", fileName).toDispatch();
// https://msdn.microsoft.com/EN-US/library/office/ff845579.aspx
Dispatch.call(activeDoc, "ExportAsFixedFormat", new Object[] { "path to pdfFile.pdf", new Integer(17), false, 0 });
Object args[] = { new Integer(0) };//private static final int DO_NOT_SAVE_CHANGES = 0;
Dispatch.call(activeDoc, "Close", args);
Dispatch.call(oleComponent, "Quit");
I need to extract data from some PDF documents (using Java). I need to know what would be the easiest way to do it.
I tried iText. It's fairly complicated for my needs. Besides I guess it is not available for free for commercial projects. So it is not an option. I also gave a try to PDFBox, and ran into various NoClassDefFoundError errors.
I googled and came across several other options such as PDF Clown, jPod, but I do not have time to experiment with all of these libraries. I am relying on community's experience with PDF reading thru Java.
Note that I do not need to create or manipulate PDF documents. I just need to exrtract textual data from PDF documents with moderate level layout complexity.
Please suggest the quickest and easiest way to extract text from PDF documents. Thanks.
I recommend trying Apache Tika. Apache Tika is basically a toolkit that extracts data from many types of documents, including PDFs.
The benefits of Tika (besides being free), is that is used to be a subproject of Apache Lucene, which is a very robust open-source search engine. Tika includes a built-in PDF parser that uses a SAX Content Handler to pass PDF data to your application. It can also extract data from encrypted PDFs and it allows you to create or subclass an existing parser to customize the behavior.
The code is simple. To extract the data from a PDF, all you need to do is create a Parser class that implements the Parser interface and define a parse() method:
public void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException {
metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
metadata.set("Hello", "World");
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
xhtml.startDocument();
xhtml.endDocument();
}
Then, to run the parser, you could do something like this:
InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());
I am using JPedal and I'm really happy with the results. It isn't free but it's high quality and the output for image generation from pdfs or text extraction is really nice.
And as a paid library, the support is always there to answer.
I have used PDFBox to extract text for Lucene indexing without too many issues. Its error/warning logging is quite verbose if I remember right - what was the cause for those errors you received?
I understand this post is pretty old but I would recommend using itext from here:
http://sourceforge.net/projects/itext/
If you are using maven you can pull the jars in from maven central:
http://mvnrepository.com/artifact/com.itextpdf/itextpdf
I can't understand how using it can be difficult:
PdfReader pdf = new PdfReader("path to your pdf file");
PdfTextExtractor parser = new PdfTextExtractor();
String output = parser.getTextFromPage(pdf, pageNumber);
assert output.contains("whatever you want to validate on that page");
Import this Classes and add Jar Files 1.- pdfbox-app- 2.0.
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.support.FindBy;
import org.testng.Assert;
import org.testng.annotations.Test;
import java.io.File;
import java.io.IOException;
import java.text.ParseException;
import java.util.List;
import org.apache.log4j.Logger;
import org.apache.log4j.PropertyConfigurator;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.chrome.ChromeDriver;
import com.coencorp.selenium.framework.BasePage;
import com.coencorp.selenium.framework.ExcelReadWrite;
import com.relevantcodes.extentreports.LogStatus;
Add this code inside the class.
public void showList() throws InterruptedException, IOException {
showInspectionsLink.click();
waitForElement(hideInspectionsLink);
printButton.click();
Thread.sleep(10000);
String downloadPath = "C:\\Users\\Updoer\\Downloads";
File getLatestFile = getLatestFilefromDir(downloadPath);
String fileName = getLatestFile.getName();
Assert.assertTrue(fileName.equals("Inspections.pdf"), "Downloaded file name is not
matching with expected file name");
Thread.sleep(10000);
//testVerifyPDFInURL();
PDDocument pd;
pd= PDDocument.load(new File("C:\\Users\\Updoer\\Downloads\\Inspections.pdf"));
System.out.println("Total Pages:"+ pd.getNumberOfPages());
PDFTextStripper pdf=new PDFTextStripper();
System.out.println(pdf.getText(pd));
Add this Method in same class.
public void testVerifyPDFInURL() {
WebDriver driver = new ChromeDriver();
driver.get("C:\\Users\\Updoer\\Downloads\\Inspections.pdf");
driver.findElement(By.linkText("Adeeb Khan")).click();
String getURL = driver.getCurrentUrl();
Assert.assertTrue(getURL.contains(".pdf"));
}
private File getLatestFilefromDir(String dirPath){
File dir = new File(dirPath);
File[] files = dir.listFiles();
if (files == null || files.length == 0) {
return null;
}
File lastModifiedFile = files[0];
for (int i = 1; i < files.length; i++) {
if (lastModifiedFile.lastModified() < files[i].lastModified()) {
lastModifiedFile = files[i];
}
}
return lastModifiedFile;
}