Arabic Text in PdfBox [duplicate]

Arabic Text in PdfBox [duplicate] - java

I'm trying to generate a PDF that contains Arabic text using PDFBox Apache but the text is generated as separated characters because Apache parses given Arabic string to a sequence of general 'official' Unicode characters that is equivalent to the isolated form of Arabic characters.
Here is an example:
Target text to Write in PDF "Should be expected output in PDF File" -> جملة بالعربي
What I get in PDF File ->
I tried some methods but it's no use here are some of them:
1. Converting String to Stream of bits and trying to extract right values
2. Treating String a sequence of bytes with UTF-8 && UTF-16 and extracting values from them
There is some approach seems very promising to get the value "Unicode" of each character But it generate general "official Unicode" Here is what I mean
System.out.println( Integer.toHexString( (int)(new String("كلمة").charAt(1))) );
output is 644 but fee0 was the expected output because this character is in middle from then I should get the middle Unicode fee0
so what I want is some method that generates the correct Unicode not the just the official one
The very Left column in the first table in the following link represents the general Unicode
Arabic Unicode Tables Wikipedia

Notice:
The sample code in this answer might be outdated please refer to h q's answer for the working sample code
At First I will thank Tilman Hausherr and M.Prokhorov for showing me the library that made writing Arabic possible using PDFBox Apache.
This Answer will be divided into two Sections:
Downloading the library and installing it
How to use the library
Downloading the library and installing it
We are going to use ICU Library.
ICU stands for International Components for Unicode and it is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.
To download the Library go to the downloads page from here.
Choose the latest version of ICU4J as shown in the following image.
You will be transferred to another page and you will find a box with direct links of the needed components .Go ahead and download three Files you will find the highlighted in next image.
icu4j-docs.jar
icu4j-src.jar
icu4j.jar
The following explanation for creating and adding a library in Netbeans IDE
Navigate to the Toolbar and Click tools
Choose Libraries
At the bottom left you will find new Library button Create yours
Navigate to the library that you created in libraries list
Click it and add jar folders like that
Add icu4j.jar in class path
Add icu4j-src.jar in Sources
Add icu4j-docs.jar in Javadoc
View your opened projects from the very right
Expand the project that you want to use the library in
Right Click on the libraries folder and choose add library
Finally choose the library that you had just created.
Now you are ready to use the library just import what you want like that
import com.ibm.icu.What_You_Want_To_Import;
How to use the library
With ArabicShaping Class and reversing the String we can write a correct attached Arabic LINE
Here is the Code Notice the comments in the following code
import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;
public class Main {
public static void main(String[] args) throws IOException , ArabicShapingException
{
File f = new File("Arabic Font File of format.ttf");
PDDocument doc = new PDDocument();
PDPage Page = new PDPage();
doc.addPage(Page);
PDPageContentStream Writer = new PDPageContentStream(doc, Page);
Writer.beginText();
Writer.setFont(PDType0Font.load(doc, f), 20);
Writer.newLineAtOffset(0, 700);
//The Trick in the next Line of Code But Here is some few Notes first
//We have to reverse the string because PDFBox is Writting from the left but Arabic is RTL Language
//The output will be perfect except every line will be justified to the left "It's not hard to resolve this"
// So we have to write arabic string to pdf line by line..It will be like this
String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
Writer.showText(new StringBuilder(new ArabicShaping(reverseNumbersInString(ArabicShaping.LETTERS_SHAPE).shape(s))).reverse().toString());
// Note the previous line of code throws ArabicShapingExcpetion
Writer.endText();
Writer.close();
doc.save(new File("File_Test.pdf"));
doc.close();
}
}
Here is the output
I hope that I had gone over everything.
Update : After reversing make sure to reverse the numbers again in order to get the same proper number
Here is a couple of functions that could help
public static boolean isInt(String Input)
{
try{Integer.parseInt(Input);return true;}
catch(NumberFormatException e){return false;}
}
public static String reverseNumbersInString(String Input)
{
char[] Separated = Input.toCharArray();int i = 0;
String Result = "",Hold = "";
for(;i<Separated.length;i++ )
{
if(isInt(Separated[i]+"") == true)
{
while(i < Separated.length && (isInt(Separated[i]+"") == true || Separated[i] == '.' || Separated[i] == '-'))
{
Hold += Separated[i];
i++;
}
Result+=reverse(Hold);
Hold="";
}
else{Result+=Separated[i];}
}
return Result;
}

Here is a code that works. Download a sample font, e.g. trado.ttf
Make sure the pdfbox-app and icu4j jar files are in your classpath.
import java.io.File;
import java.io.IOException;
import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import com.ibm.icu.text.Bidi;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;
public class Main {
public static void main(String[] args) throws IOException , ArabicShapingException
{
File f = new File("trado.ttf");
PDDocument doc = new PDDocument();
PDPage Page = new PDPage();
doc.addPage(Page);
PDPageContentStream Writer = new PDPageContentStream(doc, Page);
Writer.beginText();
Writer.setFont(PDType0Font.load(doc, f), 20);
Writer.newLineAtOffset(0, 700);
String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
Writer.showText(bidiReorder(s));
Writer.endText();
Writer.close();
doc.save(new File("File_Test.pdf"));
doc.close();
}
private static String bidiReorder(String text)
{
try {
Bidi bidi = new Bidi((new ArabicShaping(ArabicShaping.LETTERS_SHAPE)).shape(text), 127);
bidi.setReorderingMode(0);
return bidi.writeReordered(2);
}
catch (ArabicShapingException ase3) {
return text;
}
}
}

Related

Extracting answers to a flattened PDF form with iText 7

We have a few forms created from Adobe LiveCycle where users fill the dynamic forms and submits the document to our office where we stamp it with our signature and flatten it (at least most of the time - I've seen a few documents in our system that haven't been flattened yet but that can be a separate question, I'll focus on the flattened documents here because that's most of what we have).
I'm trying to use iText 7 to parse/extract the user's answers to our forms for migrating to an electronic solution that will happen a few months from now. I was able to make the example work in Java but I don't understand the process.
/*
This file is part of the iText (R) project.
Copyright (c) 1998-2020 iText Group NV
Authors: iText Software.
For more information, please contact iText Software at this address:
sales#itextpdf.com
*/
/**
* Example written by Bruno Lowagie in answer to:
* http://stackoverflow.com/questions/24506830/can-we-use-text-extraction-strategy-after-applying-location-extraction-strategy
*/
package ca.umanitoba.ad.research;
import com.itextpdf.kernel.font.PdfFont;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.kernel.pdf.canvas.parser.EventType;
import com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor;
import com.itextpdf.kernel.pdf.canvas.parser.data.IEventData;
import com.itextpdf.kernel.pdf.canvas.parser.data.TextRenderInfo;
import com.itextpdf.kernel.pdf.canvas.parser.filter.TextRegionEventFilter;
import com.itextpdf.kernel.pdf.canvas.parser.listener.FilteredEventListener;
import com.itextpdf.kernel.pdf.canvas.parser.listener.LocationTextExtractionStrategy;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.FileOutputStream;
import java.io.Writer;
import java.io.BufferedWriter;
public class Main {
public static final String DEST = "./target/txt/parse_custom.txt";
public static final String SRC = "./src/main/resources/pdfs/nameddestinations.pdf";
public static void main(String[] args) throws IOException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new Main().manipulatePdf(DEST);
}
protected void manipulatePdf(String dest) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));
Rectangle rect = new Rectangle(36, 750, 523, 56);
CustomFontFilter fontFilter = new CustomFontFilter(rect);
FilteredEventListener listener = new FilteredEventListener();
// Create a text extraction renderer
LocationTextExtractionStrategy extractionStrategy = listener
.attachEventListener(new LocationTextExtractionStrategy(), fontFilter);
// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.processPageContent(pdfDoc.getFirstPage());
// Get the resultant text after applying the custom filter
String actualText = extractionStrategy.getResultantText();
pdfDoc.close();
// See the resultant text in the console
System.out.println(actualText);
try (Writer writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(dest)))) {
writer.write(actualText);
}
}
/*
* The custom filter filters only the text of which the font name ends with Bold or Oblique.
*/
protected class CustomFontFilter extends TextRegionEventFilter {
public CustomFontFilter(Rectangle filterRect) {
super(filterRect);
}
#Override
public boolean accept(IEventData data, EventType type) {
if (type.equals(EventType.RENDER_TEXT)) {
TextRenderInfo renderInfo = (TextRenderInfo) data;
PdfFont font = renderInfo.getFont();
if (null != font) {
String fontName = font.getFontProgram().getFontNames().getFontName();
return fontName.endsWith("Bold") || fontName.endsWith("Oblique");
}
}
return false;
}
}
}
Why is there a need to specify a Rectangle? Our forms are dynamic so users can add more fields as needed and we also accept paragraphs on some of the questions so the length will always vary so it's unlikely that the coordinates of the texts will be the same.
How can I change the flow so that I can perhaps just search for the question and then get the text right after it (presumably the answer) - I don't really know what the best way to parse a PDF is. If there's no other way except providing a Rectangle, can I programmatically determine the coordinates/dimensions of the rectangles?
From the example it looks like it's filtering the text based on whether it's bolded or italicized which I probably don't need but it looks to be easy enough to fix by modifying/removing the accept() method.

Please take a look at what that example is for: In the JavaDoc comment you can read
/**
* Example written by Bruno Lowagie in answer to:
* http://stackoverflow.com/questions/24506830/can-we-use-text-extraction-strategy-after-applying-location-extraction-strategy
*/
and that stack overflow question starts with
I used the following code to get data in PDF from a particular location. I want to get bold text present in that location
When you wonder, therefore,
Why is there a need to specify a Rectangle?
the answer is: because the example is about finding bold text in a particular location.
You mention your forms were dynamic before flattening and fields don't have fixed positions. Thus, this filter probably is not optimal for your use case.
How can I change the flow so that I can perhaps just search for the question and then get the text right after it
In that case simply don't filter at all but use a plain LocationTextExtractionStrategy to extract text, search for the question text in the extracted text, and use the text thereafter up to the next question text.
Alternatively, if you still have the unflattened dynamic forms, you may consider extracting the xfa xml and extract the filled-in data from that xml.

Writing Arabic with PDFBOX with correct characters presentation form without being separated

I'm trying to generate a PDF that contains Arabic text using PDFBox Apache but the text is generated as separated characters because Apache parses given Arabic string to a sequence of general 'official' Unicode characters that is equivalent to the isolated form of Arabic characters.
Here is an example:
Target text to Write in PDF "Should be expected output in PDF File" -> جملة بالعربي
What I get in PDF File ->
I tried some methods but it's no use here are some of them:
1. Converting String to Stream of bits and trying to extract right values
2. Treating String a sequence of bytes with UTF-8 && UTF-16 and extracting values from them
There is some approach seems very promising to get the value "Unicode" of each character But it generate general "official Unicode" Here is what I mean
System.out.println( Integer.toHexString( (int)(new String("كلمة").charAt(1))) );
output is 644 but fee0 was the expected output because this character is in middle from then I should get the middle Unicode fee0
so what I want is some method that generates the correct Unicode not the just the official one
The very Left column in the first table in the following link represents the general Unicode
Arabic Unicode Tables Wikipedia

Notice:
The sample code in this answer might be outdated please refer to h q's answer for the working sample code
At First I will thank Tilman Hausherr and M.Prokhorov for showing me the library that made writing Arabic possible using PDFBox Apache.
This Answer will be divided into two Sections:
Downloading the library and installing it
How to use the library
Downloading the library and installing it
We are going to use ICU Library.
ICU stands for International Components for Unicode and it is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.
To download the Library go to the downloads page from here.
Choose the latest version of ICU4J as shown in the following image.
You will be transferred to another page and you will find a box with direct links of the needed components .Go ahead and download three Files you will find the highlighted in next image.
icu4j-docs.jar
icu4j-src.jar
icu4j.jar
The following explanation for creating and adding a library in Netbeans IDE
Navigate to the Toolbar and Click tools
Choose Libraries
At the bottom left you will find new Library button Create yours
Navigate to the library that you created in libraries list
Click it and add jar folders like that
Add icu4j.jar in class path
Add icu4j-src.jar in Sources
Add icu4j-docs.jar in Javadoc
View your opened projects from the very right
Expand the project that you want to use the library in
Right Click on the libraries folder and choose add library
Finally choose the library that you had just created.
Now you are ready to use the library just import what you want like that
import com.ibm.icu.What_You_Want_To_Import;
How to use the library
With ArabicShaping Class and reversing the String we can write a correct attached Arabic LINE
Here is the Code Notice the comments in the following code
import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;
public class Main {
public static void main(String[] args) throws IOException , ArabicShapingException
{
File f = new File("Arabic Font File of format.ttf");
PDDocument doc = new PDDocument();
PDPage Page = new PDPage();
doc.addPage(Page);
PDPageContentStream Writer = new PDPageContentStream(doc, Page);
Writer.beginText();
Writer.setFont(PDType0Font.load(doc, f), 20);
Writer.newLineAtOffset(0, 700);
//The Trick in the next Line of Code But Here is some few Notes first
//We have to reverse the string because PDFBox is Writting from the left but Arabic is RTL Language
//The output will be perfect except every line will be justified to the left "It's not hard to resolve this"
// So we have to write arabic string to pdf line by line..It will be like this
String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
Writer.showText(new StringBuilder(new ArabicShaping(reverseNumbersInString(ArabicShaping.LETTERS_SHAPE).shape(s))).reverse().toString());
// Note the previous line of code throws ArabicShapingExcpetion
Writer.endText();
Writer.close();
doc.save(new File("File_Test.pdf"));
doc.close();
}
}
Here is the output
I hope that I had gone over everything.
Update : After reversing make sure to reverse the numbers again in order to get the same proper number
Here is a couple of functions that could help
public static boolean isInt(String Input)
{
try{Integer.parseInt(Input);return true;}
catch(NumberFormatException e){return false;}
}
public static String reverseNumbersInString(String Input)
{
char[] Separated = Input.toCharArray();int i = 0;
String Result = "",Hold = "";
for(;i<Separated.length;i++ )
{
if(isInt(Separated[i]+"") == true)
{
while(i < Separated.length && (isInt(Separated[i]+"") == true || Separated[i] == '.' || Separated[i] == '-'))
{
Hold += Separated[i];
i++;
}
Result+=reverse(Hold);
Hold="";
}
else{Result+=Separated[i];}
}
return Result;
}

Here is a code that works. Download a sample font, e.g. trado.ttf
Make sure the pdfbox-app and icu4j jar files are in your classpath.
import java.io.File;
import java.io.IOException;
import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import com.ibm.icu.text.Bidi;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;
public class Main {
public static void main(String[] args) throws IOException , ArabicShapingException
{
File f = new File("trado.ttf");
PDDocument doc = new PDDocument();
PDPage Page = new PDPage();
doc.addPage(Page);
PDPageContentStream Writer = new PDPageContentStream(doc, Page);
Writer.beginText();
Writer.setFont(PDType0Font.load(doc, f), 20);
Writer.newLineAtOffset(0, 700);
String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
Writer.showText(bidiReorder(s));
Writer.endText();
Writer.close();
doc.save(new File("File_Test.pdf"));
doc.close();
}
private static String bidiReorder(String text)
{
try {
Bidi bidi = new Bidi((new ArabicShaping(ArabicShaping.LETTERS_SHAPE)).shape(text), 127);
bidi.setReorderingMode(0);
return bidi.writeReordered(2);
}
catch (ArabicShapingException ase3) {
return text;
}
}
}

get all acrosfields avalibles in a pdf document itext7

I am trying to automatize the modification of an pdf template according to some data computing (using Java)
I have no experience with pdf modification and I am being trying to use itext7 to do this.
I have been reading how to add text to a pdf and even here I saw how to field Acrosfield if they exist using a "key"
Nevertheless, I didn't made the pdf template I am using (which is modifiable) so I don't know if the fields with you can manually fill are made with Acrosfields or another tecnology and I don'w know what are the keys or each field If they have one...
I saw this question; where it is says how to get all the fields and their values but when I try the code that appear in the only answer I get;
main.java:[40,0] error: illegal start of type
main.java:[40,19] error: ')' expected
main.java:[40,30] error: <identifier> expected
3 errors
In this part:
for (String fldName : fldNames) {
System.out.println( fldName + ": " + fields.getField( fldName ) );
}
After try a bit, I have been finding more information but I can't find a way to get these "keys" if it's possible...
------- EDIT -------
I've made this code in order to make a copy of my pdf-template which have the name of the Acrosfield's key in each field:
package novgenrs;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.AcroFields;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfStamper;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Set;
public class MakePDF {
public static void MakePDF(String[] args) throws IOException, DocumentException{
PdfReader reader = new PdfReader("template.pdf");
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream("result.pdf"));
//AcroFields form = stamper.getAcroFields();
AcroFields fields = reader.getAcroFields();
AcroFields wrt = stamper.getAcroFields();
Set<String> fldNames = fields.getFields().keySet();
for (String fldName : fldNames) {
wrt.setField(fldName, fldName) ;
}
stamper.close();
reader.close();
}
}
NOTE: this only work with itext5. For some reason when I tried to do this with itext7 I couldn't made it work so I tried to do it with itext5 and it worked!

If you want a full answer to your question, you will have to provide the PDF so that we can inspect it, but these are already some answers that will get you in the right direction.
When you refer to How to fill out a pdf file programmatically? (AcroForm technology), you refer to the iText 7 version of How to fill out a pdf file programmatically? (AcroForm technology) which is the answer to the same question, but for developers who use iText 5. As you can see, there's a big difference between iText 5 and iText 7.
However, when you refer to How do I get all the fields and value from a PDF file using iText? you get an answer that is to be used with iText 5 only. If you are using iText 7, that code won't work because it's iText 5 code.
The code you need can be found here: How to get specific types from AcroFields? Like PushButtonField, RadioCheckField, etc
PdfReader reader = new PdfReader(src);
PdfDocument pdfDoc = new PdfDocument(reader);
// Get the fields from the reader (read-only!!!)
PdfAcroForm form = PdfAcroForm.getAcroForm(pdfDoc, true);
// Loop over the fields and get info about them
Set<String> fields = form.getFormFields().keySet();
for (String key : fields) {
writer.print(key + ": ");
PdfName type = form.getField(key).getFormType();
if (0 == PdfName.Btn.compareTo(type)) {
if(((PdfButtonFormField)form.getField(key)).isPushButton()){
writer.println("Pushbutton");
} else {
if(((PdfButtonFormField)form.getField(key)).isRadio()){
writer.println("Radiobutton");
}else {
writer.println("Checkbox");
}
}
} else if (0 == PdfName.Ch.compareTo(type)) {
writer.println("Choicebox");
} else if (0 == PdfName.Sig.compareTo(type)) {
writer.println("Signature");
} else if (0 == PdfName.Tx.compareTo(type)) {
writer.println("Text");
}else {
writer.println("?");
}
}
This code will loop over all the fields, and write the key to the System.out, as well as the type of field that corresponds with that key. You can also use RUPS to inspect your PDF.
You mention:
main.java:[40,0] error: illegal start of type
main.java:[40,19] error: ')' expected
main.java:[40,30] error: <identifier> expected
It isn't clear to me if this is a compiler error or a run-time error.
If it's a compiler error, you are simply missing a ) somewhere, in which case your problem is totally unrelated to iText (which is what I suspect: you just have a very simple programming error).
If it's a run-time error, there is something wrong with your PDF. In PDF, a string is delimited by parentheses. Maybe a bracket is missing somewhere in your PDF (but I doubt that: you'd get a different type of error).
In short: try a good IDE, and that IDE will tell give you an indication of where a parenthesis is missing. If you don't find that location immediately, clean up your code by adding indentation and spaces. That should make it clear where you forget a ).

What is the easiest way to extract data from a PDF?

I need to extract data from some PDF documents (using Java). I need to know what would be the easiest way to do it.
I tried iText. It's fairly complicated for my needs. Besides I guess it is not available for free for commercial projects. So it is not an option. I also gave a try to PDFBox, and ran into various NoClassDefFoundError errors.
I googled and came across several other options such as PDF Clown, jPod, but I do not have time to experiment with all of these libraries. I am relying on community's experience with PDF reading thru Java.
Note that I do not need to create or manipulate PDF documents. I just need to exrtract textual data from PDF documents with moderate level layout complexity.
Please suggest the quickest and easiest way to extract text from PDF documents. Thanks.

I recommend trying Apache Tika. Apache Tika is basically a toolkit that extracts data from many types of documents, including PDFs.
The benefits of Tika (besides being free), is that is used to be a subproject of Apache Lucene, which is a very robust open-source search engine. Tika includes a built-in PDF parser that uses a SAX Content Handler to pass PDF data to your application. It can also extract data from encrypted PDFs and it allows you to create or subclass an existing parser to customize the behavior.
The code is simple. To extract the data from a PDF, all you need to do is create a Parser class that implements the Parser interface and define a parse() method:
public void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException {
metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
metadata.set("Hello", "World");
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
xhtml.startDocument();
xhtml.endDocument();
}
Then, to run the parser, you could do something like this:
InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());

I am using JPedal and I'm really happy with the results. It isn't free but it's high quality and the output for image generation from pdfs or text extraction is really nice.
And as a paid library, the support is always there to answer.

I have used PDFBox to extract text for Lucene indexing without too many issues. Its error/warning logging is quite verbose if I remember right - what was the cause for those errors you received?

I understand this post is pretty old but I would recommend using itext from here:
http://sourceforge.net/projects/itext/
If you are using maven you can pull the jars in from maven central:
http://mvnrepository.com/artifact/com.itextpdf/itextpdf
I can't understand how using it can be difficult:
PdfReader pdf = new PdfReader("path to your pdf file");
PdfTextExtractor parser = new PdfTextExtractor();
String output = parser.getTextFromPage(pdf, pageNumber);
assert output.contains("whatever you want to validate on that page");

Import this Classes and add Jar Files 1.- pdfbox-app- 2.0.
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.support.FindBy;
import org.testng.Assert;
import org.testng.annotations.Test;
import java.io.File;
import java.io.IOException;
import java.text.ParseException;
import java.util.List;
import org.apache.log4j.Logger;
import org.apache.log4j.PropertyConfigurator;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.chrome.ChromeDriver;
import com.coencorp.selenium.framework.BasePage;
import com.coencorp.selenium.framework.ExcelReadWrite;
import com.relevantcodes.extentreports.LogStatus;
Add this code inside the class.
public void showList() throws InterruptedException, IOException {
showInspectionsLink.click();
waitForElement(hideInspectionsLink);
printButton.click();
Thread.sleep(10000);
String downloadPath = "C:\\Users\\Updoer\\Downloads";
File getLatestFile = getLatestFilefromDir(downloadPath);
String fileName = getLatestFile.getName();
Assert.assertTrue(fileName.equals("Inspections.pdf"), "Downloaded file name is not
matching with expected file name");
Thread.sleep(10000);
//testVerifyPDFInURL();
PDDocument pd;
pd= PDDocument.load(new File("C:\\Users\\Updoer\\Downloads\\Inspections.pdf"));
System.out.println("Total Pages:"+ pd.getNumberOfPages());
PDFTextStripper pdf=new PDFTextStripper();
System.out.println(pdf.getText(pd));
Add this Method in same class.
public void testVerifyPDFInURL() {
WebDriver driver = new ChromeDriver();
driver.get("C:\\Users\\Updoer\\Downloads\\Inspections.pdf");
driver.findElement(By.linkText("Adeeb Khan")).click();
String getURL = driver.getCurrentUrl();
Assert.assertTrue(getURL.contains(".pdf"));
}
private File getLatestFilefromDir(String dirPath){
File dir = new File(dirPath);
File[] files = dir.listFiles();
if (files == null || files.length == 0) {
return null;
}
File lastModifiedFile = files[0];
for (int i = 1; i < files.length; i++) {
if (lastModifiedFile.lastModified() < files[i].lastModified()) {
lastModifiedFile = files[i];
}
}
return lastModifiedFile;
}

How do I generate RTF from Java?

I work on a web-based tool where we offer customized prints.
Currently we build an XML structure with Java, feed it to the XMLmind XSL-FO Converter along with customized XSL-FO, which then produces an RTF document.
This works fine on simple layouts, but there's some problem areas where I'd like greater control, or where I can't do what I want at all. F.ex: tables in header, footers (e.g., page numbers), columns, having a separate column setup or different page number info on the first page, etc.
Do any of you know of better alternatives, either to XMLmind or to the way we get from data to RTF, i.e., Java-> XML, XML+XSL-> RTF? (The only practical limitation for us is the JVM.)

You can take a look at a new library called jRTF. It allows you to create new RTF documents and to fill RTF templates.

Have you had a look at the iText library? It's touted primarily as a PDF generator, though it can also generate RTF. I haven't had cause to use it personally, but the general feeling I get is that it's good, and the interface looks comprehensive and easy to work to in the abstract. Whether it would fit in well with your existing data model is another question.

If you could afford spending some money, you could use Aspose.Words, a professional library for creating Word and RTF documents for Java and .NET.

iText supports RTF.

import com.lowagie.text.*;
import com.lowagie.text.html.simpleparser.HTMLWorker;
import com.lowagie.text.html.simpleparser.StyleSheet;
import com.lowagie.text.rtf.*;
import java.io.*;
import java.util.ArrayList;
public class HTMLtoRTF {
public static void main(String[] args) throws DocumentException {
Document document = new Document();
try {
Reader htmlreader = new BufferedReader((new InputStreamReader((new FileInputStream("C:\\Users\\asrikantan\\Desktop\\sample.htm")))));
RtfWriter2 rtfWriter = RtfWriter2.getInstance(document, new FileOutputStream(("C:\\Users\\asrikantan\\Desktop\\sample12.rtf")));
document.open();
document.add(new Paragraph("Testing simple paragraph addition."));
//ByteArrayOutputStream out = new ByteArrayOutputStream();
StyleSheet styles = new StyleSheet();
styles.loadTagStyle("body", "font", "Bitstream Vera Sans");
ArrayList htmlParser = HTMLWorker.parseToList(htmlreader, styles);
//fetch HTML line by line
for (int htmlDatacntr = 0; htmlDatacntr < htmlParser.size(); htmlDatacntr++) {
Element htmlDataElement = (Element) htmlParser.get(htmlDatacntr);
document.add((htmlDataElement));
}
htmlreader.close();
document.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (Exception e) {
System.out.println(e);
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Arabic Text in PdfBox [duplicate] - java

Related

Extracting answers to a flattened PDF form with iText 7

Writing Arabic with PDFBOX with correct characters presentation form without being separated

get all acrosfields avalibles in a pdf document itext7

What is the easiest way to extract data from a PDF?

How do I generate RTF from Java?

Categories

Resources