What is the easiest way to extract data from a PDF?

What is the easiest way to extract data from a PDF? - java

I need to extract data from some PDF documents (using Java). I need to know what would be the easiest way to do it.
I tried iText. It's fairly complicated for my needs. Besides I guess it is not available for free for commercial projects. So it is not an option. I also gave a try to PDFBox, and ran into various NoClassDefFoundError errors.
I googled and came across several other options such as PDF Clown, jPod, but I do not have time to experiment with all of these libraries. I am relying on community's experience with PDF reading thru Java.
Note that I do not need to create or manipulate PDF documents. I just need to exrtract textual data from PDF documents with moderate level layout complexity.
Please suggest the quickest and easiest way to extract text from PDF documents. Thanks.

I recommend trying Apache Tika. Apache Tika is basically a toolkit that extracts data from many types of documents, including PDFs.
The benefits of Tika (besides being free), is that is used to be a subproject of Apache Lucene, which is a very robust open-source search engine. Tika includes a built-in PDF parser that uses a SAX Content Handler to pass PDF data to your application. It can also extract data from encrypted PDFs and it allows you to create or subclass an existing parser to customize the behavior.
The code is simple. To extract the data from a PDF, all you need to do is create a Parser class that implements the Parser interface and define a parse() method:
public void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException {
metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
metadata.set("Hello", "World");
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
xhtml.startDocument();
xhtml.endDocument();
}
Then, to run the parser, you could do something like this:
InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());

I am using JPedal and I'm really happy with the results. It isn't free but it's high quality and the output for image generation from pdfs or text extraction is really nice.
And as a paid library, the support is always there to answer.

I have used PDFBox to extract text for Lucene indexing without too many issues. Its error/warning logging is quite verbose if I remember right - what was the cause for those errors you received?

I understand this post is pretty old but I would recommend using itext from here:
http://sourceforge.net/projects/itext/
If you are using maven you can pull the jars in from maven central:
http://mvnrepository.com/artifact/com.itextpdf/itextpdf
I can't understand how using it can be difficult:
PdfReader pdf = new PdfReader("path to your pdf file");
PdfTextExtractor parser = new PdfTextExtractor();
String output = parser.getTextFromPage(pdf, pageNumber);
assert output.contains("whatever you want to validate on that page");

Import this Classes and add Jar Files 1.- pdfbox-app- 2.0.
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.support.FindBy;
import org.testng.Assert;
import org.testng.annotations.Test;
import java.io.File;
import java.io.IOException;
import java.text.ParseException;
import java.util.List;
import org.apache.log4j.Logger;
import org.apache.log4j.PropertyConfigurator;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.chrome.ChromeDriver;
import com.coencorp.selenium.framework.BasePage;
import com.coencorp.selenium.framework.ExcelReadWrite;
import com.relevantcodes.extentreports.LogStatus;
Add this code inside the class.
public void showList() throws InterruptedException, IOException {
showInspectionsLink.click();
waitForElement(hideInspectionsLink);
printButton.click();
Thread.sleep(10000);
String downloadPath = "C:\\Users\\Updoer\\Downloads";
File getLatestFile = getLatestFilefromDir(downloadPath);
String fileName = getLatestFile.getName();
Assert.assertTrue(fileName.equals("Inspections.pdf"), "Downloaded file name is not
matching with expected file name");
Thread.sleep(10000);
//testVerifyPDFInURL();
PDDocument pd;
pd= PDDocument.load(new File("C:\\Users\\Updoer\\Downloads\\Inspections.pdf"));
System.out.println("Total Pages:"+ pd.getNumberOfPages());
PDFTextStripper pdf=new PDFTextStripper();
System.out.println(pdf.getText(pd));
Add this Method in same class.
public void testVerifyPDFInURL() {
WebDriver driver = new ChromeDriver();
driver.get("C:\\Users\\Updoer\\Downloads\\Inspections.pdf");
driver.findElement(By.linkText("Adeeb Khan")).click();
String getURL = driver.getCurrentUrl();
Assert.assertTrue(getURL.contains(".pdf"));
}
private File getLatestFilefromDir(String dirPath){
File dir = new File(dirPath);
File[] files = dir.listFiles();
if (files == null || files.length == 0) {
return null;
}
File lastModifiedFile = files[0];
for (int i = 1; i < files.length; i++) {
if (lastModifiedFile.lastModified() < files[i].lastModified()) {
lastModifiedFile = files[i];
}
}
return lastModifiedFile;
}

Related

Writing Arabic with PDFBOX with correct characters presentation form without being separated

I'm trying to generate a PDF that contains Arabic text using PDFBox Apache but the text is generated as separated characters because Apache parses given Arabic string to a sequence of general 'official' Unicode characters that is equivalent to the isolated form of Arabic characters.
Here is an example:
Target text to Write in PDF "Should be expected output in PDF File" -> جملة بالعربي
What I get in PDF File ->
I tried some methods but it's no use here are some of them:
1. Converting String to Stream of bits and trying to extract right values
2. Treating String a sequence of bytes with UTF-8 && UTF-16 and extracting values from them
There is some approach seems very promising to get the value "Unicode" of each character But it generate general "official Unicode" Here is what I mean
System.out.println( Integer.toHexString( (int)(new String("كلمة").charAt(1))) );
output is 644 but fee0 was the expected output because this character is in middle from then I should get the middle Unicode fee0
so what I want is some method that generates the correct Unicode not the just the official one
The very Left column in the first table in the following link represents the general Unicode
Arabic Unicode Tables Wikipedia

Notice:
The sample code in this answer might be outdated please refer to h q's answer for the working sample code
At First I will thank Tilman Hausherr and M.Prokhorov for showing me the library that made writing Arabic possible using PDFBox Apache.
This Answer will be divided into two Sections:
Downloading the library and installing it
How to use the library
Downloading the library and installing it
We are going to use ICU Library.
ICU stands for International Components for Unicode and it is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.
To download the Library go to the downloads page from here.
Choose the latest version of ICU4J as shown in the following image.
You will be transferred to another page and you will find a box with direct links of the needed components .Go ahead and download three Files you will find the highlighted in next image.
icu4j-docs.jar
icu4j-src.jar
icu4j.jar
The following explanation for creating and adding a library in Netbeans IDE
Navigate to the Toolbar and Click tools
Choose Libraries
At the bottom left you will find new Library button Create yours
Navigate to the library that you created in libraries list
Click it and add jar folders like that
Add icu4j.jar in class path
Add icu4j-src.jar in Sources
Add icu4j-docs.jar in Javadoc
View your opened projects from the very right
Expand the project that you want to use the library in
Right Click on the libraries folder and choose add library
Finally choose the library that you had just created.
Now you are ready to use the library just import what you want like that
import com.ibm.icu.What_You_Want_To_Import;
How to use the library
With ArabicShaping Class and reversing the String we can write a correct attached Arabic LINE
Here is the Code Notice the comments in the following code
import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;
public class Main {
public static void main(String[] args) throws IOException , ArabicShapingException
{
File f = new File("Arabic Font File of format.ttf");
PDDocument doc = new PDDocument();
PDPage Page = new PDPage();
doc.addPage(Page);
PDPageContentStream Writer = new PDPageContentStream(doc, Page);
Writer.beginText();
Writer.setFont(PDType0Font.load(doc, f), 20);
Writer.newLineAtOffset(0, 700);
//The Trick in the next Line of Code But Here is some few Notes first
//We have to reverse the string because PDFBox is Writting from the left but Arabic is RTL Language
//The output will be perfect except every line will be justified to the left "It's not hard to resolve this"
// So we have to write arabic string to pdf line by line..It will be like this
String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
Writer.showText(new StringBuilder(new ArabicShaping(reverseNumbersInString(ArabicShaping.LETTERS_SHAPE).shape(s))).reverse().toString());
// Note the previous line of code throws ArabicShapingExcpetion
Writer.endText();
Writer.close();
doc.save(new File("File_Test.pdf"));
doc.close();
}
}
Here is the output
I hope that I had gone over everything.
Update : After reversing make sure to reverse the numbers again in order to get the same proper number
Here is a couple of functions that could help
public static boolean isInt(String Input)
{
try{Integer.parseInt(Input);return true;}
catch(NumberFormatException e){return false;}
}
public static String reverseNumbersInString(String Input)
{
char[] Separated = Input.toCharArray();int i = 0;
String Result = "",Hold = "";
for(;i<Separated.length;i++ )
{
if(isInt(Separated[i]+"") == true)
{
while(i < Separated.length && (isInt(Separated[i]+"") == true || Separated[i] == '.' || Separated[i] == '-'))
{
Hold += Separated[i];
i++;
}
Result+=reverse(Hold);
Hold="";
}
else{Result+=Separated[i];}
}
return Result;
}

Here is a code that works. Download a sample font, e.g. trado.ttf
Make sure the pdfbox-app and icu4j jar files are in your classpath.
import java.io.File;
import java.io.IOException;
import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import com.ibm.icu.text.Bidi;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;
public class Main {
public static void main(String[] args) throws IOException , ArabicShapingException
{
File f = new File("trado.ttf");
PDDocument doc = new PDDocument();
PDPage Page = new PDPage();
doc.addPage(Page);
PDPageContentStream Writer = new PDPageContentStream(doc, Page);
Writer.beginText();
Writer.setFont(PDType0Font.load(doc, f), 20);
Writer.newLineAtOffset(0, 700);
String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
Writer.showText(bidiReorder(s));
Writer.endText();
Writer.close();
doc.save(new File("File_Test.pdf"));
doc.close();
}
private static String bidiReorder(String text)
{
try {
Bidi bidi = new Bidi((new ArabicShaping(ArabicShaping.LETTERS_SHAPE)).shape(text), 127);
bidi.setReorderingMode(0);
return bidi.writeReordered(2);
}
catch (ArabicShapingException ase3) {
return text;
}
}
}

Arabic Text in PdfBox [duplicate]

I'm trying to generate a PDF that contains Arabic text using PDFBox Apache but the text is generated as separated characters because Apache parses given Arabic string to a sequence of general 'official' Unicode characters that is equivalent to the isolated form of Arabic characters.
Here is an example:
Target text to Write in PDF "Should be expected output in PDF File" -> جملة بالعربي
What I get in PDF File ->
I tried some methods but it's no use here are some of them:
1. Converting String to Stream of bits and trying to extract right values
2. Treating String a sequence of bytes with UTF-8 && UTF-16 and extracting values from them
There is some approach seems very promising to get the value "Unicode" of each character But it generate general "official Unicode" Here is what I mean
System.out.println( Integer.toHexString( (int)(new String("كلمة").charAt(1))) );
output is 644 but fee0 was the expected output because this character is in middle from then I should get the middle Unicode fee0
so what I want is some method that generates the correct Unicode not the just the official one
The very Left column in the first table in the following link represents the general Unicode
Arabic Unicode Tables Wikipedia

Notice:
The sample code in this answer might be outdated please refer to h q's answer for the working sample code
At First I will thank Tilman Hausherr and M.Prokhorov for showing me the library that made writing Arabic possible using PDFBox Apache.
This Answer will be divided into two Sections:
Downloading the library and installing it
How to use the library
Downloading the library and installing it
We are going to use ICU Library.
ICU stands for International Components for Unicode and it is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.
To download the Library go to the downloads page from here.
Choose the latest version of ICU4J as shown in the following image.
You will be transferred to another page and you will find a box with direct links of the needed components .Go ahead and download three Files you will find the highlighted in next image.
icu4j-docs.jar
icu4j-src.jar
icu4j.jar
The following explanation for creating and adding a library in Netbeans IDE
Navigate to the Toolbar and Click tools
Choose Libraries
At the bottom left you will find new Library button Create yours
Navigate to the library that you created in libraries list
Click it and add jar folders like that
Add icu4j.jar in class path
Add icu4j-src.jar in Sources
Add icu4j-docs.jar in Javadoc
View your opened projects from the very right
Expand the project that you want to use the library in
Right Click on the libraries folder and choose add library
Finally choose the library that you had just created.
Now you are ready to use the library just import what you want like that
import com.ibm.icu.What_You_Want_To_Import;
How to use the library
With ArabicShaping Class and reversing the String we can write a correct attached Arabic LINE
Here is the Code Notice the comments in the following code
import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;
public class Main {
public static void main(String[] args) throws IOException , ArabicShapingException
{
File f = new File("Arabic Font File of format.ttf");
PDDocument doc = new PDDocument();
PDPage Page = new PDPage();
doc.addPage(Page);
PDPageContentStream Writer = new PDPageContentStream(doc, Page);
Writer.beginText();
Writer.setFont(PDType0Font.load(doc, f), 20);
Writer.newLineAtOffset(0, 700);
//The Trick in the next Line of Code But Here is some few Notes first
//We have to reverse the string because PDFBox is Writting from the left but Arabic is RTL Language
//The output will be perfect except every line will be justified to the left "It's not hard to resolve this"
// So we have to write arabic string to pdf line by line..It will be like this
String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
Writer.showText(new StringBuilder(new ArabicShaping(reverseNumbersInString(ArabicShaping.LETTERS_SHAPE).shape(s))).reverse().toString());
// Note the previous line of code throws ArabicShapingExcpetion
Writer.endText();
Writer.close();
doc.save(new File("File_Test.pdf"));
doc.close();
}
}
Here is the output
I hope that I had gone over everything.
Update : After reversing make sure to reverse the numbers again in order to get the same proper number
Here is a couple of functions that could help
public static boolean isInt(String Input)
{
try{Integer.parseInt(Input);return true;}
catch(NumberFormatException e){return false;}
}
public static String reverseNumbersInString(String Input)
{
char[] Separated = Input.toCharArray();int i = 0;
String Result = "",Hold = "";
for(;i<Separated.length;i++ )
{
if(isInt(Separated[i]+"") == true)
{
while(i < Separated.length && (isInt(Separated[i]+"") == true || Separated[i] == '.' || Separated[i] == '-'))
{
Hold += Separated[i];
i++;
}
Result+=reverse(Hold);
Hold="";
}
else{Result+=Separated[i];}
}
return Result;
}

Here is a code that works. Download a sample font, e.g. trado.ttf
Make sure the pdfbox-app and icu4j jar files are in your classpath.
import java.io.File;
import java.io.IOException;
import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import com.ibm.icu.text.Bidi;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;
public class Main {
public static void main(String[] args) throws IOException , ArabicShapingException
{
File f = new File("trado.ttf");
PDDocument doc = new PDDocument();
PDPage Page = new PDPage();
doc.addPage(Page);
PDPageContentStream Writer = new PDPageContentStream(doc, Page);
Writer.beginText();
Writer.setFont(PDType0Font.load(doc, f), 20);
Writer.newLineAtOffset(0, 700);
String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
Writer.showText(bidiReorder(s));
Writer.endText();
Writer.close();
doc.save(new File("File_Test.pdf"));
doc.close();
}
private static String bidiReorder(String text)
{
try {
Bidi bidi = new Bidi((new ArabicShaping(ArabicShaping.LETTERS_SHAPE)).shape(text), 127);
bidi.setReorderingMode(0);
return bidi.writeReordered(2);
}
catch (ArabicShapingException ase3) {
return text;
}
}
}

Category Tree extraction in Wikipedia using Java

Basically I intend to extract the entire category tree in Wikipedia under the root node "Economics" using Wikipedia API sandbox. I don't need the content of the articles, I just need few basic details like pageid, title, revision history (at some later stage of my work). As of now I can extract it level by level but what I want is a recursive/iterative function which does it.
Each category contains a categories and articles (like each root contains nodes and leaves).
I wrote one code to extract the first level into files. one file contains the articles, second folder contains the name of categories (daughters of the root which can be further sub-classified).
Then I went into level and extracted their categories and articles and sub-categories using similar code.
The code remains similar in each case but its the scalability. I need to reach the lowest leaves of all nodes. So i need a recursion which continuously checks till the end.
I labelled files which contains categories as 'c_', so I can provide the condition while extracting different levels.
Now for some reason it has entered into a deadlock and keeps adding same things again and again. I need a way out of the deadlock.
package wikiCrawl;
import java.awt.List;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Scanner;
import org.apache.commons.io.FileUtils;
import org.json.CDL;
import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;
public class SubCrawl
{
public static void main(String[] args) throws IOException, InterruptedException, JSONException
{ File file = new File("C:/Users/User/Desktop/Root/Economics_2.txt");
crawlfile(file);
}
public static void crawlfile(File food) throws JSONException, IOException ,InterruptedException
{
ArrayList<String> cat_list =new ArrayList <String>();
Scanner scanner_cat = new Scanner(food);
scanner_cat.useDelimiter("\n");
while (scanner_cat.hasNext())
{
String scan_n = scanner_cat.next();
if(scan_n.indexOf(":")>-1)
cat_list.add(scan_n.substring(scan_n.indexOf(":")+1));
}
System.out.println(cat_list);
//get the categories in different languages
URL category_json;
for (int i_cat=0; i_cat<cat_list.size();i_cat++)
{
category_json = new URL("https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3A"+cat_list.get(i_cat).replaceAll(" ", "%20").trim()+"&cmlimit=500"); //.trim() removes trailing and following whitespaces
System.out.println(category_json);
HttpURLConnection urlConnection = (HttpURLConnection) category_json.openConnection(); //Opens the connection to the URL so clients can communicate with the resources.
BufferedReader reader = new BufferedReader (new InputStreamReader(category_json.openStream()));
String line;
String diff = "";
while ((line = reader.readLine()) != null)
{
System.out.println(line);
diff=diff+line;
}
urlConnection.disconnect();
reader.close();
JSONArray jsonarray_cat = new JSONArray (diff.substring(diff.indexOf("[{\"pageid\"")));
System.out.println(jsonarray_cat);
//Loop categories
for (int i_url = 0; i_url<jsonarray_cat.length();i_url++) //jSONarray is an array of json objects, we are looping through each object
{
//Get the URL _part (Categorie isn't correct)
int pageid=Integer.parseInt(jsonarray_cat.getJSONObject(i_url).getString("pageid")); //this can be written in a much better way
System.out.println(pageid);
String title=jsonarray_cat.getJSONObject(i_url).getString("title");
System.out.println(title);
File food_year= new File("C:/Users/User/Desktop/Root/"+cat_list.get(i_cat).replaceAll(" ", "_").trim()+".txt");
File food_year2= new File("C:/Users/User/Desktop/Root/c_"+cat_list.get(i_cat).replaceAll(" ", "_").trim()+".txt");
food_year.createNewFile();
food_year2.createNewFile();
BufferedWriter writer = new BufferedWriter (new OutputStreamWriter(new FileOutputStream(food_year, true)));
BufferedWriter writer2 = new BufferedWriter (new OutputStreamWriter(new FileOutputStream(food_year2, true)));
if (title.contains("Category:"))
{
writer2.write(pageid+";"+title);
writer2.newLine();
writer2.flush();
crawlfile(food_year2);
}
else
{
writer.write(pageid+";"+title);
writer.newLine();
writer.flush();
}
}
}
}
}

For starters this might be too big a demand on the wikimedia servers. There are over a million categories (1) and you need to read Wikipedia:Database download - Why not just retrieve data from wikipedia.org at runtime. You would need to throttle your uses to about 1 per second or risk getting blocked. This means it would take about 11 days to get the full tree.
It would be much better to use the standard dumps at https://dumps.wikimedia.org/enwiki/ these will be easier to read and process and you don't need to put a big load on the server.
Still better is to get a Wikimedia Labs account, which allow you to run queries on a replication of the database servers or scripts on the dumps without having to download some very big files.
To get just the economics categories then its easiest to go via https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Economics this has 1242 categories. You may find it easier to use the list of categories there and build the tree from there.
This will be better than a recursive approach. The problem with the wikipedia category system is that it is not really a tree, with plenty of loops. I would not be surprised if you keep following categories you will end up getting the most of wikipedia.

Is it possible to create PDF file of Images using Ghost4j?

I need to create a PDF file having multiple images using Ghost4j ?Is it really possible?I didn't find and any related documentation in their site...Any valuable suggestions are welcomed..

Ghostscript processes PostScript and PDF files as input, not image file formats. That said, PostScript is a programming language, and so it is possible to write an import facility in PostScript. As standard Ghostscript ships with code to import GIF, JPEG, BMP and PCX file formats (ghostpdl/gs/lib/view___.ps)
However, I have no idea what Ghost4j exposes (and besides, I'm not a Java programmer) so I can't tell you how to do this.

Not sure about Ghost4j, I did it using PDFBox ImageToPDF
The actual code can be found here, also you may want to adapt this according to your requirement.

Here is a working example of using Ghost4j for converting pdf to image:
import org.ghost4j.document.DocumentException;
import org.ghost4j.document.PDFDocument;
import org.ghost4j.renderer.RendererException;
import org.ghost4j.renderer.SimpleRenderer;
import java.awt.Image;
import java.awt.image.RenderedImage;
import java.io.File;
import java.util.List;
import javax.imageio.ImageIO;
import java.io.IOException;
public class PdfToIm_G4J {
public void convertPdfToIm( String pdfFilePath, String imExtension ) throws
IOException,DocumentException,RendererException
// load the pdf
document.load( new File( pdfFilePath ) );
// create renderer
SimpleRenderer renderer = new SimpleRenderer();
// set resolution (in DPI)
renderer.setResolution( dpi );
// render the images
List<Image> images = renderer.render( document );
// write the images to file
for (int iPage = 0; iPage < images.size(); iPage++) {
ImageIO.write( (RenderedImage) images.get( iPage ), imExtension,
new File( "" + iPage + "." + imExtension ) );
}
}
}

How do I generate RTF from Java?

I work on a web-based tool where we offer customized prints.
Currently we build an XML structure with Java, feed it to the XMLmind XSL-FO Converter along with customized XSL-FO, which then produces an RTF document.
This works fine on simple layouts, but there's some problem areas where I'd like greater control, or where I can't do what I want at all. F.ex: tables in header, footers (e.g., page numbers), columns, having a separate column setup or different page number info on the first page, etc.
Do any of you know of better alternatives, either to XMLmind or to the way we get from data to RTF, i.e., Java-> XML, XML+XSL-> RTF? (The only practical limitation for us is the JVM.)

You can take a look at a new library called jRTF. It allows you to create new RTF documents and to fill RTF templates.

Have you had a look at the iText library? It's touted primarily as a PDF generator, though it can also generate RTF. I haven't had cause to use it personally, but the general feeling I get is that it's good, and the interface looks comprehensive and easy to work to in the abstract. Whether it would fit in well with your existing data model is another question.

If you could afford spending some money, you could use Aspose.Words, a professional library for creating Word and RTF documents for Java and .NET.

iText supports RTF.

import com.lowagie.text.*;
import com.lowagie.text.html.simpleparser.HTMLWorker;
import com.lowagie.text.html.simpleparser.StyleSheet;
import com.lowagie.text.rtf.*;
import java.io.*;
import java.util.ArrayList;
public class HTMLtoRTF {
public static void main(String[] args) throws DocumentException {
Document document = new Document();
try {
Reader htmlreader = new BufferedReader((new InputStreamReader((new FileInputStream("C:\\Users\\asrikantan\\Desktop\\sample.htm")))));
RtfWriter2 rtfWriter = RtfWriter2.getInstance(document, new FileOutputStream(("C:\\Users\\asrikantan\\Desktop\\sample12.rtf")));
document.open();
document.add(new Paragraph("Testing simple paragraph addition."));
//ByteArrayOutputStream out = new ByteArrayOutputStream();
StyleSheet styles = new StyleSheet();
styles.loadTagStyle("body", "font", "Bitstream Vera Sans");
ArrayList htmlParser = HTMLWorker.parseToList(htmlreader, styles);
//fetch HTML line by line
for (int htmlDatacntr = 0; htmlDatacntr < htmlParser.size(); htmlDatacntr++) {
Element htmlDataElement = (Element) htmlParser.get(htmlDatacntr);
document.add((htmlDataElement));
}
htmlreader.close();
document.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (Exception e) {
System.out.println(e);
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

What is the easiest way to extract data from a PDF? - java

I am using JPedal and I'm really happy with the results. It isn't free but it's high quality and the output for image generation from pdfs or text extraction is really nice. And as a paid library, the support is always there to answer.

I have used PDFBox to extract text for Lucene indexing without too many issues. Its error/warning logging is quite verbose if I remember right - what was the cause for those errors you received?

Related

Writing Arabic with PDFBOX with correct characters presentation form without being separated

Arabic Text in PdfBox [duplicate]

Category Tree extraction in Wikipedia using Java

Is it possible to create PDF file of Images using Ghost4j?

How do I generate RTF from Java?

Categories

Resources