Font problems in parsing PDF to text using PDFBox,FontBox etc

Font problems in parsing PDF to text using PDFBox,FontBox etc - java

I am using pdfbox api to extract text from pdf.
my program is working fine It is actually extracting text from pdf but problem font of text in pdf is in CDAC-GISTSurekh(Hindi font) and output of my program is not in same font it is in Mangla.
It is not even matching to text in pdf.
I downloaded same font i.e CDAC-GISTSurekh(Hindi font) and added it in my computer fonts but still output is formatted in Mangla.
Is there any way to change font of output while parsing.
Appreciate any help..
code i have written:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class PDFTextParser {
static String pdftoText(String fileName) {
PDFParser parser;
String parsedText = null;
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(fileName);
if (!file.isFile()) {
System.out.println("File " + fileName + " does not exist.");
return null;
}
try {
parser = new PDFParser(new FileInputStream(file));
} catch (IOException e) {
System.out.println("Unable to open PDF Parser. " + e.getMessage());
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
parsedText = pdfStripper.getText(pdDoc);
} catch (Exception e) {
e.printStackTrace();
System.out.println("An exception occured in parsing the PDF Document."+ e.getMessage());
} finally {
try {
if (cosDoc != null)
cosDoc.close();
if (pdDoc != null)
pdDoc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return parsedText;
}
public static void main(String args[]){
System.out.println(pdftoText("J:\\Users\\Shantanu\\Documents\\NetBeansProjects\\Pdf\\src\\PDfman\\A0410001.pdf"));
}
}

When you create a new PdfStripper Object, user the below syntax and specify encoding for it.
PdfTextStripper pdfStripper = new PDFTextStripper(ISO-XXXX)
Where (ISO -XXX) is the character encoding used in the PDF.

Related

JSF Primefaces p:fileDownload file name contains UTF-8 characters

I am working on Java 8, JSF 2, Primefaces 5.1.
Conversation to PDF or Docx works, but when I am displaying file name, it just skips UTF-8 encoded letters, in my case, Lithuanian letters like ą,č,ę,ė,į,š,ų,ū
What I have tried so farm is :
<h:form enctype="multipart/form-data;charset=UTF-8">
Charset.forName("UTF-8").encode(myString)
or
byte[] bytes = templateTitle.getBytes(Charset.forName("UTF-8"));
String title = new String(bytes, Charset.forName("UTF-8"));
or
UTF-8 text is garbled when form is posted as multipart/form-data
checked some tuttorials about encoding, still, no use,
also checked this, but I just do not understand this example...
Primefaces fileDownload non-english file names corrupt
my code:
Download file as docx
public void downloadTemplateAsDocx() throws Exception {
try {
InputStream content = null;
String objID = this.actData.getMainActs().get(0).getId();
ContentStream cmisStream = folderCatalogue.getDocumentContentStream(objID);
content = cmisStream.getStream();
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/hw.html"));
afiPart.setBinaryData(content);
afiPart.setContentType(new ContentType("text/html"));
Relationship altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(afiPart);
CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
ac.setId(altChunkRel.getId());
wordMLPackage.getMainDocumentPart().addObject(ac);
wordMLPackage.getContentTypeManager().addDefaultContentType("html", "text/html");
File fileTmp = File.createTempFile("tempDocFile", "docx");
wordMLPackage.save(fileTmp);
streamedContent = new DefaultStreamedContent(new FileInputStream(fileTmp), cmisStream.getMimeType(),
templateTitle + ".docx", "UTF-8");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (InvalidFormatException eInv) {
eInv.printStackTrace();
} catch (IOException ioEx) {
ioEx.printStackTrace();
} catch (Docx4JException docxEx) {
docxEx.printStackTrace();
}
}
code for .Pdf file download.
public void downloadTemplateAsPdf() {
try {
InputStream content = null;
String objID = this.actData.getMainActs().get(0).getId();
ContentStream cmisStream = folderCatalogue.getDocumentContentStream(objID);
content = cmisStream.getStream();
File fileTmp = File.createTempFile("tempFile", "pdf");
OutputStream fileStream = new FileOutputStream(fileTmp);
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, fileStream);
document.open();
XMLWorkerHelper worker = XMLWorkerHelper.getInstance();
worker.parseXHtml(writer, document, content, Charset.forName("UTF-8"));
document.close();
fileStream.close();
streamedContent = new DefaultStreamedContent(new FileInputStream(fileTmp), cmisStream.getMimeType(),
templateTitle + ".pdf");
} catch (FileNotFoundException e) {
e.printStackTrace();
System.out.println("File was not found");
} catch (IOException ex) {
ex.printStackTrace();
} catch (Exception exeption) {
exeption.printStackTrace();
}
}
EDIT:
<p:fileDownload value="#{controller.streamedContent}" />
private StreamedContent streamedContent;

Solution,
String title = URLEncoder.encode(templateTitle, "UTF-8");
StringBuilder fileName = new StringBuilder(title);
if (title.contains("+")) {
for (int i = 0; i < title.length(); i++) {
if (title.charAt(i) == '+') {
fileName.setCharAt(i, ' ');
}
}
}
This Encoding works fine, just it replaces all spaces to + that's why I loop over it.

Create a new Pdf from an existing Pdf and (HTML + CSS)

My use case is that I am generating pdf on the fly. Also I have a pdf with single page. I want to concatenate the newly generated PDF after/before the existing pdf page.
I was already able to generate PDF from HTML (this may result in 2-3 pages) Pdf from HTML with CSS
I tried looking up at the examples one of which is to concatenate existing PDFs pagewise Working with existing PDFs - Concatenate

This page shows exactly what you request with prepend and append static PDF with composed HTML and CSS content.
http://cloudformatter.com/CSS2Pdf.CustomTipsTricks.InjectPDF
Use instructions are here
http://cloudformatter.com/CSS2Pdf.APIDoc.Usage

Try this example:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.BaseFont;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
public class UtilPDF {
public static void main(String[] args) {
try {
List<InputStream> pdfs = new ArrayList<InputStream>();
File pdfDir = new File("C:\\PDF");
boolean pdfDirectoryExists = true;
if (!pdfDir.exists()) {
pdfDirectoryExists = pdfDir.mkdir();
}
if (pdfDirectoryExists) {
pdfs.add(new FileInputStream("C:\\PDF\\Document1.pdf"));
pdfs.add(new FileInputStream("C:\\PDF\\Document2.pdf"));
OutputStream output = new FileOutputStream("C:\\Projects\\FinalDocument_1_2.pdf");
UtilPDF.concatPDFs(pdfs, output, true);
}
} catch (Exception e) {
e.printStackTrace();
}
}
public static void concatPDFs(List<InputStream> streamOfPDFFiles, OutputStream outputStream, boolean paginate) {
Document document = new Document();
try {
List<InputStream> pdfs = streamOfPDFFiles;
List<PdfReader> readers = new ArrayList<PdfReader>();
int totalPages = 0;
Iterator<InputStream> iteratorPDFs = pdfs.iterator();
// Create Readers for the pdfs.
while (iteratorPDFs.hasNext()) {
InputStream pdf = iteratorPDFs.next();
PdfReader pdfReader = new PdfReader(pdf);
readers.add(pdfReader);
totalPages += pdfReader.getNumberOfPages();
}
// Create a writer for the outputstream
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
BaseFont bf = BaseFont.createFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
PdfContentByte cb = writer.getDirectContent(); // Holds the PDF
// data
PdfImportedPage page;
int currentPageNumber = 0;
int pageOfCurrentReaderPDF = 0;
Iterator<PdfReader> iteratorPDFReader = readers.iterator();
// Loop through the PDF files and add to the output.
while (iteratorPDFReader.hasNext()) {
PdfReader pdfReader = iteratorPDFReader.next();
// Create a new page in the target for each source page.
while (pageOfCurrentReaderPDF < pdfReader.getNumberOfPages()) {
document.newPage();
pageOfCurrentReaderPDF++;
currentPageNumber++;
page = writer.getImportedPage(pdfReader, pageOfCurrentReaderPDF);
cb.addTemplate(page, 0, 0);
// Code for pagination.
if (paginate) {
cb.beginText();
cb.setFontAndSize(bf, 9);
cb.showTextAligned(PdfContentByte.ALIGN_CENTER, "" + currentPageNumber + " of " + totalPages,
520, 5, 0);
cb.endText();
}
}
pageOfCurrentReaderPDF = 0;
}
outputStream.flush();
document.close();
outputStream.close();
} catch (Exception e) {
e.printStackTrace();
} finally {
if (document.isOpen())
document.close();
try {
if (outputStream != null)
outputStream.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
}
}

How to read .doc file line by line in java using all necessary jar file?

I want to display the difference between two .doc files line by line. I have done it with .txt files and it is working perfect. For this purpose I used the following code:
FileReader File1Reader = new FileReader(File1.getPath());
FileReader File2Reader = new FileReader(File2.getPath());
// Create Buffered Object.
BufferedReader File1BufRdr = new BufferedReader(File1Reader);
BufferedReader File2BufRdr = new BufferedReader(File2Reader);
// Get the file contents into String Variables.
String File1Content = File1BufRdr.readLine();
String File2Content = File2BufRdr.readLine();
//New String Builder
StringBuilder buffer = new StringBuilder();
Is there any way to read the doc files line by line.
I'm using following following code to read from doc file but this is not line by line. Here is the code:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class read_From_Doc_Docx {
public static void main(String[] args) {
//Alternate between the two to check what works.
//String FilePath = "D:\\Users\\username\\Desktop\\Doc1.docx";
String FilePath = "/Users/esna786/Removal of Redundancy.docx";
FileInputStream fis;
if (FilePath.substring(FilePath.length() - 1).equals("x")) { //is a docx
try {
fis = new FileInputStream(new File(FilePath).getAbsolutePath());
XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor extract = new XWPFWordExtractor(doc);
System.out.println(extract.getText());
} catch (IOException e) {
e.printStackTrace();
}
} else { //is not a docx
try {
fis = new FileInputStream(new File(FilePath));
HWPFDocument doc = new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(doc);
System.out.println(extractor.getText());
} catch (IOException e) {
e.printStackTrace();
}
}
}
}

Just use getParagraphText() method instead of getText().

pdfbox class cast exception

I want to read the text from the following pdf file. I am using pdfbox version 1.8.8. I am getting the following error.
2014-12-18 15:02:59 WARN XrefTrailerResolver:203 - Did not found XRef object at specified startxref position 4268142
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast to org.apache.pdfbox.cos.COSStream
at org.apache.pdfbox.pdmodel.common.COSStreamArray.<init>(COSStreamArray.java:68)
at org.apache.pdfbox.pdmodel.common.PDStream.createFromCOS(PDStream.java:185)
at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:639)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:380)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:275)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:288)
at com.algotree.pdf.test.PdfBoxTest.pdftoText(PdfBoxTest.java:53)
at com.algotree.pdf.test.PdfBoxTest.main(PdfBoxTest.java:71)
Yes,i have seen many posts about this error. Still i couldnt find the solution to read this file.
Thanks
file.pdf
This is my code:
static String pdftoText(String fileName) throws IOException {
PDFParser parser;
String parsedText = null;;
PDFTextStripper pdfStripper = new PDFTextStripper();
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(fileName);
if (!file.isFile()) {
System.err.println("File " + fileName + " does not exist.");
return null;
}
try {
parser = new PDFParser(new FileInputStream(file));
} catch (IOException e) {
System.err.println("Unable to open PDF Parser. " + e.getMessage());
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdfStripper.setSuppressDuplicateOverlappingText(false);
pdDoc = new PDDocument(cosDoc);
int endPage=pdDoc.getPageCount();
if(endPage>300)
endPage=300;
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(endPage);
parsedText = pdfStripper.getText(cosDoc);
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (cosDoc != null)
cosDoc.close();
if (pdDoc != null)
pdDoc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return parsedText;
}

This one works
static String pdftoText(String fileName) throws IOException {
String parsedText = null;;
PDFTextStripper pdfStripper = new PDFTextStripper();
PDDocument pdDoc = null;
File file = new File(fileName);
if (!file.isFile()) {
System.err.println("File " + fileName + " does not exist.");
return null;
}
try {
pdDoc=PDDocument.loadNonSeq(file, null);
} catch (IOException e) {
System.err.println("Unable to open PDF Parser. " + e.getMessage());
return null;
}
try {
pdfStripper = new PDFTextStripper();
int endPage=pdDoc.getPageCount();
if(endPage>300)
endPage=300;
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(endPage);
parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (pdDoc != null)
pdDoc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return parsedText;
}

How to read raw text from pdf file using java [duplicate]

I need to extract text (word by word) from a pdf file.
import java.io.*;
import com.itextpdf.text.*;
import com.itextpdf.text.pdf.*;
import com.itextpdf.text.pdf.parser.*;
public class pdf {
private static String INPUTFILE = "http://ontology.buffalo.edu/ontology%28PIC%29.pdf" ;
private static String OUTPUTFILE = "c:/new3.pdf";
public static void main(String[] args) throws DocumentException,
IOException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,
new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {
page = writer.getImportedPage(reader, i);
System.out.println(i);
Image instance = Image.getInstance(page);
document.add(instance);
}
document.close();
PdfReader readerN = new PdfReader(OUTPUTFILE);
PdfTextExtractor parse = new PdfTextExtractor();
for (int i = 1; i <= n; i++)
System.out.println(parser.getTextFromPage(reader,i));
}
When I compile the code, I have this error:
the constructor PdfTextExtractor is undefined
How do I fix this?

PDFTextExtractor only contains static methods and the constructor is private. itext
You can call it like so:
String myLine = PDFTextExtractor.getTextFromPage(reader, pageNumber)

If you want to get all the text from the PDF file and save it to a text file you can use below code.
Use pdfutil.jar library.
import java.io.IOException;
import java.io.PrintWriter;
import com.testautomationguru.utility.PDFUtil;
public class PDFToText{
public static void main(String[] args) {
try {
String pdfFilePath = "C:\\abc.pdf";
PDFUtil pdfUtil = new PDFUtil();
String content = pdfUtil.getText(pdfFilePath);
PrintWriter out = new PrintWriter("C:\\abc.txt");
out.println(content);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

// Try Apache PDF Box
import java.io.FilterInputStream;
import java.io.InputStream;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
// Your PDF file
String filePath = "";
InputStream inputStream = null;
try
{
inputStream = new FileInputStream(filePath);
PDFParser parser = new PDFParser(inputStream);
// This will parse the stream and populate the COSDocument object.
parser.parse();
// Get the document that was parsed.
COSDocument cosDoc = parser.getDocument();
// This class will take a pdf document and strip out all of the text and
// ignore the formatting and such.
PDFTextStripper pdfStripper = new PDFTextStripper();
// This is the in-memory representation of the PDF document
PDDocument pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(pdDoc.getNumberOfPages());
// This will return the text of a document.
def statementPDF = pdfStripper.getText(pdDoc);
}
catch(Exception e)
{
String errorMessage += "\nUnexpected Exception: " + e.getClass() + "\n" + e.getMessage();
for (trace in e.getStackTrace())
{
errorMessage += "\n\t" + trace;
}
}
finally
{
if (inputStream != null)
{
inputStream.close();
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Font problems in parsing PDF to text using PDFBox,FontBox etc - java

When you create a new PdfStripper Object, user the below syntax and specify encoding for it. PdfTextStripper pdfStripper = new PDFTextStripper(ISO-XXXX) Where (ISO -XXX) is the character encoding used in the PDF.

Related

JSF Primefaces p:fileDownload file name contains UTF-8 characters

Create a new Pdf from an existing Pdf and (HTML + CSS)

How to read .doc file line by line in java using all necessary jar file?

pdfbox class cast exception

How to read raw text from pdf file using java [duplicate]

Categories

Resources