I am writing code to read pdf file in selenium using Java PDF Library.
I wrote my code as
URL url = new URL(str);
InputStream is=url.openStream();
BufferedInputStream fileParse=new BufferedInputStream(is);
PDDocument document=null;
document=PDDocument.load(fileParse);
String pdfContent=new PDFTextStripper().getText(document);
But I am getting error at line document=PDDocument.load(fileParse) along with
java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1119)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2017)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:1988)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:269)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1143)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1040)
I need to verify the content on the pdf file .
Appreciate the help.
Thanks
Simply you can use below line of code and its working:
//Loading an existing document
File file = new File("yourPdfFilepath");
PDDocument document = PDDocument.load(file);
//Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
//Retrieving text from PDF document
String pdfcontent = pdfStripper.getText(document);
I hope it helps you
Related
I need to transform HTML to a doc file, the HTML is filled with custom information and the images and CSS change depending on what is request.
I'm trying to use Apache POI for this, but I'm having an error
`
org.apache.poi.xwpf.converter.core.XWPFConverterException: java.lang.IllegalStateException: Expecting one Styles document part, but found 0
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.convert(XHTMLConverter.java:72)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:58)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:38)
at org.apache.poi.xwpf.converter.core.AbstractXWPFConverter.convert(AbstractXWPFConverter.java:45)
My code is this:
// Load the HTML file
//Doc file
String htmlFile = "pathToHtml/file.html";
//String htmlFile = parseHTMLTemplate(disputeLetterDetails, template, fileExtension);
//new File(htmlFile);
//File file = new FileReader(htmlFile);
Path path = Path.of(htmlFile);
OutputStream in = new FileOutputStream(htmlFile, true);
// Create a new XWPFDocument
XWPFDocument document = new XWPFDocument();
// Set up the XHTML options
XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(new File("./images/")));
options.setExtractor(new FileImageExtractor(new File("./images/")));
// Convert the HTML to XWPFDocument
XHTMLConverter.getInstance().convert(document, in, options);
// Save the document to a .doc file
FileOutputStream out = new FileOutputStream("pathToHtml/OUT_from_XHTML_TEST.docx");
document.write(out);
out.close();
`
I want to get a docx file from an HTML file with the same styles but I'm getting this error `
org.apache.poi.xwpf.converter.core.XWPFConverterException: java.lang.IllegalStateException: Expecting one Styles document part, but found 0
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.convert(XHTMLConverter.java:72)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:58)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:38)
at org.apache.poi.xwpf.converter.core.AbstractXWPFConverter.convert(AbstractXWPFConverter.java:45)
`
I have a file which I am trying to read the content from.
Code tried:
String userDir = System.getProperty("user.home")+"\\Downloads";
//Loading an existing document
String s=userDir+"\\PDFStatement.pdf";
File file = new File(s);
//File file = new File(s);
System.out.println("file"+file);
PDDocument document = PDDocument.load(file);// this is where I am getting error
//Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
//Retrieving text from PDF document
String text = pdfStripper.getText(document);
System.out.println(text);
//Closing the document
document.close();
}
Error :
java.io.IOException: Error: Expected an integer type, actual='statement'
at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1622)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:531)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1187)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1154)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1125)
Same code works when I pass the same file path with / as
File file = new File("C:/abc/abc/abc/PDFStatement.pdf");
I am using pdfbox 1.8
Thanks.
I am trying to extract the pdf text using Tabula. But the code has no errors but when i run the extracted pdf text does not get displayed in console. Could some one help.
I have been using PDFBox and after doing some research, i have found that tabula is new and wanted to try it.
File file = new File(pdfFilePath);
PDDocument document = PDDocument.load(file);
ObjectExtractor oe = new ObjectExtractor(document);
Page page = oe.extract(1) //1st page
TextStripper textStripper = new TextStripper(document,1);
System.out.println(textStripper.getText(document));
output of pdf text
You are not using the page variable. Try the following code.
File file = new File(pdfFilePath);
PDDocument document = PDDocument.load(file);
ObjectExtractor oe = new ObjectExtractor(document);
Page page = oe.extract(1); // 1st page
for (TextElement textElement: page.getText()) {
System.out.print(textElement.getText());
}
My goal is to verify that a link to a PDF loads properly. I'm new to selenium, java, etc.
I tried both with URL and FILE after the load
PDDocument doc = PDDocument.load(new FILE ("https://xxxxxcx/iPledgeUI/rems/pdf/resources/iPledge_REMS_Non_Compliance_Action_Policy.pdf"));
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(doc);
doc.close();
Expected result would be that the file gets loaded from the URL
Here is what I am getting
https:/xxxxxxxx/iPledgeUI/rems/pdf/resources/iPledge_REMS_Non_Compliance_Action_Policy.pdf (No such file or directory)
That isn't a file, that is an URL. Load these with
import java.net.URL;
import java.io.InputStream;
…
InputStream is = new URL("....").openStream(); // will throw here if URL doesn't work
PDDocument doc = PDDocument.load(is); // will throw here if PDF malformed or empty file
…
is.close();
If the URL doesn't exist you'll get an exception. (Exception handling code not included here)
converting doc file to pdf
I am using the following code :
POIFSFileSystem fs = null;
Document Pdfdocument = new Document();
fs = new POIFSFileSystem(new FileInputStream(srcFile));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
PdfWriter writer = PdfWriter.getInstance(Pdfdocument, new
FileOutputStream(targetFile));
Pdfdocument.open();
writer.setPageEmpty(true);
Pdfdocument.newPage();
writer.setPageEmpty(true);
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
Pdfdocument.add(new Paragraph(paragraphs[i]));
}
This generates a pdf without formatting and images
even fonts will be missing.
Since WordExtractor uses only text
is there any other way to convert with fonts and images.
Convertion form doc(HWPFDocument) but not on docx
I have referred these SO links
Convert doc to pdf using Apache POI
https://stackoverflow.com/a/6210694/6032482
how to convert doc,docx files to pdf in java programatically
and many more but found
they all use WordExtractor .
Note:
I can't use library office
nor
Aspose
Can it be done using:
ApachePOI
DOCX4j
itext