Apache Commons IO download only first PDF page

Apache Commons IO download only first PDF page - java

I'm using Java with Apache Commons-IO to download a PDF but I only want to get the first page, is there a way I can do it?
Here's the piece of code that gets the whole doc:
public void getPDF(String route) throws IOException {
URL url = new URL(route);
File file = new File("file.pdf");
FileUtils.copyURLToFile(url, file);
}

In continuation to your code, you may use a new Document to hold only first page of given PDF file.
URL url = new URL(route);
File file = new File("file.pdf");
FileUtils.copyURLToFile(url, file);
PDDocument pdDoc = PDDocument.load(file);
PDDocument document = null;
int pageNumberToRead=0;
try {
document = new PDDocument();
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(pageNumberToRead));
document.save("basepath/first_page.pdf");
document.close();
}catch(Exception e){}

Related

Apache POI convert HTML/XHTML to DOC/DOCX

I need to transform HTML to a doc file, the HTML is filled with custom information and the images and CSS change depending on what is request.
I'm trying to use Apache POI for this, but I'm having an error
`
org.apache.poi.xwpf.converter.core.XWPFConverterException: java.lang.IllegalStateException: Expecting one Styles document part, but found 0
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.convert(XHTMLConverter.java:72)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:58)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:38)
at org.apache.poi.xwpf.converter.core.AbstractXWPFConverter.convert(AbstractXWPFConverter.java:45)
My code is this:
// Load the HTML file
//Doc file
String htmlFile = "pathToHtml/file.html";
//String htmlFile = parseHTMLTemplate(disputeLetterDetails, template, fileExtension);
//new File(htmlFile);
//File file = new FileReader(htmlFile);
Path path = Path.of(htmlFile);
OutputStream in = new FileOutputStream(htmlFile, true);
// Create a new XWPFDocument
XWPFDocument document = new XWPFDocument();
// Set up the XHTML options
XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(new File("./images/")));
options.setExtractor(new FileImageExtractor(new File("./images/")));
// Convert the HTML to XWPFDocument
XHTMLConverter.getInstance().convert(document, in, options);
// Save the document to a .doc file
FileOutputStream out = new FileOutputStream("pathToHtml/OUT_from_XHTML_TEST.docx");
document.write(out);
out.close();
`
I want to get a docx file from an HTML file with the same styles but I'm getting this error `
org.apache.poi.xwpf.converter.core.XWPFConverterException: java.lang.IllegalStateException: Expecting one Styles document part, but found 0
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.convert(XHTMLConverter.java:72)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:58)
at org.apache.poi.xwpf.converter.xhtml.XHTMLConverter.doConvert(XHTMLConverter.java:38)
at org.apache.poi.xwpf.converter.core.AbstractXWPFConverter.convert(AbstractXWPFConverter.java:45)
`

unable to read pdf document using selenium web driver

I am writing code to read pdf file in selenium using Java PDF Library.
I wrote my code as
URL url = new URL(str);
InputStream is=url.openStream();
BufferedInputStream fileParse=new BufferedInputStream(is);
PDDocument document=null;
document=PDDocument.load(fileParse);
String pdfContent=new PDFTextStripper().getText(document);
But I am getting error at line document=PDDocument.load(fileParse) along with
java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1119)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2017)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:1988)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:269)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1143)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1040)
I need to verify the content on the pdf file .
Appreciate the help.
Thanks

Simply you can use below line of code and its working:
//Loading an existing document
File file = new File("yourPdfFilepath");
PDDocument document = PDDocument.load(file);
//Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
//Retrieving text from PDF document
String pdfcontent = pdfStripper.getText(document);
I hope it helps you

Removing all embedded files in iTextSharp

As I've become interested in iTextSharp I need to learn C#. Since I know a bit of AutoHotkey (simple yet powerful script programming language for Windows) it is easer for me. However, I often come across code written in Java which is said to be easily converted to C#. Unfortunetly, I have some problems with it. Let's have a look at original code written by Bruno Lowagie.
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfDictionary root = reader.getCatalog();
PdfDictionary names = root.getAsDict(PdfName.NAMES);
names.remove(PdfName.EMBEDDEDFILES);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
stamper.close();
}
This is what I have managed to write on my own:
static void removeFiles(string sourceFilePath, string destFilePath)
{
try
{
// read src file
FileStream inputStream = new FileStream(sourceFilePath, FileMode.Open, FileAccess.Read, FileShare.None);
Document source = new Document();
// open for reading
PdfWriter reader = PdfReader.GetInstance(inputStream);
source.Open();
// create dest file
FileStream outputStream = new FileStream(destFilePath, FileMode.Create, FileAccess.ReadWrite, FileShare.None);
Document dest = new Document();
PdfWriter writer = PdfWriter.GetInstance(dest, inputStream); // open stream from src
// remove embedded files from dest
PdfDictionary root = dest.getCatalog().getPdfObject();
PdfDictionary names = root.getAsDictionary(PdfName.Names);
names.remove(PdfName.EmbeddedFiles);
// close all
source.Close();
dest.Close();
}
catch (Exception ex)
{
}
}
Unfortunately, there are many errors such as:
'Document' does not contain a definition for 'getCatalog' and no extension method 'getCatalog'
'PdfReader' does not contain a definition for 'GetInstance'
This is what I have managed to do after countless hours of coding and googling.

There are some iTextSharp examples available. See for instance: How to read a PDF Portfolio using iTextSharp
I don't know much about C#, but this is a first attempt to fix your code:
static void RemoveFiles(string sourceFilePath, string destFilePath)
{
// read src file
FileStream inputStream = new FileStream(sourceFilePath, FileMode.Open, FileAccess.Read, FileShare.None);
// open for reading
PdfReader reader = new PdfReader(inputStream);
FileStream outputStream = new FileStream(destFilePath, FileMode.Create, FileAccess.ReadWrite, FileShare.None);
PdfStamper stamper = new PdfStamper(reader, outputStream);
// remove embedded files
PdfDictionary root = reader.Catalog;
PdfDictionary names = root.GetAsDict(PdfName.NAMES);
names.Remove(PdfName.EMBEDDEDFILES);
// close all
stamper.Close();
reader.Close();
}
Note that I don't understand why you were using Document and PdfWriter. You should use PdfStamper instead. Also: you are only removing the document-level attachments. If there are file attachment annotations, they will still be present in the PDF.

Vaadin Convert and display image as PDF

Does anyone know how image file can be easily converted into PDF format. What I need is to get the image from database and display it on the screen as PDF. What am I doing wrong? I tried to use iText but with no results.
My code:
StreamResource resource = file.downloadFromDatabase();//get file from db
Document converToPdf=new Document();//Create Document Object
PdfWriter.getInstance(convertToPdf, new FileOutputStream(""));//Create PdfWriter for Document to hold physical file
convertToPdf.open();
Image convertJpg=Image.getInstance(resource); //Get the input image to Convert to PDF
convertToPdf.add(convertJpg);//Add image to Document
Embedded pdf = new Embedded("", convertToPdf);//display document
pdf.setMimeType("application/pdf");
pdf.setType(Embedded.TYPE_BROWSER);
pdf.setSizeFull();
Thanks.

You're not using iText correctly:
You never close your writer, so the addition of the image never gets written to the outputstream.
You pass an empty string to your FileOutputStream. If you want to keep the pdf in memory, use a ByteArrayOutputStream. If not, define a temporary name instead.
You pass your Document object, which is a iText-specific object to your Embedded object and treat it like a file. It is not a pdf-file or byte[]. You'll probably want to pass either your ByteArrayOutputStream or read the temp file as a ByteArrayOutputStream into memory and pass that to Embedded.

Maybe someone will use (Vaadin + iText)
Button but = new Button("FV");
StreamResource myResource = getPDFStream();
FileDownloader fileDownloader = new FileDownloader(myResource);
fileDownloader.extend(but);
hboxBottom.addComponent( but );
private StreamResource getPDFStream() {
StreamResource.StreamSource source = new StreamResource.StreamSource() {
public InputStream getStream() {
// step 1
com.itextpdf.text.Document document = new com.itextpdf.text.Document();
// step 2
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try {
com.itextpdf.text.pdf.PdfWriter.getInstance(document, baos);
// step 3
document.open();
document.add(Chunk.NEWLINE); //Something like in HTML :-)
document.add(new Paragraph("TEST" ));
document.add(Chunk.NEWLINE); //Something like in HTML :-)
document.newPage(); //Opened new page
//document.add(list); //In the new page we are going to add list
document.close();
//file.close();
System.out.println("Pdf created successfully..");
} catch (DocumentException ex) {
Logger.getLogger(WndOrderZwd.class.getName()).log(Level.SEVERE, null, ex);
}
ByteArrayOutputStream stream = baos;
InputStream input = new ByteArrayInputStream(stream.toByteArray());
return input;
}
};
StreamResource resource = new StreamResource ( source, "test.pdf" );
return resource;
}

CSS style is not taken while generating pdf from html using itext

I can successfully generate a pdf from an html string, but the problem is that it doesn't take the css script. How can I generate the pdf with css style?
Please help! I have tried cssresolver als
My code is here:
{String result = "failed";
try
{
String html2 ="<html>"+.....+"</html>" ;
long timemilli = System.currentTimeMillis();
String filename = "EastAfriPack2014_Ticket_"+timemilli;
String writePath = Global.PDF_SAVE_PATH + filename ;
System.out.println("----------writePath--------------"+ writePath);
OutputStream file = new FileOutputStream(new File(writePath+".pdf"));
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, file);
document.open();
InputStream is = new ByteArrayInputStream(k.getBytes());
CSSResolver cssResolver = XMLWorkerHelper.getInstance().getDefaultCssResolver(false);
cssResolver.addCss("table {color: red; background-color: blue; } ", true);
XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
document.close();
file.close();
System.out.println("pdf created");
result = filename;
return filename;
} catch (Exception e) {
e.printStackTrace();
return result;
}
}

I don't think your approach works. I tried it before because, its the easiest way to create a PDF from HTML, but got bitten by same problem.
You either provide the styles inline via style attribute for the table
or
use the HTML, CSS files separately and send them to the HelperClass
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream("myhtmlFile.html"),
new FileInputStream("myCSSFile.css"));
the HTML part can also be an inputStream you made above in the code.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Commons IO download only first PDF page - java

Related

Apache POI convert HTML/XHTML to DOC/DOCX

unable to read pdf document using selenium web driver

Removing all embedded files in iTextSharp

Vaadin Convert and display image as PDF

CSS style is not taken while generating pdf from html using itext

Categories

Resources