Apache PDFBox to open temporary created PDF file - java

I'm using apache pdfbox 2.x version and I am trying to read a temp created file.
Below is my code to create a temp file and read it:
Path mergedTempFile = null;
try {
mergedTempFile = Files.createTempFile("merge_", ".pdf");
PDDocument pdDocument = PDDocument.load(mergedTempFile.toFile());
But it gives error:
java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1098)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2577)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2560)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1006)
at com.howtodoinjava.demo.PdfboxApi.test(PdfboxApi.java:326)
at com.howtodoinjava.demo.PdfboxApi.main(PdfboxApi.java:317)
From this link I have got a reference but it did not help anyway:
Similar Issue Link
Please help me with this. Still, I can not get rid of this.

PDDocument.load(...) is used to parse an existing PDF.
The passed temporary file (mergedTempFile) is empty, thus the exception. Just create a PDDocument with the constructor (resides in-memory) and later save it with PDDocument.save(...).
Path mergedTempFile = null;
try {
mergedTempFile = Files.createTempFile("merge_", ".pdf");
try (PDDocument pdDocument = new PDDocument()) {
// add content
pdDocument.addPage(new PDPage()); // empty page as an example
pdDocument.save(mergedTempFile.toFile());
}
} catch (IOException e) {
// exception handling
}
// use mergedTempFile for further logic

Related

How can I merge two pdf using pdfBox

Hi there I am trying to generate pdf by combining two one from my local machine and another one from s3 I am not sure how to do this
here is what I tried doing -
S3Object object = amazonS3.getObject(new GetObjectRequest(bucket, "name.pdf"));
InputStream ins = object.getObjectContent();
try {
PDDocument doc = PDDocument.load(ins);
File file1 = new File(
"C:\\Users\\admin\\example.pdf");
PDFMergerUtility obj = new PDFMergerUtility();
obj.setDestinationFileName(
"C:\\Users\\admin\\newMerged.pdf");
obj.addSource(file1);
obj.addSource(String.valueOf(doc));
obj.mergeDocuments();
System.out.println("PDF Documents merged to a single file");
}
catch (Exception e){
e.printStackTrace();
}
Error - org.apache.pdfbox.pdmodel.PDDocument#4d33940d (The system cannot find the file specified)
I know this could not be the way, I wanna know how to do this thing.
Consider to use absolute pathname to where the file exists.
PDFMergerUtility result = new PDFMergerUtility();
result.addSource("C:\\...\\FileRead\\first.txt");
result.addSource("C:\\...\\FileRead\\second.txt");
result.setDestinationFileName("the destination path");
result.mergeDocuments();

How to fix an error during generating PDF: PdfBoxTextRenderer.getWidth(PdfBoxTextRenderer.java:300)

I use openhtmltopdf library to convert my html templates to PDF:
try (OutputStream os = new FileOutputStream(filePath);
PDDocument doc = new PDDocument()) {
for (String html : htmlPagesWithValues) {
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.defaultTextDirection(BaseRendererBuilder.TextDirection.LTR);
builder.useDefaultPageSize(210, 297, BaseRendererBuilder.PageSizeUnits.MM);
builder.useProtocolsStreamImplementation(new InternalFSStreamFactory(), "localProtocol");
builder.withHtmlContent(html, "");
builder.useSVGDrawer(new BatikSVGDrawer());
builder.usePDDocument(doc);
PdfBoxRenderer renderer = builder.buildPdfRenderer();
renderer.createPDFWithoutClosing();
}
doc.save(os);
} catch (Exception ex) {
log.debug("Stacktrace: ", ex);
}
During generating PDF file I am getting the following stacktrace:
java.lang.NullPointerException: null
at com.openhtmltopdf.pdfboxout.PdfBoxTextRenderer.getWidth(PdfBoxTextRenderer.java:300)
at com.openhtmltopdf.layout.Breaker.doBreakText(Breaker.java:147)
at com.openhtmltopdf.layout.Breaker.doBreakText(Breaker.java:115)
at com.openhtmltopdf.layout.Breaker.breakText(Breaker.java:109)
at com.openhtmltopdf.layout.InlineBoxing.layoutText(InlineBoxing.java:959)
...
I found the issue. I use images on PDF file that hosted on our server. The Openpdftohtml library tried to access the images by the our public URL but this URL was not available. After adding access PDF is successfully created. I've also opened the issue on Github: https://github.com/danfickle/openhtmltopdf/issues/267

PDF Box - Unable to renameTo or Delete files

I'm fairly new to programming and I've been trying to use PDFBox for a personal project that I have. I'm basically trying to verify if the PDF has specific keywords in it, if YES I want to transfer the file to a "approved" folder.
I know the code below is poor written, but I'm not able to transfer nor delete the file correctly:
try (Stream<Path> filePathStream = Files.walk(Paths.get("C://pdfbox_teste"))) {
filePathStream.forEach(filePath -> {
if (Files.isRegularFile(filePath)) {
String arquivo = filePath.toString();
File file = new File(arquivo);
try {
// Loading an existing document
PDDocument document = PDDocument.load(file);
// Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
String[] words = text.split("\\.|,|\\s");
for (String word : words) {
// System.out.println(word);
if (word.equals("Revisão") || word.equals("Desenvolvimento")) {
// System.out.println(word);
if(file.renameTo(new File("C://pdfbox_teste//Aprovados//" + file.getName()))){
document.close();
System.out.println("Arquivo transferido corretamente");
file.delete();
};
}
}
System.out.println("Fim do documento: " + arquivo);
System.out.println("----------------------------");
document.close();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
});
I wanted to have the files transferred into the new folder. Instead, sometimes they only get deleted and sometimes nothing happens. I imagine the error is probably on the foreach, but I can't seem to find a way to fix it.
You try to rename the file while it is still open, and only close it afterwards:
// your code, does not work
if(file.renameTo(new File("C://pdfbox_teste//Aprovados//" + file.getName()))){
document.close();
System.out.println("Arquivo transferido corretamente");
file.delete();
};
Try to close the document first, so the file is no longer accessed by your process, and then it should be possible to rename it:
// fixed code:
document.close();
if(file.renameTo(new File("C://pdfbox_teste//Aprovados//" + file.getName()))){
System.out.println("Arquivo transferido corretamente");
};
And as Mahesh K pointed out, you don't have to delete the (original) file after you renamed it. Rename does not make a duplicate where the original file would need to be deleted, it just renames it.
After calling renameTo, you shouldn't be using delete.. as per my understanding renameTo works like move command. Pls see this

Apache POI fails to save (HWPFDocument.write) large word doc files

I want to remove word metadata from .doc files. My .docx files works fine with XWPFDocument, but the following code for removing metadata fails for large (> 1MB) files. For example using a 6MB .doc file with images, it outputs a 4.5MB file in which some images are removed.
public static InputStream removeMetaData(InputStream inputStream) throws IOException {
POIFSFileSystem fss = new POIFSFileSystem(inputStream);
HWPFDocument doc = new HWPFDocument(fss);
// **it even fails on large files if you remove from here to 'until' below**
SummaryInformation si = doc.getSummaryInformation();
si.removeAuthor();
si.removeComments();
si.removeLastAuthor();
si.removeKeywords();
si.removeSubject();
si.removeTitle();
doc.getDocumentSummaryInformation().removeCategory();
doc.getDocumentSummaryInformation().removeCompany();
doc.getDocumentSummaryInformation().removeManager();
try {
doc.getDocumentSummaryInformation().removeCustomProperties();
} catch (Exception e) {
// can not remove above
}
// until
ByteArrayOutputStream os = new ByteArrayOutputStream();
doc.write(os);
os.flush();
os.close();
return new ByteArrayInputStream(os.toByteArray());
}
Related posts:
How to save the Word Document using POI API?
https://stackoverflow.com/questions/9758955/saving-poi-document-correctly
Which version of Apache POI are you using ?
This seems to be the Bug 46220 - Regression: Some embedded images being lost .
Please upgrade to the latest release of POI (3.8) and try again.
Hope that helps.

Java: Read in text files from a directory, from the internet

Does anybody know how to recursively read in files from a specific directory on the internet, in Java?
I want to read in all the text files from this web directory: http://www.cs.ucdavis.edu/~davidson/courses/170-S11/Female/
I know how to read in multiple files that are in a folder on my computer, and I how to read in a single file from the internet. But how can I read in multiple files on the internet, without hardcoding the URLs in?
Stuff I tried:
// List the files on my Desktop
final File folder = new File("/Users/crystal/Desktop");
File[] listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++) {
File fileEntry = listOfFiles[i];
if (!fileEntry.isDirectory()) {
System.out.println(fileEntry.getName());
}
}
Another thing I tried:
// Reading data from the web
try
{
// Create a URL object
URL url = new URL("http://www.cs.ucdavis.edu/~davidson/courses/170-S11/Female/5_1_1.txt");
// Read all of the text returned by the HTTP server
BufferedReader in = new BufferedReader (new InputStreamReader(url.openStream()));
String htmlText; // String that holds current file line
// Read through file one line at a time. Print line
while ((htmlText = in.readLine()) != null)
{
System.out.println(htmlText);
}
in.close();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
// If another exception is generated, print a stack trace
e.printStackTrace();
}
Thanks!
Since the URL you mentioned has indexes enabled, you're in luck.
You've got a few options here.
Parse the html to find the attribute of the a tags, using SAX2 or any other XML parser. htmlunit would also work I think.
Use a little regexp magic to match all string between <a href=" and "> and use that as the urls to read from.
Once you've got a list of all the URLs you need, then the second piece of code should work just fine. Just iterate over your list, and construct your URL from that list.
Here's a sample regex that should match what you want. It does catch a few extra links, but you should be able to filter those out.
<a\ href="(.+?)">

Categories

Resources