I am developing a module where i am supposed to print documents from the server. Following are the requirements :
the module should be able to print a pdf from a url, with & without saving
the module should be able to accept page numbers as parameters and only print/save those page numbers.
the module should be able to accept the printer name as a parameter and use only that printer
Is there any library available for this? How should i go about implementing this?
The answer was Apache PDFBox . I was able to load the PDF into a PDDocument object like this :
PDDocument pdf = PDDocument.load(new URL(download_pdf_from).openStream());
Splitting the document was as easy as :
Splitter splitter = new Splitter();
List<PDDocument> splittedDocuments = splitter.split(pdf);
Now, to get a reference to any particular page:
splittedDocuments.get(pageNo);
Saving the entire document or even a given page number :
pdf.save(path); //saving the entire document to device
splittedDocuments.get(pageNo).save(path); //saving a particular page number to device
For the printing part, this answer helped me.
Related
I wanted to make a simple program to get text content from a pdf file through Java. Here is the code:
PDFTextStripper ts = new PDFTextStripper();
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = PDDocument.load(file);
String allText = ts.getText(doc1);
String gradeText = allText.substring(allText.indexOf("GRADE 10B"), allText.indexOf("GRADE 10C"));
System.out.println("Meeting ID for English: "
+ gradeText.substring(gradeText.indexOf("English") + 7, gradeText.indexOf("English") + 20));
This is just part of the code, but this is the part with the problem.
The error is: The method load(File) is undefined for the type PDDocument
I have learnt using PDFBox from JavaTPoint. I have followed the correct instructions for installing the PDFBox libraries and adding them to the Build Path.
My PDFBox version is 3.0.0
I have also searched the source files and their methods, and I am unable to find the load method there.
Thank you in advance.
As per the 3.0 migration guide the PDDocument.load method has been replaced with the Loader method:
For loading a PDF PDDocument.load has been replaced with the Loader
methods. The same is true for loading a FDF document.
When saving a PDF this will now be done in compressed mode per
default. To override that use PDDocument.save with
CompressParameters.NO_COMPRESSION.
PDFBox now loads a PDF Document incrementally reducing the initial
memory footprint. This will also reduce the memory needed to consume a
PDF if only certain parts of the PDF are accessed. Note that, due to
the nature of PDF, uses such as iterating over all pages, accessing
annotations, signing a PDF etc. might still load all parts of the PDF
overtime leading to a similar memory consumption as with PDFBox 2.0.
The input file must not be used as output for saving operations. It
will corrupt the file and throw an exception as parts of the file are
read the first time when saving it.
So you can either swap to an earlier 2.x version of PDFBox, or you need to use the new Loader method. I believe this should work:
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = Loader.loadPDF(file);
I need to find a way to ignore pictures and photos from PDF document during conversion to DOCX file.
I am creating an instance of FineReader Engine:
IEngine engine = Engine.InitializeEngine(
engineConfig.getDllFolder(), engineConfig.getCustomerProjectId(),
engineConfig.getLicensePath(), engineConfig.getLicensePassword(), "", "", false);
After that, I am converting a document:
IFRDocument document = engine.CreateFRDocument();
document.AddImageFile(file.getAbsolutePath(), null, null);
document.Process(null);
String exportPath = FileUtil.prepareExportPath(file, resultFolder);
document.Export(exportPath, FileExportFormatEnum.FEF_DOCX, null);
As a result, it converts all images from the initial pdf document.
When you exporting pdf to docx you should use some export params. In this way you can use IRTFExportParams. You can get this object:
IRTFExportParams irtfExportParams = engine.CreateRTFExportParams();
and there you can set writePicture property like this:
irtfExportParams.setWritePictures(false);
there: IEngine engine is main interface. I think u know how to initialize it;)))
Also you have to set in method document.Process() property. (document is from IFRDocument document).
In Process() method you have to give IDocumentProcessingParams iDocumentProcessingParams. This object has method setPageProcessingParams() and there you have to put IPageProcessingParams iPageProcessingParams params(You can get this object by engine.CreatePageProcessingParams()). And this object has methods:
iPageProcessingParams.setPerformAnalysis(true);
iPageProcessingParams.setPageAnalysisParams(iPageAnalysisParams);
In the first method set true,
and in the second one we give iPageAnalysisParams(IPageAnalysisParams iPageAnalysisParams = engine.CreatePageAnalysisParams()).
Last step, you have to set false value in setDetectPictures(false) method from iPageAnalysisParams like this. Thats all:)
And when you are going to export document you should put this param like this:
IFRDocument document = engine.CreateFRDocument();
document.Export(filePath, FileExportFormatEnum.FEF_DOCX, irtfExportParams);
I hope my answer will help to everyone)))
I'm not really familiar with PDF to DOCX conversion, but i think you could try custom profiles according to your needs.
At some point in your code you should create a Engine object, and then create a Document object (or IFRDocument object depending of your application). Add this line just before giving your document to your engine for processing:
engine.LoadProfile(PROFILE_FILENAME);
Then create your file with some processing parameters described in the documentation packaged with your FRE installation under "Working with Profiles" section.
Do not forget to add in your file:
... some params under other sections
[PageAnalysisParams]
DetectText = TRUE --> force text detection
DetectPictures = FALSE --> ignore pictures
... other params under PageAnalysisParams
... some params under other sections
It works the same way for Barcodes, etc... But keep in mind to benchmark your results when adding or removing things from this file as it may alter processing speed and global quality of your result.
What do PDF input pages contain? What is expected in MS Word?
It would be great if you would attach an example of an input PDF file and an example of the desired result in MS Word format.
Then give a useful recommendation will be much easier.
I am trying to automate docx report generation process. For this I am using java and docx4j. I have a template document containing only single page.I would like to copy that page modify it and save it in another docx document.The output report is of multiple similar pages with modification from the template. How do I go about it.
PS : java and docx4j are my first choice but I am open to solutions apart from java and docx4j.
Leaving it up to you to modify the template, here is how you could add one document to the end of another document. Suppose base.docx contains "This is the base document." and template.docx contains "The time is:", then after executing this code:
WordprocessingMLPackage doc = Docx4J.load(new File("base.docx"));
WordprocessingMLPackage template = Docx4J.load(new File("template.docx"));
MainDocumentPart main = doc.getMainDocumentPart();
Br pageBreak = Context.getWmlObjectFactory().createBr();
pageBreak.setType(STBrType.PAGE);
main.addObject(pageBreak);
for (Object obj : template.getMainDocumentPart().getContent()) {
main.addObject(obj);
}
main.addParagraphOfText(LocalDateTime.now().toString());
doc.save(new File("result.docx"));
Then result.docx will contain something like:
This is the base document.
^L
The time is:
2018-04-16T17:37:13.541984200
(Where ^L represents a page break.)
To be more precise my original template is containing only header and some styling component.
This kind of information can be stored in a Word stylesheet (.dotx file).
PS : java and docx4j are my first choice but I am open to solutions apart from java and docx4j.
A good tool would be pxDoc: you can specify a dedicated stylesheet in your document generator, or use "variable styles"and specify the stylesheet only when you launch the document generation
I have some questions about parsing pdf anfd how to:
what is the purpose of using
PDDocument.loadNonSeq method that include a scratch/temporary file?
I have big pdf and i need to parse it and get text contents. I use PDDocument.load() and then PDFTextStripper to extract data page by page (pdfstripper have got setStartPage(n) and setEndPage(n)
where n=n+1 every page loop ). Is more efficient for memory using loadNonSeq instead load?
For example
File pdfFile = new File("mypdf.pdf");
File tmp_file = new File("result.tmp");
PDDocument doc = PDDocument.loadNonSeq(pdfFile, new RandomAccessFile(tmp_file, READ_WRITE));
int index=1;
int numpages = doc.getNumberOfPages();
for (int index = 1; index <= numpages; index++){
PDFTextStripper stripper = new PDFTextStripper();
Writer destination = new StringWriter();
String xml="";
stripper.setStartPage(index);
stripper.setEndPage(index);
stripper.writeText(this.doc, destination);
.... //filtering text and then convert it in xml
}
Is this code above a right loadNonSeq use and is it a good practice to read PDF page per page without vaste in memory?
I use page per page reading because I need to write text in XML using DOM memory (using stripping technique, I decide to produce an XML for every page)
what is the purpose of using PDDocument.loadNonSeq method that include a scratch/temporary file?
PDFBox implements two ways to read a PDF file.
loadNonSeq is the way documents should be loaded
load is the way documents should not be loaded but one might try to repair flles with broken cross references this way
In the 2.0.0 development branch, the algorithm formerly used for loadNonSeq is now used for load and the algorithm formerly used for load is not used anymore.
I have big pdf and i need to parse it and get text contents. I use PDDocument.load() and then PDFTextStripper to extract data page by page (pdfstripper have got setStartPage(n) and setEndPage(n) where n=n+1 every page loop ). Is more efficient for memory using loadNonSeq instead load?
Using loadNonSeq instead of load may improve memory usage for multi-revision PDFs because it only reads objects still referenced from the reference table while load can keep more in memory.
I don't know, though, whether using a scratch file makes a big difference.
is it a good practice to read PDF page per page without vaste in memory?
Internally PDFBox parses the given range page after page, too. Thus, if you process the stripper output page-by-page, it certainly is ok to parse it page by page.
I have been using OLE automation from java to access methods for word.
I managed to do the following using the OLE automation:
Open word document template file.
Mail merge the word document template with a csv datasource file.
Save mail merged file to a new word document file.
What i need to do now is to be able to open the mail merged file and then using OLE programmatically split it into multiple files. Meaning if the original mail merged file has 6000 pages and my max pages per file property is set to 3000 pages i need to create two new word document files and place the 1st 3000 pages in the one and the last 3000 pages into the other one.
On my first attempts i took the amount of rows in the csv file and multiplied it by the number of pages in the template to get the total amount of pages after it will be merged. Then i used the merging to create the multiple files. The problem however is that i cannot exactly calculate how many pages the merged document will be because in some case all say 9 pages of the template will not be used because of the data and the mergefields used. So in some cases one row will only create 3 pages (using the 9 page template) and others might create 9 pages (using the 9 page template) during mail merge time.
So the only solution is to merge all rows into one document and then split it into multiple documents therafter to ensure that the exact amount of pages like the 3000 pages property is indeed in each file until there are no more pages left from the original merged file.
I have tried a few things already by using the msdn site to get methods and their properties etc but have been unable to this.
On my last attempts now i have been trying to use GoTo to get to a specific page number and the remove the page. I was going to try do this one by one for each page until i get to where i want the file to start from and then save it as a new file but have been unable to do so as well.
Please can anyone suggest something that could help me out?
Thanks and Regards
Sean
An example to open a word file using the OLE AUTOMATION from jave is included below:
Code sample
OleAutomation documentsAutomation = this.getChildAutomation(this.wordAutomation, "Documents");
int [ ] id = documentsAutomation.getIDsOfNames(new String[]{"Open"});
Variant[] arguments = new Variant[1];
arguments[0] = new Variant(fileName); // where filename is the absolute path to the docx file
Variant invokeResult = documentsAutomation.invoke(id[0], arguments);
private OleAutomation getChildAutomation(OleAutomation automation, String childName) {
int[] id = automation.getIDsOfNames(new String[]{childName});
Variant pVarResult = automation.getProperty(id[0]);
return(pVarResult.getAutomation());
}
Code sample
Sounds like you've pegged it already. Another approach you could take which would avoid building then deleting would be to look at the parts of your template that can make the biggest difference to the number of your template (that is where the data can be multi-line). If you then take these fields and look at the font, line-spacing and line-width type of properties you'll be able to calculate the room your data will take in the template and limit your data at that point. Java FontMetrics can help you with that.