I am using pd4ml to create pdf documents, however I don't want the user to be able to edit those documents using ms word 2013.
here is what I have tried so far
pd4ml = new PD4ML();
pd4ml.setPageSize(PD4Constants.A4);
pd4ml.setPageInsetsMM(new Insets(TOPVALUE, LEFTVALUE, BOTTOMVALUE, RIGHTVALUE));
pd4ml.setHtmlWidth(USERSPACEWIDTH);
pd4ml.enableImgSplit(false);
pd4ml.disableHyperlinks();
//some more code
pd4ml.render(arrayOfURLs, byteArrayOutputStream);
//some more code
then I read the PD4ML API documentation and added this line of code pd4ml.generatePdfa(true); I thought the problem was solved when I opened the document in adobe reader and saw this message "
this file claims compliance with the pdf/a standard and has been opened read-only", but of course it was still editable; so any suggestions on how this is done in pd4ml, or any reference to an api I can use to add this restriction to the generated pdf will be more than welcomed.
Tried this?
This if from the documentation by the way:
AllowModify
public static final int AllowModify
Document access permission (bit 4, value = 8).
Modify the contents of the document by operations other than those controlled by bits 6, 9, and 11.
See Also:
PD4ML.setPermissions(String, int, boolean), Constant Field Values
Also you may want to do:
pd4ml.setPermissions("", 0xffffffff ^ PD4Constants.AllowModify, false);
to disable the modification.
More information here: http://pd4ml.com/cookbook/pd4ml_pdf_security.htm
Related
I wanted to make a simple program to get text content from a pdf file through Java. Here is the code:
PDFTextStripper ts = new PDFTextStripper();
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = PDDocument.load(file);
String allText = ts.getText(doc1);
String gradeText = allText.substring(allText.indexOf("GRADE 10B"), allText.indexOf("GRADE 10C"));
System.out.println("Meeting ID for English: "
+ gradeText.substring(gradeText.indexOf("English") + 7, gradeText.indexOf("English") + 20));
This is just part of the code, but this is the part with the problem.
The error is: The method load(File) is undefined for the type PDDocument
I have learnt using PDFBox from JavaTPoint. I have followed the correct instructions for installing the PDFBox libraries and adding them to the Build Path.
My PDFBox version is 3.0.0
I have also searched the source files and their methods, and I am unable to find the load method there.
Thank you in advance.
As per the 3.0 migration guide the PDDocument.load method has been replaced with the Loader method:
For loading a PDF PDDocument.load has been replaced with the Loader
methods. The same is true for loading a FDF document.
When saving a PDF this will now be done in compressed mode per
default. To override that use PDDocument.save with
CompressParameters.NO_COMPRESSION.
PDFBox now loads a PDF Document incrementally reducing the initial
memory footprint. This will also reduce the memory needed to consume a
PDF if only certain parts of the PDF are accessed. Note that, due to
the nature of PDF, uses such as iterating over all pages, accessing
annotations, signing a PDF etc. might still load all parts of the PDF
overtime leading to a similar memory consumption as with PDFBox 2.0.
The input file must not be used as output for saving operations. It
will corrupt the file and throw an exception as parts of the file are
read the first time when saving it.
So you can either swap to an earlier 2.x version of PDFBox, or you need to use the new Loader method. I believe this should work:
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = Loader.loadPDF(file);
I need to find a way to ignore pictures and photos from PDF document during conversion to DOCX file.
I am creating an instance of FineReader Engine:
IEngine engine = Engine.InitializeEngine(
engineConfig.getDllFolder(), engineConfig.getCustomerProjectId(),
engineConfig.getLicensePath(), engineConfig.getLicensePassword(), "", "", false);
After that, I am converting a document:
IFRDocument document = engine.CreateFRDocument();
document.AddImageFile(file.getAbsolutePath(), null, null);
document.Process(null);
String exportPath = FileUtil.prepareExportPath(file, resultFolder);
document.Export(exportPath, FileExportFormatEnum.FEF_DOCX, null);
As a result, it converts all images from the initial pdf document.
When you exporting pdf to docx you should use some export params. In this way you can use IRTFExportParams. You can get this object:
IRTFExportParams irtfExportParams = engine.CreateRTFExportParams();
and there you can set writePicture property like this:
irtfExportParams.setWritePictures(false);
there: IEngine engine is main interface. I think u know how to initialize it;)))
Also you have to set in method document.Process() property. (document is from IFRDocument document).
In Process() method you have to give IDocumentProcessingParams iDocumentProcessingParams. This object has method setPageProcessingParams() and there you have to put IPageProcessingParams iPageProcessingParams params(You can get this object by engine.CreatePageProcessingParams()). And this object has methods:
iPageProcessingParams.setPerformAnalysis(true);
iPageProcessingParams.setPageAnalysisParams(iPageAnalysisParams);
In the first method set true,
and in the second one we give iPageAnalysisParams(IPageAnalysisParams iPageAnalysisParams = engine.CreatePageAnalysisParams()).
Last step, you have to set false value in setDetectPictures(false) method from iPageAnalysisParams like this. Thats all:)
And when you are going to export document you should put this param like this:
IFRDocument document = engine.CreateFRDocument();
document.Export(filePath, FileExportFormatEnum.FEF_DOCX, irtfExportParams);
I hope my answer will help to everyone)))
I'm not really familiar with PDF to DOCX conversion, but i think you could try custom profiles according to your needs.
At some point in your code you should create a Engine object, and then create a Document object (or IFRDocument object depending of your application). Add this line just before giving your document to your engine for processing:
engine.LoadProfile(PROFILE_FILENAME);
Then create your file with some processing parameters described in the documentation packaged with your FRE installation under "Working with Profiles" section.
Do not forget to add in your file:
... some params under other sections
[PageAnalysisParams]
DetectText = TRUE --> force text detection
DetectPictures = FALSE --> ignore pictures
... other params under PageAnalysisParams
... some params under other sections
It works the same way for Barcodes, etc... But keep in mind to benchmark your results when adding or removing things from this file as it may alter processing speed and global quality of your result.
What do PDF input pages contain? What is expected in MS Word?
It would be great if you would attach an example of an input PDF file and an example of the desired result in MS Word format.
Then give a useful recommendation will be much easier.
I'm trying to extract some infos from a set of PDFs. This works so far, but one PDF gives me grievances.
I'm using PDFBox 1.8.8, with Java 7.
PDDocument document = PDDocument.load(pdfFile);
PDFTextStripper stripper = new PDFTextStripper();
System.out.println("File: "+pdfFile.getAbsolutePath()+" readable: "+pdfFile.canRead()+" size: "+pdfFile.length());
System.out.println(stripper.getText(document));
It just prints
File: /foo/bar/mypdf.pdf readable: true size: 1267743
Then it terminates. Usually I use the writeText method and funnel the text through a stream, but above code was used for simplification. I've tried converting the PDF with pdftotext - it works just like the others.
I get no exception, no nothing. Any ideas?
EDIT:
Additional Info: Created with Acrobat Distiller 9.0.0 (Windows), Format PDF-1.6; The other PDFs are Version 1.4 and 1.5
Doesn't seem to contain exotic characters. I can mark/copy text in Evince PDF-viewer
EDIT2:
Dang it. File property dialog (Nautilus) said "Security: No", but pdfinfo gives me:
Encrypted: yes (print:yes copy:no change:no addNotes:no algorithm:AES)
Anyway to circumvent that? After all, pdftotext could get the text out.
The document was "encrypted" (write protected), but with no user password set. This Stackoverflow answer shows how you can remove the encryption and simply read the file: remove encryption from pdf with pdfbox, like qpdf
I'm attempting to render a notes document to RTF, then DXL using the Java API. Once I have the DXL, I'm converting it to HTML with an XSL stylesheet. My goal is to produce an HTML document that displays as close as possible to the document rendering in the notes client.
However, computed fields are missing from the rendered RTF and DXL.
Here is the code used to generate the DXL:
private String renderDocumentToDxl(lotus.domino.Document lotusDocument)
throws Exception {
Database db = getDatabase();
lotus.domino.Document tmp = db.createDocument();
RichTextItem rti = tmp.createRichTextItem("Body");
lotusDocument.computeWithForm(true, false);
lotusDocument.save();
lotusDocument.renderToRTItem(rti);
DxlExporter dxlExporter = getSession().createDxlExporter();
dxlExporter.setOutputDOCTYPE(false);
dxlExporter.setConvertNotesBitmapsToGIF(true);
return dxlExporter.exportDxl(tmp);
}
Fields added to the document by the call to computeWithForm are not present in the generated DXL.
Is there any way to get the computed fields into the generated DXL with the Java API? Or is there a better way to generate an HTML representation of a notes document using the domino Java API?
I'm not quite clear on your objective. There are two possibilities:
1) You want the items from lotusDocument to exist in tmp, and to be exported as actual tag data in the DXL. Your code does not do this.
2) You want the values of the non-hidden Items from lotusDocument to exist as text within the rich text Body item in tmp, and you want those values to be included within the DXL that is exported from tmp - as text within the tag for the Body item. This should be what your code is doing.
If you expected the former, then that's not what renderToRTItem does. What it does is the latter. I.e., it gives you a snapshot of the values of the items in lotusDocument - but if and only if they would be displayed to a user who opens the document. You do not get the items themselves, and they won't appear separately in the DXL. If that's all you expected, and it's not happening, then there's something else going wrong and you haven't given enough infornmation here to figure it out.
If you wanted the former, i.e., the actual items from lotusDocument to exist as separate tag elements within the DXL exported from tmp, then you should be using
lotusDocument.copyAllItems(tmp,true);,
or sequences of
Item tmpItem = lotusDocument.getFirstItem(itemName);
tmp.copyItem(tmpItem,"");
You can get the HTML representation of a RichText field with the URL
http://server/db.nsf/view/docunid/RichTextFieldname?OpenField
So, save your tmp document, get the docunid and read the result via http from URL
http://server/db.nsf/0/tmpdocunid/Body?OpenField
You don't need to call lotusDocument.computeWithForm as lotusDocument.renderToRTItem does execute form's input translation and validation formulas already.
Be aware that for both methods form's LotusScript code won't be executed - just in case your fields gets calculated this way.
In case you can use XPages this would be an alternative: http://linqed.eu/2014/07/11/getting-html-from-any-richtext-item/
We have an Oracle BPM 10g activity that:
Reads a form-fill protected Word document template.
Merges data into the fields.
Saves the merged/filled copy to the filesystem.
Prints the document to a selected, pre-defined printer, OR to the default printer.
All of this works fine when printing to a "real" printer. However, there is now a need to output the Word document to TIFF. Attempting to use "Microsoft Document Image Writer" as one of the printer selections does not work as expected. Normally, when printing to the Microsoft Document Image Writer from Word (or any other application) directly, you're prompted for a location to save the resultant file. This prompting does not occur when attempting to print from this particular activity in BPM 10g.
Ideally, we actually would like to bypass the dialog and output the TIFF directly to the filesystem. However, I have not found a way to control this programmatically. That is, being able to specify the destination filename in code. Right now, I'm just trying to get output to the Microsoft Document Image Writer at all, to make sure it works.
So, the bottom line question(s) is/are:
Can this be done? I.e., printing to Microsoft Document Image Writer
If yes, can the file location dialog be suppressed?
How?
You said nothing about the way you're automating Word.
In Word VBA, you may use this sample to print out the active document immediately without showing the print dialog:
Public Sub PrintToXPS()
'Presume that Microsoft XPS Document Writer was already
'set up as ActivePrinter
Dim strFilePath As String
strFilePath = "C:\temp\helloworld.xps"
ActiveDocument.PrintOut Background:=False, outputfilename:=strFilePath
End Sub
There's no need to use the print dialog instead. However, if you want to operate through the dialog object, that can be done in Word using a variable of type Word.Dialog and providing the necessary parameters, e.g.
Dim dlgFilePrint As Word.Dialog
Set dlgFilePrint = Application.Dialogs(wdDialogFilePrint)
dlgFilePrint.Update
dlgFilePrint.PrToFileName = strFilePath
dlgFilePrint.printtofile = True
'add other parameters as needed ...
'lock up parameter names in Word VBA Online Help using "WdWordDialog-Enumeration"
'as key word
dlgFilePrint.Execute
What I did here with the XPS printer, you may of course do also with any other printer.
Thank you, domke consulting.
After more searching, I found this forum post on MSDN.
Adding these registry entries to suppress the dialog box and suppress post-generation output seemed to do the trick:
In HKEY_CURRENT_USER\Software\Microsoft\Office\12.0\MODI\MDI Writer
PrivateFlags = 17 (Decimal)
OpenInMODI = 0 (Decimal)
For our purposes, this seems to work fine if we call the printOut() method with the following relevant arguments (other arguments omitted here for brevity):
document.printOut(outputFileName : "C:\\temp\\fileName.tif", printToFile : true);