Extracting PDF annotations/comments [duplicate]

Extracting PDF annotations/comments [duplicate] - java

This question already has answers here:
How to extract Highlighted Parts from PDF files
(2 answers)
Closed 2 years ago.
We have a pretty complex print workflow where the controlling is adding comments and annotations for draft versions of generated PDF documents using Adobe Reader or Adobe Acrobat. As part of the workflow imported PDF documents with annotations and comments should be parsed and the annotations should be imported into a CMS system (together with the PDF).
Q: are there any reliable tools (preferred Python or Java) for extracting such data in
clean and reliable way to PDF files?

This code should do the job. One of the answers to the question Parse annotations from a pdf was very helpful in getting me to write the code below. It uses the poppler library to parse the annotations. This is a link to annotations.pdf.
code
import poppler, os.path
path = 'file://%s' % os.path.realpath('annotations.pdf')
doc = poppler.document_new_from_file(path, None)
pages = [doc.get_page(i) for i in range(doc.get_n_pages())]
for page_no, page in enumerate(pages):
items = [i.annot.get_contents() for i in page.get_annot_mapping()]
items = [i for i in items if i]
print "page: %s comments: %s " % (page_no + 1, items)
output
page: 1 comments: ['This is an annotation']
page: 2 comments: [' Please note ', ' Please note ', 'This is a comment in the text']
installation
On Ubuntu the installation as as follows.
apt-get install python-poppler

Related

Apache PDFBox wrong acces permission [duplicate]

This question already has an answer here:
Security Method is No Security but Page Extraction and Document Assembly is not Allowed
(1 answer)
Closed 2 years ago.
I'm trying to extract access permissions with Apache PDFBox. The problem is that all the permissions are set to true.
For example, I extracted the Document Assembly property as follow:
PDDocument doc = PDDocument.load(new File(filePath));
AccessPermission ap = doc.getCurrentAccessPermission();
boolean documentAssembly = ap.canAssembleDocument();
The documentAssembly variable is true. However, when i check the permissions on Adobe reader I found that the document assembly property is set to not allowed:
Is there a way to extract all the correct informations, as in the above image?

What you see on the security tab is a summary of all document restrictions that apply. In particular there are some restrictions which only depend on the PDF viewer you use. If I look at the same dialog in Adobe Acrobat (not Reader), for example, I see
Obviously PDFBox does not know which viewer you will use. So it cannot consider viewer specific restrictions.

Reading pdf ebook's contents and split pdf file accordingly

I have some huge technical pdf ebooks and I would like to split them in a way that helps me find and read exactly the parts I want from each book. I am talking about indexed pdf files, with contents (parts and chapters). I have come up with the following splitting scheme, based on the pdf's contents:
1. Read book's contents.
2. Create a root folder for the entire book
3. Create one subfolder for each part of the book
4. Split the book in one pdf file per chapter and place the pdfs (chapters) in the corresponding subfolder (part).
How can this be done using a Java or Python pdf library?

You can use PyPDF2 to read and split your PDF files.
Here is how you can export PDF pages:
import PyPDF2
def export_pdf_pages(input_pdf_path, page_first, page_last, output_pdf_path):
with open(input_pdf_path, "rb") as input_stream:
input_pdf = PyPDF2.PdfFileReader(input_stream)
output = PyPDF2.PdfFileWriter()
for index in xrange(page_first - 1, page_last):
try:
page = input_pdf.getPage(index)
except IndexError:
fmt = 'Missing page {page_num} in "{input_pdf_path}"'
msg = fmt.format(page_num=index + 1, input_pdf_path=input_pdf_path)
raise IndexError(msg)
output.addPage(page)
with open(output_pdf_path, "wb") as output_stream:
output.write(output_stream)

Splitting word file into multiple smaller word files using OLE Automation from java

I have been using OLE automation from java to access methods for word.
I managed to do the following using the OLE automation:
Open word document template file.
Mail merge the word document template with a csv datasource file.
Save mail merged file to a new word document file.
What i need to do now is to be able to open the mail merged file and then using OLE programmatically split it into multiple files. Meaning if the original mail merged file has 6000 pages and my max pages per file property is set to 3000 pages i need to create two new word document files and place the 1st 3000 pages in the one and the last 3000 pages into the other one.
On my first attempts i took the amount of rows in the csv file and multiplied it by the number of pages in the template to get the total amount of pages after it will be merged. Then i used the merging to create the multiple files. The problem however is that i cannot exactly calculate how many pages the merged document will be because in some case all say 9 pages of the template will not be used because of the data and the mergefields used. So in some cases one row will only create 3 pages (using the 9 page template) and others might create 9 pages (using the 9 page template) during mail merge time.
So the only solution is to merge all rows into one document and then split it into multiple documents therafter to ensure that the exact amount of pages like the 3000 pages property is indeed in each file until there are no more pages left from the original merged file.
I have tried a few things already by using the msdn site to get methods and their properties etc but have been unable to this.
On my last attempts now i have been trying to use GoTo to get to a specific page number and the remove the page. I was going to try do this one by one for each page until i get to where i want the file to start from and then save it as a new file but have been unable to do so as well.
Please can anyone suggest something that could help me out?
Thanks and Regards
Sean
An example to open a word file using the OLE AUTOMATION from jave is included below:
Code sample
OleAutomation documentsAutomation = this.getChildAutomation(this.wordAutomation, "Documents");
int [ ] id = documentsAutomation.getIDsOfNames(new String[]{"Open"});
Variant[] arguments = new Variant[1];
arguments[0] = new Variant(fileName); // where filename is the absolute path to the docx file
Variant invokeResult = documentsAutomation.invoke(id[0], arguments);
private OleAutomation getChildAutomation(OleAutomation automation, String childName) {
int[] id = automation.getIDsOfNames(new String[]{childName});
Variant pVarResult = automation.getProperty(id[0]);
return(pVarResult.getAutomation());
}
Code sample

Sounds like you've pegged it already. Another approach you could take which would avoid building then deleting would be to look at the parts of your template that can make the biggest difference to the number of your template (that is where the data can be multi-line). If you then take these fields and look at the font, line-spacing and line-width type of properties you'll be able to calculate the room your data will take in the template and limit your data at that point. Java FontMetrics can help you with that.

How can I convert a Word document to PDF? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
How can I convert a Word document to PDF where the document contains various things, such as tables. When trying to use iText, the original document looks different to the converted PDF. Is there an open source API / library, rather than calling out to an executable, that I can use?

This is quite a hard task, ever harder if you want perfect results (impossible without using Word) as such the number of APIs that just do it all for you in pure Java and are open source is zero I believe (Update: I am wrong, see below).
Your basic options are as follows:
Using JNI/a C# web service/etc script MS Office (only option for 100% perfect results)
Using the available APIs script Open Office (90+% perfect)
Use Apache POI & iText (very large job, will never be perfect).
Update - 2016-02-11
Here is a cut down copy of my blog post on this subject which outlines existing products that support Word-to-PDF in Java.
Converting Microsoft Office (Word, Excel) documents to PDFs in Java
Three products that I know of can render Office documents:
yeokm1/docs-to-pdf-converter
Irregularly maintained, Pure Java, Open Source
Ties together a number of libraries to perform the conversion.
xdocreport
Actively developed, Pure Java, Open Source
It's Java API to merge XML document created with MS Office (docx) or OpenOffice (odt), LibreOffice (odt) with a Java model to generate report and convert it if you need to another format (PDF, XHTML...).
Snowbound Imaging SDK
Closed Source, Pure Java
Snowbound appears to be a 100% Java solution and costs over $2,500. It contains samples describing how to convert documents in the evaluation download.
OpenOffice API
Open Source, Not Pure Java - Requires Open Office installed
OpenOffice is a native Office suite which supports a Java API. This supports reading Office documents and writing PDF documents. The SDK contains an example in document conversion (examples/java/DocumentHandling/DocumentConverter.java). To write PDFs you need to pass the "writer_pdf_Export" writer rather than the "MS Word 97" one.
Or you can use the wrapper API JODConverter.
JDocToPdf - Dead as of 2016-02-11
Uses Apache POI to read the Word document and iText to write the PDF. Completely free, 100% Java but has some limitations.

You can use JODConverter for this purpose. It can be used to convert documents between different office formats. such as:
Microsoft Office to OpenDocument, and vice versa
Any format to PDF
And supports many more conversion as well
It can also convert MS office 2007 documents to PDF as well with almost all formats
More details about it can be found here:
http://www.artofsolving.com/opensource/jodconverter

Docx4j is open source and the best API for convert Docx to pdf without any alignment or font issue.
Maven Dependencies:
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j-JAXB-Internal</artifactId>
<version>8.0.0</version>
</dependency>
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j-JAXB-ReferenceImpl</artifactId>
<version>8.0.0</version>
</dependency>
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j-JAXB-MOXy</artifactId>
<version>8.0.0</version>
</dependency>
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j-export-fo</artifactId>
<version>8.0.0</version>
</dependency>
Code:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import org.docx4j.Docx4J;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
public class DocToPDF {
public static void main(String[] args) {
try {
InputStream templateInputStream = new FileInputStream("D:\\\\Workspace\\\\New\\\\Sample.docx");
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(templateInputStream);
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
String outputfilepath = "D:\\\\Workspace\\\\New\\\\Sample.pdf";
FileOutputStream os = new FileOutputStream(outputfilepath);
Docx4J.toPDF(wordMLPackage,os);
os.flush();
os.close();
} catch (Throwable e) {
e.printStackTrace();
}
}
}

Check out docs-to-pdf-converter on github. Its a lightweight solution designed specifically for converting documents to pdf.
Why?
I wanted a simple program that can convert Microsoft Office documents
to PDF but without dependencies like LibreOffice or expensive
proprietary solutions. Seeing as how code and libraries to convert
each individual format is scattered around the web, I decided to
combine all those solutions into one single program. Along the way, I
decided to add ODT support as well since I encountered the code too.

It's already 2019, I can't believe still no easiest and conveniencest way to convert the most popular Micro$oft Word document to Adobe PDF format in Java world.
I almost tried every method the above answers mentioned, and I found the best and the only way can satisfy my requirement is by using OpenOffice or LibreOffice. Actually I am not exactly know the difference between them, seems both of them provide soffice command line.
My requirement is:
It must run on Linux, more specifically CentOS, not on Windows, thus we cannot install Microsoft Office on it;
It must support Chinese character, so ISO-8859-1 character encoding is not a choice, it must support Unicode.
First thing came in mind is doc-to-pdf-converter, but it lacks of maintenance, last update happened 4 years ago, I will not use a nobody-maintain-solution. Xdocreport seems a promising choice, but it can only convert docx, but not doc binary file which is mandatory for me. Using Java to call OpenOffice API seems good, but too complicated for such a simple requirement.
Finally I found the best solution: use OpenOffice command line to finish the job:
Runtime.getRuntime().exec("soffice --convert-to pdf -outdir . /path/some.doc");
I always believe the shortest code is the best code (of course it should be understandable), that's it.

You can use Cloudmersive native Java library. It is free for up to 50,000 conversions/month and is much higher fidelity in my experience than other things like iText or Apache POI-based methods. The documents actually look the same as they do in Microsoft Word which for me is the key. Incidentally it can also do XLSX, PPTX, and the legacy DOC, XLS and PPT conversion to PDF.
Here is what the code looks like, first add your imports:
import com.cloudmersive.client.invoker.ApiClient;
import com.cloudmersive.client.invoker.ApiException;
import com.cloudmersive.client.invoker.Configuration;
import com.cloudmersive.client.invoker.auth.*;
import com.cloudmersive.client.ConvertDocumentApi;
Then convert a file:
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
ConvertDocumentApi apiInstance = new ConvertDocumentApi();
File inputFile = new File("/path/to/input.docx"); // File to perform the operation on.
try {
byte[] result = apiInstance.convertDocumentDocxToPdf(inputFile);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling ConvertDocumentApi#convertDocumentDocxToPdf");
e.printStackTrace();
}
You can get an document conversion API key for free from the portal.

I agree with posters listing OpenOffice as a high-fidelity import/export facility of word / pdf docs with a Java API and it also works across platforms. OpenOffice import/export filters are pretty powerful and preserve most formatting during conversion to various formats including PDF. Docmosis and JODReports value-add to make life easier than learning the OpenOffice API directly which can be challenging because of the style of the UNO api and the crash-related bugs.

Using JACOB call Office Word is a 100% perfect solution. But it only supports on Windows platform because need Office Word installed.
Download JACOB archive (the latest version is 1.19);
Add jacob.jar to your project classpath;
Add jacob-1.19-x32.dll or jacob-1.19-x64.dll (depends on your jdk version) to ...\Java\jdk1.x.x_xxx\jre\bin
Using JACOB API call Office Word to convert doc/docx to pdf.
public void convertDocx2pdf(String docxFilePath) {
File docxFile = new File(docxFilePath);
String pdfFile = docxFilePath.substring(0, docxFilePath.lastIndexOf(".docx")) + ".pdf";
if (docxFile.exists()) {
if (!docxFile.isDirectory()) {
ActiveXComponent app = null;
long start = System.currentTimeMillis();
try {
ComThread.InitMTA(true);
app = new ActiveXComponent("Word.Application");
Dispatch documents = app.getProperty("Documents").toDispatch();
Dispatch document = Dispatch.call(documents, "Open", docxFilePath, false, true).toDispatch();
File target = new File(pdfFile);
if (target.exists()) {
target.delete();
}
Dispatch.call(document, "SaveAs", pdfFile, 17);
Dispatch.call(document, "Close", false);
long end = System.currentTimeMillis();
logger.info("============Convert Finished：" + (end - start) + "ms");
} catch (Exception e) {
logger.error(e.getLocalizedMessage(), e);
throw new RuntimeException("pdf convert failed.");
} finally {
if (app != null) {
app.invoke("Quit", new Variant[] {});
}
ComThread.Release();
}
}
}
}

unoconv, it's a python tool worked in UNIX.
While I use Java to invoke the shell in UNIX, it works perfect for me. My source code : UnoconvTool.java. Both JODConverter and unoconv are said to use open office/libre office.
docx4j/docxreport, POI, PDFBox are good but they are missing some formats in conversion.

Open source java library for HTML to text conversion [duplicate]

This question already has answers here:
Remove HTML tags from a String
(35 answers)
Closed 1 year ago.
The community reviewed whether to reopen this question 8 months ago and left it closed:
Not suitable for this site
Can you recommend an open source Java library (preferably ASL/BSD/LGPL license) that converts HTML to plain text - cleans all the tags, converts entities (&, , etc.) and handles <br> and tables properly.
More Info
I have the HTML as a string, there's no need to fetch it from the web. Also, what I'm looking is for a method like this:
String convertHtmlToPlainText(String html)

Try Jericho.
The TextExtractor class sounds like it will do what you want. Sorry can't post a 2nd link as I'm a new user but scroll down the homepage a bit and there's a link to it.

HtmlUnit, it even shows the page after processing JavaScript / Ajax.

The bliki engine can do this, in two steps. See info.bliki.wiki / Home
How to convert HTML to Mediawiki text -- nediawiki text is already a rather plain text format, but you can convert it further
How to convert Mediawiki text to plain text -- your goal.
It will be some 7-8 lines of code, like this:
// html to wiki
import info.bliki.html.HTML2WikiConverter;
import info.bliki.html.wikipedia.ToWikipedia;
// wiki to plain text
import info.bliki.wiki.filter.PlainTextConverter;
import info.bliki.wiki.model.WikiModel;
...
String sbodyhtml = readFile( infilepath ); //get content as string
HTML2WikiConverter conv = new HTML2WikiConverter();
conv.setInputHTML( sbodyhtml );
String resultwiki = conv.toWiki(new ToWikipedia());
WikiModel wikiModel = new WikiModel("${image}", "${title}");
String plainStr = wikiModel.render(new PlainTextConverter(false), resultwiki );
System.out.println( plainStr );
Jsoup can do this simpler:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
Document doc = Jsoup.parse(sbodyhtml);
String plainStr = doc.body().text();
but in the result you lose all paragraph formatting -- there will be no any newlines.

I use TagSoup, it is available for several languages and does a really good job with HTML found "in the wild". It produces either a cleaned up version of the HTML or XML, that you can then process with some DOM/SAX parser.

I've used Apache Commons Lang to go the other way. But it looks like it can do what you need via StringEscapeUtils.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.