Converting docx file to pdf - java

I want to convert word document(docx) to pdf format using apache.poi.xwpf.It convert fine.But cover pages and diagrams not converting.I mention my code following.I want to know what are the jar and how to convert docx to pdf. So please be kind enough to solve my problem.
package javaapplication1;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import org.apache.poi.xwpf.converter.pdf.PdfConverter;
import org.apache.poi.xwpf.converter.pdf.PdfOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
/**
*
* #author Manos_T
*/
public class JavaApplication1 {
/**
* #param args the command line arguments
*/
public static void main(String[] args) throws FileNotFoundException, IOException {
String filePath = "C:/Users/manos_t/Desktop/777.docx";
FileInputStream fInputStream = new FileInputStream(new File(filePath));
// XWPFDocument document = new XWPFDocument(Data.class.getResourceAsStream(filePath));
XWPFDocument document = new XWPFDocument(fInputStream);
File outFile = new File("C:/Users/manos_t/Desktop/777.pdf");
outFile.getParentFile().mkdirs();
OutputStream out = new FileOutputStream(outFile);
PdfOptions options = PdfOptions.create().fontEncoding("windows-1250");
PdfConverter.getInstance().convert(document, out, options);
System.out.println("Sucess");
}
}

Related

How to fix a java error that states that a package is accessible to more than one module

I am doing a coding project where I am trying to input a file into java and output information about the file. I have found code online that does this for PDF's. The line "import org.xml.sax.SAXException;" keeps giving me an error and stating that the package org.xml.sax is accessible to more than one module. Can someone help me with this?
Sorry to bother you all, I am a new coder just trying to figure this out.
Here is the code:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class PDFTika
{
public static void main(final String[] args) throws
IOException,TikaException
{
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new
File("/Users/relli/OneDrive/Documents/Asparta/example.pdf"));
ParseContext pcontext = new ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :" +
handler.toString());
//getting metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();
for(String name : metadataNames)
{
System.out.println(name+ " : " + metadata.get(name));
}
}
}
Method 1: code is a copy of the code provided by Gabriel Katz. I have managed to fix the error just by adding another exception (SAXException) in code.
Method 2: is a simplified version of parsing the PDF content only.
Code Snippet Info:
This code is used to parse PDF data using the Apache Tika package. It will display the pdf content as string and print metadata of PDF file
Method 1: parse PDF and print PDF content and metadata
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class PDFTika {
public static void main(final String[] args) throws IOException, TikaException, SAXException {
File file = new File("example.pdf");
FileInputStream inputstream = new FileInputStream(file);
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext pcontext = new ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :" + handler.toString());
//getting metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
}
}
Method 2: parse PDF data and print content as a string
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
public class TikaParser {
public static void main(String[] args) throws IOException, TikaException {
File file = new File("example.pdf");
FileInputStream inputstream = new FileInputStream(file);
Tika tika = new Tika();
String fileContent = tika.parseToString(inputstream);
System.out.println(fileContent);
}
}
<!--Please add following dependencies for testng-->
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.24.1</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.24.1</version>
</dependency>
</dependencies>

Unprotect word document using java

how can we unprotect the word document using java apache poi? I have protected the document as read-only using password pro-grammatically.Now I want to unprotect it. How can we do ? Is there any method to unprotect the document. I have used removePasswordProtection() but that document is not editable even after using that method.
The sample code that I have used for protection is
XWPFDocument document=new XWPFDocument();
document.enforceReadonlyProtection(strPassword,HashAlgorithm.sha1);
The document is getting protected successfully.
But when I am unprotecting document using the below code snippet it is not working.
if(document.isEnforcedReadonlyProtection())
{
if(document.validateProtectionPassword(strPassword))
{
document.removeProtectionEnforcement();
}
}
Can anyone help me what method that I can use to unprotect the document?
Cannot reproducing.
Following code produces two Word documents. One, WordProtected.docx, which is protected and one, WordUnprotected.docx in which protection is removed.
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.poifs.crypt.HashAlgorithm;
class XWPFReadOnlyProtection {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument();
String strPassword = "password";
document.enforceReadonlyProtection(strPassword, HashAlgorithm.sha1);
FileOutputStream fileout = new FileOutputStream("WordProtected.docx");
document.write(fileout);
fileout.close();
document.close();
document = new XWPFDocument(new FileInputStream("WordProtected.docx"));
document.removeProtectionEnforcement();
fileout = new FileOutputStream("WordUnprotected.docx");
document.write(fileout);
fileout.close();
document.close();
}
}
use this code for Word to Protect
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class WordTest {
public static void main(String[] args) throws IOException {
FileInputStream in = new FileInputStream("D:\\govind.doc");
BufferedInputStream bin = new BufferedInputStream(in);
POIFSFileSystem poiFileSystem = new POIFSFileSystem(bin);
Biff8EncryptionKey.setCurrentUserPassword("P#ssw0rd");
HWPFDocument doc = new HWPFDocument(poiFileSystem);
Range range = doc.getRange();
FileOutputStream out = new FileOutputStream("D:\\govind.doc");
doc.write(out);
out.close();
}
}
this is use for protected word File unportected
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class wordFileTest {
public static void main(String[] args) throws IOException {
geenrateUnprotectedFile("D:\\","govind","1234");
}
public static void geenrateUnprotectedFile(String filePath,String fileName,String pwdtxt) {
try {
FileInputStream in = new FileInputStream(filePath+fileName+".doc");
BufferedInputStream bin = new BufferedInputStream(in);
POIFSFileSystem poiFileSystem = new POIFSFileSystem(bin);
Biff8EncryptionKey.setCurrentUserPassword(pwdtxt);
HWPFDocument doc = new HWPFDocument(poiFileSystem);
String docType=doc.getDocumentText();
FileOutputStream out = new FileOutputStream(filePath+fileName+"12.doc");
out.write(docType.getBytes());
System.out.println("don");
}catch (Exception e) {
e.printStackTrace();
}
}
}

Printing results from PDF box output to a text file

I am working on a class which Parses a PDF document with PDF Box, its purpose is to create a text file (its name is PdfTestFile.txt) with the results. We have gotten it to print the parsed text to the console, but I don't know how to make it write the results to the .txt file that the class creates (name is PdfTestFile.txt).
I tried to use out.print(Text); but it gives me an error saying that:
out cannot be resolved
The class PdfEasyManager calls the class EasySearch in which we see the error mentioned above.
Below is the code that I have where the String Text is what I would like to print to the file PdfTestFile.txt:
Class " PdfEasyManager":
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.PrintWriter;
public class PdfEasyManager {
static BufferedWriter writer;
public static void main(String[] args) throws IOException {
//writer = new BufferedWriter(new FileWriter("Evergreen.txt"));
EasySearch easysearch = new EasySearch();
// pdfManager.setFilePath("PDFextTEST.pdf");
System.out.println(easysearch.ToText());
//out.println(easysearch.ToText());
}
}
Class "EasySearch" :
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.io.IOException;
public class EasySearch {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String Text ;
private String filePath;
private File file;
static BufferedWriter writer;
//writer = new BufferedWriter(new FileWriter(BLnumber + (date.toString().substring(4, 10))+ ".org"));
public EasySearch() {
}
//public static void main(String args[]) throws Exception{
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
writer = new BufferedWriter(new FileWriter("PdfTestFile.txt"));
file = new File("C:/Users/Jon Smith/Desktop/Sample.pdf");
parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);// reading text from page 1
// pdfStripper.setEndPage(10);// to 10
pdfStripper.setEndPage(pdDoc.getNumberOfPages());// if you want to get text from full pdf file use this code
Text = pdfStripper.getText(pdDoc);
out.print(Text); //this is the line that gives me the error
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}
You are using out which is not present in your class. Use System.out.print(Text).
Thanks for the help but
writer.write(Text);
solves the issue I was having

Tika detect custom metadat fileds

I created a simple class that using tika library to extract metadata from files like PDF, html, XLS, DOC,..
files can have custom metadata. I need to detect that and ignore for first step!
But i can see how to do that with Tika!
this is my simple code to extract all metadata:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.PrintWriter;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
public class TikaParse {
public static String resPFldMeta = new String();
public static String resPFldMetaValue = new String();
#SuppressWarnings("deprecation")
public static String ParseFieldMetadata(String filename) throws Exception {
int j;
FileInputStream is = null;
File f = new File(filename);
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());
AutoDetectParser parser = new AutoDetectParser();
parser.parse(is, contenthandler, metadata,new ParseContext());
String[] metadataNames = metadata.names();
// get field name of all metadata
for(j=0;j<metadataNames.length-1; j++){
resPFldMeta += "\""+(metadataNames[j]).trim()+"\",";
}
resPFldMeta += "\""+(metadataNames[j]).trim()+"\"";
return resPFldMeta;
}
//.....
}
SO, My question is : how to check if the metadat detected is custom metadata or is normalized metadata??

how to read comments in word document from apache poi?

How to Read word comments (Annotation) from microsoft word document ?
please provide some example code if possible ...
Thanking you ...
Finally, I found the answer
here is the code snippet ...
File file = null;
FileInputStream fis = null;
HWPFDocument document = null;
Range commentRange = null;
try {
file = new File(fileName);
fis = new FileInputStream(file);
document = new HWPFDocument(fis);
commentRange = document.getCommentsRange();
int numComments = commentRange.numParagraphs();
for (int i = 0; i < numComments; i++) {
String comments = commentRange.getParagraph(i).text();
comments = comments.replaceAll("\\cM?\r?\n", "").trim();
if (!comments.equals("")) {
System.out.println("comment :- " + comments);
}
}
} catch (Exception e) {
e.printStackTrace();
}
I am using Poi poi-3.5-beta7-20090719.jar, poi-scratchpad-3.5-beta7-20090717.jar. The other archives - poi-ooxml-3.5-beta7-20090717.jar and poi-dependencies-3.5-beta7-20090717.zip - will be needed if you are hoping to work on the OpenXML based file formats.
I appreciate the help of Mark B who actually found this solution ....
Get the HWPFDocument object (by passing a Word document in an input stream, say).
Then you can get the summary via getSummaryInformation(), and that will give you a SummaryInformation object via getSummary()
Please refer the following link,it may fulfill yr requirements...
http://bihag.wordpress.com/2009/11/04/how-to-read-comments-from-word-with-poi-jav/#comment-13
Am also new to apache poi. Hear is my program its working fine this program extract word form doc to text...I hope this program will help u before u run this program u can set corresponding lib files in your classpath.
/*
* FileExtract.java
*
* Created on April 12, 2010, 9:46 AM
*
* To change this template, choose Tools | Template Manager
* and open the template in the editor.
*/
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import javax.swing.text.BadLocationException;
import javax.swing.text.DefaultStyledDocument;
import javax.swing.text.rtf.RTFEditorKit;
import java.io.*;
import org.apache.poi.POIOLE2TextExtractor.*;
import org.apache.poi.POIOLE2TextExtractor;
import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
import org.apache.poi.hdgf.extractor.VisioTextExtractor;
import org.apache.poi.hslf.extractor.PowerPointExtractor;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.ss.extractor.ExcelExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import javax.swing.text.Document;
/**
*
* #author ChandraMouil V
*/
public class RtfDocTextExtract {
/** Creates a new instance of FileExtract */
static String filePath;
static String rtfFile;
static FileInputStream fis;
static int x=0;
public RtfDocTextExtract() {
}
//This function for .DOC File
public static void meth(String filePath) {
try {
if(x!=0){
fis = new FileInputStream("D:/DummyRichTextFormat.doc");
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
WordExtractor oleTextExtractor = (WordExtractor) ExtractorFactory.createExtractor(fileSystem);
String[] paragraphText = oleTextExtractor.getParagraphText();
FileWriter fw = new FileWriter("E:/resume-template.txt");
for (String paragraph : paragraphText) {
fw.write(paragraph);
}
fw.flush();
}
}catch(Exception e){
e.printStackTrace();
}
}
}

Categories

Resources