how to read comments in word document from apache poi?

how to read comments in word document from apache poi? - java

How to Read word comments (Annotation) from microsoft word document ?
please provide some example code if possible ...
Thanking you ...

Finally, I found the answer
here is the code snippet ...
File file = null;
FileInputStream fis = null;
HWPFDocument document = null;
Range commentRange = null;
try {
file = new File(fileName);
fis = new FileInputStream(file);
document = new HWPFDocument(fis);
commentRange = document.getCommentsRange();
int numComments = commentRange.numParagraphs();
for (int i = 0; i < numComments; i++) {
String comments = commentRange.getParagraph(i).text();
comments = comments.replaceAll("\\cM?\r?\n", "").trim();
if (!comments.equals("")) {
System.out.println("comment :- " + comments);
}
}
} catch (Exception e) {
e.printStackTrace();
}
I am using Poi poi-3.5-beta7-20090719.jar, poi-scratchpad-3.5-beta7-20090717.jar. The other archives - poi-ooxml-3.5-beta7-20090717.jar and poi-dependencies-3.5-beta7-20090717.zip - will be needed if you are hoping to work on the OpenXML based file formats.
I appreciate the help of Mark B who actually found this solution ....

Get the HWPFDocument object (by passing a Word document in an input stream, say).
Then you can get the summary via getSummaryInformation(), and that will give you a SummaryInformation object via getSummary()

Please refer the following link,it may fulfill yr requirements...
http://bihag.wordpress.com/2009/11/04/how-to-read-comments-from-word-with-poi-jav/#comment-13

Am also new to apache poi. Hear is my program its working fine this program extract word form doc to text...I hope this program will help u before u run this program u can set corresponding lib files in your classpath.
/*
* FileExtract.java
*
* Created on April 12, 2010, 9:46 AM
*
* To change this template, choose Tools | Template Manager
* and open the template in the editor.
*/
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import javax.swing.text.BadLocationException;
import javax.swing.text.DefaultStyledDocument;
import javax.swing.text.rtf.RTFEditorKit;
import java.io.*;
import org.apache.poi.POIOLE2TextExtractor.*;
import org.apache.poi.POIOLE2TextExtractor;
import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
import org.apache.poi.hdgf.extractor.VisioTextExtractor;
import org.apache.poi.hslf.extractor.PowerPointExtractor;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.ss.extractor.ExcelExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import javax.swing.text.Document;
/**
*
* #author ChandraMouil V
*/
public class RtfDocTextExtract {
/** Creates a new instance of FileExtract */
static String filePath;
static String rtfFile;
static FileInputStream fis;
static int x=0;
public RtfDocTextExtract() {
}
//This function for .DOC File
public static void meth(String filePath) {
try {
if(x!=0){
fis = new FileInputStream("D:/DummyRichTextFormat.doc");
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
WordExtractor oleTextExtractor = (WordExtractor) ExtractorFactory.createExtractor(fileSystem);
String[] paragraphText = oleTextExtractor.getParagraphText();
FileWriter fw = new FileWriter("E:/resume-template.txt");
for (String paragraph : paragraphText) {
fw.write(paragraph);
}
fw.flush();
}
}catch(Exception e){
e.printStackTrace();
}
}
}

Related

Reading a Docx/Doc File in Java

I tried to read a Docx File in java.
But I am getting the error as "The constructor XWPFDocument(FileInputStream) is undefined" in LINE NO: 16 and "Type mismatch: cannot convert from XWPFParagraph[] to List" in LINE NO: 18.
Below are my code.
Used Jars:
org.apache.poi.xwpf.usermodel.XWPFDocument;
org.apache.poi.xwpf.usermodel.XWPFParagraph;
Can any one please tell me that why Iam getting this and please tell me that how to resolve it?
Thanks in advance!
package com.readindDocx;
import java.io.File;
import java.io.FileInputStream;
import java.util.List;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
public class ReadingDocument {
public static void main(String[] args) {
try {
File file = new File("D:/SampleWordFile.docx");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
for (XWPFParagraph para : paragraphs) {
System.out.println(para.getText());
}
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
}

Unprotect word document using java

how can we unprotect the word document using java apache poi? I have protected the document as read-only using password pro-grammatically.Now I want to unprotect it. How can we do ? Is there any method to unprotect the document. I have used removePasswordProtection() but that document is not editable even after using that method.
The sample code that I have used for protection is
XWPFDocument document=new XWPFDocument();
document.enforceReadonlyProtection(strPassword,HashAlgorithm.sha1);
The document is getting protected successfully.
But when I am unprotecting document using the below code snippet it is not working.
if(document.isEnforcedReadonlyProtection())
{
if(document.validateProtectionPassword(strPassword))
{
document.removeProtectionEnforcement();
}
}
Can anyone help me what method that I can use to unprotect the document?

Cannot reproducing.
Following code produces two Word documents. One, WordProtected.docx, which is protected and one, WordUnprotected.docx in which protection is removed.
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.poifs.crypt.HashAlgorithm;
class XWPFReadOnlyProtection {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument();
String strPassword = "password";
document.enforceReadonlyProtection(strPassword, HashAlgorithm.sha1);
FileOutputStream fileout = new FileOutputStream("WordProtected.docx");
document.write(fileout);
fileout.close();
document.close();
document = new XWPFDocument(new FileInputStream("WordProtected.docx"));
document.removeProtectionEnforcement();
fileout = new FileOutputStream("WordUnprotected.docx");
document.write(fileout);
fileout.close();
document.close();
}
}

use this code for Word to Protect
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class WordTest {
public static void main(String[] args) throws IOException {
FileInputStream in = new FileInputStream("D:\\govind.doc");
BufferedInputStream bin = new BufferedInputStream(in);
POIFSFileSystem poiFileSystem = new POIFSFileSystem(bin);
Biff8EncryptionKey.setCurrentUserPassword("P#ssw0rd");
HWPFDocument doc = new HWPFDocument(poiFileSystem);
Range range = doc.getRange();
FileOutputStream out = new FileOutputStream("D:\\govind.doc");
doc.write(out);
out.close();
}
}
this is use for protected word File unportected
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class wordFileTest {
public static void main(String[] args) throws IOException {
geenrateUnprotectedFile("D:\\","govind","1234");
}
public static void geenrateUnprotectedFile(String filePath,String fileName,String pwdtxt) {
try {
FileInputStream in = new FileInputStream(filePath+fileName+".doc");
BufferedInputStream bin = new BufferedInputStream(in);
POIFSFileSystem poiFileSystem = new POIFSFileSystem(bin);
Biff8EncryptionKey.setCurrentUserPassword(pwdtxt);
HWPFDocument doc = new HWPFDocument(poiFileSystem);
String docType=doc.getDocumentText();
FileOutputStream out = new FileOutputStream(filePath+fileName+"12.doc");
out.write(docType.getBytes());
System.out.println("don");
}catch (Exception e) {
e.printStackTrace();
}
}
}

Reading text of a pdf using PDFBOX occasionally returns \r\n

I’m currently using PDFBox to read the text of a set of pdfs that I’ve inherited.
I’m only interested in reading the text, not making any changes to the file.
The code that works for most of the files is:
File pdfFile = myPath.toFile();
PDDocument document = PDDocument.load(pdfFile );
Writer sw = new StringWriter();
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage( 1 );
stripper.writeText( document, sw );
String documentText = sw.toString()
For most files, I wind up with the text in the documentText field.
But, for 3 of 24 files, the documentText content for the first file is “\r\n”, for the second “\r\n\r\n”, and for the third “\r\n\r\n\r\n:, But the three files are not consecutive. Multiple good files are between each of these files.
The File is derived from a java.nio.Path. The WindowsFileAttribute that is part of the Path has a size of 279K, so the file is not empty on disk.
I can open the file and view the data, and it looks like the other files that my code reads.
I’m using Java 8.0.121, and PDFBox 2.0.4. (this is the latest version, I believe.)
Any suggestions? Is there a better way to read the text? (I’m not interested in the formatting, or fonts used, just the text.)
Thanks.

Reading multiple PDF docs using pdfbox in java
package readwordfile;
import java.io.BufferedReader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
/**
* This is an example on how to extract words from PDF document
*
* #author saravanan
*/
public class GetWordsFromPDF extends PDFTextStripper {
static List<String> words = new ArrayList<String>();
public GetWordsFromPDF() throws IOException {
}
/**
* #param args
* #throws IOException If there is an error parsing the document.
*/
public static void main(String[] args) throws IOException {
String files;
// FileWriter fs = new FileWriter("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");
// FileInputStream fstream1 = new FileInputStream("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");
// DataInputStream in1 = new DataInputStream(fstream1);
// BufferedReader br1 = new BufferedReader(new InputStreamReader(in1));
String path = "C:\\Users\\saravanan\\Desktop\\New folder\\"; //local folder path name
File folder = new File(path);
File[] listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++) {
if (listOfFiles[i].isFile()) {
files = listOfFiles[i].getName();
if (files.endsWith(".pdf") || files.endsWith(".PDF")) {
String nfiles = "C:\\Users\\saravanan\\Desktop\\New folder\\";
String fileName1 = nfiles + files;
System.out.print("\n\n" + files+"\n");
PDDocument document = null;
try {
document = PDDocument.load(new File(fileName1));
PDFTextStripper stripper = new GetWordsFromPDF();
stripper.setSortByPosition(true);
stripper.setStartPage(0);
stripper.setEndPage(document.getNumberOfPages());
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
int x = 0;
System.out.println("");
for (String word : words) {
if (word.startsWith("xxxxxx")) { //here you can give your pdf doc starting word
x = 1;
}
if (x == 1) {
if (!(word.endsWith("YYYYYY"))) { //here you can give your pdf doc ending word
System.out.print(word + " ");
// fs.write(word);
} else {
x = 0;
break;
}
}
}
} finally {
if (document != null) {
document.close();
words.clear();
}
}
}
}
}
}
/**
* Override the default functionality of PDFTextStripper.writeString()
*
* #param str
* #param textPositions
* #throws java.io.IOException
*/
#Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
String[] wordsInStream = str.split(getWordSeparator());
if (wordsInStream != null) {
for (String word : wordsInStream) {
words.add(word); //store the pdf content into the List
}
}
}
}

Converting docx file to pdf

I want to convert word document(docx) to pdf format using apache.poi.xwpf.It convert fine.But cover pages and diagrams not converting.I mention my code following.I want to know what are the jar and how to convert docx to pdf. So please be kind enough to solve my problem.
package javaapplication1;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import org.apache.poi.xwpf.converter.pdf.PdfConverter;
import org.apache.poi.xwpf.converter.pdf.PdfOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
/**
*
* #author Manos_T
*/
public class JavaApplication1 {
/**
* #param args the command line arguments
*/
public static void main(String[] args) throws FileNotFoundException, IOException {
String filePath = "C:/Users/manos_t/Desktop/777.docx";
FileInputStream fInputStream = new FileInputStream(new File(filePath));
// XWPFDocument document = new XWPFDocument(Data.class.getResourceAsStream(filePath));
XWPFDocument document = new XWPFDocument(fInputStream);
File outFile = new File("C:/Users/manos_t/Desktop/777.pdf");
outFile.getParentFile().mkdirs();
OutputStream out = new FileOutputStream(outFile);
PdfOptions options = PdfOptions.create().fontEncoding("windows-1250");
PdfConverter.getInstance().convert(document, out, options);
System.out.println("Sucess");
}
}

Converting doc into PDF in android,Unable to execute dex

I am converting doc file into pdf format in android using following libraries,
itext-1.4.8.jar
poi-3.0-FINAL.jar
poi-scratchpad-3.2-FINAL.jar
here is my sample code
package com.example.converter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import android.content.Context;
import android.os.Environment;
import android.widget.LinearLayout;
import com.lowagie.text.Document;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class TestCon extends LinearLayout {
FileInputStream infile;
private static String FILE = Environment.getExternalStorageDirectory()
+ "/MyReport.pdf";
public TestCon(Context context) {
super(context);
my_method(context);
}
public void my_method(Context context) {
POIFSFileSystem fs = null;
Document document = new Document();
try {
infile = (FileInputStream) context.getApplicationContext().getAssets().open("test.doc");
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
try {
System.out.println("Starting the test");
fs = new POIFSFileSystem(infile);
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
OutputStream file = new FileOutputStream(FILE);
PdfWriter writer = PdfWriter.getInstance(document, file);
Range range = doc.getRange();
document.open();
writer.setPageEmpty(true);
document.newPage();
writer.setPageEmpty(true);
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
org.apache.poi.hwpf.usermodel.Paragraph pr = range
.getParagraph(i);
// CharacterRun run = pr.getCharacterRun(i);
// run.setBold(true);
// run.setCapitalized(true);
// run.setItalic(true);
paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
System.out.println("Length:" + paragraphs[i].length());
System.out.println("Paragraph" + i + ": "
+ paragraphs[i].toString());
// add the paragraph to the document
document.add(new Paragraph(paragraphs[i]));
}
System.out.println("Document testing completed");
} catch (Exception e) {
System.out.println("Exception during test");
e.printStackTrace();
} finally {
// close the document
document.close();
}
}
}
but I am getting this error
[2013-05-10 12:39:12 - Dex Loader] Unable to execute dex: Multiple dex files define Lorg/apache/poi/generator/FieldIterator;
[2013-05-10 12:39:12 - converter] Conversion to Dalvik format failed: Unable to execute dex: Multiple dex files define Lorg/apache/poi/generator/FieldIterator;
I have removed my android-support-v4.jar. from lib folder a/c to this answer answer about the error but I am still getting the same error :(
Please help me to solve this issue
Anyone who have done the doc to pdf conversion,please share your code.
I will be very thankful :)
Regards

The problem is that you are including something twice or more :
Multiple dex files define Lorg/apache/poi/generator/FieldIterator
Review your build path for duplicated libraries.
In addition, once this is resolved, you'll problably have to add this line in the project.properties file :
dex.force.jumbo=true
This will allow you to solve the problem with the 65535 methods limit problem for some time.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to read comments in word document from apache poi? - java

How to Read word comments (Annotation) from microsoft word document ? please provide some example code if possible ... Thanking you ...

Get the HWPFDocument object (by passing a Word document in an input stream, say). Then you can get the summary via getSummaryInformation(), and that will give you a SummaryInformation object via getSummary()

Please refer the following link,it may fulfill yr requirements... http://bihag.wordpress.com/2009/11/04/how-to-read-comments-from-word-with-poi-jav/#comment-13

Related

Reading a Docx/Doc File in Java

Unprotect word document using java

Reading text of a pdf using PDFBOX occasionally returns \r\n

Converting docx file to pdf

Converting doc into PDF in android,Unable to execute dex

Categories

Resources