Reading a Docx/Doc File in Java

Reading a Docx/Doc File in Java - java

I tried to read a Docx File in java.
But I am getting the error as "The constructor XWPFDocument(FileInputStream) is undefined" in LINE NO: 16 and "Type mismatch: cannot convert from XWPFParagraph[] to List" in LINE NO: 18.
Below are my code.
Used Jars:
org.apache.poi.xwpf.usermodel.XWPFDocument;
org.apache.poi.xwpf.usermodel.XWPFParagraph;
Can any one please tell me that why Iam getting this and please tell me that how to resolve it?
Thanks in advance!
package com.readindDocx;
import java.io.File;
import java.io.FileInputStream;
import java.util.List;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
public class ReadingDocument {
public static void main(String[] args) {
try {
File file = new File("D:/SampleWordFile.docx");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
for (XWPFParagraph para : paragraphs) {
System.out.println(para.getText());
}
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
}

Related

How to fix a java error that states that a package is accessible to more than one module

I am doing a coding project where I am trying to input a file into java and output information about the file. I have found code online that does this for PDF's. The line "import org.xml.sax.SAXException;" keeps giving me an error and stating that the package org.xml.sax is accessible to more than one module. Can someone help me with this?
Sorry to bother you all, I am a new coder just trying to figure this out.
Here is the code:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class PDFTika
{
public static void main(final String[] args) throws
IOException,TikaException
{
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new
File("/Users/relli/OneDrive/Documents/Asparta/example.pdf"));
ParseContext pcontext = new ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :" +
handler.toString());
//getting metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();
for(String name : metadataNames)
{
System.out.println(name+ " : " + metadata.get(name));
}
}
}

Method 1: code is a copy of the code provided by Gabriel Katz. I have managed to fix the error just by adding another exception (SAXException) in code.
Method 2: is a simplified version of parsing the PDF content only.
Code Snippet Info:
This code is used to parse PDF data using the Apache Tika package. It will display the pdf content as string and print metadata of PDF file
Method 1: parse PDF and print PDF content and metadata
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class PDFTika {
public static void main(final String[] args) throws IOException, TikaException, SAXException {
File file = new File("example.pdf");
FileInputStream inputstream = new FileInputStream(file);
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext pcontext = new ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :" + handler.toString());
//getting metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
}
}
Method 2: parse PDF data and print content as a string
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
public class TikaParser {
public static void main(String[] args) throws IOException, TikaException {
File file = new File("example.pdf");
FileInputStream inputstream = new FileInputStream(file);
Tika tika = new Tika();
String fileContent = tika.parseToString(inputstream);
System.out.println(fileContent);
}
}
<!--Please add following dependencies for testng-->
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.24.1</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.24.1</version>
</dependency>
</dependencies>

Unprotect word document using java

how can we unprotect the word document using java apache poi? I have protected the document as read-only using password pro-grammatically.Now I want to unprotect it. How can we do ? Is there any method to unprotect the document. I have used removePasswordProtection() but that document is not editable even after using that method.
The sample code that I have used for protection is
XWPFDocument document=new XWPFDocument();
document.enforceReadonlyProtection(strPassword,HashAlgorithm.sha1);
The document is getting protected successfully.
But when I am unprotecting document using the below code snippet it is not working.
if(document.isEnforcedReadonlyProtection())
{
if(document.validateProtectionPassword(strPassword))
{
document.removeProtectionEnforcement();
}
}
Can anyone help me what method that I can use to unprotect the document?

Cannot reproducing.
Following code produces two Word documents. One, WordProtected.docx, which is protected and one, WordUnprotected.docx in which protection is removed.
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.poifs.crypt.HashAlgorithm;
class XWPFReadOnlyProtection {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument();
String strPassword = "password";
document.enforceReadonlyProtection(strPassword, HashAlgorithm.sha1);
FileOutputStream fileout = new FileOutputStream("WordProtected.docx");
document.write(fileout);
fileout.close();
document.close();
document = new XWPFDocument(new FileInputStream("WordProtected.docx"));
document.removeProtectionEnforcement();
fileout = new FileOutputStream("WordUnprotected.docx");
document.write(fileout);
fileout.close();
document.close();
}
}

use this code for Word to Protect
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class WordTest {
public static void main(String[] args) throws IOException {
FileInputStream in = new FileInputStream("D:\\govind.doc");
BufferedInputStream bin = new BufferedInputStream(in);
POIFSFileSystem poiFileSystem = new POIFSFileSystem(bin);
Biff8EncryptionKey.setCurrentUserPassword("P#ssw0rd");
HWPFDocument doc = new HWPFDocument(poiFileSystem);
Range range = doc.getRange();
FileOutputStream out = new FileOutputStream("D:\\govind.doc");
doc.write(out);
out.close();
}
}
this is use for protected word File unportected
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class wordFileTest {
public static void main(String[] args) throws IOException {
geenrateUnprotectedFile("D:\\","govind","1234");
}
public static void geenrateUnprotectedFile(String filePath,String fileName,String pwdtxt) {
try {
FileInputStream in = new FileInputStream(filePath+fileName+".doc");
BufferedInputStream bin = new BufferedInputStream(in);
POIFSFileSystem poiFileSystem = new POIFSFileSystem(bin);
Biff8EncryptionKey.setCurrentUserPassword(pwdtxt);
HWPFDocument doc = new HWPFDocument(poiFileSystem);
String docType=doc.getDocumentText();
FileOutputStream out = new FileOutputStream(filePath+fileName+"12.doc");
out.write(docType.getBytes());
System.out.println("don");
}catch (Exception e) {
e.printStackTrace();
}
}
}

The source attachment does not contain the source for the file JSON.class

This there anything wrong with my code? I'm new to Java and i'm trying to import a file into MongoDB. However there is a error that i have no idea what is it. I am using Eclipse.
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.DBCursor;
import com.mongodb.DBObject;
import com.mongodb.Mongo;
import com.mongodb.util.JSON;
import com.mongodb.util.JSONParseException;
public class readwrite {
public static void main(String[] args) throws FileNotFoundException,IOException,JSONParseException{
Mongo mongo = new Mongo("localhost", 27017);
DB db = mongo.getDB("actualdata");
DBCollection collection = db.getCollection("metadata");
String line = null;
StringBuilder sb = new StringBuilder();
try {
FileInputStream fstream = null;
try {
fstream = new FileInputStream("/home/Output/json1-100000-all");
} catch (FileNotFoundException e) {
e.printStackTrace();
System.out.println("File does not exist, exiting");
return;
}
BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(fstream));
while((line = bufferedReader.readLine()) != null) {
System.out.println(line);
DBObject dbObject;
sb.append(dbObject = (DBObject) JSON.parse(bufferedReader.readLine()));
collection.insert(dbObject);
DBCursor cursorDoc = collection.find();
while (cursorDoc.hasNext()) {
System.out.println(cursorDoc.next());
}
}
bufferedReader.close();
}
catch(FileNotFoundException ex) {
System.out.println("Unable to open file");
}
catch(IOException ex) {
System.out.println(
"Error reading file");
}
}
}
This is the error that is displayed
[
Exception in thread "main" com.mongodb.util.JSONParseException:
{
^
at com.mongodb.util.JSONParser.read(JSON.java:272)
at com.mongodb.util.JSONParser.parseObject(JSON.java:230)
at com.mongodb.util.JSONParser.parse(JSON.java:195)
at com.mongodb.util.JSONParser.parse(JSON.java:145)
at com.mongodb.util.JSON.parse(JSON.java:81)
at com.mongodb.util.JSON.parse(JSON.java:66)
at readwrite.main(readwrite.java:45)
It show me this error when i clicked on at com.mongodb.util.JSONParser.read(JSON.java:272) where it says that the Source is not found. The source attachment does not contain the source for the file JSON.class.
I can print the output of BufferedReader if i did not included the conversion of DBObject. Thanks in advance!

1) Didn't you mean to write JSON.parse(line)
instead of JSON.parse(bufferedReader.readLine())) ?
This might cause it to try and parse 'null' at the last iteration
2) If that doesn't help, could you get the exact string value of 'line' on the failed iteration? (this should be easy using debugger or simple printing to system out)
Regards

Converting doc into PDF in android,Unable to execute dex

I am converting doc file into pdf format in android using following libraries,
itext-1.4.8.jar
poi-3.0-FINAL.jar
poi-scratchpad-3.2-FINAL.jar
here is my sample code
package com.example.converter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import android.content.Context;
import android.os.Environment;
import android.widget.LinearLayout;
import com.lowagie.text.Document;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class TestCon extends LinearLayout {
FileInputStream infile;
private static String FILE = Environment.getExternalStorageDirectory()
+ "/MyReport.pdf";
public TestCon(Context context) {
super(context);
my_method(context);
}
public void my_method(Context context) {
POIFSFileSystem fs = null;
Document document = new Document();
try {
infile = (FileInputStream) context.getApplicationContext().getAssets().open("test.doc");
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
try {
System.out.println("Starting the test");
fs = new POIFSFileSystem(infile);
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
OutputStream file = new FileOutputStream(FILE);
PdfWriter writer = PdfWriter.getInstance(document, file);
Range range = doc.getRange();
document.open();
writer.setPageEmpty(true);
document.newPage();
writer.setPageEmpty(true);
String[] paragraphs = we.getParagraphText();
for (int i = 0; i < paragraphs.length; i++) {
org.apache.poi.hwpf.usermodel.Paragraph pr = range
.getParagraph(i);
// CharacterRun run = pr.getCharacterRun(i);
// run.setBold(true);
// run.setCapitalized(true);
// run.setItalic(true);
paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
System.out.println("Length:" + paragraphs[i].length());
System.out.println("Paragraph" + i + ": "
+ paragraphs[i].toString());
// add the paragraph to the document
document.add(new Paragraph(paragraphs[i]));
}
System.out.println("Document testing completed");
} catch (Exception e) {
System.out.println("Exception during test");
e.printStackTrace();
} finally {
// close the document
document.close();
}
}
}
but I am getting this error
[2013-05-10 12:39:12 - Dex Loader] Unable to execute dex: Multiple dex files define Lorg/apache/poi/generator/FieldIterator;
[2013-05-10 12:39:12 - converter] Conversion to Dalvik format failed: Unable to execute dex: Multiple dex files define Lorg/apache/poi/generator/FieldIterator;
I have removed my android-support-v4.jar. from lib folder a/c to this answer answer about the error but I am still getting the same error :(
Please help me to solve this issue
Anyone who have done the doc to pdf conversion,please share your code.
I will be very thankful :)
Regards

The problem is that you are including something twice or more :
Multiple dex files define Lorg/apache/poi/generator/FieldIterator
Review your build path for duplicated libraries.
In addition, once this is resolved, you'll problably have to add this line in the project.properties file :
dex.force.jumbo=true
This will allow you to solve the problem with the 65535 methods limit problem for some time.

how to read comments in word document from apache poi?

How to Read word comments (Annotation) from microsoft word document ?
please provide some example code if possible ...
Thanking you ...

Finally, I found the answer
here is the code snippet ...
File file = null;
FileInputStream fis = null;
HWPFDocument document = null;
Range commentRange = null;
try {
file = new File(fileName);
fis = new FileInputStream(file);
document = new HWPFDocument(fis);
commentRange = document.getCommentsRange();
int numComments = commentRange.numParagraphs();
for (int i = 0; i < numComments; i++) {
String comments = commentRange.getParagraph(i).text();
comments = comments.replaceAll("\\cM?\r?\n", "").trim();
if (!comments.equals("")) {
System.out.println("comment :- " + comments);
}
}
} catch (Exception e) {
e.printStackTrace();
}
I am using Poi poi-3.5-beta7-20090719.jar, poi-scratchpad-3.5-beta7-20090717.jar. The other archives - poi-ooxml-3.5-beta7-20090717.jar and poi-dependencies-3.5-beta7-20090717.zip - will be needed if you are hoping to work on the OpenXML based file formats.
I appreciate the help of Mark B who actually found this solution ....

Get the HWPFDocument object (by passing a Word document in an input stream, say).
Then you can get the summary via getSummaryInformation(), and that will give you a SummaryInformation object via getSummary()

Please refer the following link,it may fulfill yr requirements...
http://bihag.wordpress.com/2009/11/04/how-to-read-comments-from-word-with-poi-jav/#comment-13

Am also new to apache poi. Hear is my program its working fine this program extract word form doc to text...I hope this program will help u before u run this program u can set corresponding lib files in your classpath.
/*
* FileExtract.java
*
* Created on April 12, 2010, 9:46 AM
*
* To change this template, choose Tools | Template Manager
* and open the template in the editor.
*/
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import javax.swing.text.BadLocationException;
import javax.swing.text.DefaultStyledDocument;
import javax.swing.text.rtf.RTFEditorKit;
import java.io.*;
import org.apache.poi.POIOLE2TextExtractor.*;
import org.apache.poi.POIOLE2TextExtractor;
import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
import org.apache.poi.hdgf.extractor.VisioTextExtractor;
import org.apache.poi.hslf.extractor.PowerPointExtractor;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.ss.extractor.ExcelExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import javax.swing.text.Document;
/**
*
* #author ChandraMouil V
*/
public class RtfDocTextExtract {
/** Creates a new instance of FileExtract */
static String filePath;
static String rtfFile;
static FileInputStream fis;
static int x=0;
public RtfDocTextExtract() {
}
//This function for .DOC File
public static void meth(String filePath) {
try {
if(x!=0){
fis = new FileInputStream("D:/DummyRichTextFormat.doc");
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
WordExtractor oleTextExtractor = (WordExtractor) ExtractorFactory.createExtractor(fileSystem);
String[] paragraphText = oleTextExtractor.getParagraphText();
FileWriter fw = new FileWriter("E:/resume-template.txt");
for (String paragraph : paragraphText) {
fw.write(paragraph);
}
fw.flush();
}
}catch(Exception e){
e.printStackTrace();
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading a Docx/Doc File in Java - java

Related

How to fix a java error that states that a package is accessible to more than one module

Unprotect word document using java

The source attachment does not contain the source for the file JSON.class

Converting doc into PDF in android,Unable to execute dex

how to read comments in word document from apache poi?

Categories

Resources