Printing results from PDF box output to a text file - java

I am working on a class which Parses a PDF document with PDF Box, its purpose is to create a text file (its name is PdfTestFile.txt) with the results. We have gotten it to print the parsed text to the console, but I don't know how to make it write the results to the .txt file that the class creates (name is PdfTestFile.txt).
I tried to use out.print(Text); but it gives me an error saying that:
out cannot be resolved
The class PdfEasyManager calls the class EasySearch in which we see the error mentioned above.
Below is the code that I have where the String Text is what I would like to print to the file PdfTestFile.txt:
Class " PdfEasyManager":
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.PrintWriter;
public class PdfEasyManager {
static BufferedWriter writer;
public static void main(String[] args) throws IOException {
//writer = new BufferedWriter(new FileWriter("Evergreen.txt"));
EasySearch easysearch = new EasySearch();
// pdfManager.setFilePath("PDFextTEST.pdf");
System.out.println(easysearch.ToText());
//out.println(easysearch.ToText());
}
}
Class "EasySearch" :
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.io.IOException;
public class EasySearch {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String Text ;
private String filePath;
private File file;
static BufferedWriter writer;
//writer = new BufferedWriter(new FileWriter(BLnumber + (date.toString().substring(4, 10))+ ".org"));
public EasySearch() {
}
//public static void main(String args[]) throws Exception{
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
writer = new BufferedWriter(new FileWriter("PdfTestFile.txt"));
file = new File("C:/Users/Jon Smith/Desktop/Sample.pdf");
parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);// reading text from page 1
// pdfStripper.setEndPage(10);// to 10
pdfStripper.setEndPage(pdDoc.getNumberOfPages());// if you want to get text from full pdf file use this code
Text = pdfStripper.getText(pdDoc);
out.print(Text); //this is the line that gives me the error
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}

You are using out which is not present in your class. Use System.out.print(Text).

Thanks for the help but
writer.write(Text);
solves the issue I was having

Related

Unprotect word document using java

how can we unprotect the word document using java apache poi? I have protected the document as read-only using password pro-grammatically.Now I want to unprotect it. How can we do ? Is there any method to unprotect the document. I have used removePasswordProtection() but that document is not editable even after using that method.
The sample code that I have used for protection is
XWPFDocument document=new XWPFDocument();
document.enforceReadonlyProtection(strPassword,HashAlgorithm.sha1);
The document is getting protected successfully.
But when I am unprotecting document using the below code snippet it is not working.
if(document.isEnforcedReadonlyProtection())
{
if(document.validateProtectionPassword(strPassword))
{
document.removeProtectionEnforcement();
}
}
Can anyone help me what method that I can use to unprotect the document?
Cannot reproducing.
Following code produces two Word documents. One, WordProtected.docx, which is protected and one, WordUnprotected.docx in which protection is removed.
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.poifs.crypt.HashAlgorithm;
class XWPFReadOnlyProtection {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument();
String strPassword = "password";
document.enforceReadonlyProtection(strPassword, HashAlgorithm.sha1);
FileOutputStream fileout = new FileOutputStream("WordProtected.docx");
document.write(fileout);
fileout.close();
document.close();
document = new XWPFDocument(new FileInputStream("WordProtected.docx"));
document.removeProtectionEnforcement();
fileout = new FileOutputStream("WordUnprotected.docx");
document.write(fileout);
fileout.close();
document.close();
}
}
use this code for Word to Protect
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class WordTest {
public static void main(String[] args) throws IOException {
FileInputStream in = new FileInputStream("D:\\govind.doc");
BufferedInputStream bin = new BufferedInputStream(in);
POIFSFileSystem poiFileSystem = new POIFSFileSystem(bin);
Biff8EncryptionKey.setCurrentUserPassword("P#ssw0rd");
HWPFDocument doc = new HWPFDocument(poiFileSystem);
Range range = doc.getRange();
FileOutputStream out = new FileOutputStream("D:\\govind.doc");
doc.write(out);
out.close();
}
}
this is use for protected word File unportected
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class wordFileTest {
public static void main(String[] args) throws IOException {
geenrateUnprotectedFile("D:\\","govind","1234");
}
public static void geenrateUnprotectedFile(String filePath,String fileName,String pwdtxt) {
try {
FileInputStream in = new FileInputStream(filePath+fileName+".doc");
BufferedInputStream bin = new BufferedInputStream(in);
POIFSFileSystem poiFileSystem = new POIFSFileSystem(bin);
Biff8EncryptionKey.setCurrentUserPassword(pwdtxt);
HWPFDocument doc = new HWPFDocument(poiFileSystem);
String docType=doc.getDocumentText();
FileOutputStream out = new FileOutputStream(filePath+fileName+"12.doc");
out.write(docType.getBytes());
System.out.println("don");
}catch (Exception e) {
e.printStackTrace();
}
}
}

Converting docx file to pdf

I want to convert word document(docx) to pdf format using apache.poi.xwpf.It convert fine.But cover pages and diagrams not converting.I mention my code following.I want to know what are the jar and how to convert docx to pdf. So please be kind enough to solve my problem.
package javaapplication1;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import org.apache.poi.xwpf.converter.pdf.PdfConverter;
import org.apache.poi.xwpf.converter.pdf.PdfOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
/**
*
* #author Manos_T
*/
public class JavaApplication1 {
/**
* #param args the command line arguments
*/
public static void main(String[] args) throws FileNotFoundException, IOException {
String filePath = "C:/Users/manos_t/Desktop/777.docx";
FileInputStream fInputStream = new FileInputStream(new File(filePath));
// XWPFDocument document = new XWPFDocument(Data.class.getResourceAsStream(filePath));
XWPFDocument document = new XWPFDocument(fInputStream);
File outFile = new File("C:/Users/manos_t/Desktop/777.pdf");
outFile.getParentFile().mkdirs();
OutputStream out = new FileOutputStream(outFile);
PdfOptions options = PdfOptions.create().fontEncoding("windows-1250");
PdfConverter.getInstance().convert(document, out, options);
System.out.println("Sucess");
}
}

Learning PDFBox; Trouble with Sample Code

I'm trying to learn how to use PDFBox and found some sample code that I'm working through here.
I've attached the code in the post-script.
When I compile the code in Dr. Java, I get the following error:
File: C:\Users\Dick Hurtz from Hold\Desktop\Java Programs\JavaStuff\PDFManager.java [line: 30]
Error: The constructor org.apache.pdfbox.pdfparser.PDFParser(org.apache.pdfbox.io.RandomAccessFile) is undefined
I'm not sure what to do about this, and any help would be greatly appreciated. Thanks everyone!
Here are the classes:
MAIN:
import java.io.IOException;
public class JavaPDFTest {
public static void main(String[] args) throws IOException {
PDFManager pdfManager = new PDFManager();
pdfManager.setFilePath("E:\test.pdf");
System.out.println(pdfManager.ToText());
}
}
PDFManager:
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFManager {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc;
private COSDocument cosDoc;
private String Text;
private String filePath;
private File file;
public PDFManager() {
}
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File(filePath);
parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(10);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
// pdfStripper.setEndPage(pdDoc.getNumberOfPages());
Text = pdfStripper.getText(pdDoc);
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}
get the PDDocument directly using
PDDocument pdDoc = PDDocument.load(file);
is the recommended way to load a PDF document from a file.

Trying to append a text file using java printwriters

So I have a few other classes like this one, I call the method in using an object in the run file. I want to write every output of every class into the same text file. However at the moment only one output is being saved to the text file, as it is overwriting each time, how do I do this using a print writer seen below?
Any guidance is much appreciated!
Class:
package cw;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.util.Scanner;
import javax.swing.JFileChooser;
import java.io.IOException;
public class LineCounter {
public static void TotalLines() throws IOException {
Scanner sc = new Scanner(TextAnalyser.class.getResourceAsStream("test.txt"));
PrintWriter out = new PrintWriter(new FileWriter("C:\\Users\\Sam\\Desktop\\Report.txt"));
int linetotal = 0;
while (sc.hasNextLine()) {
sc.nextLine();
linetotal++;
}
out.println("The total number of lines in the file = " + linetotal);
out.close();
System.out.println("The total number of lines in the file = " + linetotal);
}
}
Run File:
package cw;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.util.Scanner;
import javax.swing.JFileChooser;
import java.io.IOException;
public class TextAnalyser {
public static void main(String[] args) throws IOException {
Scanner sc = new Scanner(TextAnalyser.class.getResourceAsStream("test.txt"));
LineCounter Lineobject = new LineCounter();
WordCounter Wordobject = new WordCounter();
NumberCounter Numberobject = new NumberCounter();
DigitCounter Digitobject = new DigitCounter();
SpaceCounter Spaceobject = new SpaceCounter();
NumberAverage Noavgobject = new NumberAverage();
WordAverage Wordavgobject = new WordAverage();
Palindromes Palindromeobject = new Palindromes();
VowelCounter Vowelobject = new VowelCounter();
ConsonantCounter Consonantobject = new ConsonantCounter();
WordOccurenceTotal RepeatsObject = new WordOccurenceTotal();
Lineobject.TotalLines();
Wordobject.TotalWords();
Numberobject.TotalNumbers();
Digitobject.TotalDigits();
Spaceobject.TotalSpaces();
Noavgobject.NumberAverage();
Wordavgobject.WordAverage();
Vowelobject.TotalVowels();
Consonantobject.TotalConsonant();
Palindromeobject.TotalPalindromes();
//RepeatsObject.TotalRepeats();
}
}
You want to use the second argument of the FileWriter constructor to set the append mode:
new FileWriter("name_of_your_file.txt", true);
instead of:
new FileWriter("name_of_your_file.txt");

Tika detect custom metadat fileds

I created a simple class that using tika library to extract metadata from files like PDF, html, XLS, DOC,..
files can have custom metadata. I need to detect that and ignore for first step!
But i can see how to do that with Tika!
this is my simple code to extract all metadata:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.PrintWriter;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
public class TikaParse {
public static String resPFldMeta = new String();
public static String resPFldMetaValue = new String();
#SuppressWarnings("deprecation")
public static String ParseFieldMetadata(String filename) throws Exception {
int j;
FileInputStream is = null;
File f = new File(filename);
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());
AutoDetectParser parser = new AutoDetectParser();
parser.parse(is, contenthandler, metadata,new ParseContext());
String[] metadataNames = metadata.names();
// get field name of all metadata
for(j=0;j<metadataNames.length-1; j++){
resPFldMeta += "\""+(metadataNames[j]).trim()+"\",";
}
resPFldMeta += "\""+(metadataNames[j]).trim()+"\"";
return resPFldMeta;
}
//.....
}
SO, My question is : how to check if the metadat detected is custom metadata or is normalized metadata??

Categories

Resources