Learning PDFBox; Trouble with Sample Code

Learning PDFBox; Trouble with Sample Code - java

I'm trying to learn how to use PDFBox and found some sample code that I'm working through here.
I've attached the code in the post-script.
When I compile the code in Dr. Java, I get the following error:
File: C:\Users\Dick Hurtz from Hold\Desktop\Java Programs\JavaStuff\PDFManager.java [line: 30]
Error: The constructor org.apache.pdfbox.pdfparser.PDFParser(org.apache.pdfbox.io.RandomAccessFile) is undefined
I'm not sure what to do about this, and any help would be greatly appreciated. Thanks everyone!
Here are the classes:
MAIN:
import java.io.IOException;
public class JavaPDFTest {
public static void main(String[] args) throws IOException {
PDFManager pdfManager = new PDFManager();
pdfManager.setFilePath("E:\test.pdf");
System.out.println(pdfManager.ToText());
}
}
PDFManager:
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFManager {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc;
private COSDocument cosDoc;
private String Text;
private String filePath;
private File file;
public PDFManager() {
}
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File(filePath);
parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(10);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
// pdfStripper.setEndPage(pdDoc.getNumberOfPages());
Text = pdfStripper.getText(pdDoc);
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}

get the PDDocument directly using
PDDocument pdDoc = PDDocument.load(file);
is the recommended way to load a PDF document from a file.

Related

Unprotect word document using java

how can we unprotect the word document using java apache poi? I have protected the document as read-only using password pro-grammatically.Now I want to unprotect it. How can we do ? Is there any method to unprotect the document. I have used removePasswordProtection() but that document is not editable even after using that method.
The sample code that I have used for protection is
XWPFDocument document=new XWPFDocument();
document.enforceReadonlyProtection(strPassword,HashAlgorithm.sha1);
The document is getting protected successfully.
But when I am unprotecting document using the below code snippet it is not working.
if(document.isEnforcedReadonlyProtection())
{
if(document.validateProtectionPassword(strPassword))
{
document.removeProtectionEnforcement();
}
}
Can anyone help me what method that I can use to unprotect the document?

Cannot reproducing.
Following code produces two Word documents. One, WordProtected.docx, which is protected and one, WordUnprotected.docx in which protection is removed.
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.poifs.crypt.HashAlgorithm;
class XWPFReadOnlyProtection {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument();
String strPassword = "password";
document.enforceReadonlyProtection(strPassword, HashAlgorithm.sha1);
FileOutputStream fileout = new FileOutputStream("WordProtected.docx");
document.write(fileout);
fileout.close();
document.close();
document = new XWPFDocument(new FileInputStream("WordProtected.docx"));
document.removeProtectionEnforcement();
fileout = new FileOutputStream("WordUnprotected.docx");
document.write(fileout);
fileout.close();
document.close();
}
}

use this code for Word to Protect
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class WordTest {
public static void main(String[] args) throws IOException {
FileInputStream in = new FileInputStream("D:\\govind.doc");
BufferedInputStream bin = new BufferedInputStream(in);
POIFSFileSystem poiFileSystem = new POIFSFileSystem(bin);
Biff8EncryptionKey.setCurrentUserPassword("P#ssw0rd");
HWPFDocument doc = new HWPFDocument(poiFileSystem);
Range range = doc.getRange();
FileOutputStream out = new FileOutputStream("D:\\govind.doc");
doc.write(out);
out.close();
}
}
this is use for protected word File unportected
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class wordFileTest {
public static void main(String[] args) throws IOException {
geenrateUnprotectedFile("D:\\","govind","1234");
}
public static void geenrateUnprotectedFile(String filePath,String fileName,String pwdtxt) {
try {
FileInputStream in = new FileInputStream(filePath+fileName+".doc");
BufferedInputStream bin = new BufferedInputStream(in);
POIFSFileSystem poiFileSystem = new POIFSFileSystem(bin);
Biff8EncryptionKey.setCurrentUserPassword(pwdtxt);
HWPFDocument doc = new HWPFDocument(poiFileSystem);
String docType=doc.getDocumentText();
FileOutputStream out = new FileOutputStream(filePath+fileName+"12.doc");
out.write(docType.getBytes());
System.out.println("don");
}catch (Exception e) {
e.printStackTrace();
}
}
}

PDFBox - read text from multiple PDFs and load it into multiple Text files

I have more than 1000 pdf files in a folder , each one to be converted and saved in its corresponding text file .
I'm a bit new to Java and i'm using PDFBox to make the conversion ; I successfully got the code for one single pdf , but I'm stuck on how to do the conversion for all the PDFS in a single Folder. Can someone help me to achieve that in Java? .
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
public final class ExtractPdf
{
public static void main( String[] args ) throws IOException
{
String fileName = "sample.pdf";
PDDocument document = null;
try (PrintWriter out = new PrintWriter("out.txt"))
{
document = PDDocument.load( new File(fileName));
PDFTextStripper stripper = new PDFTextStripper();
String pdfText = stripper.getText(document).toString();
System.out.println( "Text in the area:" + pdfText);
out.println(pdfText);
}
finally
{
if( document != null )
{
document.close();
}
}
}
}
Thanks, Free

Basically your question is how to go through a directory…
public static void main(String[] args) throws IOException
{
File dir = new File("....");
File[] files = dir.listFiles(new FilenameFilter()
{
// use anonymous inner class
#Override
public boolean accept(File dir, String name)
{
return name.toLowerCase().endsWith(".pdf");
}
});
// null check omitted!
for (File file : files)
{
int len = file.getAbsolutePath().length();
String txtFilename = file.getAbsolutePath().substring(0, len - 4) + ".txt";
// check whether txt file exists omitted
try (OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(txtFilename), Charsets.UTF_8);
PDDocument document = PDDocument.load(file))
{
PDFTextStripper stripper = new PDFTextStripper();
stripper.writeText(document, out);
}
}
// exception catch omitted. Add code here to avoid your whole job
// dying if only one file is broken
}

Search for a string in html file using Jsoup

Can anyone help me with searching for a particular string in HTML file using Jsoup or any other method. There are inbuilt methods but they help in extracting title or script texts inside a specific tags and not string in general.
In this code I have used one such inbuilt method to extract title from the html page.
But I want to search a string instead.
package dynamic_tester;
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class tester {
public static void main(String args[])
{
Document htmlFile = null;
{
try {
htmlFile = Jsoup.parse(new File("x.html"), "ISO-8859-1");
}
catch (IOException e)
{
e.printStackTrace();
}
String title = htmlFile.title();
System.out.println("Title = "+title);
}
}
}

Here's a sample. It reads the HTML file as text String and then performs search on that String.
package com.example;
import java.io.FileInputStream;
import java.nio.charset.Charset;
public class SearchTest {
public static void main(String[] args) throws Exception {
StringBuffer htmlStr = getStringFromFile("test.html", "ISO-8859-1");
boolean isPresent = htmlStr.indexOf("hello") != -1;
System.out.println("is Present ? : " + isPresent);
}
private static StringBuffer getStringFromFile(String fileName, String charSetOfFile) {
StringBuffer strBuffer = new StringBuffer();
try(FileInputStream fis = new FileInputStream(fileName)) {
byte[] buffer = new byte[10240]; //10K buffer;
int readLen = -1;
while( (readLen = fis.read(buffer)) != -1) {
strBuffer.append( new String(buffer, 0, readLen, Charset.forName(charSetOfFile)));
}
} catch(Exception ex) {
ex.printStackTrace();
strBuffer = new StringBuffer();
}
return strBuffer;
}
}

Printing results from PDF box output to a text file

I am working on a class which Parses a PDF document with PDF Box, its purpose is to create a text file (its name is PdfTestFile.txt) with the results. We have gotten it to print the parsed text to the console, but I don't know how to make it write the results to the .txt file that the class creates (name is PdfTestFile.txt).
I tried to use out.print(Text); but it gives me an error saying that:
out cannot be resolved
The class PdfEasyManager calls the class EasySearch in which we see the error mentioned above.
Below is the code that I have where the String Text is what I would like to print to the file PdfTestFile.txt:
Class " PdfEasyManager":
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.PrintWriter;
public class PdfEasyManager {
static BufferedWriter writer;
public static void main(String[] args) throws IOException {
//writer = new BufferedWriter(new FileWriter("Evergreen.txt"));
EasySearch easysearch = new EasySearch();
// pdfManager.setFilePath("PDFextTEST.pdf");
System.out.println(easysearch.ToText());
//out.println(easysearch.ToText());
}
}
Class "EasySearch" :
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.io.IOException;
public class EasySearch {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String Text ;
private String filePath;
private File file;
static BufferedWriter writer;
//writer = new BufferedWriter(new FileWriter(BLnumber + (date.toString().substring(4, 10))+ ".org"));
public EasySearch() {
}
//public static void main(String args[]) throws Exception{
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
writer = new BufferedWriter(new FileWriter("PdfTestFile.txt"));
file = new File("C:/Users/Jon Smith/Desktop/Sample.pdf");
parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);// reading text from page 1
// pdfStripper.setEndPage(10);// to 10
pdfStripper.setEndPage(pdDoc.getNumberOfPages());// if you want to get text from full pdf file use this code
Text = pdfStripper.getText(pdDoc);
out.print(Text); //this is the line that gives me the error
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}

You are using out which is not present in your class. Use System.out.print(Text).

Thanks for the help but
writer.write(Text);
solves the issue I was having

Tika detect custom metadat fileds

I created a simple class that using tika library to extract metadata from files like PDF, html, XLS, DOC,..
files can have custom metadata. I need to detect that and ignore for first step!
But i can see how to do that with Tika!
this is my simple code to extract all metadata:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.PrintWriter;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
public class TikaParse {
public static String resPFldMeta = new String();
public static String resPFldMetaValue = new String();
#SuppressWarnings("deprecation")
public static String ParseFieldMetadata(String filename) throws Exception {
int j;
FileInputStream is = null;
File f = new File(filename);
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());
AutoDetectParser parser = new AutoDetectParser();
parser.parse(is, contenthandler, metadata,new ParseContext());
String[] metadataNames = metadata.names();
// get field name of all metadata
for(j=0;j<metadataNames.length-1; j++){
resPFldMeta += "\""+(metadataNames[j]).trim()+"\",";
}
resPFldMeta += "\""+(metadataNames[j]).trim()+"\"";
return resPFldMeta;
}
//.....
}
SO, My question is : how to check if the metadat detected is custom metadata or is normalized metadata??

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Learning PDFBox; Trouble with Sample Code - java

get the PDDocument directly using PDDocument pdDoc = PDDocument.load(file); is the recommended way to load a PDF document from a file.

Related

Unprotect word document using java

PDFBox - read text from multiple PDFs and load it into multiple Text files

Search for a string in html file using Jsoup

Printing results from PDF box output to a text file

Tika detect custom metadat fileds

Categories

Resources