I created a simple class that using tika library to extract metadata from files like PDF, html, XLS, DOC,..
files can have custom metadata. I need to detect that and ignore for first step!
But i can see how to do that with Tika!
this is my simple code to extract all metadata:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.PrintWriter;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
public class TikaParse {
public static String resPFldMeta = new String();
public static String resPFldMetaValue = new String();
#SuppressWarnings("deprecation")
public static String ParseFieldMetadata(String filename) throws Exception {
int j;
FileInputStream is = null;
File f = new File(filename);
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());
AutoDetectParser parser = new AutoDetectParser();
parser.parse(is, contenthandler, metadata,new ParseContext());
String[] metadataNames = metadata.names();
// get field name of all metadata
for(j=0;j<metadataNames.length-1; j++){
resPFldMeta += "\""+(metadataNames[j]).trim()+"\",";
}
resPFldMeta += "\""+(metadataNames[j]).trim()+"\"";
return resPFldMeta;
}
//.....
}
SO, My question is : how to check if the metadat detected is custom metadata or is normalized metadata??
Related
I am doing a coding project where I am trying to input a file into java and output information about the file. I have found code online that does this for PDF's. The line "import org.xml.sax.SAXException;" keeps giving me an error and stating that the package org.xml.sax is accessible to more than one module. Can someone help me with this?
Sorry to bother you all, I am a new coder just trying to figure this out.
Here is the code:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class PDFTika
{
public static void main(final String[] args) throws
IOException,TikaException
{
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new
File("/Users/relli/OneDrive/Documents/Asparta/example.pdf"));
ParseContext pcontext = new ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :" +
handler.toString());
//getting metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();
for(String name : metadataNames)
{
System.out.println(name+ " : " + metadata.get(name));
}
}
}
Method 1: code is a copy of the code provided by Gabriel Katz. I have managed to fix the error just by adding another exception (SAXException) in code.
Method 2: is a simplified version of parsing the PDF content only.
Code Snippet Info:
This code is used to parse PDF data using the Apache Tika package. It will display the pdf content as string and print metadata of PDF file
Method 1: parse PDF and print PDF content and metadata
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class PDFTika {
public static void main(final String[] args) throws IOException, TikaException, SAXException {
File file = new File("example.pdf");
FileInputStream inputstream = new FileInputStream(file);
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext pcontext = new ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :" + handler.toString());
//getting metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
}
}
Method 2: parse PDF data and print content as a string
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
public class TikaParser {
public static void main(String[] args) throws IOException, TikaException {
File file = new File("example.pdf");
FileInputStream inputstream = new FileInputStream(file);
Tika tika = new Tika();
String fileContent = tika.parseToString(inputstream);
System.out.println(fileContent);
}
}
<!--Please add following dependencies for testng-->
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.24.1</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.24.1</version>
</dependency>
</dependencies>
how can we unprotect the word document using java apache poi? I have protected the document as read-only using password pro-grammatically.Now I want to unprotect it. How can we do ? Is there any method to unprotect the document. I have used removePasswordProtection() but that document is not editable even after using that method.
The sample code that I have used for protection is
XWPFDocument document=new XWPFDocument();
document.enforceReadonlyProtection(strPassword,HashAlgorithm.sha1);
The document is getting protected successfully.
But when I am unprotecting document using the below code snippet it is not working.
if(document.isEnforcedReadonlyProtection())
{
if(document.validateProtectionPassword(strPassword))
{
document.removeProtectionEnforcement();
}
}
Can anyone help me what method that I can use to unprotect the document?
Cannot reproducing.
Following code produces two Word documents. One, WordProtected.docx, which is protected and one, WordUnprotected.docx in which protection is removed.
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.poifs.crypt.HashAlgorithm;
class XWPFReadOnlyProtection {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument();
String strPassword = "password";
document.enforceReadonlyProtection(strPassword, HashAlgorithm.sha1);
FileOutputStream fileout = new FileOutputStream("WordProtected.docx");
document.write(fileout);
fileout.close();
document.close();
document = new XWPFDocument(new FileInputStream("WordProtected.docx"));
document.removeProtectionEnforcement();
fileout = new FileOutputStream("WordUnprotected.docx");
document.write(fileout);
fileout.close();
document.close();
}
}
use this code for Word to Protect
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class WordTest {
public static void main(String[] args) throws IOException {
FileInputStream in = new FileInputStream("D:\\govind.doc");
BufferedInputStream bin = new BufferedInputStream(in);
POIFSFileSystem poiFileSystem = new POIFSFileSystem(bin);
Biff8EncryptionKey.setCurrentUserPassword("P#ssw0rd");
HWPFDocument doc = new HWPFDocument(poiFileSystem);
Range range = doc.getRange();
FileOutputStream out = new FileOutputStream("D:\\govind.doc");
doc.write(out);
out.close();
}
}
this is use for protected word File unportected
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class wordFileTest {
public static void main(String[] args) throws IOException {
geenrateUnprotectedFile("D:\\","govind","1234");
}
public static void geenrateUnprotectedFile(String filePath,String fileName,String pwdtxt) {
try {
FileInputStream in = new FileInputStream(filePath+fileName+".doc");
BufferedInputStream bin = new BufferedInputStream(in);
POIFSFileSystem poiFileSystem = new POIFSFileSystem(bin);
Biff8EncryptionKey.setCurrentUserPassword(pwdtxt);
HWPFDocument doc = new HWPFDocument(poiFileSystem);
String docType=doc.getDocumentText();
FileOutputStream out = new FileOutputStream(filePath+fileName+"12.doc");
out.write(docType.getBytes());
System.out.println("don");
}catch (Exception e) {
e.printStackTrace();
}
}
}
I have used TikaParser to extract plain text from '.doc' files
public static void main(String[] args) throws Exception {
ContentHandler handler = new ToHTMLContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
FileInputStream content = new FileInputStream("file.doc");
parser.parse(content, handler, metadata, context);
System.out.println(handler.toString());
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
FileOutputStream outStream = new FileOutputStream("file.doc.txt");
outStream.write(handler.toString().getBytes());
outStream.close();
content.close();
}
This is working for most of the files but for a specific file, it is throwing the following exception
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser#7c417213
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at com.goarya.app.resumestorage.migration.TikaParser.main(TikaParser.java:29)
Caused by: java.lang.IllegalArgumentException: The end (7161) must not be before the start (7162)
at org.apache.poi.hwpf.usermodel.Range.sanityCheckStartEnd(Range.java:208)
at org.apache.poi.hwpf.usermodel.Range.<init>(Range.java:194)
at org.apache.poi.hwpf.usermodel.Paragraph.<init>(Paragraph.java:165)
at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph(Paragraph.java:144)
at org.apache.poi.hwpf.usermodel.Range.getParagraph(Range.java:766)
at org.apache.poi.hwpf.extractor.WordExtractor.getParagraphText(WordExtractor.java:168)
at org.apache.poi.hwpf.extractor.WordExtractor.getMainTextboxText(WordExtractor.java:145)
at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:183)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:169)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:130)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 3 more
The doc file when opened in Microsoft Word shows no error.
Also, in C# using Microsoft.Office.Interop.Word gives plain text.
How do I overcome this issue using Apache Tika?
Edit: adding sample doc for this scenario
I am using tika cote1.2 jar and my program has been run successfully with the following code.
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.xml.sax.SAXException;
public class Exmple2 {
public static void main(final String[] args) throws IOException,TikaException, SAXException {
ToHTMLContentHandler handler = new ToHTMLContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
FileInputStream content = new FileInputStream("/home/ist/FTRDocuments/taableDis.docx");
parser.parse(content, handler, metadata, context);
System.out.println(handler.toString());
String[] metadataNames = metadata.names();
for (String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
FileOutputStream outStream = new FileOutputStream("/home/ist/file.doc.txt");
outStream.write(handler.toString().getBytes());
outStream.close();
content.close();
}
}
The only thing change with tika1.2 is ToHTMLContentHandler where you are using ContentHandler.
getting exception: ClassNotFoundException
And I have included fontbox and pdfbox jar files in my classpath.
package com.KyaHub.action;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import javax.servlet.http.HttpServletRequest;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.fontbox.cmap.*;
import org.xml.sax.SAXException;
public class PdfParser {
private HttpServletRequest request;
public String execute() throws IOException,TikaException, SAXException {
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("C:/Users/admin/Downloads/cmp_column_width_example.pdf"));
ParseContext pcontext = new ParseContext();
try{
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata,pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :" + handler.toString());
//getting metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name+ " : " + metadata.get(name));
}
}
catch(Exception e)
{
e.printStackTrace();
}
return "success";
}
//getter and setter
public HttpServletRequest getRequest() {
return request;
}
public void setRequest(HttpServletRequest request) {
this.request = request;
}
}
whenever i am changing the file name with APJ.AbdulKalamAzad.pdf i got output. But when I change the file name with another pdf file then I got the exception mentioned above.
I am working on a class which Parses a PDF document with PDF Box, its purpose is to create a text file (its name is PdfTestFile.txt) with the results. We have gotten it to print the parsed text to the console, but I don't know how to make it write the results to the .txt file that the class creates (name is PdfTestFile.txt).
I tried to use out.print(Text); but it gives me an error saying that:
out cannot be resolved
The class PdfEasyManager calls the class EasySearch in which we see the error mentioned above.
Below is the code that I have where the String Text is what I would like to print to the file PdfTestFile.txt:
Class " PdfEasyManager":
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.PrintWriter;
public class PdfEasyManager {
static BufferedWriter writer;
public static void main(String[] args) throws IOException {
//writer = new BufferedWriter(new FileWriter("Evergreen.txt"));
EasySearch easysearch = new EasySearch();
// pdfManager.setFilePath("PDFextTEST.pdf");
System.out.println(easysearch.ToText());
//out.println(easysearch.ToText());
}
}
Class "EasySearch" :
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.io.IOException;
public class EasySearch {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String Text ;
private String filePath;
private File file;
static BufferedWriter writer;
//writer = new BufferedWriter(new FileWriter(BLnumber + (date.toString().substring(4, 10))+ ".org"));
public EasySearch() {
}
//public static void main(String args[]) throws Exception{
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
writer = new BufferedWriter(new FileWriter("PdfTestFile.txt"));
file = new File("C:/Users/Jon Smith/Desktop/Sample.pdf");
parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);// reading text from page 1
// pdfStripper.setEndPage(10);// to 10
pdfStripper.setEndPage(pdDoc.getNumberOfPages());// if you want to get text from full pdf file use this code
Text = pdfStripper.getText(pdDoc);
out.print(Text); //this is the line that gives me the error
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}
You are using out which is not present in your class. Use System.out.print(Text).
Thanks for the help but
writer.write(Text);
solves the issue I was having