Getting this Error when called from Soap UI "Pdf indirect object belongs to other PDF document". Below is my code for pdf generation :
Main Method from where execution of print starts:
printFolioActivity:
public void printFolioActivity(final FolioActivityTO folio,
final String printerName, final MediaTray mediaTray,
final ServiceContext context) throws ServiceException {
// S-845164 - Changes done to use new itext library
String fileName = " ";
folioActivityDocumentAssemblerTemplate = getTemplateBean();
final String method = "printFolioActivity";
logEntering(method, context);
LOGGER.info("Print Services printFolioActivity using " + "folio : "
+ folio + "printerName : " + printerName + " mediaTray : "
+ mediaTray);
if (!DocumentServiceUtil.isNullOrEmpty(folio)
&& !DocumentServiceUtil.isNullOrEmpty(printerName)
&& !DocumentServiceUtil.isNullOrEmpty(mediaTray)) {
//FolioActivityDocumentAssemblerTemplate template = new FolioActivityDocumentAssemblerTemplate();
// Calling Folio Acitivity Template to generate the pdf
final byte[] document = folioActivityDocumentAssemblerTemplate
.generateFolioActivityDocument(folio);
if (save_gen_pdf) {
fileName = saveLocalService.save(
DocumentServiceConstant.FILE_NAME_FOLIO_ACTIVITY,
document);
}
PrintServiceFacade.getInstance().print(printerName, mediaTray,
document, context);
if(Objects.nonNull(fileName)) {
deleteFile(fileName);
}
logExiting(method);
} else {
LOGGER.error("In printFolioActivity FolioActivityTO is null or Empty OR printerName is null or Empty OR mediaTray is null or Empty");
}
}
generateFolioActivityDocument:
public byte[] generateFolioActivityDocument(FolioActivityTO folioActivityTO) {
LOGGER.info("FolioActivityDocumentAssemblerTemplate generateFolioActivityDocument :: START");
FacilityServiceFacade facilityServiceFacade = getBeanFacade();
currentDate =facilityServiceFacade.getCurrentDate();
OutputStream out = new ByteArrayOutputStream();
PdfWriter writer = new PdfWriter(out);
PdfDocument pdfDoc = new PdfDocument(writer);
Document document = new Document(pdfDoc);
pdfDoc.setDefaultPageSize(FolioActivityTemplate.PAGE_SIZE_LANDSCAPE);
PdfCanvas pdfCanvas = new PdfCanvas(
document.getPdfDocument().addNewPage(FolioActivityTemplate.PAGE_SIZE_LANDSCAPE));
FolioActivityTemplate template = new FolioActivityTemplate(false);
generateFolioActivity(pdfCanvas, document, folioActivityTO, template);
document.close(); // throwing exception here
totalPDFPages = 0;
LOGGER.info("FolioActivityDocumentAssemblerTemplate generateFolioActivityDocument :: END");
return ((ByteArrayOutputStream) out).toByteArray();
}
FolioActivityTemplate class has all the constants declared for styles,fonts.
GenerateFolioActivity:
public void generateFolioActivity(PdfCanvas pdfCanvas, Document document, FolioActivityTO folioActivityTO,
FolioActivityTemplate template) {
LOGGER.info("FolioActivityDocumentAssemblerTemplate generateFolioActivity :: START");
insertTitle(pdfCanvas, folioActivityTO.getGuestInformation(), template.activityTitleStyle,
template.packageInfoStyle);
LOGGER.info("FolioActivityDocumentAssemblerTemplate generateFolioActivity :: END");
}
insertTitle:
protected static void insertTitle(PdfCanvas pdfCanvas, FolioActivityGuestInformation guestInfo, Style titleStyle,
Style packageStyle) {
LOGGER.info("FolioActivityDocumentAssemblerTemplate insertTitle :: START : GuestInfo : " + guestInfo);
drawElementFromStyle(pdfCanvas, FolioActivityTemplate.ACTIVITY_TITLE, titleStyle, 0);
if (guestInfo == null) {
return;
}
drawElementFromStyle(pdfCanvas, guestInfo.getPackageCode() + FolioActivityTemplate.TEXT_PACKAGE_CODE_SEPARATOR
+ guestInfo.getPackageCodeDescription(), packageStyle, 0);
drawElementFromStyle(pdfCanvas,
FolioActivityTemplate.GI_ARRIVAL_TEXT + getPackageEffectiveDate(guestInfo.getPackageDateEffectiveFrom())
+ FolioActivityTemplate.TEXT_DATE_EFFECTIVE_SEPARATOR + FolioActivityTemplate.GI_DEPARTURE_TEXT
+ getPackageEffectiveDate(guestInfo.getPackageDateEffectiveTo()),
packageStyle, 1);
LOGGER.info("FolioActivityDocumentAssemblerTemplate insertTitle :: END");
}
drawElementFromStyle:
protected static void drawElementFromStyle(final PdfCanvas at, final String element, final Style style,
final int offsetCount) {
at.beginText();
float fontSize = style.getFontSize().floatValue();
if (!DocumentServiceUtil.isNullOrEmpty(element) && element.length() > 50 && fontSize > 17) {
fontSize = Float.parseFloat("17.0");
}
at.setFontAndSize(style.getFontType(), fontSize);
if (element != null) {
at.moveText(style.getX().floatValue(),
(style.getY().floatValue() - (offsetCount * style.getCapHeight().floatValue()))).showText(element);
}
at.endText();
}
Exception:
soap:Faultsoap:ServerUnexpected Error occurred : printFolioActivity : Pdf indirect object belongs to other PDF document. Copy object to current pdf document.<ServiceException xmlns:ns2="http://exception.service.com/" EXCEPTIONprintFolioActivity : Pdf indirect object belongs to other PDF document. Copy object to current pdf document.
Related
I am parsing a PDF document with iText, and I want to know the colors for lines and rectangles in the pages. I am using this code which does the parsing:
private PdfDictionary getColorDictionary(PdfDictionary resourcesDic) {
PdfDictionary colorDict = resourcesDic.getAsDict(PdfName.COLORSPACE);
return colorDict;
}
public void decode(File file) throws IOException {
PdfReader reader = new PdfReader(file.toURI().toURL());
int numberOfPages = reader.getNumberOfPages();
ProcessorListener listener = new ProcessorListener ();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
for (int pageNumber = 1; pageNumber <= numberOfPages; pageNumber++) {
PdfDictionary pageDic = reader.getPageN(pageNumber);
PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
PdfDictionary colorSpaceDic = getColorDictionary(resourcesDic);
listener.setResources(colorSpaceDic);
processor.processContent(ContentByteUtils.getContentBytesForPage(reader, pageNumber), resourcesDic);
}
}
My listener has the following code, I simplified it to show only the part which gets the graphics elements in each page:
public class ProcessorListener implements ExtRenderListener {
private PdfDictionary colorSpaceDic = null;
public void setResources(PdfDictionary colorSpaceDic) {
this.colorSpaceDic = colorSpaceDic;
}
#Override
public void beginTextBlock() {
}
#Override
public void renderText(TextRenderInfo tri) {
}
#Override
public void renderImage(ImageRenderInfo iri) {
}
#Override
public Path renderPath(PathPaintingRenderInfo renderInfo) {
GraphicsState graphicsState;
try {
graphicsState = getGraphicsState(renderInfo);
} catch (NoSuchFieldException | SecurityException | IllegalArgumentException | IllegalAccessException e) {
return null;
}
if ((renderInfo.getOperation() & PathPaintingRenderInfo.STROKE) != 0) {
PdfName resource = graphicsState.getColorSpaceStroke();
if (resource != null && colorSpaceDic != null) {
PdfObject obj = colorSpaceDic.get(resource);
System.err.println("STROKE: " + obj);
}
}
if ((renderInfo.getOperation() & PathPaintingRenderInfo.FILL) != 0) {
PdfName resource = graphicsState.getColorSpaceStroke();
if (resource != null && colorSpaceDic != null) {
PdfObject obj = colorSpaceDic.get(resource);
System.err.println("FILL: " + obj);
}
}
}
return null;
}
This code executes correctly, but each PDFObject associated with afill or stroke is a PRIndirectReference. How to I get the BaseColor associated with this reference?
Also I tried to use the following code (for example for the Fill):
BaseColor fillColor = graphicsState.getFillColor();
But the color is always null. There are not only black shapes in the document (which I assume is the default), but also green or blue lines as well.
Per mkl remark, I saved the PDF document to another document using Acrobat Reader (print option) or Edge, and I have not null colors in the resulting document.
I am trying to get the current page no using PDF box reader.
Hear is what i have written the code.
public class PDFTextExtractor{
ArrayList extractText(String fileName) throws Exception {
PDDocument document = null;
try {
document = PDDocument.load( new File(fileName) );
PDFTextAnalyzer stripper = new PDFTextAnalyzer();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
return stripper.getCharactersList();
}
finally {
if( document != null ) {
document.close();
}
}
}
And when i am trying to get the details i am writing the following code.
public class PDFTextAnalyzer extends PDFTextStripper {
public PDFTextAnalyzer() throws IOException {
super();
// TODO Auto-generated constructor stub
}
private ArrayList<CharInfo> charactersList = new ArrayList<CharInfo>();
public ArrayList<CharInfo> getCharactersList() {
return charactersList;
}
public void setCharactersList(ArrayList<CharInfo> charactersList) {
this.charactersList = charactersList;
}
#Override
protected void writeString(String string, List<TextPosition> textPositions)
throws IOException {
System.out.println("----->"+document.getPages().getCount());
/* for(int i = 0 ; i < document.getPages().getCount();i++)
{
*/
float docHeight = +document.getPage(1).getMediaBox().getHeight();
for (TextPosition text : textPositions) {
/*
* System.out.println((int)text.getUnicode().charAt(0)+" "+text.
* getUnicode()+ " [(X=" + text.getXDirAdj()+" "+text.getX() + ",Y="
* + text.getYDirAdj() + ") height=" + text.getHeightDir() +
* " width=" + text.getWidthDirAdj() + "]");
*/
System.out.println("<-->"+text.toString());
charactersList.add(new CharInfo(
text.getUnicode(),
text.getXDirAdj(),
docHeight - text.getYDirAdj(),
text.getWidthDirAdj(),
text.getHeightDir(),
text.getFontSizeInPt(),
1, // Page number of current text
text.getFont().getFontDescriptor().getFontName(),
text.getFont().getFontDescriptor().getFontFamily()
)
);
}
But i am unable to fetch the page number. See the line comment "Page number of current text".Is there any way to fetch the page number.
How about this.getCurrentPageNo()?
I need to create a pdf file from plain text files. I supposed that the simplest method would be read these files and print them to a PDF printer.
My problem is that if I print to a pdf printer, the result will be an empty pdf file. If I print to Microsoft XPS Document Writer, the file is created in plain text format, not in oxps format.
I would be satisfied with a two or three step solution. (Eg. converting to xps first then to pdf using ghostscript, or something similar).
I have tried a couple of pdf printers such as: CutePDF, Microsoft PDF writer, Bullzip PDF. The result is the same for each one.
The environment is Java 1.7/1.8 Win10
private void print() {
try {
DocFlavor flavor = DocFlavor.SERVICE_FORMATTED.PRINTABLE;
PrintRequestAttributeSet patts = new HashPrintRequestAttributeSet();
PrintService[] ps = PrintServiceLookup.lookupPrintServices(flavor, patts);
if (ps.length == 0) {
throw new IllegalStateException("No Printer found");
}
System.out.println("Available printers: " + Arrays.asList(ps));
PrintService myService = null;
for (PrintService printService : ps) {
if (printService.getName().equals("Microsoft XPS Document Writer")) { //
myService = printService;
break;
}
}
if (myService == null) {
throw new IllegalStateException("Printer not found");
}
myService.getSupportedDocFlavors();
DocPrintJob job = myService.createPrintJob();
FileInputStream fis1 = new FileInputStream("o:\\k\\t1.txt");
Doc pdfDoc = new SimpleDoc(fis1, DocFlavor.INPUT_STREAM.AUTOSENSE, null);
HashPrintRequestAttributeSet pr = new HashPrintRequestAttributeSet();
pr.add(OrientationRequested.PORTRAIT);
pr.add(new Copies(1));
pr.add(MediaSizeName.ISO_A4);
PrintJobWatcher pjw = new PrintJobWatcher(job);
job.print(pdfDoc, pr);
pjw.waitForDone();
fis1.close();
} catch (PrintException ex) {
Logger.getLogger(Docparser.class.getName()).log(Level.SEVERE, null, ex);
} catch (Exception ex) {
Logger.getLogger(Docparser.class.getName()).log(Level.SEVERE, null, ex);
}
}
class PrintJobWatcher {
boolean done = false;
PrintJobWatcher(DocPrintJob job) {
job.addPrintJobListener(new PrintJobAdapter() {
public void printJobCanceled(PrintJobEvent pje) {
allDone();
}
public void printJobCompleted(PrintJobEvent pje) {
allDone();
}
public void printJobFailed(PrintJobEvent pje) {
allDone();
}
public void printJobNoMoreEvents(PrintJobEvent pje) {
allDone();
}
void allDone() {
synchronized (PrintJobWatcher.this) {
done = true;
System.out.println("Printing done ...");
PrintJobWatcher.this.notify();
}
}
});
}
public synchronized void waitForDone() {
try {
while (!done) {
wait();
}
} catch (InterruptedException e) {
}
}
}
If you can install LibreOffice, it is possible to use the Java UNO API to do this.
There is a similar example here which will load and save a file: Java Convert Word to PDF with UNO. This could be used to convert your text file to PDF.
Alternatively, you could take the text file and send it directly to the printer using the same API.
The following JARs give access to the UNO API. Ensure these are in your class path:
[Libre Office Dir]/URE/java/juh.jar
[Libre Office Dir]/URE/java/jurt.jar
[Libre Office Dir]/URE/java/ridl.jar
[Libre Office Dir]/program/classes/unoil.jar
[Libre Office Dir]/program
The following code will then take your sourceFile and print to the printer named "Local Printer 1".
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import com.sun.star.beans.PropertyValue;
import com.sun.star.frame.XComponentLoader;
import com.sun.star.uno.UnoRuntime;
import com.sun.star.view.XPrintable;
public class DirectPrintTest
{
public static void main(String args[])
{
// set to the correct name of your printers
String printer = "Local Printer 1";// "Microsoft Print to PDF";
File sourceFile = new File("c:/projects/WelcomeTemplate.doc");
if (!sourceFile.canRead()) {
throw new RuntimeException("Can't read:" + sourceFile.getPath());
}
com.sun.star.uno.XComponentContext xContext = null;
try {
// get the remote office component context
xContext = com.sun.star.comp.helper.Bootstrap.bootstrap();
System.out.println("Connected to a running office ...");
// get the remote office service manager
com.sun.star.lang.XMultiComponentFactory xMCF = xContext
.getServiceManager();
Object oDesktop = xMCF.createInstanceWithContext(
"com.sun.star.frame.Desktop", xContext);
com.sun.star.frame.XComponentLoader xCompLoader = (XComponentLoader) UnoRuntime
.queryInterface(com.sun.star.frame.XComponentLoader.class,
oDesktop);
StringBuffer sUrl = new StringBuffer("file:///");
sUrl.append(sourceFile.getCanonicalPath().replace('\\', '/'));
List<PropertyValue> loadPropsList = new ArrayList<PropertyValue>();
PropertyValue pv = new PropertyValue();
pv.Name = "Hidden";
pv.Value = Boolean.TRUE;
loadPropsList.add(pv);
PropertyValue[] loadProps = new PropertyValue[loadPropsList.size()];
loadPropsList.toArray(loadProps);
// Load a Writer document, which will be automatically displayed
com.sun.star.lang.XComponent xComp = xCompLoader
.loadComponentFromURL(sUrl.toString(), "_blank", 0,
loadProps);
// Querying for the interface XPrintable on the loaded document
com.sun.star.view.XPrintable xPrintable = (XPrintable) UnoRuntime
.queryInterface(com.sun.star.view.XPrintable.class, xComp);
// Setting the property "Name" for the favoured printer (name of
// IP address)
com.sun.star.beans.PropertyValue propertyValue[] = new com.sun.star.beans.PropertyValue[2];
propertyValue[0] = new com.sun.star.beans.PropertyValue();
propertyValue[0].Name = "Name";
propertyValue[0].Value = printer;
// Setting the name of the printer
xPrintable.setPrinter(propertyValue);
propertyValue[0] = new com.sun.star.beans.PropertyValue();
propertyValue[0].Name = "Wait";
propertyValue[0].Value = Boolean.TRUE;
// Printing the loaded document
System.out.println("sending print");
xPrintable.print(propertyValue);
System.out.println("closing doc");
((com.sun.star.util.XCloseable) UnoRuntime.queryInterface(
com.sun.star.util.XCloseable.class, xPrintable))
.close(true);
System.out.println("closed");
System.exit(0);
} catch (Exception e) {
e.printStackTrace(System.err);
System.exit(1);
}
}
}
Thank you for all. After two days struggling with various type of printers (I gave a chance to CUPS PDF printer too but I could not make it to print in landscape mode) I ended up using the Apache PDFbox.
It's only a POC solution but works and fits to my needs. I hope it will be useful for somebody.
( cleanTextContent() method removes some ESC control characters from the line to be printed. )
public void txt2pdf() {
float POINTS_PER_INCH = 72;
float POINTS_PER_MM = 1 / (10 * 2.54f) * POINTS_PER_INCH;
SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy.MM.dd HH:m.ss");
PDDocument doc = null;
try {
doc = new PDDocument();
PDPage page = new PDPage(new PDRectangle(297 * POINTS_PER_MM, 210 * POINTS_PER_MM));
doc.addPage(page);
PDPageContentStream content = new PDPageContentStream(doc, page);
//PDFont pdfFont = PDType1Font.HELVETICA;
PDFont pdfFont = PDTrueTypeFont.loadTTF(doc, new File("c:\\Windows\\Fonts\\lucon.ttf"));
float fontSize = 10;
float leading = 1.1f * fontSize;
PDRectangle mediabox = page.getMediaBox();
float margin = 20;
float startX = mediabox.getLowerLeftX() + margin;
float startY = mediabox.getUpperRightY() - margin;
content.setFont(pdfFont, fontSize);
content.beginText();
content.setLeading(leading);
content.newLineAtOffset(startX, startY);
BufferedReader fis1 = new BufferedReader(new InputStreamReader(new FileInputStream("o:\\k\\t1.txt"), "cp852"));
String inString;
//content.setRenderingMode(RenderingMode.FILL_STROKE);
float currentY = startY + 60;
float hitOsszesenOffset = 0;
int pageNumber = 1;
while ((inString = fis1.readLine()) != null) {
currentY -= leading;
if (currentY <= margin) {
content.newLineAtOffset(0, (mediabox.getLowerLeftX()-35));
content.showText("Date Generated: " + dateFormat.format(new Date()));
content.newLineAtOffset((mediabox.getUpperRightX() / 2), (mediabox.getLowerLeftX()));
content.showText(String.valueOf(pageNumber++)+" lap");
content.endText();
float yCordinate = currentY+30;
float sX = mediabox.getLowerLeftY()+ 35;
float endX = mediabox.getUpperRightX() - 35;
content.moveTo(sX, yCordinate);
content.lineTo(endX, yCordinate);
content.stroke();
content.close();
PDPage new_Page = new PDPage(new PDRectangle(297 * POINTS_PER_MM, 210 * POINTS_PER_MM));
doc.addPage(new_Page);
content = new PDPageContentStream(doc, new_Page);
content.beginText();
content.setFont(pdfFont, fontSize);
content.newLineAtOffset(startX, startY);
currentY = startY;
}
String ss = new String(inString.getBytes(), "UTF8");
ss = cleanTextContent(ss);
if (!ss.isEmpty()) {
if (ss.contains("JAN") || ss.contains("SUMMARY")) {
content.setRenderingMode(RenderingMode.FILL_STROKE);
}
content.newLineAtOffset(0, -leading);
content.showText(ss);
}
content.setRenderingMode(RenderingMode.FILL);
}
content.newLineAtOffset((mediabox.getUpperRightX() / 2), (mediabox.getLowerLeftY()));
content.showText(String.valueOf(pageNumber++));
content.endText();
fis1.close();
content.close();
doc.save("o:\\k\\t1.pdf");
} catch (IOException ex) {
Logger.getLogger(Document_Creation.class.getName()).log(Level.SEVERE, null, ex);
} finally {
if (doc != null) {
try {
doc.close();
} catch (IOException ex) {
Logger.getLogger(Document_Creation.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
}
I'd like to get all filenames of attachments/embedded files of a PDF document. I've been searching for a long time now, but my code still doesn't work.
What I tried:
File input = new File(inputfile); // Input File Path, Given as param from args[]
pd = PDDocument.load(input);
PDDocumentNameDictionary names = new PDDocumentNameDictionary(pd.getDocumentCatalog());
PDEmbeddedFilesNameTreeNode efTree = names.getEmbeddedFiles();
Map<String, COSObjectable> existedNames = efTree.getNames();
System.out.println(existedNames);//Print Embedded-Filenames to console
pd.close();
I don't know if it is even possible to print the content of a MAP to console. I'm coding in eclipse which doesn't give me any errors. But when I run the jar File I get always: NullPointerException at org.apache.pdfbox.pdmodel.PDDocument.getDocumentCatalog(PDDocument.java:778)
Any ideas or help? Many thanks...
Finally found a solution. For anyone with the same problem, the following code worked for me:
PDDocument pd;
File input = new File(inputfile); // Input File
pd = PDDocument.load(input);
//Writes all embedded Filenames (from pdf document) into Logfile
try{
PDDocumentCatalog catalog = pd.getDocumentCatalog();
PDDocumentNameDictionary names = catalog.getNames();
PDEmbeddedFilesNameTreeNode embeddedFiles = names.getEmbeddedFiles();
Map<String, COSObjectable> embeddedFileNames = embeddedFiles.getNames();
//For-Each Loop is used to list all embedded files (if there is more than one)
for (Map.Entry<String, COSObjectable> entry : embeddedFileNames.entrySet())
{
//You might need to configure the logger first
logger.info("Inputfile: " + inputfile +"Found embedded File: " + entry.getKey() + ":");
}
}
catch (Exception e){
System.out.println("Document has no attachments. ");
}
Here's the ExtractEmbeddedFiles example from the source code download:
public final class ExtractEmbeddedFiles
{
private ExtractEmbeddedFiles()
{
}
/**
* This is the main method.
*
* #param args The command line arguments.
*
* #throws IOException If there is an error parsing the document.
*/
public static void main( String[] args ) throws IOException
{
if( args.length != 1 )
{
usage();
System.exit(1);
}
else
{
PDDocument document = null;
try
{
File pdfFile = new File(args[0]);
String filePath = pdfFile.getParent() + System.getProperty("file.separator");
document = PDDocument.load(pdfFile );
PDDocumentNameDictionary namesDictionary =
new PDDocumentNameDictionary( document.getDocumentCatalog() );
PDEmbeddedFilesNameTreeNode efTree = namesDictionary.getEmbeddedFiles();
if (efTree != null)
{
Map<String, PDComplexFileSpecification> names = efTree.getNames();
if (names != null)
{
extractFiles(names, filePath);
}
else
{
List<PDNameTreeNode<PDComplexFileSpecification>> kids = efTree.getKids();
for (PDNameTreeNode<PDComplexFileSpecification> node : kids)
{
names = node.getNames();
extractFiles(names, filePath);
}
}
}
// extract files from annotations
for (PDPage page : document.getPages())
{
for (PDAnnotation annotation : page.getAnnotations())
{
if (annotation instanceof PDAnnotationFileAttachment)
{
PDAnnotationFileAttachment annotationFileAttachment = (PDAnnotationFileAttachment) annotation;
PDComplexFileSpecification fileSpec = (PDComplexFileSpecification) annotationFileAttachment.getFile();
PDEmbeddedFile embeddedFile = getEmbeddedFile(fileSpec);
extractFile(filePath, fileSpec.getFilename(), embeddedFile);
}
}
}
}
finally
{
if( document != null )
{
document.close();
}
}
}
}
private static void extractFiles(Map<String, PDComplexFileSpecification> names, String filePath)
throws IOException
{
for (Entry<String, PDComplexFileSpecification> entry : names.entrySet())
{
String filename = entry.getKey();
PDComplexFileSpecification fileSpec = entry.getValue();
PDEmbeddedFile embeddedFile = getEmbeddedFile(fileSpec);
extractFile(filePath, filename, embeddedFile);
}
}
private static void extractFile(String filePath, String filename, PDEmbeddedFile embeddedFile)
throws IOException
{
String embeddedFilename = filePath + filename;
File file = new File(filePath + filename);
System.out.println("Writing " + embeddedFilename);
FileOutputStream fos = null;
try
{
fos = new FileOutputStream(file);
fos.write(embeddedFile.toByteArray());
}
finally
{
IOUtils.closeQuietly(fos);
}
}
private static PDEmbeddedFile getEmbeddedFile(PDComplexFileSpecification fileSpec )
{
// search for the first available alternative of the embedded file
PDEmbeddedFile embeddedFile = null;
if (fileSpec != null)
{
embeddedFile = fileSpec.getEmbeddedFileUnicode();
if (embeddedFile == null)
{
embeddedFile = fileSpec.getEmbeddedFileDos();
}
if (embeddedFile == null)
{
embeddedFile = fileSpec.getEmbeddedFileMac();
}
if (embeddedFile == null)
{
embeddedFile = fileSpec.getEmbeddedFileUnix();
}
if (embeddedFile == null)
{
embeddedFile = fileSpec.getEmbeddedFile();
}
}
return embeddedFile;
}
/**
* This will print the usage for this program.
*/
private static void usage()
{
System.err.println( "Usage: java " + ExtractEmbeddedFiles.class.getName() + " <input-pdf>" );
}
}
I wrote (mostly copied from lucene-in-action ebook) an indexing example using Tika. But it doesn't index the documents at all. There is no error on compile or run. I tried indexing a .pdf, .ppt, .doc, even .txt document, no use, at search returns 0 hits, and i payed attention at the words in my documents. Please take a look at the code:
public class TikaIndexer extends Indexer {
private boolean DEBUG = false;
static Set textualMetadataFields = new HashSet();
static {
textualMetadataFields.add(Metadata.TITLE);
textualMetadataFields.add(Metadata.AUTHOR);
textualMetadataFields.add(Metadata.COMMENTS);
textualMetadataFields.add(Metadata.KEYWORDS);
textualMetadataFields.add(Metadata.DESCRIPTION);
textualMetadataFields.add(Metadata.SUBJECT);
}
public TikaIndexer(String indexDir) throws IOException {
super(indexDir);
}
#Override
protected boolean acceptFile(File f) {
return true;
}
#Override
protected Document getDocument(File f) throws Exception {
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY,
f.getCanonicalPath());
InputStream is = new FileInputStream(f);
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(10*1024*1024);
try {
parser.parse(is, handler, metadata, new ParseContext());
} finally {
is.close();
}
Document doc = new Document();
doc.add(new Field("contents", handler.toString(), Field.Store.NO, Field.Index.ANALYZED));
if (DEBUG) {
System.out.println(" intregul textt: " + handler.toString());
}
for (String name : metadata.names()) {
String value = metadata.get(name);
if (textualMetadataFields.contains(name)) {
doc.add(new Field("contents", value,
Field.Store.NO, Field.Index.ANALYZED));
}
doc.add(new Field(name, value, Field.Store.YES, Field.Index.NO));
if (DEBUG) {
System.out.println(" " + name + ": " + value);
}
}
if (DEBUG) {
System.out.println();
}
return doc;
}
}
And main class:
public static void main(String args[])
{
String indexDir = "src/indexDirectory";
String dataDir = "src/filesDirectory";
try
{
TikaConfig config = TikaConfig.getDefaultConfig();
List<MediaType> parsers = new ArrayList(config.getParser().getSupportedTypes(new ParseContext())); //3
Collections.sort(parsers);
Iterator<MediaType> it = parsers.iterator();
System.out.println(parsers.size());
System.out.println("Tipuri de parsere:");
while (it.hasNext()) {
System.out.println(" " + it.next());
}
System.out.println();
long start = new Date().getTime();
TikaIndexer indexer = new TikaIndexer(indexDir);
int numIndexed = indexer.index(dataDir);
long end = new Date().getTime();
System.out.println("Indexarea a " + numIndexed + " fisiere a durat "
+ (end - start) + " milisecunde.");
System.out.println();
System.out.println("--------------------------------------------------------------");
System.out.println();
}
catch (Exception ex)
{
System.out.println("Nu s-a putut realiza indexarea: ");
ex.printStackTrace();
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
}
}