How to detect Adobe Illustrator files in Tika version 2.6?

How to detect Adobe Illustrator files in Tika version 2.6? - java

I want to use Tika 2.6 to detect files with the MIME type 'application/illustrator'. When I use the following code snippet, I always get the MIME type 'application/pdf':
public MediaType detectMimeTypeFromContent(#NonNull File file) throws IOException {
TikaConfig config = TikaConfig.getDefaultConfig();
Detector detector = config.getDetector();
Metadata metadata = new Metadata();
TikaInputStream tikaStream = TikaInputStream.get(file, metadata);
MediaType mediaType = detector.detect(tikaStream, metadata);
tikaStream.close();
return mediaType;
}
I use this dependencies:
implementation 'org.apache.tika:tika-core:2.6.0'
implementation 'org.apache.tika:tika-parsers:2.6.0'
How can I detect Adobe Illustrator files correctly?

Adobe documentation shows they internally use 3 application/type settings
List of document mime-types that are considered to be PDF or Illustrator documents.
PDF
Postscript
Illustrator
also go one to say
Adobe Illustrator’s file format is a variant of PDF. The main differences, in the context of Experience Manager Assets, is the following:
Adobe Illustrator documents consist of a single page with multiple layers. Each layer is extracted as a PNG subasset under the main Illustrator asset.
PDF documents consist of one or more pages. Each page is extracted as a single page PDF subasset under the main multi-page PDF document.
So Adobe applications have an inhouse typeset to distinguish application/illustrator, however, that is not a registered mimetype (AI is a subset of PDF as above )
Other applications may struggle with hybrids that are wrappers of one around the other so as one example
Linguist reports a content-type of application/postscript for *.ai whilst other report application/PDF which may be due to
"Early versions [over 24 years ago] of the AI file format [Illustrator versions 3 through to 8 saved artwork as specialised EPS files,] are true EPS files with a restricted, compact syntax, with additional semantics represented by Illustrator-specific DSC comments that conform to DSC's Open Structuring Conventions."
Confused ? Dont be, simply like the current mime type register, accept AI files that are PDF like are application/pdf.
I often refer to text/pdf as the legacy format for ansi/pdf but those are not listed either
If a file starts with 40 bit signature %PDF- then irrespective of version or content
The RFC https://www.rfc-editor.org/rfc/rfc8118.html
PDF Versions
The PDF format has gone through several revisions, primarily for the
addition of features. PDF features have generally been added in a
way that older viewers "fail gracefully", because they can just
ignore features they do not recognize. Even so, the older the PDF
version produced, the more legacy viewers will support that version,
but the fewer features will be enabled. The "application/pdf" media
type is used for all versions.* See [ISOPDF2] Annex I, "PDF Versions
and Compatibility".

Related

PDDocument.load(file) isnt a method (PDFBox)

I wanted to make a simple program to get text content from a pdf file through Java. Here is the code:
PDFTextStripper ts = new PDFTextStripper();
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = PDDocument.load(file);
String allText = ts.getText(doc1);
String gradeText = allText.substring(allText.indexOf("GRADE 10B"), allText.indexOf("GRADE 10C"));
System.out.println("Meeting ID for English: "
+ gradeText.substring(gradeText.indexOf("English") + 7, gradeText.indexOf("English") + 20));
This is just part of the code, but this is the part with the problem.
The error is: The method load(File) is undefined for the type PDDocument
I have learnt using PDFBox from JavaTPoint. I have followed the correct instructions for installing the PDFBox libraries and adding them to the Build Path.
My PDFBox version is 3.0.0
I have also searched the source files and their methods, and I am unable to find the load method there.
Thank you in advance.

As per the 3.0 migration guide the PDDocument.load method has been replaced with the Loader method:
For loading a PDF PDDocument.load has been replaced with the Loader
methods. The same is true for loading a FDF document.
When saving a PDF this will now be done in compressed mode per
default. To override that use PDDocument.save with
CompressParameters.NO_COMPRESSION.
PDFBox now loads a PDF Document incrementally reducing the initial
memory footprint. This will also reduce the memory needed to consume a
PDF if only certain parts of the PDF are accessed. Note that, due to
the nature of PDF, uses such as iterating over all pages, accessing
annotations, signing a PDF etc. might still load all parts of the PDF
overtime leading to a similar memory consumption as with PDFBox 2.0.
The input file must not be used as output for saving operations. It
will corrupt the file and throw an exception as parts of the file are
read the first time when saving it.
So you can either swap to an earlier 2.x version of PDFBox, or you need to use the new Loader method. I believe this should work:
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = Loader.loadPDF(file);

issues using apache tika Parser object to parse .doc and .docx file formats

When I try to use org.apache.tika.parser.Parser and DefaultDetector() to detect and parse the .doc and .docx file formats. But I am getting some error (not exception) thrown from Tika jars and that doesn't have any helpful stack trace for me to put here. I can confirm that it is happening for .doc and .docx only. PDF, jpeg, texts are fine. Has anyone come across this problem with .doc and .docx file formats? is there any solution that you have adopted?
My Code is the following:
unzippedBytes = loadUnzippedByteCode(attachment.getContents()); /* This is utility method written using native Java Zip library - returns byte array byte[] */
/* All the objects below were declared beforehand, but not initialised until now */
parseContextObj = new ParseContext();
dObj = new DefaultDetector();
detectedParser = new AutoDetectParser(dObj);
context.set(Parser.class, parser);
OutputStream outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
InputStream input = TikaInputStream.get(unzippedBytes, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
detectedParser.parse(input, handler, metadata, parseContextObj); // This is where it is throwing NoSuchMethodError - cannot understand why and also cannot get the stacktrace - using tika 1.10 */
input.close();
The code above was something that I also found in some other SO question and decided to use it for my work. Also, the byte[] that I have used is something that I am receiving from very old struts 1.0 FormFile interface (getFileData() that returns byte[]). I used to have the bullhorn's irex parser to parse, but decided to use Tika for numerous reasons. the byte[] works fine with irex, but has issues whenever I am trying to parse .docx and .doc contents.
The following is the stack trace which I masked certain parts of due to privacy reasons:
2016-01-15 16:21:06,947 [http-apr-80-exec-3] [ERROR] XXXXX.XXXX.XXXXService - java.lang.NoSuchMethodError: org.apache.poi.util.POILogger.log(I[L
java/lang/Object;)V
at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.parseRelationshipsPart(PackageRelationshipCollection.java:313)
at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:163)
at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:131)
at org.apache.poi.openxml4j.opc.PackagePart.loadRelationships(PackagePart.java:561)
at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:109)
at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:80)
at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:125)
at org.apache.poi.openxml4j.opc.ZipPackagePart.<init>(ZipPackagePart.java:78)
at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:245)
at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:684)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:227)
at org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:208)
at org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:145)
at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
I realised that my path has POI jar version 2.5.1 and according to maven central repo I am a dinosaur (seems like) is that possibly why? I am also getting error after putting all these for versions 3.13 and 2.60 for poi artifacts and xmlbeans respectively (suggested by #venkyreddy in that answer).
UPDATE
I tried building a new project separately from my original work, and used tika-app-1.10.jar ONLY in my classpath. I also investigated the tika-app-1.10.jar and found out that all the POI dependencies are actually there inluding xmlbeans and 'xml-schema'. After keeping only tika-app-1.10.jar in my classpath, I am getting the following Error (not Exception):
java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader
at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:158)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:167)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:119)
at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:59)
at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:204)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at xxx.xxx.xxx.xxx.xxxxxAttachmentWithTika(xxxService.java:792)
I browsed the package and couldn't find any POIXMLTypeLoader class. is this a known issue? Could someone please respond to me?

Make sure there are no outdated POI jars and use the version of POI which matches the version of Tika that you are trying to use.
The class POIXMLTypeLoader was added to POI after POI 3.13 was released, so it seems you somehow mix newer versions. Only release POI 3.14-beta1 knows about this class! Make sure you do not include that version somehow.

Extracting media objects present in LibreOffice Impress using LibreOffice APIs

I am trying to get details of the media contents (video, audio ) present in a LibreOffice Impress document through LibreOffice API in java. The details which I want to extract is the type of media content present in the document. And also ways to export them. I have gone through the java examples given on the Website but could not find anything relevant to type of video or audio present in file and extraction of video files. I have gone through the example given for exporting Images from Impress Documents using GraphicExportFilter, but it is not able to export video or audio files present in the document. I also tried to extract the type of media content by using XShape (code below), but it only gives the name of the media content and not its type(audio/video/or media extension).
For exporting I am also aware of the method of converting documents to pptx and then renaming and extracting all types of media files. But I suppose that would consume more time to extract (correct me if I am wrong) in practical application, so I was trying to do the same by LibreOffice API.
XComponent xDrawDoc = Helper.loadDocument( xOfficeContext,fileName, "_blank", 0, pPropValues );
XDrawPage xPage = PageHelper.getDrawPageByIndex( xDrawDoc,nPageIndex );
XIndexAccess xIndexAccess = UnoRuntime.queryInterface(XIndexAccess.class,xPage);
long shapeNumber = xIndexAccess.getCount();
for(int j=0;j < shapeNumber;j++)
{
XShape xShape =UnoRuntime.queryInterface(XShape.class, xPage.getByIndex(j));
XNamed xShapeNamed =UnoRuntime.queryInterface(XNamed.class, xShape);
System.out.println(j+":"+xShapeNamed.getName());
}
(This code gives me the names of the media contents present in Impress but not its type or extension)
Thanks in Advance..

Unable to read a PDF file using PDFBOX

I am trying to fill in a PDF form using JAVA, but when I tried to get the fields using the below code the list is empty.
PDDocument pdDoc = PDDocument.load(filename);
PDAcroForm pdform = pdDoc.getDocumentCatalog().getAcroForm();
List<PDField> field = pdform.getFields();
Then I tried to read the file using PDFStripper
PDFTextStripper stripper = new PDFTextStripper();
System.out.println(stripper.getText(pdDoc));
and the ouput was as follows
"Please wait...
If this message is not eventually replaced by the proper contents of the document, your PDF
viewer may not be able to display this type of document.
You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by
visiting http://www.adobe.com/go/reader_download.
For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader.
Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark
of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other
countries."
But I'm able to open the file manually and fill the fields as well. I've tried other tools like iText also. But again I wasn't able to get the fields.
How can I resolve this issue?

May be it is too late to answer but anyway why not. You can get empty list if your pdf file has XFA structure.
PDDocument pdDoc = PDDocument.load(filename);
PDAcroForm pdform = pdDoc.getDocumentCatalog().getAcroForm();
List<PDField> field = pdform.getFields();
Use these code lines to start working with pdf:
PDXFA xfa = pdform.getXFA();
Document xfaDocument = xfa.getDocument();
NodeList elements = xfaDocument.getElementsByTagName( "SomeElement" );

While struggling with Alfresco's content search abilities, I've had some trouble with pdfbox (used by Alfresco to extract text and metadata) reading PDF files written by old applications (like QuarkXPress) that use old Acrobat 4.0 format. This old format pdfbox seems to be unable to extract metadata or text from it, although the files were perfectly viewable with any PDF reader application.
The solution was having all old PFD files re-printed (saved as...) using a more modern PDF format (like 10.0 for instance). This can be done in a row using some bash scripting.
I directly didn't try intermediate Acrobat versions among 4.0 and 10.0.

Extracting text from documents of unknown content type

is there a parser for application/octet-stream type within Apache Tika? I suppose it's a non-parsable stream.
I just need to parse ODS documents, MS documents and PDF files. It seems that new Tika( ).parseToString(file); is enough. But I can't figure out what happens when the content type is not detected - > application/octet-stream is default. If I have a chance to extract text from those documents that are one of those types, but contentType detector didn't detect their type.
What else should I try instead of returning document to the user telling him that it is not supported format.
Or is really a resulting application/octet-stream content type a signal that we can't read this ? Or "you must figure out your own way how to deal with this" ?

If the detector doesn't know what the file is, it'll return application/octet-stream
And if the detector doesn't know what it is, then Tika won't be able to pick a suitable Parser for it. (You'll end up with the EmptyParser which does nothing)
If you can, pass in the name of your file when you do the detection and parsing, as that'll help with the detection in some cases:
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
parser.parse(input, textHandler, metadata, new ParseContext());
Also, it's worth checking the supported formats part of the Tika website to ensure that the documents you have are ones where there's a Parser - http://tika.apache.org/0.9/formats.html
If your documents are in a format that isn't currently supported, then you have two choices (neither immediate fixes). One is to help write a new parser (requires finding a suitable Java library for the format). The other is to use a command line based parser (requires finding an executable for your platform that can do the xhtml generation, then wiring that in)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.