CSV Detector in Apache Tika

CSV Detector in Apache Tika - java

I'm using the Java library Tika by Apache (tika-core ver. 1.10).
Exists a org.apache.tika.detect.Detector for CSV files?
The MIME type should be text/csv, but I cannot find anything like that.
I would like to use the nice detect method

Currently (v1.10) tika-mimetypes.xml defines text/csv like this:
<mime-type type="text/csv">
<glob pattern="*.csv"/>
<sub-class-of type="text/plain"/>
</mime-type>
This means that Apache Tika detects only by filename. If you use Tika#detect(File) Tika will add filename (under Metadata.RESOURCE_NAME_KEY key) to Metadata object passed to detector. There's similar behavior for URLs.
If you want to inject filename you can use something like:
new Tika().detect(is, fileName)
If you want some heuristics, based on content, feel free to check and file a ticket in Tika's JIRA.

Related

Getting MIME type of a File

I want to get mimetype of a file can anyone please help me
I want MIME Type like this...
File file=new File("example.jpeg");
String MimeTypeOfFile=/*files mimetype*/;
Thank You in Advance

You can use the Apache Tika Library: It detects and extracts metadata and text from over a thousand different file types
http://tika.apache.org/0.7/detection.html
It has various methods like extension checking or reading file data to detect mime-type. It would be easy and efficient rather than writing yourself.
Example :
System.out.println(new Tika().detect(new File(PATH_TO_FILE)));

Uploading File in DAM Programmatically using AssetManager? What MimeType should I use?

I have a form that uploads a File to a SlingServlet. The SlingSerlvet receives the file and it tries to save the file in DAM using com.day.cq.dam.api.AssetManager.(i.e. Save file in DAM programmatically)
The problem arises with MIME types. The user may upload a pdf,xls, doc etc. so the Type is not fixed. I don't know what to set the MIME type as(see the third parameter xxx) assetMgr.createAsset(newFile, is,"xxx", true);
I tried "application/octet-stream" but CQ ignores the Type saying asset ignored.
Log:
27.11.2014 18:58:48.595 *INFO* [JobHandler: /etc/workflow/instances/2014-11-27/model_879500607401687:/content/dam/videojetdocuments/videojetdocuments/offerletters/Präsentation_Dominik_Suess.pdf/jcr:content/renditions/original] com.day.cq.dam.video.FFMpegThumbnailProcess execute: asset [/content/dam/videojetdocuments/videojetdocuments/offerletters/Präsentation_Dominik_Suess.pdf] is not of a video mime type, asset ignored.
27.11.2014 18:58:48.596 *INFO* [JobHandler: /etc/workflow/instances/2014-11-27/model_879500607401687:/content/dam/videojetdocuments/videojetdocuments/offerletters/Präsentation_Dominik_Suess.pdf/jcr:content/renditions/original] com.day.cq.dam.video.FFMpegTranscodeProcess execute: asset [/content/dam/videojetdocuments/videojetdocuments/offerletters/Präsentation_Dominik_Suess.pdf] is not of a video mime type, asset ignored.
I tried this using the following link
Is there any generic MIME Type for such type of Files?

You can use the Apache Sling MimeTypeService to compute the mimetype based on an incoming filename. See also http://sling.apache.org/documentation/bundles/mime-type-support-commons-mime.html
If you don't have the filename you'll need something like the Apache Tika Detector, which analyzes the binary to try to guess its mimetype. I don't know if CQ provides such a service out of the box, but if it doesn't you could integrate it yourself.
Edit:
API that checks the MIMEType based on Magic headers Link
Helpful link for understanding the above mentioned problem Link

Apache Tika, cannot retrieve 'subject' metadata value

I am using java and i am trying to extract some metadata with apache tika, but i cannot extarct the expected value for the 'subject' metadata. The file is a jpg image. Here is my code:
First i am parsing the file like this:
inputStream = new FileInputStream(fileToExtract);
Parser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(inputStream, contentHandler, metadata, new ParseContext());
and then i am trying to print these:
metadata.get(Metadata.AUTHOR) --> "MyAuthor"
metadata.get(TikaCoreProperties.CREATOR) --> "MyCreator"
metadata.get(TikaCoreProperties.TITLE) --> "MyTitle"
metadata.get(Metadata.SUBJECT) --> **null**
metadata.get(TikaCoreProperties.KEYWORDS) --> **null**
So, i get correctly all the values and i get null value for the subject. The metadata were added manually by me (right click->properties, windows OS).
Am i doing something wrong?
PS: Note that "TikaCoreProperties.KEYWORDS" is another way to retrieve the subject according to apache tika documentation.

Apache Tika tries to return consistent metadata across all file formats. It shouldn't matter if one format calls it Author, another Creator, another Created By and another Creator[0], Tika maps those all onto a consistent key. Typically, those keys are based on well known external standards, such as Dublin Core
If you want to see the mappings that Tika applies to Microsoft Office documents, you'll need to look in SummaryExtractor. If you want to know what all the metadata keys and values are that Tika can extract from a given file, either use the tika-app cli tool with --metadata, or call names() on the Metadata object to get the list of metadata keys Tika found.

Detect Content-Type Based on FileName

I'm trying to use Apache Tika to determine the content-type (i.e. - application/pdf for .pdf files). I would like to use Apache Tika's org.apache.tika.detect.NameDetector class. My problem is that it's detect method only accepts an InputStream. I do not have access to the File's InputStream. I only have the File's name (i.e. - myFile.pdf).
Is there any good way to use Apache Tika to determine the content-type based on only the extension/name of the file? (Note - I would like to avoid creating a temp file with the desired name to determine it's content-type.)
Thanks.

You can use the normal Apache Tika Detector interface passing in null for the InputStream, and supplying the filename.
Your code would look something like:
TikaConfig config = new TikaConfig();
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
String mimetype = config.getDetector().detect(null, metadata);
To simplify things even more, if you use the Tika facade class you can just do:
Tika tika = new Tika();
String mimetype = tika.detect(filename);
And you'll just get back the mimetype guessed from the filename only
For more information, see the "Ways of triggering Detection" documentation on the Apache Tika website.

I did some searching and found a blog post which contains a code example that determines the type using the org.apache.tika.Tika class's detect method.
So I could write something like this:
org.apache.tika.Tika tika = new org.apache.tika.Tika();
String mimeType = tika.detect("abc.pdf"); // replace abc.pdf with a string variable

How to convert .doc or .docx files to .txt

I'm wondering how you can convert Word .doc/.docx files to text files through Java. I understand that there's an option where I can do this through Word itself but I would like to be able to do something like this:
java DocConvert somedocfile.doc converted.txt
Thanks.

If you're interested in a Java library that deals with Word document files, you might want to look at e.g. Apache POI. A quote from the website:
Why should I use Apache POI?
A major use of the Apache POI api is
for Text Extraction applications such
as web spiders, index builders, and
content management systems.
P.S.: If, on the other hand, you're simply looking for a conversion utility, Stack Overflow may not be the most appropriate place to ask for this.
Edit: If you don't want to use an existing library but do all the hard work yourself, you'll be glad to hear that Microsoft has published the required file format specifications. (The Microsoft Open Specification Promise lists the available specifications. Just google for any of them that you're interested in. In your case, you'd need e.g. the OLE2 Compound File Format, the Word 97 binary file format, and the Open XML formats.)

Use command line utility Apache Tika. Tika suports a wide number of formats (ex: doc, docx, pdf, html, rtf ...)
java -jar tika-app-1.3.jar -t somedocfile.doc > converted.txt
Programatically:
File inputFile = ...;
Tika tika = new Tika();
String extractedText = tika.parseToString(inputFile);
You can use Apache POI too. They have a tool to extract text from doc/docx Text Extraction. If you want to extract only the text, you can use the code below. If you want to extract Rich Text (such as formatting and styling), you can use Apache Tika.
Extract doc:
InputStream fis = new FileInputStream(...);
POITextExtractor extractor;
// if docx
if (fileName.toLowerCase().endsWith(".docx")) {
XWPFDocument doc = new XWPFDocument(fis);
extractor = new XWPFWordExtractor(doc);
} else {
// if doc
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
extractor = ExtractorFactory.createExtractor(fileSystem);
}
String extractedText = extractor.getText();

You should consider using this library. Its Apache POI
Excerpt from the website
In short, you can read and write MS
Excel files using Java. In addition,
you can read and write MS Word and MS
PowerPoint files using Java. Apache
POI is your Java Excel solution (for
Excel 97-2008). We have a complete API
for porting other OOXML and OLE2
formats and welcome others to
participate.

Docmosis can read a doc and spit out the text in it. Requires some infrastructure to be installed (such as OpenOffice).
You can also use JODConverter.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

CSV Detector in Apache Tika - java

I'm using the Java library Tika by Apache (tika-core ver. 1.10). Exists a org.apache.tika.detect.Detector for CSV files? The MIME type should be text/csv, but I cannot find anything like that. I would like to use the nice detect method

Related

Getting MIME type of a File

Uploading File in DAM Programmatically using AssetManager? What MimeType should I use?

Apache Tika, cannot retrieve 'subject' metadata value

Detect Content-Type Based on FileName

How to convert .doc or .docx files to .txt

Categories

Resources