Apache Tika extracting metadata via jar but not in sample code - java

I've been able to extract metadata via the tika-app executable jar using the following line:
java -jar tika-app-1.13.jar --metadata example_received_regular.msg
It prints out all of the metadata. But when I try to execute a simple extraction of the same file in a Java program, I don't get any of it.
public static void main(String[] args) throws Exception {
Class<?> clazz = Class.forName("org.apache.tika.parser.ocr.TesseractOCRParser");
FileInputStream des = new FileInputStream("/Users/jason/docstore/example_received_regular.msg");
Tika tika = new Tika();
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
String detected = tika.detect(des);
Metadata tikaMetadata = new Metadata();
parser.parse(des, handler, tikaMetadata, new ParseContext());
String[] names = tikaMetadata.names();
for (String name : names) {
System.out.println(name + ": " + tikaMetadata.get(name));
}
System.out.println(detected);
}
My first thought was that the tika-parser library was somehow unavailable at runtime, hence me attempting to load the TesseractOCRParser on the first line, but that class loads just fine. Executing this program results in the following output:
X-Parsed-By: org.apache.tika.parser.EmptyParser
Content-Type: application/octet-stream
application/x-tika-msoffice
This seems like the most basic example of Tika metadata extraction I can find anywhere. The extraction runs fine with the jar but not in this example. Am I missing something?

The TikaCLI program utilizes a special TikaInputStream object which populates the metadata (unlike the FileInputStream in your example above).
You can make the following changes in order print the metadata values:
public static void main(String[] args) throws Exception {
File file = new File("/Users/jason/docstore/example_received_regular.msg");
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata tikaMetadata = new Metadata();
InputStream input = TikaInputStream.get(file, tikaMetadata);
parser.parse(input, handler, tikaMetadata, new ParseContext());
String[] names = tikaMetadata.names();
Arrays.sort(names);
for (String name : names) {
System.out.println(name + ": " + tikaMetadata.get(name));
}
}

Related

java unknown protocol: e when downloading a file

I'm a beginner to java file handling. I tired to get a bin file (en-parser-chunking.bin) from my hard disk partition to my web application. So far I have tried below code and it gives me the output in my console below.
unknown protocol: e
these are the code samples I have tried so far
//download file
public void download(String url, File destination) throws IOException {
URL website = new URL(url);
ReadableByteChannel rbc = Channels.newChannel(website.openStream());
FileOutputStream fos = new FileOutputStream(destination);
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
}
public void parserAction() throws Exception {
//InputStream is = new FileInputStream("en-parser-chunking.bin");
File modelFile = new File("en-parser-chunking.bin");
if (!modelFile.exists()) {
System.out.println("Downloading model.");
download("E:\\Final Project\\Softwares and tools\\en-parser-chunking.bin", modelFile);
}
ParserModel model = new ParserModel(modelFile);
Parser parser = ParserFactory.create(model);
Parse topParses[] = ParserTool.parseLine(line, parser, 1);
for (Parse p : topParses){
//p.show();
getNounPhrases(p);
}
}
getting a file in this way is possible or I have done it wrong ?
note - I need to get this from my hard disk. not download from the internet
the correct URL for a local file is:
file://E:/Final Project/Softwares and tools/en-parser-chunking.bin
where file is the protocol.
You can also you:
new File("E:/Final Project/Softwares and tools/en-parser-chunking.bin").toURL()
to create a URL from your file.
I also recomment to use slash as file seperator instead of backslash

Get document name of embedded file in xls (Apache POI)

I would like to save all embedded Files of a .xls (POI Type: HSSF) File, no matter which embedded filetype it is. So I'm happy if I can save all embedded files without extension. I'm using Apache POI Library 3.7 on Java 7.
Now, I'm having trouble using createDocumentInputStream(document). I don't know how I can get this expected parameter. Can anyone help me?
public static void saveEmbeddedXLS(InputStream fis_param, String outputfile) throws IOException, InvalidFormatException{
//HSSF - XLS
int i = 0;
System.out.println("Starting Embedded Search in xls...");
POIFSFileSystem fs = new POIFSFileSystem(fis_param);//create FileSystem using fileInputStream
HSSFWorkbook workbook = new HSSFWorkbook(fs);
for (HSSFObjectData obj : workbook.getAllEmbeddedObjects()) {
System.out.println("Objects : "+ obj.getOLE2ClassName());//the OLE2 Class Name of the object
String oleName = obj.getOLE2ClassName();//Document Type
DirectoryNode dn = (DirectoryNode) obj.getDirectory();//get Directory Node
//Trying to create an input Stream with the embedded document, argument of createDocumentInputStream should be: String; Where/How can I get this correct parameter for the function?
InputStream is = dn.createDocumentInputStream(oleName);//oleName = Document Type, but not it's name (Wrong!)
FileOutputStream fos = new FileOutputStream(outputfile + "_" + i);//Outputfilepath + Number
IOUtils.copy(is, fos);//FileInputStream > FileOutput Stream (save File without extension)
i++;
}
}

How to list all embedded files from a microsoft office document, using Apache POI?

is there any opportunity to list all embedded objects (doc, ..., txt) in a office file (doc, docx, xls, xlsx, ppt, pptx, ...)?
I am using Apache POI (Java) Library, to extract text from office files. I don't need to extract all the text from embedded objects, a log file with the file names of all embedded documents would be nice (something like: string objectFileNames = getEmbeddedFileNames(fileInputStream)).
Example: I have a Word Document "test.doc" which contains another file called "excel.xls". I'd like to write the file name of excel.xls (in this case) into a log file.
I tried this using some sample code from the apache homepage (https://poi.apache.org/text-extraction.html). But my Code always returns the same ("Footer Text: Header Text").
What I tried is:
private static void test(String inputfile, String outputfile) throws Exception {
String[] extractedText = new String[100];
int emb = 0;//used for counter of embedded objects
InputStream fis = new FileInputStream(inputfile);
PrintWriter out = new PrintWriter(outputfile);//Text in File (txt) schreiben
System.out.println("Emmbedded Search started. Inputfile: " + inputfile);
//Based on Apache sample Code
emb = 0;//Reset Counter
POIFSFileSystem emb_fileSystem = new POIFSFileSystem(fis);
// Firstly, get an extractor for the Workbook
POIOLE2TextExtractor oleTextExtractor =
ExtractorFactory.createExtractor(emb_fileSystem);
// Then a List of extractors for any embedded Excel, Word, PowerPoint
// or Visio objects embedded into it.
POITextExtractor[] embeddedExtractors =
ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);
for (POITextExtractor textExtractor : embeddedExtractors) {
// If the embedded object was an Excel spreadsheet.
if (textExtractor instanceof ExcelExtractor) {
ExcelExtractor excelExtractor = (ExcelExtractor) textExtractor;
extractedText[emb] = (excelExtractor.getText());
}
// A Word Document
else if (textExtractor instanceof WordExtractor) {
WordExtractor wordExtractor = (WordExtractor) textExtractor;
String[] paragraphText = wordExtractor.getParagraphText();
for (String paragraph : paragraphText) {
extractedText[emb] = paragraph;
}
// Display the document's header and footer text
System.out.println("Footer text: " + wordExtractor.getFooterText());
System.out.println("Header text: " + wordExtractor.getHeaderText());
}
// PowerPoint Presentation.
else if (textExtractor instanceof PowerPointExtractor) {
PowerPointExtractor powerPointExtractor =
(PowerPointExtractor) textExtractor;
extractedText[emb] = powerPointExtractor.getText();
emb++;
extractedText[emb] = powerPointExtractor.getNotes();
}
// Visio Drawing
else if (textExtractor instanceof VisioTextExtractor) {
VisioTextExtractor visioTextExtractor =
(VisioTextExtractor) textExtractor;
extractedText[emb] = visioTextExtractor.getText();
}
emb++;//Count Embedded Objects
}//Close For Each Loop POIText...
for(int x = 0; x <= extractedText.length; x++){//Write Results to TXT
if (extractedText[x] != null){
System.out.println(extractedText[x]);
out.println(extractedText[x]);
}
else {
break;
}
}
out.close();
}
Inputfile is xls, which contains a doc file as object and outputfile is txt.
Thanks if anyone can help me.
I don't think embedded OLE objects keep their original file name, so I don't think what you want is really possible.
I believe what Microsoft writes about embedded images also applies to OLE-Objects:
You might notice that the file name of the image file has been changed from Eagle1.gif to image1.gif. This is done to address privacy concerns, in that a malicious person could derive a competitive advantage from the name of parts in a document, such as an image file. For example, an author might choose to protect the contents of a document by encrypting the textual part of the document file. However, if two images are inserted named old_widget.gif and new_reenforced_widget.gif, even though the text is protected, a malicious person could learn the fact that the widget is being upgraded. Using generic image file names such as image1 and image2 adds another layer of protection to Office Open XML Formats files.
However, you could try (for Word 2007 files, aka XWPFDocument, aka ".docx", other MS Office files work similar):
try (FileInputStream fis = new FileInputStream("mydoc.docx")) {
document = new XWPFDocument(fis);
listEmbeds (document);
}
private static void listEmbeds (XWPFDocument doc) throws OpenXML4JException {
List<PackagePart> embeddedDocs = doc.getAllEmbedds();
if (embeddedDocs != null && !embeddedDocs.isEmpty()) {
Iterator<PackagePart> pIter = embeddedDocs.iterator();
while (pIter.hasNext()) {
PackagePart pPart = pIter.next();
System.out.print(pPart.getPartName()+", ");
System.out.print(pPart.getContentType()+", ");
System.out.println();
}
}
}
The pPart.getPartName() is the closest I could find to a file name of an embedded file.
Using Apache poi, you cannot get the original names of the embedded files.
However if you really need to get the original names then you can use aspose api.
You can use aspose.cells for excel files, aspose.slides for presentation files, aspose.words for word files to extract the embedded files.
You'll get the file name if the ole object is linked otherwise you'll not get the original file using aspose also.
See the example below....
public void getDocEmbedded(InputStream stream){
Document doc=new Document(stream);
NodeCollection<?> shapes = doc.getChildNodes(NodeType.SHAPE, true);
System.out.println(shapes.getCount());
int itemcount = 0;
for (int i = 0; i < shapes.getCount(); i++) {
Shape shape = (Shape) shapes.get(i);
OleFormat oleFormat = shape.getOleFormat();
if (oleFormat != null) {
if (!oleFormat.isLink() && oleFormat.getOleIcon()) {
itemcount++;
String progId = oleFormat.getProgId();
System.out.println("Extension: " + oleFormat.getSuggestedExtension()+"file Name "+oleFormat.getIconCaption());
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] bytearray = oleFormat.getRawData();
if (bytearray == null) {
oleFormat.save(baos);
bytearray = baos.toByteArray();
}
//TO DO : do with the byte array whatever you want to
}
}
}
I'm using oleFormat.getSuggestedExtension() to get the embedded file extension and oleFormat.getIconCaption() to get the embedded file names.
public class GetEmbedded {
public static void main(String[] args) throws Exception {
String path = "SomeExcelFile.xlsx"
XSSFWorkbook workbook = new XSSFWorkbook(new FileInputStream(new File(path)));
for (PackagePart pPart : workbook.getAllEmbedds()) {
String contentType = pPart.getContentType();
System.out.println("List of all the embedded contents in the Excel"+contentType);
}
}
}

How to parse large text file with Apache Tika 1.5?

Problem:
For my test, I want to extract text data from a 335 MB text file which is wikipedia's "pagecounts-20140701-060000.txt" with Apache Tika.
My solution:
I tried to use TikaInputStream since it provides buffering, then I tried to use BufferedInputStream, but that didn't solve my problem. Here is the my test class below:
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class Printer {
public void readMyFile(String fname) throws IOException, SAXException,
TikaException {
System.out.println("Working...");
File f = new File(fname);
// InputStream stream = TikaInputStream.get(new File(fname));
InputStream stream = new BufferedInputStream(new FileInputStream(fname));
Metadata meta = new Metadata();
ContentHandler content = new BodyContentHandler(Integer.MAX_VALUE);
AutoDetectParser parser = new AutoDetectParser();
String mime = new Tika().detect(f);
meta.set(Metadata.CONTENT_TYPE, mime);
System.out.println("trying to parse...");
try {
parser.parse(stream, content, meta, new ParseContext());
} finally {
stream.close();
}
}
public static void main(String[] args) {
Printer p = new Printer();
try {
p.readMyFile("test/pagecounts-20140701-060000.txt");
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
}
}
Problem:
Upon invoking the parse method of the parser I am getting:
Working...
trying to parse...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
at java.lang.StringBuffer.append(StringBuffer.java:322)
at java.io.StringWriter.write(StringWriter.java:94)
at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:92)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:135)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:88)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at com.tastyminerals.cli.Printer.readMyFile(Printer.java:37)
at com.tastyminerals.cli.Printer.main(Printer.java:46)
I tried to increase jre memory consumption up to -Xms512M -Xmx1024M, that didn't work and I don't want to use any bigger values.
Questions:
What is wrong with my code?
How should I modify my class to make it extract text from a test file >300 MB with Apache Tika?
You can set like this to avoid the limit in size :-
BodyContentHandler bodyHandler = new BodyContentHandler(-1);
Pass BodyContentHandler a Writer or OutputStream instead of int
As Gagravarr mentioned, the BodyContentHandler you've used is building an internal string buffer of the file's content. Because Tika is trying to store the entire content in memory at once, this approach will hit OutOfMemoryError exception for large files.
If your goal is to write out the Tika parse results to another file for later processing, you can construct BodyContentHandler with a Writer (or OutputStream directly) instead of passing an int:
Path outputFile = Path.of("output.txt"); // Paths.get() if not using Java 11
PrintWriter printWriter = new PrintWriter(Files.newOutputStream(outputFile));
BodyContentHandler content = new BodyContentHandler(printWriter);
And then call Tika parse:
Path inputFile = Path.of("input.txt");
TikaInputStream inputStream = TikaInputStream.get(inputFile);
AutoDetectParser parser = new AutoDetectParser();
Metadata meta = new Metadata();
ParseContext context = new ParseContext();
parser.parse(inputStream, content, meta, context);
By doing this, Tika will automatically write the content to the outputFile as it parses, instead of trying to keep it all in memory. Using a PrintWriter will buffer the output, reducing the number of writes to disk.
Note that Tika will not automatically close your input or output streams for you.
You can use incremental parsing
Tika tika = new Tika();
Reader fulltext = null;
String contentStr = null;
try {
fulltext = tika.parse(response.getEntityInputStream());
contentStr = IOUtils.toString(fulltext);
} finally {
fulltext.close();
}
Solution with ByteArrayInputStream
I had a similar problem with CSV files. If they were read in Java with the wrong charset, only a part of the records could be imported. The method from my library assigns the correct encoding to the file and prevents reading errors.
public static String lib_getCharset( String fullFile ) {
// Initialize variables.
String returnValue = "";
BodyContentHandler handler = new BodyContentHandler( -1 );
Metadata meta = new Metadata();
// Convert the BufferedInputStream to a ByteArrayInputStream.
try( final InputStream is = new BufferedInputStream( new FileInputStream( fullFile ) ) ) {
InputStream bais = new ByteArrayInputStream( is.readAllBytes() );
ParseContext context = new ParseContext();
TXTParser parser = new TXTParser();
// Run the Tika TXTParser and read the metadata.
try {
parser.parse( bais, handler, meta, context );
// Fill the metadata's names in an array ...
String[] metaNames = meta.names();
// ... and iterate over it.
for( String metaName : metaNames ) {
// Check if a charset is described.
if( metaName.equals( "Content-Encoding" ) ) {
returnValue = meta.get( metaName );
}
}
} catch( SAXException | TikaException se_te ) {
se_te.printStackTrace();
}
} catch( IOException e ) {
e.printStackTrace();
}
return returnValue;
}
Using scanner, the file can then be imported as follows.
Scanner scanner = null;
String charsetChar = TrnsLib.lib_getCharset( fullFileName );
try {
// Scan the file, e.g. with UTF-8 or
// ISO8859-1 or windows-1252 for ANSI.
scanner = new Scanner( new File( fullFileName ), charsetChar );
} catch( FileNotFoundException e ) {
e.printStackTrace();
}
Don't forget the assignment of the two dependencies in the POM.XML:
https://repo1.maven.org/maven2/org/apache/tika/tika-core/2.4.1/
https://repo1.maven.org/maven2/org/apache/tika/tika-parser-text-module/2.4.1/
and the definition of the requires in module-info.java:
module org.wnt.wnt94lib {
requires transitive org.apache.tika.core;
requires transitive org.apache.tika.parser.txt;
}
My solution works fine with small files (up to about 100 lines of 300 characters). Larger files need more attention. The Babylonian confusion around CR and LF led to inconsistencies under Apache Tika. If the parameter is set to -1, the whole text file is read for BodyContentHandler, but only the above-mentioned 100 lines are used to find the correct charset. And especially in CSV files, exotic characters like ä, ö or ü are rare. But, out of luck, Apache finds the combined characters CR and LF and concludes that it must be an ANSI instead of a UTF-8 file.
So, what can you do? - Quick and dirty, you can add the letters - ÄÖÜ to the file's first line. – However, the following solution is better: Load the file with Notepad++. Show all characters under View, Show Symbol. Under Search, Replace... delete all CR. To do this, activate the selection Extended under Search Mode and enter the characters \r\n under Find what and \n under Replace with. Set the cursor on the file's first line and press the button Replace All. It frees the file from the burden of remembering the good old typewriter and converts it into a proper Unix file with UTF-8.
Afterwards, however, do not edit the CSV file with Excel. The programme, which I otherwise really appreciate, converts your file back into one with CR-ballast. For correct saving, without CR, you have to use VBA. Ekkehard Horner describes how at: VBA : save a file with UTF-8 without BOM

is this the right way to create a java file (Programmatically)?

Below is the code to create a java file programmatically
is this the right way? or is there any other way
public static void main(String[] args) throws IOException {
File atlas = new File("D:/WIP/pac/n/sample.txt");
if (!atlas.exists()) {
System.out.println("File not exist");
}
FileHandle mAtlasHandle = new FileHandle(atlas);
BufferedReader reader = mAtlasHandle.reader(1024);
String line = null;
ArrayList<String> mArrayList = new ArrayList<String>();
while ((line = reader.readLine()) != null) {
mArrayList.add(line);
}
File file = new File("D:/WIP/pac/n/Sample.java");
if (!file.exists()) {
file.createNewFile();
}
String packageName = "package com.atom.lib;";
PrintWriter writer = new PrintWriter(file);
String mString = new String(file.getName());
String name = mString.replaceFirst("[.][^.]+$", "");
String output = Character.toUpperCase(name.charAt(0)) + name.substring(1);
writer.println(packageName);
writer.println("public class " + output);
writer.println("{");
for (String obj : mArrayList) {
writer.println("public String " + obj + "=\"" + obj + "\";");
}
writer.println("}");
writer.close();
}
Yes, there is better way. Use CodeModel to create Java files programmatically, instead of using println() or String appending methods. From their website:
CodeModel is a Java library for code generators; it provides a way to generate Java programs in a way much nicer than PrintStream.println(). This project is a spin-off from the JAXB RI for its schema compiler to generate Java source files.
But the main problem with CodelModel is very less documentation. API docs is the only bible you can have for this.
If you are comfortable with Eclipse plugin development, you can use Eclipse's AST to create Java file programmatically. It can even work standalone without having Eclipse.
I have used FreeMarker templates to generate java source file. This is very powerful tool which allows you define a source template and and model where you can define variables, methods and every thing and allow you to compile to to any multiple format(txt, java,etc)
You can start by defining template with placeholder for your required java source,file and then later you can programatically apply the values for placeholder.
Given example illustrate the working of this API, you can extend the example to create java source file
http://viralpatel.net/blogs/freemaker-template-hello-world-tutorial/

Categories

Resources