Get document name of embedded file in xls (Apache POI)

Get document name of embedded file in xls (Apache POI) - java

I would like to save all embedded Files of a .xls (POI Type: HSSF) File, no matter which embedded filetype it is. So I'm happy if I can save all embedded files without extension. I'm using Apache POI Library 3.7 on Java 7.
Now, I'm having trouble using createDocumentInputStream(document). I don't know how I can get this expected parameter. Can anyone help me?
public static void saveEmbeddedXLS(InputStream fis_param, String outputfile) throws IOException, InvalidFormatException{
//HSSF - XLS
int i = 0;
System.out.println("Starting Embedded Search in xls...");
POIFSFileSystem fs = new POIFSFileSystem(fis_param);//create FileSystem using fileInputStream
HSSFWorkbook workbook = new HSSFWorkbook(fs);
for (HSSFObjectData obj : workbook.getAllEmbeddedObjects()) {
System.out.println("Objects : "+ obj.getOLE2ClassName());//the OLE2 Class Name of the object
String oleName = obj.getOLE2ClassName();//Document Type
DirectoryNode dn = (DirectoryNode) obj.getDirectory();//get Directory Node
//Trying to create an input Stream with the embedded document, argument of createDocumentInputStream should be: String; Where/How can I get this correct parameter for the function?
InputStream is = dn.createDocumentInputStream(oleName);//oleName = Document Type, but not it's name (Wrong!)
FileOutputStream fos = new FileOutputStream(outputfile + "_" + i);//Outputfilepath + Number
IOUtils.copy(is, fos);//FileInputStream > FileOutput Stream (save File without extension)
i++;
}
}

Related

java unknown protocol: e when downloading a file

I'm a beginner to java file handling. I tired to get a bin file (en-parser-chunking.bin) from my hard disk partition to my web application. So far I have tried below code and it gives me the output in my console below.
unknown protocol: e
these are the code samples I have tried so far
//download file
public void download(String url, File destination) throws IOException {
URL website = new URL(url);
ReadableByteChannel rbc = Channels.newChannel(website.openStream());
FileOutputStream fos = new FileOutputStream(destination);
fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
}
public void parserAction() throws Exception {
//InputStream is = new FileInputStream("en-parser-chunking.bin");
File modelFile = new File("en-parser-chunking.bin");
if (!modelFile.exists()) {
System.out.println("Downloading model.");
download("E:\\Final Project\\Softwares and tools\\en-parser-chunking.bin", modelFile);
}
ParserModel model = new ParserModel(modelFile);
Parser parser = ParserFactory.create(model);
Parse topParses[] = ParserTool.parseLine(line, parser, 1);
for (Parse p : topParses){
//p.show();
getNounPhrases(p);
}
}
getting a file in this way is possible or I have done it wrong ?
note - I need to get this from my hard disk. not download from the internet

the correct URL for a local file is:
file://E:/Final Project/Softwares and tools/en-parser-chunking.bin
where file is the protocol.
You can also you:
new File("E:/Final Project/Softwares and tools/en-parser-chunking.bin").toURL()
to create a URL from your file.
I also recomment to use slash as file seperator instead of backslash

How to list all embedded files from a microsoft office document, using Apache POI?

is there any opportunity to list all embedded objects (doc, ..., txt) in a office file (doc, docx, xls, xlsx, ppt, pptx, ...)?
I am using Apache POI (Java) Library, to extract text from office files. I don't need to extract all the text from embedded objects, a log file with the file names of all embedded documents would be nice (something like: string objectFileNames = getEmbeddedFileNames(fileInputStream)).
Example: I have a Word Document "test.doc" which contains another file called "excel.xls". I'd like to write the file name of excel.xls (in this case) into a log file.
I tried this using some sample code from the apache homepage (https://poi.apache.org/text-extraction.html). But my Code always returns the same ("Footer Text: Header Text").
What I tried is:
private static void test(String inputfile, String outputfile) throws Exception {
String[] extractedText = new String[100];
int emb = 0;//used for counter of embedded objects
InputStream fis = new FileInputStream(inputfile);
PrintWriter out = new PrintWriter(outputfile);//Text in File (txt) schreiben
System.out.println("Emmbedded Search started. Inputfile: " + inputfile);
//Based on Apache sample Code
emb = 0;//Reset Counter
POIFSFileSystem emb_fileSystem = new POIFSFileSystem(fis);
// Firstly, get an extractor for the Workbook
POIOLE2TextExtractor oleTextExtractor =
ExtractorFactory.createExtractor(emb_fileSystem);
// Then a List of extractors for any embedded Excel, Word, PowerPoint
// or Visio objects embedded into it.
POITextExtractor[] embeddedExtractors =
ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);
for (POITextExtractor textExtractor : embeddedExtractors) {
// If the embedded object was an Excel spreadsheet.
if (textExtractor instanceof ExcelExtractor) {
ExcelExtractor excelExtractor = (ExcelExtractor) textExtractor;
extractedText[emb] = (excelExtractor.getText());
}
// A Word Document
else if (textExtractor instanceof WordExtractor) {
WordExtractor wordExtractor = (WordExtractor) textExtractor;
String[] paragraphText = wordExtractor.getParagraphText();
for (String paragraph : paragraphText) {
extractedText[emb] = paragraph;
}
// Display the document's header and footer text
System.out.println("Footer text: " + wordExtractor.getFooterText());
System.out.println("Header text: " + wordExtractor.getHeaderText());
}
// PowerPoint Presentation.
else if (textExtractor instanceof PowerPointExtractor) {
PowerPointExtractor powerPointExtractor =
(PowerPointExtractor) textExtractor;
extractedText[emb] = powerPointExtractor.getText();
emb++;
extractedText[emb] = powerPointExtractor.getNotes();
}
// Visio Drawing
else if (textExtractor instanceof VisioTextExtractor) {
VisioTextExtractor visioTextExtractor =
(VisioTextExtractor) textExtractor;
extractedText[emb] = visioTextExtractor.getText();
}
emb++;//Count Embedded Objects
}//Close For Each Loop POIText...
for(int x = 0; x <= extractedText.length; x++){//Write Results to TXT
if (extractedText[x] != null){
System.out.println(extractedText[x]);
out.println(extractedText[x]);
}
else {
break;
}
}
out.close();
}
Inputfile is xls, which contains a doc file as object and outputfile is txt.
Thanks if anyone can help me.

I don't think embedded OLE objects keep their original file name, so I don't think what you want is really possible.
I believe what Microsoft writes about embedded images also applies to OLE-Objects:
You might notice that the file name of the image file has been changed from Eagle1.gif to image1.gif. This is done to address privacy concerns, in that a malicious person could derive a competitive advantage from the name of parts in a document, such as an image file. For example, an author might choose to protect the contents of a document by encrypting the textual part of the document file. However, if two images are inserted named old_widget.gif and new_reenforced_widget.gif, even though the text is protected, a malicious person could learn the fact that the widget is being upgraded. Using generic image file names such as image1 and image2 adds another layer of protection to Office Open XML Formats files.
However, you could try (for Word 2007 files, aka XWPFDocument, aka ".docx", other MS Office files work similar):
try (FileInputStream fis = new FileInputStream("mydoc.docx")) {
document = new XWPFDocument(fis);
listEmbeds (document);
}
private static void listEmbeds (XWPFDocument doc) throws OpenXML4JException {
List<PackagePart> embeddedDocs = doc.getAllEmbedds();
if (embeddedDocs != null && !embeddedDocs.isEmpty()) {
Iterator<PackagePart> pIter = embeddedDocs.iterator();
while (pIter.hasNext()) {
PackagePart pPart = pIter.next();
System.out.print(pPart.getPartName()+", ");
System.out.print(pPart.getContentType()+", ");
System.out.println();
}
}
}
The pPart.getPartName() is the closest I could find to a file name of an embedded file.

Using Apache poi, you cannot get the original names of the embedded files.
However if you really need to get the original names then you can use aspose api.
You can use aspose.cells for excel files, aspose.slides for presentation files, aspose.words for word files to extract the embedded files.
You'll get the file name if the ole object is linked otherwise you'll not get the original file using aspose also.
See the example below....
public void getDocEmbedded(InputStream stream){
Document doc=new Document(stream);
NodeCollection<?> shapes = doc.getChildNodes(NodeType.SHAPE, true);
System.out.println(shapes.getCount());
int itemcount = 0;
for (int i = 0; i < shapes.getCount(); i++) {
Shape shape = (Shape) shapes.get(i);
OleFormat oleFormat = shape.getOleFormat();
if (oleFormat != null) {
if (!oleFormat.isLink() && oleFormat.getOleIcon()) {
itemcount++;
String progId = oleFormat.getProgId();
System.out.println("Extension: " + oleFormat.getSuggestedExtension()+"file Name "+oleFormat.getIconCaption());
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] bytearray = oleFormat.getRawData();
if (bytearray == null) {
oleFormat.save(baos);
bytearray = baos.toByteArray();
}
//TO DO : do with the byte array whatever you want to
}
}
}
I'm using oleFormat.getSuggestedExtension() to get the embedded file extension and oleFormat.getIconCaption() to get the embedded file names.

public class GetEmbedded {
public static void main(String[] args) throws Exception {
String path = "SomeExcelFile.xlsx"
XSSFWorkbook workbook = new XSSFWorkbook(new FileInputStream(new File(path)));
for (PackagePart pPart : workbook.getAllEmbedds()) {
String contentType = pPart.getContentType();
System.out.println("List of all the embedded contents in the Excel"+contentType);
}
}
}

Java: Overwrite the existing file on server with modified file(excel sheet) using File API

I have code that is designed to open a local master file, make additions, and save the file both by overwriting the master file and overwriting a write protected copy on an accessible network location.
But I am unable to replace the existing file on server. I have gone through other link on stackoverflow also like this but still no success.
Kndly assist me ! Rgds The code is
public class UploadAndSaveExcelAction extends Action
{
public ActionForward execute(
ActionMapping mapping,
ActionForm form,
HttpServletRequest request,
HttpServletResponse response) throws Exception{
UploadAndSaveExcelForm myForm = (UploadAndSaveExcelForm)form;
String target = null;
if (myForm.getTheExcel().getFileName().length() > 0) {
FormFile myFile = myForm.getTheExcel();
System.out.println("" +myFile);
String fileName = myFile.getFileName();
byte[] fileData = myFile.getFileData();
//Get the servers upload directory real path name
String filePath = getServlet().getServletContext().getRealPath("/") +"Sheet\SparesUsed.xls";
/* Save file on the server */
//create the upload folder if not exists
File folder = new File(filePath);
if(folder.exists()){
System.out.println("Excel Sheet folder is existed therefore deleted");
folder.deleteOnExit();
}
String filePath1 = getServlet().getServletContext().getRealPath("/") +"Sheet";
File folder1 = new File(filePath1+"\\" + FileName);
System.out.println("Excel Sheet afterr delete folder is "+folder1);
boolean makedirectory=folder1.mkdir();
System.out.println(" Making Directory "+makedirectory);
if(!fileName.equals("")){
System.out.println("Server path for Excel :" +filePath);
//Create file
File fileToCreate = new File(filePath, fileName);
//If file does not exists create file
if(!fileToCreate.exists()){
FileOutputStream fileOutStream = new FileOutputStream(fileToCreate);
fileOutStream.write(fileData);
fileOutStream.flush();
fileOutStream.close();
target ="success";
} return mapping.findForward(target);}

You can use Spoon library.
I know it's been a while since the original post, but one of the more accessible looking Java transformation libraries appears to be Spoon(http://spoon.gforge.inria.fr/).
From the Spoon Homepage(http://spoon.gforge.inria.fr/):
Spoon enables you to transform (see below) and analyze (see example) source code. Spoon provides a complete and fine-grained Java metamodel where any program element (classes, methods, fields, statements, expressions...) can be accessed both for reading and modification. Spoon takes as input source code and produces transformed source code ready to be compiled.

Java POI - Error: Unable to read entire header

I'm trying to read a .doc file with java through the POI library. Here is my code:
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
And I have this exception:
java.io.IOException: Unable to read entire header; 162 bytes read; expected 512 bytes
at org.apache.poi.poifs.storage.HeaderBlock.alertShortRead(HeaderBlock.java:226)
at org.apache.poi.poifs.storage.HeaderBlock.readFirst512(HeaderBlock.java:207)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:104)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:138)
at MicrosoftWordParser.getDocString(MicrosoftWordParser.java:277)
at MicrosoftWordParser.main(MicrosoftWordParser.java:86)
My file is not corrupted, i can launch it with microsoft Word.
I'm using poi 3.9 (the latest stable version).
Do you have an idea t solve the problem ?
Thank you.

readFirst512() will read the first 512 bytes of your Inputstream and throw an exception if there is not enough bytes to read. I think your file is not big enough to be read by POI.

It is probably not a correct Word file. Is it really only 162 bytes long? Check in your filesystem.
I'd recommend creating a new Word file using Word or LibreOffice, and then try to read it using your program.

you should try this programm.
package file_opration;
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null ;
try {
file = new File("filepath location");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for(int i=0;i<fileData.length;i++){
if(fileData[i] != null)
System.out.println(fileData[i]);
}
}
catch(Exception exep){}
}
}

Ahh, you've got a file, then you're spending loads of memory buffering the whole thing into memory by hiding your file behind an InputStream... Don't! If you have a File, give that to POI. Only give POI an InputStream if that's all your have
Your code should be something like:
NPOIFSFileSystem fs = new NPOIFSFileSystem(new File("myfile.doc"));
HWPFDocument document = new HWPFDocument(fs.getRoot());
That'll be quicker and use less memory that reading it into an InputStream, and if there are problems with the file you should normally get slightly more helpful error messages out too

A 162 byte MS Word .doc is probably an "owner file". A temporary file that Word uses to signify the file is locked/owned.
They have a .doc file extension but they are not MS Word Docs.

Write to different file instead of overwriting file

I am wondering if there is an option in java to read file from specific path i.e C:\test1.txt change the content of the file in the memory and copy it to D:\test2.txt while the content of C:\test1.txt will not change but the affected file will be D:\test2.txt
Thanks

As a basic solution, you can read in chunks from one FileInputStream and write to a FileOutputStream:
import java.io.*;
class Test {
public static void main(String[] _) throws Exception{
FileInputStream inFile = new FileInputStream("test1.txt");
FileOutputStream outFile = new FileOutputStream("test2.txt");
byte[] buffer = new byte[128];
int count;
while (-1 != (count = inFile.read(buffer))) {
// Dumb example
for (int i = 0; i < count; ++i) {
buffer[i] = (byte) Character.toUpperCase(buffer[i]);
}
outFile.write(buffer, 0, count);
}
inFile.close();
outFile.close();
}
}
If you explicitly want the entire file in memory, you can also wrap your input in a DataInputStream and use readFully(byte[]) after using File.length() to figure out the size of the file.

I think, the easiest you can do, is to use Scanner class to read file and then write with writer.
Here are some nice examples for different java versions.
Or, you can also use apache commons lib to read/write/copy file.
public static void main(String args[]) throws IOException {
//absolute path for source file to be copied
String source = "C:/sample.txt";
//directory where file will be copied
String target ="C:/Test/";
//name of source file
File sourceFile = new File(source);
String name = sourceFile.getName();
File targetFile = new File(target+name);
System.out.println("Copying file : " + sourceFile.getName() +" from Java Program");
//copy file from one location to other
FileUtils.copyFile(sourceFile, targetFile);
System.out.println("copying of file from Java program is completed");
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get document name of embedded file in xls (Apache POI) - java

Related

java unknown protocol: e when downloading a file

How to list all embedded files from a microsoft office document, using Apache POI?

Java: Overwrite the existing file on server with modified file(excel sheet) using File API

Java POI - Error: Unable to read entire header

Write to different file instead of overwriting file

Categories

Resources