Reading XLSB file with Apache POI

Reading XLSB file with Apache POI - java

I am reading xlsb file with below code, but its printing as text, I want to get headers, data values for each cell to set to java objects. Which classes should I use. Thanks in advance.
File file = null;
OPCPackage pkg = null;
XSSFEventBasedExcelExtractor ext = null;
try {
file = Paths.get("C:/Users/U574564/Ref/source/Q4_GL_Assignment v1.xlsb").toFile();
pkg = OPCPackage.open(file,PackageAccess.READ);
ZipSecureFile.setMaxTextSize(110485766);
ext = new XSSFBEventBasedExcelExtractor(pkg);
System.out.println(ext.getText());
} catch(Exception ex) {
System.out.println(ex.getMessage());
}

Related

Corrupted TAR File Error Upon Access From Google Cloud Storage in Java

I am storing a TAR file in Google Cloud Storage. The file can be successfully downloaded via gsutil and extracted in my computer using macOS Archive Utility. However, the Java program that I implement always encounter java.io.IOException: Corrupted TAR archive upon accessing the file. I have tried several ways and all of them are utilizing the org.apache.commons:commons-compress library. Can you give me insight on how to fix this problem or something that I can try on?
Here are the implementations that I have tried:
Blob blob = storage.get(BUCKET_NAME, FILE_PATH);
blob.downloadTo(Paths.get("filename.tar"));
String contentType = blob.getContentType(); // application/x-tar
InputStream is = Channels.newInputStream(blob.reader());
String mime = URLConnection.guessContentTypeFromStream(is); // null
TarArchiveInputStream ais = new TarArchiveInputStream(is);
ais.getNextEntry(); // raise java.io.IOException: Corrupted TAR archive
InputStream is2 = new ByteArrayInputStream(blob.getContent());
String mime2 = URLConnection.guessContentTypeFromStream(is2); // null
TarArchiveInputStream ais2 = new TarArchiveInputStream(is2);
ais2.getNextEntry(); // raise java.io.IOException: Corrupted TAR archive
InputStream is3 = new FileInputStream("filename.tar");
String mime3 = URLConnection.guessContentTypeFromStream(is3); // null
TarArchiveInputStream ais3 = new TarArchiveInputStream(is3);
ais3.getNextEntry(); // raise java.io.IOException: Corrupted TAR archive
TarFile file = new TarFile(blob.getContent()); // raise java.io.IOException: Corrupted TAR archive
TarFile tarFile = new TarFile(Paths.get("filename.tar")); // raise java.io.IOException: Corrupted TAR archive
Addition: I have tried to parse a JSON from GCS and it's working fine.
Blob blob = storage.get(BUCKET_NAME, FILE_PATH);
JSONTokener jt = new JSONTokener(Channels.newInputStream(blob.reader()));
JSONObject jo = new JSONObject(jt);

The problem is that your tar is compressed, it is a tgz file.
For that reason, you need to decompress the file when processing your tar contents.
Please, consider the following example; note the use of the common compress builtin GzipCompressorInputStream class:
public static void main(String... args) {
final File archiveFile = new File("latest.tar");
try (
FileInputStream in = new FileInputStream(archiveFile);
GzipCompressorInputStream gzIn = new GzipCompressorInputStream(in);
TarArchiveInputStream tarIn = new TarArchiveInputStream(gzIn)
) {
TarArchiveEntry tarEntry = tarIn.getNextTarEntry();
while (tarEntry != null) {
final File path = new File("/tmp/" + File.separator + tarEntry.getName());
if (!path.getParentFile().exists()) {
path.getParentFile().mkdirs();
}
if (!tarEntry.isDirectory()) {
try (OutputStream out = new FileOutputStream(path)){
IOUtils.copy(tarIn, out);
}
}
tarEntry = tarIn.getNextTarEntry();
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

IE11 converting .xlsx files to .xls and .docx files to .doc

I have a java application where users can upload and download files. Recently, we found out that whenever users click on a link on IE11 to download .docx or a .xlsx file, it downloads a .doc or .xls file. In the process, it warns the users that the file format and extension do not match and the users should only open the file if they trust its source. There is no such issue on Microsoft Edge or other browsers.
Is there some setting that can be done in IE11 or can some coding (specific to IE11) be done so that so that it downloads .xlsx and .docx files as they are and doesn't give annoying warning messages to users?
try {
byte[] fileContent = getFileContent(id, fName);
if (fileContent != null) {
OutputStream out = null;
try {
System.out.println("content type: "+getContentType(fName)); //prints application/vnd.ms-excel
res.reset();
out = res.getOutputStream();
res.setContentType(getContentType(fName));
res.setHeader("Content-Disposition", "inline; filename=" + fName + "; size=" + String.valueOf(fileContent.length));
res.setContentLength(fileContent.length);
out.write(fileContent);
setDestination(req, RESPONSE_NO_REDIRECT);
} catch (Exception ex) {
ex.printStackTrace();
} finally {
flushCloseOutputStream(out);
}
} else {
setDestination(req, "/404.jsp");
}
} catch (Exception ex) {
ex.printStackTrace();
}
public byte[] getFileContent(int id, String fileName) {
byte[] bytes = null;
Transaction tx = null;
Session s = null;
try {
GenericDAO dao = HibernateDAOFactory.getInstance().getDAO(GenericClassDAO.class, Files.class);
s = SessionAndTransactionManagementService.createNewSession(dao);
tx = SessionAndTransactionManagementService.startNewTransaction(s);
Criteria cr = s.createCriteria(Files.class)
.add(Restrictions.eq("id", id))
.add(Restrictions.eq("fileName", fileName))
.setProjection(Projections.property("fileContent"));
bytes = (byte[]) cr.uniqueResult();
SessionAndTransactionManagementService.commitTransaction(s);
} catch (Exception e) {
HibernateUtil.rollback(tx);
}finally{
HibernateUtil.cleanupResources(s);
}
return bytes;
}

IE11 was setting content type for all excel files as 'application/vnd.ms-excel' (don't know why). That makes it show warnings and download xlsx files as xls.
I changed my code to set contentType manually to 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' when the filename contains .xlsx and this solved my problem.

Create instance of file using value of a property in properties file

I'm trying to create an instance of a file to parse html records from a property value. the problem is in the url of the file that I must put in the file properties, here is my example :
the correspondance code for reading file :
public void extraxtElementWithoutId() {
Map<String,List<List<Element>>> uniciteIds = new HashMap<String,List<List<Element>>>();
FileReader fileReader = null;
Document doc = null;
try {
fileReader = new FileReader(new ClassPathResource(FILEPROPERTYNAME).getFile());
Properties prop = new Properties();
prop.load(fileReader);
Enumeration<?> enumeration = prop.propertyNames();
List<List<Element>> fiinalReturn = null;
while (enumeration.hasMoreElements()) {
String path = (String) enumeration.nextElement();
System.out.println("Fichier en question : " + prop.getProperty(path));
URL url = getClass().getResource(prop.getProperty(path));
System.out.println(url.getPath());
File inputFile = new File(url.getPath());
doc = Jsoup.parse(inputFile, "UTF-8");
//fiinalReturn = getListofElements(doc);
//System.out.println(fiinalReturn);
fiinalReturn = uniciteIds.put("Duplicat Id", getUnicityIds(doc));
System.out.println(fiinalReturn);
}
} catch (IOException e) {
e.printStackTrace();
}finally {
try{
fileReader.close();
}catch(Exception e) {
e.printStackTrace();
}
}
}
Thank you in advance,
Best Regards.

You are making a very common mistake for line -
URL url = getClass().getResource(prop.getProperty(path));
Try with property value as ( by removing src ) - /testHtmlFile/test.html and so on. Don't change code.
UrlEnterer1=/testHtmlFile/test.html instead of preceding it with src.
prop.getProperty(path) should be as per your build path location for the file. Check your build directory as how these files are stored. These are not stored under src but directly under build directory.
This answer explains a little bit about path value for file reading from class path.
Also, as a side note ( not related to question ) , try not doing prop.getProperty(path) but directly injecting property value in your class using org.springframework.beans.factory.annotation.Value annotation.

How to list all embedded files from a microsoft office document, using Apache POI?

is there any opportunity to list all embedded objects (doc, ..., txt) in a office file (doc, docx, xls, xlsx, ppt, pptx, ...)?
I am using Apache POI (Java) Library, to extract text from office files. I don't need to extract all the text from embedded objects, a log file with the file names of all embedded documents would be nice (something like: string objectFileNames = getEmbeddedFileNames(fileInputStream)).
Example: I have a Word Document "test.doc" which contains another file called "excel.xls". I'd like to write the file name of excel.xls (in this case) into a log file.
I tried this using some sample code from the apache homepage (https://poi.apache.org/text-extraction.html). But my Code always returns the same ("Footer Text: Header Text").
What I tried is:
private static void test(String inputfile, String outputfile) throws Exception {
String[] extractedText = new String[100];
int emb = 0;//used for counter of embedded objects
InputStream fis = new FileInputStream(inputfile);
PrintWriter out = new PrintWriter(outputfile);//Text in File (txt) schreiben
System.out.println("Emmbedded Search started. Inputfile: " + inputfile);
//Based on Apache sample Code
emb = 0;//Reset Counter
POIFSFileSystem emb_fileSystem = new POIFSFileSystem(fis);
// Firstly, get an extractor for the Workbook
POIOLE2TextExtractor oleTextExtractor =
ExtractorFactory.createExtractor(emb_fileSystem);
// Then a List of extractors for any embedded Excel, Word, PowerPoint
// or Visio objects embedded into it.
POITextExtractor[] embeddedExtractors =
ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);
for (POITextExtractor textExtractor : embeddedExtractors) {
// If the embedded object was an Excel spreadsheet.
if (textExtractor instanceof ExcelExtractor) {
ExcelExtractor excelExtractor = (ExcelExtractor) textExtractor;
extractedText[emb] = (excelExtractor.getText());
}
// A Word Document
else if (textExtractor instanceof WordExtractor) {
WordExtractor wordExtractor = (WordExtractor) textExtractor;
String[] paragraphText = wordExtractor.getParagraphText();
for (String paragraph : paragraphText) {
extractedText[emb] = paragraph;
}
// Display the document's header and footer text
System.out.println("Footer text: " + wordExtractor.getFooterText());
System.out.println("Header text: " + wordExtractor.getHeaderText());
}
// PowerPoint Presentation.
else if (textExtractor instanceof PowerPointExtractor) {
PowerPointExtractor powerPointExtractor =
(PowerPointExtractor) textExtractor;
extractedText[emb] = powerPointExtractor.getText();
emb++;
extractedText[emb] = powerPointExtractor.getNotes();
}
// Visio Drawing
else if (textExtractor instanceof VisioTextExtractor) {
VisioTextExtractor visioTextExtractor =
(VisioTextExtractor) textExtractor;
extractedText[emb] = visioTextExtractor.getText();
}
emb++;//Count Embedded Objects
}//Close For Each Loop POIText...
for(int x = 0; x <= extractedText.length; x++){//Write Results to TXT
if (extractedText[x] != null){
System.out.println(extractedText[x]);
out.println(extractedText[x]);
}
else {
break;
}
}
out.close();
}
Inputfile is xls, which contains a doc file as object and outputfile is txt.
Thanks if anyone can help me.

I don't think embedded OLE objects keep their original file name, so I don't think what you want is really possible.
I believe what Microsoft writes about embedded images also applies to OLE-Objects:
You might notice that the file name of the image file has been changed from Eagle1.gif to image1.gif. This is done to address privacy concerns, in that a malicious person could derive a competitive advantage from the name of parts in a document, such as an image file. For example, an author might choose to protect the contents of a document by encrypting the textual part of the document file. However, if two images are inserted named old_widget.gif and new_reenforced_widget.gif, even though the text is protected, a malicious person could learn the fact that the widget is being upgraded. Using generic image file names such as image1 and image2 adds another layer of protection to Office Open XML Formats files.
However, you could try (for Word 2007 files, aka XWPFDocument, aka ".docx", other MS Office files work similar):
try (FileInputStream fis = new FileInputStream("mydoc.docx")) {
document = new XWPFDocument(fis);
listEmbeds (document);
}
private static void listEmbeds (XWPFDocument doc) throws OpenXML4JException {
List<PackagePart> embeddedDocs = doc.getAllEmbedds();
if (embeddedDocs != null && !embeddedDocs.isEmpty()) {
Iterator<PackagePart> pIter = embeddedDocs.iterator();
while (pIter.hasNext()) {
PackagePart pPart = pIter.next();
System.out.print(pPart.getPartName()+", ");
System.out.print(pPart.getContentType()+", ");
System.out.println();
}
}
}
The pPart.getPartName() is the closest I could find to a file name of an embedded file.

Using Apache poi, you cannot get the original names of the embedded files.
However if you really need to get the original names then you can use aspose api.
You can use aspose.cells for excel files, aspose.slides for presentation files, aspose.words for word files to extract the embedded files.
You'll get the file name if the ole object is linked otherwise you'll not get the original file using aspose also.
See the example below....
public void getDocEmbedded(InputStream stream){
Document doc=new Document(stream);
NodeCollection<?> shapes = doc.getChildNodes(NodeType.SHAPE, true);
System.out.println(shapes.getCount());
int itemcount = 0;
for (int i = 0; i < shapes.getCount(); i++) {
Shape shape = (Shape) shapes.get(i);
OleFormat oleFormat = shape.getOleFormat();
if (oleFormat != null) {
if (!oleFormat.isLink() && oleFormat.getOleIcon()) {
itemcount++;
String progId = oleFormat.getProgId();
System.out.println("Extension: " + oleFormat.getSuggestedExtension()+"file Name "+oleFormat.getIconCaption());
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] bytearray = oleFormat.getRawData();
if (bytearray == null) {
oleFormat.save(baos);
bytearray = baos.toByteArray();
}
//TO DO : do with the byte array whatever you want to
}
}
}
I'm using oleFormat.getSuggestedExtension() to get the embedded file extension and oleFormat.getIconCaption() to get the embedded file names.

public class GetEmbedded {
public static void main(String[] args) throws Exception {
String path = "SomeExcelFile.xlsx"
XSSFWorkbook workbook = new XSSFWorkbook(new FileInputStream(new File(path)));
for (PackagePart pPart : workbook.getAllEmbedds()) {
String contentType = pPart.getContentType();
System.out.println("List of all the embedded contents in the Excel"+contentType);
}
}
}

Write some part of a bson file into another bson file in java

I have a bson file,(a.bson). I want to read this file and extract some part of it and then save these parts into another BSON file (b.bson).
Currently, I can read my source file (a.bson) using org.bson.BSONEncoder and extract my favorite parts of it (e.g., key1 and key2 for each row of source fila). Now I want to save these data in another bson file (b.bson). In fact, I need to save this data in a bson file because this file has structure a I can easily check rows have contains spacial value or not. I write this code and
import org.bson.BSONEncoder;
public static void createmyFile(File sourceFile) throws FileNotFoundException, IOException {
InputStream inputStream = new BufferedInputStream(new FileInputStream(sourceFile));
BSONDecoder decoder = new BasicBSONDecoder();
try {
while (inputStream.available() > 0) {
BSONObject bsonSingleRow = decoder.readObject(inputStream);
// ---------------------------------------------------------------------
// Write bsonSingleRow.get(key1) & bsonSingleRow.get(key2) into new file
// ---------------------------------------------------------------------
}
} catch (IOException e) {
...
}
}
Please help me to complete above code.

For example if you want 2% (select randomly) data from source file
File file = new File("a.bson");
String path = "b.bson";
BasicBSONEncoder encoder = new BasicBSONEncoder();
InputStream inputStream = new BufferedInputStream(new FileInputStream(file));
BSONDecoder decoder = new BasicBSONDecoder();
try {
while (inputStream.available() > 0) {
BSONObject bsonSingleRow = decoder.readObject(inputStream);
c = bsonSingleRow.get("yourKey").toString();
if (Math.random()> .98))
Files.write(Paths.get(path), encoder.encode(bsonSingleRow),StandardOpenOption.CREATE, StandardOpenOption.APPEND);
}
}
} catch (IOException e) {
...
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading XLSB file with Apache POI - java

Related

Corrupted TAR File Error Upon Access From Google Cloud Storage in Java

IE11 converting .xlsx files to .xls and .docx files to .doc

Create instance of file using value of a property in properties file

How to list all embedded files from a microsoft office document, using Apache POI?

Write some part of a bson file into another bson file in java

Categories

Resources