Correct way to distinguish .xls from .doc file?

Correct way to distinguish .xls from .doc file? - java

I searched how to detect that file is .xls and I've found a solution like this (but not deprecated):
POIFSFileSystem:
#Deprecated
#Removal(version="4.0")
public static boolean hasPOIFSHeader(InputStream inp) throws IOException {
return FileMagic.valueOf(inp) == FileMagic.OLE2;
}
But this one returns true for all microsoft word documents for example for .doc
Is there a way to detect .xls document?

Both .doc/.xls documents can are stored in the OLE2 storage format. The org.apache.poi.poifs.filesystem.FileMagic helps you to detect the file storage format only and not sufficient alone to distinguish between .doc/.xls files.
Also it does not appear that there is any direct API available in POI library to determine the document type (excel or document) for given inputstream/file.
Below example my be helpful to determine if given stream is a valid .xls (or .xlsx)file with the caveat that it read the given inputstram and close it.
// slurp content from given input and close it
public static boolean isExcelFile(InputStream in) throws IOException {
try {
// it slurp the input stream
Workbook workbook = org.apache.poi.ss.usermodel.WorkbookFactory.create(in);
workbook.close();
return true;
} catch (java.lang.IllegalArgumentException | org.apache.poi.openxml4j.exceptions.InvalidFormatException e) {
return false;
}
}
You may found more information on excel file format on this link
Update
Solution based on Apache Tika as suggested by gagravarr:
public class TikaBasedFileTypeDetector {
private Tika tika;
private TemporaryResources temporaryResources;
public void init() {
this.tika = new Tika();
this.temporaryResources = new TemporaryResources();
}
// clean up all the temporary resources
public void destroy() throws IOException {
temporaryResources.close();
}
// return content mime type
public String detectType(InputStream in) throws IOException {
TikaInputStream tikaInputStream = TikaInputStream.get(in, temporaryResources);
return tika.detect(tikaInputStream);
}
public boolean isExcelFile(InputStream in) throws IOException{
// see https://stackoverflow.com/a/4212908/1700467 for information on mimetypes
String type = detectType(in);
return type.startsWith("application/vnd.ms-excel") || //for Micorsoft document
type.startsWith("application/vnd.openxmlformats-officedocument.spreadsheetml"); // for OpenOffice xml format
}
}
See this answer on mime types.

You can work with Apache POI's - HSSF module.
That model (library) is written to read and write xls files (and latest for xlsx as well - although these are different languages).
With this code...
InputStream ExcelFileToRead = new FileInputStream("FileNameWithLink.xls");
HSSFWorkbook wb = new HSSFWorkbook(ExcelFileToRead);
HSSFSheet sheet = wb.getSheetAt(0);
...you can detect if it is readable xls file.
Going deeper you can use this code to try reading it etc. Actually that module is really easy to use.
There can be situations that it technically is .xls file, but it may not be readable (there can be various problems with it).
Extra - XSSF is for .xlsx and HSSF is for .xls.
I haven't used other techniques as I always want to be sure that I will be able read that file later.

You can use docx4j. Load the file with OpcPackage.load() and then check the content type.
OpcPackage.load()
* Convenience method to create a WordprocessingMLPackage
* or PresentationMLPackage
* from an inputstream (.docx/.docxm, .ppxtx or Flat OPC .xml).
* It detects the convenient format inspecting two first bytes of stream (magic bytes).
* For office 2007 'x' formats, these two bytes are 'PK' (same as zip file)
load() returns a OpcPackage which is the abstract class that GloxPackage, PresentationMLPackage, SpreadsheetMLPackage, WordprocessingMLPackage are based on. So this should work for word, excel and powerpoint docs.
A basic check
public final String XLSX_FILE = "application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml";
public final String WORD_FILE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml";
public final String UNKNOWN_FILE = "UNKNOWN";
public boolean isFileXLSX(String fileLocation) {
return getContentTypeFromFile(fileLocation).equals(XLSX_FILE);
}
public String getContentTypeFromFile(String fileLocation) {
try {
return OpcPackage.load(new File(fileLocation)).getContentType();
} catch (Docx4JException e) {
return UNKNOWN_FILE;
}
}
You should see values like
application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml

Related

ContentHandler is taking a lot of time to go through 3MB XML parsed xlsx file

I'm using SAX parser and XSSFReader of apache.poi to parse .xlsx file. my sheet contains up to 650 columns and 2000 rows (file size- about 2.5 MB). my code looks like so:
public class MyClass {
public static void main(String path){
try {
OPCPackage pkg = OPCPachage.open(new FileInputStream(path));
XSSFReader reader = new XSSFReader(pkg);
InputStream sheetData = reader.getSheet("rId3"); //the needed sheet
MyHandler handler = new MyHandler();
XMLReader parser = SAXHelper.newXMLReader();
parser.setContentHandler(handler);
parser.parse(new InputSource(sheetData));
}
catch (Exception e){
//or other catches with required exceptions
}
}
}
class MyHandler extends DefaultHandler {
#Override
public void startElement (String uri, String localName, String name, Attributes attributes) {
if("row".equals(name)) {
System.out.pringln("row: " + attributes.getValue("r"));
}
}
}
but unfortunately I saw that it takes 2 or 3 seconds to go over one row, that means that going over the sheet takes over than 30 minutes(!!)
Well, I am sure this is not supposed to be, if it was- noboy was suggesing apache.poi eventaApi for large files, was he?
I want to get to the <mergeCell> values at the end of the XML (after the closing </sheetData>, is there a better way to do it? (I was thinking of handle with the string of the xml, and simply search with some regular expression for the required values, is it possible?)
So- I have two questions:
1. What's wrong with my code/why it takes so long? (when I think about it- it actually sounds as normal situation, 600 cells- why not processing in few seconds?)
2. Is there a way to treat XML as a text file and simply search in it using regex?

Error while retrieving images from pdf using Itext

I have an existing PDF from which I want to retrieve images
NOTE:
In the Documentation, this is the RESULT variable
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
I am not getting why this image is needed?I just want to extract the images from my PDF file
So Now when I use MyImageRenderListener listener = new MyImageRenderListener(RESULT);
I am getting the error:
results\part4\chapter15\Img16.jpg (The system
cannot find the path specified)
This is the code that I am having.
package part4.chapter15;
import java.io.IOException;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
/**
* Extracts images from a PDF file.
*/
public class ExtractImages {
/** The new document to which we've added a border rectangle. */
public static final String RESOURCE = "resources/pdfs/samplefile.pdf";
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
/**
* Parses a PDF and extracts all the images.
* #param src the source PDF
* #param dest the resulting PDF
*/
public void extractImages(String filename)
throws IOException, DocumentException {
PdfReader reader = new PdfReader(filename);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
MyImageRenderListener listener = new MyImageRenderListener(RESULT);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
parser.processContent(i, listener);
}
reader.close();
}
/**
* Main method.
* #param args no arguments needed
* #throws DocumentException
* #throws IOException
*/
public static void main(String[] args) throws IOException, DocumentException {
new ExtractImages().extractImages(RESOURCE);
}
}

You have two questions and the answer to the first question is the key to the answer of the second.
Question 1:
You refer to:
public static final String RESULT = "results/part4/chapter15/Img%s.%s";
And you ask: why is this image needed?
That question is wrong, because Img%s.%s is not a filename of an image, it's a pattern of the filename of an image. While parsing, iText will detect images in the PDF. These images are stored in numbered objects (e.g. object 16) and these images can be exported in different formats (e.g. jpg, png,...).
Suppose that an image is stored in object 16 and that this image is a jpg, then the pattern will resolve to Img16.jpg.
Question 2:
Why do I get an error:
results\part4\chapter15\Img16.jpg (The system cannot find the path specified)
In your PDF, there's a jpg stored in object 16. You are asking iText to store that image using this path: results\part4\chapter15\Img16.jpg (as explained in my answer to Question 1). However: you working directory doesn't have the subdirectories results\part4\chapter15\, hence an IOException (or a FileNotFoundException?) is thrown.
What is the general problem?
You have copy/pasted the ExtractImages example I wrote for my book "iText in Action - Second Edition", but:
You didn't read that book, so you have no idea what that code is supposed to do.
You aren't telling the readers on StackOverflow that this example depends on the MyImageRenderer class, which is where all the magic happens.
How can you solve your problem?
Option 1:
Change RESULT like this:
public static final String RESULT = "Img%s.%s";
Now the images will be stored in your working directory.
Option 2:
Adapt the MyImageRenderer class, more specifically this method:
public void renderImage(ImageRenderInfo renderInfo) {
try {
String filename;
FileOutputStream os;
PdfImageObject image = renderInfo.getImage();
if (image == null) return;
filename = String.format(path,
renderInfo.getRef().getNumber(), image.getFileType());
os = new FileOutputStream(filename);
os.write(image.getImageAsBytes());
os.flush();
os.close();
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
iText calls this class whenever an image is encountered. It passed an ImageRenderInfo to this method that contains plenty of information about that image.
In this implementation, we store the image bytes as a file. This is how we create the path to that file:
String.format(path,
renderInfo.getRef().getNumber(), image.getFileType())
As you can see, the pattern stored in RESULT is used in such a way that the first occurrence of %s is replaced with a number and the second occurrence with a file extension.
You could easily adapt this method so that it stores the images as byte[] in a List if that is what you want.

Converting a docx containing a chart to PDF

I've got a docx4j generated file which contains several tables, titles and, finally, an excel-generated curve chart.
I have tried many approaches in order to convert this file to PDF, but did not get to any successful result.
Docx4j with xsl-fo did not work, most of the things included in the docx file are not yet implemented and show up in red text as "not implemented".
JODConverter did not work either, I got a resulting PDF in which everything was pretty good (just little formatting/styling issues) BUT the graph did not show up.
Finally, the closest approach was using Apache POI: The resulting PDF was identical to my docx file, but still no chart showing up.
I already know Aspose would solve this pretty easily, but I am looking for an open source, free solution.
The code I am using with Apache POI is as follows:
public static void convert(String inputPath, String outputPath)
throws XWPFConverterException, IOException {
PdfConverter converter = new PdfConverter();
converter.convert(new XWPFDocument(new FileInputStream(new File(
inputPath))), new FileOutputStream(new File(outputPath)),
PdfOptions.create());
}
I do not know what to do to get the chart inside the PDF, could anybody tell me how to proceed?
Thanks in advance.

I don't know if this helps you but you could use "jacob" (I don't know if its possible with apache poi or docx4j)
With this solution you open "Word" yourself and export it as pdf.
!Word needs to be installed on the computer!
Heres the download-page: http://sourceforge.net/projects/jacob-project/
try {
if (System.getProperty("os.arch").contains("64")) {
System.load(DLL_64BIT_PATH);
} else {
System.load(DLL_32BIT_PATH);
}
} catch (UnsatisfiedLinkError e) {
//TODO
} catch (IOException e) {
//TODO
}
ActiveXComponent oleComponent = new ActiveXComponent("Word.Application");
oleComponent.setProperty("Visible", false);
Variant var = Dispatch.get(oleComponent, "Documents");
Dispatch document = var.getDispatch();
Dispatch activeDoc = Dispatch.call(document, "Open", fileName).toDispatch();
// https://msdn.microsoft.com/EN-US/library/office/ff845579.aspx
Dispatch.call(activeDoc, "ExportAsFixedFormat", new Object[] { "path to pdfFile.pdf", new Integer(17), false, 0 });
Object args[] = { new Integer(0) };//private static final int DO_NOT_SAVE_CHANGES = 0;
Dispatch.call(activeDoc, "Close", args);
Dispatch.call(oleComponent, "Quit");

Reading and writing files using Java 7 nio

I have files which consist of json elements in an array.
(several file. each file has json array of elements)
I have a process that knows to take each json element as a line from file and process it.
So I created a small program that reads the JSON array, and then writes the elements to another file.
The output of this utility will be the input of the other process.
I used Java 7 NIO (and gson).
I tried to use as much Java 7 NIO as possible.
Is there any improvement I can do?
What about the filter? Which approach is better?
Thanks,
public class TransformJsonsUsers {
public TransformJsonsUsers() {
}
public static void main(String[] args) throws IOException {
final Gson gson = new Gson();
Path path = Paths.get("C:\\work\\data\\resources\\files");
final Path outputDirectory = Paths
.get("C:\\work\\data\\resources\\files\\output");
DirectoryStream.Filter<Path> filter = new DirectoryStream.Filter<Path>() {
#Override
public boolean accept(Path entry) throws IOException {
// which is better?
// BasicFileAttributeView attView = Files.getFileAttributeView(entry, BasicFileAttributeView.class);
// return attView.readAttributes().isRegularFile();
return !Files.isDirectory(entry);
}
};
DirectoryStream<Path> directoryStream = Files.newDirectoryStream(path, filter);
directoryStream.forEach(new Consumer<Path>() {
#Override
public void accept(Path filePath) {
String fileOutput = outputDirectory.toString() + File.separator + filePath.getFileName();
Path fileOutputPath = Paths.get(fileOutput);
try {
BufferedReader br = Files.newBufferedReader(filePath);
User[] users = gson.fromJson(br, User[].class);
BufferedWriter writer = Files.newBufferedWriter(fileOutputPath, Charset.defaultCharset());
for (User user : users) {
writer.append(gson.toJson(user));
writer.newLine();
}
writer.flush();
} catch (IOException e) {
throw new RuntimeException(filePath.toString(), e);
}
}
});
}
}

There is no point of using Filter if you want to read all the files from the directory. Filter is primarily designed to apply some filter criteria and read a subset of files. Both of them may not have any real difference in over all performance.
If you looking to improve performance, you can try couple different approaches.
Multi-threading
Depending on how many files exists in the directory and how powerful your CPU is, you can apply multi threading to process more than one file at a time
Queuing
Right now you are reading and writing to another file synchronously. You can queue content of the file using Queue and create asynchronous writer.
You can combine both of these approaches as well to improve performance further.

Don't put the I/O into the filter. That's not what it's for. You should get the complete list of files and then process it. For example if the I/O creates another file in the directory, the behaviour is undefined. You might miss a file, or see the new file in the accept() method.

Java: CSV File Easy Read/Write

I'm working on a program that requires quick access to a CSV comma-delimited spreadsheet file.
So far I've been able to read from it easily using a BufferedReader.
However, now I want to be able to edit the data it reads, then export it BACK to the CSV.
The spreadsheet contains names, phone numbers, email addresses, etc. And the program lists everyone's data, and when you click on them it brings up a page with more detailed information, also pulled from the CSV. On that page you can edit the data, and I want to be able to click a "Save Changes" button, then export the data back to its appropriate line in the CSV--or delete the old one, and append the new.
I'm not very familiar with using a BufferedWriter, or whatever it is I should be using.
What I started to do is create a custom class called FileIO. It contains both a BufferedReader and a BufferedWriter. So far it has a method that returns bufferedReader.readLine(), called read(). Now I want a function called write(String line).
public static class FileIO {
BufferedReader read;
BufferedWriter write;
public FileIO (String file) throws MalformedURLException, IOException {
read = new BufferedReader(new InputStreamReader (getUrl(file).openStream()));
write = new BufferedWriter (new FileWriter (file));
}
public static URL getUrl (String file) throws IOException {
return //new URL (fileServer + file).openStream()));
FileIO.class.getResource(file);
}
public String read () throws IOException {
return read.readLine();
}
public void write (String line) {
String [] data = line.split("\\|");
String firstName = data[0];
// int lineNum = findLineThatStartsWith(firstName);
// write.writeLine(lineNum, line);
}
};
I'm hoping somebody has an idea as to how I can do this?

Rather than reinventing the wheel you could have a look at OpenCSV which supports reading and writing of CSV files. Here are examples of reading & writing

Please consider Apache commons csv.
To fast understand the api, there are four important classes:
CSVFormat
Specifies the format of a CSV file and parses input.
CSVParser
Parses CSV files according to the specified format.
CSVPrinter
Prints values in a CSV format.
CSVRecord
A CSV record parsed from a CSV file.
Code Example:
Unit test code:

The spreadsheet contains names, phone numbers, email addresses, etc. And the program lists everyone's data, and when you click on them it brings up a page with more detailed information, also pulled from the CSV. On that page you can edit the data, and I want to be able to click a "Save Changes" button, then export the data back to its appropriate line in the CSV--or delete the old one, and append the new.
The content of a file is a sequence of bytes. CSV is a text based file format, i.e. the sequence of byte is interpreted as a sequence of characters, where newlines are delimited by special newline characters.
Consequently, if the length of a line increases, the characters of all following lines need to be moved to make room for the new characters. Likewise, to delete a line you must move the later characters to fill the gap. That is, you can not update a line in a csv (at least not when changing its length) without rewriting all following lines in the file. For simplicity, I'd rewrite the entire file.
Since you already have code to write and read the CSV file, adapting it should be straightforward. But before you do that, it might be worth asking yourself if you're using the right tool for the job. If the goal is to keep a list of records, and edit individual records in a form, programs such as Microsoft Access or whatever the Open Office equivalent is called might be a more natural fit. If you UI needs go beyond what these programs provide, using a relational database to keep your data is probably a better fit (more efficient and flexible than a CSV).

Add Dependencies
implementation 'com.opencsv:opencsv:4.6'
Add Below Code in onCreate()
InputStreamReader is = null;
try {
String path= "storage/emulated/0/Android/media/in.bioenabletech.imageProcessing/MLkit/countries_image_crop.csv";
CSVReader reader = new CSVReader(new FileReader(path));
String[] nextLine;
int lineNumber = 0;
while ((nextLine = reader.readNext()) != null) {
lineNumber++;
//print CSV file according to your column 1 means first column, 2 means
second column
Log.e(TAG, "onCreate: "+nextLine[2] );
}
}
catch (Exception e)
{
Log.e(TAG, "onCreate: "+e );
}

I solved it using
<dependency>
<groupId>com.fasterxml.jackson.dataformat</groupId>
<artifactId>jackson-dataformat-csv</artifactId>
<version>2.8.6</version>
</dependency>
and
private static final CsvMapper mapper = new CsvMapper();
public static <T> List<T> readCsvFile(MultipartFile file, Class<T> clazz) throws IOException {
InputStream inputStream = file.getInputStream();
CsvSchema schema = mapper.schemaFor(clazz).withHeader().withColumnReordering(true);
ObjectReader reader = mapper.readerFor(clazz).with(schema);
return reader.<T>readValues(inputStream).readAll();
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.