Easiest way to compare two Excel files in Java? - java

I'm writing a JUnit test for some code that produces an Excel file (which is binary). I have another Excel file that contains my expected output. What's the easiest way to compare the actual file to the expected file?
Sure I could write the code myself, but I was wondering if there's an existing method in a trusted third-party library (e.g. Spring or Apache Commons) that already does this.

You might consider using my project simple-excel which provides a bunch of Hamcrest Matchers to do the job.
When you do something like the following,
assertThat(actual, WorkbookMatcher.sameWorkbook(expected));
You'd see, for example,
java.lang.AssertionError:
Expected: entire workbook to be equal
but: cell at "C14" contained <"bananas"> expected <nothing>,
cell at "C15" contained <"1,850,000 EUR"> expected <"1,850,000.00 EUR">,
cell at "D16" contained <nothing> expected <"Tue Sep 04 06:30:00">
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
That way, you can run it from your automatted tests and get meaningful feedback whilst you're developing.
You can read more about it at this article on my site

Here's what I ended up doing (with the heavy lifting being done by DBUnit):
/**
* Compares the data in the two Excel files represented by the given input
* streams, closing them on completion
*
* #param expected can't be <code>null</code>
* #param actual can't be <code>null</code>
* #throws Exception
*/
private void compareExcelFiles(InputStream expected, InputStream actual)
throws Exception
{
try {
Assertion.assertEquals(new XlsDataSet(expected), new XlsDataSet(actual));
}
finally {
IOUtils.closeQuietly(expected);
IOUtils.closeQuietly(actual);
}
}
This compares the data in the two files, with no risk of false negatives from any irrelevant metadata that might be different. Hope this helps someone.

A simple file comparison can easily be done using some checksumming (like MD5) or just reading both files.
However, as Excel files contain loads of metadata, the files will probably never be identical byte-for-byte, as James Burgess pointed out.
So you'll need another kind of comparison for your test.
I'd recommend somehow generating a "canonical" form from the Excel file, i.e. reading the generated Excel file and converting it to a simpler format (CSV or something similar), which will only retain the information you want to check. Then you can use the "canonical form" to compare with your expected result (also in canonical form, of course).
Apache POI might be useful for reading the file.
BTW: Reading a whole file to check its correctnes would generally not be considere a Unit test. That's an integration test...

I needed to do something similar and was already using the Apache POI library in my project to create Excel files. So I opted to use the included ExcelExtractor interface to export both workbooks as a string of text and asserted that the strings were equal. There are implementations for both HSSF for .xls as well as XSSF for .xlsx.
Dump to string:
XSSFWorkbook xssfWorkbookA = ...;
String workbookA = new XSSFExcelExtractor(xssfWorkbookA).getText();
ExcelExtractor has some options for what all should be included in the string dump. I found it to have useful defaults of including sheet names. In addition it includes the text contents of the cells.

The easiest way I find is to use Tika.
I use it like this:
private void compareXlsx(File expected, File result) throws IOException, TikaException {
Tika tika = new Tika();
String expectedText = tika.parseToString(expected);
String resultText = tika.parseToString(result);
assertEquals(expectedText, resultText);
}
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.13</version>
<scope>test</scope>
</dependency>

You could use javaxdelta to check whether the two files are the same. It's available from here:
http://javaxdelta.sourceforge.net/

Just found out there's something in commons-io's FileUtils. Thanks for the other answers.

Please, take a look at the site to compare the binary files, http://www.velocityreviews.com/forums/t123770-re-java-code-for-determining-binary-file-equality.html
Tiger

You may use Beyond Compare 3 which can be started from command-line and supports different ways to compare Excel files, including:
Comparing Excel sheets as database tables
Checking all textual content
Checking textual content with some formating

To test only content of the first sheets in Kotlin (easily can be converted to java).
private fun checkEqualityExcelDocs(doc : XSSFWorkbook, doc1 : XSSFWorkbook) : Boolean{
val mapOfCellDoc = doc.toList().first().toList().flatMap { row -> row.map { Pair(PivotExcelCreator.IndexInThePivotTable(it.rowIndex,it.columnIndex),it.stringCellValue) }}.toMap()
val mapOfCellDoc1 = doc1.toList().first().toList().flatMap { row -> row.map { Pair(PivotExcelCreator.IndexInThePivotTable(it.rowIndex,it.columnIndex),it.stringCellValue) }}.toMap()
if(mapOfCellDoc.size == mapOfCellDoc1.size){
return mapOfCellDoc.entries.all { mapOfCellDoc1.containsKey(it.key) && mapOfCellDoc[it.key] == mapOfCellDoc1[it.key]}
}
return false
}
data class IndexInThePivotTable(val row: Int, val col: Int)
and in your code add assert
assertTrue(checkEqualityExcelDocs(expected, actual), "Docs aren't equal!")
as you can see doc.toList().first() will take only the first sheet of document, if you need to compare each sheet respectively change code a little.
Also it is quite good idea to not take into account "" empty strings cells, I didn't need this functionality (As well, simply add this part, if you need).
also it can be useful information
//first doc I've got from outputstream such way
val out = ByteArrayOutputStream()
//some method which writes excel to outputstream
val firstDoc = XSSFWorkbook(ByteArrayInputStream(out.toByteArray()))
and second doc from file to compare with
val secondDoc = XSSFWorkbook(Test::class.java.getClassLoader().getResource("yourfile.xlsx").path)

Maybe... compare MD5 digests of each file? I'm sure there are a lot of ways to do it. You could just open both files and compare each byte.
EDIT: James stated how the XLS format might have differences in the metadata. Perhaps you should use the same interface you used to generate the xls files to open them and compare the values from cell to cell?

Related

FineReader Engine Java SDK. How to ignore pictures during conversion from PDF to DOCX

I need to find a way to ignore pictures and photos from PDF document during conversion to DOCX file.
I am creating an instance of FineReader Engine:
IEngine engine = Engine.InitializeEngine(
engineConfig.getDllFolder(), engineConfig.getCustomerProjectId(),
engineConfig.getLicensePath(), engineConfig.getLicensePassword(), "", "", false);
After that, I am converting a document:
IFRDocument document = engine.CreateFRDocument();
document.AddImageFile(file.getAbsolutePath(), null, null);
document.Process(null);
String exportPath = FileUtil.prepareExportPath(file, resultFolder);
document.Export(exportPath, FileExportFormatEnum.FEF_DOCX, null);
As a result, it converts all images from the initial pdf document.
When you exporting pdf to docx you should use some export params. In this way you can use IRTFExportParams. You can get this object:
IRTFExportParams irtfExportParams = engine.CreateRTFExportParams();
and there you can set writePicture property like this:
irtfExportParams.setWritePictures(false);
there: IEngine engine is main interface. I think u know how to initialize it;)))
Also you have to set in method document.Process() property. (document is from IFRDocument document).
In Process() method you have to give IDocumentProcessingParams iDocumentProcessingParams. This object has method setPageProcessingParams() and there you have to put IPageProcessingParams iPageProcessingParams params(You can get this object by engine.CreatePageProcessingParams()). And this object has methods:
iPageProcessingParams.setPerformAnalysis(true);
iPageProcessingParams.setPageAnalysisParams(iPageAnalysisParams);
In the first method set true,
and in the second one we give iPageAnalysisParams(IPageAnalysisParams iPageAnalysisParams = engine.CreatePageAnalysisParams()).
Last step, you have to set false value in setDetectPictures(false) method from iPageAnalysisParams like this. Thats all:)
And when you are going to export document you should put this param like this:
IFRDocument document = engine.CreateFRDocument();
document.Export(filePath, FileExportFormatEnum.FEF_DOCX, irtfExportParams);
I hope my answer will help to everyone)))
I'm not really familiar with PDF to DOCX conversion, but i think you could try custom profiles according to your needs.
At some point in your code you should create a Engine object, and then create a Document object (or IFRDocument object depending of your application). Add this line just before giving your document to your engine for processing:
engine.LoadProfile(PROFILE_FILENAME);
Then create your file with some processing parameters described in the documentation packaged with your FRE installation under "Working with Profiles" section.
Do not forget to add in your file:
... some params under other sections
[PageAnalysisParams]
DetectText = TRUE --> force text detection
DetectPictures = FALSE --> ignore pictures
... other params under PageAnalysisParams
... some params under other sections
It works the same way for Barcodes, etc... But keep in mind to benchmark your results when adding or removing things from this file as it may alter processing speed and global quality of your result.
What do PDF input pages contain? What is expected in MS Word?
It would be great if you would attach an example of an input PDF file and an example of the desired result in MS Word format.
Then give a useful recommendation will be much easier.

How to keep zero begin string when export data using opencsv library

I using opencsv library in java and export csv. But i have problem. When i used string begin zero look like : 0123456 , when i export it remove 0 and my csv look like : 123456. Zero is missing. I using way :
"\"\t"+"0123456"+ "\""; but when csv export it look like : "0123456" . I don't want it. I want 0123456. I don't want edit from excel because some end user don't know how to edit. How to export csv using open csv and keep 0 begin string. Please help
I think it is not really the problem when generating CSV but the way excel treats the data when opened via explorer.
Tried this code, and viewed the CSV in a text editor ( not excel ), notice that it shows up correctly, though when opened in excel, leading 0s are lost !
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"));
// feed in your array (or convert your data to an array)
String[] entries = "0123131#21212#021213".split("#");
List<String[]> a = new ArrayList<>();
a.add(entries);
//don't apply quotes
writer.writeAll(a,false);
writer.close();
If you are really sure that you want to see the leading 0s for numeric values when excel is opened by user, then each cell entry be in format ="dataHere" format; see code below:
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"));
// feed in your array (or convert your data to an array)
String[] entries = "=\"0123131\"#=\"21212\"#=\"021213\"".split("#");
List<String[]> a = new ArrayList<>();
a.add(entries);
writer.writeAll(a);
writer.close();
This is how now excel shows when opening excel from windows explorer ( double clicking ):
But now, if we see the CSV in a text editor, with the modified data to "suit" excel viewing, it shows as :
Also see link :
format-number-as-text-in-csv-when-open-in-both-excel-and-notepad
have you tried to use String like this "'"+"0123456". ' char will mark number as text when parse into excel
For me OpenCsv works correctly ( vers. 5.6 ).
for example my csv file has a row as the following extract:
"999739059";;;"abcdefgh";"001024";
and opencsv reads the field "1024" as 001024 corretly. Of course I have mapped the field in a string, not in a Double.
But, if you still have problems, you can grab a simple yet powerful parser that fully adheres with RFC 4180 standard:
mykong.com
Mykong shows you some examples using opencsv directly and, in the end, he writes a simple parser to use if you don't want to import OpenCSV , and the parser works very well , and you can use it if you still have any problems.
So you have an easy-to-understand source code of a simple parser that you can modify as you want if you still have any problem or if you want to customize it for your needs.

Is there a uniform ExcelExtractor class and a factory for both xls and xlsx files?

Is there a common class and an implementation of the ExcelExtractor interface that handles, uniformly, extraction of text from xls and xlsx sources?
Maybe, something in ss package.
I am looking for something that would allow me to do something like, but by getting the right implementation from the factory, based on the file type.
Right now, I am having to explicitly use the org.apache.poi.hssf.extractor.ExcelExtractor
for the xls files and org.apache.poi.xssf.extractor.XSSFExcelExtractor for xlsx.
For example, explicit approach for xls:
InputStream inp = new FileInputStream(path);
HSSFWorkbook wb = new HSSFWorkbook(new POIFSFileSystem(inp));
ExcelExtractor extractor = new ExcelExtractor(wb);
extractor.setFormulasNotResults(true);
extractor.setIncludeSheetNames(false);
String text = extractor.getText();
I can implement my own Factory, but before I do that I thought to ask to see if there is a common approach that handles both formats (that is what ss package is for).
Two options
First, if you really really want to stick with the old Apache POI text extractors, then use the ExtractorFactory class. That will identify the type, and create an extractor for you
However, the better option - Apache Tika. Tika builds on top of POI (and lots of others), and gives you plain text extraction (+detection +xhtml +more!) from a wide range of file formats. You'd just call Tika, ask for the text, and get it back no matter the type. See Tika examples like this one to get started

Trailing null (\x00) characters when writing text to Accumulo

I am trying to write the name of a file into Accumulo. I am using accumulo-core-1.43.
For some reason, certain files seem to be written into Accumulo with trailing \x00 characters at the end of the name. The upload is coming through a Java servlet (using the jquery file upload plugin). In the servlet, I check the name of the file with a System.out.println and it looks normal, and I even tried unescaping the string with
org.apache.commons.lang.StringEscapeUtils.unescapeJava(...);
The actual writing to accumulo looks like this:
Mutation mut = new Mutation(new Text(checkSum));
Value val = new Value(new Text(filename).getBytes());
long timestamp = System.currentTimeMillis();
mut.put(new Text(colFam), new Text(EMPTY_BYTES), timestamp, val);
but nothing unusual showed up there (perhaps \x00 isn't escaped)? But then if I do a scan on my table in accumulo, there will be one or more \x00 in the file name.
The problem this seems to cause is that I return that string within XML when I retrieve a list of files (where it shows up) and pass that back to the browser, the the XSL that is supposed to render the information in the XML no longer works when there's these extra characters (not sure why that is the case either).
In chrome, for the response on these calls, I see that there's three red dots after the file name, and when I hover over it, \u0 pops up (which I think is a different representation of 0/null?).
Anyway, I'm just trying to figure out why this happens, or at the very least, how I can filter out \x00 characters before returning the file in Java. any ideas?
You are likely incorrectly using the Hadoop Text class -- this is not an error with Accumulo. Specifically, you make the mistake in your above example:
Value val = new Value(new Text(filename).getBytes());
You must adhere to the length of provided by the Text class. See the Text javadoc for more information. If you're using Hadoop-2.2.0, you can use the provided copyBytes method on Text. If you're on older version of Hadoop where this method doesn't yet exist, you can use something like the ByteBuffer class or the System.arraycopy method to get a copy of the byte[] with the proper limits enforced.

Java/ImageIO Validate format before reading the entire file?

I'm developing a Web application that will let users upload images.
My concern is the file´s size, specially if they are invalid formats.
I'm wondering if there´s a way in java (or a third party library) to check the allowed files formats (jpg, gif and png) before reading the entire file.
If you wish to support only a few types of images you can start by (up)loading the image and at some point use the first few bytes to check wether you wish to continue the upload.
Quite a lot of image formats can be recognized by the first few bytes, the magic number. If the number matches you don't know whether the file is valid of course, but it may be used to match extension and magic number to prevent is really does not correspond at all.
Have a look at this page to check out some Java which checks mime-types. Do read the docs or source to check whether any given method requires the entire file, or can operate on the first few bytes. I've not used those libraries :)
Also check out this page which also lists some java libraries, and some papers on which detection is based.
Don't forget to put in some feedback if you managed to find something you like!
You don't need 3rd party libraries. The code you have to write is simple.
At the point you are handling your uploads, filter the files by their extension. This isn't perfect, but will account for most of the cases.
However, this would mean files are already uploaded to the server. You can use a bit of javascript on the client-side to perform the same operation - check whether the value of the file-upload component contains an allowed file type - .jpg, .png, etc.
function extensionsOkay(fval) {
var extension = new Array();
extension[0] = ".png";
extension[1] = ".gif";
extension[2] = ".jpg";
extension[3] = ".jpeg";
extension[4] = ".bmp";
// No other customization needed.
var thisext = fval.substr(fval.lastIndexOf('.')).toLowerCase();
for(var i = 0; i < extension.length; i++) {
if(thisext == extension[i]) {
$('#support-documents').hide();
return true; }
}
// show client side error message
$('#span.failed').show();
return false;
}

Categories

Resources