Read manually inserted text from an Excel spreadsheet - java

I have a .xlsx file containing my university's timetable. I'm working on an application that makes use of the timetable. But I don't want to "copy" the timetable contents from this Excel spreadsheet into a more "programmer-friendly" format, instead, I'd like to write a program/script that would parse this .xlsx table and automatically convert it in the format I need (e.g. in some objects in code).
There's no trouble for me in reading "normal" cells of the spreadsheet. However, instead of simply putting 1 text entry in each cell, the person who created this timetable file manually "divided" some cells into "subcells" and manually inserted some text in each of them. This looks like:
How should this be interpreted: students are divided into 4 groups. At 15.20-16.50 only groups number 1 and 2 will have a specific class. At 17.00-18.30 only groups 1, 3, and 4 will have that class.
As one can see, these "cells" are not real cells — they seem to have been created ("divided") manually, just like the text that is selected in the picture.
The question is: how do I find and read such "cells" (manually inserted text components) like in the picture (preferably also knowing their position so that I can not only read what classes exist, but also when they start (time is stated in the very left of the spreadsheet))?
I tried using Python's xlrd module but haven't been able to achieve what I need. Neither have I had any success with Java's Apache POI — I just can't find how to read such text entries. Solutions on both languages, no matter what libraries and approaches are used, will be fine for me.

Both xls and xslx are proprietary formats. Microsoft went out of their way to explain in court that xslx is open, but unfortunately not one of the judges involved knew anything significant about computer science and the lawyers knew it, so don't get distracted by their misleading case. XSLX has the option for the 'vendor' to add a block of 'custom binary blobs' and the vast majority of the excel features that aren't the most common, lowest level stuff imaginable are in these binary blobs. No doubt this 'stick a text table object into a single cell' thing that's going on here is exactly like that.
Microsoft has never released any documentation on these binary blobs, nor any library that can parse them.
Therefore, Apache POI, xlrd, and all other libraries to read XLS files that do not explicitly require Excel to be installed and running on the computer that's running the 'library' (kind of a tricky thing to pull if you have e.g. a linux-based server!) are based on reverse engineering it, and it's a horrible format. Literally - look up what Apache POI's 'HSSF' stands for. Officially nothing, but etymologically, that H is for Horrible. (Horrible Spread Sheet Format - HSSF).
That's the long way around of saying: Sorry - you probably can't. And it's not the fault of POI or xlrd, it's on microsoft. It is not appropriate to use such a closed, proprietary and undocumented format to transfer anything meaningful. The error lies in whatever process led to the situation that you're now stuck trying to write software to parse a weird excel file.
If you must, most likely a script running within excel can untangle this mess and write out a csv file or json or something in a documented format. Alternatively, you can write something in C#, but it would just be farming out the work to excel, so, you still would not be able to port this code to other platforms.
Apache POI does give you the option of a more low-level approach where you can read the binary blobs. You can attempt to reverse engineer whatever's going on in that 'cell-with-a-table-in-it' yourself, but as neither the xlrd team nor the Apache POI team has bothered, and at least the POI team is on record as saying the format seems to be designed to be obfuscated - that sounds like a job that will take you many, many weeks.
That gets me back to the solution I advised earlier: Unless spending many weeks building an incredibly fragile stack that requires a full blown windows and an excel license is the lesser evil compared to a simple change in human behaviour (unlikely), the fix lies in addressing the process (as in, address that excel is used to transfer this info, or at least make the excel sheet muuuch simpler than this thing), and not by finding out how to read this mess in java or python.

Related

How to fill tables of a word document programatically

I have been given a Microsoft Word Document, with some tables and spots to fill in automatically. I am not sure if this can be done with JAVA, which is my most preferable language.
I am looking for a way to implement a function which I can give the word file to it, and it fills the required spots for me. Is it possible to do it? A hint or a link to a tutorial would definitely suffice. Thanks.
Newer versions Word store documents as zipped XML. Have you filled out the form manually in Word and done a before/after comparison on the XML? Depending on the extent of the changes you could use the standard Java XML APIs to do the same thing programmatically.
A bit of googling and I found docx4j and Apache POI. I haven't used either personally, but it appears that what you're asking for is certainly possible. See this example from the POI SVN repo on how to manipulate tables.

Creating an editable document via java web application

I am looking for a convenient method to export some data from my database into a form that would be editable afterwards. The perfect scenario would be to export a word document, and perhaps a brutally simple solution would be to generate HTML and copy/paste it into Word.
I've looked at several open source libraries for generating word documents, but they seem a bit too simple or incomplete. I need support for tables and embedded images and control over formatting the fonts, table borders etc. (too much formatting seems to be lost when copying html and pasting into word).
Although Word is the end format, it'd be fine to generate it in any format that word would be able to open and subsequently save as DOCX.
I really haven't been able to find anything about generating ODT files (server side without client installation).
I would just dive into the ASPOSE libraries, but it'll take ages (and significant pain) to get a purchase order sorted out so I need to make sure its the only viable option before taking that route.
I could generate it as an excel file and copy that to word - this is looking like the best option currently.

Import for Java or Other Languages that will Generate Flowcharts, Given Data

I'm trying to create an automated "spider diagram" like the ones created by VUE:
http://vue.tufts.edu/
VUE is open source, but the issue is that you create the maps in the program. I want to have a program that will pull the data from an excel sheet and display the map automatically when run.
I know how to open and parse the data in files, so reading the file isn't the issue. I can program the behavior of how I want everything to "link up", but I just don't want to have to create an applet, then develop the software from scratch.
If I made anything unclear, let me know. I'm very tired today, so it's difficult to stay focused very long.
Many thanks!
-Justian
JGraph is a library to do that. You give it the node and edges and it figures out how to present them in a meaningful way. It is kind of like using graphviz but in Java.
For visualization of production runs we use graphviz out of process and show the images generated from that. It works fine, but a single process solution would be better.
Reading an excel as CSV should be straightforward. POI allows you to read directly the Excel files.

How to programmatically extract and manipulate images from an Office file?

How to extract some images from PowerPoint and Word documents, in order to manipulate them, and after that, put the images back in the MS Office files?
Apache has a project called "POI" explicitly made for interacting with MS Office formats from Java. Hopefully that does it for you!
http://poi.apache.org/
Apache POI can handle Word documents via its HWPF module, and extract or insert images from these. Although it's not well documented, check out the POI unit tests for image manipulation within Word (the unit tests seem to be the best documentation for this module).
Failing that, the COM interface is accessible via (say) JACOB. That's probably more work, but will make available APIs not exposed via POI.
In terms of C++, Word exposes a COM API to allow you to manipulate its document format, so as long as you have Word installed on the machine, you can do this in C++ quite easily. Word isn't open source, but you probably have the license anyway.
The company I work for, SoftArtisans, has a product called OfficeWriter that allows you do that, among other things, for Word and Excel (PowerPoint is planned to be added in the future). It is not free or open sourced though.
On the other hand, if you are working strictly with 2007 format (XML based) you can probably use OpenXML.

Using excel as UI without VB

I think every business person would like to have excel UI, however they are forced into using web applications that sometimes look like really bad excel.
Are there any frameworks that help build excel ui without VB? I dont mean framework like POI or JExcel that allows you to generate excel reports.
I've seen many applications built using Excel. All of them were clumsy, error prone, and next to impossible to keep up-to-date.
If the end user needs an application to work like Excel for some grid calculations, then give them a tool to do so, or let them use Excel for that portion.
However using Excel / VBA exclusively to develop big Enterprise worthy applications is heading down the wrong road. It might work well for a while, but it won't be long before issues expose the weak points.
Since you ended talking about reports... yes, by all means have your application export to CSV, HTML, PDF, Excel etc. That way the user that wants to use Excel to generate pretty pie charts, and reformat/search/scan/crop the data can do so with the tool they feel comfortable with.
A combination of the two can work quite well... Excel is not great for inputting data, this is where an app (desktop or web) works better, but excel is great for dynamic reports and analyzing data.
The best approach for dynamic reports I've seen is to write add-ins that add new functions to excel (e.g. to pull in real time data). in the java space you could try XLLoop - this allows you to expose POJO functions in excel (full disclosure: I work on this project).
Obba is an Excel Add-In which allows to instantiate Java objects and work with them directly in Excel (without VBA or any other glue code).
The nice part is that it is fully transparent what the Excel Sheet (UI) does to your Java classes.
I am not sure what you mean by UI here, but if it is for the data presentation (as not data input) you could e.g use SQL Server Reporting Services and export the results to excel format. Alternatively you can parse your data into excel xml format and allow the user to open it as excel file (that is a bit painful though if your data is more complex than a simple table)
EDIT
I went through a pain of presenting and processing data with the use of excel when creating a web system that was replacing old paper work based one - that was a requirement for a transition time.
It is a real pain, all the data validation, ensuring that what is submitted back has not been modified structurewise etc.
My conclusion would be:
use the web system for inputing data
if required provide the excel format for reporting
if really, really required you could implement parsing excel into the web system for inputing data, but then add some human validation as it is humanly impossible to predict all the possible errors one can create in excel
You can look into embedding Excel as an ActiveX control into your application. It will allow you to manipulate the control from your language of choice.
This may point you in the right direction: http://j-integra.intrinsyc.com/support/kb/Article.aspx?id=30421
For java,this one is pretty good.
http://www.jxcell.net

Categories

Resources