Excel Application to Web based Application - java

I have been trying to find the right design/toolset which can help our business users. They have enormous data in excel files which they push through to various excel formulas nearly 400+ and calculations usually and mostly on a row by row basis and vlookups from other sheets. In trying to design a system for them, i want to enable them to define the business rules so that we can stick to designing and implementing the system, which will change state according to the business rules defined ? What current stack of technologies would be able to support this ?
The basic requirements to point out are like
Should able to handle millions of rows of data and process them.(Millions rows of data need not to be processed at same time it can be processed sequentially)
Convert existing excel formulas into some rules which Business user can edit and maintain (These Excel formulas are quit complex. Here formulas deal with multiple sheets and decision based on row data from multiple sheets uses VLOOKUP to MATCH and INDEX to get corresponding matching row in different sheet. )
I am planning to use Drools and Guvnor for it..
What do you all suggest? Is there any other better option?
Even in Drools my major concern is if Business user will be able to create the rules as easily as he can do in Excel..

The "millions" won't be a problem for sequential processing, if there's a reasonably fast way of input and output of the data itself.
Lookups into other sheets can be transformed into sets of static facts, loaded once when the session is started - just a technicality.
The transformation of the Excel formulas: Ay, there's the rub. The Business User (BU) will not be able to transform them off the cuff. Rules aren't any more complicated than Excel formulas, but the BUs will need some formal training, ideally tailored to the subset they'll have to use. This also applies if they should use Guvnor for editing the formulas, which is just a more convenient writing tool but no silver bullet.
BTW: Excel formulas do require a certain amount of technical knowledge, even if their domain doesn't have that look and feel.

Related

Best way to process lots of POJOs

I have an ever growing data set ( stored in a google spreadsheet from day one ) which I now want to do some analysis on. I have some basic spread sheet processing done which worked fine when the data set was < 10,000 but now that I have over 30,000 rows it takes a painful length of time to refresh the sheet when I make any changes.
So basically each data entry contains the following fields (among other things):
Name, time, score, initial value, final value
My spreadsheet was ok as a data analysis solution for stuff like giving me all rows where Name contained string "abc" and score was < 100.
However, as the number of rows increases it takes google sheets longer and longer to generate a result.
So I want to load all my data into a Java program ( Java because this is the language I am most familiar with and want to use this as a meaningful way to refresh my java skills also. )
I also have an input variable which my spread sheet uses when processing the data which I adjust in incremental steps to see how the output is affected. But to get a result for each incremental change to this input variable takes far too long. This is something I want to automate so I can set the range of the input value, increment step and then have the system generate the output for each incremental value.
My question is, what is the best way to load this data into a java program. I have the data in a txt file so figured I could read each line into its own pojo and when all 30,000 rows are loaded into an ArrayList start crunching through this. Is there a more efficient data container or method I could be using?
If you have a bunch of arbitrary (unspecified, probably ad-hoc) data processing to do, and using a spread-sheet is proving too slow, you would be better off looking for a better tool or more applicable language.
Here are some of the many possibilities:
Load the data into an SQL database and perform your analysis using SQL queries. There are many interactive database tools out there.
OpenRefine. Never used it, but I am told it is powerful and easy to use.
Learn Python or R and their associated data analysis libraries.
It would be possible to implement this all in Java and make it go really fast, but for a dataset of 30,000 records it is (IMO) not worth the development effort.

What is a good framework to implement data transformation rules through UI

Let me describe the problem. A lot of suppliers send us data files in various formats (with various headers). We do not have any control on the data format (what columns the suppliers send us). Then this data needs to be converted to our standard transactions (this standard is constant and defined by us).
The challenge here is that we do not have any control on what columns suppliers send us in their files. The destination standard is constant. Now I have been asked to develop a framework through which the end users can define their own data transformation rules through UI. (say field A in destination transaction is equal to columnX+columnY or first 3 characters of columnZ from input file). There will be many such data transformation rules.
The goal is that the users should be able to add all these supplier files (and convert all their data to my company data from front end UI with minimum code change). Please suggest me some frameworks for this (preferably java based).
Worked in a similar field before. Not sure if I would trust customers/suppliers to use such a tool correctly and design 100% bulletproof transformations. Mapping columns is one thing, but how about formatting problems in dates, monetary values and the likes? You'd probably need to manually check their creations anyway or you'll end up with some really nasty data consistency issues. Errors caused by faulty data transformation are little beasts hiding in the dark and jumping at you when you need them the least.
If all you need is a relatively simple, graphical way to design data conversations, check out something like Talend Open Studio (just google it). It calls itself an ETL tool, but we used for all kinds of stuff.

Map multiple columns from multiple files which are slightly different

I am looking for a good practical method of tackling metadata normalization between multiple files that have slightly different schema's for a batch ETL job in Talend.
I have a few hundred historical reports (around 25K to 200K records each) with about 100 to 150 columns per excel file. Most of the column names are the same for all the files (98% overlap) however there are subtle evil differences:
Different Column orders
Different Column names (sometimes using and sometimes not using abbreviations)
Different counts of columns
Sometimes columns have spaces between words, sometimes, dots, dashes or underscores
etc.
Short of writing a specialized application or brute forcing all the files by manually correcting them, are there any good free tools or methods that would provide a diff and correction between file column names in an intelligent or semi-automated fashion?
You could use Talend Open Studio to achieve that. But I do see one caveat.
The official way
In order to make Talend understand your Excel files, you will need to first load it's metadata. The caveat is that you will need to load all metadata by hand (one by one). In the free version of Talend (Open Studio Data), there is no support for dynamic metadata.
Using components like tMap you can then map your input metadata into your desired output metadata (could be a Excel file or a Database or something else). During this step you can shape your input data into your desired output (fixing / ignoring / transforming it / etc).
The unofficial way
There seems to exist a user contributed component that offers support the Excel dynamic metadata. I did not test it, but it worth trying :
http://www.talendforge.org/exchange/?eid=663&product=tos&action=view&nav=1,1,1
This can evolve as components are released, updated frequently.
My answer is about the status as it is on version 5.3.1
I write this tentatively as an "answer" because I dont have the link to hand to demonstrate how exactly it can be done. However Pentaho data integration provides a very good way to load files like this - There is a method by which you can read the metadata of the file in the first transformation, by that I mean the column names, and you can then use the "metadata injection" functionality to inject that metadata into the next transformation which reads the file.
Now; In the scenario where your column names are slightly different, youll have to somehow do some additional mapping. perhaps you can store a lookup table somewhere of "alias" column name and real column name.
Either way, this sounds like a pretty complex / nasty task to automate!
I've not seen any way to handle varying metadata of a file in Talend - Although happy to be corrected on this point!

Dynamic data storage with search and mapping for java project

I'm looking for good dynamic data storage mainly based on JAVA, or to have ability to be really easy used by JAVA.
Main problem in my project is in fast that our data structures will not remain stable, even more structures will be changed from time to time, so basically casual Relation Data Base will lost fight on that level, because drop and add new columns are pretty risky. Which means that some NOSQL or XML based or even file based storage will be usable there.
All inputs are coming from other resource which could be a SOAP callback, JASON call back, import from CSV file or manual input, based on that have to create entities and than to fill it with data.
Last thing on which I have to keep eye on is to bring unstructured, semi-structured and differently structured data in a unified form. Beside of this would be nice to have ability of maintenance of huge amount of data in accepted time duration.
Any ideas?
HyperSQL is open source and free.

What database to use?

I'm new to databases, but I think I finally have a situation where flat files won't work.
I'm writing a program to analyze the outcomes of multiplayer games, where each game could have any number of players grouped into any number of teams. I want to allow players can win, tie, or leave partway through the game (and win/lose based on team performance).
I also might want to store historical player ratings (unless it's faster to just recompute that from their game history), so I don't know if that means storing each player's rating alongside each game played, or having a separate table for each player, or what.
I don't see any criteria that impacts database choice, but I'll list the free ones:
PostgreSQL
MySQL
SQL Server Express
Oracle Express
I don't recommend an embedded database like SQLite, because embedded databases make trade-offs in features to accommodate space & size concerns. I don't agree with their belief that data typing should be relaxed - it's lead to numerous questions on SO about about to deal with date/time filtration, among others...
You'll want to learn about normalization, getting data to Third Normal Form (3NF) because it enforces referential integrity, which also minimizes data redundancy. For example, your player stats would not be stored in the database - they'd be calculated at the time of the request based on the data onhand.
You didn't mention any need for locking mechanisms where multiple users may be competing to write the same data to the same resource (a database record or file in the case of flat files) simultaneously. What I would suggest is get a good book on database design and try to understand normalization rules in depth. Distributing data across separate tables have a performance impact, but they also have an effect on the ease-of-use of query construction. This is a very involving topic, and there's no simple answer to it. That's why companies hire database administrators to keep their data structures optimized.
You might want to look at SQLite, if you need a lightweight database engine.
Some good options were mentioned already, but I really think that on Java platform, H2 is a very good choice. It is perfect for testing (in-memory test database), but works very well also for embedded use cases and as stand-alone "real database". Plus it is easy to export as dump file, import from that, to move around. And works efficiently too.
It is developed by a very good Java DB guy, and is not his first take, and you can see this from maturity of the project. On top of this it is still being actively developed as well as supported.
A word on why nobody even mentions any of the "NoSQL" databases while you have used it as a tag:
Non-SQL databases are getting a lot of attention (or even outright hype) recently, because of some high-profile usecases, because they're new (and therefore interesting), and because their promise of incredible scalability (which is "sexy" to programmers). However, only a very few very big players actually need that kind of scalability - and you certainly don't.
Another factor is that SQL databases require you to define your DB schema (the structure of tables and columns) beforehand, and changing it is somewhat problematic (especially if you already have a very large database). Non-SQL databases are more flexible in that regard, but you pay for it with more complex code (e.g. after you introduce a new field, your code needs to be able to deal with elements where it's not yet present). It doesn't sound like you need this kind of flexibility either.
Try also OrientDB. It's free (Apache 2 license), run everywhere, supports SQL and it's really fast. Can insert 1,000,000 of records in 6 seconds on common hw.

Categories

Resources