Map multiple columns from multiple files which are slightly different - java

I am looking for a good practical method of tackling metadata normalization between multiple files that have slightly different schema's for a batch ETL job in Talend.
I have a few hundred historical reports (around 25K to 200K records each) with about 100 to 150 columns per excel file. Most of the column names are the same for all the files (98% overlap) however there are subtle evil differences:
Different Column orders
Different Column names (sometimes using and sometimes not using abbreviations)
Different counts of columns
Sometimes columns have spaces between words, sometimes, dots, dashes or underscores
etc.
Short of writing a specialized application or brute forcing all the files by manually correcting them, are there any good free tools or methods that would provide a diff and correction between file column names in an intelligent or semi-automated fashion?

You could use Talend Open Studio to achieve that. But I do see one caveat.
The official way
In order to make Talend understand your Excel files, you will need to first load it's metadata. The caveat is that you will need to load all metadata by hand (one by one). In the free version of Talend (Open Studio Data), there is no support for dynamic metadata.
Using components like tMap you can then map your input metadata into your desired output metadata (could be a Excel file or a Database or something else). During this step you can shape your input data into your desired output (fixing / ignoring / transforming it / etc).
The unofficial way
There seems to exist a user contributed component that offers support the Excel dynamic metadata. I did not test it, but it worth trying :
http://www.talendforge.org/exchange/?eid=663&product=tos&action=view&nav=1,1,1
This can evolve as components are released, updated frequently.
My answer is about the status as it is on version 5.3.1

I write this tentatively as an "answer" because I dont have the link to hand to demonstrate how exactly it can be done. However Pentaho data integration provides a very good way to load files like this - There is a method by which you can read the metadata of the file in the first transformation, by that I mean the column names, and you can then use the "metadata injection" functionality to inject that metadata into the next transformation which reads the file.
Now; In the scenario where your column names are slightly different, youll have to somehow do some additional mapping. perhaps you can store a lookup table somewhere of "alias" column name and real column name.
Either way, this sounds like a pretty complex / nasty task to automate!
I've not seen any way to handle varying metadata of a file in Talend - Although happy to be corrected on this point!

Related

Best way to process lots of POJOs

I have an ever growing data set ( stored in a google spreadsheet from day one ) which I now want to do some analysis on. I have some basic spread sheet processing done which worked fine when the data set was < 10,000 but now that I have over 30,000 rows it takes a painful length of time to refresh the sheet when I make any changes.
So basically each data entry contains the following fields (among other things):
Name, time, score, initial value, final value
My spreadsheet was ok as a data analysis solution for stuff like giving me all rows where Name contained string "abc" and score was < 100.
However, as the number of rows increases it takes google sheets longer and longer to generate a result.
So I want to load all my data into a Java program ( Java because this is the language I am most familiar with and want to use this as a meaningful way to refresh my java skills also. )
I also have an input variable which my spread sheet uses when processing the data which I adjust in incremental steps to see how the output is affected. But to get a result for each incremental change to this input variable takes far too long. This is something I want to automate so I can set the range of the input value, increment step and then have the system generate the output for each incremental value.
My question is, what is the best way to load this data into a java program. I have the data in a txt file so figured I could read each line into its own pojo and when all 30,000 rows are loaded into an ArrayList start crunching through this. Is there a more efficient data container or method I could be using?
If you have a bunch of arbitrary (unspecified, probably ad-hoc) data processing to do, and using a spread-sheet is proving too slow, you would be better off looking for a better tool or more applicable language.
Here are some of the many possibilities:
Load the data into an SQL database and perform your analysis using SQL queries. There are many interactive database tools out there.
OpenRefine. Never used it, but I am told it is powerful and easy to use.
Learn Python or R and their associated data analysis libraries.
It would be possible to implement this all in Java and make it go really fast, but for a dataset of 30,000 records it is (IMO) not worth the development effort.

Excel Application to Web based Application

I have been trying to find the right design/toolset which can help our business users. They have enormous data in excel files which they push through to various excel formulas nearly 400+ and calculations usually and mostly on a row by row basis and vlookups from other sheets. In trying to design a system for them, i want to enable them to define the business rules so that we can stick to designing and implementing the system, which will change state according to the business rules defined ? What current stack of technologies would be able to support this ?
The basic requirements to point out are like
Should able to handle millions of rows of data and process them.(Millions rows of data need not to be processed at same time it can be processed sequentially)
Convert existing excel formulas into some rules which Business user can edit and maintain (These Excel formulas are quit complex. Here formulas deal with multiple sheets and decision based on row data from multiple sheets uses VLOOKUP to MATCH and INDEX to get corresponding matching row in different sheet. )
I am planning to use Drools and Guvnor for it..
What do you all suggest? Is there any other better option?
Even in Drools my major concern is if Business user will be able to create the rules as easily as he can do in Excel..
The "millions" won't be a problem for sequential processing, if there's a reasonably fast way of input and output of the data itself.
Lookups into other sheets can be transformed into sets of static facts, loaded once when the session is started - just a technicality.
The transformation of the Excel formulas: Ay, there's the rub. The Business User (BU) will not be able to transform them off the cuff. Rules aren't any more complicated than Excel formulas, but the BUs will need some formal training, ideally tailored to the subset they'll have to use. This also applies if they should use Guvnor for editing the formulas, which is just a more convenient writing tool but no silver bullet.
BTW: Excel formulas do require a certain amount of technical knowledge, even if their domain doesn't have that look and feel.

What is a good framework to implement data transformation rules through UI

Let me describe the problem. A lot of suppliers send us data files in various formats (with various headers). We do not have any control on the data format (what columns the suppliers send us). Then this data needs to be converted to our standard transactions (this standard is constant and defined by us).
The challenge here is that we do not have any control on what columns suppliers send us in their files. The destination standard is constant. Now I have been asked to develop a framework through which the end users can define their own data transformation rules through UI. (say field A in destination transaction is equal to columnX+columnY or first 3 characters of columnZ from input file). There will be many such data transformation rules.
The goal is that the users should be able to add all these supplier files (and convert all their data to my company data from front end UI with minimum code change). Please suggest me some frameworks for this (preferably java based).
Worked in a similar field before. Not sure if I would trust customers/suppliers to use such a tool correctly and design 100% bulletproof transformations. Mapping columns is one thing, but how about formatting problems in dates, monetary values and the likes? You'd probably need to manually check their creations anyway or you'll end up with some really nasty data consistency issues. Errors caused by faulty data transformation are little beasts hiding in the dark and jumping at you when you need them the least.
If all you need is a relatively simple, graphical way to design data conversations, check out something like Talend Open Studio (just google it). It calls itself an ETL tool, but we used for all kinds of stuff.

CSV file, faster in binary format? Fastest search?

If I have a CSV file, is it faster to keep the file as place text or to convert it to some other format? (for searching)
In terms of searching a CSV file, what is the fastest method of retrieving a particular row (by key)? Not referring to sorting the file sorry, what I mean was looking up a arbitrary key in the file.
Some updates:
the file will be read-only
the file can be read and kept in memory
There are several things to consider for this:
What kind of data do you store? Does it actually make sense, to convert this to a binary format? Will binary format take up less space (the time it takes to read the file is dependent on size)?
Do you have multiple queries for the same file, while the system is running, or do you have to load the file each time someone does a query?
Do you need to efficiently transfer the file between different systems?
All these factors are very important for a decision. The common case is that you only need to load the file once and then do many queries. In that case it hardly matters what format you store the data in, because it will be stored in memory afterwards anyway. Spend more time thinking about good data structures to handle the queries.
Another common case is, that you cannot keep the main application running and hence you cannot keep the file in memory. In that case, get rid of the file and use a database. Any database you can use will most likely be faster than anything you could come up with. However it is not easy to transfer a database between system.
Most likely though, the file format will not be a real issue to consider. I've read quite a few very long CSV files and most often the time it took to read the file was negligible compared to what I needed to do with the data afterwards.
If you have too much data and is very production level, then use Apache Lucene
If its small dataset or its about learning then read through Suffix tree and Tries
"Convert" it (i.e. import it) into a database table (or preferably normalised tables) with indexes on the searchable columns and a primary key on the column that has the highest cardinality - no need to re-invent the wheel... you'll save yourself a lot of issues - transaction management, concurrency.... really - if it will be in production, the chance that you will want to keep it in csv format is slim to zero.
If the file is too large to keep in memory, then just keep the keys in memory. Some number of rows can also be keep in memory, with least-recently-accessed rows paged out as additional rows are needed. Use fseeks (directed by keys) with the file to find the row in the file itself. Then load that row into memory in case other entries on that row might be needed.

Efficiently store easily parsable data in a file?

I am needing to store easily parsable data in a file as an alternative to the database backed solution (not up for debate). Since its going to be storing lots of data, preferably it would be a lightweight syntax. This does not necessarily need to be human readable, but should be parsable. Note that there are going to be multiple types of fields/columns, some of which might be used and some of which won't
From my limited experience without a database I see several options, all with issues
CSV - I could technically do this, and it is very light. However the parsing would be an issue, and then it would suck if I wanted to add a column. Multi-language support is iffy, mainly people's own custom parsers
XML - This is the perfect solution from many fronts except when it comes to parsing and overhead. Thats a lot of tags and would generate a giant file, and parsing would be very resource consuming. However virtually every language supports XML
JSON - This is the middle ground, but I don't really want to do this as its an awkward syntax and parsing is non-trivial. Language support is iffy.
So all have their disadvantages. But what would be the best when trying to aim for language support AND somewhat small file size?
How about sqlite? This would allow you to basically embed the "DB" in your application, but not require a separate DB backend.
Also, if you end up using a DB backend later, it should be fairly easy to switch over.
If that's not suitable, I'd suggest one of the DBM-like stores for key-value lookups, such as Berkely DB or tdb.
If you're just using the basics of all these formats, all of the parsers are trivial. If CSV is an option, then for XML and JSON you're talking blocks of name/value pairs, so there's not even a recursive structure involved. json.org has support for pretty much any language.
That said.
I don't see what the problem is with CSV. If people write bad parsers, then too bad. If you're concerned about compatibility, adopt the default CSV model from Excel. Anyone that can't parse CSV from Excel isn't going to get far in this world. The weakest support you find in CSV is embedded newlines and carriage returns. If you data doesn't have this, then it's not a problem. Only other issue is embedded quotations, and those are escaped in CSV. If you don't have those either, then its even more trivial.
As for "adding a column", you have that problem with all of these. If you add a column, you get to rewrite the entire file. I don't see this being a big issue either.
If space is your concern, CSV is the most compact, followed by JSON, followed by XML. None of the resulting files can be easily updated. They pretty much all would need to be rewritten for any change in the data. CSV has the advantage that it's easily appended to, as there's no closing element (like JSON and XML).
JSON is probably your best bet (it's lightish, faster to parse, and self-descriptive so you can add your new columns as time goes by). You've said parsable - do you mean using Java? There are JSON libraries for Java to take the pain out of most of the work. There are also various light-weight in memory databases that can persist to a file (in case "not an option" means you don't want a big separate database)
If this is just for logging some data quickly to a file, I find tab delimited files are easier to parse than CSV, so if it's a flat text file you're looking for I'd go with that (so long as you don't have tabs in the feed of course). If you have fixed size columns, you could use fixed-length fields. That is even quicker because you can seek.
If it's unstructured data that might need some analysis, I'd go for JSON.
If it's structured data and you envision ever doing any querying on it... I'd go with sqlite.
When I needed solution like this I wrote up a simple representation of data prefixed with length. For example "Hi" will be represented as(in hex) 02 48 69.
To form rows just nest this operation(first number is number of fields, and then the fields), for example if field 0 contains "Hi" and field 1 contains "abc" then it will be:
Num of fields Field Length Data Field Length Data
02 02 48 69 03 61 62 63
You can also use first row as names for the columns.
(I have to say this is kind of a DB backend).
You can use CSV and if you only add columns to the end this is simple to handle. i.e. if you have less columns than you expect, use the default value for the "missing" fields.
If you want to be able to change the order/use of fields, you can add a heading row. i.e. the first row has the names of the columns. This can be useful when you are trying to read the data.
If you are forced to use a flat file, why not develop your own format? You should be able to tweak overhead and customize as much as you want (which is good if you are parsing lots of data).
Data entries will be either of a fixed or variable length, there are advantages to forcing some entries to a fixed length but you will need to create a method for delimiting both. If you have different "types" of rows, write all the rows of a each type in a chunk. Each chunk of rows will have a header. Use one header to describe the type of the chunk, and another header to describe the columns and their sizes. Determine how you will use the headers to describe each chunk.
eg (H is header, C is column descriptions and D is data entry):
H Phone Numbers
C num(10) type
D 1234567890 Home
D 2223334444 Cell
H Addresses
C house(5) street postal(6) province
D 1234_ "some street" N1G5K6 Ontario
I'd say that if you want to store rows and columns, you've got to to use a DB. The reason is simple - modification of the structure with any approach except RDBMS will require significant efforts, and you mentioned that you want to change the structure in future.

Categories

Resources