Custom InputFormat or InputReader for Excel files(xls) - java

I need to read a excel(xls) file stored on Hadoop cluster. Now I did some research and found out that I need to create a custom InputFormat for that. I read many articles but none of them is helpful from programming point of view. If someone can help me with sample code for writing custom inputformat so that I can understand the basics of "Programming InputFormat" and can use Apache POI library to read the excel file.
I had made a mapreduce program for reading text file. Now I need help regarding the fact that even if I some how manage to code my own custom InputFormat where would I write the code in respect to the mapreduce program I have already written.
PS:- converting the .xls file into .csv file is not an option.

Yes, you should create RecordReader to read each record from your excel document. Inside that record reader you should use POI like api to read from excel docs. More precisely please do the following steps:
Extend FileInputFromat and create your own CustomInputFrmat and overrride getRecordReader .
Create a CustomRecordReader by extending RecordReader ,here you have to write how to generate a key value pair from a given filesplit.
So first read bytes from filesplit and from that bufferedbytes read out desired key and value using POI.
You can check myown CustomInputFormat and RecordReader to deal with custom data objects here
myCustomInputFormat

Your research is correct. You need a custom InputFormat for Hadoop. If you are lucky, somebody already created one for your use case.
If not, I would suggest to look for a Java library that is able to read excel files.
Since Excel is a proprietary file format, it is unlikely that you will find an implementation that works perfectly.
Once you found a library that is able to read Excel files, integrate it with the InputFormat.
Therefore, You have to extend the FileInputFormat of Hadoop. The getRecordReader that is being returned by your ExcelInputFormat must return the rows from your excel file. You probably also have to overwrite the getSplits() method to tell the framework not to split the file at all.

Related

Spring Batch read different csv files and xml output

I have a requirement where I need to read 4 different csv files. These have to be read line by line. All files have different number of columns and values.
After processing, I have to generate the output in XML.
If someone can please throw some light, that how to achieve this?
Thanks
Spring Batch has a reader interface. You can write your own reader class which can have 4 individual FlatFileItemReaders and read them until all are done.
The writer is also an interface that you can implement yourself. You'd override the write method and do whatever you needed to do.
http://docs.spring.io/spring-batch/trunk/apidocs/org/springframework/batch/item/ItemReader.html
http://docs.spring.io/spring-batch/trunk/apidocs/org/springframework/batch/item/ItemWriter.html
http://docs.spring.io/spring-batch/reference/html/readersAndWriters.html

Best way to update an existing FDF (Form Data Format) file

I've a need to update an existing FDF file, progamatically from server side. For this I'm looking for a Java library which we can manupuate an existing FDF file. I've tried out libraries from iText and Adobe so far. It seems like iText's FDFWriter will allow you just to create a new FDF file and will not help you to update an existing one.
With Adobe's FDFDoc class I somehow managed to update a FDF file, but this API seems to be very old and looks ugly (Method names and field names are not very much elegant and does not follow the camel notation.). My questions is whether there is a known better library?
P.S. : FDF is a data format to collect input data from editable PDF forms
FDF is a simple structured text file, which means that you should be able to do simple text manipulation to modify it. You may actually even be able to create an XSLT (even if FDF it is not XML).
Adobe has the FDFToolkit, which is quite old, but there have been no changes in the FDF format for a long time. Although the FDFToolkit reads and writes FDF, it brings advantages only for reading; writing is sol basic that you don't really need a specific library…

Using multiple InputFormat classes while configuring MapReduce job

I want to write a MapReduce application which can process both text and zip files. For this I want to use to different input formats, one for text and another for zip. Is it possible to do so?
Extending a bit from #ChrisWhite's answer, what you need is to use a custom InputFormat and RecordReader that work with ZIP files. You can find here a sample ZipFileInputFormat and here a sample ZipFileRecordReader.
Given this, as Chris suggested you should use MultipleInputs, and here is how I would do it if you don't need custom mappers for each type of file:
MultipleInputs.addInputPath(job, new Path("/path/to/zip"), ZipFileInputFormat.class);
MultipleInputs.addInputPath(job, new Path("/path/to/txt"), TextInputFormat.class);
Look at the API docs for MultipleInputs (old api, new api). Not hugely self explanatory, but you should be able to see that you call the addInputPath methods in your job configuration and configure the input path (which can be a glob, input format and associated mapper).
You should be able to Google for some examples, infact here's a SO question / answer that shows some usage
Consider writing a custom InputFormat where you can check what kind of input is being read and then based on the check invoke the required InputFormat

Spring Batch: Reading database and writing into multi record Flat files

Hi I am doing POC/base for design on reading database and writing into flat files. I am struggling on couple of issues here but first I will tell you the output format of flat file
Please let me know how do design the input writer where I need to read the transactions from different tables, process records , figure out the summary fields and then how should I design the Item Writer which has such a complex design. Please advice. I am successfully able to read from single table and write to file but the above task looks complex.
Extend the FlatFileItemWriter to only open a file once and append to it instead of overwriting it. Then pass that same filewriter into multiple readers in the order you would like them to appear. (Make sure that each object read by the readers are extensible by something that the writer understands! Maybe interface BatchWriteable would be a good name.)
Some back-of-the-envelope pseudocode:
Before everything starts:
Open file.
Write file headers.
Start Batch step
implement as many times as necessary
Read Batch section
Process Batch section
A Write Batch section
when done:
Write file footer
Close file

Parse through text files and write into CSV

I have about 100 different text files in the same format. Each file has observations about certain events at certain time periods for a particular person. I am trying to parse out the relevant information for a few individuals for each of the different text files. Ideally, I want to get parse through all this information and create a CSV file (eventually to be imported into Excel) with all the information I need from ALL the text files. Any help/ideas would be greatly appreciated. I would prefer to use java...but any simpler methods are appreciated.
The log files are structured as below: changed data to preserve private information
|||ID||NAME||TIME||FIRSTMEASURE^DATA||SECONDMEASURE^DATA|| etc...
TIME appears like 20110825111501 for 2011, 08/25 11:15:01 AM
Here are the steps in Java:
Open the file using FileReader
You could also wrapped the FileReader with BufferedReader and use readLine() method to read the file line by line
For each line you need to parse it. You know best the data definition of each line, to help you might be able to use various String functions or Java Regex
You could do the same thing for the Date. Check if you could utilize DateFormat
Once you parse the data, you could start building your CSV File using CSVParser mentioned above or write it your own using FileOutputStream
When you are ready to to convert to Excel, you could use Apache POI for Excel
Let me know if you need further clarification
Just to parse through the text file and use CSVParser from apache to write to a csv file. Additionally if you want to write in to excel, you can simply use Apache POI or JXL for that.
You can use SuperCSV to parse the file into a bean and
also to create a csv-file.

Categories

Resources