I need to parse an EBCDIC input file format. Using Java, I am able to read it like below:
InputStreamReader rdr = new InputStreamReader(new FileInputStream("/Users/rr/Documents/workspace/EBCDIC_TO_ASCII/ebcdic.txt"), java.nio.charset.Charset.forName("ibm500"));
But in Hadoop Mapreduce, I need to parse via RecordReader which has not worked so far.
Can any one provide a solution to this problem?
You can try to parse it through Spark, maybe, by using Cobrix which is an open-source COBOL data source for Spark.
The best thing you can do is to convert data to ASCII first and then load to HDFS.
Why is the file in EBCDIC ???, does it need to be ???
If it is just Text data, why not convert it to ascii when you send / pull the file from the Mainframe / AS400 ???.
If the file contains binary or Cobol numeric fields then you have several options
Convert the file to normal Text on the mainframe (The Mainframe Sort utility is good at this), then send the file and convert it (to ascii) .
If it is a Cobol file, There are some open source projects you could look at https://github.com/tmalaska/CopybookInputFormat or https://github.com/ianbuss/CopybookHadoop
There are commercial packages for loading mainframe-Cobol data into hadoop.
Related
I want to extract the data present inside a PDF file and present it in the format of a CSV/Excel sheet.I got to know that this can be done using Tika library in java.But,i did find the solution as to how extract the data as simple text,but i want to know how to store it in an excel sheet.
If someone has done such type of work earlier,then please help me.
The first part (and the hard one) is to parse original data and interpret it as a table. Apache Tika will give you xhtml representation (or call your own handler with SAX events) but it usually won't construct a table for you. From pdf file, I mean, since pdf isn't a tabular format by itself.
So, you'll have to take Tika-produced paragraphs, split them and pass resulting cells to some csv/xls/xlsx writter.
It might work if you have some regular table in you pdf (one line per table row, clean cell logical separation etc). But it will look like parsing plain text, of course.
In case I wouldn't work, you'll have to take pdf parser (like Apache PDFBox) and try to interpret its output.
The second part (output) is simple. If csv/ssv/tsv is suitable for you -- use your preferred library to produce it (I can recommend Apache commons-csv).
But take into account that MS Excel requires BOM for UTF-8 and UTF-16 csv to understand that file isn't in one-byte encoding (like CP-1252 etc).
If you want Excel xls or xlsx format -- just use Apache POI to write it.
I have a text file and i need to convert this text file all data in xml format to make more readable.
Text file
how can i convert it in xml format.
Any java library or any way that i can do it.
Your question is rather vague (and you could probably find the answer yourself with just a little research), but I'll give you a hint.
Your sample appears to be an INI file (as traditionally used for configuration files on Windows & DOS). So, look for an "INI file parser." If you can't find one, you should be able to write a simple parser yourself using regular expressions. It's a simple file format, consisting of section headings like [SectionTitle] and data fields like Key=Value. That's all.
As for generating XML ... it shouldn't be hard, but "xml format" is not a useful description. Can you be more specific? E.g., what will the XML be used for?
Try this: http://www.smooks.org/mediawiki/index.php?title=Main_Page. I've used it and it's great.
A more sophisticated solution would be to use Mule Data Mapper. On the server side, obviously.
I need to read a excel(xls) file stored on Hadoop cluster. Now I did some research and found out that I need to create a custom InputFormat for that. I read many articles but none of them is helpful from programming point of view. If someone can help me with sample code for writing custom inputformat so that I can understand the basics of "Programming InputFormat" and can use Apache POI library to read the excel file.
I had made a mapreduce program for reading text file. Now I need help regarding the fact that even if I some how manage to code my own custom InputFormat where would I write the code in respect to the mapreduce program I have already written.
PS:- converting the .xls file into .csv file is not an option.
Yes, you should create RecordReader to read each record from your excel document. Inside that record reader you should use POI like api to read from excel docs. More precisely please do the following steps:
Extend FileInputFromat and create your own CustomInputFrmat and overrride getRecordReader .
Create a CustomRecordReader by extending RecordReader ,here you have to write how to generate a key value pair from a given filesplit.
So first read bytes from filesplit and from that bufferedbytes read out desired key and value using POI.
You can check myown CustomInputFormat and RecordReader to deal with custom data objects here
myCustomInputFormat
Your research is correct. You need a custom InputFormat for Hadoop. If you are lucky, somebody already created one for your use case.
If not, I would suggest to look for a Java library that is able to read excel files.
Since Excel is a proprietary file format, it is unlikely that you will find an implementation that works perfectly.
Once you found a library that is able to read Excel files, integrate it with the InputFormat.
Therefore, You have to extend the FileInputFormat of Hadoop. The getRecordReader that is being returned by your ExcelInputFormat must return the rows from your excel file. You probably also have to overwrite the getSplits() method to tell the framework not to split the file at all.
I have about 100 different text files in the same format. Each file has observations about certain events at certain time periods for a particular person. I am trying to parse out the relevant information for a few individuals for each of the different text files. Ideally, I want to get parse through all this information and create a CSV file (eventually to be imported into Excel) with all the information I need from ALL the text files. Any help/ideas would be greatly appreciated. I would prefer to use java...but any simpler methods are appreciated.
The log files are structured as below: changed data to preserve private information
|||ID||NAME||TIME||FIRSTMEASURE^DATA||SECONDMEASURE^DATA|| etc...
TIME appears like 20110825111501 for 2011, 08/25 11:15:01 AM
Here are the steps in Java:
Open the file using FileReader
You could also wrapped the FileReader with BufferedReader and use readLine() method to read the file line by line
For each line you need to parse it. You know best the data definition of each line, to help you might be able to use various String functions or Java Regex
You could do the same thing for the Date. Check if you could utilize DateFormat
Once you parse the data, you could start building your CSV File using CSVParser mentioned above or write it your own using FileOutputStream
When you are ready to to convert to Excel, you could use Apache POI for Excel
Let me know if you need further clarification
Just to parse through the text file and use CSVParser from apache to write to a csv file. Additionally if you want to write in to excel, you can simply use Apache POI or JXL for that.
You can use SuperCSV to parse the file into a bean and
also to create a csv-file.
I want to be able to create a new Tika parser to extract metadata from a file. We're already using Tika and the metadata extraction will be done consistently.
I think that I've run into this problem/enhancement request for Tika:
Allow passing of files or memory buffers to parsers
I have a console c++ executable that accepts the path to a file on input and then outputs the metadata that it finds, each line consisting of name/value pairs.
The c++ code relies on libraries that expect a file path when accessing the data.
It's not going to be possible to rewrite this executable in Java.
I thought that it would be fairly easy to plug this into Tika. But the Tika parser needs to be in Java and the Tika parser method that needs to be overridden takes an open input stream:
void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
So I guess that my only solution will be to take the input stream and write it to a temporary file and then to process the file that gets written and to then finally clean up the file. I hate messing with a temporary file and then potentially having to worry about cleanup of temp files should something go wrong and it doesn't get deleted.
Does anyone have a clever idea about how to cleanly deal with something like this?
There's TikaInputStream which should help. It handles wrapping a File or an InputStream, and converting between them internally as parsers require. It does all the temp file bits as needed for you.
Several Java parsers already make use of it because they need a File rather than an Input Stream. What's more, users who have a file can pass it to the Parser wrapped as an InputStream, and the parser can read it as either a File or an InputStream as their needs suit.
So, I'd suggest you just turn the InputStream into a TikaInputStream (which is just a cast if it's already one), then get the file and pass that to your c++.
If I understand correctly and assuming you're launching the C++ program using Runtime.exec, you could parse the Processs standard output stream as the InputStream that Tika wants. Would that work?