I have about 100 different text files in the same format. Each file has observations about certain events at certain time periods for a particular person. I am trying to parse out the relevant information for a few individuals for each of the different text files. Ideally, I want to get parse through all this information and create a CSV file (eventually to be imported into Excel) with all the information I need from ALL the text files. Any help/ideas would be greatly appreciated. I would prefer to use java...but any simpler methods are appreciated.
The log files are structured as below: changed data to preserve private information
|||ID||NAME||TIME||FIRSTMEASURE^DATA||SECONDMEASURE^DATA|| etc...
TIME appears like 20110825111501 for 2011, 08/25 11:15:01 AM
Here are the steps in Java:
Open the file using FileReader
You could also wrapped the FileReader with BufferedReader and use readLine() method to read the file line by line
For each line you need to parse it. You know best the data definition of each line, to help you might be able to use various String functions or Java Regex
You could do the same thing for the Date. Check if you could utilize DateFormat
Once you parse the data, you could start building your CSV File using CSVParser mentioned above or write it your own using FileOutputStream
When you are ready to to convert to Excel, you could use Apache POI for Excel
Let me know if you need further clarification
Just to parse through the text file and use CSVParser from apache to write to a csv file. Additionally if you want to write in to excel, you can simply use Apache POI or JXL for that.
You can use SuperCSV to parse the file into a bean and
also to create a csv-file.
Related
I want to extract the data present inside a PDF file and present it in the format of a CSV/Excel sheet.I got to know that this can be done using Tika library in java.But,i did find the solution as to how extract the data as simple text,but i want to know how to store it in an excel sheet.
If someone has done such type of work earlier,then please help me.
The first part (and the hard one) is to parse original data and interpret it as a table. Apache Tika will give you xhtml representation (or call your own handler with SAX events) but it usually won't construct a table for you. From pdf file, I mean, since pdf isn't a tabular format by itself.
So, you'll have to take Tika-produced paragraphs, split them and pass resulting cells to some csv/xls/xlsx writter.
It might work if you have some regular table in you pdf (one line per table row, clean cell logical separation etc). But it will look like parsing plain text, of course.
In case I wouldn't work, you'll have to take pdf parser (like Apache PDFBox) and try to interpret its output.
The second part (output) is simple. If csv/ssv/tsv is suitable for you -- use your preferred library to produce it (I can recommend Apache commons-csv).
But take into account that MS Excel requires BOM for UTF-8 and UTF-16 csv to understand that file isn't in one-byte encoding (like CP-1252 etc).
If you want Excel xls or xlsx format -- just use Apache POI to write it.
I need to parse an EBCDIC input file format. Using Java, I am able to read it like below:
InputStreamReader rdr = new InputStreamReader(new FileInputStream("/Users/rr/Documents/workspace/EBCDIC_TO_ASCII/ebcdic.txt"), java.nio.charset.Charset.forName("ibm500"));
But in Hadoop Mapreduce, I need to parse via RecordReader which has not worked so far.
Can any one provide a solution to this problem?
You can try to parse it through Spark, maybe, by using Cobrix which is an open-source COBOL data source for Spark.
The best thing you can do is to convert data to ASCII first and then load to HDFS.
Why is the file in EBCDIC ???, does it need to be ???
If it is just Text data, why not convert it to ascii when you send / pull the file from the Mainframe / AS400 ???.
If the file contains binary or Cobol numeric fields then you have several options
Convert the file to normal Text on the mainframe (The Mainframe Sort utility is good at this), then send the file and convert it (to ascii) .
If it is a Cobol file, There are some open source projects you could look at https://github.com/tmalaska/CopybookInputFormat or https://github.com/ianbuss/CopybookHadoop
There are commercial packages for loading mainframe-Cobol data into hadoop.
I want to color the first row of csv file as first row contains header and i want to show it separately for user convenience.
First of all, CSV has no way of specify formatting options, it's a plain text file containing just the data. In order to add formatting you need to choose another format (may it be XLS, XLSX or just plain HTML).
In your code snippet, you use a jsp, that can be a good solution if you decide to emit HTML, but then again, you should use iteration tags on data and hardcode the HTML marckups in the jstp instead of emitting the document entirely.
If you plan to provide a download like in excel format, a servlet will be probably a better choice.
If you decide for excel compatible format you can use Apache POI to emit the document.
You can read here for a sample on how using a Servlet to emit a CSV.
CSV (comma separated values) are indeed just that: values. There's no way to include formatting in a CSV file
CSV is a plain text file. We cannot add formatting like color,font etc..,
If you want to add color, use xls or xlsx.
I have a text file and i need to convert this text file all data in xml format to make more readable.
Text file
how can i convert it in xml format.
Any java library or any way that i can do it.
Your question is rather vague (and you could probably find the answer yourself with just a little research), but I'll give you a hint.
Your sample appears to be an INI file (as traditionally used for configuration files on Windows & DOS). So, look for an "INI file parser." If you can't find one, you should be able to write a simple parser yourself using regular expressions. It's a simple file format, consisting of section headings like [SectionTitle] and data fields like Key=Value. That's all.
As for generating XML ... it shouldn't be hard, but "xml format" is not a useful description. Can you be more specific? E.g., what will the XML be used for?
Try this: http://www.smooks.org/mediawiki/index.php?title=Main_Page. I've used it and it's great.
A more sophisticated solution would be to use Mule Data Mapper. On the server side, obviously.
I need to read a excel(xls) file stored on Hadoop cluster. Now I did some research and found out that I need to create a custom InputFormat for that. I read many articles but none of them is helpful from programming point of view. If someone can help me with sample code for writing custom inputformat so that I can understand the basics of "Programming InputFormat" and can use Apache POI library to read the excel file.
I had made a mapreduce program for reading text file. Now I need help regarding the fact that even if I some how manage to code my own custom InputFormat where would I write the code in respect to the mapreduce program I have already written.
PS:- converting the .xls file into .csv file is not an option.
Yes, you should create RecordReader to read each record from your excel document. Inside that record reader you should use POI like api to read from excel docs. More precisely please do the following steps:
Extend FileInputFromat and create your own CustomInputFrmat and overrride getRecordReader .
Create a CustomRecordReader by extending RecordReader ,here you have to write how to generate a key value pair from a given filesplit.
So first read bytes from filesplit and from that bufferedbytes read out desired key and value using POI.
You can check myown CustomInputFormat and RecordReader to deal with custom data objects here
myCustomInputFormat
Your research is correct. You need a custom InputFormat for Hadoop. If you are lucky, somebody already created one for your use case.
If not, I would suggest to look for a Java library that is able to read excel files.
Since Excel is a proprietary file format, it is unlikely that you will find an implementation that works perfectly.
Once you found a library that is able to read Excel files, integrate it with the InputFormat.
Therefore, You have to extend the FileInputFormat of Hadoop. The getRecordReader that is being returned by your ExcelInputFormat must return the rows from your excel file. You probably also have to overwrite the getSplits() method to tell the framework not to split the file at all.