Dynamic parsing of fixed text files

Dynamic parsing of fixed text files - java

I want to build a parser for fixed position text files.
What I want to achieve is to make it dynamic so that I could pass an external configuration file containing the format of the file that will be parsed.
Example of configuration file to make the application to load:
Field; Position
Name;0-20
Surname;21-40
Age;40-42
Sex;42-43
...
Example of file to parse:
John William Hoover23M
Deborah Foobar33F
...
I saw googling lot of libraries to parse fixed length file.
Problem is that all of them relies on creating some classes with annotated fields telling the fixed position in the text file.
I want to make a generic parser so this classes should be automatically generated and annotated based on some external configuration file.
Do you know any library or different kind of approach that I could follow?
I'm talking about parsing relatively big files around ~500Mb so also efficiency and speed is important factor.
Thank you all!

You dont need to "parse" the big file. You only need to extract at given positions
1 parse the "format" file, with classical regex, and store name, positions in an array. Time doesnt matter there.
2 open your big file, read the lines, and extract at the positions you want. It will be the faster your could do.

Try uniVocity-parsers' FixedWidthParser:
//define field lengths
FixedWidthFields fields = new FixedWidthFields();
accountFields.addField("ID", 10);
accountFields.addField("Bank", 8);
accountFields.addField("AccountNumber", 15);
accountFields.addField("Swift", 12);
//configure the parser
FixedWidthParserSettings settings = new FixedWidthParserSettings(fields); //many options here, check the tutorial
settings.getFormat().setLineSeparator("\n");
//We can now parse all rows
FixedWidthParser parser = new FixedWidthParser(settings);
List<String[]> rows = parser.parseAll(new File("/path/to/file.txt"));
This is just a rough example. There are many other examples here.
Disclosure: I'm the author of this library, it's open-source and free (Apache 2.0 License)

Related

Is there any way to create a dynamic word document to an existing template in Java?

I need to automatically generate 4 different types of CVs using Java/Spring. The information is already in the database in a structured way. However, we need to generate a Word Document for 4 different types of CVs. If you have noticed in the Europass format there are sections like work-experience and education and training that need to be duplicated more than once.
I have seen a docx4j version , where creating an XML file and adjusting the word document to comply with that XML can make it work. However, what I can't seem to be figuring out for now, is how to add repeating sections, for example a list of experiences. Not only do I have to repeat the actual data, but I also have to duplicate the text in the existing template.
If any of you guys knows any other library/plug-in/tech that might help me to dynamically create a word document (the CV) using Java, please let me know.

What is the fastest file / way to parse a large data file?

So I am working on a GAE project. I need to look up cities, Country Names and Country Codes for sign ups, LBS, ect ...
Now I figured that putting all the information in the Datastore is rather stupid as it will be used quite frequently and its gonna eat up my datastore quotations for no reason, specially that these lists arent going to change, so its pointless to put in datastore.
Now that leaves me with a few options:
API - No budget for paid services, free ones are not exactly reliable.
Upload Parse-able file - Favorable option as I like the certainty that the data will always be there.
So I got the files needed from GeoNames (link has source files for all countries in case someone needs it). The file for each country is a regular UTF-8 tab delimited file which is great.
However, now that I have the option to choose how to format and access the data, the question is:
What is the best way to format and retrieve data systematically from a static file in a Java servelet container ?
The best way being the fastest, and least resource hungry method.
Valid options:
TXT file, tab delimited
XML file Static
Java Class with Tons of enums
I know that importing country files as Java Enums and going through their values will be very fast, but do you think this is going to affect memory beyond reasonable limits ? On the other hand, every time I need to access a record, the loop will go through a few thousand lines until it finds the required record ... reading line by line so no memory issues, but incredibly slow ... I have had some experience with parsing an excel file in a Java servelet and it took something like 20 seconds just to parse 250 records, on large scale, response time WILL timeout (no doubt about it) so is XML anything like excel ??
Thank you very much guys !! Please provide opinions, all and anything is appreciated !

Easiest and fastest way would be to have the file as a static web resource file, under the WEB-INF folder and on application startup, have a context listener to load the file into memory.
In memory, it should be a Map, mapping from a key you want to search by. This will allow you like a constant access time.
Memory consumption would only matter if it is really big. A hundred thousand record for example not worth optimizing if you need to access this many times.
The static file should be plain text format or CSV, they are read and parsed most efficiently. No need XML formatting as parsing it would be slow.
If the list is really big, you can break it up into multiple, smaller files, and only parse those and only when they are required. A reasonable, easy partitioning would be to break it up by country, but any other partitioning would work (like based on its name using the first few characters from its name).
You could also consider building this Map in the memory once, and then serialize this map to a binary file, and include that binary file as a static resource file, and that way you would only have to deserialize this Map and would be no need to parse/process it as a text file and build objects yourself.
Improvements on the data file
An alternative to having the static resource file as a text/CSV file or a serialized Map
data file would be to have it as a binary data file where you could create your own custom file format.
Using DataOutputStream you can write data to a binary file in a very compact and efficient way. Then you could use DataInputStream to load data from this custom file.
This solution has the advantages that the file could be much less (compared to plain text / CSV / serialized Map), and loading it would be much faster (because DataInputStream doesn't use number parsing from a text for example, it reads the bytes of a number directly).

Hold the data in source form as XML. At start of day, or when it changes, read it into memory: that's the only time you incur the parsing cost. There are then two main options:
(a) your in-memory form is still an XML tree, and you use XPath/XQuery to query it.
(b) your in-memory form is something like a java HashMap
If the data is very simple then (b) is probably best, but it only allows you to do one kind of query, which is hard-coded. If the data is more complex or you have a variety of possible queries, then (a) is more flexible.

Java Parsing Framework for complex CSV files

I need to parse complex (non fixed length) csv files to Java objects in order to compare its values.
I first tried the Flatform Parsing Framework, i liked the approach of describing the values in an extra (xml) document. Maybe it's the right tool for simple csv (and also flat) files. Nevertheless my csv files contains lines that vary in quantity of fields - sometimes they span across multiple lines. There are also dependencies among those fields.
Here's a little sample: (each type has a certain amount of extra parameters)
; <COMMENTS (to be ignored)>
<NAME>,<TYPE_A>,<DESCRIPTION>,<PARAMETER>
<NAME>,<TYPE_B>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_C>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>,<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_D>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>,<PARAMETER>,<PARAMETER>, -
<PARAMETER>,<PARAMETER>, -
<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_B>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_A>,<DESCRIPTION>,<PARAMETER>
So i need something to describe and parse the csv file in a more complex manner. I'm new to this, I've heard about parser generator - is that what I need?

Try OpenCSV (see http://opencsv.sourceforge.net/#what-features). It handles embedded carriage returns just fine.

One option is to use the Scanner class or you might want to check out the Spring Batch. Ive never actually used SB but given batch jobs often read from simple text files i believe i read it caters for this including all sorts of object mapping.

You may also try japaki

java dyn parser

Im a novice java-programer, whos trying to create a small java app.
In the program im working on, I want to load the configurations from different ini-files.
The basic idea is, that I would have a library containing all the config files, the parser should read all of them and make configurations named after their filenames.
The parser should be created to work dynamicly, so it can read different types of configs.
example
House.ini
-> type0
-> id name height witdh length price_based_on_dimensions
-> id1 name1 height witdh length price_based_on_dimensions
These data should be saved to a config object named config.house. The tricky part is that a different config file, can have its type = type0 but with a different number of attributtes.
I realise that there is no simply solution to this, but any help and or guides to create a dynamic parser is welcome

I'm not really clear on the output you want to produce, but a Java INI parsing library might be a good place to start. For that, you should use ini4j.

Other then ini4j (which is a really good library) I personally prefer to use xml config files. To me they are easier to use and allow for easier configuration

What technologies are there for formatted, structured data input and output?

I am working on a project here that ingests internal resumes from people at my company, strips out the skills and relevant content from them and stores it in a database. This was all done using docx4j and Grails. This required the resumes to first be submitted via a template that formatted everything just right so that the ingest tool knew what to look for to strip the data.
The 2nd portion of this, is what if we want to get out a "reduced" resume from the database. In other words, I want to search the uploaded content I now have, and only print out new resumes for people who have Java programming experience lets say. So I can go into my database, find the people who originally had java as a skill, and output a new set of resumes that are also still in a nice templated format, and only have the relevant info in them, instead of ALL the content.
I have been writing some software to do this in Java that will basically use a docx template, overwriting the items in customXML which are bound to the content controls in the doc, so the new data shows up and can eb saved as a new docx with that custom data.
This seems really cumbersome to me, and has some limitations. For one, lets say my template has a place for 3 Skills, and the particular person has 8 skills. There seems to be no good way to add those 5 additional skills to the docx other than painstakingly inserting the data with all of the formatting XML tags and such. This is a real pain, because if the template changes, I dont want to have to go back into my software and edit source code to change that additional data input XML tag to bold instead of italic.
I was doing some reading up on using Infopath to create a form that I could use to get the input, connecting to some sharepoint data source or something to store the stripped out data. However, I can't seem to find out if it is possible using sharepoint to get the data back out, in a nice formatted way. What would the general steps for this be? It seems like I couldnt find very much about this topic with any quick googling.
Thanks

You could set up the skills:
<skills>
<skill>..</skill>
<skill>..</skill>
and use a "repeat" content control pointing to the container. This would handle any number of <skill> entries.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.