I have 2 xml files (one around 50 MB and other one 120K) which can be linked using a common field. I need to read the files, link them on the common field and produce a combined output in CSV.
Can someone please advise what would be the most efficient way to do that using JAVA ?
Here is a good example to look at ( http://www.developerfusion.com/code/2064/a-simple-way-to-read-an-xml-file-in-java/ ). You will have to tweak it some, but the point of the example shows you how to gain access to the file (in your case, files) and then processing each parent node, you can get the child nodes and do your processing, then write out a CSV. Let me know if you need more information or get stuck.
Related
How to write Rowloader JAVA code to inject data from sample.csv file into GenfireXD database.
The GemFireXD distribution includes a JDBCRowLoader source example. Look in the examples directory. In your case you will have to determine which field of your CSV you want to consider as primary keys, parse the CSV and return rows as needed.
You can check IMPORT_DATA_EX and IMPORT_TABLE_EX procedures to load data into GemFireXD.
Since you mentioned csv format IMPORT_DATA_EX might be the recommend way to do it since you can also tweak the number of threads and constraints while loading the data. It's definitely one of the fastest ways to do it but please note that the csv file but be available from the node you're issuing the command.
You might also want to consider starting a peer member with host-data=false.
Reference: http://gemfirexd.docs.pivotal.io/latest/userguide/index.html#reference/system_procedures/derby/rrefimportdataproc_ex.html
If I have 100 short XML files in folder and I'd like to know which of them contains text aaabbbccc (for later accurate parsing). Is this a good idea to read them as Strings one after other and try to use contains function to determine what files not contains this text?
As I know the contains function is very fast.
So I am working on a GAE project. I need to look up cities, Country Names and Country Codes for sign ups, LBS, ect ...
Now I figured that putting all the information in the Datastore is rather stupid as it will be used quite frequently and its gonna eat up my datastore quotations for no reason, specially that these lists arent going to change, so its pointless to put in datastore.
Now that leaves me with a few options:
API - No budget for paid services, free ones are not exactly reliable.
Upload Parse-able file - Favorable option as I like the certainty that the data will always be there.
So I got the files needed from GeoNames (link has source files for all countries in case someone needs it). The file for each country is a regular UTF-8 tab delimited file which is great.
However, now that I have the option to choose how to format and access the data, the question is:
What is the best way to format and retrieve data systematically from a static file in a Java servelet container ?
The best way being the fastest, and least resource hungry method.
Valid options:
TXT file, tab delimited
XML file Static
Java Class with Tons of enums
I know that importing country files as Java Enums and going through their values will be very fast, but do you think this is going to affect memory beyond reasonable limits ? On the other hand, every time I need to access a record, the loop will go through a few thousand lines until it finds the required record ... reading line by line so no memory issues, but incredibly slow ... I have had some experience with parsing an excel file in a Java servelet and it took something like 20 seconds just to parse 250 records, on large scale, response time WILL timeout (no doubt about it) so is XML anything like excel ??
Thank you very much guys !! Please provide opinions, all and anything is appreciated !
Easiest and fastest way would be to have the file as a static web resource file, under the WEB-INF folder and on application startup, have a context listener to load the file into memory.
In memory, it should be a Map, mapping from a key you want to search by. This will allow you like a constant access time.
Memory consumption would only matter if it is really big. A hundred thousand record for example not worth optimizing if you need to access this many times.
The static file should be plain text format or CSV, they are read and parsed most efficiently. No need XML formatting as parsing it would be slow.
If the list is really big, you can break it up into multiple, smaller files, and only parse those and only when they are required. A reasonable, easy partitioning would be to break it up by country, but any other partitioning would work (like based on its name using the first few characters from its name).
You could also consider building this Map in the memory once, and then serialize this map to a binary file, and include that binary file as a static resource file, and that way you would only have to deserialize this Map and would be no need to parse/process it as a text file and build objects yourself.
Improvements on the data file
An alternative to having the static resource file as a text/CSV file or a serialized Map
data file would be to have it as a binary data file where you could create your own custom file format.
Using DataOutputStream you can write data to a binary file in a very compact and efficient way. Then you could use DataInputStream to load data from this custom file.
This solution has the advantages that the file could be much less (compared to plain text / CSV / serialized Map), and loading it would be much faster (because DataInputStream doesn't use number parsing from a text for example, it reads the bytes of a number directly).
Hold the data in source form as XML. At start of day, or when it changes, read it into memory: that's the only time you incur the parsing cost. There are then two main options:
(a) your in-memory form is still an XML tree, and you use XPath/XQuery to query it.
(b) your in-memory form is something like a java HashMap
If the data is very simple then (b) is probably best, but it only allows you to do one kind of query, which is hard-coded. If the data is more complex or you have a variety of possible queries, then (a) is more flexible.
I am trying to download a file from a server in a user specified number of parts (n). So there is a file of x bytes divided into n parts with each part downloading a piece of the whole file at the same time. I am using threads to implement this, but I have not worked with http before and do not really understand how downloading a file really works. I have read up on it and it seems "Range" needs to be used, but I do not know how to download different parts and being able to append them without corrupting the data.
(Since it's a homework assignment I will only give you a hint)
Appending to a single file will not help you at all, since this will mess up the data. You have two alternatives:
Download from each thread to a separate temporary file and then merge the temporary files in the right order to create the final file. This is probably easier to conceive, but a rather ugly and inefficient approach.
Do not stick to the usual stream-style semantics - use random access (1, 2) to write data from each thread straight to the right location within the output file.
I need to parse complex (non fixed length) csv files to Java objects in order to compare its values.
I first tried the Flatform Parsing Framework, i liked the approach of describing the values in an extra (xml) document. Maybe it's the right tool for simple csv (and also flat) files. Nevertheless my csv files contains lines that vary in quantity of fields - sometimes they span across multiple lines. There are also dependencies among those fields.
Here's a little sample: (each type has a certain amount of extra parameters)
; <COMMENTS (to be ignored)>
<NAME>,<TYPE_A>,<DESCRIPTION>,<PARAMETER>
<NAME>,<TYPE_B>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_C>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>,<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_D>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>,<PARAMETER>,<PARAMETER>, -
<PARAMETER>,<PARAMETER>, -
<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_B>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_A>,<DESCRIPTION>,<PARAMETER>
So i need something to describe and parse the csv file in a more complex manner. I'm new to this, I've heard about parser generator - is that what I need?
Try OpenCSV (see http://opencsv.sourceforge.net/#what-features). It handles embedded carriage returns just fine.
One option is to use the Scanner class or you might want to check out the Spring Batch. Ive never actually used SB but given batch jobs often read from simple text files i believe i read it caters for this including all sorts of object mapping.
You may also try japaki