Searching and storing values from CSV - java

I'm a java-beginner and want to learn how to read in files and store data in a way that makes it easy to manipulate.
I have a pretty big csv file (18000 rows). The data is representing the sortiment from all different beverages sold by a liqueur-shop. It consists of 16 something columns with headers like "article number, name, producer, amount of alcohol, etc etc. The columns are separated by "\t".
I now want to do some searching in this file to find things like how many products that are produced in Sweden and finding the most expensive liqueur/liter.
Since I really want to learn how to program and not just find the answer I'm not looking for any exact code here. I'm instead looking for the psuedo-code behind this and a good way of thinking when dealing with large sets of data and what kind of data structures that are best suited for a task.
Lets take the "How many products are from Sweden" example.
Since the data consists of both strings, ints and floats I cant put everything in a list. What is the best way of storing it so it later could be manipulated? Or can I find it as soon as it's parsed, maybe I don't have to store it at all?

If you're new to Java and programming in general I'd recommend a library to help you view and use your data, without getting into databases and learning SQL. One that I've used in the past is Commons CSV.
https://commons.apache.org/proper/commons-csv/user-guide.html#Parsing_files
It lets you easily parse a whole CSV file into CSVRecord objects. For example:
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in);
for (CSVRecord record : records) {
String lastName = record.get("Last Name");
String firstName = record.get("First Name");
}

If you have csv file particularly then You may use database to store this data.
You go through to read csv in java using this link.
Make use of ORM framework like Hibernate use alongwith Spring application. Use this link to create application
By using this you can create queries to fetch the data like "How many products are from Sweden" and make use of Collection framework. This link to use HQL queries in same application.
Create JSP pages to show the results on UI.
Sorry for my english.

It seems you are looking for an in-memory SQL engine over your CSV file. I would suggest to use CQEngine which provides indexed view on top of Java collection framework with SQL-like queries.
You are basically treating Java collection as a database table. Assuming that each CSV line maps to some POJO class like Beverage:
IndexedCollection<Beverage> table = new ConcurrentIndexedCollection<Beverage>();
table.addIndex(NavigableIndex.onAttribute(Beverage.BEVERAGE_ID));
table.add(new Beverage(...));
table.add(new Beverage(...));
table.add(new Beverage(...));
What you need to do now is to read the CSV file and load it into IndexedCollection and then build a proper index on some fields. After that, you can query the table as a usual SQL database. At the end, de-serialize the collection to new CSV file (if you made any modification).

Related

Efficient data import PostgreSQL DB

I just designed a Pg database and need to choose a way of populating my DB with data, the data consists of txt and csv files but can generally be any type of file containing characters with delimiters, I'm programming in java in order to the data to have the same structure (there's lots of different kinds of files and I need to find what each column of the file represents so I can associate it with a column of my DB) I thought of two ways:
Convert the files into one same type of file (JSON) and then get the DB to regularly check the JSON file and import its content.
Directly connect to the database via JDBC send the strings to the DB (I still need to create a backup file containing what was inserted into the DB so in both cases there is a file created and written into).
Which would you go with time efficiency wise? I'm kinda tempted into using the first one as it would be easier to handle a json file in the DB.
If you have any other suggestion that would also be welcome!
JSON or CSV
If you have the liberty of converting your data either to CSV or JSON format, CSV is the one to choose. This is because you will then be able to use COPY FROM to bulk load large amounts of data at once into postgresql.
CSV is supported by COPY but JSON is not.
Directly inserting values.
This is the approach to take if you only need to insert a few (or maybe even a few thousand) records but not suited for large number of records because it will be slow.
If you choose this approach you can create the back up using COPY TO. However if you feel that you need to create the backup file with your java code. Choosing the format as CSV means you would be able to bulk load as discussed above.

Saving big amount of data (words): Serialization or DB

I need to save permanently a big vocabulary and associate to each word some information (and use it to search words efficiently).
Is it better to store it in a DB (in a simply table and let the DBMS make the work of structuring data based on the key) or is it better to create a
trie data structure and then serialize it to a file and deserialize once the program is started, or maybe instead of serialization use a XML file?
Edit: the vocabulary would be in the order of 5 thousend to 10 thousend words in size, and for each word the metadata are structured in array of 10 Integer. The access to the word is very frequent (this is why I thought to trie data structure that have a search time ~O(1) instead of DB that use B-tree or something like that where the search is ~O(logn)).
p.s. using java.
Thanks!
using DB is better.
many companies are merged to DB, like the erp divalto was using serializations and now merged to DB to get performance
you have many choices between DBMS, if you want to see all data in one file the simple way is to use SQLITE. his advantage it not need any server DBMS running.

Insert Query Builder for java

I have a use case where in I need to read rows from a file, transform them using an engine and then write the output to a database (that can be configured).
While I could write a query builder of my own, I was interested in knowing if there's already an available solution (library).
I searched online and could find jOOQ library but it looks like it is type-safe and has a code-gen tool so is probably suited for static database schema's. In the use case that I have db's can be configured dynamically and the meta-data is programatically read and made available for write-purposes (so a list of tables would be made available, user can select the columns to write and the insert script for these column needs to be dynamically created).
Is there any library that could help me with the use case?
If I understand correctly you need to query the database structure, display the result to via a GUI and have the user map data from a file to that structure?
Assuming this is the case, you're not looking for a 'library', you're looking for an ETL tool.
Alternatively, if you're set on writing something yourself, the (very) basic way to do this is:
the structure of a database using Connection.getMetaData(). The exact usage can vary between drivers so you'll need to create an abstraction layer that meets your needs - I'd assume you're just interested in the table structure here.
the format of the file needs to be mapped to a similar structure to the tables.
provide a GUI that allows the user to connect elements from the file to columns in the table including any type mapping that is needed.
create a parametrized insert statement based on file element to column mapping - this is just a simple bit of string concatenation.
loop throw the rows in the file performing a batch insert for each.
My advice, get an ETL tool, this sounds like a simple problem, but it's full of idiosyncrasies - getting even an 80% solution will be tough and time consuming.
jOOQ (the library you referenced in your question) can be used without code generation as indicated in the jOOQ manual:
http://www.jooq.org/doc/latest/manual/getting-started/use-cases/jooq-as-a-standalone-sql-builder
http://www.jooq.org/doc/latest/manual/sql-building/plain-sql
When searching through the user group, you'll find other users leveraging jOOQ in the way you intend
The setps you need to do is:
read the rows
build each row into an object
transform the above object to target object
insert the target object into the db
Among the above 4 steps, the only thing you need to do is step 3.
And for the above purpose, you can use Transmorph, EZMorph, Commons-BeanUtils, Dozer, etc.

Compare Huge XML Rows with Database Table Records - Custom Requirement

Problem
We have an XML like (its having some non unicode which needs to be filtered of) data,
<row><div>1234</div><dept>ABCD</dept></row>
<row><div>5678</div><dept>EFGH</dept></row>
Just mentioning only 2 column tags for ease of understanding. Actually it has more than 20 column tags in each
XML data is directly inserted as records into an Oracle schema table as,
div_c qdept
1234 ABCD
5678 EFGH
More information
XML file is more than 9 Gigs and available in FTP.
Database table column names might be different from XML column tag names.
Might have to add/define some Rules to filter out the rows.
Question
What would be the appropriate way to parse this huge XML and find out whether that record exists in the database table? Any open source tools available to utilize?
What Am Trying
Wrote StAX parser using default implementation(XMLInputFactory) with Invalid characters fiter (FilterReader)
Planning to split the XML as chunks
Have concurrent threads processing each of the chunks
Each thread will generate a query to check whether that exists in database or not (i know its absurd)
Have a connection pool created and execute those queries by each of the thread
I know this is really worst what I am doing and it will take years to complete, I really need some advice on this like whether to go with any ORM to make the checking easy and make the XML parsing fast.
Some suggestions like that would really help me.
Yeah. I think you were right to use StAX. You definitely want to stream and StAX seems to have the simplest API for streaming XML. I wouldn't go to ORM right away. Most ORM is to round-trip data. It saves you work for mechanical transformations. That makes it good when you have very structured data but the mapping between the two schemas is not very complicated. Here you are trying import data from one format into another. It sounds like your large dataset has a fairly simple schema but the mapping is the more complicated part. Go with custom code. Pawel's suggestion of the temporary table sounds good. Try to do as much processing as you can in stored procedures that operate on the whole dataset at once (old and imported). You don't want to keep transferring those rows back and forth from the database to your app.

dynamic object relation mapping

I am trying to create an application in java which pulls out records from the database and maps it to objects. It does that without knowing what the schema of the database looks like. All i want to do is fetch all rows from all tables and store them somewhere. There could be a thousand tables with thousands of records each. The application doesn't know the name of any table or attribute. It should map "on the fly". I looked at hibernate but it doesnt give me what i want for this app. I don't want to create hard-coded xml files and classes for mapping. Any ideas how i can accomplish this ?
Thanks
Oracle has a bunch of data dictionary views for metadata.
ALL_TABLES, ALL_TAB_COLUMNS would be first places to start. Then you'd build ad-hoc queries based on what you get out of there. Not sure whether you have to deal with all data types (dates, blobs, spatial, user-defined....).
Not sure what you mean by "store them somewhere". If you start thinking CSV or XML files, you'll need to escape various characters from VARCHAR2 columns.
If you are looking for some generic extract/unload routines, you should look at what is already available in the database or open-source/commercially.
MyBatis provides a pretty simple way to map data results to objects and back, maybe check that out?
http://code.google.com/p/mybatis/
Not to be flip, but for this task, you might want to check out Ruby on Rails and its ActiveRecord approach

Categories

Resources