I just designed a Pg database and need to choose a way of populating my DB with data, the data consists of txt and csv files but can generally be any type of file containing characters with delimiters, I'm programming in java in order to the data to have the same structure (there's lots of different kinds of files and I need to find what each column of the file represents so I can associate it with a column of my DB) I thought of two ways:
Convert the files into one same type of file (JSON) and then get the DB to regularly check the JSON file and import its content.
Directly connect to the database via JDBC send the strings to the DB (I still need to create a backup file containing what was inserted into the DB so in both cases there is a file created and written into).
Which would you go with time efficiency wise? I'm kinda tempted into using the first one as it would be easier to handle a json file in the DB.
If you have any other suggestion that would also be welcome!
JSON or CSV
If you have the liberty of converting your data either to CSV or JSON format, CSV is the one to choose. This is because you will then be able to use COPY FROM to bulk load large amounts of data at once into postgresql.
CSV is supported by COPY but JSON is not.
Directly inserting values.
This is the approach to take if you only need to insert a few (or maybe even a few thousand) records but not suited for large number of records because it will be slow.
If you choose this approach you can create the back up using COPY TO. However if you feel that you need to create the backup file with your java code. Choosing the format as CSV means you would be able to bulk load as discussed above.
Related
I'm a java-beginner and want to learn how to read in files and store data in a way that makes it easy to manipulate.
I have a pretty big csv file (18000 rows). The data is representing the sortiment from all different beverages sold by a liqueur-shop. It consists of 16 something columns with headers like "article number, name, producer, amount of alcohol, etc etc. The columns are separated by "\t".
I now want to do some searching in this file to find things like how many products that are produced in Sweden and finding the most expensive liqueur/liter.
Since I really want to learn how to program and not just find the answer I'm not looking for any exact code here. I'm instead looking for the psuedo-code behind this and a good way of thinking when dealing with large sets of data and what kind of data structures that are best suited for a task.
Lets take the "How many products are from Sweden" example.
Since the data consists of both strings, ints and floats I cant put everything in a list. What is the best way of storing it so it later could be manipulated? Or can I find it as soon as it's parsed, maybe I don't have to store it at all?
If you're new to Java and programming in general I'd recommend a library to help you view and use your data, without getting into databases and learning SQL. One that I've used in the past is Commons CSV.
https://commons.apache.org/proper/commons-csv/user-guide.html#Parsing_files
It lets you easily parse a whole CSV file into CSVRecord objects. For example:
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in);
for (CSVRecord record : records) {
String lastName = record.get("Last Name");
String firstName = record.get("First Name");
}
If you have csv file particularly then You may use database to store this data.
You go through to read csv in java using this link.
Make use of ORM framework like Hibernate use alongwith Spring application. Use this link to create application
By using this you can create queries to fetch the data like "How many products are from Sweden" and make use of Collection framework. This link to use HQL queries in same application.
Create JSP pages to show the results on UI.
Sorry for my english.
It seems you are looking for an in-memory SQL engine over your CSV file. I would suggest to use CQEngine which provides indexed view on top of Java collection framework with SQL-like queries.
You are basically treating Java collection as a database table. Assuming that each CSV line maps to some POJO class like Beverage:
IndexedCollection<Beverage> table = new ConcurrentIndexedCollection<Beverage>();
table.addIndex(NavigableIndex.onAttribute(Beverage.BEVERAGE_ID));
table.add(new Beverage(...));
table.add(new Beverage(...));
table.add(new Beverage(...));
What you need to do now is to read the CSV file and load it into IndexedCollection and then build a proper index on some fields. After that, you can query the table as a usual SQL database. At the end, de-serialize the collection to new CSV file (if you made any modification).
I need to save permanently a big vocabulary and associate to each word some information (and use it to search words efficiently).
Is it better to store it in a DB (in a simply table and let the DBMS make the work of structuring data based on the key) or is it better to create a
trie data structure and then serialize it to a file and deserialize once the program is started, or maybe instead of serialization use a XML file?
Edit: the vocabulary would be in the order of 5 thousend to 10 thousend words in size, and for each word the metadata are structured in array of 10 Integer. The access to the word is very frequent (this is why I thought to trie data structure that have a search time ~O(1) instead of DB that use B-tree or something like that where the search is ~O(logn)).
p.s. using java.
Thanks!
using DB is better.
many companies are merged to DB, like the erp divalto was using serializations and now merged to DB to get performance
you have many choices between DBMS, if you want to see all data in one file the simple way is to use SQLITE. his advantage it not need any server DBMS running.
I have a program that creates multiple text files of rdf triples. I need to compare the triples and do it fast, what is the best way to do this? I thought of putting the triples into an array and comparing them but there could potentially be hundreds of thousands of triples per file and that would take forever. I need it to be as close to realtime as possible since the triples will be genreated constantly amoung the files. Any help would be great. The files are also in AllegroGraph repository's if it's easier to compare them there somehow.
A thought: if I stored the triples in excel (one triple per row) and one sheet per repository,
A: how could I find the duplicates amoung the sheets.
B: would it be fast.
and C: how could I automate that from Java?
You need to build a master index that will store each triple and in how many files it appears and the exact file name and location of the triple within each file. You can search the master index to answer the queries in real-time.
As you update, delete or create new rdf files, you need to update the master index.
You need to store the master index so that it can be updated, searched efficiently.
Simple choice could be to use relational database (like MySql) to store the master index. It can answer you queries like finding common triples with simple select statement select * from rdfindex where triplecount > 2.
EDIT: You cannot store hundreds of thousands of triples in memory using HashMap or similar datastructure. That's why I suggested using database, which can store the data and respond to your queries efficiently. You can look at embedded database like SQLite to store the data.
Read upon these topics
How to create SQLite database and create tables, access tables etc., Create a simple table to store triple, triplecount, filenames.
Convert all your Excel files to CSV files. You can use opencsv to parse the file in Java (check out the samples that come with opencsv).
Parse the CSV files and load the data into SQLite. If the triple is already in the database, then just update the count, if not insert the triple.
As far as I know there is a function to delete duplicate entries in AllegroGraph, this may be an option if all the triples come from there.
Problem
We have an XML like (its having some non unicode which needs to be filtered of) data,
<row><div>1234</div><dept>ABCD</dept></row>
<row><div>5678</div><dept>EFGH</dept></row>
Just mentioning only 2 column tags for ease of understanding. Actually it has more than 20 column tags in each
XML data is directly inserted as records into an Oracle schema table as,
div_c qdept
1234 ABCD
5678 EFGH
More information
XML file is more than 9 Gigs and available in FTP.
Database table column names might be different from XML column tag names.
Might have to add/define some Rules to filter out the rows.
Question
What would be the appropriate way to parse this huge XML and find out whether that record exists in the database table? Any open source tools available to utilize?
What Am Trying
Wrote StAX parser using default implementation(XMLInputFactory) with Invalid characters fiter (FilterReader)
Planning to split the XML as chunks
Have concurrent threads processing each of the chunks
Each thread will generate a query to check whether that exists in database or not (i know its absurd)
Have a connection pool created and execute those queries by each of the thread
I know this is really worst what I am doing and it will take years to complete, I really need some advice on this like whether to go with any ORM to make the checking easy and make the XML parsing fast.
Some suggestions like that would really help me.
Yeah. I think you were right to use StAX. You definitely want to stream and StAX seems to have the simplest API for streaming XML. I wouldn't go to ORM right away. Most ORM is to round-trip data. It saves you work for mechanical transformations. That makes it good when you have very structured data but the mapping between the two schemas is not very complicated. Here you are trying import data from one format into another. It sounds like your large dataset has a fairly simple schema but the mapping is the more complicated part. Go with custom code. Pawel's suggestion of the temporary table sounds good. Try to do as much processing as you can in stored procedures that operate on the whole dataset at once (old and imported). You don't want to keep transferring those rows back and forth from the database to your app.
We have a large table of approximately 1 million rows, and a data file with millions of rows. We need to regularly merge a subset of the data in the text file into a database table.
The main reason for it being slow is that the data in the file has references to other JPA objects, meaning the other jpa objects need to be read back for each row in the file. ie Imagine we have 100,000 people, and 1,000,000 asset objects
Person object --> Asset list
Our application currently uses pure JPA for all of its data manipulation requirements. Is there an efficient way to do this using JPA/ORM methodologies or am I going to need to revert back to pure SQL and vendor specific commands?
why doesnt use age old technique: divide and conquer? Split the file into small chunks and then have parallel processes work on these small files concurrently.
And use batch inserts/updates that are offered by JPA and Hibernate. more details here
The ideal way in my opinion though is to use batch support provided by plain JDBC and then commit at regular intervals.
You might also wants to look at spring batch as it provided split/parallelization/iterating through files etc out of box. I have used all of these successfully for an application of considerable size.
One possible answer which is painfully slow is to do the following
For each line in the file:
Read data line
fetch reference object
check if data is attached to reference object
if not add data to reference object and persist
So slow it is not worth considering.