We have a large table of approximately 1 million rows, and a data file with millions of rows. We need to regularly merge a subset of the data in the text file into a database table.
The main reason for it being slow is that the data in the file has references to other JPA objects, meaning the other jpa objects need to be read back for each row in the file. ie Imagine we have 100,000 people, and 1,000,000 asset objects
Person object --> Asset list
Our application currently uses pure JPA for all of its data manipulation requirements. Is there an efficient way to do this using JPA/ORM methodologies or am I going to need to revert back to pure SQL and vendor specific commands?
why doesnt use age old technique: divide and conquer? Split the file into small chunks and then have parallel processes work on these small files concurrently.
And use batch inserts/updates that are offered by JPA and Hibernate. more details here
The ideal way in my opinion though is to use batch support provided by plain JDBC and then commit at regular intervals.
You might also wants to look at spring batch as it provided split/parallelization/iterating through files etc out of box. I have used all of these successfully for an application of considerable size.
One possible answer which is painfully slow is to do the following
For each line in the file:
Read data line
fetch reference object
check if data is attached to reference object
if not add data to reference object and persist
So slow it is not worth considering.
Related
Let's consider a scenario
Accounts.csv
Transaction.csv
We have a mapping of each account number to transaction details, so 1 account number can have multiple transactions. Using these details we have to generate PDF for each account
If suppose, transaction CSV file is very large(>1 GB), then loading all the details and parsing could be the memory issue. So what could be the best approach to parse the transaction file ? Reading chunk by chunk also leading to memory consumption. Please advice
As others have said a Database would be a good solution.
Alternatively you could sort the 2 files on th account number. Most Operating systems provide efficient file sorting programs, e.g. for linux (sorting on 5th column)
LC_ALL=C sort -t, -k5 file.csv > sorted.csv
taken from Sorting csv file by 5th column using bash
You can then read the 2 files sequentially
Your Programming logic is:
if (Accounts.accountNumber < Transaction.accountNumber) {
read Accounts file
} else if (Accounts.accountNumber = Transaction.accountNumber) {
process transaction
read Transaction file
} else {
read Transaction file
}
The memory requirements will be tiny, you only need to hold one record from each file in memory.
Let's say you are using Oracle as Database,.
you could load the data into its corresponding tables using the Oracle SQL Loader tool.
Once the data is loaded you could use simple SQL Queries to Join and Query data from the loaded tables.
This will work in all types of Databases but you will need to find the appropriate tool for loading the data.
Of cause importing the data to a database first would be the most elegant way.
Beside that your question leaves the impression that this isn‘t an option.
So I recommend you read the transactions.csv line-by-line (for instance by using a BufferedReader). Because in CSV Format each line is a record you can then (while reading) filter out and forget about each record that is not for your current account.
After one file-traversal you have all transactions for one account and that should usually fit into memory.
A downfall of this approach is that you end up reading the transactions multiple times, once for each accounts PDF generation. But if your application would need to be highly optimized, I suggest you would have already used a database.
I need to save permanently a big vocabulary and associate to each word some information (and use it to search words efficiently).
Is it better to store it in a DB (in a simply table and let the DBMS make the work of structuring data based on the key) or is it better to create a
trie data structure and then serialize it to a file and deserialize once the program is started, or maybe instead of serialization use a XML file?
Edit: the vocabulary would be in the order of 5 thousend to 10 thousend words in size, and for each word the metadata are structured in array of 10 Integer. The access to the word is very frequent (this is why I thought to trie data structure that have a search time ~O(1) instead of DB that use B-tree or something like that where the search is ~O(logn)).
p.s. using java.
Thanks!
using DB is better.
many companies are merged to DB, like the erp divalto was using serializations and now merged to DB to get performance
you have many choices between DBMS, if you want to see all data in one file the simple way is to use SQLITE. his advantage it not need any server DBMS running.
I am working on a project that involves parsing through a LARGE amount of data rapidly. Currently this data is on disk and broken down into a directory hierarchy:
(Folder: DataSource) -> (Files: Day1, Day2, Day3...Day1000...)
(Folder: DataSource2) -> (Files: Day1, Day2, Day3...Day1000...)
...
(Folder: DataSource1000) -> ...
...
Each Day file consists of entries that need to be accessed very quickly.
My initial plans were to use traditional FileIO in java to access these files, but upon further reading, I began to fear that this might be too slow.
In short, what is the fastest way I can selectively load entries from my filesystem from varying DataSources and Days?
The issue could be solved both ways but it depends on few factors
go for FileIO.
if the volume is < millons of rows
if your dont do a complicated query like Jon Skeet said
if your referance for fetching the row is by using hte Folder Name: "DataSource" as the key
go for DB
if you see your program reading through millions of records
you can do complicated selection, even multiple rows using a single select.
if you have knowledge of creating a basic table structure for DB
Depending on architecture you are using you can implement different ways of caching, in the Jboss there is a built-in Jboss Caching, there are also third party opensource software that lets utilizes caching, like Redis, or EhCache depending on your needs. Basically Caching stores objects in their memory, some are passivated/activated upon demand, when memory is exhausted it is stored as a physical IO file, which are also easily activated marshalled by the caching mechanism. It lowers the database connectivity held by your program. There are other caches but here are some of them that I've worked with:
Jboss:http://www.jboss.org/jbosscache/
Redis:http://redis.io/
EhCache:http://ehcache.org/
what is the fastest way I can selectively load entries from my filesystem from varying DataSources and Days?
selectively means filtering, so my answer is a localhost database. Generally speaking if you filter, sort, paginate or extract distinct records from a large number of records, it's hard to beat a localhost SQL server. You get a query optimizer (nobody does that Java), a cache (which requires effort in Java, especially the invalidation), database indexes (have not seen that being done in Java either) etc. It's possible to implement these things manually, but then your are writing a database in Java.
On top of this you gain access to higher level SQL functions like window aggegrates etc., so in most cases there is no need to post-process data in Java.
I am trying to import data from speadsheet into a database using Java. There are two ways that I could do this: 1) Read and extract the data from speardsheets and organize them into data structures, such as ArrayLists, Vectors or maps of different objects, so that I could get rid of redundant entries etc, then write the data structures into the database. 2) Extract the data and put them into the database directly as the cells are read and extracted. I think the first way is probably better but would the second way be faster? Any other considerations i should think of?
Thank.
You would want to do a executeBatch() here which is similar to approach #1. So basically you read data from the spread sheet for a batch size (ie. 1000 records) and then you do a commit for transactions a batch at a time to the DB. After that move on to the next batch and so on and so forth. With this approach you utilize database efficiently, save yourself network trips, and also you do not end up hoarding a lot of data in memory which could lead to out of memory exceptions. You should also re-use the same connection and prepared statement objects.
Regarding the data clean up process, you should definitely sanitize your data before putting into a persistent storage such as a table. You may need to generate reports or use the data in other applications in the future, so having clean & well structured tables will help you in the long run. For batch applications, usually the performance requirements are not as high as the transactional systems.
You should also utilize a helper library like apache poi for reading excel documents. As far as the data structure is concerned it will depend on your data, but generally an ArrayList should suffice here.
Another point you might consider is that ypically most ETL tools offer these kinds of data loading tasks out of the box. If your situation allows for it, I highly recommend looking at an ETL tool like Kettle to load the data. You may be able to save yourself some time and learn a new tool.
Hope this helps!
You can consider using an ETL tools (Extraction, Transformation and Loading) for the kind of task you are referring
My Java application uses a read-only lookup table, which is stored in an XML file. When the application starts it just reads the file into a HashMap. So far, so good, but since the table is growing I don't like loading the entire table into the memory at once. RDBMS and NoSQL key-value stores seem overkill to me. What would you suggest?
Makes you wish Java would allow to allocate infinite amounts of heap as memory mapped file :-)
If you use Java 5, then use Java DB; it's a database engine written in Java, based on Apache Derby. If you know SQL, then setting up an embedded database takes only a couple of minutes. Since you can create the database again every time your app is started, you don't have to worry about permissions, DB schema migration, stale caches, etc.
Or you could use an OO database like db4o but many people find it hard to make the mental transition to use queries to iterate over internal data structures. To take your example: You have a huge HashMap. Instead of using map.get(), you have to build a query using DB4o and then run that query on your map to locate items; otherwise DB4o would be forced to load the whole map at once.
Another alternative is to create your own minimal system: Read the data from the XML file and save it as a large random access file plus an index + caching so you can quickly look up items. If your objects are all serializable, then you can use ObjectInputStream to read the individual entries after seeking to the right place using the RandomAccessFile.