I am trying to import data from speadsheet into a database using Java. There are two ways that I could do this: 1) Read and extract the data from speardsheets and organize them into data structures, such as ArrayLists, Vectors or maps of different objects, so that I could get rid of redundant entries etc, then write the data structures into the database. 2) Extract the data and put them into the database directly as the cells are read and extracted. I think the first way is probably better but would the second way be faster? Any other considerations i should think of?
Thank.
You would want to do a executeBatch() here which is similar to approach #1. So basically you read data from the spread sheet for a batch size (ie. 1000 records) and then you do a commit for transactions a batch at a time to the DB. After that move on to the next batch and so on and so forth. With this approach you utilize database efficiently, save yourself network trips, and also you do not end up hoarding a lot of data in memory which could lead to out of memory exceptions. You should also re-use the same connection and prepared statement objects.
Regarding the data clean up process, you should definitely sanitize your data before putting into a persistent storage such as a table. You may need to generate reports or use the data in other applications in the future, so having clean & well structured tables will help you in the long run. For batch applications, usually the performance requirements are not as high as the transactional systems.
You should also utilize a helper library like apache poi for reading excel documents. As far as the data structure is concerned it will depend on your data, but generally an ArrayList should suffice here.
Another point you might consider is that ypically most ETL tools offer these kinds of data loading tasks out of the box. If your situation allows for it, I highly recommend looking at an ETL tool like Kettle to load the data. You may be able to save yourself some time and learn a new tool.
Hope this helps!
You can consider using an ETL tools (Extraction, Transformation and Loading) for the kind of task you are referring
Related
I have a requirement to store CSV data in an Oracle database for later retrieval by dynamic query scripts. The data needs to be stored such that any column of the CSV data can be queried using SQL and performance is key (some CSV files are 100k+ lines).
The content of the CSV files (number of columns, headings, data types) is not known ahead of time and the system needs to be able to handle multiple file structures (which are added to a config file so the system knows how to read them, by people who don't know SQL).
My current solution, in order to avoid an EAV model, is to have my code create new tables every time a new CSV structure is added to the config file. I'm curious to know if there is a better way to achieve what I'm trying to do. I'm not particularly fond of having my code create new tables in production at run-time.
The system is written in groovy, in case it matters.
I am inclined to go with your current solution, which is a separate table for each type. Somehow, I'm most comfortable with storing data in well-defined tables with well-defined types.
An EAV (entity-attribute-value) solution is also viable. With 100k rows of data, the EAV solution should perform pretty well, unless you have lots of tables. One downside is the types of the columns. Without a lot of extra work, you are pretty much limited to strings for all the values.
Oracle does offer another possibility, which is an XML solution. This can give you the flexibility of dynamic column names along with the "simplicity" of not having to define a separate table for each one. You can read more about it in the documentation here.
It comes down to what you want to model. If you need to handle adhoc queries against any of the columns in the CSV file, then I guess you need to model them all as Oracle columns. If you need to only retrieve a whole line based on a particular key, then you could model as two columns: the key and the line. If you need to model the individual columsn that such a thing would not be in first normal form.
When you create an EAV model, you are making a flexible system that allows for additional columns to be added/removed easily. Oracle is already a flexible system that allows for additional columns to be added/removed easily. They've just put more thought into locking, performance, scalability and tool support that your naive EAV model might have.
Overall, I think what you are probably doing is best. It's not an easy problem and it's not exactly what Oracle was designed for so you might have issues with statistics and which indexes to create and so on.
I am working on a project that involves parsing through a LARGE amount of data rapidly. Currently this data is on disk and broken down into a directory hierarchy:
(Folder: DataSource) -> (Files: Day1, Day2, Day3...Day1000...)
(Folder: DataSource2) -> (Files: Day1, Day2, Day3...Day1000...)
...
(Folder: DataSource1000) -> ...
...
Each Day file consists of entries that need to be accessed very quickly.
My initial plans were to use traditional FileIO in java to access these files, but upon further reading, I began to fear that this might be too slow.
In short, what is the fastest way I can selectively load entries from my filesystem from varying DataSources and Days?
The issue could be solved both ways but it depends on few factors
go for FileIO.
if the volume is < millons of rows
if your dont do a complicated query like Jon Skeet said
if your referance for fetching the row is by using hte Folder Name: "DataSource" as the key
go for DB
if you see your program reading through millions of records
you can do complicated selection, even multiple rows using a single select.
if you have knowledge of creating a basic table structure for DB
Depending on architecture you are using you can implement different ways of caching, in the Jboss there is a built-in Jboss Caching, there are also third party opensource software that lets utilizes caching, like Redis, or EhCache depending on your needs. Basically Caching stores objects in their memory, some are passivated/activated upon demand, when memory is exhausted it is stored as a physical IO file, which are also easily activated marshalled by the caching mechanism. It lowers the database connectivity held by your program. There are other caches but here are some of them that I've worked with:
Jboss:http://www.jboss.org/jbosscache/
Redis:http://redis.io/
EhCache:http://ehcache.org/
what is the fastest way I can selectively load entries from my filesystem from varying DataSources and Days?
selectively means filtering, so my answer is a localhost database. Generally speaking if you filter, sort, paginate or extract distinct records from a large number of records, it's hard to beat a localhost SQL server. You get a query optimizer (nobody does that Java), a cache (which requires effort in Java, especially the invalidation), database indexes (have not seen that being done in Java either) etc. It's possible to implement these things manually, but then your are writing a database in Java.
On top of this you gain access to higher level SQL functions like window aggegrates etc., so in most cases there is no need to post-process data in Java.
Hy,
I have to make a query on a Facebook table witch will return xml information about number of likes and so on.I have to keep those info in database or in a xml file on the disc and every day at a certain hour i have to update those dates.
How can i make those updates at a certain time?
If the information is very large and i can't store in database,can i store it in a xml file?
If the amount of data or its volume would prove troublesome for a database, you certainly won't benefit from using XML for storage! Quite the contrary. Find out if perhaps your database supports XML as a column type. If it does, it might supply XPath-based indexing and maybe even updates. If you get the info as XML, maybe some manner of bridging the XML to relational DB gap would be of use. Using EclipseLink for persistance would provide an excellent bridge in the form of using JAXB together with JPA.
As for scheduled updates, maybe try to find out if you always need all the info or just a subset. Even if you can't request partial data, maybe filtering out some stuff you don't need (like with an XSLT transform) could reduce memory footprint and processing time further down the line. Using JPA entities would certainly make synching and updates easier.
My Java application uses a read-only lookup table, which is stored in an XML file. When the application starts it just reads the file into a HashMap. So far, so good, but since the table is growing I don't like loading the entire table into the memory at once. RDBMS and NoSQL key-value stores seem overkill to me. What would you suggest?
Makes you wish Java would allow to allocate infinite amounts of heap as memory mapped file :-)
If you use Java 5, then use Java DB; it's a database engine written in Java, based on Apache Derby. If you know SQL, then setting up an embedded database takes only a couple of minutes. Since you can create the database again every time your app is started, you don't have to worry about permissions, DB schema migration, stale caches, etc.
Or you could use an OO database like db4o but many people find it hard to make the mental transition to use queries to iterate over internal data structures. To take your example: You have a huge HashMap. Instead of using map.get(), you have to build a query using DB4o and then run that query on your map to locate items; otherwise DB4o would be forced to load the whole map at once.
Another alternative is to create your own minimal system: Read the data from the XML file and save it as a large random access file plus an index + caching so you can quickly look up items. If your objects are all serializable, then you can use ObjectInputStream to read the individual entries after seeking to the right place using the RandomAccessFile.
We have a large table of approximately 1 million rows, and a data file with millions of rows. We need to regularly merge a subset of the data in the text file into a database table.
The main reason for it being slow is that the data in the file has references to other JPA objects, meaning the other jpa objects need to be read back for each row in the file. ie Imagine we have 100,000 people, and 1,000,000 asset objects
Person object --> Asset list
Our application currently uses pure JPA for all of its data manipulation requirements. Is there an efficient way to do this using JPA/ORM methodologies or am I going to need to revert back to pure SQL and vendor specific commands?
why doesnt use age old technique: divide and conquer? Split the file into small chunks and then have parallel processes work on these small files concurrently.
And use batch inserts/updates that are offered by JPA and Hibernate. more details here
The ideal way in my opinion though is to use batch support provided by plain JDBC and then commit at regular intervals.
You might also wants to look at spring batch as it provided split/parallelization/iterating through files etc out of box. I have used all of these successfully for an application of considerable size.
One possible answer which is painfully slow is to do the following
For each line in the file:
Read data line
fetch reference object
check if data is attached to reference object
if not add data to reference object and persist
So slow it is not worth considering.