How can I update an xml file periodically? - java

Hy,
I have to make a query on a Facebook table witch will return xml information about number of likes and so on.I have to keep those info in database or in a xml file on the disc and every day at a certain hour i have to update those dates.
How can i make those updates at a certain time?
If the information is very large and i can't store in database,can i store it in a xml file?

If the amount of data or its volume would prove troublesome for a database, you certainly won't benefit from using XML for storage! Quite the contrary. Find out if perhaps your database supports XML as a column type. If it does, it might supply XPath-based indexing and maybe even updates. If you get the info as XML, maybe some manner of bridging the XML to relational DB gap would be of use. Using EclipseLink for persistance would provide an excellent bridge in the form of using JAXB together with JPA.
As for scheduled updates, maybe try to find out if you always need all the info or just a subset. Even if you can't request partial data, maybe filtering out some stuff you don't need (like with an XSLT transform) could reduce memory footprint and processing time further down the line. Using JPA entities would certainly make synching and updates easier.

Related

Hibernate bulk operations migrate databases

I wrote a small executable jar using Spring & Spring Data JPA to migrate data from a database, converting objects from original database (throught several tables) to valid objects for the new database and then insert the new objects in new database.
Problem is : I process a large amount of data (200 000) and doing my insert one by one is really time consuming (1hr, all the time is spent for the INSERT operations which happen after validating/transforming incoming data, it is not spent for the retrieval from original database nor validation/conversion).
I already had suggestions about it :
[Edit because i didn't explain it well] As I am doing a
extract-validate-transform-insert, do my insert (which are valid
because they are verified first) X objects by X objects (instead of
inserting it one by one). That is the suggestion from the frist
answer : tried it but that not so efficient, stil time consuming.
Instead of saving directly in database, save the insert into a .sql file and then import the file directly in database. But how to translate myDao.save() to a final SQL output and then write it to a file.
Use Talend : know as probably the best way, but too long to re-do everything. I'd like to find a way using java and refactor my jar.
Other ideas ?
Note : one important point is that if one valisation fails I want to continue to process other data, I only log an error.
Thanks
You should pause and think for a minute: what could cause an error when inserting your data into the database? Short of "your database is hosed", there are two posibilities:
There is a bug in your code;
The data coming in is bad.
If you have a bug in your code, you would be better of if all your data load is reverted. You will get another chance to transfer data after you fix your code.
If the data coming in is bad, or is suspected bad, you should add a step for validating your data. So, your process workflow might look like this: Extract --> Validate --> Transform --> Load. If the incoming data is invalid, write it into the log or load into a separate table for erroneous data.
You should keep all your process run in the same transaction, using the same Hibernate session. Keeping all 200K reords in memory would be pushing it. I suggest using batching (see http://docs.jboss.org/hibernate/orm/3.3/reference/en-US/html/batch.html). In two words, after a predetermined number of records, say 1000, flush and clear your Hibernate session.

Compare Huge XML Rows with Database Table Records - Custom Requirement

Problem
We have an XML like (its having some non unicode which needs to be filtered of) data,
<row><div>1234</div><dept>ABCD</dept></row>
<row><div>5678</div><dept>EFGH</dept></row>
Just mentioning only 2 column tags for ease of understanding. Actually it has more than 20 column tags in each
XML data is directly inserted as records into an Oracle schema table as,
div_c qdept
1234 ABCD
5678 EFGH
More information
XML file is more than 9 Gigs and available in FTP.
Database table column names might be different from XML column tag names.
Might have to add/define some Rules to filter out the rows.
Question
What would be the appropriate way to parse this huge XML and find out whether that record exists in the database table? Any open source tools available to utilize?
What Am Trying
Wrote StAX parser using default implementation(XMLInputFactory) with Invalid characters fiter (FilterReader)
Planning to split the XML as chunks
Have concurrent threads processing each of the chunks
Each thread will generate a query to check whether that exists in database or not (i know its absurd)
Have a connection pool created and execute those queries by each of the thread
I know this is really worst what I am doing and it will take years to complete, I really need some advice on this like whether to go with any ORM to make the checking easy and make the XML parsing fast.
Some suggestions like that would really help me.
Yeah. I think you were right to use StAX. You definitely want to stream and StAX seems to have the simplest API for streaming XML. I wouldn't go to ORM right away. Most ORM is to round-trip data. It saves you work for mechanical transformations. That makes it good when you have very structured data but the mapping between the two schemas is not very complicated. Here you are trying import data from one format into another. It sounds like your large dataset has a fairly simple schema but the mapping is the more complicated part. Go with custom code. Pawel's suggestion of the temporary table sounds good. Try to do as much processing as you can in stored procedures that operate on the whole dataset at once (old and imported). You don't want to keep transferring those rows back and forth from the database to your app.

Which way would be better to import spreadsheet data?

I am trying to import data from speadsheet into a database using Java. There are two ways that I could do this: 1) Read and extract the data from speardsheets and organize them into data structures, such as ArrayLists, Vectors or maps of different objects, so that I could get rid of redundant entries etc, then write the data structures into the database. 2) Extract the data and put them into the database directly as the cells are read and extracted. I think the first way is probably better but would the second way be faster? Any other considerations i should think of?
Thank.
You would want to do a executeBatch() here which is similar to approach #1. So basically you read data from the spread sheet for a batch size (ie. 1000 records) and then you do a commit for transactions a batch at a time to the DB. After that move on to the next batch and so on and so forth. With this approach you utilize database efficiently, save yourself network trips, and also you do not end up hoarding a lot of data in memory which could lead to out of memory exceptions. You should also re-use the same connection and prepared statement objects.
Regarding the data clean up process, you should definitely sanitize your data before putting into a persistent storage such as a table. You may need to generate reports or use the data in other applications in the future, so having clean & well structured tables will help you in the long run. For batch applications, usually the performance requirements are not as high as the transactional systems.
You should also utilize a helper library like apache poi for reading excel documents. As far as the data structure is concerned it will depend on your data, but generally an ArrayList should suffice here.
Another point you might consider is that ypically most ETL tools offer these kinds of data loading tasks out of the box. If your situation allows for it, I highly recommend looking at an ETL tool like Kettle to load the data. You may be able to save yourself some time and learn a new tool.
Hope this helps!
You can consider using an ETL tools (Extraction, Transformation and Loading) for the kind of task you are referring

Merging a large table with a large text file using JPA?

We have a large table of approximately 1 million rows, and a data file with millions of rows. We need to regularly merge a subset of the data in the text file into a database table.
The main reason for it being slow is that the data in the file has references to other JPA objects, meaning the other jpa objects need to be read back for each row in the file. ie Imagine we have 100,000 people, and 1,000,000 asset objects
Person object --> Asset list
Our application currently uses pure JPA for all of its data manipulation requirements. Is there an efficient way to do this using JPA/ORM methodologies or am I going to need to revert back to pure SQL and vendor specific commands?
why doesnt use age old technique: divide and conquer? Split the file into small chunks and then have parallel processes work on these small files concurrently.
And use batch inserts/updates that are offered by JPA and Hibernate. more details here
The ideal way in my opinion though is to use batch support provided by plain JDBC and then commit at regular intervals.
You might also wants to look at spring batch as it provided split/parallelization/iterating through files etc out of box. I have used all of these successfully for an application of considerable size.
One possible answer which is painfully slow is to do the following
For each line in the file:
Read data line
fetch reference object
check if data is attached to reference object
if not add data to reference object and persist
So slow it is not worth considering.

caching readonly data for java application

I have a database which has around 150K records of data with a primary key on the table. The data size for each record will take less than 1kB. The processing time for constructing a POJO from the DB record takes about 1-2 secs(there is some business logic that takes too much time). This is read-only data. Hence I'm planning to implement caching the data. What I'm thinking to do is. Load the data in subsets(200 records each time) and create a thread that'll construct the POJOs and keep them in a hashtable. While the cache is being loaded(when I start the application) the User will see a wait sign. For storing the data in HashTable is an issue I'll actually store the processed data in to another DB table(marshall the POJO to xml).
I use a third party API to load the data from database. Once I load a record I'll have load the data I'll have to load associations for the loaded data and then associations for the association found at the top level. It's like loading a family tree.
I can't use Hibernate or any ORM framework as I'm using a third party API to load the data which is shipped with the database it self(it's a product). More over I don't think loading data once is not a big issue.
If there is a possibility to fine tune the business logic I wouldn't have asked this question here.
Caching the data on demand is an option, but I'm trying to see if I can do anything better.
Suggest me if there is a better idea that you are aware of. Thank you./
Suggest me if there is a better idea that you are aware of.
Yes, fix the business logic so that it doesn't take 1 to 2 seconds per record. That's a ridiculously long time.
Before you do that, profile your application to make sure that it is really the business logic that is causing the slow record loading, and not something else. (For example, it could be a pathological data structure, or a database issue.)
Once you've fixed the root cause of the slow record loading, it is still a good idea to cache the read-only records, but you probably don't need to preload the cache. Instead, just load the records on demand.
It sounds like you are reinventing the wheel. I'd be looking to use hibernate. Apart from simplifying the code to access the database, hibernate has built-in caching and lazy loading of data so it only creates objects as you request them. Ergo, a lot of what you describe above is already in place and you can concentrate on sorting out your business logic. I suspect that once you solve the business logic performance issue, there will be no need to do such as complicated caching system and hibernate defaults will be sufficient.
As maximdim said in a comment, preloading the whole thing will take a lot of time. If your system is not very strange, the user won't need all data at once. Just cache on demand instead. I would also recommend using an established caching solution, such as EHCache, which has persistence via DiskStore -- the only issue is that whatever you cache in this case has to be Serializable. Since you can marshall it as XML, I'm betting you can serialize it too, which should be faster.
In a past project, we had to query a very busy, very sluggish service running in an off-site mainframe in order to assemble one of the entities. Average response times from our app were dominated by this query. Since the data we retrieved was mostly read-only caching with EHCache solved our problems.
jdbm has a nice, persistent map implementation (http://code.google.com/p/jdbm2/) - that may help you do local caching - it would certainly be a lot faster than serializing your POJOs to XML and writing them back into a SQL database.
If your data is truly read-only, then I'd think that the best solution would be to treat the source database as an input queue that feeds your app database. Create a background process (heck, a service would be better), and have it monitor the source database and keep your app database synced.

Categories

Resources