Saving big amount of data (words): Serialization or DB - java

I need to save permanently a big vocabulary and associate to each word some information (and use it to search words efficiently).
Is it better to store it in a DB (in a simply table and let the DBMS make the work of structuring data based on the key) or is it better to create a
trie data structure and then serialize it to a file and deserialize once the program is started, or maybe instead of serialization use a XML file?
Edit: the vocabulary would be in the order of 5 thousend to 10 thousend words in size, and for each word the metadata are structured in array of 10 Integer. The access to the word is very frequent (this is why I thought to trie data structure that have a search time ~O(1) instead of DB that use B-tree or something like that where the search is ~O(logn)).
p.s. using java.
Thanks!

using DB is better.
many companies are merged to DB, like the erp divalto was using serializations and now merged to DB to get performance
you have many choices between DBMS, if you want to see all data in one file the simple way is to use SQLITE. his advantage it not need any server DBMS running.

Related

How to Store Data To Show at Chart using Java?

I have a Spring based Java application. I have two types of data.
First one is indexed document number at my application. Documents are indexed only 2 or 3 times a week.
Second one is number of searches. Many users searches something at my application. I want to visualize the search terms. Many data flows at any time.
What do you suggest me to store such kind of data using Java?
For first one I think that I can use RRD or something like that or I can even write data into a table at MySQL etc.
For second one I can use a more sophisticated database and I can use an in memory database as like H2 between my sophisticated database and user interface.
Any ideas?
Have you considered using Redis? It has great support for atomic increments if you wanted to track search counts and its also very fast since data is stored in-memory.

What is the best way to find common elements of multiple text files with java?

I have a program that creates multiple text files of rdf triples. I need to compare the triples and do it fast, what is the best way to do this? I thought of putting the triples into an array and comparing them but there could potentially be hundreds of thousands of triples per file and that would take forever. I need it to be as close to realtime as possible since the triples will be genreated constantly amoung the files. Any help would be great. The files are also in AllegroGraph repository's if it's easier to compare them there somehow.
A thought: if I stored the triples in excel (one triple per row) and one sheet per repository,
A: how could I find the duplicates amoung the sheets.
B: would it be fast.
and C: how could I automate that from Java?
You need to build a master index that will store each triple and in how many files it appears and the exact file name and location of the triple within each file. You can search the master index to answer the queries in real-time.
As you update, delete or create new rdf files, you need to update the master index.
You need to store the master index so that it can be updated, searched efficiently.
Simple choice could be to use relational database (like MySql) to store the master index. It can answer you queries like finding common triples with simple select statement select * from rdfindex where triplecount > 2.
EDIT: You cannot store hundreds of thousands of triples in memory using HashMap or similar datastructure. That's why I suggested using database, which can store the data and respond to your queries efficiently. You can look at embedded database like SQLite to store the data.
Read upon these topics
How to create SQLite database and create tables, access tables etc., Create a simple table to store triple, triplecount, filenames.
Convert all your Excel files to CSV files. You can use opencsv to parse the file in Java (check out the samples that come with opencsv).
Parse the CSV files and load the data into SQLite. If the triple is already in the database, then just update the count, if not insert the triple.
As far as I know there is a function to delete duplicate entries in AllegroGraph, this may be an option if all the triples come from there.

Compare Huge XML Rows with Database Table Records - Custom Requirement

Problem
We have an XML like (its having some non unicode which needs to be filtered of) data,
<row><div>1234</div><dept>ABCD</dept></row>
<row><div>5678</div><dept>EFGH</dept></row>
Just mentioning only 2 column tags for ease of understanding. Actually it has more than 20 column tags in each
XML data is directly inserted as records into an Oracle schema table as,
div_c qdept
1234 ABCD
5678 EFGH
More information
XML file is more than 9 Gigs and available in FTP.
Database table column names might be different from XML column tag names.
Might have to add/define some Rules to filter out the rows.
Question
What would be the appropriate way to parse this huge XML and find out whether that record exists in the database table? Any open source tools available to utilize?
What Am Trying
Wrote StAX parser using default implementation(XMLInputFactory) with Invalid characters fiter (FilterReader)
Planning to split the XML as chunks
Have concurrent threads processing each of the chunks
Each thread will generate a query to check whether that exists in database or not (i know its absurd)
Have a connection pool created and execute those queries by each of the thread
I know this is really worst what I am doing and it will take years to complete, I really need some advice on this like whether to go with any ORM to make the checking easy and make the XML parsing fast.
Some suggestions like that would really help me.
Yeah. I think you were right to use StAX. You definitely want to stream and StAX seems to have the simplest API for streaming XML. I wouldn't go to ORM right away. Most ORM is to round-trip data. It saves you work for mechanical transformations. That makes it good when you have very structured data but the mapping between the two schemas is not very complicated. Here you are trying import data from one format into another. It sounds like your large dataset has a fairly simple schema but the mapping is the more complicated part. Go with custom code. Pawel's suggestion of the temporary table sounds good. Try to do as much processing as you can in stored procedures that operate on the whole dataset at once (old and imported). You don't want to keep transferring those rows back and forth from the database to your app.

How to implement persistent lookup table

My Java application uses a read-only lookup table, which is stored in an XML file. When the application starts it just reads the file into a HashMap. So far, so good, but since the table is growing I don't like loading the entire table into the memory at once. RDBMS and NoSQL key-value stores seem overkill to me. What would you suggest?
Makes you wish Java would allow to allocate infinite amounts of heap as memory mapped file :-)
If you use Java 5, then use Java DB; it's a database engine written in Java, based on Apache Derby. If you know SQL, then setting up an embedded database takes only a couple of minutes. Since you can create the database again every time your app is started, you don't have to worry about permissions, DB schema migration, stale caches, etc.
Or you could use an OO database like db4o but many people find it hard to make the mental transition to use queries to iterate over internal data structures. To take your example: You have a huge HashMap. Instead of using map.get(), you have to build a query using DB4o and then run that query on your map to locate items; otherwise DB4o would be forced to load the whole map at once.
Another alternative is to create your own minimal system: Read the data from the XML file and save it as a large random access file plus an index + caching so you can quickly look up items. If your objects are all serializable, then you can use ObjectInputStream to read the individual entries after seeking to the right place using the RandomAccessFile.

Merging a large table with a large text file using JPA?

We have a large table of approximately 1 million rows, and a data file with millions of rows. We need to regularly merge a subset of the data in the text file into a database table.
The main reason for it being slow is that the data in the file has references to other JPA objects, meaning the other jpa objects need to be read back for each row in the file. ie Imagine we have 100,000 people, and 1,000,000 asset objects
Person object --> Asset list
Our application currently uses pure JPA for all of its data manipulation requirements. Is there an efficient way to do this using JPA/ORM methodologies or am I going to need to revert back to pure SQL and vendor specific commands?
why doesnt use age old technique: divide and conquer? Split the file into small chunks and then have parallel processes work on these small files concurrently.
And use batch inserts/updates that are offered by JPA and Hibernate. more details here
The ideal way in my opinion though is to use batch support provided by plain JDBC and then commit at regular intervals.
You might also wants to look at spring batch as it provided split/parallelization/iterating through files etc out of box. I have used all of these successfully for an application of considerable size.
One possible answer which is painfully slow is to do the following
For each line in the file:
Read data line
fetch reference object
check if data is attached to reference object
if not add data to reference object and persist
So slow it is not worth considering.

Categories

Resources