I have a large collection of roughly 3.2 million records, this collection data is being updated monthly but the source data is being fetched as-is, meaning I don't get just the updated records but everything.
In terms of performance, is it better to simply remove the collection and insert everything or do an update for each record?
Also is there a good way to compare existing record with the one being read from the source to check if there's any change?
Thanks.
Also is there a good way to compare existing record with the one being read from the source to check if there's any change?
You're searching for a Change Detection System : it's a problem commonly described for ETL system. I suggest you to read something about ETL process (Kimball's Datawarehouse ETL Toolkit is a good source). In general detecting changes is an hard problem and involves the use of snapshot in order to calculate differences. If you're sure that your collection will always remain in a mongo storage you can see if it's possible to mess around with mongo log.
Furthermore consider that change detection is very coupled with the structure and the meaning of your data: e.g. if you have insertion-only collection you can get changed data with _id.
The problem is too complex to give answers like "do this and that and you'll get it"; you have to analyze your data and understand what is the better method: refer to literature to find known solutions and avoid reinventing the wheel.
In terms of performance, is it better to simply remove the collection and insert everything or do an update for each record?
Once again, you have to known how you data is structured. If you have a collection that has more changes than constant parts you'd better reload the entire collection and avoid tracking changes. If your collection has changeset that is considerably smaller than the whole collection updating existing document leads to better performance.
Hope this helps.
Related
Recently I come across a schema model like this
Structure looks exactly the same, i just renamed with Entity name like Table (*)
Starting from Table C, all the tables are having close to 200 Columns, from C to L
Reason for posting this is like, I never come across structure like this before, if anyone who have already experienced like this or worked similar or more complex than this please do share your idea,
Having a structure like this is good or bad, and why?
Assume we need to have API to save data for the table structure like this,
how to design the API
How we are going to manage the Transactional across all these tables
In service code, there are few cases where we might need to get data from these table and transfer to external system.
Catch here is, external system is accepting the request in the flatten structure not in the hierarchy which we have as mentioned above. If this data needs to be transferred to external system, how can we manage marshaling and un marshaling
Last but not least, API which is going to manage the data like this can be consumed atleast 2K a day.
What is your thought on this, I don't know exactly why we need it, it needs a detailed discussion and we need to break up the things.
If I consider Spring Data JPA, Hibernate. What are all things i need to consider,
More Importantly, all these tables row values will be limited based on the the ownerId/tenantId, so the data needs to be consistent across all the tables.
I can not comment on the general aspect of the structure as that is pretty domain specific and one would need to know why this structure was chosen to be able to say if it's good or not. Either way, you probably can't change this anyway, so why bother asking if it's good or not?
Having said that, with such a model there are a few aspects that you should consider:
When updating data, it is pretty important to update only columns that really changed to avoid index trashing and allow the DB to use spare storage in pages. This is a performance concern that usually comes up when using Hibernate with such models as Hibernate usually updates all "updatable" columns, not just the dirty ones. There is an option to do dynamic updates though. Without dynamic updates, you might produce a few more IOs per update and thus keep locks for a longer time which affects the overall scalability.
When reading data, it is very important not to use join fetching by default as that might result in a result set size explosion.
Now I have a situation where I need to make some comparisons and result filtration that is not very simple to do, what I want is something like Lucenes search but only I will develop it, it is not my decision though I would have gone with Lucene.
What I will do is:
Find the element according to full word match of a certain field, if not then check if it starts with it the check if it just contains.
Every field has its weight according to matching case(full->begins->contains) and its priority to me.
After one has matched I will also check the weight of the other fields as well to make a final total row weight.
Then I will return an Map of both rows and their weights.
Now I realized that this is not easy done by hibernate's HQL meaning I would have to run multiple queries to do this.
So my question is should I do it in java meaning should I retrieve all records and do my calculations to find my target, or should I do it in hibernate by executing multiple queries? which is better according to performance and speed ?
Unfortunately, I think the right answer is "it depends": how many words, what data structure, whether the data fits in memory, how often you have to do the search, etc.
I am inclined to think that a database is a better solution, even if Hibernate is not part of it. You might need to learn how to write better SQL. Perhaps the dynamic SQL that Hibernate generates for you isn't sufficient. Proper JOINs and indexing might make this perform nicely.
There might be a third way to consider: Lucene and indexing. I'd need to know more about your problem to decide.
I have a table called Token in my database that represents texts tokenized.
Each row haves attributes like textblock, sentence and position(for identifying the text that the token is from) and logical fields like text, category, chartype, etc.
What I want to know is iterate over all tokens to find patterns and do some operations. For example, merging two adjacent tokens that have the category as Name into one (and after this, reset the positions). I think that I will need some kind of list
What is the best way to do this? With SQL queries to find the patterns or iterating over all tokens in the table. I think the queries will be complex a lot and maybe, iterating as a list will be more simple, but I don't know which is the way (as example, retrieving to a Java list or using a language that I can iterate and do changes right on database).
To this question not be closed, what I want to know is what the most recommended way to do this? I'm using Java, but if other language is better, no problem, I think I will need use R to do some statistic calculus.
Edit: The table is large, millions rows, load entire in memory is not possible.
If you are working with a small table, or proving out a merge strategy, then just setup a query that finds all of the candidate duplicate lines and dump the relevant columns out to a table. Then view that table in a text editor or spreadsheet to see if your hypothesis about the duplication is correct.
Keep in mind that any time you try to merge two rows into one, you will be deleting data. Worst case is that you might merge ALL of your rows into one. Proceed with caution!
This is an engineering decision to be made, based mostly on the size of the corpus you want to maintain, and the kind of operations you want to perform on them.
If the size gets bigger than "what fits in the editor", you'll need some kind of database. That may or may not be an SQL database. But there is also the code part: if you want perform non-trivial operations on the data, you might need a real programming language (could be anything: C, Java, Python. anything goes). In that case, the communication with the database will become a bottleneck: you need to generate queries that produce results that fit in the application programme's memory. SQL is powerful enough to represent and store N-grams and do some calculations on them, but that is about as far as you are going to get. In any case the database has to be fully normalised, and that will cause it to be more difficult to understand for non-DBAs.
My own toy project, http://sourceforge.net/projects/wakkerbot/ used a hybrid approach:
the data was obtained by a python crawler
the corpus was stored as-is in the database
the actual (modified MegaHal) Markov code stores it's own version of the corpus in a (binary) flatfile, containing the dictionary, N-grams, and the associated coefficients.
the training and text generation is done by a highly optimised C program
the output was picked up by another python script, and submitted to the target.
[in another life, I would probably have done some more normalisation, and stored N-grams or trees in the database. That would possibly cause the performance to drop to only a few generated sentences per second. It now is about 4000/sec]
My gut feeling is that what you want is more like a "linguistic workbench" than a program that does exactly one task efficiently (like wakkerbot). In any case you'll need to normalise a bit more: store the tokens as {tokennumber,tokentext} and refer to them only by number. Basically, a text is just a table (or array) containing a bunch of token numbers. An N-gram is just a couple of tokennumbers+the corresponding coefficients.
This is not the most optimized method but it's a design that allows you to write the code easily.
write an entity class that represent a row in your table.
write a factory method that allows you to get the entity object of a given row id, i.e. a method that create an object of entity class witht the values from the specified row.
write methods that remove and insert a given row object into table.
write a row counting method.
now, you can try to iterate your table using your java code. remember that if you merge between two row, you need to correctly adjust the next index.
This method allows you use small memory but you will be using a lot of query to create the row.
The concept is very similar or identical to ORM (Object Relational Mapping). If you know how tho use hibernate or other ORM then try those libraries.
IMO it'd be easier, and likely faster overall, to load everything into Java and do your operations there to avoid continually re-querying the DB.
There are some pretty strong numerical libs for Java and statistics, too; I wouldn't dismiss it out-of-hand until you're sure what you need isn't available (or is too slow).
This sounds like you're designing a text search engine. You should first see if pgsql's full text search engine is right for you.
If you do it without full text search, loading pl into pgsql and learning to drive it is likely to be the fastest and most efficient solution. It'll allow you to put all this work into a few well thought out lines of R, and do it all in the db where access to the data is closest. the only time to avoid such a plan is when it would make the database server work VERY hard, like holding the dataset in memory and cranking a single cpu core across it. Then it's ok to do it app side.
Whether you use pl/R or not, access large data sets in a cursor, it's by far the most efficient way to get either single or smaller subsets of rows. If you do it with a select with a where clause for each thing you want to process then you don't have to hold all those rows in memory at once. You can grab and discard parts of result sets while doing things like running averages etc.
Think about scale here. If you had a 5 TB database, how would you access it to do this the fastest? A poor scaling solution will come back to bite you even if it's only accessing 1% of the data set. And if you're already starting on a pretty big dataset today, it'll just get worse with time.
pl/R http://www.joeconway.com/plr/
Following problem: I want to render a news stream of short messages based on localized texts. In various places of these messages I have to insert parameters to "customize" them. I guess you know what I mean ;)
My question probably falls into the "Which is the best style to do it?" category: How would you store these parameters (they may be Strings and Numbers that need to be formatted according to Locale) in the database? I'm using Hibernate to do the ORM and I can think of the following solutions:
build a combined String and save it as such (ugly and hard to maintain I think)
do some kind of fancy normalization and and make every parameter a single row on the database (clean I guess, but a performance nightmare)
Put the params into an Array, Map or other Java data structure and save it in binary format (probably causes a lot of overhead size-wise)
I tend towards option #3 but I'm afraid that it might be to costly in terms of size in the database. What do you think?
If you can afford the performance hit of using the normalized approach of having a separate table I would go with this approach. We use the same approach as your first suggestion at work, and it gets messy, especially when you reach the column limit and key/values start getting truncated!
Do the normalization.
I would suggest something like:
Table Message
id
Table Params
message_id
key
value
Storing serialized Java objects in the database is quite a bad thing in most cases. As they are hard to maintain and you cannot access them with 'simple' SQL tools.
The performance impact is not as big, as you can fetch all together in a single select using a join.
It depends a bit. Is the number of parameters huge for each entity? If it is not probable second option is the best.
If you don't want to add extra queries caused by the lazy load you can always change fetch type for the variable number of parameters that would only add one join to a query you were always doing. In normal conditions it is not a big price to pay.
Also the third and the first one forbids forever any type of queries over the parameters. A huge technical debt for the future I would not be willing to pay.
directly put it as string and save it ..
Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?
Insane number of rows
Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )
Unable to fit all rows in memory
Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.
Fits in memory
Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).
Oh and if I were you, I will put the unique constraint on the DB anyways...
I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.
You have two options,
do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.
Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.
Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.
Your problem can be solve with a Extract, transform, load (ETL) approach:
You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.
You can do this manually or use an ETL tool.