Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm new to MongoDB and trying to figure out some solutions for basic requirements.
Here is the scenario:
I have an application which saves info to MongoDB through a scheduler. Each time when the scheduler runs, i need to fetch the last inserted document from db to get the last info id which i should send with my request to get the next set of info. Also when i save an object, i need to validate it whether a specific value for a field is already there in db, if it is there, i just need to update a count in that document or else will save a new one.
My Questions are:
I'm using MongoDB java driver. Is it good to generate the object id myself, or is it good to use what is generated from the driver or MongoDB itself?
What is the best way to retrieve the last inserted document to find the last processed info id?
If i do a find for the specific value in a column before each insert, i'm worried about the performance of the application since it is supposed to have thousands of records. What is the best way to do this validation?
I saw in some places where they are talking about two writes when doing inserts, one to write the document to the collection, other one to write it to another collection as last updated to keep track of the last inserted entry. Is this a good way? We don't normally to this with relational databases.
Can i expect the generated object ids to be unique for every record inserted in the future too? (Have a doubt whether it can repeat or not)
Thanks.
Can i expect the generated object ids to be unique for every record inserted in the future too? (Have a doubt whether it can repeat or not)
I'm using MongoDB java driver. Is it good to generate the object id myself, or is it good to use what is generated from the driver or MongoDB itself?
If you don't provide a object id for the document, mongo will automatically generate it for you. All documents must definitely have _id. Here is the relevant reference.
The relevant part is
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
http://docs.mongodb.org/manual/reference/object-id/
I guess this is more than enough randomization(although a poor hash choice, since dates/time increase monotonically)
If i do a find for the specific value in a column before each insert, i'm worried about the performance of the application since it is supposed to have thousands of records. What is the best way to do this validation?
Index the field(there are no columns in mongodb, 2 documents can have different fields) first.
db.collection_name.ensureIndex({fieldName : 1})
What is the best way to retrieve the last inserted document to find the last processed >info id?
Fun fact. We don't need info id field if we are using it once and then deleting the document. The _id field is datewise sorted. But if the document is regularly updated, then we need to modify the document atomically with the find and modify operation.
http://api.mongodb.org/java/2.6/com/mongodb/DBCollection.html#findAndModify%28com.mongodb.DBObject,%20com.mongodb.DBObject,%20com.mongodb.DBObject,%20boolean,%20com.mongodb.DBObject,%20boolean,%20boolean%29
Now you have updated the date of the last inserted/modified document. Make sure this field is indexed. Now using the above link, look for the sort parameter and populate it with
new BasicDbObject().put("infoId", -1); //-1 is for descending order.
I saw in some places where they are talking about two writes when doing inserts, one to write the document to the collection, other one to write it to another collection as last updated to keep track of the last inserted entry. Is this a good way? We don't normally to this with relational databases.
Terrible Idea ! Welcome to Mongo. You can do better than this.
Related
I've been attempting to do the equivalent of an UPSERT (insert or update if already exists) in solr. I only know what does not work and the solr/lucene documentation I have read has not been helpful. Here's what I have tried:
curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[{"id":"1","name":{"set":"steve"}}]'
{"responseHeader":{"status":409,"QTime":2},"error":{"msg":"Document not found for update. id=1","code":409}}
I do up to 50 updates in one request and request may contain the same id with exclusive fields (title_en and title_es for example). If there was a way of querying whether or not a list of id's exist, I could split the data and perform separate insert and update commands... This would be an acceptable alternative but is there already a handler that does this? I would like to avoid doing any in house routines at this point.
Thanks.
With Solr 4.0 you can do a Partial update of all those document with just the fields that have changed will keeping the complete document same. The id should match.
Solr does not support UPSERT mechanics out of the box. You can create a record or you can update a record and syntax is different.
And if you update the record you must make sure all your other pre-inserted fields are stored (not just indexed). Under the covers, an update creates a completely new record just pre-populated with previously stored values. But that functionality if very deep in (probably in Lucene itself).
Have you looked at DataImportHandler? You reverse the control flow (start from Solr), but it does have support for checking which records need to be updated and which records need to be created.
Or you can just run a solr query like http://solr.example.com:8983/solr/select?q=id%3A(ID1+ID2+ID3)&fl=id&wt=csv where you ask Solr to look for your ID records and return only ID of records it does find. Then, you could post-process that to segment your Updates and Inserts.
I've stucked with one question in my understanding of ElasticSearch indexing process. I've already read this article, which says, that inverted-index stores all tokens of all documents and it is immutable. So, to update it we must remove it and reindexing all data to have all document searchable.
But I've read about partial updating the documents (automaticaly marking them to "deleted" and inserting+indexing new one). But in those article where no mention about reindexing all previous data.
So, I do not understand properly next: when I update the document (text document with 100 000 words) and already have in storage some other indexed document - is it true that I will have on every UPDATE or INSERT operation reindexing process of all my documents?
Basicly I rely on default ElasticSearch settings (5 primary shards with one replica per shard and 2 nodes in cluster)
You can just have a document updated (that is reindexed, which is basically the same as removing from index and adding it again), see: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/update-doc.html This will take care of the whole index, so you won't need to reindex every other document.
I'm not sure what you mean by "save" operation, you may want to clarify it with an example.
As of the time required to update a document of 100K words, I suggest you to try it out.
I've never used CouchDB/MongoDB/Couchbase before and am evaluating them for my application. Generally speaking, they seem to be a very interesting technology that I would like to use. However, coming from an RDBMS background, I am hung up on the lack of transactions. But at the same time, I know that there is going to be much less a need for transactions as I would have in an RDBMS given the way data is organized.
That being said, I have the following requirement and not sure if/how I can use a NoSQL DB.
I have a list of clients
Each client can have multiple files
Each file must be sequentially number for that specific client
Given an RDBMS this would be fairly simple. One table for client, one (or more) for files. In the client table, keep a counter of last filenumber, and increment by one when inserting a new record into the file table. Wrap everything in a transaction and you are assured that there are inconsistencies. Heck, just to be safe, I could even put a unique constraint on a (clientId, filenumber) index to ensure that there is never the same filenumber used twice for a client.
How can I accomplish something similar in MongoDB or CouchDB/base? Is it even feasible? I keep reading about two-phase commits, but I can't seem to wrap my head around how that works in this kind of instance. Is there anything in Spring/Java that provides two-phase commit that would work with these DBs, or does it need to be custom code?
Couchdb is transactional by default. Every document in couchdb contains a _rev key. All updates to a document are performed against this _rev key:-
Get the document.
Send it for update using the _rev property.
If update succeeds then you have updated the latest _rev of the document
If the update fails the document was not recent. Repeat steps 1-3.
Check out this answer by MrKurt for a more detailed explanation.
The couchdb recipies has a banking example that show how transactions are done in couchdb.
And there is also this atomic bank transfers article that illustrate transactions in couchdb.
Anyway the common theme in all of these links is that if you follow the couchdb pattern of updating against a _rev you can't have an inconsistent state in your database.
Heck, just to be safe, I could even put a unique constraint on a (clientId, filenumber) index to ensure that there is never the same filenumber used twice for a client.
All couchdb documents are unique since the _id fields in two documents can't be the same. Check out the view cookbook
This is an easy one: within a CouchDB database, each document must have a unique _id field. If you require unique values in a database, just assign them to a document’s _id field and CouchDB will enforce uniqueness for you.
There’s one caveat, though: in the distributed case, when you are running more than one CouchDB node that accepts write requests, uniqueness can be guaranteed only per node or outside of CouchDB. CouchDB will allow two identical IDs to be written to two different nodes. On replication, CouchDB will detect a conflict and flag the document accordingly.
Edit based on comment
In a case where you want to increment a field in one document based on the successful insert of another document
You could use separate documents in this case. You insert a document, wait for the success response. Then add another document like
{_id:'some_id','count':1}
With this you can set up a map reduce view that simply counts the results of these documents and you have an update counter. All you are doing is instead of updating a single document for updates you are inserting a new document to reflect a successful insert.
I always end up with the case where a failed file insert would leave the DB in an inconsistent state especially with another client successfully inserting a file at the same time.
Okay so I already described how you can do updates over separate documents but even when updating a single document you can avoid inconsistency if you :
Insert a new file
When couchdb gives a success message -> attempt to update the counter.
Why this works?
This works because because when you try to update the update document you must supply a _rev string. You can think of _rev as a local state for your document. Consider this scenario:-
You read the document that is to be updated.
You change some fields.
Meanwhile another request has already changed the original document. This means the document now has a new _rev
But You request couchdb to update the document with a _rev that is stale that you read in step #1.
Couchdb will generate an exception.
You read the document again get the latest _rev and attempt to update it.
So if you do this you will always have to update against the latest revision of the document. I hope this makes things a bit clearer.
Note:
As pointed out by Daniel the _rev rules don't apply to bulk updates.
Yes you can do the same with MongoDB, and Couchbase/CouchDB using proper approach.
First of all in MongoDB you have unique index, this will help you to ensure a part of the problem:
- http://docs.mongodb.org/manual/tutorial/create-a-unique-index/
You also have some pattern to implement sequence properly:
- http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
You have many options to implement a cross document/collection transactions, you can find some good information about this on this blog post:
http://edgystuff.tumblr.com/post/93523827905/how-to-implement-robust-and-scalable-transactions (the 2 phase commit is documented in detail here: http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/ )
Since you are talking about Couchbase, you can find some pattern here too:
http://docs.couchbase.com/couchbase-devguide-2.5/#providing-transactional-logic
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm writing a Java Program right now, which reads files and writes the content of these files (after some modifications) into an relational database.
My problem right now is that the program should support a wide range of databases and not only one.
So in my program I create SQL statements and commit them to the DB - no problem. (SAP HANA)
Now I want so add another DB (MySQL) and have to slightly change the SQL syntax of the query before committing.
My solution right now is copying the code block, that creates the statements and make the DB specific changes to it. But that obviously can't be it.(to many databases -> 80% code never used) I probably need some kind of mapper, that converts my sql to a dialect, that the chosen DB understands.
Now, I found out about Hibernate and other mappers, but I don't think they fit my needs. The Problem is that they expect an java object (pojo) and convert them. But since I don't know what kind of data my Program is gonna load, I can not create static objects for each column for example.
Sometimes I need to create 4 columns, sometimes 10. sometimes they are Integer, sometimes Strings / varchar. And all of the time they have different names. So all tutorials I found on hibernate are starting from a point where the program is certain what kind of data is going to be inserted into the db which my program is not.
Moreover I need to insert a large number of lines per table (like a billion+) and I think it might be slow to create a object for each insert.
I hope anyone understands my problem and can give me some hints. maybe a mapper, that just converts sql without the need to create a object before.
thank you very much! : )
edit: to make it more clear: the purpose of the programm is to fill up a relational db with data that is stored / discribes in files like csv and xml ). so the db is not used as a tool to store the data but storing the data there is the main aim. I need a realtional db filled up with data that the user provides. and not only one db, but different kinds of rdbs
I think you are describing a perfect use for a file system. Or if you want to go with a filesystem abstraction:
have a look at the apache jackrabbit project
So basically you want to write a tool that writes a arbitrary text file (some kind of csv I assume) into an arbitrary database system? Creating tables and content on the fly, depending on the structure of the text tile?
Using a high level abstraction layer like hibernate is not gonna take you anywhere soon. What you want to do is low level database interaction. As long as you dont need any specific DBMS dependent features you should go a long way with ANSI sql. If that is not enough, I dont see an easy way out of this. Maybe it is an option to write your own abstraction layer that handles DBMS specific formating of the SQL statments. Doesn't sound nice though.
A different thing to think about is the large number of lines per table (like a billion+). Using single row INSERT statements is not a good idea. You have to make use of efficient mass data interfaces - which are strongly DBMS dependent! Prepared statements is the least measure here.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm designing a database and a Java application to the following:
1. Allow user to query the database via an API.
2. Allow a user to save a query and identify the query via a 'query-id'. User can then pass-in 'query-id' on next call to API, which will execute the query associated with id but it will only retrieve data from the last time the specific query was requested.
- Along with this, I would also need to save the query-id information for each UserID.
Information regarding the Database
The database of choice is PostgreSQL and the information to be requested by user will be stored in various tables.
My question: Any suggestions/advice/tips on how to go about implementing requirement
No. 2?
Is there an existing design pattern, sql queries, built-in db function on how to save a query and fetch information from multiple tables from the last returned results.
Note:
My initial thoughts so far is to store the last row(each row in all the tables will have a primary key) read from each table into a data structure and then save this data structure for each saved query and use it when retrieving data again.
For storing the user and query-id information, I was thinking of creating a separate table to store the UserName, UserUUID, SavedQuery, LastInfoRetrieved.
Thanks.
This is quite a question. The obvious tool to use here would prepared statements but since these are planned on first run, they can run into problems when run multiple times with multiple parameters. Consider the difference, assuming that id ranges from 1 to 1000000 between:
SELECT * FROM mytable WHERE id > 999900;
and
SELECT * FROM mytable WHERE id > 10;
The first should use an index while the second should do a physical-order scan of the table.
A second possibility would be to have functions which return refcursors. This would mean the query is actually run when the refcursor is returned.
A third possibility would be to have a schema of tables that could be used for this, per session, holding results. Ideally these would be temporary tables in pg_temp, but if you have to preserve across sessions, that may be less desirable. Building such a solution is a lot more work and adds a lot of complexity (read: things that can go wrong) so it is really a last choice.
From what you say, refcursors sound like the way to do this but keep in mind PostgreSQL needs to know what data types to return so you can run into some difficulties in this regard (read the documentation thoroughly before proceeding), and if prepared statements gets where you need to go, that might be simpler.