Archiving a mongodb collection - java

I'm using java Spring and spring data for mongodb.
I have a collection that needs to contain only documents from the last 3 months but all the documents should be saved in some way (maybe expoet to a file?). I'm looking for solution but all i can find talks about full DB backup.
What is the best way to keep the collection updated to only the last 3 months? (weekly cron?)
How to save the collection archive? I think mongodump is an overkill.

Both mongoexport and mongodump support a -q option to specify a query to limit the documents that will be deleted. The choice for either is more of a function of what format you'd like the data to be stored in.
Let's assume that you have a collection with a timestamp field. You could run either one of these (filling in the required names and times in the angle brackets):
mongoexport -d <yourdatabase> -c <yourcollection> -q "{ timestamp: { \$gt: <yourtimestamp>}}" -o <yourcollection_export_yourtimestamp>.json
mongodump -d <yourdatabase> -c <yourcollection> -q "{ timestamp: { \$gt: <yourtimestamp>}}"
And then delete the old data.
Alternatively you could take periodic snapshots via cron with either method on a collection with a ttl index so that you don't have to prune it yourself - mongodb will automatically delete older data:
db.collectioname.ensureIndex( { "createdAt": 1 }, { expireAfterSeconds: 7862400 } )
This will keep deleting any document older than 91 days based on a createdAt field in the document
http://docs.mongodb.org/manual/tutorial/expire-data/

With mongoexport you can backup a single collection instead of the whole database. I would recommand a Cron-Job (like you sad) to export the data and ceep the database limited to the Documents of the last 3 months my removing oder documents.
mongoexport -d databasename -c collectionname -o savefilename.json

Related

Get all items of the last 15 minutes in DynamoDB

As I come from RDBM background I am bit confuse with DynamoDB, how to write this query.
Problem : need to filter out those data which is more than 15 minutes.
I have created GSI with hashkeymaterialType and createTime (create time format Instant.now().toEpochMilli()).
Now I have to write a java query which gives those values which is more than 15 minutes.
Here is an example using cli.
:v1 should be the material type I'd that you are searching on. :v2 should be your epoch time in milliseconds for 30 mins ago which you will have to calculate.
aws dynamodb query \
--table-name mytable \
--index-name myindex
--key-condition-expression "materialType = :v1 AND create time > :v2" \
--expression-attribute-values '{
":v1": {"S": "some id"},
":V2": {"N": "766677876567"}
}'

Use the same connection for multiple Logstash configurations

I'm using Logstash 2.4.1 to load data to Elasticsearch 2.4.6.
I have the following Logstash config:
input {
jdbc {
jdbc_connection_string => "jdbc:oracle:thin:#database:1521:db1"
jdbc_user => "user"
jdbc_password => "password"
jdbc_driver_library => "ojdbc6-11.2.0.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
parameters => { "id" => 1 }
statement => "SELECT modify_date, userName from user where id = :id AND modify_date >= :sql_last_value"
schedule => "*/1 * * * *"
tracking_column => modify_date
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "index1"
document_type => "USER"
}
stdout { codec => rubydebug }
}
So, for each minute, it goes to the database to check if there is new data for Elastic.
It works perfectly, but there is one problem:
We have around 100 clients, and they are all in the same database instance.
That means I have 100 scripts and will have 100 instances of Logstash running, meaning 100 open connections:
nohup ./logstash -f client-1.conf Logstash startup
nohup ./logstash -f client-2.conf Logstash startup
nohup ./logstash -f client-3.conf Logstash startup
nohup ./logstash -f client-4.conf Logstash startup
nohup ./logstash -f client-5.conf Logstash startup
and so on...
This is just bad.
Is there any way I can use the same connection for all my scripts ?
The only difference between all those scripts is the parameter id and the index name, each client will have a diferent id and a different index:
parameters => { "id" => 1 }
index => "index1"
Any ideas ?
I don't have experience with the JDBC input but I assume that it will index each column in a own field inside of each document (per row).
So than you don't have to filter on a specific user in the query, but just add all the rows to same index. After that you can filter with Kibana on a specific user (assumed that you want use Kibana to analyze the data). Filtering can also be done with a ES query.
With that approach you only need 1 logstash configuration.
There are a few ways to implement this, you will need to play with them to find the one that works for you.
In my experience with JDBC input, all columns in your select become fields in a document. Each row returned by the JDBC input will result in a new document.
If you select the client id instead of using it as a parameter/predicate you can then use the id in your elastic search output and append it to the index output.
Each document (row) will then be routed to an index based on the client id.
This is very similar to their date based index strategy and is supported by default in logstash.
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-index
Just remember that all columns names are automatically lowercased when logstash brings them in. They are case sensitive once in logstash.
Instead of using 1 index to each customer, I decided to user 1 index for all clients and making the filter of the clients on the query.
Works good, no performance issues.
Check this for more info: My post on elastic forum

Search Queries for Gerrit

I am doing some basic search queries for Gerrit but could not find anything much.
I want a gerrit query which can fetch the list based on multiple authors or based on created and updated date.
I am doing the following query in java but need a query which can fetch list based on multiple authors
gerritApi.changes().query("project:" + "Test-Project"+"+"+"status:merged").get();
To search for all opened changes from AUTHOR-1 or AUTHOR-2 do the following:
curl --request GET --user USER https://GERRIT-SERVER/a/changes/?q=\(owner:AUTHOR-1+OR+owner:AUTHOR-2\)+AND+status:open
Or in java:
gerritApi.changes().query("(owner:AUTHOR-1 OR owner:AUTHOR-2) + AND status:open").get();

ElasticSearch index deletion

I'm using Java API to delete old indexes from ElasticSearch.
Client client = searchConnection.client
DeleteIndexResponse delete = client.admin().indices().delete(new DeleteIndexRequest('location')).actionGet();
During deletion cluster goes red for a minute and not indexing new data - reason "missing indices/replicas etc".
How I can tell ElasticSearch that I'm going to delete them to prevent "red state"?
You could use aliases in order to abstract from the real indices underneath. The idea would be to read from an alias and write to an alias instead. That way you can create a new index, swap the write alias to the new index (so that the indexing process is not disrupted) and then delete the old index. Process-wise, it would go like this:
Context: Your current location index has the location_active alias and the indexing process writes to the location_active alias instead of directly to the location index.
Step 1: Create the new location_112015 index
curl -XPUT localhost:9200/location_112015
Step 2: Swap the location_active alias from the "old" location index to the "new" one created in step 1
curl -XPOST 'http://localhost:9200/_aliases' -d '{
"actions" : [
{ "remove" : { "index" : "location", "alias" : "location_active" } },
{ "add" : { "index" : "location_112015", "alias" : "location_active" } }
]
}'
Note that this operation is atomic, so if the indexing process keeps sending new documents to location_active, it will be transparent for it and no docs will be lost, no errors will be raised.
Step 3: Remove the old index
curl -XDELETE localhost:9200/location
Step 4: Rinse and repeat as often as needed
Note: these operations can easily be performed with the Java client library as well.

Mongo DB, Java Mongo API, how to add hint into aggregate command

I stuck with Mongo with $hint command.
I have collection and i had indexed this collection. But the problem is, I query collection with Aggregate framework, but I want temporary disable Indexing, so I use hint command like this:
db.runCommand(
{aggregate:"MyCollectionName",
pipeline:[{$match : {...somthing...},
{$project : {...somthing...}}]
},
{$hint:{$natural:1}}
)
Please Note that I use {$hint:{$natural:1}} to disable Indexing for this query,
I have run SUCCESSFULLY this command on MongoDB command line. But I don't know how to map this command to Mongo Java Api (Java Code).
I used lib mongo-2.10.1.jar
Currently you can't - it is on the backlog - please vote for SERVER-7944

Categories

Resources