I'm using Logstash 2.4.1 to load data to Elasticsearch 2.4.6.
I have the following Logstash config:
input {
jdbc {
jdbc_connection_string => "jdbc:oracle:thin:#database:1521:db1"
jdbc_user => "user"
jdbc_password => "password"
jdbc_driver_library => "ojdbc6-11.2.0.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
parameters => { "id" => 1 }
statement => "SELECT modify_date, userName from user where id = :id AND modify_date >= :sql_last_value"
schedule => "*/1 * * * *"
tracking_column => modify_date
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "index1"
document_type => "USER"
}
stdout { codec => rubydebug }
}
So, for each minute, it goes to the database to check if there is new data for Elastic.
It works perfectly, but there is one problem:
We have around 100 clients, and they are all in the same database instance.
That means I have 100 scripts and will have 100 instances of Logstash running, meaning 100 open connections:
nohup ./logstash -f client-1.conf Logstash startup
nohup ./logstash -f client-2.conf Logstash startup
nohup ./logstash -f client-3.conf Logstash startup
nohup ./logstash -f client-4.conf Logstash startup
nohup ./logstash -f client-5.conf Logstash startup
and so on...
This is just bad.
Is there any way I can use the same connection for all my scripts ?
The only difference between all those scripts is the parameter id and the index name, each client will have a diferent id and a different index:
parameters => { "id" => 1 }
index => "index1"
Any ideas ?
I don't have experience with the JDBC input but I assume that it will index each column in a own field inside of each document (per row).
So than you don't have to filter on a specific user in the query, but just add all the rows to same index. After that you can filter with Kibana on a specific user (assumed that you want use Kibana to analyze the data). Filtering can also be done with a ES query.
With that approach you only need 1 logstash configuration.
There are a few ways to implement this, you will need to play with them to find the one that works for you.
In my experience with JDBC input, all columns in your select become fields in a document. Each row returned by the JDBC input will result in a new document.
If you select the client id instead of using it as a parameter/predicate you can then use the id in your elastic search output and append it to the index output.
Each document (row) will then be routed to an index based on the client id.
This is very similar to their date based index strategy and is supported by default in logstash.
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-index
Just remember that all columns names are automatically lowercased when logstash brings them in. They are case sensitive once in logstash.
Instead of using 1 index to each customer, I decided to user 1 index for all clients and making the filter of the clients on the query.
Works good, no performance issues.
Check this for more info: My post on elastic forum
Related
I want to connect hive to elasticsearch. I followed the instruction from here.
I do the following steps
1. start-dfs.sh
2. start-yarn.sh
3. launch elasticsearch
4. launch kibana
5. launch hive
inside hive
a- create a database
b- create a table
c- load data into the table (LOAD DATA LOCAL INPATH '/home/myuser/Documents/datacsv/myfile.csv' OVERWRITE INTO TABLE students; )
d- add jar /home/myuser/elasticsearch-hadoop-7.10.1/dist/elasticsearch-hadoop-hive-7.10.1.jar
e- create a table for Elastic.
create table students_es (stt int not null, mahocvien varchar(10), tenho string, ten string, namsinh date, gioitinh string, noisinh string, namvaodang date, trinhdochuyenmon string, hesoluong float, phucaptrachnhiem float, chucvudct string, chucdqh string, dienuutien int, ghichu int) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.nodes' = '127.0.0.1', 'es.port' = '9201', 'es.resource' = 'students/student');
f- insert overwrite table students_es select * from students;
Then the error I got is the following
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org/apache/commons/httpclient/protocol/ProtocolSocketFactory
I used the components
kibana: 7.10.1
hive : 3.1.2
hadoop: 3.1.2
I finally found how to solve it.
You need to download the jar file commons-httpclient-3.1.jar and put it into
your hive lib directory.
Recently, one of our clients reported not being able to create a table based on a query against a view. That said, they were able to save the result of a query against a table into another table. This issue spawned a more implementation focused question using the Java client libraries. Specifically, is there any way to save the result set of a query against a view to a table using the Java client library? I will be digging and post anything that I find. That said, any early guidance would be appreciated!
To be specific and add more context, I note that the the following process failed when the query was run against a union view.
java -jar BigQueryToCloudExporter.jar ./GAFastAccessKey.p12 '' "
Select date(date_add('2014-08-09',floor(datediff(date(sec_to_timestamp(visitstarttime)),'2014-08-03')/7)*7,"DAY")) WeekEndDate
, hits.eventinfo.eventaction GA_RentalNo
, count(distinct visitID) PDP_PPC
FROM (TABLE_DATE_RANGE([Union_View.GA],
TIMESTAMP('2014-08-30'),
TIMESTAMP('2014-09-13')))
where hits.eventinfo.eventcategory='property attributes'
and brandId=121
--hits.eventinfo.eventcategory='property inquiry'
and trafficsource.medium like '%cpc%'
--and trafficsource.campaign not like '%ppb%'
and trafficsource.campaign like '%mpm%'
group each by WeekEndDate, GA_XXXXXX
order by WeekEndDate, GA_XXXXXX limit 100" StagingQueryTable QueryTable AVRO gs://XXXXXX/QueryTable*.avro
On the other hand, the following process succeeded when the query was made against a BigQuery table (keeping everything else same).
java -jar BigQueryToCloudExporter.jar ./GAFastAccessKey.p12 '' "
Select date(date_add('2014-08-09',floor(datediff(date(sec_to_timestamp(visitstarttime)),'2014-08-03')/7)*7,"DAY")) WeekEndDate
, hits.eventinfo.eventaction GA_XXXXXX
, count(distinct visitID) PDP_PPC
FROM (TABLE_DATE_RANGE([XXXXXX.ga_sessions_],
TIMESTAMP('2014-08-30'),
TIMESTAMP('2014-09-13')))
where hits.eventinfo.eventcategory='property attributes'
and brandId=121
--hits.eventinfo.eventcategory='property inquiry'
and trafficsource.medium like '%cpc%'
--and trafficsource.campaign not like '%ppb%'
and trafficsource.campaign like '%mpm%'
group each by WeekEndDate, GA_RentalNo
order by WeekEndDate, GA_XXXXXX limit 100" StagingQueryTable QueryTable AVRO gs://XXXXXX/QueryTable*.avro
I'm using java Spring and spring data for mongodb.
I have a collection that needs to contain only documents from the last 3 months but all the documents should be saved in some way (maybe expoet to a file?). I'm looking for solution but all i can find talks about full DB backup.
What is the best way to keep the collection updated to only the last 3 months? (weekly cron?)
How to save the collection archive? I think mongodump is an overkill.
Both mongoexport and mongodump support a -q option to specify a query to limit the documents that will be deleted. The choice for either is more of a function of what format you'd like the data to be stored in.
Let's assume that you have a collection with a timestamp field. You could run either one of these (filling in the required names and times in the angle brackets):
mongoexport -d <yourdatabase> -c <yourcollection> -q "{ timestamp: { \$gt: <yourtimestamp>}}" -o <yourcollection_export_yourtimestamp>.json
mongodump -d <yourdatabase> -c <yourcollection> -q "{ timestamp: { \$gt: <yourtimestamp>}}"
And then delete the old data.
Alternatively you could take periodic snapshots via cron with either method on a collection with a ttl index so that you don't have to prune it yourself - mongodb will automatically delete older data:
db.collectioname.ensureIndex( { "createdAt": 1 }, { expireAfterSeconds: 7862400 } )
This will keep deleting any document older than 91 days based on a createdAt field in the document
http://docs.mongodb.org/manual/tutorial/expire-data/
With mongoexport you can backup a single collection instead of the whole database. I would recommand a Cron-Job (like you sad) to export the data and ceep the database limited to the Documents of the last 3 months my removing oder documents.
mongoexport -d databasename -c collectionname -o savefilename.json
I stuck with Mongo with $hint command.
I have collection and i had indexed this collection. But the problem is, I query collection with Aggregate framework, but I want temporary disable Indexing, so I use hint command like this:
db.runCommand(
{aggregate:"MyCollectionName",
pipeline:[{$match : {...somthing...},
{$project : {...somthing...}}]
},
{$hint:{$natural:1}}
)
Please Note that I use {$hint:{$natural:1}} to disable Indexing for this query,
I have run SUCCESSFULLY this command on MongoDB command line. But I don't know how to map this command to Mongo Java Api (Java Code).
I used lib mongo-2.10.1.jar
Currently you can't - it is on the backlog - please vote for SERVER-7944
I have to generate random number within a range (0-100.000) in a cluster environment (many stateless Java based app servers + Mongodb) - so every user request will get some unique number and will maintain it in the next few requests.
As I understand, I have two options:
1. have some number persisted in mongo and incrementAndGet it - but it's not atomic - bad choice.
2. Use Redis - it's atomic and support counters.
3. Any idea? Is it safe to use UUID and set a range for it ?
4. Hazelcast ?
Any other though?
Thanks
I would leverage the existing MongoDB infrastructure and use the MongoDB findAndModify command to do an atomic increment and get operation.
For the shell the command would look like.
var result = db.ids.findAndModify( {
query: { _id: "counter" },
sort: { rating: 1 },
new : true,
update: { $inc: { counter: 1 } },
upsert : true
} );
The 'new : true' returns the document after the update. Upsert creates the document if it is missing.
The 10gen supported driver and the Asynchronous Driver both contain helper methods/builders for the find and modify command.