The problem I have is, whether it is possible to use ElasticSearch on top of a relational database.
1. When I insert or delete a record in the relational database, will it reflect in the elastic search?
2. If I insert a document in the elastic search will it be persisted in the database?
3. Does it uses a cache or an in-memory database to facilitate search? If so what is uses?
There is no direct connection between Elasticsearch and relational databases - ES has it's own datastore based on Apache Lucene.
That said, you can as others have noted use the Elasticsearch River plugin for JDBC to load data from a relational database into Elasticsearch. Keep in mind there are a number of limitations to this approach:
It's one way only - The JDBC River for ES only reads from the source
database - it does not push data from ES into the source database.
Deletes are not handled - if you delete data in your source database
after it's been indexed into ES that deletion will not be reflected
in ES.
ElasticSearch river JDBC MySQL not deleting records
and https://github.com/jprante/elasticsearch-river-jdbc/issues/213
It was not intended as a production, scalable solution for
relational database and Elasticsearch integration. From the JDBC
River's author's comment in January of 2014, it was designed as a "
a single node (non-scalable) solution" "for demonstration purposes."
http://elasticsearch-users.115913.n3.nabble.com/Strategy-for-keeping-Elasticsearch-updated-with-MySQL-td4047253.html
To answer your questions directly (assuming you use the JDBC River):
New document inserts can be handled by the JDBC River but existing
data deletes are not.
Data does not flow from Elasticsearch into your relational database. That would need to be custom development work.
Elasticsearch is built on top of Apache Lucene. Lucene in turn
depends a great deal on file system caching at the OS level (which
is why ES recommends keeping heap size down to no more than 50% of
total memory, to leave a lot for the file system cache). In addition
the ES/Lucene stack makes use of a number of internal caches (like
the Lucene field cache and the filter cache)
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-cache.html
and
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html
Internally the filter cache is implemented using a bitset:
http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/
1)You should take a look at the ElasticSearch jdbc river here for inserts (I believe deleted rows aren't managed any more, see developper comment).
2)Unless you do it manually, it is not natively managed by ElasticSearch.
3)Indeed, ElasticSearch use cache to improve performances, especially when using filters. Bitsets (arrays of 0/1) are stored.
Came across this question while looking for a similar thing. Thought an update was due.
My Findings:
Elasticsearch has now deprecated Rivers, though the above-mentioned jprante's River lives on...
Another option I found was the Scotas Push Connector which pushes inserts, updates and deletes from an RDBMS to Elasticsearch. Details here: http://www.scotas.com/product-scotas-push-connector.
Example implementation here: http://www.scotas.com/blog/?p=90
Related
We have source Oracle database, where we have a lot of tabels (let say 100) which we need to mirror to target database. So we need to copy data increments periodically to another db tables. The target database is currently Oracle, but in the short future it will be probably changed to a different database technology.
So currently we can create a PL/SQL procedure which will dynamically generate DML (insert, update or merge statements) for each table (assuming that the source and target table have exactly the same attributes) from Oracle metadata.
But we would rather create some db technology independent solution so when we change target database to another (e.g. MS SQL or Postgres), then we will no need to change whole logic of data mirroring.
Does anyone have a suggestion how to do it differently (preferably in java)?
Thanks for every advice.
The problem you have is called CDC - continuous data capture. In case of Oracle this is complicated because Oracle is usually asking money for this.
So you can use:
PL/SQL or Java and use SQL to incrementally detect changes in data. IT requires plenty of work and performance is bad.
Use tools based on Oracle triggers, which will dects data changes and pushes them into some queue.
Use tool which can parse content of Oracle Archive logs. These are commercial products: GoldenGate (from Oracle) and Shareplex (Dell/EMC/dunno). GoldenDate also contains Java technology(XStreams) which allows you to inject Java visitor into the data stream. Those technologies also support sending data changes into Kafka stream.
There are plenty of tools like Debezium, Informatica, Tibco which can not parse Archived logs by themself, but rather they use Oracle's internal tool LogMiner. These tools usually do not scale well and can not cope with higher data volumes.
Here is quite article in as a summary. If you have money pick GoldenGate or Shareplex. If you don't pick Debezium or any other Java CDC project based on Logminer.
I just recently switch from MySQL to MongoDB, I'm wondering with MySQL I stored the player data inside a hashmap and retrieved name, coins etc; like that so I don't have to constantly query the database to retrieve the data.
Now with MongoDB would I need to do the same thing store the values inside a hashmap and retrieve it the same way I did with MySQL?
It depends on your requirement. You have migrated to mongodb from mysql, this doesnt means that your reads would be superfast. If there would have been any significant I/O improvement in mongodb, mysql developers would have adopted it as well. MongoDB provide flexibility over mysql and there are some more advantages there. So If your load remains the same, you should have a caching layer before mongodb layer. Both Mysql and mongodb come with in-built caching which caches results on the basis of query just like a hashmap, but rest data is on disk and as mentioned mongodb doesnt have any technical advantage over mysql in terms of I/O. So have a caching layer to avoid excessive querying to db.
As part of our application, we are using Jackrabbit (1.6.4) to store documents. Each document that is retrieved by our application is put into a folder structure in Jackrabbit, which is created if not existing.
Our DBA has noticed that the following query is executed a lot against the Oracle (11.2.0.2.0) database holding the Jackrabbit schema - more than 50000 times per hour, causing a lot of IO on the database. In fact, it is one of the top 5 SQL statements in terms of IO over elapsed time (97% IO):
select BUNDLE_DATA from VERSION_BUNDLE where NODE_ID = :1
Taking a look at the database, one notices that this table initially only contains a single record, comprising the node_id (data type RAW) key with the DEADBEEFFACEBABECAFEBABECAFEBABE value and then a couple of bytes in the bundle_data BLOB column. Later on, more records are added with additional data.
The SQL for the table looks like this:
CREATE TABLE "VERSION_BUNDLE"
(
"NODE_ID" RAW(16) NOT NULL ENABLE,
"BUNDLE_DATA" BLOB NOT NULL ENABLE
);
I have the following questions:
Why is Jackrabbit accessing this table so frequently?
Any Jackrabbit tuning options to make this faster?
Is the BUNDLE_DATA value changed by Jackrabbit at all or is it just read for every access to the repository?
Is there any way to tune the database schema to make it deal better with this scenario?
Update: The table only initially contains one record, additional records are added over time as decided internally by Jackrabbit. The access still seems to be read-only for most of the cases, as insert or update statements are not reported as being run with a high number.
Is this physical i/o or logical? With the data being read that often I'd be surprised if the blocks are being aged out of the cache fast enough for physical i/o to be required.
If the JCR-Store is based within a Oracle database you could reorganize the underlying table.
Build a hash-cluster of that table to prevent index accesses
Check if you've licenses to use partitioning option
By deleting unnecessary versions in your application rows will got deleted (Version prune)
If you're storing binary objects like pictures, documents - just have also a look at VERSION_BINVAL.
Why is Jackrabbit accessing this table so frequently?
Then it's a sign that you're creating versions in your repository. Is that something which your application is supposed to do?
Any Jackrabbit tuning options to make this faster?
Not that I'm aware of ; one option to investigate is to upgrade to a more recent Jackrabbit version. Version 2.4.2 was just released and 1.6.4 is almost two years old. It's a possibility that there were performance improvements between these releases.
Is the BUNDLE_DATA value changed by Jackrabbit at all or is it just read for every access to the repository?
By the looks of it's the GUID of the root repository node.
Is there any way to tune the database schema to make it deal better with this scenario?
As far as I know the schema is auto-generated by Jackrabbit so the only options are to modify the table definition in a compatible way after it's been created. But that's a topic for a DBA, which I am not.
Why is Jackrabbit accessing this table so frequently?
We have seen that this table is accessed very often even if you are not asking for versions.
Take a look to this thread from Jackrabbit users mailing list
Is there a database out there that I can use for a really basic project that stores the schema in terms of documents representing an individual database table?
For example, if I have a schema made up of 5 tables (one, two, three, four and five), then the database would be made up of 5 documents in some sort of "simple" encoding (e.g. json, xml etc)
I'm writing a Java based app so I would need it to have a JDBC driver for this sort of database if one exists.
CouchDB and you can use it with java
dbslayer is also light weight with MySQL adapter. I guess, this will make life a little easy.
I haven't used it for a bit, but HyperSQL has worked well in the past, and it's quite quick to set up:
"... offers a small, fast multithreaded and transactional database engine which offers in-memory and disk-based tables and supports embedded and server modes."
CouchDB works well (#zengr). You may also want to look at MongoDB.
Comparing Mongo DB and Couch DB
Java Tutorial - MongoDB
Also check http://jackrabbit.apache.org/ , not quite a DB but should also work.
Imagine a classifieds website, a very simple one where users don't have login details.
I have this currently with MySql as a db. The db has several tables, because of the categories, but one main table for the classified itself. Total of 7 tables in my case.
I want to use only Solr as a "db" because some people on SO thinks it would be better, and I agree, if it works that is.
Now, I have some quick questions about doing this:
Should I have multiple scheema.xml files or config.xml files?
How do I query multiple indices?
How would this (having multiple indices) affect performance and do I need a more powerful machine (memory, cpu etc...) for managing this?
Would you eventually go with only Solr instead of what I planned to do, which is to use Solr to search and return ID numbers which I use to query and find the classifieds in MySql?
I have some 300,000 records today, and they probably won't be increasing.
I have not tested how the records would affect performance when using Solr with MySql, because I am still creating the website, but when using only MySql it is quite slow.
I am hoping it will be better with Solr + MySql, but as I said, if it is possible I will go with only Solr.
Thanks
4 : If an item has status fields that get updated much more frequently than the rest of the record then it's better to store that information in a database and retrieve it when you access the item. For example, if you stored your library book holdings in a solr index, you would store the 'borrowed' status in a database. Updating Solr can take a fair bit of resources and if you don't need to search on a field it doesn't really need to be in Solr.