Tuning Jackrabbit data model (VERSION_BUNDLE table)

Tuning Jackrabbit data model (VERSION_BUNDLE table) - java

As part of our application, we are using Jackrabbit (1.6.4) to store documents. Each document that is retrieved by our application is put into a folder structure in Jackrabbit, which is created if not existing.
Our DBA has noticed that the following query is executed a lot against the Oracle (11.2.0.2.0) database holding the Jackrabbit schema - more than 50000 times per hour, causing a lot of IO on the database. In fact, it is one of the top 5 SQL statements in terms of IO over elapsed time (97% IO):
select BUNDLE_DATA from VERSION_BUNDLE where NODE_ID = :1
Taking a look at the database, one notices that this table initially only contains a single record, comprising the node_id (data type RAW) key with the DEADBEEFFACEBABECAFEBABECAFEBABE value and then a couple of bytes in the bundle_data BLOB column. Later on, more records are added with additional data.
The SQL for the table looks like this:
CREATE TABLE "VERSION_BUNDLE"
(
"NODE_ID" RAW(16) NOT NULL ENABLE,
"BUNDLE_DATA" BLOB NOT NULL ENABLE
);
I have the following questions:
Why is Jackrabbit accessing this table so frequently?
Any Jackrabbit tuning options to make this faster?
Is the BUNDLE_DATA value changed by Jackrabbit at all or is it just read for every access to the repository?
Is there any way to tune the database schema to make it deal better with this scenario?
Update: The table only initially contains one record, additional records are added over time as decided internally by Jackrabbit. The access still seems to be read-only for most of the cases, as insert or update statements are not reported as being run with a high number.

Is this physical i/o or logical? With the data being read that often I'd be surprised if the blocks are being aged out of the cache fast enough for physical i/o to be required.

If the JCR-Store is based within a Oracle database you could reorganize the underlying table.
Build a hash-cluster of that table to prevent index accesses
Check if you've licenses to use partitioning option
By deleting unnecessary versions in your application rows will got deleted (Version prune)
If you're storing binary objects like pictures, documents - just have also a look at VERSION_BINVAL.

Why is Jackrabbit accessing this table so frequently?
Then it's a sign that you're creating versions in your repository. Is that something which your application is supposed to do?
Any Jackrabbit tuning options to make this faster?
Not that I'm aware of ; one option to investigate is to upgrade to a more recent Jackrabbit version. Version 2.4.2 was just released and 1.6.4 is almost two years old. It's a possibility that there were performance improvements between these releases.
Is the BUNDLE_DATA value changed by Jackrabbit at all or is it just read for every access to the repository?
By the looks of it's the GUID of the root repository node.
Is there any way to tune the database schema to make it deal better with this scenario?
As far as I know the schema is auto-generated by Jackrabbit so the only options are to modify the table definition in a compatible way after it's been created. But that's a topic for a DBA, which I am not.

Why is Jackrabbit accessing this table so frequently?
We have seen that this table is accessed very often even if you are not asking for versions.
Take a look to this thread from Jackrabbit users mailing list

Related

Run ElasticSearch on top relational database

The problem I have is, whether it is possible to use ElasticSearch on top of a relational database.
1. When I insert or delete a record in the relational database, will it reflect in the elastic search?
2. If I insert a document in the elastic search will it be persisted in the database?
3. Does it uses a cache or an in-memory database to facilitate search? If so what is uses?

There is no direct connection between Elasticsearch and relational databases - ES has it's own datastore based on Apache Lucene.
That said, you can as others have noted use the Elasticsearch River plugin for JDBC to load data from a relational database into Elasticsearch. Keep in mind there are a number of limitations to this approach:
It's one way only - The JDBC River for ES only reads from the source
database - it does not push data from ES into the source database.
Deletes are not handled - if you delete data in your source database
after it's been indexed into ES that deletion will not be reflected
in ES.
ElasticSearch river JDBC MySQL not deleting records
and https://github.com/jprante/elasticsearch-river-jdbc/issues/213
It was not intended as a production, scalable solution for
relational database and Elasticsearch integration. From the JDBC
River's author's comment in January of 2014, it was designed as a "
a single node (non-scalable) solution" "for demonstration purposes."
http://elasticsearch-users.115913.n3.nabble.com/Strategy-for-keeping-Elasticsearch-updated-with-MySQL-td4047253.html
To answer your questions directly (assuming you use the JDBC River):
New document inserts can be handled by the JDBC River but existing
data deletes are not.
Data does not flow from Elasticsearch into your relational database. That would need to be custom development work.
Elasticsearch is built on top of Apache Lucene. Lucene in turn
depends a great deal on file system caching at the OS level (which
is why ES recommends keeping heap size down to no more than 50% of
total memory, to leave a lot for the file system cache). In addition
the ES/Lucene stack makes use of a number of internal caches (like
the Lucene field cache and the filter cache)
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-cache.html
and
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html
Internally the filter cache is implemented using a bitset:
http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/

1)You should take a look at the ElasticSearch jdbc river here for inserts (I believe deleted rows aren't managed any more, see developper comment).
2)Unless you do it manually, it is not natively managed by ElasticSearch.
3)Indeed, ElasticSearch use cache to improve performances, especially when using filters. Bitsets (arrays of 0/1) are stored.

Came across this question while looking for a similar thing. Thought an update was due.
My Findings:
Elasticsearch has now deprecated Rivers, though the above-mentioned jprante's River lives on...
Another option I found was the Scotas Push Connector which pushes inserts, updates and deletes from an RDBMS to Elasticsearch. Details here: http://www.scotas.com/product-scotas-push-connector.
Example implementation here: http://www.scotas.com/blog/?p=90

How to copy huge amount of data from one Oracle database to another with good performance

I need to copy about 50 millions of rows with joins among 3 4 tables from one Oracle db to another into a single table. This is a repeating process happening from time to time. I copy only active data (meaning that there's some outdated archived data that not needed any more). We havea special java service which does this procedure via JDBC which is too slow for our needs.

You can use the Transportable Modules
The fundamental functionality of a Transportable Module is to copy a
group of related database objects from one database to another using
the fastest possible mechanisms.

You can use DataPump utility available in Oracle 10g onwards. It gives you the capability to use direct path export. To know more, here is the link -
http://docs.oracle.com/cd/B19306_01/server.102/b14215/dp_export.htm

Loading a Database Table to Memory to Use

In a search application, I need to keep track of the files and their locations. Currently am using a database table for this, but since I have to connect to the db every time I need to retrieve such data, this is obviously not efficient. Is there a method I can load the table to memory and use it? I won't need to modify it while it's in the memory.
Thank You!

If all you want to do is retrieve one table into memory you can do this with a single SELECT statement. You can build a collection like a Map from the ResultSet. After that get the information you want from the Map.

You could populate any of the several Java databases out there that have an in-memory mode, like HSQLDB, Derby, or H2. You might also look at SQLite, which isn't specifically Java but has various Java connectors as described in this Q&A here on StackOverflow.
But you don't have to connect to a DB each time you need to query it, you can use a connection pool to manage a set of connections you can reuse. Since usually the main delay is establishing a connection, this can lead to quite lot per-query overhead.

You could also use one of caching products like Ehcache, Memcache, Coherence and many others. I have some knowledge in using Ehache. Configure Hibernate to cache a particular query or entity object or a POJO. All subsequent searches with same criteria will be fetched from cache.
I believe similar features are provided by other products as well.

Your sentence "I won't need to modify it while it's in the memory." does not reflect the title of your question, where you apparently want to modify an commit back your data after using it.
If you simply want to speedup your app, why don't you store your data in some kind of variable? Depending on your development tool, it could be some kind of session variable.

Opinion on data storage

I have an upcoming project where the core of it will be storing a mapping between two integers. ( 1234 in column A maps to 4567 in column B). There are roughly 1000 mappings. A lookup on the mappings will be done every time a user hits a certain url on the site.
It seems like inserting it into our relational database is overkill. The overhead of selecting it out on every hit seems high. On the other hand, storing it an XML file and loading that flat file from disk every time theres a hit, also seems less than optimal.
So my question is this: what is the ideal data structure and method to persist this mapping?
The system architecture is tomcat + apache + mysql. The code will be running in tomcat.
EDIT:
Mappings are static, I won't need to change them. Seems like the XML file in a hashmap is the way to go.

I would use a properties file or an XML file, load it into memory (as a HashMap<Integer, Integer>) on startup and then just serve from the hashmap.
If you need to change the mapping at execution time, you could either write it back immediately or potentially just write changes incrementally (and update the in-memory map), with a process to unify the original file and the changes on startup. This doesn't work terribly well if you need to scale to multiple servers, of course - at that point you need to work out what sort of consistency you need etc. A database is probably the simplest way of proceeding, but it depends on the actual requirements.

I agree a relational database seems a bit of overkill. You may want to look at a NoSQL database. MongoDB is my personal favourite, but there are plenty out there. Do a search on NoSQL databases.
A NoSQL database will allow you to store this mapping as a simple document, with extremely faster searching and updating of the data. Obviously it's another technology in your stack though, so that's something for you to consider.

You could try using an in-memory database like H2 or HSQLDB. The memory footprint will likely be larger than with in-memory hashmap and file, but on the upside you can use SQL for querying and updating and don't need to worry about concurrent access.

"Should I use multiple indices in Solr?", and some other quick Q

Imagine a classifieds website, a very simple one where users don't have login details.
I have this currently with MySql as a db. The db has several tables, because of the categories, but one main table for the classified itself. Total of 7 tables in my case.
I want to use only Solr as a "db" because some people on SO thinks it would be better, and I agree, if it works that is.
Now, I have some quick questions about doing this:
Should I have multiple scheema.xml files or config.xml files?
How do I query multiple indices?
How would this (having multiple indices) affect performance and do I need a more powerful machine (memory, cpu etc...) for managing this?
Would you eventually go with only Solr instead of what I planned to do, which is to use Solr to search and return ID numbers which I use to query and find the classifieds in MySql?
I have some 300,000 records today, and they probably won't be increasing.
I have not tested how the records would affect performance when using Solr with MySql, because I am still creating the website, but when using only MySql it is quite slow.
I am hoping it will be better with Solr + MySql, but as I said, if it is possible I will go with only Solr.
Thanks

4 : If an item has status fields that get updated much more frequently than the rest of the record then it's better to store that information in a database and retrieve it when you access the item. For example, if you stored your library book holdings in a solr index, you would store the 'borrowed' status in a database. Updating Solr can take a fair bit of resources and if you don't need to search on a field it doesn't really need to be in Solr.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Tuning Jackrabbit data model (VERSION_BUNDLE table) - java

Is this physical i/o or logical? With the data being read that often I'd be surprised if the blocks are being aged out of the cache fast enough for physical i/o to be required.

Why is Jackrabbit accessing this table so frequently? We have seen that this table is accessed very often even if you are not asking for versions. Take a look to this thread from Jackrabbit users mailing list

Related

Run ElasticSearch on top relational database

How to copy huge amount of data from one Oracle database to another with good performance

Loading a Database Table to Memory to Use

Opinion on data storage

"Should I use multiple indices in Solr?", and some other quick Q

Categories

Resources