"Should I use multiple indices in Solr?", and some other quick Q

"Should I use multiple indices in Solr?", and some other quick Q - java

Imagine a classifieds website, a very simple one where users don't have login details.
I have this currently with MySql as a db. The db has several tables, because of the categories, but one main table for the classified itself. Total of 7 tables in my case.
I want to use only Solr as a "db" because some people on SO thinks it would be better, and I agree, if it works that is.
Now, I have some quick questions about doing this:
Should I have multiple scheema.xml files or config.xml files?
How do I query multiple indices?
How would this (having multiple indices) affect performance and do I need a more powerful machine (memory, cpu etc...) for managing this?
Would you eventually go with only Solr instead of what I planned to do, which is to use Solr to search and return ID numbers which I use to query and find the classifieds in MySql?
I have some 300,000 records today, and they probably won't be increasing.
I have not tested how the records would affect performance when using Solr with MySql, because I am still creating the website, but when using only MySql it is quite slow.
I am hoping it will be better with Solr + MySql, but as I said, if it is possible I will go with only Solr.
Thanks

4 : If an item has status fields that get updated much more frequently than the rest of the record then it's better to store that information in a database and retrieve it when you access the item. For example, if you stored your library book holdings in a solr index, you would store the 'borrowed' status in a database. Updating Solr can take a fair bit of resources and if you don't need to search on a field it doesn't really need to be in Solr.

Related

Apache Solr for single database column suggestions

I have a relational database with few tables. Some of them have columns that I want to enable autocompletion / autocorrection on (e.g. titles, tags, categories).
I have seen that Apache Solr, which builds upon Lucene indexing can offer such functionality. Also data can be fed in to Solr from relational database.
My question is: is this the best way I can get autocomplete and autocorrect services for my entities? Or am I killing a mosquito with a bazooka here?
Solr requires a lot of resources, memory and stuff and I wonder if something far simpler can do the trick for me.

How many unique values do you have in title, tags , categories? A few thousand? Then I think you can get away with using a Trie Data structure. A few million records in those columns? Then Solr / Elasticsearch might be good option.
I have used Trie for autosuggestion. Building a Trie is expensive. But you can store the trie in Memcached or even SQL and update it periodically when new data is added to your columns.

How join a record set that is returned from a web service with one of your sql tables

I thought about this solution: get data from web service, insert into table and then join with other table, but it will affect perfomance and, also, after this I must delete all that data.
Are there other ways to do this?

You don't return a record set from a web service. HTTP knows nothing about your database or result sets.
HTTP requests and responses are strings. You'll have to parse out the data, turn it into queries, and manipulate it.
Performance depends a great deal on things like having proper indexes on columns in WHERE clauses, the nature of the queries, and a lot of details that you don't provide here.
This sounds like a classic case of "client versus server". Why don't you write a stored procedure that does all that work on the database server? You are describing a lot of work to bring a chunk of data to the middle tier, manipulate it, put it back, and then delete it? I'd figure out how to have the database do it if I could.

no, you don't need save anything into database, there's a number of ways to convert XML to table without saving it into database
for example in Oracle database you can use XMLTable/XMLType/XQuery/dbms_xml
to convert xml result from webservice into table and then use it in your queries
for example:
if you use Oracle 12c you can use JSON_QUERY: Oracle 12С JSON
XMLTable: oracle-xmltable-tutorial
this week discussion about converting xml into table data

It is common to think about applications having a three-tier structure: user interface, "business logic"/middleware, and backend data management. The idea of pulling records from a web service and (temporarily) inserting them into a table in your SQL database has some advantages, as the "join" you wish to perform can be quickly implemented in SQL.
Oracle (as other SQL DBMS) features temporary tables which are optimized for just such tasks.
However this might not be the best approach given your concerns about performance. It's a guess that your "middleware" layer is written in Java, given the tags placed on the Question, and the lack of any explicit description suggests you may be attempting a two-tier design, where user interface programs connect directly with the backend data management resources.
Given your apparent investment in Oracle products, you might find it worthwhile to incorporate Oracle Middleware elements in your design. In particular Oracle Fusion Middleware promises to enable "data integration" between web services and databases.

Tuning Jackrabbit data model (VERSION_BUNDLE table)

As part of our application, we are using Jackrabbit (1.6.4) to store documents. Each document that is retrieved by our application is put into a folder structure in Jackrabbit, which is created if not existing.
Our DBA has noticed that the following query is executed a lot against the Oracle (11.2.0.2.0) database holding the Jackrabbit schema - more than 50000 times per hour, causing a lot of IO on the database. In fact, it is one of the top 5 SQL statements in terms of IO over elapsed time (97% IO):
select BUNDLE_DATA from VERSION_BUNDLE where NODE_ID = :1
Taking a look at the database, one notices that this table initially only contains a single record, comprising the node_id (data type RAW) key with the DEADBEEFFACEBABECAFEBABECAFEBABE value and then a couple of bytes in the bundle_data BLOB column. Later on, more records are added with additional data.
The SQL for the table looks like this:
CREATE TABLE "VERSION_BUNDLE"
(
"NODE_ID" RAW(16) NOT NULL ENABLE,
"BUNDLE_DATA" BLOB NOT NULL ENABLE
);
I have the following questions:
Why is Jackrabbit accessing this table so frequently?
Any Jackrabbit tuning options to make this faster?
Is the BUNDLE_DATA value changed by Jackrabbit at all or is it just read for every access to the repository?
Is there any way to tune the database schema to make it deal better with this scenario?
Update: The table only initially contains one record, additional records are added over time as decided internally by Jackrabbit. The access still seems to be read-only for most of the cases, as insert or update statements are not reported as being run with a high number.

Is this physical i/o or logical? With the data being read that often I'd be surprised if the blocks are being aged out of the cache fast enough for physical i/o to be required.

If the JCR-Store is based within a Oracle database you could reorganize the underlying table.
Build a hash-cluster of that table to prevent index accesses
Check if you've licenses to use partitioning option
By deleting unnecessary versions in your application rows will got deleted (Version prune)
If you're storing binary objects like pictures, documents - just have also a look at VERSION_BINVAL.

Why is Jackrabbit accessing this table so frequently?
Then it's a sign that you're creating versions in your repository. Is that something which your application is supposed to do?
Any Jackrabbit tuning options to make this faster?
Not that I'm aware of ; one option to investigate is to upgrade to a more recent Jackrabbit version. Version 2.4.2 was just released and 1.6.4 is almost two years old. It's a possibility that there were performance improvements between these releases.
Is the BUNDLE_DATA value changed by Jackrabbit at all or is it just read for every access to the repository?
By the looks of it's the GUID of the root repository node.
Is there any way to tune the database schema to make it deal better with this scenario?
As far as I know the schema is auto-generated by Jackrabbit so the only options are to modify the table definition in a compatible way after it's been created. But that's a topic for a DBA, which I am not.

Why is Jackrabbit accessing this table so frequently?
We have seen that this table is accessed very often even if you are not asking for versions.
Take a look to this thread from Jackrabbit users mailing list

Loading a Database Table to Memory to Use

In a search application, I need to keep track of the files and their locations. Currently am using a database table for this, but since I have to connect to the db every time I need to retrieve such data, this is obviously not efficient. Is there a method I can load the table to memory and use it? I won't need to modify it while it's in the memory.
Thank You!

If all you want to do is retrieve one table into memory you can do this with a single SELECT statement. You can build a collection like a Map from the ResultSet. After that get the information you want from the Map.

You could populate any of the several Java databases out there that have an in-memory mode, like HSQLDB, Derby, or H2. You might also look at SQLite, which isn't specifically Java but has various Java connectors as described in this Q&A here on StackOverflow.
But you don't have to connect to a DB each time you need to query it, you can use a connection pool to manage a set of connections you can reuse. Since usually the main delay is establishing a connection, this can lead to quite lot per-query overhead.

You could also use one of caching products like Ehcache, Memcache, Coherence and many others. I have some knowledge in using Ehache. Configure Hibernate to cache a particular query or entity object or a POJO. All subsequent searches with same criteria will be fetched from cache.
I believe similar features are provided by other products as well.

Your sentence "I won't need to modify it while it's in the memory." does not reflect the title of your question, where you apparently want to modify an commit back your data after using it.
If you simply want to speedup your app, why don't you store your data in some kind of variable? Depending on your development tool, it could be some kind of session variable.

Questions about SOLR documents and some more

Website: Classifieds website (users may put ads, search ads etc)
I plan to use SOLR for searching and then return results as ID nr:s only, and then use those ID nr:s and query mysql, and then lastly display the results with those ID:s.
Currently I have around 30 tables in MySQL, one for each category.
1- Do you think I should do it differently than above?
2- Should I use only one SOLR document, or multiple documents? Also, is document the same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of columns in each table? Personally I am much better at using MySQL than SOLR.
4- Say the user wants to search for cars in a specific region, how is this type of querying performed/done in SOLR? Ex: q=cars&region=washington possible?
You may think there is alot of info about SOLR out there, but there isn't, and especially not about using PHP with SOLR and a SOLR php client... Maybe I will write something when I have learned all this... Or maybe one of you could write something up!
Thanks again for all help...

First, the definitions: a Solr/Lucene document is roughly the equivalent of a database row. An index is roughly the same as a database table.
I recommend trying to store all the classified-related information in Solr. Querying Solr and then the database is inefficient and very likely unnecessary.
Querying in a specific region would be something like q=cars+region:washington assuming you have a region field in Solr.
The Solr wiki has tons of good information and a pretty good basic tutorial. Of course this can always be improved, so if you find anything that isn't clear please let the Solr team know about it.
I can't comment on the PHP client since I don't use PHP.

Solr is going to return it's results in a syntax easily parsible using SimpleXml. You could also use the SolPHP client library: http://wiki.apache.org/solr/SolPHP.
Solr is really quite efficient. I suggest putting as much data into your Solr index as necessary to retrieve everything in one hit from Solr. This could mean much less database traffic for you.
If you've installed the example Solr application (comes with Jetty), then you can develop Solr queries using the admin interface. The URI of the result is pretty much what you'd be constructing in PHP.
The most difficult part when beginning with Solr is getting the solrconfig.xml and the schema.xml files correct. I suggest starting with a very basic config, and restart your web app each time you add a field. Starting off with the whole schema.xml can be confusing.

2- Should I use only one SOLR document, or multiple documents? Also, is document the
same as a SOLR index?
3- Would it be better to Only use SOLR and skip MySQL knowing that I have alot of
columns in each table? Personally I am much better at using MySQL than SOLR.
A document is "an instance" of solr index. Take into account that you can build only one solr index per solr Core. A core acts as an independent solr Server into the same solr insallation.
http://wiki.apache.org/solr/CoreAdmin
Yo can build one index merging some table contents and some other indexes to perform second level searches...
would you give more details about your architecture and data??

As suggested by others you can store and index your mysql data and can run query in solr index, thus making mysql unnecessary to use.
You don't need to just store and index ids and query and get ids and then run mysql query to get additional data against that id. You can just store other data corresponding to ids in solr itself.
Regarding solr PHP client, then you don't need to use and it is recommended to directly use REST like Solr Web API. You can use PHP function like file_get_contents("http://IP:port/solr/#/core/select?q=query&start=0&rows=100&wt=json") or use curl with PHP if you need to. Both ways are almost same and efficient. This will return data in json as wt=json. Then use PHP function json_decode($returned_data) to get that data in object.
If you need to ask anything just reply.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

"Should I use multiple indices in Solr?", and some other quick Q - java

Related

Apache Solr for single database column suggestions

How join a record set that is returned from a web service with one of your sql tables

Tuning Jackrabbit data model (VERSION_BUNDLE table)

Loading a Database Table to Memory to Use

Questions about SOLR documents and some more

Categories

Resources