Lucene and External DB

Lucene and External DB - java

I am working with the Lucene and Derby databases. Lucene contains the text index, and Derby has information regarding additional user data. For example, each document has a tag. For this purpose the Derby database has two tables
TAGS:
ID
Name
LUCENETAGS:
ID
LUCENEID (docID in Lucene, not a field)
TAGID
I want a user to be able to search something like:
very interesting text AND tag:fun
Changing the structure in a way that tag is a Lucene field is not an option.
Thank you!

I believe you'll have to simply perform your text search in Lucene, and then filter your results based on the result of a query into a Derby.
If few documents will match a particular tag, you could also query the database for the IDs to be queried, and rewrite the query like:
(very interesting text) AND id:(1 2 3 etc.)
Probably not feasible, but in the case that tags are pretty sparse, it might be worth considering.
I do wonder, though, why a field can't be added to the index, duplicating the stored value in the Derby Database. In any implementation you choose to get what you want from your stated structure, you will see much poorer performance, and more complexity for you to deal with, than if the data were available in the index as well.

Related

Adding custom fields in my application

I have a SAAS product, which is build by Spring MVC and Hibernate. Generally SAAS products allow user's to customize the product like adding extra fields to the table. So i want to give the flexibility to users, to create custom fields in the tables for themselves. Please provide all the viable solutions to achieve it. Thank you so much for your help.

I'm guessing your trying to back this to a Relational database. The primary problem is that relational databases store things in tables, and tables don't really handle free form data well.
So one solution is to use a document structure that is flexible, like XML (and perhaps ditch the database) but databases have features which are nice, so let's also consider the database-using approaches.
You could create a "custom field" table which would have columns (composite primary key) for
ExtendedTable
ColumnName
but you'd also have to store the data somewhere
(ExtendedKey)
DataItem
And now we get into the really nasty bits. How would you apply constraints to this data? I mean, what would the type be of a DataItem? A general solution would be quite complex (being a type of free form database). Hopefully you could limit the solution to solve only the problems you require solved.
Another approach is to use a single "extra" column that contains an XML record which embeds it's own "column and value" extensions, but if you wanted to display a table of the efficiently, you'd have to parse out every XML document in every field, which is not ideal.
Neither one of these approaches will work well with the existing SQL query language, so you'll then start building your own query language.
I suggest you go back and look at real data requirements, instead of sweeping them under the table with a "and anything else one might want" set of columns on your table.

Your requirement is best suited use case for NoSQL databases (like MongoDB).
Dynamically creating relational database tables & columns (modifying schemas) upon user requests in an application is not a best practice as these involve DDL operations, which are very powerful and in case if you don't handle them carefully, the whole application's database goes to the inconsistent state.

Should I use multiple Lucene directories/indexes to search different types of data?

I have many MySQL tables to store different types of data like goods, catagories, brands, suppliers, etc. Each of them needs to implement full-text search via Lucene.
So I plan to build one Lucene Directory (and one IndexWriter + one IndexReader corresponding to this Directory) for each table, e.g.
HashMap<String, Directory> = ...;
put("goods", FSDirectory.open(luceneDirRoot + "/goods"));
put("catagories", FSDirectory.open(luceneDirRoot + "/catagories"));
...
Is this a good practice to use Lucene?
Furthur more, how can I know how many directories I made by Lucene, like MySQL command "SHOW TABLES"? new File(luceneDirRoot).listFiles() can be a choice but I am not sure whether there are other non-Lucene folders.

I would implement one Lucene index pro MySQL table provided you do not need to perform search over several tables. Alternative would be to write everything into one index and add table name into each lucene document, that way you could limit the search to particular table.
AFAIK Lucene does not support SHOW TABLES equivalent the way you desire it, but you might easily do that by yourself, e.g. by using naming convention for the directories.
I would recommend to look at Hibernate Search, this is a good match for your needs, it builds one index directory pro table and allows you to perform full text search while handling the low-level lucene issues for you. You just configure the index by annotating the JPA entities corresponding to your tables and have to implement the full text queries. This is much easier then doing naked Lucene with data from MySQL on your own, Hibernate Search builds the index for you and integrates well with data from relational DB such as MySQL.

ElasticSearch: Secondary indecies on field values using Java-API

I'm considering to use ElasticSearch as a search engine for large objects. There are about 500 millions objects on a single machine. For far is Elasticsearch a good solution for executing advanced queries. But a have the problem that i did find any technique to create secondary index on the document fields. Is in elasticsearch a possibility to create a secondary indecies like in MySQL on columns? Or are there any other technologies implemented to accelerate searches on field values? I'm using an single server enviroment and I have to store about 300 fields per row/object. At the moment there are about 500 million object in my database.

I apologize in advance it I don't understand the question. Elasticsearch is itself an index based technology (it's built on top of Lucene which is a build for index based search). You put documents into Elasticsearch and the individual fields on those documents are indexed and searchable. You should not have to worry about creating secondary indexes; the fields will be indexed by default (in most cases).
One of the differences between Elasticsearch and Solr is that in Solr, you have to specify a schema defining what the fields are on the documents and whether that field will be indexed (available to search against), stored (available as the result of a search) or both. Elasticsearch does not require an upfront schema, and in lieu of specific mappings for fields, then reasonable defaults are used instead. I believe that the core fields (string, number, etc..._) are indexed by default, meaning they are available to search against.
Now in your case, you have a document with a lot of fields on it. You will probably need to tweak the mappings a bit to only index the fields that you know you might search against. If you index too much, the size of the index itself will balloon and will not be as fast as if you had a trim index of only the fields you know you will search against. Also, Lucene loads parts of the index into memory to really enable fast searches. With a bloated index, you won't be able to keep as much stuff in memory and your searches will suffer as a result. You should look at the Mappings API and the Core Types section for more info on how to do this.

Keeping query statistics using lucene

I am developing a search component of a web application using Lucene. I would like to save the user queries to an index and use them to suggest alternate queries to users, and to keep query statistics (most often used queries, top scoring queries, ...).
To use this data for alternate query suggestions, I would analyze the queries to see which terms are most often used with one another and use that to create a suggestion to the user.
But I can't figure out in which form to index the data. I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content. Does anyone have any ideas about the way this can be accomplished?
Thanks for the help.

"I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content"
You can tell Lucene not to store document content, which means that the principal overhead will be the unique Terms, and the index itself. So, it might not be a large overhead to store each query as a unique Document...this way you will not be throwing away any information.

First, I believe that you should store the queries separately from the existing index. The problem is not redundant data but rather "watering down" your index - storing the queries in the same index may harm the relevance of your searches. Some options for this are:
Use a separate Lucene index.
Use Solr, with two separate cores, one for the documents and the other for the queries.
Use a query log. Store scores with the queries. Build query statistics using post-processing.As this is a web application, you can probably use a servlet container, such as Tomcat's, logs for this.
Second, Auto-Suggest From Popular Queries Using EdgeNGrams suggests an alternative implementation of query suggestion using Solr.

Should I use Lucene only for search?

Our website needs to give out data to the world. This is open-source data that we have stored, and we want it to make it publicly available. It's about 2 million records.
We've implemented the search of these records using Lucene, which is fine, however we'd like to show an individual record (say the user clicks on it after the search is done) and provide more detailed information for that record.
This more detailed information however isn't stored in the index directly... there are like many-to-many relationships and we use our relational database (MySQL) to provide this information.
So like a single record belongs to a category, we want the user to click on that category and show the rest of the records within that category (lots more associations like this).
My question is, should we use Lucene also to store this sort of information and retrieve it through simple search (category:apples), or should MySQL continue doing this logical job? Should I use Lucene only for the search part?
EDIT
I would like to point out that all of our records are pretty static.... changes are made to this data once every week or so.

Lucene's strength lies in rapidly building an index of a set of documents and allowing you to search over them. If this "detailed information" does not need to be indexed or searched over, then don't store it in Lucene.
Lucene is not a database, it's an index.

You want to use Lucene to store data?, I thing it's ok, I've used Solr http://lucene.apache.org/solr/
which built on top of Lucene to work as search engine and store more data relate to the record that maybe use for front end display. It worked with 500k records for me, and 2mil records I think it should be fine.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.