Should I use Lucene only for search? - java

Our website needs to give out data to the world. This is open-source data that we have stored, and we want it to make it publicly available. It's about 2 million records.
We've implemented the search of these records using Lucene, which is fine, however we'd like to show an individual record (say the user clicks on it after the search is done) and provide more detailed information for that record.
This more detailed information however isn't stored in the index directly... there are like many-to-many relationships and we use our relational database (MySQL) to provide this information.
So like a single record belongs to a category, we want the user to click on that category and show the rest of the records within that category (lots more associations like this).
My question is, should we use Lucene also to store this sort of information and retrieve it through simple search (category:apples), or should MySQL continue doing this logical job? Should I use Lucene only for the search part?
EDIT
I would like to point out that all of our records are pretty static.... changes are made to this data once every week or so.

Lucene's strength lies in rapidly building an index of a set of documents and allowing you to search over them. If this "detailed information" does not need to be indexed or searched over, then don't store it in Lucene.
Lucene is not a database, it's an index.

You want to use Lucene to store data?, I thing it's ok, I've used Solr http://lucene.apache.org/solr/
which built on top of Lucene to work as search engine and store more data relate to the record that maybe use for front end display. It worked with 500k records for me, and 2mil records I think it should be fine.

Related

How to Store Data To Show at Chart using Java?

I have a Spring based Java application. I have two types of data.
First one is indexed document number at my application. Documents are indexed only 2 or 3 times a week.
Second one is number of searches. Many users searches something at my application. I want to visualize the search terms. Many data flows at any time.
What do you suggest me to store such kind of data using Java?
For first one I think that I can use RRD or something like that or I can even write data into a table at MySQL etc.
For second one I can use a more sophisticated database and I can use an in memory database as like H2 between my sophisticated database and user interface.
Any ideas?
Have you considered using Redis? It has great support for atomic increments if you wanted to track search counts and its also very fast since data is stored in-memory.

Lucene and External DB

I am working with the Lucene and Derby databases. Lucene contains the text index, and Derby has information regarding additional user data. For example, each document has a tag. For this purpose the Derby database has two tables
TAGS:
ID
Name
LUCENETAGS:
ID
LUCENEID (docID in Lucene, not a field)
TAGID
I want a user to be able to search something like:
very interesting text AND tag:fun
Changing the structure in a way that tag is a Lucene field is not an option.
Thank you!
I believe you'll have to simply perform your text search in Lucene, and then filter your results based on the result of a query into a Derby.
If few documents will match a particular tag, you could also query the database for the IDs to be queried, and rewrite the query like:
(very interesting text) AND id:(1 2 3 etc.)
Probably not feasible, but in the case that tags are pretty sparse, it might be worth considering.
I do wonder, though, why a field can't be added to the index, duplicating the stored value in the Derby Database. In any implementation you choose to get what you want from your stated structure, you will see much poorer performance, and more complexity for you to deal with, than if the data were available in the index as well.

how to create "has many" between two documents in couchdb?

basically I am wondering how you would go about in Couchdb as you would in MysQL: storing username, password in one table and link the user id as foreign key on another table of tasks?
should I just use mysql for the user authentication part and couchdb to store lots of user submitted documents? so create a random unique token to link each user to their "documents" on couchdb?
also I am looking to store Java objects to the couchdb, and retrieve them to be used directly in my application. which Java-couchdb library does this? Ektorp's example is seems more complicated compared to couchdb4j.
I do not know Java very well, but I suggest use the most simple tool you find. CouchDB is very simple and usually it is most beneficial to access it with simple tools too.
Yes, if you will have many relationships in the data, MySQL will help. However CouchDB can do some simple has-many queries.
First, there is view collation. You use map/reduce, and for every "child" document, you emit a key pointing to the parent document. When you query for ?key=parent then you get a long list of children. (The wiki explains it pretty well.)
Secondly, I suggest the article What's new in CouchDB 0.11 which shows how to use document _ids to link between two documents.
Good luck!

Keeping query statistics using lucene

I am developing a search component of a web application using Lucene. I would like to save the user queries to an index and use them to suggest alternate queries to users, and to keep query statistics (most often used queries, top scoring queries, ...).
To use this data for alternate query suggestions, I would analyze the queries to see which terms are most often used with one another and use that to create a suggestion to the user.
But I can't figure out in which form to index the data. I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content. Does anyone have any ideas about the way this can be accomplished?
Thanks for the help.
"I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content"
You can tell Lucene not to store document content, which means that the principal overhead will be the unique Terms, and the index itself. So, it might not be a large overhead to store each query as a unique Document...this way you will not be throwing away any information.
First, I believe that you should store the queries separately from the existing index. The problem is not redundant data but rather "watering down" your index - storing the queries in the same index may harm the relevance of your searches. Some options for this are:
Use a separate Lucene index.
Use Solr, with two separate cores, one for the documents and the other for the queries.
Use a query log. Store scores with the queries. Build query statistics using post-processing.As this is a web application, you can probably use a servlet container, such as Tomcat's, logs for this.
Second, Auto-Suggest From Popular Queries Using EdgeNGrams suggests an alternative implementation of query suggestion using Solr.

Do I need to normalize this MySQL db?

I have a classifieds website which uses SOLR to search for whatever ads the user wants to search for... SOLR then returns the ID:s of all the matches found. I then use the ID:s to fetch and display the ads from a MySQL table.
currently I have one huge table containing everything in MySQL.
Sometimes some of the fields are empty because for instance an apartment has no "model" but a car does.
Is this a problem for me if I use SOLR like I do?
Thanks
Ask yourself these questions:
Is your current implementation slow or prone to error?
Are you adding a lot of "hacks" in order to display content or fetch data correctly due to the de-normalization of your database?
In the long run, will you benefit from normalizing the table?
Hope that helps. It all depends on your situation! Personally, I build databases normalized and then de-normalize as needed to keep things speedy.
If you are using SOLR, why don't you just serve complete ad from solr instead of MySQL to save DB time?
One huge table usually is not goog option at all.

Categories

Resources