Does Lucene store the actual documents in its index? - java

I am planning to use Lucene to index a very large corpus of text documents. I know how an inverted index and all that works.
Question: Does Lucene store the actual source documents in its index (in addition to the terms)? So if I search for a term and want all the documents that contain the term, do the documents come out of Lucene, or does Lucene just return pointers (e.g. the file path to the matched documents)?

This is up to you. Lucene represents documents as collections of fields, and for each field you can configure whether it is stored. Typically, you would store the title fields, but not the body fields, when handling largish documents, and you'd add an identifier field (not indexed) that can be used to retrieve the actual document.

Related

Internal working of GAE Search API (Java)

I would like to know how the Search API store Document internally? Does it store a Document into the Datastore with a "Document" kind? Or something else? Also where are indexes stored? In memcache?
Documents and indexes are stored in HR Datastore
Documents and indexes are saved in a separate persistent store
optimized for search operations
The Document class represents documents. Each document has a document identifier and a list of fields.
The Document class represents documents. Each document has a document identifier and a list of fields.
It's all in Google's documentation

ElasticSearch: Secondary indecies on field values using Java-API

I'm considering to use ElasticSearch as a search engine for large objects. There are about 500 millions objects on a single machine. For far is Elasticsearch a good solution for executing advanced queries. But a have the problem that i did find any technique to create secondary index on the document fields. Is in elasticsearch a possibility to create a secondary indecies like in MySQL on columns? Or are there any other technologies implemented to accelerate searches on field values? I'm using an single server enviroment and I have to store about 300 fields per row/object. At the moment there are about 500 million object in my database.
I apologize in advance it I don't understand the question. Elasticsearch is itself an index based technology (it's built on top of Lucene which is a build for index based search). You put documents into Elasticsearch and the individual fields on those documents are indexed and searchable. You should not have to worry about creating secondary indexes; the fields will be indexed by default (in most cases).
One of the differences between Elasticsearch and Solr is that in Solr, you have to specify a schema defining what the fields are on the documents and whether that field will be indexed (available to search against), stored (available as the result of a search) or both. Elasticsearch does not require an upfront schema, and in lieu of specific mappings for fields, then reasonable defaults are used instead. I believe that the core fields (string, number, etc..._) are indexed by default, meaning they are available to search against.
Now in your case, you have a document with a lot of fields on it. You will probably need to tweak the mappings a bit to only index the fields that you know you might search against. If you index too much, the size of the index itself will balloon and will not be as fast as if you had a trim index of only the fields you know you will search against. Also, Lucene loads parts of the index into memory to really enable fast searches. With a bloated index, you won't be able to keep as much stuff in memory and your searches will suffer as a result. You should look at the Mappings API and the Core Types section for more info on how to do this.

Best practices handling large number of Strings from RSS feeds in Java and Lucene

I have a situation where I have an hourly batch job which has to parse a large number of RSS feeds and extract the text of the title and description elements from each item per feed, into strings which will then have their word frequencies calculated by Lucene
But, not knowing how many feeds or items per feed, each string may potentially consist of thousands of words.
I suppose the basic pseudocode I'm look at is something like this:
for each feed
for each item within date/time window
get text from title element, concatenate it to title_string
get text from description element,
concatenate it to description_string
calculate top x keywords from title_string
for each keyword y in x
calculate frequency of keyword y in description_string
Can anyone suggest how to handle this data to reduce memory usage? That is apart from using StringBuilders as the data is read from each feed.
Though the contents of the feeds will be stored in a database, I want to calculate the word frequencies 'on the fly' to avoid all the IO necessary where each feed has its own database table.
First, I don't understand why you want to store text in database if you already have Lucene. Lucene is a kind of database with indexes built on words, not record id's, and that's the only difference for text documents. For example, you can store each item in the feed as a separate document with fields "title", "description", etc. If you need to store information about feed itself, create one more type of documents for feeds, generate id and put this id as a reference to all feed's items.
If you do this, you can count word frequency in a constant time (well, not real constant time, but approximately constant). Yeah, it will cause IO, but using databases to save text will do it too. And reading word frequency information is extremely fast: Lucene uses data structure, called inverted index, i.e. stores map of word -> vector of < doc_number/frequency > pairs. When searching, Lucene doesn't read documents itself, but instead reads indexes and retrieves such map - this is small enough to be read very quickly.
If storing text in Lucene index is not an option and you only need information about word frequency, use in-memory index to analyze each separate batch of feeds, save frequency information somewhere and erase index. Also, when adding fields to documents, set store parameter to Field.Store.NO to store only frequency information, but not field itself.

Any good way to handling repeats when using Lucene indexing?

I am using Lucene to index my documents. In my case, each document is rather in small size but having a large quantity (~2GB). And in each document, there are many repeating words or terms. I am wondering if it is the right way for me to do index using Lucene or what preprocessing I should do on the document before indexing.
The following are a couple of examples of my documents (each column is a field, the first row is the field name, and starting from 2nd row, each row is one document):
ID category track keywords
id1 cat1 track1 mode=heat treatment;repeat=true;Note=This is an apple
id2 cat1 track2 mode=cold treatment;repeat=true;Note=This is an orange
I want to index all documents, perform a search on the 3 fields (category, track and keywords) and return the unique id1.
If I directly index this, will the repeating terms affect the searching performance? Do you have a good idea how I should do the indexing and searching? Thanks a lot in advance.
Repeated terms may affect the search performance by forcing the scorer to consider a large set of documents. If you have terms that are not that discriminating between documents, I suggest preprocessing the documents in order to remove these terms. However, you may want to start by indexing everything (say for a sample of 10000-20000 documents) and see how you fare with regard to relevance and performance.
From the way you describe this, you will need to index the category, track and keywords fields, maybe using a KeywordAnalyzer for the category and track fields. You only need to store the id field. You may want a custom analyzer for the keywords field, or alternatively to preprocess it before the actual indexing.

Java: from Lucene Hits to original objects

I'd like to implement a filter/search feature in my application using Lucene.
Querying Lucene index gives me a Hits instance, which is nothing more than a list of Documents matching my criteria.
Since I generate the indexed Documents from my objects, which is the best way to find the original object related to a specific Lucene Document?
A better description of my situation:
Three model classes for now: Folder (can have other Folders or
Lists as children), List (can have Tasks as children) and
Task (can have other Tasks as children). They are all
DefaultMutableTreeNode subclasses. I'll add the Tag entity in the
future.
Each Task has a text, a start date, a due date, some boolean flags.
They are displayed in a JTree.
The hole tree is saved in an XML file.
I'd like to do things like these:
search Tasks with Google-like queries.
Find all Tasks that start today.
Filter Tasks by Tag.
You can't, not with vanilla Lucene. You said yourself that you converted your objects into Documents and then stored the Documents in Lucene, how would you imagine that process would be reversible?
If you want to store and retrieve your own objects in Lucene, I strongly recommend that you use Compass instead. Compass is to Lucene what Hibernate is to JDBC - you define a mapping between your objects and Lucene documents, Compass takes care of the conversion.
Add a "stored" field that contains an object identifier. For each hit, lookup the original object via the identifier.
Without knowing more context, it's hard to be more specific.

Categories

Resources