Any good way to handling repeats when using Lucene indexing?

Any good way to handling repeats when using Lucene indexing? - java

I am using Lucene to index my documents. In my case, each document is rather in small size but having a large quantity (~2GB). And in each document, there are many repeating words or terms. I am wondering if it is the right way for me to do index using Lucene or what preprocessing I should do on the document before indexing.
The following are a couple of examples of my documents (each column is a field, the first row is the field name, and starting from 2nd row, each row is one document):
ID category track keywords
id1 cat1 track1 mode=heat treatment;repeat=true;Note=This is an apple
id2 cat1 track2 mode=cold treatment;repeat=true;Note=This is an orange
I want to index all documents, perform a search on the 3 fields (category, track and keywords) and return the unique id1.
If I directly index this, will the repeating terms affect the searching performance? Do you have a good idea how I should do the indexing and searching? Thanks a lot in advance.

Repeated terms may affect the search performance by forcing the scorer to consider a large set of documents. If you have terms that are not that discriminating between documents, I suggest preprocessing the documents in order to remove these terms. However, you may want to start by indexing everything (say for a sample of 10000-20000 documents) and see how you fare with regard to relevance and performance.
From the way you describe this, you will need to index the category, track and keywords fields, maybe using a KeywordAnalyzer for the category and track fields. You only need to store the id field. You may want a custom analyzer for the keywords field, or alternatively to preprocess it before the actual indexing.

Related

What's a fast way to lookup text data in a large text file?

I have a vocabulary with different words and information about them. It's about 100MB in size. Searching this file takes a very long time, however. Is there any way to improve the speed at which I can lookup the data? For example, I was thinking of writing a program that would split the text file into 26 different text files (by the first letter of the word) and then, the program would just need to check the first letter of the given word and would have a much smaller file to search. Will this improve the execution time of the program? Are there any efficient data structured I could store the file in? Like json, for example. Also, what about databases? I'm using Kotlin/Java.
Edit: So far, I've just brute-force searched the entire file until I found a match. But, as I said, the file is >100MB. The execution of the program is about 5 seconds and that's searching for just one word. In the future, I want the program to search easily for 100 words in milliseconds, optimally. Like text editors like Word search for words in their vocabularies.

Perhaps save the map (key = word, value = information about word) in a JSON file. Then, you can load the JSON in the program, extract the HashMap, and find the word you want (since hash lookups are very fast).

It depends on the available memory. If the whole vocabulary can fit in memory with no performance decrease, then a HashMap (if each word has an associated value) or HashSet (if it has not) are specially optimized for fast lookup access. If keeping everything in memory is not an option, you could use a database with an index on the words that you want to lookup. Apache Derby is a lightweight database nicely interfaced with Java, but HSQLDB, H2 or SQLite are good choices too.

There are multiple ways to achieve this:
Load the data in a relational database (mysql, Postgres etc) with one column representing word and other columns containing information about word. Add an index on the word column. This will cater to case when your dataset is going to increase in future beyond the allocated memory
Load the data in memory in a hash table with key as the word and value as the information about word
If you want to write your own logic, you can load the data into a list, sort by word and perform binary search

You can use text search databases like ElasticSearch or Apache Solr

You have a file, in this file, you search character by character and word by word
Assuming that you have n words in the files
Full "scan" will take n * time_for_one_word_check
Assuming that time_for_one_word_check is constant, we will just focus on n
Searching a sorted list of words using binary search (or some form of it) will take at most time of roughly log (n)
This means that if you have n = 10, the full scan will take 10 and binary search will take 3
For n = 1000000, full scan will take 1000000 while binary search will take 6
So, sort the data and save it then search the sorted data
This can be done in multiple ways
Saving the data in a sorted format
You can either save the data to a single file or have a database manage saving, indexing and querying this data
You should choose a database, if your data will get bigger and will have more added complexity later or if you intend to be able to lookup (index) both the words and their information
You should choose a simple file if the data is not expected to have its volume or complexity increased
There are different file formats, I suggest that you try saving the data in a json format where the keys are the sorted words and the values are their description (this allows you to only search throw the keys)
Load this data once on application startup into an immutable Map implementation variable
Query that variable every time you need to perform a search
Helpful research keywords
binary search
table scan and index

Also, what about databases?
You can use indexer if in your search you don't want to search through all rows and you have big table. When you create an index on table DBMS creates usually B-tree. B-tree is useful for storing large amount of data when you need search or range search. Check this post link and reference for MySQL link. If you want to learn more about how to implement structure like B-tree or B+-tree you can use this book link. You have here implementation of structures that are used for searching data, here you don't have B-trees but author is creator of red-black trees (B-trees are generalization). You also have something here link.

Data structure for fast searching of custom object using its attributes (fields) in Java

I have abstract super class and some sub classes. My question is how is the best way to keep objects of those classes so I can easily find them using all the different parameters.
For example if I want to look up with resourceCode (every object is with unique resource code) I can use HashMap with key value resourceCode. But what happens if I want to look up with genre - there are many games with the same genre so I will get all those games. My first idea was with ArrayList of those objects, but isn’t it too slow if we have 1 000 000 games (about 1 000 000 operations).
My other idea is to have a HashTable with key value the product code. Complexity of the search is constant. After that I create that many HashSets as I have fields in the classes and for each field I get the productCode/product Codes of the objects, that are in the HashSet under that certain filed (for example game promoter). With those unique codes I can get everything I want from the HashTable. Is this a good idea? It seems there will be needed a lot of space for the date to be stored, but it will be fast.
So my question is what Data Structure should I use so I can implement fast finding of custom object, searching by its attributes (fields)
Please see the attachment: Classes Example
Thank you in advanced.
Stefan Stefanov

You can use Sorted or Ordered data structures to optimize search complexity.
You can introduce your own search index for custom data.
But it is better to use database or search engine.
Have a look at Elasticsearch, Apache Solr, PostgreSQL

It sounds like most of your fields can be mapped to a string (name, genre, promoter, description, year of release, ...). You could put all these strings in a single large index that maps each keyword to all objects that contain the word in any of their fields. Then if you search for certain keywords it will return a list of all entries that contain that word. For example searching for 'mine' should return 'minecraft' (because of title), as well as all mine craft clones (having 'minecraft-like' as genre) and all games that use the word 'mine' in the 'info text' field.
You can code this yourself, but I suppose some fulltext indexer, such as Lucene may be useful. I haven't used Lucene myself, but I suppose it would also allow you to search for multiple keyword at once, even if they occur in different fields.

This is not a very appealing answer.
Start with a database. Maybe an embedded database (like h2database).
Easy set of fixed develop/test data; can be easily changed. (The database dump.)
. Too many indices (hash maps) harm
Developing and optimizing queries is easier (declarative) than with data structures
Database tables are less coupled than data structures with help structures (maps)
The resulting system is far less complex and better scalable
After development has stabilized the set of queries, you can think of doing away of the DB part. Use at least a two tier separation of database and the classes.
Then you might find a stable and best fitting data model.
Should you still intend to do it all with pure objects, then work them out in detail as design documentation before you start programming. Example stories, and how one solves them.

What is an index in Elasticsearch

What is an index in Elasticsearch? Does one application have multiple indexes or just one?
Let's say you built a system for some car manufacturer. It deals with people, cars, spare parts, etc. Do you have one index named manufacturer, or do you have one index for people, one for cars and a third for spare parts? Could someone explain?

Good question, and the answer is a lot more nuanced than one might expect. You can use indices for several different purposes.
Indices for Relations
The easiest and most familiar layout clones what you would expect from a relational database. You can (very roughly) think of an index like a database.
MySQL => Databases => Tables => Rows/Columns
ElasticSearch => Indices => Types => Documents with Properties
An ElasticSearch cluster can contain multiple Indices (databases), which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties (columns).
So in your car manufacturing scenario, you may have a SubaruFactory index. Within this index, you have three different types:
People
Cars
Spare_Parts
Each type then contains documents that correspond to that type (e.g. a Subaru Imprezza doc lives inside of the Cars type. This doc contains all the details about that particular car).
Searching and querying takes the format of: http://localhost:9200/[index]/[type]/[operation]
So to retrieve the Subaru document, I may do this:
$ curl -XGET localhost:9200/SubaruFactory/Cars/SubaruImprezza
.
Indices for Logging
Now, the reality is that Indices/Types are much more flexible than the Database/Table abstractions we are used to in RDBMs. They can be considered convenient data organization mechanisms, with added performance benefits depending on how you set up your data.
To demonstrate a radically different approach, a lot of people use ElasticSearch for logging. A standard format is to assign a new index for each day. Your list of indices may look like this:
logs-2013-02-22
logs-2013-02-21
logs-2013-02-20
ElasticSearch allows you to query multiple indices at the same time, so it isn't a problem to do:
$ curl -XGET localhost:9200/logs-2013-02-22,logs-2013-02-21/Errors/_search=q:"Error Message"
Which searches the logs from the last two days at the same time. This format has advantages due to the nature of logs - most logs are never looked at and they are organized in a linear flow of time. Making an index per log is more logical and offers better performance for searching.
.
Indices for Users
Another radically different approach is to create an index per user. Imagine you have some social networking site, and each users has a large amount of random data. You can create a single index for each user. Your structure may look like:
Zach's Index
Hobbies Type
Friends Type
Pictures Type
Fred's Index
Hobbies Type
Friends Type
Pictures Type
Notice how this setup could easily be done in a traditional RDBM fashion (e.g. "Users" Index, with hobbies/friends/pictures as types). All users would then be thrown into a single, giant index.
Instead, it sometimes makes sense to split data apart for data organization and performance reasons. In this scenario, we are assuming each user has a lot of data, and we want them separate. ElasticSearch has no problem letting us create an index per user.

#Zach's answer is valid for elasticsearch 5.X and below. Since elasticsearch 6.X Type has been deprecated and will be completely removed in 7.X. Quoting the elasticsearch docs:
Initially, we spoke about an “index” being similar to a “database” in an SQL database, and a “type” being equivalent to a “table”.
This was a bad analogy that led to incorrect assumptions.
Further to explain, two columns with the same name in SQL from two different tables can be independent of each other. But in an elasticsearch index that is not possible since they are backed by the same Lucene field. Thus, "index" in elasticsearch is not quite same as a "database" in SQL. If there are any same fields in an index they will end up having conflicts of field types. To avoid this the elasticsearch documentation recommends storing index per document type.
Refer: Removal of mapping types

An index is a data structure for storing the mapping of fields to the corresponding documents. The objective is to allow faster searches, often at the expense of increased memory usage and preprocessing time.
The number of indexes you create is a design decision that you should take according to your application requirements. You can have an index for each business concept... You can an index for each month of the year...
You should invest some time getting acquainted with lucene and elasticsearch concepts.
Take a look at the introductory video and to this one with some data design patterns

Above one is too detailed in very short it could be defined as
Index: It is a collection of different type of documents and document properties. Index also uses the concept of shards to improve the performance. For example, a set of document contains data of a social networking application.
Answer from tutorialpoints.com
Since index is collection of different type of documents as per question depends how you want to categorize.
Do you have one index named manufacturer?
Yes , we will keep one document with manufacturer thing.
do you have one index for people, one for cars and a third for spare parts? Could someone explain?
Think of instance car given by same manufacturer to many people driving it on road .So there could be many indices depending upon number of use.
If we think deeply we will found except first question all are invalid ones.
Elastic-search documents are much different that SQL docs or csv or spreadsheet docs ,from one indices and by good powerful query language you can create millions type of data categorised documents in CSV style.
Due to its blazingly fast and indexed capability we create one index only for one customer , from that we create many type of documnets as per our need .
For example:
All old people using same model.Or One Old people using all model .
Permutation is inifinite.

Best practices handling large number of Strings from RSS feeds in Java and Lucene

I have a situation where I have an hourly batch job which has to parse a large number of RSS feeds and extract the text of the title and description elements from each item per feed, into strings which will then have their word frequencies calculated by Lucene
But, not knowing how many feeds or items per feed, each string may potentially consist of thousands of words.
I suppose the basic pseudocode I'm look at is something like this:
for each feed
for each item within date/time window
get text from title element, concatenate it to title_string
get text from description element,
concatenate it to description_string
calculate top x keywords from title_string
for each keyword y in x
calculate frequency of keyword y in description_string
Can anyone suggest how to handle this data to reduce memory usage? That is apart from using StringBuilders as the data is read from each feed.
Though the contents of the feeds will be stored in a database, I want to calculate the word frequencies 'on the fly' to avoid all the IO necessary where each feed has its own database table.

First, I don't understand why you want to store text in database if you already have Lucene. Lucene is a kind of database with indexes built on words, not record id's, and that's the only difference for text documents. For example, you can store each item in the feed as a separate document with fields "title", "description", etc. If you need to store information about feed itself, create one more type of documents for feeds, generate id and put this id as a reference to all feed's items.
If you do this, you can count word frequency in a constant time (well, not real constant time, but approximately constant). Yeah, it will cause IO, but using databases to save text will do it too. And reading word frequency information is extremely fast: Lucene uses data structure, called inverted index, i.e. stores map of word -> vector of < doc_number/frequency > pairs. When searching, Lucene doesn't read documents itself, but instead reads indexes and retrieves such map - this is small enough to be read very quickly.
If storing text in Lucene index is not an option and you only need information about word frequency, use in-memory index to analyze each separate batch of feeds, save frequency information somewhere and erase index. Also, when adding fields to documents, set store parameter to Field.Store.NO to store only frequency information, but not field itself.

How can to group lucene's results?

My application indexes discussion threads. Each entry in the discussion is indexed as a separate Lucene document with a common_id field which can be used to group search hits into one discussion.
Currently when the search is performed, if a thread has 3 entries, then 3 separate hits are returned. Even though this is correct, from the users point of view the same entry is appearing in the search multiple times.
Is there a way to tell lucene to group it's search results by the common_id field before returning them?

I believe what you are asking for is Field Collapsing, which is a feature of Solr (and I believe Elasticsearch as well).
If you want to roll your own, One possible way to do this is:
Add a "series id" field to each document that is a member of a
series. You will have to ensure that this gets incremented for every
new series.
Make an initial query to Lucene, and get a hit list.
For each hit, check to see if it has a series id; If it does, make another query by the series id in order to retrieve all the
members of the series.
An alternative is to store the ids of all the series members in a field inside each member's document.

There is nothing built into Lucene that collapses results based on a field. You will need to implement that yourself.
However, they've recently built this feature into Solr.
See http://www.lucidimagination.com/blog/2010/09/16/2446/

Since version 3.2 lucene supports grouping search results based on a field.
http://lucene.apache.org/core/4_1_0/grouping/org/apache/lucene/search/grouping/package-summary.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.