How Indexing improve query performance in mongodb

How Indexing improve query performance in mongodb - java

I need to know abt how indexing in mongo improve query performance. And currently my db is not indexed. How can i index an existing DB.? Also is i need to create a new field only for indexing.?.

Fundamentally, indexes in MongoDB are similar to indexes in other database systems. MongoDB supports indexes on any field or sub-field contained in documents within a MongoDB collection.
Indexes are covered in detail here and I highly recommend reading this documentation.
There are sections on indexing operations, strategies and creation options as well as a detailed explanations on the various indexes such as compound indexes (i.e. an index on multiple fields).
One thing to note is that by default, creating an index is a blocking operation. Creating an index is as simple as:
db.collection.ensureIndex( { zip: 1})
Something like this will be returned, indicating the index was correctly inserted:
Inserted 1 record(s) in 7ms
Building an index on a large collection of data, the operation can take a long time to complete. To resolve this issue, the background option can allow you to continue to use your mongod instance during the index build.
Limitations on indexing in MongoDB is covered here.

Related

Is it possible in Hibernate Search to predefined Query or conditions before the indexing process?

I am using MassIndexer with #Indexed interceptor and it works just fine I am able to filter the entities. but the problem is that I have thousands of soft-deleted records, I don't want these objects to be in the indexing process since they are not important anymore.
so, is it possible in Hibernate Search to predefined query or conditions before the indexing process?

You can't do indexing with predefined HQL. Rather you can intercept the indexing process and instruct indexer whether it should index, skip or remove index for entity.
Please refer to Conditional Indexing topic in Reference Guilde.
Under your conditions: when > 95% of data is not to be indexed I would suggest the following:
Consider manual reindexing by running query and pushing items to index as described in Manual index changes
Consider splitting full data table and only active data table. This is a bit of data duplication but should give you considerable performance gains when working with active records only.

ElasticSearch: Secondary indecies on field values using Java-API

I'm considering to use ElasticSearch as a search engine for large objects. There are about 500 millions objects on a single machine. For far is Elasticsearch a good solution for executing advanced queries. But a have the problem that i did find any technique to create secondary index on the document fields. Is in elasticsearch a possibility to create a secondary indecies like in MySQL on columns? Or are there any other technologies implemented to accelerate searches on field values? I'm using an single server enviroment and I have to store about 300 fields per row/object. At the moment there are about 500 million object in my database.

I apologize in advance it I don't understand the question. Elasticsearch is itself an index based technology (it's built on top of Lucene which is a build for index based search). You put documents into Elasticsearch and the individual fields on those documents are indexed and searchable. You should not have to worry about creating secondary indexes; the fields will be indexed by default (in most cases).
One of the differences between Elasticsearch and Solr is that in Solr, you have to specify a schema defining what the fields are on the documents and whether that field will be indexed (available to search against), stored (available as the result of a search) or both. Elasticsearch does not require an upfront schema, and in lieu of specific mappings for fields, then reasonable defaults are used instead. I believe that the core fields (string, number, etc..._) are indexed by default, meaning they are available to search against.
Now in your case, you have a document with a lot of fields on it. You will probably need to tweak the mappings a bit to only index the fields that you know you might search against. If you index too much, the size of the index itself will balloon and will not be as fast as if you had a trim index of only the fields you know you will search against. Also, Lucene loads parts of the index into memory to really enable fast searches. With a bloated index, you won't be able to keep as much stuff in memory and your searches will suffer as a result. You should look at the Mappings API and the Core Types section for more info on how to do this.

What is an index in Elasticsearch

What is an index in Elasticsearch? Does one application have multiple indexes or just one?
Let's say you built a system for some car manufacturer. It deals with people, cars, spare parts, etc. Do you have one index named manufacturer, or do you have one index for people, one for cars and a third for spare parts? Could someone explain?

Good question, and the answer is a lot more nuanced than one might expect. You can use indices for several different purposes.
Indices for Relations
The easiest and most familiar layout clones what you would expect from a relational database. You can (very roughly) think of an index like a database.
MySQL => Databases => Tables => Rows/Columns
ElasticSearch => Indices => Types => Documents with Properties
An ElasticSearch cluster can contain multiple Indices (databases), which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties (columns).
So in your car manufacturing scenario, you may have a SubaruFactory index. Within this index, you have three different types:
People
Cars
Spare_Parts
Each type then contains documents that correspond to that type (e.g. a Subaru Imprezza doc lives inside of the Cars type. This doc contains all the details about that particular car).
Searching and querying takes the format of: http://localhost:9200/[index]/[type]/[operation]
So to retrieve the Subaru document, I may do this:
$ curl -XGET localhost:9200/SubaruFactory/Cars/SubaruImprezza
.
Indices for Logging
Now, the reality is that Indices/Types are much more flexible than the Database/Table abstractions we are used to in RDBMs. They can be considered convenient data organization mechanisms, with added performance benefits depending on how you set up your data.
To demonstrate a radically different approach, a lot of people use ElasticSearch for logging. A standard format is to assign a new index for each day. Your list of indices may look like this:
logs-2013-02-22
logs-2013-02-21
logs-2013-02-20
ElasticSearch allows you to query multiple indices at the same time, so it isn't a problem to do:
$ curl -XGET localhost:9200/logs-2013-02-22,logs-2013-02-21/Errors/_search=q:"Error Message"
Which searches the logs from the last two days at the same time. This format has advantages due to the nature of logs - most logs are never looked at and they are organized in a linear flow of time. Making an index per log is more logical and offers better performance for searching.
.
Indices for Users
Another radically different approach is to create an index per user. Imagine you have some social networking site, and each users has a large amount of random data. You can create a single index for each user. Your structure may look like:
Zach's Index
Hobbies Type
Friends Type
Pictures Type
Fred's Index
Hobbies Type
Friends Type
Pictures Type
Notice how this setup could easily be done in a traditional RDBM fashion (e.g. "Users" Index, with hobbies/friends/pictures as types). All users would then be thrown into a single, giant index.
Instead, it sometimes makes sense to split data apart for data organization and performance reasons. In this scenario, we are assuming each user has a lot of data, and we want them separate. ElasticSearch has no problem letting us create an index per user.

#Zach's answer is valid for elasticsearch 5.X and below. Since elasticsearch 6.X Type has been deprecated and will be completely removed in 7.X. Quoting the elasticsearch docs:
Initially, we spoke about an “index” being similar to a “database” in an SQL database, and a “type” being equivalent to a “table”.
This was a bad analogy that led to incorrect assumptions.
Further to explain, two columns with the same name in SQL from two different tables can be independent of each other. But in an elasticsearch index that is not possible since they are backed by the same Lucene field. Thus, "index" in elasticsearch is not quite same as a "database" in SQL. If there are any same fields in an index they will end up having conflicts of field types. To avoid this the elasticsearch documentation recommends storing index per document type.
Refer: Removal of mapping types

An index is a data structure for storing the mapping of fields to the corresponding documents. The objective is to allow faster searches, often at the expense of increased memory usage and preprocessing time.
The number of indexes you create is a design decision that you should take according to your application requirements. You can have an index for each business concept... You can an index for each month of the year...
You should invest some time getting acquainted with lucene and elasticsearch concepts.
Take a look at the introductory video and to this one with some data design patterns

Above one is too detailed in very short it could be defined as
Index: It is a collection of different type of documents and document properties. Index also uses the concept of shards to improve the performance. For example, a set of document contains data of a social networking application.
Answer from tutorialpoints.com
Since index is collection of different type of documents as per question depends how you want to categorize.
Do you have one index named manufacturer?
Yes , we will keep one document with manufacturer thing.
do you have one index for people, one for cars and a third for spare parts? Could someone explain?
Think of instance car given by same manufacturer to many people driving it on road .So there could be many indices depending upon number of use.
If we think deeply we will found except first question all are invalid ones.
Elastic-search documents are much different that SQL docs or csv or spreadsheet docs ,from one indices and by good powerful query language you can create millions type of data categorised documents in CSV style.
Due to its blazingly fast and indexed capability we create one index only for one customer , from that we create many type of documnets as per our need .
For example:
All old people using same model.Or One Old people using all model .
Permutation is inifinite.

How does Hibernate Search (Lucene indexing) work?

I am using Hibernate Search built on top of Lucene indexing. If indexes are created against database table the performance will be good in returning the results.
My question is, once indexes are created, if we query for the results does Hibernate Search fetch results from the original database table using the created indexes? or does it not need to hit the database to fetch the results?
Thanks!

Unless you use Projections the indexes are used only to identify the set of primary keys matching the query, these are then used to load the entities from the Database.
There are many good reasons for this:
As you pointed out, we don't store all data in the index: a larger index is a slower index
Adding all needed metadata to the index would make indexing a very expensive operation
Value extraction from the index is not efficient at all: it's good at queries, no more
Relational databases are very good at loading data by primary key
If you DB isn't good enough, second level cache is excellent to load by primary key
By loading from the DB we guarantee consistency especially with async indexing
By loading from the DB you have entities participate in Transactions and isolation
That said, if you don't need fully managed entities you can use Projections to load the fields you annotated as Stored.YES. A common pattern is to provide preview of matches using projections, and then when the user clicks for details to load the full entity matching that result.

By default, every time an object is inserted, updated or deleted through Hibernate, Hibernate Search updates the according Lucene index as per documentation
Hence, the further searches will yeild the data through lucene indexes only.
Another Question explaining how Indexes work

Keeping query statistics using lucene

I am developing a search component of a web application using Lucene. I would like to save the user queries to an index and use them to suggest alternate queries to users, and to keep query statistics (most often used queries, top scoring queries, ...).
To use this data for alternate query suggestions, I would analyze the queries to see which terms are most often used with one another and use that to create a suggestion to the user.
But I can't figure out in which form to index the data. I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content. Does anyone have any ideas about the way this can be accomplished?
Thanks for the help.

"I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content"
You can tell Lucene not to store document content, which means that the principal overhead will be the unique Terms, and the index itself. So, it might not be a large overhead to store each query as a unique Document...this way you will not be throwing away any information.

First, I believe that you should store the queries separately from the existing index. The problem is not redundant data but rather "watering down" your index - storing the queries in the same index may harm the relevance of your searches. Some options for this are:
Use a separate Lucene index.
Use Solr, with two separate cores, one for the documents and the other for the queries.
Use a query log. Store scores with the queries. Build query statistics using post-processing.As this is a web application, you can probably use a servlet container, such as Tomcat's, logs for this.
Second, Auto-Suggest From Popular Queries Using EdgeNGrams suggests an alternative implementation of query suggestion using Solr.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.