ElasticSearch: Secondary indecies on field values using Java-API

ElasticSearch: Secondary indecies on field values using Java-API - java

I'm considering to use ElasticSearch as a search engine for large objects. There are about 500 millions objects on a single machine. For far is Elasticsearch a good solution for executing advanced queries. But a have the problem that i did find any technique to create secondary index on the document fields. Is in elasticsearch a possibility to create a secondary indecies like in MySQL on columns? Or are there any other technologies implemented to accelerate searches on field values? I'm using an single server enviroment and I have to store about 300 fields per row/object. At the moment there are about 500 million object in my database.

I apologize in advance it I don't understand the question. Elasticsearch is itself an index based technology (it's built on top of Lucene which is a build for index based search). You put documents into Elasticsearch and the individual fields on those documents are indexed and searchable. You should not have to worry about creating secondary indexes; the fields will be indexed by default (in most cases).
One of the differences between Elasticsearch and Solr is that in Solr, you have to specify a schema defining what the fields are on the documents and whether that field will be indexed (available to search against), stored (available as the result of a search) or both. Elasticsearch does not require an upfront schema, and in lieu of specific mappings for fields, then reasonable defaults are used instead. I believe that the core fields (string, number, etc..._) are indexed by default, meaning they are available to search against.
Now in your case, you have a document with a lot of fields on it. You will probably need to tweak the mappings a bit to only index the fields that you know you might search against. If you index too much, the size of the index itself will balloon and will not be as fast as if you had a trim index of only the fields you know you will search against. Also, Lucene loads parts of the index into memory to really enable fast searches. With a bloated index, you won't be able to keep as much stuff in memory and your searches will suffer as a result. You should look at the Mappings API and the Core Types section for more info on how to do this.

Related

Data structure for fast searching of custom object using its attributes (fields) in Java

I have abstract super class and some sub classes. My question is how is the best way to keep objects of those classes so I can easily find them using all the different parameters.
For example if I want to look up with resourceCode (every object is with unique resource code) I can use HashMap with key value resourceCode. But what happens if I want to look up with genre - there are many games with the same genre so I will get all those games. My first idea was with ArrayList of those objects, but isn’t it too slow if we have 1 000 000 games (about 1 000 000 operations).
My other idea is to have a HashTable with key value the product code. Complexity of the search is constant. After that I create that many HashSets as I have fields in the classes and for each field I get the productCode/product Codes of the objects, that are in the HashSet under that certain filed (for example game promoter). With those unique codes I can get everything I want from the HashTable. Is this a good idea? It seems there will be needed a lot of space for the date to be stored, but it will be fast.
So my question is what Data Structure should I use so I can implement fast finding of custom object, searching by its attributes (fields)
Please see the attachment: Classes Example
Thank you in advanced.
Stefan Stefanov

You can use Sorted or Ordered data structures to optimize search complexity.
You can introduce your own search index for custom data.
But it is better to use database or search engine.
Have a look at Elasticsearch, Apache Solr, PostgreSQL

It sounds like most of your fields can be mapped to a string (name, genre, promoter, description, year of release, ...). You could put all these strings in a single large index that maps each keyword to all objects that contain the word in any of their fields. Then if you search for certain keywords it will return a list of all entries that contain that word. For example searching for 'mine' should return 'minecraft' (because of title), as well as all mine craft clones (having 'minecraft-like' as genre) and all games that use the word 'mine' in the 'info text' field.
You can code this yourself, but I suppose some fulltext indexer, such as Lucene may be useful. I haven't used Lucene myself, but I suppose it would also allow you to search for multiple keyword at once, even if they occur in different fields.

This is not a very appealing answer.
Start with a database. Maybe an embedded database (like h2database).
Easy set of fixed develop/test data; can be easily changed. (The database dump.)
. Too many indices (hash maps) harm
Developing and optimizing queries is easier (declarative) than with data structures
Database tables are less coupled than data structures with help structures (maps)
The resulting system is far less complex and better scalable
After development has stabilized the set of queries, you can think of doing away of the DB part. Use at least a two tier separation of database and the classes.
Then you might find a stable and best fitting data model.
Should you still intend to do it all with pure objects, then work them out in detail as design documentation before you start programming. Example stories, and how one solves them.

Hibernate Search Result Ranking

I am using Hibernate Search Along with Lucene to implement full text search on my data base. I want to know that do hibernate search query or lucene query return top ranked and the most relevant results? Documentation says:
Apache Lucene provides a very flexible and powerful way to sort
results. While the default sorting (by relevance) is appropriate most
of the time
Link: http://docs.jboss.org/hibernate/search/4.2/reference/en-US/html_single/#search-query
Section: 5.1.3.3. Sorting
But I am very confused with the results as they are always arranged with the IDs of the objects. I just need the top 100 most relevant records.

See Customizing Lucene's scoring formula

Sorting by relevance is affected by your Analyzer choices. If you are getting results in the order of primary keys it is likely that they are all having the same score, which is normally very unlikely so my guess is that you're not enabling tokenization on any searched field.
Make sure you're tokenizing the fields used in the Query and they are using an appropriate Analyzer. To pick an appropriate one you'll have to experiment a bit as it depends on the language (if it's natural language) or on what kind of data you're indexing.
To actually debug the sort order applied by Relevance sort, see usage of Projections in the Hibernate Search documentation: both FullTextQuery.SCORE and FullTextQuery.EXPLANATION can be very useful to understand what's going on.
A handy utility to quickly experiment the effect of different Analyzers is to use org.hibernate.search.util.AnalyzerUtils. You can either write unit tests creating the Analyzer instance yourself or you can retrieve the analyzers by name using org.hibernate.search.engine.SearchFactory.getAnalyzer(String) or the base one used for a specific indexed entity by entity type: org.hibernate.search.engine.SearchFactory.getAnalyzer(Class).

What is an index in Elasticsearch

What is an index in Elasticsearch? Does one application have multiple indexes or just one?
Let's say you built a system for some car manufacturer. It deals with people, cars, spare parts, etc. Do you have one index named manufacturer, or do you have one index for people, one for cars and a third for spare parts? Could someone explain?

Good question, and the answer is a lot more nuanced than one might expect. You can use indices for several different purposes.
Indices for Relations
The easiest and most familiar layout clones what you would expect from a relational database. You can (very roughly) think of an index like a database.
MySQL => Databases => Tables => Rows/Columns
ElasticSearch => Indices => Types => Documents with Properties
An ElasticSearch cluster can contain multiple Indices (databases), which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties (columns).
So in your car manufacturing scenario, you may have a SubaruFactory index. Within this index, you have three different types:
People
Cars
Spare_Parts
Each type then contains documents that correspond to that type (e.g. a Subaru Imprezza doc lives inside of the Cars type. This doc contains all the details about that particular car).
Searching and querying takes the format of: http://localhost:9200/[index]/[type]/[operation]
So to retrieve the Subaru document, I may do this:
$ curl -XGET localhost:9200/SubaruFactory/Cars/SubaruImprezza
.
Indices for Logging
Now, the reality is that Indices/Types are much more flexible than the Database/Table abstractions we are used to in RDBMs. They can be considered convenient data organization mechanisms, with added performance benefits depending on how you set up your data.
To demonstrate a radically different approach, a lot of people use ElasticSearch for logging. A standard format is to assign a new index for each day. Your list of indices may look like this:
logs-2013-02-22
logs-2013-02-21
logs-2013-02-20
ElasticSearch allows you to query multiple indices at the same time, so it isn't a problem to do:
$ curl -XGET localhost:9200/logs-2013-02-22,logs-2013-02-21/Errors/_search=q:"Error Message"
Which searches the logs from the last two days at the same time. This format has advantages due to the nature of logs - most logs are never looked at and they are organized in a linear flow of time. Making an index per log is more logical and offers better performance for searching.
.
Indices for Users
Another radically different approach is to create an index per user. Imagine you have some social networking site, and each users has a large amount of random data. You can create a single index for each user. Your structure may look like:
Zach's Index
Hobbies Type
Friends Type
Pictures Type
Fred's Index
Hobbies Type
Friends Type
Pictures Type
Notice how this setup could easily be done in a traditional RDBM fashion (e.g. "Users" Index, with hobbies/friends/pictures as types). All users would then be thrown into a single, giant index.
Instead, it sometimes makes sense to split data apart for data organization and performance reasons. In this scenario, we are assuming each user has a lot of data, and we want them separate. ElasticSearch has no problem letting us create an index per user.

#Zach's answer is valid for elasticsearch 5.X and below. Since elasticsearch 6.X Type has been deprecated and will be completely removed in 7.X. Quoting the elasticsearch docs:
Initially, we spoke about an “index” being similar to a “database” in an SQL database, and a “type” being equivalent to a “table”.
This was a bad analogy that led to incorrect assumptions.
Further to explain, two columns with the same name in SQL from two different tables can be independent of each other. But in an elasticsearch index that is not possible since they are backed by the same Lucene field. Thus, "index" in elasticsearch is not quite same as a "database" in SQL. If there are any same fields in an index they will end up having conflicts of field types. To avoid this the elasticsearch documentation recommends storing index per document type.
Refer: Removal of mapping types

An index is a data structure for storing the mapping of fields to the corresponding documents. The objective is to allow faster searches, often at the expense of increased memory usage and preprocessing time.
The number of indexes you create is a design decision that you should take according to your application requirements. You can have an index for each business concept... You can an index for each month of the year...
You should invest some time getting acquainted with lucene and elasticsearch concepts.
Take a look at the introductory video and to this one with some data design patterns

Above one is too detailed in very short it could be defined as
Index: It is a collection of different type of documents and document properties. Index also uses the concept of shards to improve the performance. For example, a set of document contains data of a social networking application.
Answer from tutorialpoints.com
Since index is collection of different type of documents as per question depends how you want to categorize.
Do you have one index named manufacturer?
Yes , we will keep one document with manufacturer thing.
do you have one index for people, one for cars and a third for spare parts? Could someone explain?
Think of instance car given by same manufacturer to many people driving it on road .So there could be many indices depending upon number of use.
If we think deeply we will found except first question all are invalid ones.
Elastic-search documents are much different that SQL docs or csv or spreadsheet docs ,from one indices and by good powerful query language you can create millions type of data categorised documents in CSV style.
Due to its blazingly fast and indexed capability we create one index only for one customer , from that we create many type of documnets as per our need .
For example:
All old people using same model.Or One Old people using all model .
Permutation is inifinite.

Lucene indexing strategy for documents that change often

I'm integrating search functionality into a desktop application and I'm using vanilla Lucene to do so. The application handles (potentially thousands) of POJOs each with its own set of key/value(s) properties. When mapping models between my application and Lucene I originally thought of assigning each POJO a Document and add the properties as Fields. This approach works great as far as indexing and searching goes but the main downside is that whenever a POJO changes its properties I have to reindex ALL the properties again, even the ones that didn't change, in order to update the index. I have been thinking of changing my approach and instead create a Document per property and assign the same id to all the Documents from the same POJO. This way when a POJO property changes I only update its corresponding Document without reindexing all the other unchanged properties. I think that the graph db Neo4J follows a similar approach when comes to indexing, but I'm not completely sure. Could anyone comment on possible impact on performance, querying, etc?

It depends fundamentally on what you want to return as a Document in a search result.
But indexing is pretty cheap. Does a changed POJO really have so many properties that reindexing them all is a major problem?

If you only search one field in every search request, splitting one POJO to several documents will speed up reindexing. But it will cause another problem if search one multiple fields, a POJO may appear many times.
Actually, I agree with EJP, building index is very fast in small dataset.

Keeping query statistics using lucene

I am developing a search component of a web application using Lucene. I would like to save the user queries to an index and use them to suggest alternate queries to users, and to keep query statistics (most often used queries, top scoring queries, ...).
To use this data for alternate query suggestions, I would analyze the queries to see which terms are most often used with one another and use that to create a suggestion to the user.
But I can't figure out in which form to index the data. I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content. Does anyone have any ideas about the way this can be accomplished?
Thanks for the help.

"I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content"
You can tell Lucene not to store document content, which means that the principal overhead will be the unique Terms, and the index itself. So, it might not be a large overhead to store each query as a unique Document...this way you will not be throwing away any information.

First, I believe that you should store the queries separately from the existing index. The problem is not redundant data but rather "watering down" your index - storing the queries in the same index may harm the relevance of your searches. Some options for this are:
Use a separate Lucene index.
Use Solr, with two separate cores, one for the documents and the other for the queries.
Use a query log. Store scores with the queries. Build query statistics using post-processing.As this is a web application, you can probably use a servlet container, such as Tomcat's, logs for this.
Second, Auto-Suggest From Popular Queries Using EdgeNGrams suggests an alternative implementation of query suggestion using Solr.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.