Java: from Lucene Hits to original objects

Java: from Lucene Hits to original objects - java

I'd like to implement a filter/search feature in my application using Lucene.
Querying Lucene index gives me a Hits instance, which is nothing more than a list of Documents matching my criteria.
Since I generate the indexed Documents from my objects, which is the best way to find the original object related to a specific Lucene Document?
A better description of my situation:
Three model classes for now: Folder (can have other Folders or
Lists as children), List (can have Tasks as children) and
Task (can have other Tasks as children). They are all
DefaultMutableTreeNode subclasses. I'll add the Tag entity in the
future.
Each Task has a text, a start date, a due date, some boolean flags.
They are displayed in a JTree.
The hole tree is saved in an XML file.
I'd like to do things like these:
search Tasks with Google-like queries.
Find all Tasks that start today.
Filter Tasks by Tag.

You can't, not with vanilla Lucene. You said yourself that you converted your objects into Documents and then stored the Documents in Lucene, how would you imagine that process would be reversible?
If you want to store and retrieve your own objects in Lucene, I strongly recommend that you use Compass instead. Compass is to Lucene what Hibernate is to JDBC - you define a mapping between your objects and Lucene documents, Compass takes care of the conversion.

Add a "stored" field that contains an object identifier. For each hit, lookup the original object via the identifier.
Without knowing more context, it's hard to be more specific.

Related

Data structure for fast searching of custom object using its attributes (fields) in Java

I have abstract super class and some sub classes. My question is how is the best way to keep objects of those classes so I can easily find them using all the different parameters.
For example if I want to look up with resourceCode (every object is with unique resource code) I can use HashMap with key value resourceCode. But what happens if I want to look up with genre - there are many games with the same genre so I will get all those games. My first idea was with ArrayList of those objects, but isn’t it too slow if we have 1 000 000 games (about 1 000 000 operations).
My other idea is to have a HashTable with key value the product code. Complexity of the search is constant. After that I create that many HashSets as I have fields in the classes and for each field I get the productCode/product Codes of the objects, that are in the HashSet under that certain filed (for example game promoter). With those unique codes I can get everything I want from the HashTable. Is this a good idea? It seems there will be needed a lot of space for the date to be stored, but it will be fast.
So my question is what Data Structure should I use so I can implement fast finding of custom object, searching by its attributes (fields)
Please see the attachment: Classes Example
Thank you in advanced.
Stefan Stefanov

You can use Sorted or Ordered data structures to optimize search complexity.
You can introduce your own search index for custom data.
But it is better to use database or search engine.
Have a look at Elasticsearch, Apache Solr, PostgreSQL

It sounds like most of your fields can be mapped to a string (name, genre, promoter, description, year of release, ...). You could put all these strings in a single large index that maps each keyword to all objects that contain the word in any of their fields. Then if you search for certain keywords it will return a list of all entries that contain that word. For example searching for 'mine' should return 'minecraft' (because of title), as well as all mine craft clones (having 'minecraft-like' as genre) and all games that use the word 'mine' in the 'info text' field.
You can code this yourself, but I suppose some fulltext indexer, such as Lucene may be useful. I haven't used Lucene myself, but I suppose it would also allow you to search for multiple keyword at once, even if they occur in different fields.

This is not a very appealing answer.
Start with a database. Maybe an embedded database (like h2database).
Easy set of fixed develop/test data; can be easily changed. (The database dump.)
. Too many indices (hash maps) harm
Developing and optimizing queries is easier (declarative) than with data structures
Database tables are less coupled than data structures with help structures (maps)
The resulting system is far less complex and better scalable
After development has stabilized the set of queries, you can think of doing away of the DB part. Use at least a two tier separation of database and the classes.
Then you might find a stable and best fitting data model.
Should you still intend to do it all with pure objects, then work them out in detail as design documentation before you start programming. Example stories, and how one solves them.

Should I use multiple Lucene directories/indexes to search different types of data?

I have many MySQL tables to store different types of data like goods, catagories, brands, suppliers, etc. Each of them needs to implement full-text search via Lucene.
So I plan to build one Lucene Directory (and one IndexWriter + one IndexReader corresponding to this Directory) for each table, e.g.
HashMap<String, Directory> = ...;
put("goods", FSDirectory.open(luceneDirRoot + "/goods"));
put("catagories", FSDirectory.open(luceneDirRoot + "/catagories"));
...
Is this a good practice to use Lucene?
Furthur more, how can I know how many directories I made by Lucene, like MySQL command "SHOW TABLES"? new File(luceneDirRoot).listFiles() can be a choice but I am not sure whether there are other non-Lucene folders.

I would implement one Lucene index pro MySQL table provided you do not need to perform search over several tables. Alternative would be to write everything into one index and add table name into each lucene document, that way you could limit the search to particular table.
AFAIK Lucene does not support SHOW TABLES equivalent the way you desire it, but you might easily do that by yourself, e.g. by using naming convention for the directories.
I would recommend to look at Hibernate Search, this is a good match for your needs, it builds one index directory pro table and allows you to perform full text search while handling the low-level lucene issues for you. You just configure the index by annotating the JPA entities corresponding to your tables and have to implement the full text queries. This is much easier then doing naked Lucene with data from MySQL on your own, Hibernate Search builds the index for you and integrates well with data from relational DB such as MySQL.

ElasticSearch: Secondary indecies on field values using Java-API

I'm considering to use ElasticSearch as a search engine for large objects. There are about 500 millions objects on a single machine. For far is Elasticsearch a good solution for executing advanced queries. But a have the problem that i did find any technique to create secondary index on the document fields. Is in elasticsearch a possibility to create a secondary indecies like in MySQL on columns? Or are there any other technologies implemented to accelerate searches on field values? I'm using an single server enviroment and I have to store about 300 fields per row/object. At the moment there are about 500 million object in my database.

I apologize in advance it I don't understand the question. Elasticsearch is itself an index based technology (it's built on top of Lucene which is a build for index based search). You put documents into Elasticsearch and the individual fields on those documents are indexed and searchable. You should not have to worry about creating secondary indexes; the fields will be indexed by default (in most cases).
One of the differences between Elasticsearch and Solr is that in Solr, you have to specify a schema defining what the fields are on the documents and whether that field will be indexed (available to search against), stored (available as the result of a search) or both. Elasticsearch does not require an upfront schema, and in lieu of specific mappings for fields, then reasonable defaults are used instead. I believe that the core fields (string, number, etc..._) are indexed by default, meaning they are available to search against.
Now in your case, you have a document with a lot of fields on it. You will probably need to tweak the mappings a bit to only index the fields that you know you might search against. If you index too much, the size of the index itself will balloon and will not be as fast as if you had a trim index of only the fields you know you will search against. Also, Lucene loads parts of the index into memory to really enable fast searches. With a bloated index, you won't be able to keep as much stuff in memory and your searches will suffer as a result. You should look at the Mappings API and the Core Types section for more info on how to do this.

What is an index in Elasticsearch

What is an index in Elasticsearch? Does one application have multiple indexes or just one?
Let's say you built a system for some car manufacturer. It deals with people, cars, spare parts, etc. Do you have one index named manufacturer, or do you have one index for people, one for cars and a third for spare parts? Could someone explain?

Good question, and the answer is a lot more nuanced than one might expect. You can use indices for several different purposes.
Indices for Relations
The easiest and most familiar layout clones what you would expect from a relational database. You can (very roughly) think of an index like a database.
MySQL => Databases => Tables => Rows/Columns
ElasticSearch => Indices => Types => Documents with Properties
An ElasticSearch cluster can contain multiple Indices (databases), which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties (columns).
So in your car manufacturing scenario, you may have a SubaruFactory index. Within this index, you have three different types:
People
Cars
Spare_Parts
Each type then contains documents that correspond to that type (e.g. a Subaru Imprezza doc lives inside of the Cars type. This doc contains all the details about that particular car).
Searching and querying takes the format of: http://localhost:9200/[index]/[type]/[operation]
So to retrieve the Subaru document, I may do this:
$ curl -XGET localhost:9200/SubaruFactory/Cars/SubaruImprezza
.
Indices for Logging
Now, the reality is that Indices/Types are much more flexible than the Database/Table abstractions we are used to in RDBMs. They can be considered convenient data organization mechanisms, with added performance benefits depending on how you set up your data.
To demonstrate a radically different approach, a lot of people use ElasticSearch for logging. A standard format is to assign a new index for each day. Your list of indices may look like this:
logs-2013-02-22
logs-2013-02-21
logs-2013-02-20
ElasticSearch allows you to query multiple indices at the same time, so it isn't a problem to do:
$ curl -XGET localhost:9200/logs-2013-02-22,logs-2013-02-21/Errors/_search=q:"Error Message"
Which searches the logs from the last two days at the same time. This format has advantages due to the nature of logs - most logs are never looked at and they are organized in a linear flow of time. Making an index per log is more logical and offers better performance for searching.
.
Indices for Users
Another radically different approach is to create an index per user. Imagine you have some social networking site, and each users has a large amount of random data. You can create a single index for each user. Your structure may look like:
Zach's Index
Hobbies Type
Friends Type
Pictures Type
Fred's Index
Hobbies Type
Friends Type
Pictures Type
Notice how this setup could easily be done in a traditional RDBM fashion (e.g. "Users" Index, with hobbies/friends/pictures as types). All users would then be thrown into a single, giant index.
Instead, it sometimes makes sense to split data apart for data organization and performance reasons. In this scenario, we are assuming each user has a lot of data, and we want them separate. ElasticSearch has no problem letting us create an index per user.

#Zach's answer is valid for elasticsearch 5.X and below. Since elasticsearch 6.X Type has been deprecated and will be completely removed in 7.X. Quoting the elasticsearch docs:
Initially, we spoke about an “index” being similar to a “database” in an SQL database, and a “type” being equivalent to a “table”.
This was a bad analogy that led to incorrect assumptions.
Further to explain, two columns with the same name in SQL from two different tables can be independent of each other. But in an elasticsearch index that is not possible since they are backed by the same Lucene field. Thus, "index" in elasticsearch is not quite same as a "database" in SQL. If there are any same fields in an index they will end up having conflicts of field types. To avoid this the elasticsearch documentation recommends storing index per document type.
Refer: Removal of mapping types

An index is a data structure for storing the mapping of fields to the corresponding documents. The objective is to allow faster searches, often at the expense of increased memory usage and preprocessing time.
The number of indexes you create is a design decision that you should take according to your application requirements. You can have an index for each business concept... You can an index for each month of the year...
You should invest some time getting acquainted with lucene and elasticsearch concepts.
Take a look at the introductory video and to this one with some data design patterns

Above one is too detailed in very short it could be defined as
Index: It is a collection of different type of documents and document properties. Index also uses the concept of shards to improve the performance. For example, a set of document contains data of a social networking application.
Answer from tutorialpoints.com
Since index is collection of different type of documents as per question depends how you want to categorize.
Do you have one index named manufacturer?
Yes , we will keep one document with manufacturer thing.
do you have one index for people, one for cars and a third for spare parts? Could someone explain?
Think of instance car given by same manufacturer to many people driving it on road .So there could be many indices depending upon number of use.
If we think deeply we will found except first question all are invalid ones.
Elastic-search documents are much different that SQL docs or csv or spreadsheet docs ,from one indices and by good powerful query language you can create millions type of data categorised documents in CSV style.
Due to its blazingly fast and indexed capability we create one index only for one customer , from that we create many type of documnets as per our need .
For example:
All old people using same model.Or One Old people using all model .
Permutation is inifinite.

Lucene indexing strategy for documents that change often

I'm integrating search functionality into a desktop application and I'm using vanilla Lucene to do so. The application handles (potentially thousands) of POJOs each with its own set of key/value(s) properties. When mapping models between my application and Lucene I originally thought of assigning each POJO a Document and add the properties as Fields. This approach works great as far as indexing and searching goes but the main downside is that whenever a POJO changes its properties I have to reindex ALL the properties again, even the ones that didn't change, in order to update the index. I have been thinking of changing my approach and instead create a Document per property and assign the same id to all the Documents from the same POJO. This way when a POJO property changes I only update its corresponding Document without reindexing all the other unchanged properties. I think that the graph db Neo4J follows a similar approach when comes to indexing, but I'm not completely sure. Could anyone comment on possible impact on performance, querying, etc?

It depends fundamentally on what you want to return as a Document in a search result.
But indexing is pretty cheap. Does a changed POJO really have so many properties that reindexing them all is a major problem?

If you only search one field in every search request, splitting one POJO to several documents will speed up reindexing. But it will cause another problem if search one multiple fields, a POJO may appear many times.
Actually, I agree with EJP, building index is very fast in small dataset.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.