My question is, which situation we can choose MongoDB and which situation we can choose ElasticSearch.
If you have a case where you want to do search on particular word and you know that word is present in your db you can go for mongodb directly. But if you have a case where you want to do search partially then go for elastic search.
Example: If you do text indexing on your document's some fields, mongo text search will work on single word search. Suppose you have test field in your collection you did text indexing. test field has value " I am testing it ". on text search if you pass "testing" it will give you the document which has "testing" word in collection. But now if you search for "test" you will gonna get no data.
Instead if you try doing the same in elastic search even for "tes","testi" etc. partial search you will gonna get data in response.
reference: http://blog.mpayetta.com/elasticsearch/mongodb/2016/08/04/full-text-indexing-with-elastic-search-and-mongodb/
Related
I have a vocabulary with different words and information about them. It's about 100MB in size. Searching this file takes a very long time, however. Is there any way to improve the speed at which I can lookup the data? For example, I was thinking of writing a program that would split the text file into 26 different text files (by the first letter of the word) and then, the program would just need to check the first letter of the given word and would have a much smaller file to search. Will this improve the execution time of the program? Are there any efficient data structured I could store the file in? Like json, for example. Also, what about databases? I'm using Kotlin/Java.
Edit: So far, I've just brute-force searched the entire file until I found a match. But, as I said, the file is >100MB. The execution of the program is about 5 seconds and that's searching for just one word. In the future, I want the program to search easily for 100 words in milliseconds, optimally. Like text editors like Word search for words in their vocabularies.
Perhaps save the map (key = word, value = information about word) in a JSON file. Then, you can load the JSON in the program, extract the HashMap, and find the word you want (since hash lookups are very fast).
It depends on the available memory. If the whole vocabulary can fit in memory with no performance decrease, then a HashMap (if each word has an associated value) or HashSet (if it has not) are specially optimized for fast lookup access. If keeping everything in memory is not an option, you could use a database with an index on the words that you want to lookup. Apache Derby is a lightweight database nicely interfaced with Java, but HSQLDB, H2 or SQLite are good choices too.
There are multiple ways to achieve this:
Load the data in a relational database (mysql, Postgres etc) with one column representing word and other columns containing information about word. Add an index on the word column. This will cater to case when your dataset is going to increase in future beyond the allocated memory
Load the data in memory in a hash table with key as the word and value as the information about word
If you want to write your own logic, you can load the data into a list, sort by word and perform binary search
You can use text search databases like ElasticSearch or Apache Solr
You have a file, in this file, you search character by character and word by word
Assuming that you have n words in the files
Full "scan" will take n * time_for_one_word_check
Assuming that time_for_one_word_check is constant, we will just focus on n
Searching a sorted list of words using binary search (or some form of it) will take at most time of roughly log (n)
This means that if you have n = 10, the full scan will take 10 and binary search will take 3
For n = 1000000, full scan will take 1000000 while binary search will take 6
So, sort the data and save it then search the sorted data
This can be done in multiple ways
Saving the data in a sorted format
You can either save the data to a single file or have a database manage saving, indexing and querying this data
You should choose a database, if your data will get bigger and will have more added complexity later or if you intend to be able to lookup (index) both the words and their information
You should choose a simple file if the data is not expected to have its volume or complexity increased
There are different file formats, I suggest that you try saving the data in a json format where the keys are the sorted words and the values are their description (this allows you to only search throw the keys)
Load this data once on application startup into an immutable Map implementation variable
Query that variable every time you need to perform a search
Helpful research keywords
binary search
table scan and index
Also, what about databases?
You can use indexer if in your search you don't want to search through all rows and you have big table. When you create an index on table DBMS creates usually B-tree. B-tree is useful for storing large amount of data when you need search or range search. Check this post link and reference for MySQL link. If you want to learn more about how to implement structure like B-tree or B+-tree you can use this book link. You have here implementation of structures that are used for searching data, here you don't have B-trees but author is creator of red-black trees (B-trees are generalization). You also have something here link.
I am using Hibernate Search and Lucene for full text Search on the content field of my document database. I have a search text box which is taking user query. I have fixed the search to phrase matching based search. I want to use the combination of search. To explain my point let's say user wants to search "United States". If I use phrase based search, it will give me every occurrence on the query and ignoring individual occurrences of "United" and "States". If I make the search to field matching, it will fetch all the results containing individual query words. My question is, Is there any direct way so that if user search for a phrase with quotations mark or any other mark, the hibernate search apply phrase based search. Other wise it retrieve the word based results. If user enter two query words separated with any Boolean character, it apply boolean search, etc. For example:
Example Query | Description
United States | Search for all occurrences of two words: United and States
"United States" | Search for phrase "United States"
United NOT States | Apply Boolean not query on United and States
etc
I want to implement something like google, I know that google is too power full but at least a little bit of it can be done. I just want to know that is there any built in functionality in Hibernate Search and lucene for this type of thing or I need to give user some operators, parse user query manually, implement some logic to find out operators, and other symbols and then apply query based on found symbols. Kindly Help
There is nothing like that directly in Hibernate Search, but Lucene has a query parser. For its syntax have a look at - http://lucene.apache.org/core/4_10_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description.
If you are happy with its functionality and syntax you could just pass the user input to the Lucene query parser. If not, you will need to write your own syntax and syntax parser which will translate the query into an appropriate Hibernate Search / Lucene query.
I have store a set of json object into the lucene indexes and also want to retrieve it from the index. I am using lucene-3.4.
So is there any library or easy mechanism to make this happen in lucene.
For sample: Json object
{
BOOKNAME1: {
id:1,
name:"bname1",
price:"p1"
},
BOOKNAME2: {
id:2,
name:"bname2",
price:"p2"
},
BOOKNAME3: {
id:3,
name:"bname3",
price:"p3"
}
}
Any sort of help will be appreciated.
Thanks in advance,
I would recommend you to index your json object by:
1) Parse your json file. I usually use json simple.
2) Open an index using IndexWriterConfig
3) Add documents to the index.
4) Commit changes and close the index
5) Run your queries
If you would like to use Lucene Core instead of elasticsearch, I have created a sample project, which gets as an input a file with JSON objects and creates an Index. Also, I have added a test to query the index.
I am using the latest Lucene version (4.8), please have a look here:
http://ignaciosuay.com/getting-started-with-lucene-and-json-indexing/
If you have time, I think it is worth reading "Lucene in Action".
Hope it helps.
If you don't want to search within the json but only store it, you just need to extract the id, which will hopefully be unique. Then your lucene document would have two fields:
the id (indexed, not necessarily stored)
the json itself, as it is (only stored)
Once you stored your json in lucene you can retrieve it filtering by id.
On the other hand this is pretty much what elasticsearch does with your documents. You just send some json to it via a REST api. elasticsearch will keep the json as it is and also make it searchable by default. That means you can either retrieve the json by id or search against it, out of the box without having to write any code.
Also, with lucene your documents wouldn't be available till you commit your documents or reopen the index reader, while elasticsearch adds a handy transaction log to it, so that the GET is always real time.
Also, elasticsearch offers a lot more: a nice distributed infrastructure, faceting, scripting and more. Check it out!
I am working with the Lucene and Derby databases. Lucene contains the text index, and Derby has information regarding additional user data. For example, each document has a tag. For this purpose the Derby database has two tables
TAGS:
ID
Name
LUCENETAGS:
ID
LUCENEID (docID in Lucene, not a field)
TAGID
I want a user to be able to search something like:
very interesting text AND tag:fun
Changing the structure in a way that tag is a Lucene field is not an option.
Thank you!
I believe you'll have to simply perform your text search in Lucene, and then filter your results based on the result of a query into a Derby.
If few documents will match a particular tag, you could also query the database for the IDs to be queried, and rewrite the query like:
(very interesting text) AND id:(1 2 3 etc.)
Probably not feasible, but in the case that tags are pretty sparse, it might be worth considering.
I do wonder, though, why a field can't be added to the index, duplicating the stored value in the Derby Database. In any implementation you choose to get what you want from your stated structure, you will see much poorer performance, and more complexity for you to deal with, than if the data were available in the index as well.
I have a SolR index where each record is a page from a file. So for every record we have the full text, the page number and the file ID.
When we do a search, often a single file will overwhelm the results as it contains the search term repeatedly.
What I would like to do is to have the search query only return a maximum of two hits per document and then offer the user a "see more hits from this document" which would do another, more limited query. I.e. similar to how Google will only show you a handful of results from any given domain, with the option of seeing more from each.
Is there anyway to structure a SolR query to accomplish this?
Which solr version are you using? If it's 4.0 (i.e. nightly), then you can use collapsing on the filename field.