Customising the search algorithm of Elasticsearch - java

I originally tried posting a similar post to the elasticsearch mailing list (https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/BZLFJSEpl78) but didn't get any helpful responses so I though I'd give Stack Overflow a try. This is my first post on SO so apologies if it doesn't quite fit into the mould it is meant to.
I'm currently working with a university helping them to implement a test suite to further refine some research they have been conducting. Their research is based around dynamic schema searching. After spending some time evaluating the various open source search solutions I settled on elasticsearch as the base platform and I am wondering what the best way to proceed would be. I have spent about a week looking into the elasticsearch documentation and the code itself and also reading the documentation of Lucene but I am struggling to see a clear way forward.
The goal of the project is to provide the researches with a piece of software they can use to plugin revisions of the searching algorithm to test and refine. They would like to be able to write the pluggable algorithm in languages other then Java that is supported by the JVM like Groovy, Python or Closure but that isn't a hard requirement. Part of that will be to provide them with a front end to run queries and see output and an admin interface to add documents to an index. I am comfortable with all of that thanks to the very powerful and complete REST API. What I am not so sure about is how to proceed with implementing the pluggable search algorithm.
The researcher's algorithm requires 4 inputs to function:
The query terms(s).
A Word (term) x Document matrix across a index.
A Document x Word (term) matrix across a index.
A Word (term) frequency list across a index. That is how many times each word appears across the entire index.
For their purposes, a document doesn't correspond to an actual real-world document (they actually call them text events). Rather, for now, it corresponds to one sentence (having that configurable might also be useful). I figure the best way to handle this is to break down documents into their sentences (using Apache Tika or something similar), putting each sentence in as its own document in the index. I am confident I can do this in the Admin UI I provide using the mapper-attachement plugin as a starting point. The downside is that breaking up the document before giving it to elasticsearch isn't a very configurable way of doing it. If they want to change the resolution to their algorithm, they would need to re-add all documents to the index again. If the index stored that full documents as is and the search algorithm could chose what resolution to work at per query then that would be perfect. I'm not sure it is possible or not though.
The next problem is how to get the three inputs they require and pass it into their pluggable search algorithm. I'm really struggling where to start with this one. It seems from looking at Luecene that I need to provide my own search/query implementation, but I'm not sure if this is right or not. There also doesn't seem to be any search plugins listed on the elasticsearch site, so I'm not even sure if it is possible. The important things here are that the algorithm needs to operate at the index level with the query terms available to generate its schema before using the schema to score each document in the index. From what I can tell, this means that the scripting interface provided by elasticsearch won't be of any use. The description of the scripting interface in the elasticsearch guide makes it sound like a script operates at the document level and not the index level. Other concerns/considerations are the ability to program this algorithm in a range of languages (just like the scripting interface) and the ability to augment what is returned by the REST API for a search to include the schema the algorithm generated (which I assume means I will need to define my own REST endpoint(s)).
Can anybody give me some advice on where to get started here? It seems like I am going to have to write my own search plugin that can accept scripts as it's core algorithm. The plugin will be responsible for organising the 4 inputs that I outlined earlier before passing control to the script. It will also be responsible for getting the output from the script and returning it via it's own REST API. Does this seem logical? If so, how do I get started with doing this? What parts of the code do I need to look it?

You should store 1 sentence per document if that's how their algorithm works. You can always reindex if they change their model.
Lucene is pretty good at finding matches, so I suspect your co-workers' algorithm will be dealing with scoring. ElasticSearch supports custom scoring script. You can pass params to a given scoring script. You can use groovy for scripting in ES.
http://www.elasticsearch.org/guide/reference/modules/scripting.html
To use larger datastructures in your search algorithm, it does not make sense to pass those datastructures as params, you might find it useful to use other datasources in scoring script.
For example Redis: http://java.dzone.com/articles/connecting-redis-elasticsearch .

Related

Score a query with JaroWinkler in ElasticSearch with Java API

I'm working with ElasticSearch using Java API.
Currently, I'm doing some match query. Now, I would like to calculate the _score value for my queries using the Jaro Winkler distance for strings.
Does ElasticSearch allow to use other scoring functions defined by users?
Elasticsearch uses lucene under the hood for all scoring. Lucene uses TF/IDF for scoring in versions before 6.0 and versions later than 6.0 use the Bm25 algorithm.
Elasticsearch allows you to write scripts to modify the scores for the hits you have already obtained from lucene but there is no other way of writing a scoring function that is implemented for the initial search. Also trying to modify scores you get has limitations due to pagination of results since the result on the second page might be better result than with your algorithm compared to all the results on the first page.
So the only thing you can really do is write a plugin to do that for elasticsearch/lucene. You should also keep in mind that elasticsearch/lucene use an inverted index so your results still might be not what you want.
Also since the access to the server is not available the short answer to your question is no it can't be done.
The best you can do is ask for a lot of result and then boost them using scripting.
EDIT: After doing some more research I found that you might be able to something very similar to what you want to do using the function score query of elasticsearch, with help of fuzziness. Although it still wouldn't change how documents are found (have to deal with inverted indexes and analyzers etc) but you could definitely mess with scoring of the results. Also look at this
Elasticsearch uses that algorithm for terms suggesters. If you want custom scoring like that maybe you need to build a plugin for that and if you don't have access to the server where you can install the plugin it might be difficult. Or, if you have a Groovy script implementation, maybe you can do it at search time using scripts.
Quick scan of the web: https://github.com/ucidentity/id-match-engine/blob/master/grails-app/services/dolphin/JaroWinklerDistanceService.groovy

Information retrieval in dbpedia using spotlight

I have recently come across dbpedia-spotlight and I want to do an information retrieval. I have a set of queries and dbpedia and using Information retrieval I need to get the output. I was not able to understand the documentation so can you give me a sample code to start working.
I have tried terrier but that was equally difficult.
Terrier is more popular as a research tool, where you can try out various standard IR models against standard test collections, e.g. TREC, ClueWeb etc.
If you want to develop quickly a reasonably functional search system, Lucene is the best thing to try. Go through this Lucene in 5 minutes tutorial. I guess this should be fairly simple to use.

Java crawl web and store in cassandra

I have a java project for which I'd like to use a pre-built web crawler that gives me enough flexibility to be able to control which urls are crawled and then once the crawler has the output I want to control where to put it (cassandra with my own schema).
The big picture is I want to feed in a list of urls (Google and Bing searches) and then filter the urls that are returned. I want it to then crawl the filtered urls (I may possibly want to change the url query string, but that's not a hard requirement). I want to take the resulting html and parse it using Tika then pull the data out and store it.
I'm looking at Apache Droids, it's a good fit since it seems to do everything I've mentioned but there isn't any real documentation. I'd consider Nutch or Heritrix but the use cases seem to be more a full solution and after skimming I don't see anything that talks about how to do what want.
Does anyone have any experience with this type of thing? I mostly need some recommendations, but if you know of examples doing this sort of thing that'd be nice as well since I'm still pretty new to java.
I wouldn't say Droids is a well established framework yet. If you compare it to Nutch, which has a lot of history behind, I would expect it to be less stable and less documented. I have no experience with Droids, though.
As far as storing data in cassandra, I would recommend either https://github.com/Netflix/astyanax
or Hector
https://github.com/hector-client/hector
I have used extensively Hector in the last year and have found it to be extremely simple and easy to use. It is faster to develop in Hector than its predecessors: pure Thrift/Pelops, but Hector is flexible enough to allow you to do the nitty gritty things which you expect from Thrift.
Recently I have also been eyeing astyanax as it is developed/supported by a larger team and tested on a larger scale, which is important for my current field of work. However, Hector is usually faster in implementing new features in new cassandra releases, so both libraries have their benefits.

Add faceting over multivalued to application using Hibernate Search

we use Hibernate Search in our application. We use faceting. Recently we have found a big limitation. Faceting over fields that can have multiple values doesn't work properly with Hibernate Search - if a document has multiple values for faceted field (ex. multiple categories), only one of the values is taken into account.
I can currently think of a couple two solutions:
use bobo-browse (http://code.google.com/p/bobo-browse/)
solr (http://lucene.apache.org/solr/)
In both solutions we continue to maintain the index using Hiberante Search and making queries as we did before (using Hiberante Search), and run additional bobo-browse or solr query for faceting, where required (bobo-browse or solr would use index in kind of "read-only" manner). The problem is that we update index quite often, and would like to get really fresh data in faceting queries. Bobo-browse doesn't automatically integrate with Hiberante Search, and to keep search up to date, I might get into some problems (ex. https://groups.google.com/forum/?fromgroups=#!topic/bobo-browse/sn_Efc-YClU). The documentation looks a bit untidy and not yet completed. Solr on the other hand seems like a really big thing to add, just to get faceting work properly. And I'm still afraid I might run into some problems with updating/refreshing index.
Do you have any experience in that matter? Any suggestions?
As a Hibernate Search developer, I'd suggest to join us and help implement what you need.
Noone of us actually needed multivalued faceting so we're not really sure which solution to pick either; it seems you have a real need, that's perfect to explore the alternatives and try them out.
Hibernate Search already depends on many Solr modules especially because of the large collection of excellent analysers. I'm confident we could find a way to embed the faceting logic of Solr and package it nicely in our consistent API, without the need to actually start Solr in server mode.
I guess we could do the same with Bobo-browse; I'd prefer Solr to not add other dependencies, but if bobo-browse proofs a superior solution why not.. but you can help us in this choice.
What would you get in exchange?
we'll maintain it: compatibility will stay with any future version. hopefully you'll help a bit.
eternal gratitude from other users ;)
rock solid testing from thousands of other users
bugfixes and improvements from ..
a rock star badge on your CV
What is required?
unit tests
documentation updates
sensible code
https://community.jboss.org/wiki/ContributingToHibernateSearch
I also use Bobo Browse in combination with Hibernate Search. I also have the problem with regular updates and the read-only issue. Bobo is not the easiest library out there and I've looked several times at ways to integrate with Hibernate Search and just gave up because of the complexity.
I use timed reloads of the index in order to ensure freshness but that creates a lot of garbage to be collected. Lucene has over time optimized the process of reopening indexreaders, but the Bobo team is not really focused on supporting that. https://linkedin.jira.com/browse/BOBO-31 describes this issue.
The Hibernate Search infrastructure should provide enough flexibility to integrate. Zoie is a real-time indexing system like Hibernate Search that is integrated with Bobo https://linkedin.jira.com/wiki/display/BOBO/Realtime+Faceting+with+Zoie Perhaps it can inspire your efforts
This is something of a solution to the multi-value facet-count problem for hibernate-search.
Blog: http://outbottle.com/hibernate-search-multivalue-facet-counts/
The blog is complete with a Java Class that can be reused to generate facet-counts for single-value and multi-value fields.
The solution provided is based on the BitSet solution provided here: http://sujitpal.blogspot.ie/2007/04/lucene-search-within-search-with.html
The blog has a Maven project which demonstrates the solution quite comprehensively. The project demonstrates using the hibernate-search faceting API to filter on....
a date-range AND a 1-to-many (single-value) facet-group AND a many-to-many (multi-value) facet-group combined.
The solution is then invoked to correctly derive facet-counts for each facet-group.
The solution facilitates results similar to this jsFiddle emulation: http://goo.gl/y5C9UO (except that the emulation does not demo the range faceting).
The jsFiddle is part of a larger blog which explores the concept of facet searching in general: http://outbottle.com/understanding-faceted-searching/. If you’re like me and are finding the whole notion of facet-searching quite confusing then this will help.
It may not be the best solution in the world so feel free to feedback.

Machine learning challenge: diagnosing program in java/groovy (datamining, machine learning)

I'm planning to develop program in Java which will provide diagnosis. The data set is divided into two parts one for training and the other for testing. My program should learn to classify from the training data (BTW which contain answer for 30 questions each in new column, each record in new line the last column will be diagnosis 0 or 1, in the testing part of data diagnosis column will be empty - data set contain about 1000 records) and then make predictions in testing part of data :/
I've never done anything similar so I'll appreciate any advice or information about solution to similar problem.
I was thinking about Java Machine Learning Library or Java Data Mining Package but I'm not sure if it's right direction... ? and I'm still not sure how to tackle this challenge...
Please advise.
All the best!
I strongly recommend you use Weka for your task
Its a collection of machine learning algorithms with a user friendly front-end which facilitates a lot of different kinds of feature and model selection strategies
You can do a lot of really complicated stuff using this without really having to do any coding or math
The makers have also published a pretty good textbook that explains the practical aspects of data mining
Once you get the hang of it, you could use its API to integrate any of its classifiers into your own java programs
Hi As Gann Bierner said, this is a classification problem. The best classification algorithm for your needs I know of is, Ross Quinlan algorithm. It's conceptually very easy to understand.
For off-the-shelf implementations of the classification algorithms, the best bet is Weka. http://www.cs.waikato.ac.nz/ml/weka/. I have studied Weka but not used, as I discovered it a little too late.
I used a much simpler implementation called JadTi. It works pretty good for smaller data sets such as yours. I have used it quite a bit, so can confidently tell so. JadTi can be found at:
http://www.run.montefiore.ulg.ac.be/~francois/software/jaDTi/
Having said all that, your challenge will be building a usable interface over web. To do so, the dataset will be of limited use. The data set basically works on the premise that you have the training set already, and you feed the new test dataset in one step, and you get the answer(s) immediately.
But my application, probably yours also, was a step by step user discovery, with features to go back and forth on the decision tree nodes.
To build such an application, I created a PMML document from my training set, and built a Java Engine that traverses each node of the tree asking the user to give an input (text/radio/list) and use the values as inputs to the next possible node predicate.
The PMML standard can be found here: http://www.dmg.org/ Here you need the TreeModel only. NetBeans XML Plugin is a good schema-aware editor for PMML authoring. Altova XML can do a better job, but costs $$.
It is also possible to use an RDBMS to store your dataset and create the PMML automagically! I have not tried that.
Good luck with your project, please feel free to let me know if you need further inputs.
There are various algorithms that fall into the category of "machine learning", and which is right for your situation depends on the type of data you're dealing with.
If your data essentially consists of mappings of a set of questions to a set of diagnoses each of which can be yes/no, then I think methods that could potentially work include neural networks and methods for automatically building a decision tree based on the test data.
I'd have a look at some of the standard texts such as Russel & Norvig ("Artificial Intelligence: A Modern Approach") and other introductions to AI/machine learning and see if you can easily adapt the algorithms they mention to your particular data. See also O'Reilly, "Programming Collective Intelligence" for some sample Python code of one or two algorithms that might be adaptable to your case.
If you can read Spanish, the Mexican publishing house Alfaomega have also published various good AI-related introductions in recent years.
This is a classification problem, not really data mining. The general approach is to extract features from each data instance and let the classification algorithm learn a model from the features and the outcome (which for you is 0 or 1). Presumably each of your 30 questions would be its own feature.
There are many classification techniques you can use. Support vector machines is popular as is maximum entropy. I haven't used the Java Machine Learning library, but at a glance I don't see either of these. The OpenNLP project has a maximum entropy implementation. LibSVM has a support vector machine implementation. You'll almost certainly have to modify your data to something that the library can understand.
Good luck!
Update: I agree with the other commenter that Russel and Norvig is a great AI book which discusses some of this. Bishop's "Pattern Recognition and Machine Learning" discusses classification issues in depth if you're interested in the down and dirty details.
Your task is classical for neural networks, which are intended first of all to solve exactly classification tasks. Neural network has rather simple realization in any language, and it is the "mainstream" of "machine learning", closer to AI than anything other.
You just implement (or get existing implementation) standart neural network, for example multilayered network with learning by error back propagation, and give it learning examples in cycle. After some time of such learning you will get it working on real examples.
You can read more about neural networks starting from here:
http://en.wikipedia.org/wiki/Neural_network
http://en.wikipedia.org/wiki/Artificial_neural_network
Also you can get links to many ready implementations here:
http://en.wikipedia.org/wiki/Neural_network_software

Categories

Resources