How to use Lucene for next token recommendation? - java

I am trying to build a recommender that suggests the next token based on some context. An example would be recommending method calls where the context would be the method calls that were seen before.
I would need Lucene to build a language model (e.g. the n-gram model) from a prebuilt list of strings. It should then support queries that contain a list of tokens (the context) and return the token that has the highest probability to come next.
What would be the best way to implement this with the Lucene API?
edit:
I tried the suggestion from femtoRgon to try the Suggesters from the suggest package. Unfortunately, they don't completely solve my use case. AnalyzingSuggester and AnalyzingInfixSuggester need to be built with query data,
while I want to build the model with a list of strings or text.
The FreeTextSuggester is pretty close to what I need but it does not support context. While it builds an n-gram model from the text and the autocompletion works, I would also like to input a context. The context would be a list of strings, for example when using it with method calls it would be the tokens seen before the triggering the code completion.
Would there be a way to use such context in these suggesters or is there another method in Lucene?

Related

How to create a simple Italian Model for a Named Entity Extraction of Persons using OpenNLP?

I have to do a project with OpenNLP, strictly in italian language. Since it's almost impossible to find some existing structures in this language, my idea is to create a simple model myself. Reading some posts on this platform, my idea is try to do this using model-builder addon.
First of all, it's possible to obtain my goal with this addon?
If so, referring to this other post, what kind of file is meant by "modelOutFile"? In my case I don't have an existing model.
N.B.: the addon uses some deprecated functions (such as nameFinderME.train()).
Naively, I tried to pass as a "modelOutFile" a simple empty file "model.bin", but, of course I bumped into an error:
Cannot invoke "java.util.Properties.getProperty(String)" because "manifest" is null
Furthermore, I used a few names and sentences for the test (I only wanted to know if this worked), not the large amount requested (15000 sentences at least).
I'm open to other suggestions instead of the use of modelbuilder addons.
Hope someone can help me.

How to manage a crawler URL frontier?

Guys
I have the following code to add visited links on my crawler.
After extracting links i have a for loop which loop thorough each individual href tags.
And after i have visited a link , opened it , i will add the URL to a visited link collection variable defined above.
private final Collection<String> urlForntier = Collections.synchronizedSet(new HashSet<String>());
The crawler implementation is mulithread and assume if i have visited 100,000 urls, if i didn't terminate the crawler it will grow day by day . and It will create memory issues ? Please , what option do i have to refresh the variable without creating inconsistency across threads ?
Thanks in advance!
If your crawlers are any good, managing the crawl frontier quickly becomes difficult, slow and error-prone.
Luckily, your don't need to write this yourself, just write your crawlers to use consume the URL Frontier API and plug-in an implementation that suits you.
See https://github.com/crawler-commons/url-frontier
The most usable way for modern crawling systems is to use NoSQL databases.
This solution is notable slower than HashSet. That is why you can leverage different caching strategy like a Redis, or even Bloom filters
But including specific nature of URL, I'd like to recommend Trie data structure that gives you lot of options to manipulate and search by url string. (Discussion of java implementation can be found on this Stackoevrflow topic)
As per question, I would recommend using Redis to replace use of Collection. It's in-memory database for data structure store and super fast to insert and retrieve data with support of all standard data structures.In your case Set and you can check existence of key in set with SISMEMBER command).
Apache Nutch is also good to explore.

Why parsing Gremlin query in Java isn't generic?

I'm parsing a Gremlin query in Java (well, actually I'm writing Scala, and using the Groovy compiled JARs like it was Java).
The query is a String variable that is given by user input. In other words - I cannot tell what the query will be, I'm only assuming it's a valid Gremlin query (syntactically and logically).
I started with a simple Gremlin.compile(query) that returns Pipe on which I'm iterating. However, according to the example, one must invoke .setStarts prior to iterating the Pipe. And I must know what the is the runtime type S in my Pipe<S,E>.
It feels like this API isn't generic enough, the following line from the example
pipe.setStarts(new SingleIterator<Vertex>(graph.getVertex(1)));
will work for some cases, but for Vertex Iteration for one example (g.V()) it will throw a CastException.
Is there a way to work-around it?
Perhaps using the underlying Script Engine (like the next examples in the link above) will help me to achieve more generic code?
I found a workaround. It feels a bit ugly but it does the job.
I'm using ScriptEngine with bindings of 'g' for the Graph, so the user can start his/her queries with g.. (not helps for generics, but makes it more user-friendly by not making the user use the Identity Pipe (_()) at the beginning of his/her queries).
(kind of ugly, I know) I'm extracting from the query string (using RegEx) the starting vertex (if exists), finding it programatically and (if found) invoking setStarts with it. If it's not found I'm giving the Graph itself as the parameter for setStarts, assuming its a Vertex Iteration query.

Manipulating HTML nodes with java javascript scripting API

I'm using the Java Scripting API which is working quite well. Now I have a function where I want to get all <a> tags from a String and then add/remove attributes before returning the manipulated String. The problem of course is, that I can't just use document.getElementsByTagName. Is there any easy option that comes to your mind without going through regex-hell?
Please note that I'm currently running on Java 7 (with Rhino), planning to update to Java 8 (with Nashorn), so I don't want to use any Rhino specific APIs.
In the book "Learning JavaScript Design Patterns" by Addi Osmani, author mentions 3 alternatives to a similar problem, obviously being getElementById() the fastest.
Excerpt from book:
Imagine that we have a script where for each DOM element found on page
with class "foo," we wish to increment a counter. What's the most
efficient way to query for this collection of elements? Well, there
are a few different ways this problem could be tackled:
Select all of the elements in the page and then store references to them. Next, filter this collection and use regular expressions (or
another means) to store only those with the class "foo."
Use a modern native browser feature such as querySelectorForAll() to select all of the elements with the class "foo."
Use a netive feature such as getElementsByClassName() to similarly...
Another way is, since you're using Nashorn/Rhino, you could use the Java implementation of the Xerces library to manipulate the DOM.
Hope this helps you find out the solution.

Using a common query convention for multiple search fields

Imagine that I am building a hashtag search. My main indexed type is called Post, which has a list of Hashtag items, which are marked as IndexedEmbedded. Separately, every post has a list of Comment objects, each of which, again, contains a list of Hashtag objects.
On the search side, I am using a MultiFieldQueryParser, to which I pass a long list of possible search fields, including some nested fields like:
hashTags.value and
coments.hashTags.value
Now, the interesting thing happens when I want to search for something, say #architecture. I figure out where the hashtags are, so the simplest logical thing to do would be to convert a query of the type #architecture, into one of the type hashTags.value:architecture or comments.hashTags.value:architecture Although possible, this is very inflexible. What if I come up with yet another field that contains hashtags? I'd have to include that too.
Is there a general way to do this?
P.S. Please have in mind that the root type I am searching for is Post, because this is the kind of results I'd like to achieve
Hashtags are keywords, and you should let Lucene handle the text analysis to extract the hashtags from your main text and store them in a custom field.
You can do this very easily with Hibernate Search by defining your text to be indexed in two different #Field (using #Fields annotation). You could have one field named comments and another commentsHashtags.
You then apply a custom Analyser to commentsHashtags which does some standard tokenization and discards any term not starting with #; you can define one easily by taking the standard tokenizer and apply a custom filter.
When you run a query, you don't have to write custom code to look for hashtags in the query input, let it be processed by the same Analyser (which is the default anyway) and target both fields, you can even boost the hashtags more if that makes sense.
With this solution you
take advantage of the high efficiency of Search's text analysis
avoid entities and tables on the database containing the hashtags: useless overhead
avoid messing with free text extraction
It gets you another strong win point:
you can then open a raw IndexReader and load the termvector from commentsHashtags to get both a list of all used tags, and metrics about them. Cool to do some data mining, or just visualize a tag cloud.
Instead of treating the fields as different and the top-level document as Post, why not store both Posts and Comments as Lucene documents? That way, you can just have a single field called "hashtags" that you search. You should also have a field called "type" or something to differentiate between comments and posts.
Search results may be either comments of posts. You can filter by type if users want to search only posts or comments. Or you can show them differently in your UI.
If you want to add another concept that also uses hashtags (like ... I dunno... splanks or whatever silly name we all give to Internet communications in the future), then you can add it alongside the existing Post and Comment documents simply my indexing your new type with a "hashtags" field. You'll have to do plenty of work to add the splanks, anyway, so adding a handler for that new type of search result shouldn't be too much of an inconvenience.

Categories

Resources