I have a Java (lucene 4) based application and a set of keywords fed into the application as a search query (the terms may include more than one words, eg it can be: “memory”, “old house”, “European Union law”, etc).
I need a way to get the list of matched keywords out of an indexed document and possibly also get keyword positions in the document (also for the multi-word keywords).
I tried with the lucene highlight package but I need to get only the keywords without any surrounding portion of text. It also returns multi-word keywords in separate fragments.
I would greatly appreciate any help.
There's a similar (possibly same) question here:
Get matched terms from Lucene query
Did you see this?
The solution suggested there is to disassemble a complicated query into a more simple query, until you get a TermQuery, and then check via searcher.explain(query, docId) (because if it matches, you know that's the term).
I think It's not very efficient, but
it worked for me until I ran into SpanQueries. it might be enough for you.
Related
I'm working with HTML tags, and I need to interpret HTML documents. Here's what I need to achieve:
I have to recognize and remove HTML tags without removing the
original content.
I have to store the index of the previously existing markups.
So here's a example. Imagine that I have the following markup:
This <strong>is a</strong> message.
In this example, we have a String sequence with 35 characters, and markedup with strong tag. As we know, an HTML markup has a start and an end, and if we interpret the start and end markup as a sequence of characters, each also has a start and an end (a character index).
Again, in the previous example, the beggining index of the open/start tag is 5 (starts at index 0), and the end index is 13. The same logic goes to the close tag.
Now, once we remove the markup, we end up with the following:
This is a message.
The question:
How can I remember with this sequence the places where I could enter the markup again?
For example, once the markup has been removed, how do I know that I have to insert the opening tag in the X position/index, and the closing tag in the Y position/index... Like so:
This is a message.
5 9
index 5 = <strong>
index 9 = </strong>
I must remember that it is possible to find the following situation:
<a>T<b attribute="value">h<c>i<d>s</a> <g>i<h>s</h></g> </b>a</c> <e>t</e>e<f>s</d>t</f>.
I need to implement this in Java. I've figured out how to get the start and end index of each tag in a document. For this, I'm using regular expressions (Pattern and Matcher), but I still do not know how to insert the tags again properly (as described). I would like a working example (if possible). It does not have to be the best example (the best solution) in the world, but only that it works the right way for any kind of situation.
If anyone has not understood my question, please comment that I will do it better.
Thanks in advance.
EDIT
People in the comments are saying that I should not use regular expressions to work with HTML. I do not care to use or not regular expressions to solve this problem, I just want to solve it, no matter how (But of course, in the most appropriate way).
I mentioned that I'm using regular expressions, but I do not mind using another approach that presents the same solution. I read that a XML parser could be the solution. Is that correct? Is there an XML parser capable of doing all this what I need?
Again, Thanks in advance.
EDIT 2
I'm doing this edition now to explain the applicability of my problem (as asked). Well, before I start, I want to say that what I'm trying to do is something I've never done before, it's not something on my area, so it may not be the most appropriate way to do it. Anyway...
I'm developing a site where users are allowed to read content but can not edit it (edit or remove text). However, users can still mark/highlight excerpts (ranges) of the content present (with some stylization). This is the big summary.
Now the problem is how to do this (in Java). On the client side, for now, I was thinking of using TinyMCE to enable styling of content without text editing. I could save stylized text to a database, but this would take up a lot of space, since every client is allowed to do this, given that they are many clients. So if a client marks snippets of a paragraph, saving the paragraph back in the database for each client in the system is somewhat costly in terms of memory.
So I thought of just saving the range (indexes) of the markups made by users in a database. It is much easier to save just a few numbers than all the text with the styling required. In the case, for example, I could save a line / record in a table that says:
In X paragraph, from Y to Z index, the user P defined a ABC
stylization.
This would require a translation / conversion, from database to HTML, and HTML to database. Setting a converter can be easy (I guess), but I do not know how to get the indexes (following this logic). And then we stop again at the beginning of my question.
Just to make it clear:
If someone offers a solution that will cost money, such as a paid API, tool, or something similar, unfortunately this option is not feasible for me. I'm sorry :/
In a similar way, I know it would be ideal to do this processing with JavaScript (client-side). It turns out that I do not have a specialized JavaScript team, so this needs to be done on the server side (unfortunately), which is written in Java. I can only use a JavaScript solution if it is already ready, easy and quick to use. Would you know of any ready-made, easy-to-use library that can do it in a simple way? Does it exist?
You can't use a regular expression to parse HTML. See this question (which includes this rather epic answer as well as several other interesting answers) for more information, but HTML isn't a regular language because it has a recursive structure.
Any language that allows recursion isn't regular by definition, so you can't parse it with a regex.
Keep in mind that HTML is a context-free languages (or, at least, pretty close to context-free). See also the Chomsky hierarchy.
I'm in trouble with a simple query to get strings from Realm engine in Java for an Android app.
As said in the title of my topic, I want to get diacritic insensitive results from my query.
Example:
If user type the word "securite", I want my query to return "securite" and "sécurité".
How can I do that ?
Thanks a lot in advance for your help !
While Realm doesn't support that currently. Depending on how much of the data you control, you can also add a "normalized" field you can use in your search. There is an approach described here: Remove diacritics from string in Java
This is not possible in Realm at the moment. Your only option is to manage tables containing all the possibilities for each letter of the alphabet you are interested in. Something like [a, á, à, å, etc] and then for each string compute all the possible permutations and build a huge query with equalTo() and or(). It would probably take longer to build such query than to execute it, but that's a very interesting use case! If you end up implementing it I would love to know the results!
Imagine that I am building a hashtag search. My main indexed type is called Post, which has a list of Hashtag items, which are marked as IndexedEmbedded. Separately, every post has a list of Comment objects, each of which, again, contains a list of Hashtag objects.
On the search side, I am using a MultiFieldQueryParser, to which I pass a long list of possible search fields, including some nested fields like:
hashTags.value and
coments.hashTags.value
Now, the interesting thing happens when I want to search for something, say #architecture. I figure out where the hashtags are, so the simplest logical thing to do would be to convert a query of the type #architecture, into one of the type hashTags.value:architecture or comments.hashTags.value:architecture Although possible, this is very inflexible. What if I come up with yet another field that contains hashtags? I'd have to include that too.
Is there a general way to do this?
P.S. Please have in mind that the root type I am searching for is Post, because this is the kind of results I'd like to achieve
Hashtags are keywords, and you should let Lucene handle the text analysis to extract the hashtags from your main text and store them in a custom field.
You can do this very easily with Hibernate Search by defining your text to be indexed in two different #Field (using #Fields annotation). You could have one field named comments and another commentsHashtags.
You then apply a custom Analyser to commentsHashtags which does some standard tokenization and discards any term not starting with #; you can define one easily by taking the standard tokenizer and apply a custom filter.
When you run a query, you don't have to write custom code to look for hashtags in the query input, let it be processed by the same Analyser (which is the default anyway) and target both fields, you can even boost the hashtags more if that makes sense.
With this solution you
take advantage of the high efficiency of Search's text analysis
avoid entities and tables on the database containing the hashtags: useless overhead
avoid messing with free text extraction
It gets you another strong win point:
you can then open a raw IndexReader and load the termvector from commentsHashtags to get both a list of all used tags, and metrics about them. Cool to do some data mining, or just visualize a tag cloud.
Instead of treating the fields as different and the top-level document as Post, why not store both Posts and Comments as Lucene documents? That way, you can just have a single field called "hashtags" that you search. You should also have a field called "type" or something to differentiate between comments and posts.
Search results may be either comments of posts. You can filter by type if users want to search only posts or comments. Or you can show them differently in your UI.
If you want to add another concept that also uses hashtags (like ... I dunno... splanks or whatever silly name we all give to Internet communications in the future), then you can add it alongside the existing Post and Comment documents simply my indexing your new type with a "hashtags" field. You'll have to do plenty of work to add the splanks, anyway, so adding a handler for that new type of search result shouldn't be too much of an inconvenience.
I have a search autocomplete on my site, and I'm using Solr to find matching documents. I am trying to get partial matches on page titles, so for example Java* would match Java, Javascript, etc. As of right now, the autocomplete is set up to give me partial matches on all of the text in the page, which gives some weird results, so I've decided to switch over to using the page title. However, when I try to switch the search term from text for the page text to title, the query suddenly does not pick up partial matches any more. Here is an example of my original query:
q=text:java^2+text:"java"
&hl=true&hl.snippets=1&hl.fragsize=25&hl.fl=title&start=0&rows=3
Unfortunately, the guy who set this up for me does not work with me any more, so I have little idea what's going on 'under the hood'. I'm using Spring/J2EE for my backend, if that makes any difference.
You need to make sure that the field is no string based field. You can lookup this if you take a look at your schema.xml. If you search with Java* inside a string field it will match only titles which start with Java*.
Another thing is that you need to make sure that you are aware that Wildcard Queries are case sensitive (see this).
Depends on how the field title was analyzed, look at schema.xml to see what type the field is and how its analyzed to create term. Easy way to do that would be to go to solr admin http://localhost:8983/solr/admin/analysis.jsp, choose the same name option, type in the field name (am guessing 'title') put some sample text and query to see what terms are created and matched.
Is there any way to query GAE datastore with filter similar to SQL LIKE statement? For example, if a class has a string field, and I want to find all classes that have some specific keyword in that string, how can I do that?
It looks like JDOQL's matches() don't work... Am I missing something?
Any comments, links or code fragments are welcome
As the GAE/J docs say, BigTable doesn't have such native support. You can use JDOQL String.matches for "something%" (i.e startsWith). That's all there is. Evaluate it in-memory otherwise.
If you have a lot of items to examine you want to avoid loading them at all. The best way would probably be to break down the inputs a write time. If you are only searching by whole words then that is easy
For example, "Hello world" becomes "Hello", "world" - just add both to a multi valued property. If you have a lot of text you want to avoid loading the multi valued property because you only need it for the index lookup. You can do this by creating a "Relation Index Entity" - see bret slatkins Google IO talk for details.
You may also want to break down the input into 3 character, 4 character etc strings or stem the words - perhaps with a lucene stemmer.