Imagine that I am building a hashtag search. My main indexed type is called Post, which has a list of Hashtag items, which are marked as IndexedEmbedded. Separately, every post has a list of Comment objects, each of which, again, contains a list of Hashtag objects.
On the search side, I am using a MultiFieldQueryParser, to which I pass a long list of possible search fields, including some nested fields like:
hashTags.value and
coments.hashTags.value
Now, the interesting thing happens when I want to search for something, say #architecture. I figure out where the hashtags are, so the simplest logical thing to do would be to convert a query of the type #architecture, into one of the type hashTags.value:architecture or comments.hashTags.value:architecture Although possible, this is very inflexible. What if I come up with yet another field that contains hashtags? I'd have to include that too.
Is there a general way to do this?
P.S. Please have in mind that the root type I am searching for is Post, because this is the kind of results I'd like to achieve
Hashtags are keywords, and you should let Lucene handle the text analysis to extract the hashtags from your main text and store them in a custom field.
You can do this very easily with Hibernate Search by defining your text to be indexed in two different #Field (using #Fields annotation). You could have one field named comments and another commentsHashtags.
You then apply a custom Analyser to commentsHashtags which does some standard tokenization and discards any term not starting with #; you can define one easily by taking the standard tokenizer and apply a custom filter.
When you run a query, you don't have to write custom code to look for hashtags in the query input, let it be processed by the same Analyser (which is the default anyway) and target both fields, you can even boost the hashtags more if that makes sense.
With this solution you
take advantage of the high efficiency of Search's text analysis
avoid entities and tables on the database containing the hashtags: useless overhead
avoid messing with free text extraction
It gets you another strong win point:
you can then open a raw IndexReader and load the termvector from commentsHashtags to get both a list of all used tags, and metrics about them. Cool to do some data mining, or just visualize a tag cloud.
Instead of treating the fields as different and the top-level document as Post, why not store both Posts and Comments as Lucene documents? That way, you can just have a single field called "hashtags" that you search. You should also have a field called "type" or something to differentiate between comments and posts.
Search results may be either comments of posts. You can filter by type if users want to search only posts or comments. Or you can show them differently in your UI.
If you want to add another concept that also uses hashtags (like ... I dunno... splanks or whatever silly name we all give to Internet communications in the future), then you can add it alongside the existing Post and Comment documents simply my indexing your new type with a "hashtags" field. You'll have to do plenty of work to add the splanks, anyway, so adding a handler for that new type of search result shouldn't be too much of an inconvenience.
Related
Trying to build hashmaps using values from XML for the purpose of sending the list of maps off through an API call to build a table in another app. I need to find the most efficient way, whether library, pattern, or even another language, to find the values in the XML and build this structure.
As of now, I am building a class for each table type. I am doing this because there are dozens of types of tables and they are all looking for different values. For example, imagine I am working for a supermarket and there is an XML document holding all the items in the store along with various details about each item. I need to build a table of items from the XML for each section in the supermarket. I would have a GroceryBuilder class, a ClothingBuilder class, and so on. So in the GrocerBuilder class, I would traverse the XML, find all the grocery items, and add the items, along with other data related to those items, to a hashmap, looking like this, where [n] equals a row:
("Grocery[1].Item", "Apple"),
("Grocery[1].Description", "Granny Smith"),
("Grocery[1].Color", "Green"),
("Grocery[2].Item", "Paper Plates"),
("Grocery[2].PricePerEach", ".03"),
("Grocery[2].Purpose", "Eating"),
("Grocery[3].Item", "Bologna"),
("Grocery[3].Description", "Meat-like"),
("Grocery[3].Purpose", "Sustainence"),
As you can see above, each row can have different column values because not every cell in the table is populated.
Here is an example of what the XML could look like:
<grocery>
<food>
<produce>
<apple>
<description>Granny Smith</description>
<itemCd>93jfu4n</itemCd>
<color>Green</color>
</apple>
<pear>
<description>Concorde</description>
<itemCd>0272ve6dg3</itemCd>
<color>Yellow</color>
</pear>
<banana>
<description>Regular</description>
<itemCd>2je7c3</itemCd>
<color>Yellow</color>
</pear>
...
<insert 50 types of produce here/>
...
</produce>
<meat>
<bologna>
<description>Meat-like</description>
<itemCd>9dmd623</itemCd>
<purpose>Sustainence</purpose>
</bologna>
...
<insert 50 types of meat here/>
...
</meat>
</food>
<sporting goods>
...
<insert 50 types of sporting goods here/>
...
</sporting goods>
<clothing>
...
<insert 50 types of clothing here/>
...
</clothing>
</grocery>
The problem I am facing is that there are potentially 100+ table types (using the example above, imagine a table for every section in the store), each looking for specific values, so I would potentially have to build 100+ different classes. I am looking for a more generic way to build these structures.
The challenge is that there are many conditions on the values I am getting from the XML. For instance, insert the value into the XML only if the ItemCd is a certain value. Or if the Apple Description equals whatever value, insert this value instead.
So far, I've been building these maps manually, looping over each item in a section (i.e. "produce"), checking conditions, and inserting the values based on those conditions. But this is going to be a ton of effort if I must do this for 100+ tables. Is there an established pattern or library that could handle this better? Or even a language other than Java?
XPath approach
To select parts of an XML document use XPath; to transform from one XML document to another, or even construct an XML document from text or JSON, use XSLT (which embeds XPath). XPath supports the powerful conditionality that you describe and more.
See also
How to read XML using XPath in Java
XSLT processing with Java?
Data binding approach
There are data binding tools such as Jakarta XML Binding that can help automate the mapping between XML and Java objects.
See also
XML data binding
Java XML Binding
Simple, structurally typed XML data binding (without code generation or reflection)
so I would potentially have to build 100+ different classes
I am looking for a more generic way to build the
and maybe the grocery may introduce a new product type anytime (assumption)
In this case you right and defining a class for each type is not effective. You had good intuition to ask for a more generic approach.
I don't see other requirements or constraints so it is hard to tell what is exactly a good solution.
For start I'd propose to have a common parametrized TypeBuilder - a parser or filter (e. g. pass the required type name as a parameter returning a set of any properties as a map from the parsed input? A parser returning a list of tbe TypeBuilder instances? )
The challenge is that there are many conditions on the values I am getting from the XML.
Your intution is right again, it is considered as bad practice to put (business) rules into the code.
If you cannot find some generic set of rules, then you may need to put the rules somewhere. Maybe a rule engine is an overkill (feasible for enterprises, such as supermarkets), maybe a list of regular expressions could be good enough. Without more requirements or constraints it is hard to propose better answer
I have a Java (lucene 4) based application and a set of keywords fed into the application as a search query (the terms may include more than one words, eg it can be: “memory”, “old house”, “European Union law”, etc).
I need a way to get the list of matched keywords out of an indexed document and possibly also get keyword positions in the document (also for the multi-word keywords).
I tried with the lucene highlight package but I need to get only the keywords without any surrounding portion of text. It also returns multi-word keywords in separate fragments.
I would greatly appreciate any help.
There's a similar (possibly same) question here:
Get matched terms from Lucene query
Did you see this?
The solution suggested there is to disassemble a complicated query into a more simple query, until you get a TermQuery, and then check via searcher.explain(query, docId) (because if it matches, you know that's the term).
I think It's not very efficient, but
it worked for me until I ran into SpanQueries. it might be enough for you.
I'm looking for a Java library that can do Named entity recognition (NER) with a custom controlled vocabulary, without needing labeled training data first. I searched some on SE, but most questions are rather unspecific.
Consider the following use-case:
an editor is inputting articles in a CMS (about 500 words).
the text may contain references (in plain text) to entities of a specific domain. e.g:
names of points of interest, like bars, restaurants, as well as neighborhoods, etc.
a controlled vocabulary of these entities exist (about 5.000 entities) .
I imagine an entity to be a -tuple in the vocabulary
after finishing the text, the user should be able to save the document.
This triggers the workflow to scan the piece of text against the vocabulary, by comparing against the name of the entity. It's not required to have a 100% match: 97% on Jarao-winkler or whatever (I'm not familiar with what algo's NER uses) may be enough, I need this to be configurable.
Hits are returned to the controller server-side. This in return returns JSON to the client containing of the entities, which are represented as suggested crosslinks to the editor.
Ideally, I'm looking for a project that uses NRE to suggests crosslinks within a CMS-environment to piggyback on. (I'm sure plugins for wordpress exist for example) not so sure if something similar exists in Java.
All other more general pointers to NRE-libraries which work with controlled custom vocabularies are welcome as well.
For people looking this up in the future:
"Approximate Dictionary-Based Chunking"
see: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
(URL edited.)
Unsure if these might be helpful:
http://www-nlp.stanford.edu/software/CRF-NER.shtml
http://cogcomp.cs.illinois.edu/page/software
I have a search autocomplete on my site, and I'm using Solr to find matching documents. I am trying to get partial matches on page titles, so for example Java* would match Java, Javascript, etc. As of right now, the autocomplete is set up to give me partial matches on all of the text in the page, which gives some weird results, so I've decided to switch over to using the page title. However, when I try to switch the search term from text for the page text to title, the query suddenly does not pick up partial matches any more. Here is an example of my original query:
q=text:java^2+text:"java"
&hl=true&hl.snippets=1&hl.fragsize=25&hl.fl=title&start=0&rows=3
Unfortunately, the guy who set this up for me does not work with me any more, so I have little idea what's going on 'under the hood'. I'm using Spring/J2EE for my backend, if that makes any difference.
You need to make sure that the field is no string based field. You can lookup this if you take a look at your schema.xml. If you search with Java* inside a string field it will match only titles which start with Java*.
Another thing is that you need to make sure that you are aware that Wildcard Queries are case sensitive (see this).
Depends on how the field title was analyzed, look at schema.xml to see what type the field is and how its analyzed to create term. Easy way to do that would be to go to solr admin http://localhost:8983/solr/admin/analysis.jsp, choose the same name option, type in the field name (am guessing 'title') put some sample text and query to see what terms are created and matched.
I have a Java based web-application and a new requirement to allow Users to place variables into text fields that are replaced when a document or other output is produced. How have others gone about this?
I was thinking of having a pre-defined set of variables such as :
#BOOKING_NUMBER#
#INVOICE_NUMBER#
Then when a user enters some text they can specify a variable inline (select it from a modal or similar). For example:
"This is some text for Booking #BOOKING_NUMBER# that is needed by me"
When producing some output (eg. PDF) that uses this text, I would do a regex and find all variables and replace them with the correct value:
"This is some text for Booking 10001 that is needed by me"
My initial thought was something like Freemarker but I think that is too complex for my Users and would require them to know my DataModel (eww).
Thanks for reading!
D.
Have a look at java.text.MessageFormat - particularly the format method - as this is designed for exactly what you are looking for.
i.e.
MessageFormat.format("This is some text for booking {0} that is needed by me, for use with invoice {1}", bookingNumber, invoiceNumber);
You may even want to get the template text from a resource bundle, to allow for support of multiple languages, with the added ability to cope with the fact that {0} and {1} may appear in a different order in some languages.
UPDATE:
I just read your original post properly, and noticed the comment about the PDF.
This suggest that the template text is going to be significantly larger than a line or two.
In such cases, you may want to explore something like StringTemplate which seems better suited for this purpose - this comment is based solely on initial investigations, as I've not used it in anger.
I have used a similiar replacement token system before. I personally like something like.
[MYVALUE]
As it is easy for the user to type, and then I just use replacements to swap out the tokens for the real data.