Just wanted to know how you would do it.
I have a webservice that permits me to complete an user address while he's writing it.
When the suggestions are shown, i'd like the part of the suggestion label that match the user input to be surrounded with bold tags.
I want the "matching" to be clever, and not just a simple/replace, since the WS we use is clever too but i don't have that code).
For exemple:
Input: 3 OxFôr sTrE
Ws result: 3 Oxford Street
Formatted: <b>3 Oxford Stre</b>et
Formatted: [bold]3 Oxford Stre[/bold]et
I can do it in JS or Java.
I'd rather do it in JS but with Java perhaps Lucene can help?
Do you see how it can be handled?
Index your Text using NGrams using a search engine or a custom data structure. I am implementing Auto Recommendation stuff by indexing around 1 billion query words using NGrams & then while displaying I sort them as per frequency of each typed query. Lucene/Solr can help you here. Highlighting stuff (as you asked) will be enclosed in tags by default if you use Lucene/Solr and you can also exploit ngram indexing feature provided by Lucene/Solr
LinkedIn Engineering recently open sourced Cleo (the open source technology behind LinkedIn's typeahead search) : Link.
Great stuff by LinkedIn. Check out for clever matching and highlighting stuff as desired by you
Related
I'm working with HTML tags, and I need to interpret HTML documents. Here's what I need to achieve:
I have to recognize and remove HTML tags without removing the
original content.
I have to store the index of the previously existing markups.
So here's a example. Imagine that I have the following markup:
This <strong>is a</strong> message.
In this example, we have a String sequence with 35 characters, and markedup with strong tag. As we know, an HTML markup has a start and an end, and if we interpret the start and end markup as a sequence of characters, each also has a start and an end (a character index).
Again, in the previous example, the beggining index of the open/start tag is 5 (starts at index 0), and the end index is 13. The same logic goes to the close tag.
Now, once we remove the markup, we end up with the following:
This is a message.
The question:
How can I remember with this sequence the places where I could enter the markup again?
For example, once the markup has been removed, how do I know that I have to insert the opening tag in the X position/index, and the closing tag in the Y position/index... Like so:
This is a message.
5 9
index 5 = <strong>
index 9 = </strong>
I must remember that it is possible to find the following situation:
<a>T<b attribute="value">h<c>i<d>s</a> <g>i<h>s</h></g> </b>a</c> <e>t</e>e<f>s</d>t</f>.
I need to implement this in Java. I've figured out how to get the start and end index of each tag in a document. For this, I'm using regular expressions (Pattern and Matcher), but I still do not know how to insert the tags again properly (as described). I would like a working example (if possible). It does not have to be the best example (the best solution) in the world, but only that it works the right way for any kind of situation.
If anyone has not understood my question, please comment that I will do it better.
Thanks in advance.
EDIT
People in the comments are saying that I should not use regular expressions to work with HTML. I do not care to use or not regular expressions to solve this problem, I just want to solve it, no matter how (But of course, in the most appropriate way).
I mentioned that I'm using regular expressions, but I do not mind using another approach that presents the same solution. I read that a XML parser could be the solution. Is that correct? Is there an XML parser capable of doing all this what I need?
Again, Thanks in advance.
EDIT 2
I'm doing this edition now to explain the applicability of my problem (as asked). Well, before I start, I want to say that what I'm trying to do is something I've never done before, it's not something on my area, so it may not be the most appropriate way to do it. Anyway...
I'm developing a site where users are allowed to read content but can not edit it (edit or remove text). However, users can still mark/highlight excerpts (ranges) of the content present (with some stylization). This is the big summary.
Now the problem is how to do this (in Java). On the client side, for now, I was thinking of using TinyMCE to enable styling of content without text editing. I could save stylized text to a database, but this would take up a lot of space, since every client is allowed to do this, given that they are many clients. So if a client marks snippets of a paragraph, saving the paragraph back in the database for each client in the system is somewhat costly in terms of memory.
So I thought of just saving the range (indexes) of the markups made by users in a database. It is much easier to save just a few numbers than all the text with the styling required. In the case, for example, I could save a line / record in a table that says:
In X paragraph, from Y to Z index, the user P defined a ABC
stylization.
This would require a translation / conversion, from database to HTML, and HTML to database. Setting a converter can be easy (I guess), but I do not know how to get the indexes (following this logic). And then we stop again at the beginning of my question.
Just to make it clear:
If someone offers a solution that will cost money, such as a paid API, tool, or something similar, unfortunately this option is not feasible for me. I'm sorry :/
In a similar way, I know it would be ideal to do this processing with JavaScript (client-side). It turns out that I do not have a specialized JavaScript team, so this needs to be done on the server side (unfortunately), which is written in Java. I can only use a JavaScript solution if it is already ready, easy and quick to use. Would you know of any ready-made, easy-to-use library that can do it in a simple way? Does it exist?
You can't use a regular expression to parse HTML. See this question (which includes this rather epic answer as well as several other interesting answers) for more information, but HTML isn't a regular language because it has a recursive structure.
Any language that allows recursion isn't regular by definition, so you can't parse it with a regex.
Keep in mind that HTML is a context-free languages (or, at least, pretty close to context-free). See also the Chomsky hierarchy.
I want to write a script that checks a
document for keywords and specifies html document nodes in which they are contained (possibly
assign a unique identifier).
I am not a professional programmer and do not know the strength of low-level languages and things as PLO.. I'm afraid of doing something very bad and unsupported.
How is it possible to isolate the desired nodes?
My experience - js and php - php only for very simple things. Also, I
do not want to use the opportunity to work
with js nodes. My thoughts:
to make a string of html
verify the existence of the words on the page
if the word on page exists: foreach node in body element I get first and last positions
(for example, we see opening tag for each character we initially know
position and therefore we calculate the first
position where the tag is opened and last where closed. And so on for all nodes).
We know the position of the word (eg 192,
199) and check in what range it got (in this
case, these bands - nodes html document).
I need ideas from experienced programmers.
It does not matter what language you are
programming (except for web-oriented)-
every opinion is important to me. It is likely
that there are libraries that solve such
problems. I very much hope that you will
understand me. English is not my native
language.
I always recommend Beautiful Soup for this kind of thing. It is a Python library that allows you to parse XML/HTML documents really quickly. You could quite quickly get something running that extracts the text from each div element I would have thought. Then using Pythons built-in string manipulation tools I'm sure searching for particular words would be fairly simple.
You need to use a html parser. Refer
Which HTML Parser is the best?
After that, you need to use xpath feature to extract whichever node.
I have been working on information extraction and was able to run standAloneAnnie.java
http://gate.ac.uk/wiki/code-repository/src/sheffield/examples/StandAloneAnnie.java
My question is, How can I use GATE ANNIE to get similar words like if I input (dine) will get result like (food, eat, dinner, restaurant) ?
More Information:
I am doing a project where I was assigned to develop a simple webpage to take user input and pass to GATE components which will tokenize the query and return a semantic grouping for each phrase in order to make some recommendation.
For example user would enter "I want to have dinner in Kuala Lumpur" and the system will break it down to (Search for :dinner - Required: restaurant, dinner, eat, food - Location: Kuala Lumpur.
ANNIE by default has like 15 annotations, see demo
http://services.gate.ac.uk/annie/
Now I already implemented everything as the demo but my question is. Can I do that using GATE ANNIE, i mean is it possible to find words synonyms or group words based on their type (noun, verbs)?
Plain vanilla ANNIE doesn't support this kind of thing but there are third party plugins such as Phil Gooch's WordNet Suggester that might help. Or if your domain is fairly restricted you might get better results with less effort by simply creating your own gazetteer lists of related terms and a few simple JAPE rules. You may find the training materials available on the GATE Wiki useful if you haven't done much of this before.
Currently, I am using Lucene version 3.0.2 to create a search application that is similar to a dictionary. One of the objects that I want to display is a sort of "example", where Lucene would look for a word in a book and then the sentences where the words were used are displayed.
I've been reading the Lucene in Action book and it mentions something like this, but looking through it I can't find other mentions. Is this something you can do with Lucene? If it is, how is can you do it?
I believe what you are looking for is a Highlighter.
One possibility is to use the lucene.search.highlight package, specifically the Highlighter.
Another option is to use the lucene.search.vectorhighlight package, specifically the FastVectorHighlighter.
Both classes search a text document, choose relevant snippets and display them with the matching terms highlighted. I have only used the first one, which worked fine for my use-case. If you can pre-divide the book into shorter parts, it would make highlighting faster.
Given:
A text (optional with HTML tags)
a database table with abbreviations and acronyms (like "etc.", "s.o.", ...)
Goals:
Build a parser that finds all occurrences in the given text
Build a small gui to let the user choose if the found occurrence matches (this will be swing by demand)
User has the option to ignore a match (must also be marked as "to be ignored")
Replace any accepted occurrence with a special XML construct
My main problem is the parser, I've mentioned the GUI just for giving a complete overview.
The task is to build a parser that analyzes the text for e.x. an acronym and mark it for later postprocessing. Any "mark" must me in form of XML tags, as the surrounding environment does not accept anything else (We are in a DOM Editor of a CMS that ends with "Spirit" ;) ).
Does anybody has a hint for a library or did anybody build something like this? How did you or would you handle things like:
Two or more words are one entity
fullstop - part of the sentence or part of the token you are looking for
iterative replacement - user accepts the first occurrence - instant replace or buffering?
Any idea, library hint, wikipedia article, whatever - is helpful. I didn't find any related question that answered all of the aspects mentioned above.
I've read much good things about apache lucene and I'd look at this first if I had a similar project. It can index the source document and help to find all occurences of your acronyms (that's what you want as a result from the 'parsing' step, if I got it right).
Use a SAX parser of some sort, that runs on the input. For every hit you pause the parsing, show it in gui and let the user choose what to do. While parsing you build a DOM tree in the background.
Every time the user replaces something, you replace the given element in that DOM tree (you know which it is, since your holding the element that the user needs to react on).
When the whole thing is parsed and replaced you simply print out the DOM tree.