How is web browser search implemented? - java

I want to implement in desktop application in java searching and highlighting multiple phrases in html files, like it is done in web browsers, so html tags (within < and >) are ignored but some tags like <b> arent ignored. When searching for example each table in text ...each <b>table</b> has name... will be highlighted, but in text ...has each</p><p> Table is... it will be not highlighted, because the <p> tag interrupts the text meaning.
in web browser is this somehow implemented, how can I get to this implementation? or is there some source on the net? I tried google, but without success :(

Instead of searching inside the actual HTML file the browsers search on the rendered output of that HTML.
Get a suitable HTML renderer and get its output as text. Then search on that text output using appropriate string searching algorithms.
The example that you highlighted in your question would result in a newline character in the rendered HTML output and hence a normal string searching algorithm will behave as you expect.

As Faisal stated, browsers search in rendered content only. For doing so you'll need to remove the HTML tags before doing the actual search:
This code might help you:
http://www.dotnetperls.com/remove-html-tags
Of course you'll need to add some checks/exclusions like script tags and other things that are not rendered into the browser.

This seems pretty easy.
1) Search for the last word in the string.
2) Look at what's before the last word.
3) Decide if what's before the last word constitutes and interruption (<p>, <br />, <div>).
4) If interruption, continue
5) Else evaluate previous word against the search query.
I don't know if this is how browsers perform this operation, but this approach should work.

Try using javax.swing.text.html package in java.

Related

How to get all the texts displayed on a Web Page using Robot Framework?

I'm using Robotframework to automate tests, it uses the Selenium2 Library and gives the opportunity to extend many libraries (Java, Python, AngularJS, etc.).
Here's my question.
Is there a way to get all the texts displayed on a page?
I can get any specific text by the element locator, but currently I need to write a function which gets all the texts displayed on the page.
Does anyone know a way? Or a hint how to get things going?
You can do that by getting the text content of the <body> tag:
${text}= Get Text //body
Log ${text} # a very long string, with newlines as delimiters b/n the different tags
${text as list}= Split To Lines ${text}
Log ${text as list} # a list, each member is the different tag's text
Another (non-working with SE) way to do it is to go after each element, with a locator like //body//*, producing webelements with Get Webelements on it.
But when you callGet Text on each produced webelement, it will return its text, plus the ones for all its children - thus duplicating the data. That can be done in pure xpath/xslt (with text(), . and normalize-space()), but regretfully not through webdriver/selenium (it always expects a node as argument).
The purpose of that ^ de-tour from the answer was to present the outcome of a 2 minute research :), and to get any feedback from someone that might have actually accomplished it with Get Text on each element of the page.

Is it possible to remove tags (or sequences) and relate or remember them as indexes?

I'm working with HTML tags, and I need to interpret HTML documents. Here's what I need to achieve:
I have to recognize and remove HTML tags without removing the
original content.
I have to store the index of the previously existing markups.
So here's a example. Imagine that I have the following markup:
This <strong>is a</strong> message.
In this example, we have a String sequence with 35 characters, and markedup with strong tag. As we know, an HTML markup has a start and an end, and if we interpret the start and end markup as a sequence of characters, each also has a start and an end (a character index).
Again, in the previous example, the beggining index of the open/start tag is 5 (starts at index 0), and the end index is 13. The same logic goes to the close tag.
Now, once we remove the markup, we end up with the following:
This is a message.
The question:
How can I remember with this sequence the places where I could enter the markup again?
For example, once the markup has been removed, how do I know that I have to insert the opening tag in the X position/index, and the closing tag in the Y position/index... Like so:
This is a message.
5 9
index 5 = <strong>
index 9 = </strong>
I must remember that it is possible to find the following situation:
<a>T<b attribute="value">h<c>i<d>s</a> <g>i<h>s</h></g> </b>a</c> <e>t</e>e<f>s</d>t</f>.
I need to implement this in Java. I've figured out how to get the start and end index of each tag in a document. For this, I'm using regular expressions (Pattern and Matcher), but I still do not know how to insert the tags again properly (as described). I would like a working example (if possible). It does not have to be the best example (the best solution) in the world, but only that it works the right way for any kind of situation.
If anyone has not understood my question, please comment that I will do it better.
Thanks in advance.
EDIT
People in the comments are saying that I should not use regular expressions to work with HTML. I do not care to use or not regular expressions to solve this problem, I just want to solve it, no matter how (But of course, in the most appropriate way).
I mentioned that I'm using regular expressions, but I do not mind using another approach that presents the same solution. I read that a XML parser could be the solution. Is that correct? Is there an XML parser capable of doing all this what I need?
Again, Thanks in advance.
EDIT 2
I'm doing this edition now to explain the applicability of my problem (as asked). Well, before I start, I want to say that what I'm trying to do is something I've never done before, it's not something on my area, so it may not be the most appropriate way to do it. Anyway...
I'm developing a site where users are allowed to read content but can not edit it (edit or remove text). However, users can still mark/highlight excerpts (ranges) of the content present (with some stylization). This is the big summary.
Now the problem is how to do this (in Java). On the client side, for now, I was thinking of using TinyMCE to enable styling of content without text editing. I could save stylized text to a database, but this would take up a lot of space, since every client is allowed to do this, given that they are many clients. So if a client marks snippets of a paragraph, saving the paragraph back in the database for each client in the system is somewhat costly in terms of memory.
So I thought of just saving the range (indexes) of the markups made by users in a database. It is much easier to save just a few numbers than all the text with the styling required. In the case, for example, I could save a line / record in a table that says:
In X paragraph, from Y to Z index, the user P defined a ABC
stylization.
This would require a translation / conversion, from database to HTML, and HTML to database. Setting a converter can be easy (I guess), but I do not know how to get the indexes (following this logic). And then we stop again at the beginning of my question.
Just to make it clear:
If someone offers a solution that will cost money, such as a paid API, tool, or something similar, unfortunately this option is not feasible for me. I'm sorry :/
In a similar way, I know it would be ideal to do this processing with JavaScript (client-side). It turns out that I do not have a specialized JavaScript team, so this needs to be done on the server side (unfortunately), which is written in Java. I can only use a JavaScript solution if it is already ready, easy and quick to use. Would you know of any ready-made, easy-to-use library that can do it in a simple way? Does it exist?
You can't use a regular expression to parse HTML. See this question (which includes this rather epic answer as well as several other interesting answers) for more information, but HTML isn't a regular language because it has a recursive structure.
Any language that allows recursion isn't regular by definition, so you can't parse it with a regex.
Keep in mind that HTML is a context-free languages (or, at least, pretty close to context-free). See also the Chomsky hierarchy.

How to abbreviate HTML with Java?

A user enters text as HTML in a form, for example:
<p>this is my <strong>blog</strong> post,
very <i>long</i> and written in <b>HTML</b></p>
I want to be able to output only a part of the string ( for example the first 20 characters ) without breaking the HTML structure of the user's input. In this case:
<p>this is my <strong>blog</strong> post, very <i>l</i>...</p>
which renders as
this is my <strong>blog</strong> post, very <i>lo</i>...
Is there a Java library able to do this, or a simple method to use?
MyLibrary.abbreviateHTML(string,20) ?
Since it's not very easy to do this correctly I usually strip all tags and truncate. This gives great control on the text size and appearance which usually needs to be placed in places where you do need control.
Note that you may find my proposal very conservative and it actually is not a proper answer to your question. But most of the times the alternatives are:
strip all tags and truncate
provide an alternate content manageable rich text which will serve as the truncated text. This of course only works in the case of CMSes etc
The reason that truncating HTML would be hard is that you don't know how truncating would affect the structure of the HTML. How would you truncate in the middle of a <ul> or, even worst, in the middle of a complex <table>?
So the problem here is that HTML can not only contain content and styling (bold, italics) but also structure (lists, tables, divs etc). So a good and safe implementation would be to strip everything out apart inline "styling" tags (bold, italics etc) and truncate while keeping track of unclosed tags.
I don't know any library but it should not be so complicated (for 80%).
You only need a simple "parser" that understand 4 type of tokens:
opening tags - everything that starts with < but not </ and ends with > but not />
closing tags - everything that starts with </ and ends with >
self-closing tags (like <br/>) - everything that starts with < but not </ and ends with /> but not >
normal character - everything that is none of the other types
Then you must walk through your input string, and count the "normal characters". While you walking along the string and count, you copy every token to the output as long as the counted normal chars are less or equals the amount you want to have.
You also need to build a stack of current open tags, while you walk thought the input. Every time you walk trough a "opening tag" you put it to the stack (its name), every time you you find a closing tag, you remove the topmost tag name from the stack (hopefully the input is correct XHTML).
When you reach the end of the required amount of normal chars, then you only need to write closing HTML tags for the tag names remaining on the stack.
But be careful, this works only with the input is well-formed XML.
I don't know what you want to do with this piece of code, but you should pay attention to HTML/JavaScript injection attacks.
If you really want to abbreviate HTML then just do it (cut the text at desired length), pass the abbreviated result through http://jtidy.sourceforge.net/ and hope for the best.
It seams that there are a lot of libs and tools for this common task:
truncateNicely from Jakarta Taglibs String (Jakarta Taglibs has been retired)
org.displaytag.util.HtmlTagUtil#abbreviateHtmlString from Display tag library 1.2 (allready Mentioned by Marnix van Bochove in his comment.)

Avoid spaceless concatenation with JSoup

Suppose I have a div as such:
<div>
This is a paragraph
written by someone
on the internet.
</div>
The problem is that when JSoup parses this, it puts it all on one line, so that when I call text() it reads as such:
This is a paragraphwritten by someoneon the internet.
Now, I realize this isn't really a JSoup problem, in that the actual html doesn't contain a space. However, is there any way to use JSoup (perhaps some override or maybe an option I haven't seen) so that as it parses it will add a space between lines? I imagine it must be possible (as I can inspect element in Chrome and unselect word wrap and it gets what I want) but I'm not sure JSoup can do this.
Any thoughts?
Can you provide a full example of your code? What version of jsoup are you using?
In the current version (1.6.1), this code:
Document doc = Jsoup.parse("<div>\n" +
"This is a paragraph\n" +
"written by someone\n" +
"on the internet.\n" +
"</div>");
System.out.println(doc.text());
Produces:
This is a paragraph written by someone on the internet.
I.e., \n (and \r\n etc) are converted to text as spaces.
Happy to fix or improve it, if I can replicate :)
the following post shows how you get everything including the line break
Removing HTML entities while preserving line breaks with JSoup
the answer and comment in the following also has another way (read the comment in it)
Remove HTML tags from a String
and this one has even another way if you check all the answers and the comments
How do I preserve line breaks when using jsoup to convert html to plain text?

Text Processing - Detecting if you are inside an HTML tag in Java

I have a program that does text processing on a html formatted document based on information on the same document without the html information. I basically, locate a word or phrase in the unformatted document, then find the corresponding word in the formatted document and alter the appearance of the word or phrase using HTML tags to make it stick out (e.g. bold it or change its color).
Here is my problem. Occasionally, I want to do formatting to a word or phrase which might be part of a html tag (for example perhaps I want to do some formatting to the word "font" but only if is a word that is not inside an html tag). Is there an easy way to detect whether a string is part of an html tag in a block of text or not?
By the way, I can't just strip out the html tags in the document and do my processing on the remaining text because I need to preserve the html in the result. I need to add to the existing html but I need to reliably distinguish between strings that are part of tags and strings that are not.
Any ideas?
Thank you,
Elliott
You could do a few things
Write a regular expression for what you're doing. There are plenty of prewritten ones you can find on Google
Find a library to parse the document (e.g., http://htmlparser.sourceforge.net/) and only replace text
The first is likely to the be the fastest and easiest, but the second will be more reliable.
Use the following regex code to detect if it has HTML tags: "\<.*?\>"
And here you can learn how to effectively use regex in your java code.
Happy coding ;)
If you have parsed the DOM, what you have, if you are doing it correctly. Then ask the super tag that contains current tag, and keep doing that, if that is not the tag, that you are looking for.
If you use some custom search or regex to parse html, then check best answe for this question:
RegEx match open tags except XHTML self-contained tags (It has +4000 upvotes for a reason)

Categories

Resources