A user enters text as HTML in a form, for example:
<p>this is my <strong>blog</strong> post,
very <i>long</i> and written in <b>HTML</b></p>
I want to be able to output only a part of the string ( for example the first 20 characters ) without breaking the HTML structure of the user's input. In this case:
<p>this is my <strong>blog</strong> post, very <i>l</i>...</p>
which renders as
this is my <strong>blog</strong> post, very <i>lo</i>...
Is there a Java library able to do this, or a simple method to use?
MyLibrary.abbreviateHTML(string,20) ?
Since it's not very easy to do this correctly I usually strip all tags and truncate. This gives great control on the text size and appearance which usually needs to be placed in places where you do need control.
Note that you may find my proposal very conservative and it actually is not a proper answer to your question. But most of the times the alternatives are:
strip all tags and truncate
provide an alternate content manageable rich text which will serve as the truncated text. This of course only works in the case of CMSes etc
The reason that truncating HTML would be hard is that you don't know how truncating would affect the structure of the HTML. How would you truncate in the middle of a <ul> or, even worst, in the middle of a complex <table>?
So the problem here is that HTML can not only contain content and styling (bold, italics) but also structure (lists, tables, divs etc). So a good and safe implementation would be to strip everything out apart inline "styling" tags (bold, italics etc) and truncate while keeping track of unclosed tags.
I don't know any library but it should not be so complicated (for 80%).
You only need a simple "parser" that understand 4 type of tokens:
opening tags - everything that starts with < but not </ and ends with > but not />
closing tags - everything that starts with </ and ends with >
self-closing tags (like <br/>) - everything that starts with < but not </ and ends with /> but not >
normal character - everything that is none of the other types
Then you must walk through your input string, and count the "normal characters". While you walking along the string and count, you copy every token to the output as long as the counted normal chars are less or equals the amount you want to have.
You also need to build a stack of current open tags, while you walk thought the input. Every time you walk trough a "opening tag" you put it to the stack (its name), every time you you find a closing tag, you remove the topmost tag name from the stack (hopefully the input is correct XHTML).
When you reach the end of the required amount of normal chars, then you only need to write closing HTML tags for the tag names remaining on the stack.
But be careful, this works only with the input is well-formed XML.
I don't know what you want to do with this piece of code, but you should pay attention to HTML/JavaScript injection attacks.
If you really want to abbreviate HTML then just do it (cut the text at desired length), pass the abbreviated result through http://jtidy.sourceforge.net/ and hope for the best.
It seams that there are a lot of libs and tools for this common task:
truncateNicely from Jakarta Taglibs String (Jakarta Taglibs has been retired)
org.displaytag.util.HtmlTagUtil#abbreviateHtmlString from Display tag library 1.2 (allready Mentioned by Marnix van Bochove in his comment.)
Related
I'm working with HTML tags, and I need to interpret HTML documents. Here's what I need to achieve:
I have to recognize and remove HTML tags without removing the
original content.
I have to store the index of the previously existing markups.
So here's a example. Imagine that I have the following markup:
This <strong>is a</strong> message.
In this example, we have a String sequence with 35 characters, and markedup with strong tag. As we know, an HTML markup has a start and an end, and if we interpret the start and end markup as a sequence of characters, each also has a start and an end (a character index).
Again, in the previous example, the beggining index of the open/start tag is 5 (starts at index 0), and the end index is 13. The same logic goes to the close tag.
Now, once we remove the markup, we end up with the following:
This is a message.
The question:
How can I remember with this sequence the places where I could enter the markup again?
For example, once the markup has been removed, how do I know that I have to insert the opening tag in the X position/index, and the closing tag in the Y position/index... Like so:
This is a message.
5 9
index 5 = <strong>
index 9 = </strong>
I must remember that it is possible to find the following situation:
<a>T<b attribute="value">h<c>i<d>s</a> <g>i<h>s</h></g> </b>a</c> <e>t</e>e<f>s</d>t</f>.
I need to implement this in Java. I've figured out how to get the start and end index of each tag in a document. For this, I'm using regular expressions (Pattern and Matcher), but I still do not know how to insert the tags again properly (as described). I would like a working example (if possible). It does not have to be the best example (the best solution) in the world, but only that it works the right way for any kind of situation.
If anyone has not understood my question, please comment that I will do it better.
Thanks in advance.
EDIT
People in the comments are saying that I should not use regular expressions to work with HTML. I do not care to use or not regular expressions to solve this problem, I just want to solve it, no matter how (But of course, in the most appropriate way).
I mentioned that I'm using regular expressions, but I do not mind using another approach that presents the same solution. I read that a XML parser could be the solution. Is that correct? Is there an XML parser capable of doing all this what I need?
Again, Thanks in advance.
EDIT 2
I'm doing this edition now to explain the applicability of my problem (as asked). Well, before I start, I want to say that what I'm trying to do is something I've never done before, it's not something on my area, so it may not be the most appropriate way to do it. Anyway...
I'm developing a site where users are allowed to read content but can not edit it (edit or remove text). However, users can still mark/highlight excerpts (ranges) of the content present (with some stylization). This is the big summary.
Now the problem is how to do this (in Java). On the client side, for now, I was thinking of using TinyMCE to enable styling of content without text editing. I could save stylized text to a database, but this would take up a lot of space, since every client is allowed to do this, given that they are many clients. So if a client marks snippets of a paragraph, saving the paragraph back in the database for each client in the system is somewhat costly in terms of memory.
So I thought of just saving the range (indexes) of the markups made by users in a database. It is much easier to save just a few numbers than all the text with the styling required. In the case, for example, I could save a line / record in a table that says:
In X paragraph, from Y to Z index, the user P defined a ABC
stylization.
This would require a translation / conversion, from database to HTML, and HTML to database. Setting a converter can be easy (I guess), but I do not know how to get the indexes (following this logic). And then we stop again at the beginning of my question.
Just to make it clear:
If someone offers a solution that will cost money, such as a paid API, tool, or something similar, unfortunately this option is not feasible for me. I'm sorry :/
In a similar way, I know it would be ideal to do this processing with JavaScript (client-side). It turns out that I do not have a specialized JavaScript team, so this needs to be done on the server side (unfortunately), which is written in Java. I can only use a JavaScript solution if it is already ready, easy and quick to use. Would you know of any ready-made, easy-to-use library that can do it in a simple way? Does it exist?
You can't use a regular expression to parse HTML. See this question (which includes this rather epic answer as well as several other interesting answers) for more information, but HTML isn't a regular language because it has a recursive structure.
Any language that allows recursion isn't regular by definition, so you can't parse it with a regex.
Keep in mind that HTML is a context-free languages (or, at least, pretty close to context-free). See also the Chomsky hierarchy.
I need a regular expression that can be used with replaceall to replace all the html tags with empty string except any variations of br to maintain the line breaks.
I found the following to replace all html tags
<\s*br\s*\[^>]
You might get some answers that claim to work.
Those answers might even work for the particular cases you try them against.
But know that regular expressions (which I'm fond of in general) are the wrong tool for the job in this case.
And as your project evolves and needs to cover more complex HTML inputs, the regular expression will get more and more convoluted, and there may well come a time when it simply cannot solve your problem anymore, period.
Do it the right way from the beginning. Use an HTML parser, not a regex.
For reference, here are some related SO posts:
Regex to match all HTML tags except <p> and </p>
Regex to replace all \n in a String, but no those inside [code] [/code] tag
RegEx match open tags except XHTML self-contained tags - bobince says it much more thoroughly than I do (:
If the HTML is known to be valid, then you can use this regex (case-insensitive):
<(?!br\b)/?[a-z]([^"'>]|"[^"]*"|'[^']*')*>
but it can fail in interesting ways if you give it invalid HTML. Also, I took "HTML tags" pretty literally; the above won't cover <!-- HTML comments --> and <!DOCTYPE declarations>, and won't convert <![CDATA[ blocks ]]> and &entity;s to plain text.
It's probably better to take a step back, think about why you want to strip out these HTML tags — that is, what you're actually trying to achieve — and then find an HTML-handling library that offers a better way to achieve that goal. HTML cleaning is really a solved problem; you shouldn't need to reinvent it.
UPDATE: I've just realized that, even for valid HTML, the above has some major limitations. For example, it will mishandle something like <!--<yes--> (converting it to just <!--), and also something like <script><foo></script> (since HTML proper has a small number of tags with CDATA content, that is, everything after the start-tag until the first </ is taken to be character data, not containing HTML tags; fortunately, XHTML was forced to get rid of this concept due to XML's lack of support for it). Both of these limitations can be addressed, of course — using more regexes! — but they should help reinforce the point that you should use a well-tested HTML-handling library rather than trying to roll your own regexes. If you have a lot of guarantees about the nature of the HTML you're trying to handle, then regexes can be useful; but if what you're trying to do is strip out arbitrary tags, then that's a good sign that you don't have these sorts of guarantees.
I want to implement in desktop application in java searching and highlighting multiple phrases in html files, like it is done in web browsers, so html tags (within < and >) are ignored but some tags like <b> arent ignored. When searching for example each table in text ...each <b>table</b> has name... will be highlighted, but in text ...has each</p><p> Table is... it will be not highlighted, because the <p> tag interrupts the text meaning.
in web browser is this somehow implemented, how can I get to this implementation? or is there some source on the net? I tried google, but without success :(
Instead of searching inside the actual HTML file the browsers search on the rendered output of that HTML.
Get a suitable HTML renderer and get its output as text. Then search on that text output using appropriate string searching algorithms.
The example that you highlighted in your question would result in a newline character in the rendered HTML output and hence a normal string searching algorithm will behave as you expect.
As Faisal stated, browsers search in rendered content only. For doing so you'll need to remove the HTML tags before doing the actual search:
This code might help you:
http://www.dotnetperls.com/remove-html-tags
Of course you'll need to add some checks/exclusions like script tags and other things that are not rendered into the browser.
This seems pretty easy.
1) Search for the last word in the string.
2) Look at what's before the last word.
3) Decide if what's before the last word constitutes and interruption (<p>, <br />, <div>).
4) If interruption, continue
5) Else evaluate previous word against the search query.
I don't know if this is how browsers perform this operation, but this approach should work.
Try using javax.swing.text.html package in java.
I want to parse a document that is not pure xml. For example
my name is <j> <b> mike</b> </j>
example 2
my name is <mytag1 attribute="val" >mike</mytag1> and yours is <mytag2> john</mytag2>
Means my input is not pure xml. ITs simliar to html but the tags are not html.
How can i parse it in java?
Your examples are valid XML, except for the lack of a document element. If you know this to always be the case, then you could just wrap a set of dummy tags around the whole thing and use a standard parser (SAX, DOM...)
On the other hand if you get something uglier (e.g. tags don't match up, or are spaced out in an overlapping fashion), you'll have to do something custom which will involve a number of rules that you have to decide on that will be unique to your application. (e.g. How do I handle an opening tag that has no close? What do I do if the closing tag is outside the parent?)
There are few parsers that take not well formed html and turn it into well formed xml, here is some comparison with examples, that includes the most popular ones, except maybe HTMLParser. Probably that's what you need.
I have to compare different versions of HTML pages for formatting and text changes. Unfortunately the guy/company who creates them uses some kind of HTML editor that re-wraps all the HTML every time (and adds tons of whitespace), which makes it hard to diff them. So I am looking for a tool (preferrably a Java library) that can reformat my HTML in a way that all insignificant spaces and newlines get removed.
That means, in
<h1>First Headline</h1> <h2>Second headline</h2>
the space between </h1> and <h2> should be removed, but in
<b>formatted</b> <i>text</i>
the whitespace may not be removed. I do not care about <pre>, <textarea> or <script> blocks, and also not about CSS whitespace attributes that can change the behavior - I am just looking for a solution that strips most of the unnecessary whitespace (and better leave too much whitespace in than too little).
(I am already collapsing multiple whitespaces and re-adding newlines instead of whitespaces before tags to make the text more readable - but there are still too many cases where for example a new newline between headlines or table cells/rows breaks my simple "solution".)
JTidy may be of use here. It's an HTML parser that parses the HTML (and is tolerant of ill-formed HTML) and presents the HTML as a DOM, and you can override the writing out of this to remove whatever you're not interested in.
If this is for internal use only, then consider using a converter to XHTML, and then canonicalize the XML. Then it is much easier to compare the results.
Tidy: http://tidy.sourceforge.net/ (output-xhtml option - http://tidy.sourceforge.net/docs/quickref.html#output-xhtml)
Canonicalize: http://en.wikipedia.org/wiki/Canonical_XML