How to strip insignificant whitespace out of HTML - java

I have to compare different versions of HTML pages for formatting and text changes. Unfortunately the guy/company who creates them uses some kind of HTML editor that re-wraps all the HTML every time (and adds tons of whitespace), which makes it hard to diff them. So I am looking for a tool (preferrably a Java library) that can reformat my HTML in a way that all insignificant spaces and newlines get removed.
That means, in
<h1>First Headline</h1> <h2>Second headline</h2>
the space between </h1> and <h2> should be removed, but in
<b>formatted</b> <i>text</i>
the whitespace may not be removed. I do not care about <pre>, <textarea> or <script> blocks, and also not about CSS whitespace attributes that can change the behavior - I am just looking for a solution that strips most of the unnecessary whitespace (and better leave too much whitespace in than too little).
(I am already collapsing multiple whitespaces and re-adding newlines instead of whitespaces before tags to make the text more readable - but there are still too many cases where for example a new newline between headlines or table cells/rows breaks my simple "solution".)

JTidy may be of use here. It's an HTML parser that parses the HTML (and is tolerant of ill-formed HTML) and presents the HTML as a DOM, and you can override the writing out of this to remove whatever you're not interested in.

If this is for internal use only, then consider using a converter to XHTML, and then canonicalize the XML. Then it is much easier to compare the results.
Tidy: http://tidy.sourceforge.net/ (output-xhtml option - http://tidy.sourceforge.net/docs/quickref.html#output-xhtml)
Canonicalize: http://en.wikipedia.org/wiki/Canonical_XML

Related

java regex replace all html tags except br

I need a regular expression that can be used with replaceall to replace all the html tags with empty string except any variations of br to maintain the line breaks.
I found the following to replace all html tags
<\s*br\s*\[^>]
You might get some answers that claim to work.
Those answers might even work for the particular cases you try them against.
But know that regular expressions (which I'm fond of in general) are the wrong tool for the job in this case.
And as your project evolves and needs to cover more complex HTML inputs, the regular expression will get more and more convoluted, and there may well come a time when it simply cannot solve your problem anymore, period.
Do it the right way from the beginning. Use an HTML parser, not a regex.
For reference, here are some related SO posts:
Regex to match all HTML tags except <p> and </p>
Regex to replace all \n in a String, but no those inside [code] [/code] tag
RegEx match open tags except XHTML self-contained tags - bobince says it much more thoroughly than I do (:
If the HTML is known to be valid, then you can use this regex (case-insensitive):
<(?!br\b)/?[a-z]([^"'>]|"[^"]*"|'[^']*')*>
but it can fail in interesting ways if you give it invalid HTML. Also, I took "HTML tags" pretty literally; the above won't cover <!-- HTML comments --> and <!DOCTYPE declarations>, and won't convert <![CDATA[ blocks ]]> and &entity;s to plain text.
It's probably better to take a step back, think about why you want to strip out these HTML tags — that is, what you're actually trying to achieve — and then find an HTML-handling library that offers a better way to achieve that goal. HTML cleaning is really a solved problem; you shouldn't need to reinvent it.
UPDATE: I've just realized that, even for valid HTML, the above has some major limitations. For example, it will mishandle something like <!--<yes--> (converting it to just <!--), and also something like <script><foo></script> (since HTML proper has a small number of tags with CDATA content, that is, everything after the start-tag until the first </ is taken to be character data, not containing HTML tags; fortunately, XHTML was forced to get rid of this concept due to XML's lack of support for it). Both of these limitations can be addressed, of course — using more regexes! — but they should help reinforce the point that you should use a well-tested HTML-handling library rather than trying to roll your own regexes. If you have a lot of guarantees about the nature of the HTML you're trying to handle, then regexes can be useful; but if what you're trying to do is strip out arbitrary tags, then that's a good sign that you don't have these sorts of guarantees.

How to abbreviate HTML with Java?

A user enters text as HTML in a form, for example:
<p>this is my <strong>blog</strong> post,
very <i>long</i> and written in <b>HTML</b></p>
I want to be able to output only a part of the string ( for example the first 20 characters ) without breaking the HTML structure of the user's input. In this case:
<p>this is my <strong>blog</strong> post, very <i>l</i>...</p>
which renders as
this is my <strong>blog</strong> post, very <i>lo</i>...
Is there a Java library able to do this, or a simple method to use?
MyLibrary.abbreviateHTML(string,20) ?
Since it's not very easy to do this correctly I usually strip all tags and truncate. This gives great control on the text size and appearance which usually needs to be placed in places where you do need control.
Note that you may find my proposal very conservative and it actually is not a proper answer to your question. But most of the times the alternatives are:
strip all tags and truncate
provide an alternate content manageable rich text which will serve as the truncated text. This of course only works in the case of CMSes etc
The reason that truncating HTML would be hard is that you don't know how truncating would affect the structure of the HTML. How would you truncate in the middle of a <ul> or, even worst, in the middle of a complex <table>?
So the problem here is that HTML can not only contain content and styling (bold, italics) but also structure (lists, tables, divs etc). So a good and safe implementation would be to strip everything out apart inline "styling" tags (bold, italics etc) and truncate while keeping track of unclosed tags.
I don't know any library but it should not be so complicated (for 80%).
You only need a simple "parser" that understand 4 type of tokens:
opening tags - everything that starts with < but not </ and ends with > but not />
closing tags - everything that starts with </ and ends with >
self-closing tags (like <br/>) - everything that starts with < but not </ and ends with /> but not >
normal character - everything that is none of the other types
Then you must walk through your input string, and count the "normal characters". While you walking along the string and count, you copy every token to the output as long as the counted normal chars are less or equals the amount you want to have.
You also need to build a stack of current open tags, while you walk thought the input. Every time you walk trough a "opening tag" you put it to the stack (its name), every time you you find a closing tag, you remove the topmost tag name from the stack (hopefully the input is correct XHTML).
When you reach the end of the required amount of normal chars, then you only need to write closing HTML tags for the tag names remaining on the stack.
But be careful, this works only with the input is well-formed XML.
I don't know what you want to do with this piece of code, but you should pay attention to HTML/JavaScript injection attacks.
If you really want to abbreviate HTML then just do it (cut the text at desired length), pass the abbreviated result through http://jtidy.sourceforge.net/ and hope for the best.
It seams that there are a lot of libs and tools for this common task:
truncateNicely from Jakarta Taglibs String (Jakarta Taglibs has been retired)
org.displaytag.util.HtmlTagUtil#abbreviateHtmlString from Display tag library 1.2 (allready Mentioned by Marnix van Bochove in his comment.)

Regex remove only certain tags from html

I want to remove only a set of html tags (b,i,p, end of tags) from a given html.
Pattern p = Pattern.compile("<[^bip/](.*?)>");
However, this also removes img tag coz of .*. What should I change to prevent removal of img
EDIT: I'm doing this on Android app. I know regex is the worst way, but Inbuilt spannable classes are not working as expected and I cant import a library just for html parsing. My purpose is to just detect if other tags exist OR not. Also, html is pretty small (upto 10 lines max), performance shouldn't be a problem.
This has been said a million times on stackoverflow.
Don't process HTML, XHTML or XML with regexes. They aren't regular languages, they are context free languages and can't be correctly processed with regular expressions.
Trying to work into xml (or html) is a bad idea : you definitely want to use a parser.
In your case, you want to match:
<\s*/?\s*[bip]\s*>
Remove simple letter tag (and same closing tag) and take into account some spaces are valid; you also need to run your regex as multiline.
It might work, but it's dangerous and you might have unexpected side effects
EDIT:
I understood you just want to remove the tags, not the actual content inside the tag
EDIT2:
current pattern matches the 3 tags, not their content. In a substitution regexp (replacing by nothing), it would remove these formatting tags, not the embedded content.
If you want to remove only <b>,<p>,<i> and </b>,</p>,</i> tags then you can use following regex :
(</?b>|</?p>|</?i>)
I am not sure I understand your regex, seems very different from what you say you want. Use something like below:
<([bip])>.*?</\1>
And if possible, don't use the above or any other regexes. There are various other better ways to do this. Search here or on google.
Most of the sample regex only checks a tag starts with a certain tag. For instance, you may want to remove <b>, but not <br>. So, in most of the sample regex, if you add <b> in the tags list, it automatically removes <br> as well. I use /<\/?(font|div|b)(\/|>|\s.*?>)/g. This regex prevents starts with issue. This sample will find only font, div and b, not match with br.

Text Processing - Detecting if you are inside an HTML tag in Java

I have a program that does text processing on a html formatted document based on information on the same document without the html information. I basically, locate a word or phrase in the unformatted document, then find the corresponding word in the formatted document and alter the appearance of the word or phrase using HTML tags to make it stick out (e.g. bold it or change its color).
Here is my problem. Occasionally, I want to do formatting to a word or phrase which might be part of a html tag (for example perhaps I want to do some formatting to the word "font" but only if is a word that is not inside an html tag). Is there an easy way to detect whether a string is part of an html tag in a block of text or not?
By the way, I can't just strip out the html tags in the document and do my processing on the remaining text because I need to preserve the html in the result. I need to add to the existing html but I need to reliably distinguish between strings that are part of tags and strings that are not.
Any ideas?
Thank you,
Elliott
You could do a few things
Write a regular expression for what you're doing. There are plenty of prewritten ones you can find on Google
Find a library to parse the document (e.g., http://htmlparser.sourceforge.net/) and only replace text
The first is likely to the be the fastest and easiest, but the second will be more reliable.
Use the following regex code to detect if it has HTML tags: "\<.*?\>"
And here you can learn how to effectively use regex in your java code.
Happy coding ;)
If you have parsed the DOM, what you have, if you are doing it correctly. Then ask the super tag that contains current tag, and keep doing that, if that is not the tag, that you are looking for.
If you use some custom search or regex to parse html, then check best answe for this question:
RegEx match open tags except XHTML self-contained tags (It has +4000 upvotes for a reason)

parsing a non xml file in java

I want to parse a document that is not pure xml. For example
my name is <j> <b> mike</b> </j>
example 2
my name is <mytag1 attribute="val" >mike</mytag1> and yours is <mytag2> john</mytag2>
Means my input is not pure xml. ITs simliar to html but the tags are not html.
How can i parse it in java?
Your examples are valid XML, except for the lack of a document element. If you know this to always be the case, then you could just wrap a set of dummy tags around the whole thing and use a standard parser (SAX, DOM...)
On the other hand if you get something uglier (e.g. tags don't match up, or are spaced out in an overlapping fashion), you'll have to do something custom which will involve a number of rules that you have to decide on that will be unique to your application. (e.g. How do I handle an opening tag that has no close? What do I do if the closing tag is outside the parent?)
There are few parsers that take not well formed html and turn it into well formed xml, here is some comparison with examples, that includes the most popular ones, except maybe HTMLParser. Probably that's what you need.

Categories

Resources