Regex for an html text code in Java - java

I have a html text file that has headings I would like to extract the only the text inside
Example:
<h1 class="title">Fire Safety</h1>
<h1>About this book</h1>
<h1>1</h1>
<h1>Contents of this book</h1>
I would like extract only the following text from HTML code:
Fire Safety,
About this book,
1,
Contents of this book
I tried lot of things like:
Pattern pattern = Pattern.compile("<a[^>]href\\s=\\s*\"\\s*([^\"]*)");
Matcher matcher = pattern.matcher(input);
where input is the html data.
Didn't get any results on the console or sometimes are i am getting only href :(
How do I get to fix this?
Let me know!
Thanks!

I would strongly recommend to use an HTML parser, something like TagSoup, Jericho, NekoHTML, HTML Parser, etc

Related

Sting builder - remove all html tags except <br>

I have a string builder object "sb" that looks like -
Hello. How can I help you?<br>I don't know<br>Use the link Google<br>
This is just a sample and it can have any kinds of html tags. How do I remove all possible HTML tags from the object except the br tag.
I have been trying to use the below code to remove all html tags from the sb object, but it doesn't seem to work. Also, not sure how to make an exception for br tag.
sb.replaceAll("<.*?>", "");
Like we all know, parsing HTML with regex is strongly discouraged if you are trying to capture full tag data and trying to manipulate it. But if you are trying to just strip out all the tags, or conditionally some tags, like in this case, you want to remove all tags except <br> tag, you can use this regex,
<\/?(?!br>)\w+[^>]*>
Explanation:
< - Match starting of tag
\/? - Optionally match / for matching closing tag
(?!br>) - Reject the match if tag name is br
\w+ - Match any tag name consisting of word characters
[^>]* - Optionally allow tag attributes to match
> - Match closing of tag
Demo
Sample Java codes,
String s = "Hello. How can I help you?<br>I don't know<br>Use the link <a \r\n" +
"href=\"www.google.com\" target=\"_blank\">Google</a></br>Hello. <sometag>somedata</sometag> hey <br1>somedata</br2> hello <1br>somedata</1br> How can I help you?<br>I don't know<br>Use the link <a \r\n" +
"href=\"www.google.com\" target=\"_blank\">Google</a></br>";
System.out.println(s.replaceAll("</?(?!br>)\\w+[^>]*>", ""));
Prints this where it removes all tags except <br> and </br>,
Hello. How can I help you?<br>I don't know<br>Use the link Google</br>Hello. somedata hey somedata hello somedata How can I help you?<br>I don't know<br>Use the link Google</br>
Edit: As mentioned by Lino in his comment, if your tag name has optional space around br text, you can use following regex which allows optional spaces,
<\s*\/?\s*(?!br\s*>)\w+[^>]*>
Demo allowing optional space in br tag
parsing HTML using regex is not good idea. if you are sure it is HTML always i would suggest you to use Jsoup it will automatically consume your html and gives back the document.
Document doc = Jsoup.parse(sb.toString());
printChilds(doc.body().childNodes());
public static void printChilds(List<Node> node)
{
for (Node n : node)
{
if (n.childNodeSize() == 0)
System.out.print(n.toString());
else
printChilds(n.childNodes());
}
}
will output Hello. How can I help you?<br>I don't know<br>Use the link Google<br>

Strip HTML from text but also specific content wrapped in html in Java

Let's say I have the following text:
<blockquote>
<div>This is text and html in a blockquote<\/div>
More text in a block quote.
<\/blockquote>
Here's some content <b> bolded </b> and <i> other random HTML tags </i>
I'd like to strip the entire blockquote out, and keep the content in other html tags. So the output would be:
Here's some bolded and other random HTML tags.
I know theres a hundred or more answers to "Stripping HTML from content" but I can't find an answer on stripping HTML tags but also content that is wrapped specific html tags.
How can I get the desire output in Java?
You could use simple regex expressions: .replaceAll("<blockquote>.*</blockquote>", "").replaceAll("<[^>]+>", ""). It should be enough.
It may seem like an overhead but you could use Jsoup to parse the HTML and operate with the Elements.
Maybe there is something more lightweight for your problem but Jsoup should to the job just fine. You can select elements by using css selectors, remove unneeded and get the plain text (without tags) out of them.
Here is a simple sample:
final String html = "<html><div><bq>i do not want this</bq>but this <b>should</b> all <i>get</i> read</div></html>";
final Document document = Jsoup.parse(html);
final Elements div = document.select("div");
div.select("bq").remove();
System.out.println(div.text()); // prints but this should all get read
You could also use JSoup in this way:
String text = "<blockquote>\n" +
" <div>This is text and html in a blockquote</div>\n" +
" More text in a block quote.\n" +
"</blockquote> \n" +
"Here's some content <b> bolded </b> and <i> other random HTML tags </i>";
Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(text, whitelist);

Using JSoup to parse text between two different tags

I have the following HTML...
<h3 class="number">
<span class="navigation">
6:55 <b>»</b>
</span>**This is the text I need to parse!**</h3>
I can use the following code to extract the text from h3 tag.
Element h3 = doc.select("h3").get(0);
Unfortunately, that gives me everything in that tag.
6:55 » This is the text I need to parse!
Can I use Jsoup to parse between different tags? Is there a best practice for doing this (regex?)
(regex?)
No, as you can read in the answers of this question, you can't parse HTML using a regular expression.
Try this:
Element h3 = doc.select("h3").get(0);
String h3Text = h3.text();
String spanText = h3.select("span").get(0).text();
String textBetweenSpanEndAndH3End = h3Text.replace(spanText, "");
No, JSoup wasn't made for this. It's supposed to parse something hierachical. Searching for a text which is between an end-tag and a start-tag, or the other way around wouldn't make any sense for JSoup. That's what regular expressions are for.
But you should of course narrow it down as much as you can using JSoup first, before you shoot with a regex at the string.
Just use ownText()
#Test
void innerTextCase() {
String sample = "<h3 class=\"number\">\n" +
"<span class=\"navigation\">\n" +
"6:55 <b>»</b>\n" +
"</span>**This is the text I need to parse!**</h3>\n";
Assertions.assertEquals("**This is the text I need to parse!**",
Jsoup.parse(sample).select("h3").first().ownText());
}

How to parse HTML and get CSS styles

I need to parse HTML and find corresponding CSS styles. I can parse HTML and CSS separataly, but I can't combine them. For example, I have an XHTML page like this:
<html>
<head>
<title></title>
</head>
<body>
<div class="abc">Hello World</div>
</body>
</html>
I have to search for "hello world" and find its class name, and after that I need to find its style from an external CSS file. Answers using Java, JavaScript, and PHP are all okay.
Use jsoup library in java which is a HTML Parser. You can see for example here
For example you can do something like this:
String html="<<your html content>>";
Document doc = Jsoup.parse(html);
Element ele=doc.getElementsContainingOwnText("Hello World").first.clone(); //get tag containing Hello world
HashSet<String>class=ele.classNames(); //gives you the classnames of element containing Hello world
You can explore the library further to fit your needs.
Similiar question Can jQuery get all CSS styles associated with an element?. Maybe css optimizers can do what you want, take a look at unused-css.com its online tool but also lists other tools.
As i understood you have chance to parse style sheet from external file and this makes your task easy to solve. First try to parse html file with jsoup which supports jquery like selector syntax that helps you parse complicated html files easier. then check this previous solution to parse css file. Im not going to full solution as i state with these libraries all task done internally and the only thing you should do is writing glue code to combine these two.
Using Java java.util.regex
String s = "<body>...<div class=\"abc\">Hello World</div></body>";
Pattern p = Pattern.compile("<div.+?class\\s*?=\\s*['\"]?([^ '\"]+).*?>Hello World</div>", Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1));
}
prints abc

Regular expression for getting HREF based on span tag [duplicate]

I have a requirement where I need to get the last HREF in the HTML code, means getting the HREF in the footer of the page.
Is there any direct regular expression for the same?
No regex, use the :last jQuery selector instead.
demo :
foo
bar
var link = $("a:last");
You could use plain JavaScript for this (if you don't need it to be a jQuery object):
var links = document.links;
var lastLink = links[links.length - 1];
var lastHref = lastLink.href;
alert(lastHref);
JS Fiddle demo.
Disclaimer: the above code only works using JavaScript; as HTML itself has no regex, or DOM manipulation, capacity. If you need to use a different technology please leave a comment or edit your question to include the relevant tags.
It's not a good idea to parse html with regular expressions. Have a look at HtmlParser
to parse html.

Categories

Resources