I am trying to remove all HTML elements from a String. Unfortunately, I cannot use regular expressions because I am developing on the Blackberry platform and regular expressions are not yet supported.
Is there any other way that I can remove HTML from a string? I read somewhere that you can use a DOM Parser, but I couldn't find much on it.
Text with HTML:
<![CDATA[As a massive asteroid hurtles toward Earth, NASA head honcho Dan Truman (Billy Bob Thornton) hatches a plan to split the deadly rock in two before it annihilates the entire planet, calling on Harry Stamper (Bruce Willis) -- the world's finest oil driller -- to head up the mission. With time rapidly running out, Stamper assembles a crack team and blasts off into space to attempt the treacherous task. Ben Affleck and Liv Tyler co-star.]]>
Text without HTML:
As a massive asteroid hurtles toward Earth, NASA head honcho Dan Truman (Billy Bob Thornton) hatches a plan to split the deadly rock in two before it annihilates the entire planet, calling on Harry Stamper (Bruce Willis) -- the world's finest oil driller -- to head up the mission. With time rapidly running out, Stamper assembles a crack team and blasts off into space to attempt the treacherous task.Ben Affleck and Liv Tyler co-star.
Thanks!
There are a lot of nuances to parsing HTML in the wild, one of the funnier ones being that many pages out there do not follow any standard. This said, if all your HTML is going to be as simple as your example, something like this is more than enough:
char[] cs = s.toCharArray();
StringBuilder sb = new StringBuilder();
boolean tag = false;
for (int i=0; i<cs.length; i++) {
switch(cs[i]) {
case '<': if ( ! tag) { tag = true; break; }
case '>': if (tag) { tag = false; break; }
case '&': i += interpretEscape(cs, i, sb); break;
default: if ( ! tag) sb.append(cs[i]);
}
}
System.err.println(sb);
Where interpretEscape() is supposed to know how to convert HTML escapes such as > to their character counterparts, and skip all characters up to the ending ;.
I cannot use regular expressions
because I am developing on the
Blackberry platform
You cannot use regular expressions because HTML is a recursive language and regular expressions can't handle those.
You need a parser.
If you can add external jars you can try with those two small libs:
tagsoup, it's a sax parser
jericho html, another small html parser
they both allow you to strip everything.
I used jericho many times, to strip you define an extractor as you like it:
class HTMLStripExtractor extends TextExtractor
{
public HTMLStripExtractor(Source src)
{
super(src)
src.setLogger(null)
}
public boolean excludeElement(StartTag startTag)
{
return startTag.getName() != HTMLElementName.A
}
}
I'd try to tackle this the other way around, create a DOM tree from the HTML and then extract the string from the tree:
Use a library like TagSoup to parse in the HTML while cleaning it up to be close to XHTML.
As you're streaming the cleaned up XHTML, extract the text you want.
Related
I recently discovered the Stanford NLP parser and it seems quite amazing. I have currently a working instance of it running in our project but facing the below mentioned 2 problems.
How can I parse text and then extract only specific speech-labels from the parsed data, for example, how can I extract only NNPS and PRP from the sentence.
Our platform works in both English and German, so there is always a possibility that the text is either in English or German. How can I accommodate this scenario. Thank you.
Code :
private final String PCG_MODEL = "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz";
private final TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "invertible=true");
public void testParser() {
LexicalizedParser lp = LexicalizedParser.loadModel(PCG_MODEL);
String sent="Complete Howto guide to install EC2 Linux server in Amazon Web services cloud.";
Tree parse;
parse = lp.parse(sent);
List taggedWords = parse.taggedYield();
System.out.println(taggedWords);
}
The above example works, but as you can see I am loading the English data. Thank you.
Try this:
for (Tree subTree: parse) // traversing the sentence's parse tree
{
if(subTree.label().value().equals("NNPS")) //If the word's label is NNPS
{ //Do what you want }
}
For Query 1, I don't think stanford-nlp has an option to extract a specific POS tags.
However, Using custom trained models, we can achieve the same. I had tried similar requirement for NER - name Entity recognition custom models.
My overall goal is to return only clean sentences from a Wikipedia article without any markup. Obviously, there are ways to return JSON, XML, etc., but these are full of markup. My best approach so far is to return what Wikipedia calls raw. For example, the following link returns the raw format for the page "Iron Man":
http://en.wikipedia.org/w/index.php?title=Iron%20Man&action=raw
Here is a snippet of what is returned:
...//I am truncating some markup at the beginning here.
|creative_team_month =
|creative_team_year =
|creators_series =
|TPB =
|ISBN =
|TPB# =
|ISBN# =
|nonUS =
}}
'''Iron Man''' is a fictional character, a [[superhero]] that appears in\\
[[comic book]]s published by [[Marvel Comics]].
...//I am truncating here everything until the end.
I have stuck to the raw format because I have found it the easiest to clean up. Although what I have written so far in Java cleans up this pretty well, there are a lot of cases that slip by. These cases include markup for Wikipedia timelines, Wikipedia pictures, and other Wikipedia properties which do not appear on all articles. Again, I am working in Java (in particular, I am working on a Tomcat web application).
Question: Is there a better way to get clean, human-readable sentences from Wikipedia articles? Maybe someone already built a library for this which I just can't find?
I will be happy to edit my question to provide details about what I mean by clean and human-readable if it is not clear.
My current Java method which cleans up the raw formatted text is as follows:
public String cleanRaw(String input){
//Next three lines attempt to get rid of references.
input= input.replaceAll("<ref>.*?</ref>","");
input= input.replaceAll("<ref .*?</ref>","");
input= input.replaceAll("<ref .*?/>","");
input= input.replaceAll("==[^=]*==", "");
//I found that anything between curly braces is not needed.
while (input.indexOf("{{") >= 0){
int prevLength= input.length();
input= input.replaceAll("\\{\\{[^{}]*\\}\\}", "");
if (prevLength == input.length()){
break;
}
}
//Next line gets rid of links to other Wikipedia pages.
input= input.replaceAll("\\[\\[([^]]*[|])?([^]]*?)\\]\\]", "$2");
input= input.replaceAll("<!--.*?-->","");
input= input.replaceAll("[^A-Za-z0-9., ]", "");
return input;
}
I found a couple of projects that might help. You might be able to run the first one by including a Javascript engine in your Java code.
txtwiki.js
A javascript library to convert MediaWiki markup to plaintext.
https://github.com/joaomsa/txtwiki.js
WikiExtractor
A Python script that extracts and cleans text from a Wikipedia database dump
http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
Source:
http://www.mediawiki.org/wiki/Alternative_parsers
I've a String from html web page like this:
String htmlString =
<span style="mso-bidi-font-family:Gautami;mso-bidi-theme-font:minor-bidi">President Pranab pay great
tributes to Motilal Nehru on occasion of
</span>
150th birth anniversary. Pranab said institutions evolved by
leaders like him should be strengthened instead of being destroyed.
<span style="mso-spacerun:yes">
</span>
He listed his achievements like his role in evolving of Public Accounts Committee and protecting independence of
Legislature from the influence of the Executive by establishing a separate cadre for the Central Legislative Assembly,
the first set of coins and postal stamps released at the function to commemorate the event.
</p>
i need to extract the text from above String ,after extraction my out put should look like
OutPut:
President Pranab pay great tributes to Motilal Nehru on occasion of 150th birth anniversary. Pranab said institutions evolved by leaders like him should be strengthened instead of being destroyed. He listed his achievements like his role in evolving of Public Accounts Committee and protecting independence of Legislature from the influence of the Executive by establishing a separate cadre for the Central Legislative Assembly, now Parliament. Calling himself a student of history, he said Motilal's Swaraj Party acted as a disciplined assault force in the Legislative Assembly and he was credited with evolving the system of a Public Accounts Committee which is now one of the most effective watchdogs over executive in matters of money and finance. Mukherjee also received the first set of coins and postal stamps released at the function to commemorate the event.
For this i have used below logic:
int spanIndex = content.indexOf("<span");
spanIndex = content.indexOf(">", spanIndex);
int endspanndex = content.indexOf("</span>", spanIndex);
content = content.substring(spanIndex + 1, endspanndex);
and my Resultant out put is:
President Pranab pay great tributes to Motilal Nehru on occasion of
I have used Different HTMLParsers,but those are not working in case of j2me
can any one help me to get full description text? thanks .....
If you are using BlackBerry OS 5.0 or later you can use the BrowserField to parse HTML into a DOM document.
You may continue the same way as you propose with the rest of the string. Alternatively, a simple finite-state automaton would solve this. I have seen such solution in the moJab procect (you can download the sources here). In the mojab.xml package, there is a minimalistic XML parser designed for j2me. I mean it would parse your example as well. Take look at the sources, it's just three simple clases. It seems to be usable without modifications.
We can Extract the Text In Case of j2me as it is not suporting HTMLParsers,like this:
private String removeHtmlTags(String content) {
while (content.indexOf("<") != -1) {
int beginTag;
int endTag;
beginTag = content.indexOf("<");
endTag = content.indexOf(">");
if (beginTag == 0) {
content = content.substring(endTag
+ 1, content.length());
} else {
content = content.substring(0, beginTag) + content.substring(endTag
+ 1, content.length());
}
}
return content;
}
JSoup is a very popular library for extracting text from HTML documents. Here is one such example of the same.
I am trying to get all the noun phrases using the edu.stanford.nlp.* package. I got all the subtrees of label value "NP", but I am not able to get the normal original String format (not Penn Tree format).
E.g. for the subtree.toString() gives (NP (ND all)(NSS times))) but I want the string "all times". Can anyone please help me. Thanks in advance.
I believe what you want is something like:
final StringBuilder sb = new StringBuilder();
for ( final Tree t : tree.getLeaves() ) {
sb.append(t.toString()).append(" ");
}
While I'm not 100% sure, I seem to recall this being the solution used for some software I worked on a few years back.
This can be accomplished using the yield() method for the subtree, instead of creating a separate StringBuilder objext.
if (subtree.label().value().equals("NP")) {
out.println(subtree); //print subtree
out.println(Sentence.listToString(subtree.yield())); //print phrase
break;
}
If I post a comment like "hello there dog" it works great, but if there are any special characters like ' or " the comment is posted successfully to the database but the jQuery code is not displaying the comment in the list.
Thanks for any tips.
function feedbacksubmit () {
// Show the Ajax Loader
$("#ajaxloader").css("display","inline");
var textsubmitted = $("#feedbackinput").val();
if (textsubmitted.length < 5) {
alert("Don't forget to write something!");
// Hide the Ajax Loader
$("#ajaxloader").css("display","none");
}
else {
$.post("/feedback/ajax/insert/", {feedback: textsubmitted},
function(data) {
// Place the comment in the top of the list
$('<li></li>').prependTo("#comment-list").hide().prepend(data.commenttext2insert).fadeIn('slow');
// Hide the Ajax Loader
$("#ajaxloader").css("display","none");
// Clear out the textarea
$("#feedbackinput").val('');
}, "json");
}
}
Here is an example response that is not working with the jQuery code above:
{"returnmessage":"The Ajax operation was successful.","returncode":"0","commenttext2insert":"\n\t<div class=\"comment-header\">\n\t\t<span class=\"comment-avatar\">\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t<img src=\"/_images/users/photos//17941/nobosh.jpg\" />\t\t\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t</span>\n\t\t<span class=\"comment-author\">\n\t\t\t\n\t\t\t\t<b>BOB Man</b>\n\t\t\t\n\t\t</span>\n\t\t<span class=\"comment-timestamp\">just now</span>\n\t</div>\n\t<div class=\"comment-body\">\n\t\t<p>12wsa\'</p>\n\t</div>\n"}
What does your ColdFusion look like (at least the part where you return something to the browser)? If BalusC is right about needing to escape html characters in your return data, you can just wrap your text with the HTMLEditFormat function rather than needing to write anything in Java.
Try something like:
$.post("/feedback/ajax/insert/", {feedback: escape(textsubmitted)},
...
If you're preparing HTML in the server side, you need to make sure that all reserved HTML characters in user-controlled input are properly escaped, else it may cause the JS code to become syntactically invalid (and make your website prone for XSS). You need escape at least the reserved HTML characters <, >, &, " and ' into HTML entities <, >, &, " and ' respectively.
You mentioned that you're using Coldfusion, which is Java based. As the standard Java SE/EE API doesn't provide builtin facilities to escape them, you'll need to either write one yourself, e.g.
public static final String escapeHTML(String string){
StringBuilder builder = new StringBuilder();
for (char c : string.toCharArray()) {
switch (c) {
case '<': builder.append("<"); break;
case '>': builder.append(">"); break;
case '&': builder.append("&"); break;
case '"': builder.append("""); break;
case '\'': builder.append("'"); break;
default: builder.append(c); break;
}
}
return builder.toString();
}
..which can be used as
input = escapeHTML(input);
..or to grab for example Apache Commons Lang StringEscapeUtils#escapeHtml4 which can be used as
input = StringEscapeUtils.escapeHtml4(input);
Once again, only do this for user-controlled input. You don't need to do this for any HTML code which you hardcoded in the server side code (else it would get displayed plain as-is). Thus do something like:
StringBuilder comment = new StringBuilder();
comment.append("<div class=\"comment\">");
comment.append(escapeHTML(input));
comment.append("</div>");
That said, I already commented in your question with the hint that you'd better to do this in jQuery, because that's after all much better for maintainability and reusability. You don't want to have raw HTML somewhere hidden in depths of Java code. You also don't want to make the JSON result dependent of the purpose. Just return a generic (and HTML-sanitized) JSON result and let jQuery build the HTML.