How to clean JTextPanes/JEditorPanes html content to string in Java? - java

I try to get pretty (cleaned) text content from JTextPane. Here is example code from JTextPane:
JTextPane textPane = new JTextPane ();
textPane.setContentType ("text/html");
textPane.setText ("This <b>is</b> a <b>test</b>.");
String text = textPane.getText ();
System.out.println (text);
Text look like this in JTexPane:
This is a test.
I get this kind of print to console:
<html>
<head>
</head>
<body>
This <b>is</b> a <b>test</b>.
</body>
</html>
I've used substring() and/or replace() code, but it is uncomfortable to use:
String text = textPane.getText ().replace ("<html> ... <body>\n , "");
Is there any simple function to remove all other tags than <b> tags (content) from string?
Sometimes JTextPane add <p> tags around content so I want to get rid of them also.
Like this:
<html>
<head>
</head>
<body>
<p style="margin-top: 0">
hdfhdfgh
</p>
</body>
</html>
I want to get only text content with tags:
This <b>is</b> a <b>test</b>.

I subclassed HTMLWriter and overrode startTag and endTag to skip all tags outside of <body>.
I did not test much, it seems to work ok. One drawback is that the output string has quite a lot of whitespace. Getting rid of that shouldn't be too hard.
import java.io.*;
import javax.swing.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
public class Foo {
public static void main(String[] args) throws Exception {
JTextPane textPane = new JTextPane();
textPane.setContentType("text/html");
textPane.setText("<p>This</p> <b>is</b> a <b>test</b>.");
StringWriter writer = new StringWriter();
HTMLDocument doc = (HTMLDocument) textPane.getStyledDocument();
HTMLWriter htmlWriter = new OnlyBodyHTMLWriter(writer, doc);
htmlWriter.write();
System.out.println(writer.toString());
}
private static class OnlyBodyHTMLWriter extends HTMLWriter {
public OnlyBodyHTMLWriter(Writer w, HTMLDocument doc) {
super(w, doc);
}
private boolean inBody = false;
private boolean isBody(Element elem) {
// copied from HTMLWriter.startTag()
AttributeSet attr = elem.getAttributes();
Object nameAttribute = attr
.getAttribute(StyleConstants.NameAttribute);
HTML.Tag name = null;
if (nameAttribute instanceof HTML.Tag) {
name = (HTML.Tag) nameAttribute;
}
return name == HTML.Tag.BODY;
}
#Override
protected void startTag(Element elem) throws IOException,
BadLocationException {
if (inBody) {
super.startTag(elem);
}
if (isBody(elem)) {
inBody = true;
}
}
#Override
protected void endTag(Element elem) throws IOException {
if (isBody(elem)) {
inBody = false;
}
if (inBody) {
super.endTag(elem);
}
}
}
}

You could use the HTML parser that the JEditorPane uses itself, HTMLEditorKit.ParserDelegator.
See this example, and the API docs.

I find solution to this problem by using substring and replace -methods:
// Get textPane content to string
String text = textPane.getText();
// Then I take substring to remove tags (html, head, body)
text = text.substring(44, text.length() - 19);
// Sometimes program sets <p style="margin-top: 0"> and </p> -tags so I remove them
// This isn't necessary to use.
text = text.replace("<p style=\"margin-top: 0\">\n ", "").replace("\n </p>", ""));
// This is for convert possible escape characters example & -> &
text = StringEscapeUtils.unescapeHtml(text);
There is link to StringEscapeUtils -libraries which convert escape characters back to normal view. Thanks to Ozhan Duz for the suggestion.
(commons-lang - download)

String text = textPane.getDocument.getText (0,textPane.getText().length());

Related

Reading an html page content and parsing the content in JSP

In this Java web application project I'm first, trying to read the content of a page with getUrlContentString() method (seem to be working) and second, only display the content between tags using the method proccessString (). The second method does not seem to be responding as expected and it returns a blank page. What is causing the problem?
index.jsp
<%#page contentType="text/html" pageEncoding="UTF-8"%>
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>JSP Page</title>
</head>
<body>
<%= cookiePac.CookieJar.getUrlContentString("http://help.websiteos.com/"
+ "websiteos/example_of_a_simple_html_page.htm")%>
<p>
<%= cookiePac.CookieJar.proccessString()%>
</p>
</body>
</html>
CookieJar.java
package cookiePac;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CookieJar {
private final List<String> cookies;
private static String rawCookiesString = "";
private static String rawCookiesString_1 = "";
public CookieJar () {
this.cookies = new ArrayList<>();
}
/* read the page, store into rawCookiesString */
public static String getUrlContentString (String theUrl) {
StringBuilder content = new StringBuilder();
try {
URL url = new URL(theUrl);
URLConnection urlConnection = url.openConnection();
BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(urlConnection.getInputStream()));
String line;
while ((line = bufferedReader.readLine()) != null) {
content.append(line + "\n");
}
bufferedReader.close();
} catch (Exception e) {
e.printStackTrace();
}
rawCookiesString = content.toString();
return " ";
}
/* select the content between <a> */
public static String proccessString () {
Pattern p = Pattern.compile("<a>(.*?)</a>");
Matcher m = p.matcher(rawCookiesString);
if (m.find()) {
rawCookiesString_1 = m.group(1);
}
return rawCookiesString_1.toString();
}
}
I've created a project with your code. I saw some problems there. Here they are.
First of all, a static html that you get with the url you've specified - not the one you see in your browser console
window, but the one without scripts being executed - does not
contain anchor tags. That's why you cannot get any content of this
tag. Take, for example, this URL: http://www.cssdesignawards.com/ - instead
of yours http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm.
Secondly, you're trying to match a tag in this fashion:
"<a>(.*?)</a>". But in fact it's very hard to match any anchor tag
content with this regex, because usually CSS classes are used, so
the way that increases chances to match anchor content is to use
"<a(.*?)</a>" instead of "<a>(.*?)</a>".
Next, your getUrlContentString method is named to return html as a string,
but it always returns just a blank string. Consider renaming this method or
returning rawCookiesString.
Moreover, you have a lot of static methods. Java is object-oriented
language, and it's much better to use non-static methods for primary logic of
application.
And finally, to parse html, I recommend you to use JSoup
library. It's not very hard to get acquainted
with it, and it provides really great opportunities for html
parsing. For example, here is a cookbook to extract information from tags.

Jsoup WhiteList to allow comments

I am using jsoup 1.7.3 with Whitelist custom configuration.
Apparently it sanitizes all the HTML comments (<!-- ... -->) inside the document.
It also sanitizes the <!DOCTYPE ...> element.
How can I get jsoup Whitelist to allow comments as is?
How can I define the !DOCTYPE element as allowed element with any attribute?
This is not possible by standard JSoup classes and its not dependent on WhiteList. Its the org.jsoup.safety.Cleaner. The cleaner uses a Node traverser that allows only elements and text nodes. Also only the body is parsed. So the head and doctype are ignored completely. So to achieve this you'll have to create a custom cleaner. For example if you have an html like
<!DOCTYPE html>
<html>
<head>
<!-- This is a script -->
<script type="text/javascript">
function newFun() {
alert(1);
}
</script>
</head>
<body>
<map name="diagram_map">
<area id="area1" />
<area id="area2" />
</map>
<!-- This is another comment. -->
<div>Test</div>
</body>
</html>
You will first create a custom cleaner copying the orginal one. However please note the package should org.jsoup.safety as the cleaner uses some of the protected method of Whitelist associated with. Also there is not point in extending the Cleaner as almost all methods are private and the inner node traverser is final.
package org.jsoup.safety;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Attribute;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Comment;
import org.jsoup.nodes.DataNode;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.DocumentType;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.parser.Tag;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;
public class CustomCleaner {
private Whitelist whitelist;
public CustomCleaner(Whitelist whitelist) {
Validate.notNull(whitelist);
this.whitelist = whitelist;
}
public Document clean(Document dirtyDocument) {
Validate.notNull(dirtyDocument);
Document clean = Document.createShell(dirtyDocument.baseUri());
copyDocType(dirtyDocument, clean);
if (dirtyDocument.head() != null)
copySafeNodes(dirtyDocument.head(), clean.head());
if (dirtyDocument.body() != null) // frameset documents won't have a body. the clean doc will have empty body.
copySafeNodes(dirtyDocument.body(), clean.body());
return clean;
}
private void copyDocType(Document dirtyDocument, Document clean) {
dirtyDocument.traverse(new NodeVisitor() {
public void head(Node node, int depth) {
if (node instanceof DocumentType) {
clean.prependChild(node);
}
}
public void tail(Node node, int depth) { }
});
}
public boolean isValid(Document dirtyDocument) {
Validate.notNull(dirtyDocument);
Document clean = Document.createShell(dirtyDocument.baseUri());
int numDiscarded = copySafeNodes(dirtyDocument.body(), clean.body());
return numDiscarded == 0;
}
private final class CleaningVisitor implements NodeVisitor {
private int numDiscarded = 0;
private final Element root;
private Element destination; // current element to append nodes to
private CleaningVisitor(Element root, Element destination) {
this.root = root;
this.destination = destination;
}
public void head(Node source, int depth) {
if (source instanceof Element) {
Element sourceEl = (Element) source;
if (whitelist.isSafeTag(sourceEl.tagName())) { // safe, clone and copy safe attrs
ElementMeta meta = createSafeElement(sourceEl);
Element destChild = meta.el;
destination.appendChild(destChild);
numDiscarded += meta.numAttribsDiscarded;
destination = destChild;
} else if (source != root) { // not a safe tag, so don't add. don't count root against discarded.
numDiscarded++;
}
} else if (source instanceof TextNode) {
TextNode sourceText = (TextNode) source;
TextNode destText = new TextNode(sourceText.getWholeText(), source.baseUri());
destination.appendChild(destText);
} else if (source instanceof Comment) {
Comment sourceComment = (Comment) source;
Comment destComment = new Comment(sourceComment.getData(), source.baseUri());
destination.appendChild(destComment);
} else if (source instanceof DataNode) {
DataNode sourceData = (DataNode) source;
DataNode destData = new DataNode(sourceData.getWholeData(), source.baseUri());
destination.appendChild(destData);
} else { // else, we don't care about comments, xml proc instructions, etc
numDiscarded++;
}
}
public void tail(Node source, int depth) {
if (source instanceof Element && whitelist.isSafeTag(source.nodeName())) {
destination = destination.parent(); // would have descended, so pop destination stack
}
}
}
private int copySafeNodes(Element source, Element dest) {
CleaningVisitor cleaningVisitor = new CleaningVisitor(source, dest);
NodeTraversor traversor = new NodeTraversor(cleaningVisitor);
traversor.traverse(source);
return cleaningVisitor.numDiscarded;
}
private ElementMeta createSafeElement(Element sourceEl) {
String sourceTag = sourceEl.tagName();
Attributes destAttrs = new Attributes();
Element dest = new Element(Tag.valueOf(sourceTag), sourceEl.baseUri(), destAttrs);
int numDiscarded = 0;
Attributes sourceAttrs = sourceEl.attributes();
for (Attribute sourceAttr : sourceAttrs) {
if (whitelist.isSafeAttribute(sourceTag, sourceEl, sourceAttr))
destAttrs.put(sourceAttr);
else
numDiscarded++;
}
Attributes enforcedAttrs = whitelist.getEnforcedAttributes(sourceTag);
destAttrs.addAll(enforcedAttrs);
return new ElementMeta(dest, numDiscarded);
}
private static class ElementMeta {
Element el;
int numAttribsDiscarded;
ElementMeta(Element el, int numAttribsDiscarded) {
this.el = el;
this.numAttribsDiscarded = numAttribsDiscarded;
}
}
}
Once you have both you could do cleaning as normal. Like
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.CustomCleaner;
import org.jsoup.safety.Whitelist;
public class CustomJsoupSanitizer {
public static void main(String[] args) {
try {
Document doc = Jsoup.parse(new File("t2.html"), "UTF-8");
CustomCleaner cleaner = new CustomCleaner(Whitelist.relaxed().addTags("script"));
Document doc2 = cleaner.clean(doc);
System.out.println(doc2.html());
} catch (IOException e) {
e.printStackTrace();
}
}
}
This will give you the sanitized output for above html as
<!DOCTYPE html>
<html>
<head>
<!-- This is a script -->
<script>
function newFun() {
alert(1);
}
</script>
</head>
<body>
<!-- This is another comment. -->
<div>
Test
</div>
</body>
</html>
You can customize the cleaner to match your requirement. i.e to avoid head node or script tag etc...
The Jsoup Cleaner doesn't give you a chance here (l. 100):
} else { // else, we don't care about comments, xml proc instructions, etc
numDiscarded++;
}
Only instances of Element and TextNode may remain in the cleaned HTML.
Your only chance may be something horrible like parsing the document, replacing the comments and the doctype with a special whitelisted tag, cleaning the document and then parsing and replacing the special tags again.

HTML to PDF using iText : How can produce a checkbox

I have a simple HTML page, iText is able to produce a PDF from it. It's fine but the checkbox is ignored. What can I do about it ?
import java.io.FileOutputStream;
import java.io.StringReader;
import com.itextpdf.text.Document;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.html.simpleparser.HTMLWorker;
import com.itextpdf.text.pdf.PdfWriter;
public class HtmlToPDF {
public static void main(String ... args ) {
try {
Document document = new Document(PageSize.LETTER);
PdfWriter pdfWriter = PdfWriter.getInstance(document, new FileOutputStream("c://temp//testpdf.pdf"));
document.open();
String str = "<HTML><HEAD></HEAD><BODY><H1>Testing</H1><FORM>" +
"check : <INPUT TYPE='checkbox' CHECKED/><br/>" +
"</FORM></BODY></HTML>";
htmlWorker.parse(new StringReader(str));
document.close();
System.out.println("Done.");
}
catch (Exception e) {
e.printStackTrace();
}
}
}
I got it working with YAHP ( http://www.allcolor.org/YaHPConverter/ ).
import java.io.File;
import java.io.FileOutputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
// http://www.allcolor.org/YaHPConverter/
import org.allcolor.yahp.converter.CYaHPConverter;
import org.allcolor.yahp.converter.IHtmlToPdfTransformer;
public class HtmlToPdf_yahp {
public static void main(String ... args ) throws Exception {
htmlToPdfFile();
}
public static void htmlToPdfFile() throws Exception {
CYaHPConverter converter = new CYaHPConverter();
File fout = new File("c:/temp/x.pdf");
FileOutputStream out = new FileOutputStream(fout);
Map properties = new HashMap();
List headerFooterList = new ArrayList();
String str = "<HTML><HEAD></HEAD><BODY><H1>Testing</H1><FORM>" +
"check : <INPUT TYPE='checkbox' checked=checked/><br/>" +
"</FORM></BODY></HTML>";
properties.put(IHtmlToPdfTransformer.PDF_RENDERER_CLASS,
IHtmlToPdfTransformer.FLYINGSAUCER_PDF_RENDERER);
//properties.put(IHtmlToPdfTransformer.FOP_TTF_FONT_PATH, fontPath);
converter.convertToPdf(str,
IHtmlToPdfTransformer.A4P, headerFooterList, "file://c:/temp/", out,
properties);
out.flush();
out.close();
}
}
Are you generating the HTML?
If so, then instead of using an HTML checkbox you could using the Unicode 'ballot box' character, which is ☐ or ☐. It's just a box, you can't electronically tick it or untick it; but if the PDF is intended for printing then of course people can tick it using a pen or pencil.
For example:
String str = "<HTML><HEAD></HEAD><BODY><H1>Testing</H1><FORM>" +
"check : ☐<br/>" +
"</FORM></BODY></HTML>";
Note that this will only work if you're using a Unicode font in your PDF; I think that iText won't use a Unicode font unless you tell it to.
You may be out of luck here.
The "htmlWorker" which is used to parse the html tags, doesn't seem to support the "input" tag.
public static final String tagsSupportedString = "ol ul li a pre font span br p div body table td th tr i b u sub sup em strong s strike h1 h2 h3 h4 h5 h6 img";
You can access the source code for "HtmlWorker" from here.
http://www.java2s.com/Open-Source/Java-Document/PDF/pdf-itext/com/lowagie/text/html/simpleparser/HTMLWorker.java.htm
It is from this source that I figured that out.
public void startElement(String tag, HashMap h) {
if (!tagsSupported.containsKey(tag))
return; //return if tag not supported
// ...
}
creating pdfs with iText from html is a bit troubled.
i advise to use the flying saucer library for this.
it is also using iText in the background.
The only alternative I'm aware of at that point is to hack iText. The new XMLWorker should be considerably more extensible than The Old Way (HTMLWorker), but it'll still be Non Trivial.
There might be some magic style tag you can pass in that will show up in a "generic tag" for a PdfPageEventHandler... lets see here...
Reading the code, it looks like a style or attribute "generictag" will be propagated to the ...text.Chunk object via setGenericTag().
So what you need to do is XSLT your unsupported tags into div/p/whatever with a "generictag" attribute that is a string which encodes the information you need to recreate the original element.
In your PdfPageEventHandler's OnGenericTag function, you have to parse that tag and recreate whatever it is you're trying to recreate.
That's just crazy enough to work!

Retrieve KEYWORDS from META tag in a HTML WebPage using JAVA

I want to retrieve all the content words from a HTML WebPage and all the keywords contained in the META TAG of the same HTML webpage using Java.
For example, consider this html source code:
<html>
<head>
<meta name = "keywords" content = "deception, intricacy, treachery">
</head>
<body>
My very short html document.
<br>
It has just 2 'lines'.
</body>
</html>
The CONTENT WORDS here are: my, very, short, html, document, it, has, just, lines
Note: The punctuation and the number '2' are ruled out.
The KEYWORDS here are: deception, intricacy, treachery
I have created a class for this purpose called WebDoc, this is as far as I have been able to get.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Set;
import java.util.TreeSet;
public class WebDoc {
protected URL _url;
protected Set<String> _contentWords;
protected Set<String> _keyWords
public WebDoc(URL paramURL) {
_url = paramURL;
}
public Set<String> getContents() throws IOException {
//URL url = new URL(url);
Set<String> contentWords = new TreeSet<String>();
BufferedReader in = new BufferedReader(new InputStreamReader(_url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
// Process each line.
contentWords.add(RemoveTag(inputLine));
//System.out.println(RemoveTag(inputLine));
}
in.close();
System.out.println(contentWords);
_contentWords = contentWords;
return contentWords;
}
public String RemoveTag(String html) {
html = html.replaceAll("\\<.*?>","");
html = html.replaceAll("&","");
return html;
}
public Set<String> getKeywords() {
//NO IDEA !
return null;
}
public URL getURL() {
return _url;
}
#Override
public String toString() {
return null;
}
}
So, after the answer from RedSoxFan about the meta-keywords, you only need to split your content lines.
You can use a similar method there:
Instead of
contentWords.add(RemoveTag(inputLine));
use
contentWords.addAll(Arrays.asList(RemoveTag(inputLine).split("[^\\p{L}]+")));
.split(...) splits your line at all non-letters (I hope this works, please try and report), giving back an array of substrings, which each should contain only of letters, and some empty strings between.
Arrays.asList(...) wraps this array in a list.
addAll(...) adds all the elements of this array to the set, but not duplicates).
At the end you should delete the empty string "" from your contentWords-Set.
Process each line and use
public Set<String> getKeywords(String str) {
Set<String> s = new HashSet<String>();
str = str.trim();
if (str.toLowerCase().startsWith("<meta ")) {
if (str.toLowerCase().matches("<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\".*\"/?>")) {
// Returns only whats in the content attribute (case-insensitive)
str = str.replaceAll("(?i)<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\"(.*)\"/?>","$1");
for (String st:str.split(",")) s.add(st.trim());
return s;
}
}
return null;
}
If you need an explanation, let me know.

how to get HTML DOM path by text content?

a HTML file:
<html>
<body>
<div class="main">
<p id="tID">content</p>
</div>
</body>
</html>
i has a String == "content",
i want to use "content" get HTML DOM path:
html body div.main p#tID
chrome developer tools has this feature(Elements tag,bottom bar), i want to know how to do it in java?
thanks for your help :)
Have fun :)
JAVA CODE
import java.io.File;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.DomSerializer;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
public class Teste {
public static void main(String[] args) {
try {
// read and clean document
TagNode tagNode = new HtmlCleaner().clean(new File("test.xml"));
Document document = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
// use XPath to find target node
XPath xpath = XPathFactory.newInstance().newXPath();
Node node = (Node) xpath.evaluate("//*[text()='content']", document, XPathConstants.NODE);
// assembles jquery/css selector
String result = "";
while (node != null && node.getParentNode() != null) {
result = readPath(node) + " " + result;
node = node.getParentNode();
}
System.out.println(result);
// returns html body div#myDiv.foo.bar p#tID
} catch (Exception e) {
e.printStackTrace();
}
}
// Gets id and class attributes of this node
private static String readPath(Node node) {
NamedNodeMap attributes = node.getAttributes();
String id = readAttribute(attributes.getNamedItem("id"), "#");
String clazz = readAttribute(attributes.getNamedItem("class"), ".");
return node.getNodeName() + id + clazz;
}
// Read attribute
private static String readAttribute(Node node, String token) {
String result = "";
if(node != null) {
result = token + node.getTextContent().replace(" ", token);
}
return result;
}
}
XML EXAMPLE
<html>
<body>
<br>
<div id="myDiv" class="foo bar">
<p id="tID">content</p>
</div>
</body>
</html>
EXPLANATIONS
Object document points to evaluated XML.
The XPath //*[text()='content'] finds everthing with text = 'content', and find the node.
The while loops up to the first node, getting id and classes of current element.
MORE EXPLANATIONS
In this new solution I'm using HtmlCleaner. So, you can have <br>, for example, and cleaner will replace with <br/>.
To use HtmlCleaner, just download the newest jar here.

Categories

Resources