Reading an html page content and parsing the content in JSP - java

In this Java web application project I'm first, trying to read the content of a page with getUrlContentString() method (seem to be working) and second, only display the content between tags using the method proccessString (). The second method does not seem to be responding as expected and it returns a blank page. What is causing the problem?
index.jsp
<%#page contentType="text/html" pageEncoding="UTF-8"%>
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>JSP Page</title>
</head>
<body>
<%= cookiePac.CookieJar.getUrlContentString("http://help.websiteos.com/"
+ "websiteos/example_of_a_simple_html_page.htm")%>
<p>
<%= cookiePac.CookieJar.proccessString()%>
</p>
</body>
</html>
CookieJar.java
package cookiePac;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CookieJar {
private final List<String> cookies;
private static String rawCookiesString = "";
private static String rawCookiesString_1 = "";
public CookieJar () {
this.cookies = new ArrayList<>();
}
/* read the page, store into rawCookiesString */
public static String getUrlContentString (String theUrl) {
StringBuilder content = new StringBuilder();
try {
URL url = new URL(theUrl);
URLConnection urlConnection = url.openConnection();
BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(urlConnection.getInputStream()));
String line;
while ((line = bufferedReader.readLine()) != null) {
content.append(line + "\n");
}
bufferedReader.close();
} catch (Exception e) {
e.printStackTrace();
}
rawCookiesString = content.toString();
return " ";
}
/* select the content between <a> */
public static String proccessString () {
Pattern p = Pattern.compile("<a>(.*?)</a>");
Matcher m = p.matcher(rawCookiesString);
if (m.find()) {
rawCookiesString_1 = m.group(1);
}
return rawCookiesString_1.toString();
}
}

I've created a project with your code. I saw some problems there. Here they are.
First of all, a static html that you get with the url you've specified - not the one you see in your browser console
window, but the one without scripts being executed - does not
contain anchor tags. That's why you cannot get any content of this
tag. Take, for example, this URL: http://www.cssdesignawards.com/ - instead
of yours http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm.
Secondly, you're trying to match a tag in this fashion:
"<a>(.*?)</a>". But in fact it's very hard to match any anchor tag
content with this regex, because usually CSS classes are used, so
the way that increases chances to match anchor content is to use
"<a(.*?)</a>" instead of "<a>(.*?)</a>".
Next, your getUrlContentString method is named to return html as a string,
but it always returns just a blank string. Consider renaming this method or
returning rawCookiesString.
Moreover, you have a lot of static methods. Java is object-oriented
language, and it's much better to use non-static methods for primary logic of
application.
And finally, to parse html, I recommend you to use JSoup
library. It's not very hard to get acquainted
with it, and it provides really great opportunities for html
parsing. For example, here is a cookbook to extract information from tags.

Related

Jsoup HTML extraction without script code [duplicate]

One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup?
Can't paste page code here, since it is too long: http://pastebin.com/qw4Rfqgw
Here's element which content I need: <div id='tags_list'></div>
I need to get this information in Java. Preferably using Jsoup. Element is field with help of JavaScript:
<div id="tags_list">
разведчик
Sr
стратегический
</div>
Java code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Test
{
public static void main( String[] args )
{
try
{
Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
Elements Tags = Doc.select( "#tags_list a" );
for ( Element Tag : Tags )
{
System.out.println( Tag.text() );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
}
JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.
To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?
Solved in my case with com.codeborne.phantomjsdriver
NOTE: it is groovy code.
pom.xml
<dependency>
<groupId>com.codeborne</groupId>
<artifactId>phantomjsdriver</artifactId>
<version> <here goes last version> </version>
</dependency>
PhantomJsUtils.groovy
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver
class PhantomJsUtils {
private static String filePath = 'data/temp/';
public static Document renderPage(String filePath) {
System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
WebDriver ghostDriver = new PhantomJSDriver();
try {
ghostDriver.get(filePath);
return Jsoup.parse(ghostDriver.getPageSource());
} finally {
ghostDriver.quit();
}
}
public static Document renderPage(Document doc) {
String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
FileUtils.writeToFile(tmpFileName, doc.toString());
return renderPage(tmpFileName);
}
}
ClassInProject.groovy
Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))
You need to understand what is happening :
When you query a page from a website, whether using Jsoup or your browser, what gets sent back to you is some HTML. Jsoup is able to parse that.
However, most websites include Javascript in that HTML, or linked from that HTML, which will populate the page with content. Your browser is able to execute the Javascript, and thus populate the page. Jsoup is not.
The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.
Here are some solutions for this kind of problems:
If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:
it might be that the url is dynamic, and depends on what is on the page at that time
if the content is not public, cookies will be involved, and simply querying the resource URL will not be enough
In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.
You can use a combination of JSoup and HtmlUnit to get the page contents after JavaScript scripts are done loading.
pom.xml
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>3.35</version>
</dependency>
Simple Example From file https://riptutorial.com/jsoup/example/16274/parsing-javascript-generated-page-with-jsoup-and-htmunit
// load page using HTML Unit and fire scripts
WebClient webClient2 = new WebClient();
HtmlPage myPage = webClient2.getPage(new File("page.html").toURI().toURL());
// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());
// iterate row and col
for (Element row : doc.select("table#data > tbody > tr"))
for (Element col : row.select("td"))
// print results
System.out.println(col.ownText());
// clean up resources
webClient2.close();
A Complex Example: Load login, get Session and CSRF, then post and wait for home page to finish loading (15 seconds)
import java.io.IOException;
import java.net.HttpCookie;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
//JSoup load Login Page and get Session Details
Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute();
String sessionId = res.cookie("findSESSION");
String csrf = res.cookie("findCSRF");
HttpCookie cookie = new HttpCookie("findCSRF", csrf);
cookie.setDomain("domain.url");
cookie.setPath("/path");
WebClient webClient = new WebClient();
webClient.addCookie(cookie.toString(),
new URL("https://url"),
"https://referrer");
// Add other cookies/ Session ...
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Wait time
webClient.waitForBackgroundJavaScript(15000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
URL url = new URL("https://login.path");
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);
requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf);
HtmlPage page = webClient.getPage(requestSettings);
// Wait
synchronized (page) {
try {
page.wait(15000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
// Parse logged in page as needed
Document doc = Jsoup.parse(page.asXml());
I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists RedirectedUrl variable is set. So you know your target... Then you can retrieve the target page and go on...
String RedirectedUrl=null;
Elements meta = page.select("html head meta");
if (meta.attr("http-equiv").contains("REFRESH")) {
RedirectedUrl = meta.attr("content").split("=")[1];
} else {
if (page.toString().contains("window.location.href")) {
meta = page.select("script");
for (Element script:meta) {
String s = script.data();
if (!s.isEmpty() && s.startsWith("window.location.href")) {
int start = s.indexOf("=");
int end = s.indexOf(";");
if (start>0 && end >start) {
s = s.substring(start+1,end);
s =s.replace("'", "").replace("\"", "");
RedirectedUrl = s.trim();
break;
}
}
}
}
}
... now retrieve the redirected page again...
It is possible by combining JSoup with another framework to interpret the webpage, in my example here I'm using HtmlUnit.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
...
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(URL);
Document document = Jsoup.parse(myPage.asXml());
Elements otherLinks = document.select("a[href]");
After specifying user agent, my problem is solved.
https://github.com/jhy/jsoup/issues/287#issuecomment-12769155
Try:
Document Doc = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();

Page content is loaded with JavaScript and Jsoup doesn't see it

One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup?
Can't paste page code here, since it is too long: http://pastebin.com/qw4Rfqgw
Here's element which content I need: <div id='tags_list'></div>
I need to get this information in Java. Preferably using Jsoup. Element is field with help of JavaScript:
<div id="tags_list">
разведчик
Sr
стратегический
</div>
Java code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Test
{
public static void main( String[] args )
{
try
{
Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
Elements Tags = Doc.select( "#tags_list a" );
for ( Element Tag : Tags )
{
System.out.println( Tag.text() );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
}
JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.
To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?
Solved in my case with com.codeborne.phantomjsdriver
NOTE: it is groovy code.
pom.xml
<dependency>
<groupId>com.codeborne</groupId>
<artifactId>phantomjsdriver</artifactId>
<version> <here goes last version> </version>
</dependency>
PhantomJsUtils.groovy
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver
class PhantomJsUtils {
private static String filePath = 'data/temp/';
public static Document renderPage(String filePath) {
System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
WebDriver ghostDriver = new PhantomJSDriver();
try {
ghostDriver.get(filePath);
return Jsoup.parse(ghostDriver.getPageSource());
} finally {
ghostDriver.quit();
}
}
public static Document renderPage(Document doc) {
String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
FileUtils.writeToFile(tmpFileName, doc.toString());
return renderPage(tmpFileName);
}
}
ClassInProject.groovy
Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))
You need to understand what is happening :
When you query a page from a website, whether using Jsoup or your browser, what gets sent back to you is some HTML. Jsoup is able to parse that.
However, most websites include Javascript in that HTML, or linked from that HTML, which will populate the page with content. Your browser is able to execute the Javascript, and thus populate the page. Jsoup is not.
The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.
Here are some solutions for this kind of problems:
If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:
it might be that the url is dynamic, and depends on what is on the page at that time
if the content is not public, cookies will be involved, and simply querying the resource URL will not be enough
In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.
You can use a combination of JSoup and HtmlUnit to get the page contents after JavaScript scripts are done loading.
pom.xml
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>3.35</version>
</dependency>
Simple Example From file https://riptutorial.com/jsoup/example/16274/parsing-javascript-generated-page-with-jsoup-and-htmunit
// load page using HTML Unit and fire scripts
WebClient webClient2 = new WebClient();
HtmlPage myPage = webClient2.getPage(new File("page.html").toURI().toURL());
// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());
// iterate row and col
for (Element row : doc.select("table#data > tbody > tr"))
for (Element col : row.select("td"))
// print results
System.out.println(col.ownText());
// clean up resources
webClient2.close();
A Complex Example: Load login, get Session and CSRF, then post and wait for home page to finish loading (15 seconds)
import java.io.IOException;
import java.net.HttpCookie;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
//JSoup load Login Page and get Session Details
Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute();
String sessionId = res.cookie("findSESSION");
String csrf = res.cookie("findCSRF");
HttpCookie cookie = new HttpCookie("findCSRF", csrf);
cookie.setDomain("domain.url");
cookie.setPath("/path");
WebClient webClient = new WebClient();
webClient.addCookie(cookie.toString(),
new URL("https://url"),
"https://referrer");
// Add other cookies/ Session ...
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Wait time
webClient.waitForBackgroundJavaScript(15000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
URL url = new URL("https://login.path");
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);
requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf);
HtmlPage page = webClient.getPage(requestSettings);
// Wait
synchronized (page) {
try {
page.wait(15000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
// Parse logged in page as needed
Document doc = Jsoup.parse(page.asXml());
I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists RedirectedUrl variable is set. So you know your target... Then you can retrieve the target page and go on...
String RedirectedUrl=null;
Elements meta = page.select("html head meta");
if (meta.attr("http-equiv").contains("REFRESH")) {
RedirectedUrl = meta.attr("content").split("=")[1];
} else {
if (page.toString().contains("window.location.href")) {
meta = page.select("script");
for (Element script:meta) {
String s = script.data();
if (!s.isEmpty() && s.startsWith("window.location.href")) {
int start = s.indexOf("=");
int end = s.indexOf(";");
if (start>0 && end >start) {
s = s.substring(start+1,end);
s =s.replace("'", "").replace("\"", "");
RedirectedUrl = s.trim();
break;
}
}
}
}
}
... now retrieve the redirected page again...
It is possible by combining JSoup with another framework to interpret the webpage, in my example here I'm using HtmlUnit.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
...
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(URL);
Document document = Jsoup.parse(myPage.asXml());
Elements otherLinks = document.select("a[href]");
After specifying user agent, my problem is solved.
https://github.com/jhy/jsoup/issues/287#issuecomment-12769155
Try:
Document Doc = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();

Javascript to Java Applet communication

I am trying to pass a selected value from HTML drop-down to an Applet method, using setter method in the Applet. But every time the Javascript is invoked it shows "object doesn't support this property or method" as an exception.
My javascript code :
function showSelected(value){
alert("the value given from"+value);
var diseasename=value;
alert(diseasename);
document.decisiontreeapplet.setDieasename(diseasename);
alert("i am after value set ");
}
My applet code :
package com.vaannila.utility;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import prefuse.util.ui.JPrefuseApplet;
public class dynamicTreeApplet extends JPrefuseApplet {
private static final long serialVersionUID = 1L;
public static int i = 1;
public String dieasenameencode;
//System.out.println("asjdjkhcd"+dieasenameencode);
public void init() {
System.out.println("asjdjkhcd"+dieasenameencode);
System.out.println("the value of i is " + i);
URL url = null;
//String ashu=this.getParameter("dieasenmae");
//System.out.println("the value of the dieases is "+ashu);
//Here dieasesname is important to make the page refresh happen
//String dencode = dieasenameencode.trim();
try {
//String dieasename = URLEncoder.encode(dencode, "UTF-8");
// i want this piece of the code to be called
url = new URL("http://localhost:8080/docRuleToolProtocol/appletRefreshAction.do?dieasename="+dieasenameencode);
URLConnection con = url.openConnection();
con.setDoOutput(true);
con.setDoInput(true);
con.setUseCaches(false);
InputStream ois = con.getInputStream();
this.setContentPane(dynamicView.demo(ois, "name"));
ois.close();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (FileNotFoundException f) {
f.printStackTrace();
} catch (IOException io) {
io.printStackTrace();
}
++i;
}
public void setDieasename(String message){
System.out.println("atleast i am here and call is made ");
this.dieasenameencode=message;
System.out.println("the final value of the dieasenmae"+dieasenameencode);
}
}
My appletdeployment code :
<applet id="decisiontreeapplet" code="com.vaannila.utility.dynamicTreeApplet.class" archive="./appletjars/dynamictree.jar, ./appletjars/prefuse.jar" width ="1000" height="500" >
</applet>
Change..
document.decisiontreeapplet
..to..
document.getElementById('decisiontreeapplet')
..and it will most likely work.
E.G.
HTML
<html>
<body>
<script type='text/javascript'>
function callApplet() {
msg = document.getElementById('input').value;
applet = document.getElementById('output');
applet.setMessage(msg);
}
</script>
<input id='input' type='text' size=20 onchange='callApplet()'>
<br>
<applet
id='output'
code='CallApplet'
width=120
height=20>
</applet>
</body>
</html>
Java
import javax.swing.*;
public class CallApplet extends JApplet {
JTextField output;
public void init() {
output = new JTextField(20);
add(output);
validate();
}
public void setMessage(String message) {
output.setText(message);
}
}
Please also consider posting a short complete example next time. Note that the number of lines in the two sources shown above, is shorter that your e.g. applet, and it took me longer to prepare the source so I could check my answer.
Try changing the id parameter in your applet tag to name instead.
<applet name="decisiontreeapplet" ...>
</applet>
Try passing parameters using the param tag:
http://download.oracle.com/javase/tutorial/deployment/applet/param.html
I think the <applet> tag is obsolete and <object> tag shoudl be used instead. I recall there was some boolean param named scriptable in the object tag.
Why you do not use deployment toolkit ? It would save you a lot of trying - see http://rostislav-matl.blogspot.com/2011/10/java-applets-building-with-maven.html for more info.

How to clean JTextPanes/JEditorPanes html content to string in Java?

I try to get pretty (cleaned) text content from JTextPane. Here is example code from JTextPane:
JTextPane textPane = new JTextPane ();
textPane.setContentType ("text/html");
textPane.setText ("This <b>is</b> a <b>test</b>.");
String text = textPane.getText ();
System.out.println (text);
Text look like this in JTexPane:
This is a test.
I get this kind of print to console:
<html>
<head>
</head>
<body>
This <b>is</b> a <b>test</b>.
</body>
</html>
I've used substring() and/or replace() code, but it is uncomfortable to use:
String text = textPane.getText ().replace ("<html> ... <body>\n , "");
Is there any simple function to remove all other tags than <b> tags (content) from string?
Sometimes JTextPane add <p> tags around content so I want to get rid of them also.
Like this:
<html>
<head>
</head>
<body>
<p style="margin-top: 0">
hdfhdfgh
</p>
</body>
</html>
I want to get only text content with tags:
This <b>is</b> a <b>test</b>.
I subclassed HTMLWriter and overrode startTag and endTag to skip all tags outside of <body>.
I did not test much, it seems to work ok. One drawback is that the output string has quite a lot of whitespace. Getting rid of that shouldn't be too hard.
import java.io.*;
import javax.swing.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
public class Foo {
public static void main(String[] args) throws Exception {
JTextPane textPane = new JTextPane();
textPane.setContentType("text/html");
textPane.setText("<p>This</p> <b>is</b> a <b>test</b>.");
StringWriter writer = new StringWriter();
HTMLDocument doc = (HTMLDocument) textPane.getStyledDocument();
HTMLWriter htmlWriter = new OnlyBodyHTMLWriter(writer, doc);
htmlWriter.write();
System.out.println(writer.toString());
}
private static class OnlyBodyHTMLWriter extends HTMLWriter {
public OnlyBodyHTMLWriter(Writer w, HTMLDocument doc) {
super(w, doc);
}
private boolean inBody = false;
private boolean isBody(Element elem) {
// copied from HTMLWriter.startTag()
AttributeSet attr = elem.getAttributes();
Object nameAttribute = attr
.getAttribute(StyleConstants.NameAttribute);
HTML.Tag name = null;
if (nameAttribute instanceof HTML.Tag) {
name = (HTML.Tag) nameAttribute;
}
return name == HTML.Tag.BODY;
}
#Override
protected void startTag(Element elem) throws IOException,
BadLocationException {
if (inBody) {
super.startTag(elem);
}
if (isBody(elem)) {
inBody = true;
}
}
#Override
protected void endTag(Element elem) throws IOException {
if (isBody(elem)) {
inBody = false;
}
if (inBody) {
super.endTag(elem);
}
}
}
}
You could use the HTML parser that the JEditorPane uses itself, HTMLEditorKit.ParserDelegator.
See this example, and the API docs.
I find solution to this problem by using substring and replace -methods:
// Get textPane content to string
String text = textPane.getText();
// Then I take substring to remove tags (html, head, body)
text = text.substring(44, text.length() - 19);
// Sometimes program sets <p style="margin-top: 0"> and </p> -tags so I remove them
// This isn't necessary to use.
text = text.replace("<p style=\"margin-top: 0\">\n ", "").replace("\n </p>", ""));
// This is for convert possible escape characters example & -> &
text = StringEscapeUtils.unescapeHtml(text);
There is link to StringEscapeUtils -libraries which convert escape characters back to normal view. Thanks to Ozhan Duz for the suggestion.
(commons-lang - download)
String text = textPane.getDocument.getText (0,textPane.getText().length());

Retrieve KEYWORDS from META tag in a HTML WebPage using JAVA

I want to retrieve all the content words from a HTML WebPage and all the keywords contained in the META TAG of the same HTML webpage using Java.
For example, consider this html source code:
<html>
<head>
<meta name = "keywords" content = "deception, intricacy, treachery">
</head>
<body>
My very short html document.
<br>
It has just 2 'lines'.
</body>
</html>
The CONTENT WORDS here are: my, very, short, html, document, it, has, just, lines
Note: The punctuation and the number '2' are ruled out.
The KEYWORDS here are: deception, intricacy, treachery
I have created a class for this purpose called WebDoc, this is as far as I have been able to get.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Set;
import java.util.TreeSet;
public class WebDoc {
protected URL _url;
protected Set<String> _contentWords;
protected Set<String> _keyWords
public WebDoc(URL paramURL) {
_url = paramURL;
}
public Set<String> getContents() throws IOException {
//URL url = new URL(url);
Set<String> contentWords = new TreeSet<String>();
BufferedReader in = new BufferedReader(new InputStreamReader(_url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
// Process each line.
contentWords.add(RemoveTag(inputLine));
//System.out.println(RemoveTag(inputLine));
}
in.close();
System.out.println(contentWords);
_contentWords = contentWords;
return contentWords;
}
public String RemoveTag(String html) {
html = html.replaceAll("\\<.*?>","");
html = html.replaceAll("&","");
return html;
}
public Set<String> getKeywords() {
//NO IDEA !
return null;
}
public URL getURL() {
return _url;
}
#Override
public String toString() {
return null;
}
}
So, after the answer from RedSoxFan about the meta-keywords, you only need to split your content lines.
You can use a similar method there:
Instead of
contentWords.add(RemoveTag(inputLine));
use
contentWords.addAll(Arrays.asList(RemoveTag(inputLine).split("[^\\p{L}]+")));
.split(...) splits your line at all non-letters (I hope this works, please try and report), giving back an array of substrings, which each should contain only of letters, and some empty strings between.
Arrays.asList(...) wraps this array in a list.
addAll(...) adds all the elements of this array to the set, but not duplicates).
At the end you should delete the empty string "" from your contentWords-Set.
Process each line and use
public Set<String> getKeywords(String str) {
Set<String> s = new HashSet<String>();
str = str.trim();
if (str.toLowerCase().startsWith("<meta ")) {
if (str.toLowerCase().matches("<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\".*\"/?>")) {
// Returns only whats in the content attribute (case-insensitive)
str = str.replaceAll("(?i)<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\"(.*)\"/?>","$1");
for (String st:str.split(",")) s.add(st.trim());
return s;
}
}
return null;
}
If you need an explanation, let me know.

Categories

Resources