Formatting text output of html with jSoup - java

I have a document I want to parse it contains html, I want to convert if from html to plaintext but with formatting.
Example extract
<p>My simple paragragh</p>
<p>My paragragh with <a>Link</a></p>
<p>My paragragh with an <img/></p>
I can do the simple example quite easily by doing (maybe not efficently)
StringBuilder sb = new StringBuilder();
for(Element element : doc.getAllElements()){
if(element.tag().getName().equals("p")){
sb.append(element.text());
sb.append("\n\n");
}
}
Is it possible (and how would I do it) to insert output for an inline element in the correct place. An example:
<p>My paragragh with <a>Link</a> in the middle</p>
would become:
My paragragh with (Location: http://mylink.com) in the middle

You can replace each link-tag with a TextNode:
final String html = "<p>My simple paragragh</p>\n"
+ "<p>My paragragh with <a>Link</a></p>\n"
+ "<p>My paragragh with an <img/></p>";
Document doc = Jsoup.parse(html, "");
// Select all link-tags and replace them with TextNodes
for( Element element : doc.select("a") )
{
element.replaceWith(new TextNode("(Location: http://mylink.com)", ""));
}
StringBuilder sb = new StringBuilder();
// Format as needed
for( Element element : doc.select("*") )
{
// An alternative to the 'if'-statement
switch(element.tagName())
{
case "p":
sb.append(element.text()).append("\n\n");
break;
// Maybe you have to format some other tags here too ...
}
}
System.out.println(sb);

Related

Text extract using Jsoup and wordcount

I am crawling websites using crawler4j. I am using jsoup to extract content and save it in a text format file. Then I use omegaT to find the number of words in those text files.
The problem I am having is with text extraction. I am using the following function to extract the text from html.
public static String cleanTagPerservingLineBreaks(String html) {
String result = "";
if (html == null)
return html;
Document document = Jsoup.parse(html);
document.outputSettings(new Document.OutputSettings()
.prettyPrint(false));
document.select("br").append("\\n");
document.select("p").prepend("\\n\\n");
result = document.html().replaceAll("\\\\n", "\n");
result = result.replaceAll(" ", " ");
result = result.trim();
result = Jsoup.clean(result, "", Whitelist.none(),
new Document.OutputSettings().prettyPrint(false));
return result;
}
In the line result = document.html().replaceAll("\\\\n", "\n"); when I use document.text() it gives me a well formatted text with appropriate spaces. But when I do the word count from omegaT, the unique words are not shown properly. If I keep using document.html() then I get a proper word count but there are no paces between some text(eg. WomenNew ArrivalsTops & BlousesPants & DenimDresses & SkirtsMenView All MenNew) and tags like strong, em are not removed by Jsoup.
Is there a way to put spaces between all the tags and properly strip content? And a explanation on why the fluctuation in word count is happening, if possible.

Can I include white space between all html text() elements in Jsoup

I want to use Jsoup to extract all text from an HTML page and return a single string of all the text without any HTML. The code I am using half works, but has the effect of joining elements which affects my keyword searches against the string.
This is the Java code I am using:
String resultText = scrapePage(htmldoc);
private String scrapePage(Document doc) {
Element allHTML = doc.select("html").first();
return allHTML.text();
}
Run against the following HTML:
<html>
<body>
<h1>Title</h1>
<p>here is para1</p>
<p>here is para2</p>
</body>
</html>
Outputting resultText gives "Titlehere is para1here is para2" meaning I can't search for the word "para1" as the only word is "para1here".
I don't want to split document into further elements than necessary (for example, getting all H1, p.text elements as there is such a wide range of tags I could be matching
(e.g. data1data2 would come from):
<td>data1</td><td>data2</td>
Is there a way if can get all the text from the page but also include a space between the tags? I don't want to preserve whitepsace otherwise, no need to keep line breaks etc. as I am just preparing a keyword string. I will probably trim all white space otherwise to a single space for this reason.
I don't have this issue using JSoup 1.7.3.
Here's the full code i used for testing:
final String html = "<html>\n"
+ " <body>\n"
+ " <h1>Title</h1>\n"
+ " <p>here is para1</p>\n"
+ " <p>here is para2</p>\n"
+ " </body>\n"
+ "</html>";
Document doc = Jsoup.parse(html);
Element element = doc.select("html").first();
System.out.println(element.text());
And the output:
Title here is para1 here is para2
Can you run my code? Also update to a newer version of jsoup if you don't have 1.7.3 yet.
Previous answer is not right, because it works just thanks to "\n" end of lines added to each line, but in reality you may not have end of line on end of each HTML line...
void example2text() throws IOException {
String url = "http://www.example.com/";
String out = new Scanner(new URL(url).openStream(), "UTF-8").useDelimiter("\\A").next();
org.jsoup.nodes.Document doc = Jsoup.parse(out);
String text = "";
Elements tags = doc.select("*");
for (Element tag : tags) {
for (TextNode tn : tag.textNodes()) {
String tagText = tn.text().trim();
if (tagText.length() > 0) {
text += tagText + " ";
}
}
}
System.out.println(text);
}
By using answer: https://stackoverflow.com/a/35798214/4642669

Getting substring from a given string in Java

I am reading the content from a web page and then I am parsing it with the help of Jsoup parser to get only the hyperlinks that exists in the body section. I am getting the output as:
<font color="#0000FF">Sports</font>
<font color="#0000FF">Titanic</font>
license plates
miracle cars
Clear
and even more hyperlinks.
From all of them, all I am interested in is data like
/sports/sports.asp
/titanic/titanic.asp
gastheft.asp
miracle.asp
/crime/warnings/clear.asp
How can I do this using Strings or is there any other way or method to extract this information usinf Jsoup Parser itself?
You can try this, its works.
public class AttributeParsing {
/**
* #param args
*/
public static void main(String[] args) {
final String html = "<font color=\"#0000FF\">Sports</font>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Element th = doc.select("a[href]").first();
String href = th.attr("href");
System.out.println(th);
System.out.println(href);
}
}
Output :
th : <font color="#0000FF">Sports</font>
href : /sports/sports.asp
Try this it may help
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String nextIndex = linkHref .indexOf ("\"", linkHref );
This should be a basic bit of parsign using
String.indexOf
as in
index = jsoupOutput.indexOf ("href=\"");
and
nextIndex = jsoupOutput.indexOf ("\"", index);
with the necessary checks in place.
Let's assume that String anchor contains one of these links then the beginning index of the substring will after href=" and the end index will be the first quotation mark after index 9 this way:
String anchor = "<font color=\"#0000FF\">Sports</font>";
int beginIndex = anchor.indexOf("href=\"") + 6; //To start after <a href="
int endIndex = anchor.indexOf("\"", beginIndex);
String desiredPart = anchor.substring(beginIndex, endIndex);
And that's it if the shape of the anchor is going to always be that way.. better options are using regular expressions and best would be using an XML parser.
Use this as reference
import java.util.regex.*;
public class HelloWorld{
public static void main(String []args){
String s = "<font color=\"#0000FF\">Sports</font>"+
"<font color=\"#0000FF\">Titanic</font>"+
"license plates"+
"miracle cars"+
"Clear";
Pattern p = Pattern.compile("href=\".+?\"");
Matcher m = p.matcher(s);
while(m.find())
{
System.out.println(m.group().split("=")[1].replace("\"",""));
}
}
}
Output
/sports/sports.asp
/titanic/titanic.asp
gastheft.asp
miracle.asp
/crime/warnings/clear.asp
You can do it in one line:
String[] paths = str.replaceAll("(?m)^.*?\"(.*?)\".*?$", "$1").split("(?ms)$.*?^");
The first method call removes everything except the target from each line, and the second splits on newlines (will work on all OS terminators).
FYI (?m) turns on "multiline mode" and (?ms) also turns on the "dotall" flag.

java : generating xpath using string matcher regex

I want to generate xPath from html file. So far, I have been succeded to store Html source in a String and generating basic xpath using matcher regex as follows:-
String text = "<html><body><table><tr id=\"x\"><td>abc</td><td></td><td>xyz</td></tr></table></body></html>";
//I want xpath till label "xyz"
String unwanted= "xyz";
//so splitting and storing needed String
String[] neededString=text.split(unwanted);
String a="";
//pattern for extracting tags
String patternString1 = "<(.+?)>";
Pattern pattern = Pattern.compile(patternString1);
Matcher matcher = pattern.matcher(neededString[0]);
while(matcher.find()) {
a=a.concat(matcher.group(1)+"/");
System.out.println(a);
}
This code works for basic tag Structure without multiple child nodes like multiple <td>'s in <tr>. Can anyone improve my above code to include xpath generation for multiple childs and also for capturing attrributes like Ids,Class etc.
Any help is much appreciated.
Thanks in advance.
Regex is not so Accurate for Extracting the Html content.
Use Jsoup Html Parser
public static void main(String[] args){
String html = "<html><body><table><tr id=\"x\"><td>abc</td><td></td>" +
"<td>xyz</td></tr></table></body></html>";
Document doc = Jsoup.parse(html);
for (Element table : doc.select("table")) {
for (Element row : table.select("tr[id=x]")) {
Elements tds = row.select("td)");
System.out.println(tds.get(2).text());
}
}
}

Why Jsoup cannot select td element?

I have made little test (with Jsoup 1.6.1):
String s = "" +Jsoup.parse("<td></td>").select("td").size();
System.out.println("Selected elements count : " + s);
It outputs:
Selected elements count : 0
But it should return 1, because I have parsed html with td element. What is wrong with my code or is there bug in Jsoup?
Because Jsoup is a HTML5 compliant parser and you feeded it with invalid HTML. A <td> has to go inside at least a <table>.
int size = Jsoup.parse("<table><td></td></table>").select("td").size();
System.out.println("Selected elements count : " + size);
String url = "http://foobar.com";
Document doc = Jsoup.connect(url).get();
Elements td = doc.select("td");
Jsoup 1.6.2 allows to parse with different parser and simple XML parser is provided. With following code I could solve my problem. You can later parse your fragment with HTML parse, to get valid HTML.
// Jsoup 1.6.2
String s = "" + Jsoup.parse("<td></td>", "", Parser.xmlParser()).select("td").size();
System.out.println("Selected elements count : " + s);

Categories

Resources