Can I include white space between all html text() elements in Jsoup - java

I want to use Jsoup to extract all text from an HTML page and return a single string of all the text without any HTML. The code I am using half works, but has the effect of joining elements which affects my keyword searches against the string.
This is the Java code I am using:
String resultText = scrapePage(htmldoc);
private String scrapePage(Document doc) {
Element allHTML = doc.select("html").first();
return allHTML.text();
}
Run against the following HTML:
<html>
<body>
<h1>Title</h1>
<p>here is para1</p>
<p>here is para2</p>
</body>
</html>
Outputting resultText gives "Titlehere is para1here is para2" meaning I can't search for the word "para1" as the only word is "para1here".
I don't want to split document into further elements than necessary (for example, getting all H1, p.text elements as there is such a wide range of tags I could be matching
(e.g. data1data2 would come from):
<td>data1</td><td>data2</td>
Is there a way if can get all the text from the page but also include a space between the tags? I don't want to preserve whitepsace otherwise, no need to keep line breaks etc. as I am just preparing a keyword string. I will probably trim all white space otherwise to a single space for this reason.

I don't have this issue using JSoup 1.7.3.
Here's the full code i used for testing:
final String html = "<html>\n"
+ " <body>\n"
+ " <h1>Title</h1>\n"
+ " <p>here is para1</p>\n"
+ " <p>here is para2</p>\n"
+ " </body>\n"
+ "</html>";
Document doc = Jsoup.parse(html);
Element element = doc.select("html").first();
System.out.println(element.text());
And the output:
Title here is para1 here is para2
Can you run my code? Also update to a newer version of jsoup if you don't have 1.7.3 yet.

Previous answer is not right, because it works just thanks to "\n" end of lines added to each line, but in reality you may not have end of line on end of each HTML line...

void example2text() throws IOException {
String url = "http://www.example.com/";
String out = new Scanner(new URL(url).openStream(), "UTF-8").useDelimiter("\\A").next();
org.jsoup.nodes.Document doc = Jsoup.parse(out);
String text = "";
Elements tags = doc.select("*");
for (Element tag : tags) {
for (TextNode tn : tag.textNodes()) {
String tagText = tn.text().trim();
if (tagText.length() > 0) {
text += tagText + " ";
}
}
}
System.out.println(text);
}
By using answer: https://stackoverflow.com/a/35798214/4642669

Related

Replace some HTML attributes with Jsoup without changing rest of input

I work with incoming html text blocks, like this:
String html = "<p>Some text here with already existing tags and it's escaped symbols.\n" +
" More text here:<br/>\\r\\n---<br/>\\r\\n" +
" <img src=\"/attachments/a0d4789a-1575-4b70-b57f-9e8fe21df46b\" sha256=\"2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c\"></a>" +
" It was img tag with attr to replace above</p>\\r\\n\\r\\n<p>More text here\n" +
" and here.<br/>\\r\\n---</p>";
I need to replace src attribute value in img tags with slightly modified sha256 attribute value in the same tag. I can do it easily with Jsoup like this:
Document doc = Jsoup.parse(html);
Elements elementsByAttribute = doc.select("img[src]");
elementsByAttribute.forEach(x -> x.attr("src", "/usr/myfolder/" + x.attr("sha256") + ".zip"));
But there is a problem. Incoming text already has some formatting, html tags, escaping etc that need to be preserved. But Jsoup removes tags / adds tags / unescapes / escapes and does some other modifications to the original input.
For example, System.out.println(doc); or System.out.println(doc.html()); gives me following:
<html>
<head></head>
<body>
<p>Some text here with already existing tags and it's escaped symbols. More text here:<br>\r\n---<br>\r\n <img src="/usr/myfolder/2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c.zip" sha256="2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c"> It was img tag with attr to replace above</p>\r\n\r\n
<p>More text here and here.<br>\r\n---</p>
</body>
</html>
My src attribute is replaced, but a lot more html-tags are added, it's is escaped to it's.
If I use System.out.println(doc.text()); i receive following:
Some text here with already existing tags and it's escaped symbols. More text here: \r\n--- \r\n It was img tag with attr to replace above\r\n\r\n More text here and here. \r\n---
My tags are removed here, it's is escaped to it's again.
I tried some other Jsoup features but didn't find how to solve this problem.
Quesion: is there any way to replace only some attributes with Jsoup without changing other tags? Maybe there is some othere library for that purpose? Or my only way is regex?
I encoutered same problem and in my case this code doesn't change original formatting.
Try this:
public static void m(){
String html = "<p>Some text here with already existing tags and it's escaped symbols.\n" +
" More text here:<br/>\r\n---<br/>\r\n" +
" <img src=\"/attachments/a0d4789a-1575-4b70-b57f-9e8fe21df46b\" sha256=\"2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c\"></a>" +
" It was img tag with attr to replace above</p>\r\n\r\n<p>More text here\n" +
" and here.<br/>\r\n---</p>";
Document doc = Jsoup.parseBodyFragment(html);
doc.outputSettings().prettyPrint(false);
Elements elementsByAttribute = doc.select("img[src]");
elementsByAttribute.forEach(x -> x.attr("src", "/usr/myfolder/" + x.attr("sha256") + ".zip"));
String result = doc.body().html();
System.out.println(result);
}
It outputs in console (In your example there is dandling </a> after <img/> so library remove it):
<p>Some text here with already existing tags and it's escaped symbols.
More text here:<br>
---<br>
<img src="/usr/myfolder/2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c.zip" sha256="2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c"> It was img tag with attr to replace above</p>
<p>More text here
and here.<br>
---</p>
And my case in Kotlin (replace content of src of <img/> attrs & remove <script></script> tags)(text is input String? variable from outer scope):
val content: String get(){
var c = text ?: ""
//val document = Jsoup.parse(c)
val document = Jsoup.parseBodyFragment(c)
document.outputSettings().prettyPrint(false)
val elementsByAttr = document.select("img[src]")
elementsByAttr.forEach {
val srcContent = it.attr("src")
val (type,value) = srcContent.let {
val eqIdx = it.indexOf('=')
it.substring(0, max(0,eqIdx)) to it.substring(eqIdx+1)
}
if (type=="path"){
it.attr("src", ArticleRepo.imgPathPrefix+value)
}
}
document.select("script").remove()
c = document.body().html()
return c
}

Text extract using Jsoup and wordcount

I am crawling websites using crawler4j. I am using jsoup to extract content and save it in a text format file. Then I use omegaT to find the number of words in those text files.
The problem I am having is with text extraction. I am using the following function to extract the text from html.
public static String cleanTagPerservingLineBreaks(String html) {
String result = "";
if (html == null)
return html;
Document document = Jsoup.parse(html);
document.outputSettings(new Document.OutputSettings()
.prettyPrint(false));
document.select("br").append("\\n");
document.select("p").prepend("\\n\\n");
result = document.html().replaceAll("\\\\n", "\n");
result = result.replaceAll(" ", " ");
result = result.trim();
result = Jsoup.clean(result, "", Whitelist.none(),
new Document.OutputSettings().prettyPrint(false));
return result;
}
In the line result = document.html().replaceAll("\\\\n", "\n"); when I use document.text() it gives me a well formatted text with appropriate spaces. But when I do the word count from omegaT, the unique words are not shown properly. If I keep using document.html() then I get a proper word count but there are no paces between some text(eg. WomenNew ArrivalsTops & BlousesPants & DenimDresses & SkirtsMenView All MenNew) and tags like strong, em are not removed by Jsoup.
Is there a way to put spaces between all the tags and properly strip content? And a explanation on why the fluctuation in word count is happening, if possible.

Getting substring from a given string in Java

I am reading the content from a web page and then I am parsing it with the help of Jsoup parser to get only the hyperlinks that exists in the body section. I am getting the output as:
<font color="#0000FF">Sports</font>
<font color="#0000FF">Titanic</font>
license plates
miracle cars
Clear
and even more hyperlinks.
From all of them, all I am interested in is data like
/sports/sports.asp
/titanic/titanic.asp
gastheft.asp
miracle.asp
/crime/warnings/clear.asp
How can I do this using Strings or is there any other way or method to extract this information usinf Jsoup Parser itself?
You can try this, its works.
public class AttributeParsing {
/**
* #param args
*/
public static void main(String[] args) {
final String html = "<font color=\"#0000FF\">Sports</font>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Element th = doc.select("a[href]").first();
String href = th.attr("href");
System.out.println(th);
System.out.println(href);
}
}
Output :
th : <font color="#0000FF">Sports</font>
href : /sports/sports.asp
Try this it may help
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String nextIndex = linkHref .indexOf ("\"", linkHref );
This should be a basic bit of parsign using
String.indexOf
as in
index = jsoupOutput.indexOf ("href=\"");
and
nextIndex = jsoupOutput.indexOf ("\"", index);
with the necessary checks in place.
Let's assume that String anchor contains one of these links then the beginning index of the substring will after href=" and the end index will be the first quotation mark after index 9 this way:
String anchor = "<font color=\"#0000FF\">Sports</font>";
int beginIndex = anchor.indexOf("href=\"") + 6; //To start after <a href="
int endIndex = anchor.indexOf("\"", beginIndex);
String desiredPart = anchor.substring(beginIndex, endIndex);
And that's it if the shape of the anchor is going to always be that way.. better options are using regular expressions and best would be using an XML parser.
Use this as reference
import java.util.regex.*;
public class HelloWorld{
public static void main(String []args){
String s = "<font color=\"#0000FF\">Sports</font>"+
"<font color=\"#0000FF\">Titanic</font>"+
"license plates"+
"miracle cars"+
"Clear";
Pattern p = Pattern.compile("href=\".+?\"");
Matcher m = p.matcher(s);
while(m.find())
{
System.out.println(m.group().split("=")[1].replace("\"",""));
}
}
}
Output
/sports/sports.asp
/titanic/titanic.asp
gastheft.asp
miracle.asp
/crime/warnings/clear.asp
You can do it in one line:
String[] paths = str.replaceAll("(?m)^.*?\"(.*?)\".*?$", "$1").split("(?ms)$.*?^");
The first method call removes everything except the target from each line, and the second splits on newlines (will work on all OS terminators).
FYI (?m) turns on "multiline mode" and (?ms) also turns on the "dotall" flag.

Formatting text output of html with jSoup

I have a document I want to parse it contains html, I want to convert if from html to plaintext but with formatting.
Example extract
<p>My simple paragragh</p>
<p>My paragragh with <a>Link</a></p>
<p>My paragragh with an <img/></p>
I can do the simple example quite easily by doing (maybe not efficently)
StringBuilder sb = new StringBuilder();
for(Element element : doc.getAllElements()){
if(element.tag().getName().equals("p")){
sb.append(element.text());
sb.append("\n\n");
}
}
Is it possible (and how would I do it) to insert output for an inline element in the correct place. An example:
<p>My paragragh with <a>Link</a> in the middle</p>
would become:
My paragragh with (Location: http://mylink.com) in the middle
You can replace each link-tag with a TextNode:
final String html = "<p>My simple paragragh</p>\n"
+ "<p>My paragragh with <a>Link</a></p>\n"
+ "<p>My paragragh with an <img/></p>";
Document doc = Jsoup.parse(html, "");
// Select all link-tags and replace them with TextNodes
for( Element element : doc.select("a") )
{
element.replaceWith(new TextNode("(Location: http://mylink.com)", ""));
}
StringBuilder sb = new StringBuilder();
// Format as needed
for( Element element : doc.select("*") )
{
// An alternative to the 'if'-statement
switch(element.tagName())
{
case "p":
sb.append(element.text()).append("\n\n");
break;
// Maybe you have to format some other tags here too ...
}
}
System.out.println(sb);

Why Jsoup cannot select td element?

I have made little test (with Jsoup 1.6.1):
String s = "" +Jsoup.parse("<td></td>").select("td").size();
System.out.println("Selected elements count : " + s);
It outputs:
Selected elements count : 0
But it should return 1, because I have parsed html with td element. What is wrong with my code or is there bug in Jsoup?
Because Jsoup is a HTML5 compliant parser and you feeded it with invalid HTML. A <td> has to go inside at least a <table>.
int size = Jsoup.parse("<table><td></td></table>").select("td").size();
System.out.println("Selected elements count : " + size);
String url = "http://foobar.com";
Document doc = Jsoup.connect(url).get();
Elements td = doc.select("td");
Jsoup 1.6.2 allows to parse with different parser and simple XML parser is provided. With following code I could solve my problem. You can later parse your fragment with HTML parse, to get valid HTML.
// Jsoup 1.6.2
String s = "" + Jsoup.parse("<td></td>", "", Parser.xmlParser()).select("td").size();
System.out.println("Selected elements count : " + s);

Categories

Resources