Replace some HTML attributes with Jsoup without changing rest of input - java

I work with incoming html text blocks, like this:
String html = "<p>Some text here with already existing tags and it's escaped symbols.\n" +
" More text here:<br/>\\r\\n---<br/>\\r\\n" +
" <img src=\"/attachments/a0d4789a-1575-4b70-b57f-9e8fe21df46b\" sha256=\"2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c\"></a>" +
" It was img tag with attr to replace above</p>\\r\\n\\r\\n<p>More text here\n" +
" and here.<br/>\\r\\n---</p>";
I need to replace src attribute value in img tags with slightly modified sha256 attribute value in the same tag. I can do it easily with Jsoup like this:
Document doc = Jsoup.parse(html);
Elements elementsByAttribute = doc.select("img[src]");
elementsByAttribute.forEach(x -> x.attr("src", "/usr/myfolder/" + x.attr("sha256") + ".zip"));
But there is a problem. Incoming text already has some formatting, html tags, escaping etc that need to be preserved. But Jsoup removes tags / adds tags / unescapes / escapes and does some other modifications to the original input.
For example, System.out.println(doc); or System.out.println(doc.html()); gives me following:
<html>
<head></head>
<body>
<p>Some text here with already existing tags and it's escaped symbols. More text here:<br>\r\n---<br>\r\n <img src="/usr/myfolder/2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c.zip" sha256="2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c"> It was img tag with attr to replace above</p>\r\n\r\n
<p>More text here and here.<br>\r\n---</p>
</body>
</html>
My src attribute is replaced, but a lot more html-tags are added, it's is escaped to it's.
If I use System.out.println(doc.text()); i receive following:
Some text here with already existing tags and it's escaped symbols. More text here: \r\n--- \r\n It was img tag with attr to replace above\r\n\r\n More text here and here. \r\n---
My tags are removed here, it's is escaped to it's again.
I tried some other Jsoup features but didn't find how to solve this problem.
Quesion: is there any way to replace only some attributes with Jsoup without changing other tags? Maybe there is some othere library for that purpose? Or my only way is regex?

I encoutered same problem and in my case this code doesn't change original formatting.
Try this:
public static void m(){
String html = "<p>Some text here with already existing tags and it's escaped symbols.\n" +
" More text here:<br/>\r\n---<br/>\r\n" +
" <img src=\"/attachments/a0d4789a-1575-4b70-b57f-9e8fe21df46b\" sha256=\"2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c\"></a>" +
" It was img tag with attr to replace above</p>\r\n\r\n<p>More text here\n" +
" and here.<br/>\r\n---</p>";
Document doc = Jsoup.parseBodyFragment(html);
doc.outputSettings().prettyPrint(false);
Elements elementsByAttribute = doc.select("img[src]");
elementsByAttribute.forEach(x -> x.attr("src", "/usr/myfolder/" + x.attr("sha256") + ".zip"));
String result = doc.body().html();
System.out.println(result);
}
It outputs in console (In your example there is dandling </a> after <img/> so library remove it):
<p>Some text here with already existing tags and it's escaped symbols.
More text here:<br>
---<br>
<img src="/usr/myfolder/2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c.zip" sha256="2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c"> It was img tag with attr to replace above</p>
<p>More text here
and here.<br>
---</p>
And my case in Kotlin (replace content of src of <img/> attrs & remove <script></script> tags)(text is input String? variable from outer scope):
val content: String get(){
var c = text ?: ""
//val document = Jsoup.parse(c)
val document = Jsoup.parseBodyFragment(c)
document.outputSettings().prettyPrint(false)
val elementsByAttr = document.select("img[src]")
elementsByAttr.forEach {
val srcContent = it.attr("src")
val (type,value) = srcContent.let {
val eqIdx = it.indexOf('=')
it.substring(0, max(0,eqIdx)) to it.substring(eqIdx+1)
}
if (type=="path"){
it.attr("src", ArticleRepo.imgPathPrefix+value)
}
}
document.select("script").remove()
c = document.body().html()
return c
}

Related

Turn off automatic close tag in jsoup

I am trying to turn off automatic generation of close tags and I referred to this link
How to turn off automatic generation of close tags </tagName> in Jsoup?
String html="<A HREF=\"#Item1\">\n"
+ "<p style=\"font-family:times;margin-top:12pt;margin-left:0pt;\">\n"
+ "<FONT SIZE=2>Item 1.</FONT>\n"
+ "</A>";
Document document = Jsoup.parse(html,"",Parser.xmlParser());
But when I try I am not getting any output and I think it is going into a inifinte loop or something.
This is the code with which I am trying: ( no output and hanging)
String html = "<table>"
+ "<tr align='top'>"
+ "<th><font>Link</th>"
+ "</tr>"
+ "</table>";
Document document = Jsoup.parse(html,"",Parser.xmlParser());
System.out.println(document.toString());
Can someone tell me what the error is?
What I need is some sort of an ouput saying that the end tag is missing.
EDIT - Sorry there was some problem with my eclipse.Anyway now there is no infinite loop but my output is as follows
String html = "<table>"
+ "<tr align='top'>"
+ "<th><font>Link</th>"
+ "</tr>"
+ "</table>";
Document document = Jsoup.parse(html,"",Parser.xmlParser());
System.out.println("UNPARSED = \n"+html + "\n---------------");
System.out.println("parsed:" + document.toString());
Output
UNPARSED =
<table><tr align='top'><th><font>Link</th></tr></table>
---------------
parsed:<table>
<tr align="top">
<th><font>Link</font></th>
</tr>
</table>
I dont want the </font> to be added.
Edit -
I fixed it by checking using Regular expressions before parsing by Jsoup.
#Abi I don't think the example can remove close tag, even if you use the xmlParser to parse your html, Jsoup still will add close tag to the unclosed tag. because for xml or html node must have open tag with closed tag. your example has proved this.
I think you can use regexp to do this.

Can I include white space between all html text() elements in Jsoup

I want to use Jsoup to extract all text from an HTML page and return a single string of all the text without any HTML. The code I am using half works, but has the effect of joining elements which affects my keyword searches against the string.
This is the Java code I am using:
String resultText = scrapePage(htmldoc);
private String scrapePage(Document doc) {
Element allHTML = doc.select("html").first();
return allHTML.text();
}
Run against the following HTML:
<html>
<body>
<h1>Title</h1>
<p>here is para1</p>
<p>here is para2</p>
</body>
</html>
Outputting resultText gives "Titlehere is para1here is para2" meaning I can't search for the word "para1" as the only word is "para1here".
I don't want to split document into further elements than necessary (for example, getting all H1, p.text elements as there is such a wide range of tags I could be matching
(e.g. data1data2 would come from):
<td>data1</td><td>data2</td>
Is there a way if can get all the text from the page but also include a space between the tags? I don't want to preserve whitepsace otherwise, no need to keep line breaks etc. as I am just preparing a keyword string. I will probably trim all white space otherwise to a single space for this reason.
I don't have this issue using JSoup 1.7.3.
Here's the full code i used for testing:
final String html = "<html>\n"
+ " <body>\n"
+ " <h1>Title</h1>\n"
+ " <p>here is para1</p>\n"
+ " <p>here is para2</p>\n"
+ " </body>\n"
+ "</html>";
Document doc = Jsoup.parse(html);
Element element = doc.select("html").first();
System.out.println(element.text());
And the output:
Title here is para1 here is para2
Can you run my code? Also update to a newer version of jsoup if you don't have 1.7.3 yet.
Previous answer is not right, because it works just thanks to "\n" end of lines added to each line, but in reality you may not have end of line on end of each HTML line...
void example2text() throws IOException {
String url = "http://www.example.com/";
String out = new Scanner(new URL(url).openStream(), "UTF-8").useDelimiter("\\A").next();
org.jsoup.nodes.Document doc = Jsoup.parse(out);
String text = "";
Elements tags = doc.select("*");
for (Element tag : tags) {
for (TextNode tn : tag.textNodes()) {
String tagText = tn.text().trim();
if (tagText.length() > 0) {
text += tagText + " ";
}
}
}
System.out.println(text);
}
By using answer: https://stackoverflow.com/a/35798214/4642669

JSoup $ sign in id tag

How can I use JSoup special characters in a tag attribute selector?.
For example:
id=HRS_CE_JO_EXT_I_HRS_JOB_OPENING_ID$1
The usual selection syntax doesn't work:
element.select("span#HRS_CE_JO_EXT_I_HRS_JOB_OPENING_ID$0");
Of course, as long as the special characters are towards the end, "start with" syntax can be used, but it is a kind-of-ugly work around..
You can try the attribute selector instead:
final String html = "<div id=HRS_CE_JO_EXT_I_HRS_JOB_OPENING_ID$1>A</div>";
Document doc = Jsoup.parse(html);
// whatever tag
// |
Element element = doc.select("div[id=HRS_CE_JO_EXT_I_HRS_JOB_OPENING_ID$1]").first();
// | |
// attribute = id attribute value
System.out.println(element);
Output:
<div id="HRS_CE_JO_EXT_I_HRS_JOB_OPENING_ID$1">
A
</div>

Javascript for extracting anchor text from anchor tag

need help in the following.
In javascript, need to pass a input
as eg:
str="<a href=www.google.com>Google</a>"; // this is for example actual input vary
// str is passed as parameter for javascript function
The output should retrieve as 'Google'.
I have regex in java and it is working fine in it.
String regex = "< a [ ^ > ] * > ( . * ? ) < / a > ";
Pattern p = Pattern.compile(regex, Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
but in javascript it is not working.
how can I do this in Javascript. Can anyone provide me help for javascript implementation.
I dont think you would like to use Regex for this. You may try simply like this:-
<a id="myLink" href="http://www.google.com">Google</a>
var anchor = document.getElementById("myLink");
alert(anchor.getAttribute("href")); // Extract link
alert(anchor.innerHTML); // Extract Text
Sample DEMO
EDIT:-(As rightly commented by Patrick Evans)
var str = "<a href=www.google.com>Google</a>";
var str1 = document.createElement('str1');
str1.innerHTML = str;
alert(str1.textContent);
alert( str1.innerText);
Sample DEMO
Insert the HTML string into an element, and then just get the text ?
var str = "<a href=www.google.com>Google</a>";
var div = document.createElement('div');
div.innerHTML = str;
var txt = div.textContent ? div.textContent : div.innerText;
FIDDLE
In jQuery this would be :
var str = "<a href=www.google.com>Google</a>";
var txt = $(str).text();
FIDDLE
From the suggestions given by you all I got answer and works for me
function extractText(){
var anchText = "<a href=www.google.com>Google</a>";
var str1 = document.createElement('str1');
str1.innerHTML = anchText;
alert("hi "+str1.innerText);
return anc;
}
Thanks everyone for the support
Just going to take an initial stab at this, I can update this is you add more tests cases or details to your question:
\w+="<.*>(.*)</.*>"
This matches your provided example, in addition it doesn't matter if:
the variable name is different
the tag or contents of the tag wrapping the text are different
What will break this, specifically, is if there are angle brackets inside your html tag, which is possible.
Note: It is a much better idea to do this using html as other answers have attempted, I only answered this with a regex because that was what OP asked for. To OP, if you can do this without a regex, do that instead. You should not attempt to parse HTML with javascript when possible, and this regex is not comparable to a full html parser.
No need for a regex, just parse the string with DOMParser and get the element and then use the DOM object methods/attributes
var parser = new DOMParser();
var str='<a href='www.google.com'>Google</a>";
var dom = parser.parseFromString(str,"text/xml");
//From there use dom like you would use document
var atags = dom.getElementsByTagName("a");
console.log( atags[0].textContent );
//Or
var atag = dom.querySelector("a");
console.log( atag.textContent );
//Or
var atag = dom.childNodes[0];
console.log( atag.textContent );
Only catch is DOMParser is not supported in IE lower than 9.
Well, if you're using JQuery this should be an easy task.
I would just create an invisible div and render this anchor () on it. Afterwards you could simply select the anchor and get it's inner text.
$('body').append('<div id="invisibleDiv" style="display:none;"></div>'); //create a new invisible div
$('#invisibleDiv').html(str); //Include yours "str" content on the invisible DIV
console.log($('a', '#invisibleDiv').html()); //And this should output the text of any anchor inside that invisible DIV.
Remember, to do this way you must have JQuery loaded on your page.
EDIT: Use only if you've already have JQuery on your project, since as stated below, something simple as this should not be a reason for the inclusion of this entire library.
Assuming that you are using java, from the provided code.
I would recommend you to use JSoup to extract text inside anchor tag.
Here's a reason why. Using regular expressions to parse HTML: why not?
String html = "<a href='www.google.com'>Google</a>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String linkHref = link.attr("href"); // "www.google.com"
String linkText = link.text(); // "Google""
String linkOuterH = link.outerHtml();
// "<a href='www.google.com'>Google</a>";
String linkInnerH = link.html(); // "<b>example</b>"

Why Jsoup cannot select td element?

I have made little test (with Jsoup 1.6.1):
String s = "" +Jsoup.parse("<td></td>").select("td").size();
System.out.println("Selected elements count : " + s);
It outputs:
Selected elements count : 0
But it should return 1, because I have parsed html with td element. What is wrong with my code or is there bug in Jsoup?
Because Jsoup is a HTML5 compliant parser and you feeded it with invalid HTML. A <td> has to go inside at least a <table>.
int size = Jsoup.parse("<table><td></td></table>").select("td").size();
System.out.println("Selected elements count : " + size);
String url = "http://foobar.com";
Document doc = Jsoup.connect(url).get();
Elements td = doc.select("td");
Jsoup 1.6.2 allows to parse with different parser and simple XML parser is provided. With following code I could solve my problem. You can later parse your fragment with HTML parse, to get valid HTML.
// Jsoup 1.6.2
String s = "" + Jsoup.parse("<td></td>", "", Parser.xmlParser()).select("td").size();
System.out.println("Selected elements count : " + s);

Categories

Resources