I am trying to extract img using Jsoup. It works fine for images without any space in filename but it extract only the first part if there is a white space.
I tried with below.
String result = Jsoup.clean(content,"https://rally1.rallydev.com/", Whitelist.relaxed().preserveRelativeLinks(true), new Document.OutputSettings().prettyPrint(false));
Document doc = Jsoup.parse(result);
Elements images = doc.select("img");
e.g HTML content
Description:<div>some text content<br /></div>
<div><img src=/slm/attachment/43647556403/My file with space.png /></div>
<div><img src=/slm/attachment/43648152373/my_file_without_space.png/></div>
result content is:
Description:Some text content<br> <img src="/slm/attachment/43647556403/My"><img src="/slm/attachment/43648152373/my_file_without_space.png/">
in "result" for the image with space in file name has only first part "My". It ignored the content after whitespace.
How to extract filename if that contains space?
The problem can't be easily solved in Jsoup, since the src attribute value of the example with spaces actually is correctly identified to be only My. The file, with and space.png parts are in this example also attributes without values. Of course you can use JSoup to concatenate the attribute keys that follow the src attribute to its value. For example like this:
String test =""
+ "<div><img src=/slm/attachment/43647556403/My file with space.png /></div>"
+ "<div><img src=/slm/attachment/43647556403/My file with space.png name=whatever/></div>"
+ "<div><img src=/slm/attachment/43647556403/This breaks it.png name=whatever/></div>"
+ "<div><img src=\"/slm/attachment/43647556403/This works.png\" name=whatever/></div>"
+ "<div><img src=/slm/attachment/43648152373/my_file_without_space.png/></div>";
Document doc = Jsoup.parse(test);
Elements imgs = doc.select("img");
for (Element img : imgs){
Attribute src = null;
StringBuffer newSrcVal = new StringBuffer();
List<String> toRemove = new ArrayList<>();
for (Attribute a : img.attributes()){
if (a.getKey().equals("src")){
newSrcVal.append(a.getValue());
src = a;
}
else if (newSrcVal.length()>0){
//we already found the scr tag
if (a.getValue().isEmpty()){
newSrcVal.append(" ").append(a.getKey());
toRemove.add(a.getKey());
}
else{
//the empty attributes, i.e. file name parts are over
break;
}
}
}
for (String toRemAttr : toRemove){
img.removeAttr(toRemAttr);
}
src.setValue(newSrcVal.toString());
}
System.out.println(doc);
This algorithm cycles over all img elements and within each img it cycles over its attributes. When it finds the src attribute it keeps it for reference and starts to fill the newSrcBuf StringBuffer. All following value-less attributes will be added to to newSrcBuf until either another attribute with value is found or there are no more attributes. Finally the scr attribute value is reset with the contents of newSrcBuf and the former empty attributes are removed from the DOM.
Note that this will not work when your filename contains two or more consecutive spaces. JSoup discards those spaces between attributes and therefore you can't get them back after parsing. If you need that, then you need to manipulate the input html before parsing.
You can something like this:
Elements images = doc.select("img");
for(Element image: images){
String imgSrc = image.attr("src");
imgSrc = imgSrc.subString(imgSrc.lastIndexOf("/"), imgSrc.length()); // this will give you name.png
}
Related
I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}
You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}
I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}
I am playing around with nutch. I am trying to write something which also include detecting specific nodes in the DOM structure and extracting text data from around the node. e.g. text from parent nodes, sibling nodes etc. I researched and read some examples and then tried writing a plugin that will do this for an image node. Some of the code,
if("img".equalsIgnoreCase(nodeName) && nodeType == Node.ELEMENT_NODE){
String imageUrl = "No Url";
String altText = "No Text";
String imageName = "No Image Name"; //For the sake of simpler code, default values set to
//avoid nullpointerException in findMatches method
NamedNodeMap attributes = currentNode.getAttributes();
List<String>ParentNodesText = new ArrayList<String>();
ParentNodesText = getSurroundingText(currentNode);
//Analyze the attributes values inside the img node. <img src="xxx" alt="myPic">
for(int i = 0; i < attributes.getLength(); i++){
Attr attr = (Attr)attributes.item(i);
if("src".equalsIgnoreCase(attr.getName())){
imageUrl = getImageUrl(base, attr);
imageName = getImageName(imageUrl);
}
else if("alt".equalsIgnoreCase(attr.getName())){
altText = attr.getValue().toLowerCase();
}
}
private List<String> getSurroundingText(Node currentNode){
List<String> SurroundingText = new ArrayList<String>();
while(currentNode != null){
if(currentNode.getNodeType() == Node.TEXT_NODE){
String text = currentNode.getNodeValue().trim();
SurroundingText.add(text.toLowerCase());
}
if(currentNode.getPreviousSibling() != null && currentNode.getPreviousSibling().getNodeType() == Node.TEXT_NODE){
String text = currentNode.getPreviousSibling().getNodeValue().trim();
SurroundingText.add(text.toLowerCase());
}
currentNode = currentNode.getParentNode();
}
return SurroundingText;
}
This doesn't seem to work properly. img tag gets detected, Image name and URL gets retrieved but no more help. the getSurroundingText module looks too ugly, I tried but couldn't improve it. I don't have clear idea from where and how can I extract text which could be related to the image. Any help please?
you're on the right track, on the other hand, take a look at this example HTML of code:
<div>
<span>test1</span>
<img src="http://example.com" alt="test image" title="awesome title">
<span>test2</span>
</div>
In your case, I think that the problem lies in the sibling nodes of the img node, for instance you're looking for the direct siblings, and you may think that on the previous example these would be the span nodes, but in this case are some dummy text nodes so when you ask for the sibling node of the img you'll get this empty node with no actual text.
If we rewrite the previous HTML as: <div><span>test1</span><img src="http://example.com" alt="test image" title="awesome title"><span>test2</span></div> then the sibling nodes of the img would be the span nodes that you want.
I'm assuming that in the previous example you want to get both "text1" and "text2", in that case you need to actually keep moving until you find some Node.ELEMENT_NODE and then fetch the text inside that node. One good practice would be to not grab anything that you find, but limit your scope to p,span,div to improve the accuracy.
Using Jsoup clean is it possible to convert this string:
Here is some <b>important</b> stuff that can't have
<script>javascript</script> or the following embed tag
<embed src="helloworld.swf" type="application/vnd.adobe.flash-movie"> movie
in the output
to this :
Here is some <b>important</b> stuff that can't have
<script>javascript</script> or the following embed tag
<embed src="helloworld.swf" type="application/vnd.adobe.flash-movie">
movie in the output
so it renders
Here is some important stuff that can't have
<script>javascript</script> or the following embed tag
<embed src="helloworld.swf" type="application/vnd.adobe.flash-movie">
movie in the output
Where the bold tag is allowed and left alone but the script and embed tags delimiters change from < > to < and > so they are treated as just text and not real html elements.
What settings are necessary to accomplish this? I have:
private static String limitHtml(String value) {
String result = value;
if (value != null && !value.isEmpty()) {
Document.OutputSettings settings = new Document.OutputSettings();
settings.prettyPrint(false);
// what other settings ???
Whitelist whitelist = Whitelist.none().addTags(ALLOWED_HTML_TAGS);
whitelist.addAttributes(":all", ALLOWED_HTML_ATTRIBUTES);
result = Jsoup.clean(value, "", whitelist, settings);
}
return result;
}
Is there a similar Java lib that can accomplish this if Jsoup doesn't.
Jsoup can definitively get your back here. The trick is to use a dummy document (transitional variable in the code) with a single pre element in it.
We will simply add each unallowed element found in this pre element.
Later, we replace the unallowed element in the initial value with its escaped html code.
CODE
// Comma separated list of allowed tags.
private static String ALLOWED_HTML_TAGS_CSS_QUERY = "b,span";
private static String limitHtml(String value) {
String result = value;
if (value != null && !value.isEmpty()) {
// Build a sided document. It will help us escape unallowed tags.
Document transitional = Jsoup.parse("<pre></pre>");
// Parse the actual value for finding unallowed tags
Document doc = Jsoup.parseBodyFragment(value, "");
Elements unallowedElements = doc.select("*:not("+ALLOWED_HTML_TAGS_CSS_QUERY+")");
for (Element e : unallowedElements) {
switch (e.tagName()) {
case "#root": case "html": case "head": case "body":
// Those tags are added automatically by Jsoup. Nothing to do...
break;
default:
// Load the unallowed element to escape its html code in the transitional document
Element pre = transitional.select("pre").first().text(e.outerHtml());
// Replace unallowed element with its escape html code
e.replaceWith(new TextNode(pre.text(), ""));
}
}
// Get the final sanitized value
Document.OutputSettings settings = new Document.OutputSettings();
settings.prettyPrint(false);
Whitelist whitelist = Whitelist.none().addTags(ALLOWED_HTML_TAGS);
whitelist.addAttributes(":all", ALLOWED_HTML_ATTRIBUTES);
result = Jsoup.clean(doc.body().html(), "", whitelist, settings);
}
return result;
}
SAMPLE USAGE
String unsanitizedHtml = "Here is some <b>important</b> stuff that can't have " + //
"<script>javascript</script> or the following embed tag " + //
"<embed src=\"helloworld.swf\" type=\"application/vnd.adobe.flash-movie\"> movie" + //
"in the output";
System.out.println("BEFORE:\n" + unsanitizedHtml);
System.out.println();
System.out.println("AFTER:\n" + limitHtml(unsanitizedHtml));
OUTPUT
BEFORE:
Here is some <b>important</b> stuff that can't have <script>javascript</script> or the following embed tag <embed src="helloworld.swf" type="application/vnd.adobe.flash-movie"> moviein the output
AFTER:
Here is some <b>important</b> stuff that can't have <script>javascript</script> or the following embed tag <embed src="helloworld.swf" type="application/vnd.adobe.flash-movie"> moviein the output
I have found several topics with similar questions and valuable answers, but I am still struggling with this:
I want to parse some html with Jsoup so I can replace, for example,
"changeme"
with
<changed>changeme</changed>
, but only if it appears on a text portion of the html, no if it is part of a tag. So, starting with this html:
<body>
<p>test changeme app</p>
</BODY>
</HTML>
I would want to get to this:
<body>
<p>test <changed>changeme</changed> app</p>
</BODY>
</HTML>
I have tried several approaches, this one is which brings me closer to the desired result:
Document doc = null;
try {
doc = Jsoup.parse(new File("tmp1450348256397.txt"), "UTF-8");
} catch (Exception ex) {
}
Elements els = doc.body().getAllElements();
for (Element e : els) {
if (e.text().contains("changeme")) {
e.html(e.html().replaceAll("changeme","<changed>changeme</changed>"));
}
}
html = doc.toString();
System.out.println(html);
But with this approach I find two problems:
<body>
<p><a href="http://<changed>changeme</changed> .html">test
<changed>
changeme
</changed>
app</a></p>
</BODY>
</HTML>
Line breaks are inserted before and after the new element I am introducing. This is not a real problem as I coul get rid of them if I use #changed# to do the replacing and after the doc.toString() I replace them again to the desired value (with < >).
The real problem: The URL in the href has been modified, and I don't want it to happen.
Ideas? Thx.
Here is my solution:
String html=""
+"<p><a href=\"http://changeme.html\">"
+ "test changeme "
+ "<div class=\"changeme\">"
+ "inner text changeme"
+ "</div>"
+ " app</a>"
+"</p>";
Document doc = Jsoup.parse(html);
Elements els = doc.body().getAllElements();
for (Element e : els) {
List<TextNode> tnList = e.textNodes();
for (TextNode tn : tnList){
String orig = tn.text();
tn.text(orig.replaceAll("changeme","<changed>changeme</changed>"));
}
}
html = doc.toString();
System.out.println(html);
TextNodes are always leaf nodes, i.e. they do not contain more HTML elements. In your original approach you replace the HTML of an element with new HTML with replaced changme strings. You only check for the changeme to be part of the TextNodes contents, but you replace every occurrence in the HTML string of the element, including all occurrences outside TextNodes.
My solution basically works like yours, but I use the JSoup method textNodes(). This way I don't need to typecast.
P.S.
Of course, my solution as well as yours will contain <changed>changeme</changed> instead of <changed>changeme</changed> in the end. This may or may not be what you want. If you do not want this, then your result is not any more valid HTML, since changed is no valid HTML tag. Jsoup will not help you in this case. However, you can of course replace in the resulting string all <changed>changeme</changed> again - outside JSoup.
I think your issue is that you're replacing the elements html rather than just its text, change:
e.html(e.html().replaceAll("changeme","<changed>changeme</changed>"));
to
e.text(e.text().replaceAll("changeme","<changed>changeme</changed>"));
the line breaks issue can probably be solved by doing doc.outputSettings().prettyPrint(false); before doing html = doc.toString();
Finally I tried this solution (at the end of the question), using TextNodes:
How I can replace "text" in the each tag using Jsoup
This is the resulting code:
Elements els = doc.body().getAllElements();
for (Element e : els) {
for (Node child : e.childNodes()){
if (child instanceof TextNode && !((TextNode) child).isBlank()) {
((TextNode)child).text(((TextNode)child).text().replaceAll("changeme","<changed>changeme</changed>"));
}
}
}
Now the output is the expected, and it even does not introduce extra break lines. In this case prettyPrint must be set to True.
The only problem is that I don't really understand the difference of using TextNode vs Element.text(). If someone wants to provide some info it will be much appreciated.
Thanks.
Document doc = Jsoup.connect("http://reviews.opentable.com/0938/9/reviews.htm").get();
Element part = doc.body();
Elements parts = part.getElementsByTag("span");
String attValue;
String html;
for(Element ent : parts)
{
if(ent.hasAttr("class"))
{
attValue = ent.attr("class");
if(attValue=="BVRRReviewText description")
{
System.out.println("\n");
html=ent.text();
System.out.println(html);
}
}
}
Am using Jsoup.jar for the above program.
I am accessing the webpage and my aim is to the print the text that is found within the tag <span class="BVRRReviewText description">text</span>.
But nothing is getting printed as output. There is no contents added to the String html in the program. But attValue is getting all the attribute values of the span tag.
Where must I have went wrong? Please advise.
if(attValue=="BVRRReviewText description")
should be
if(attValue.equals("...")) surely?
This is Java, not Javascript.
Change
attValue=="BVRRReviewText description"
for
attValue.matches("...")