Get URL and remove first image in an HTML String with Java - java

I want to get the URL of the first image in an HTML String and then replace it with an empty String.
The images can be in this two forms in my String:
<img src="http://www.mywebsite.de/wp-content/uploads/2014/11/picture.jpg" alt="MyImage" width="635" height="311" class="aligncenter size-full wp-image-32729" />
<img class="aligncenter size-full wp-image-38590" src="http://www.mywebsite.de/wp-content/uploads/2014/11/picture2.jpg" alt="MyImage2" width="635" height="303" />
I want to extract the URL as String http://www.mywebsite.de/wp-content/uploads/2014/11/picture.jpg and replace it with an empty string.
At the moment I use this code to get the URL:
/**
* Method to get the URL of the first image
*/
public String getFirstImageURL(String description){
Document doc = Jsoup.parse(description);
Element imageElement = doc.select("img").first();
String absoluteUrl = imageElement.absUrl("src"); //absolute URL on src
//String srcValue = imageElement.attr("src"); // exact content value of the attribute.
return absoluteUrl;
}
This way I can retrieve the correct URL but I cannot replace the complete HTML tag with an emptry String. If I use
// Get description string
String imageURL = getFirstImageURL(HTMLString);
HTMLString = HTMLString.replaceAll(imageURL, "");
I still have <img src="" alt="MyImage" width="635" height="317" class="aligncenter size-full wp-image-13794" /> in the HTMLString.
Anyone an idea how I can completely replace the HTML tag?
SOLUTION
/**
* Method to get the URL of the first image
*/
public String getFirstImageURL(String description){
Document doc = Jsoup.parse(description);
Element imageElement = doc.select("img").first();
imageURL = imageElement.absUrl("src"); //absolute URL on src
imageElement.remove();
description = doc.toString();
return description;
}

From what i understand, you want to remove the entire
Element imageElement = doc.select("img").first();
imageElement.remove();

Related

How to extract the 'src' attribute as per the html through Selenium and Java

I have a html page:
<video crossorigin="anonymous" class="" id="video" playsinline="true"
src="https://df.dfs.bnt.com/
DEEAB832j06E9j413FjAA8Dj2zxc535DA2072E3jW01j15579/mp4-
hi/jFNf2IbBoGF28IzyU_WqkA,1535144598/zxxx/
contents/1/8/1a57ae021173751b468cca136e0192.mp4?
rnd=0.38664488150364296">
</video>
Through Selenium WebDriver I tried get video url:
By videoLocator = By.id("video");
WebElement videoElement = driver.findElement(videoLocator);
String videoUrl = videoElement.getAttribute("src");
But videoUrl - always returns ""(is empty).
Š¯owever for example:
videoElement.getAttribute("crossorigin")
return correct answer: "anonymous".
I have tried get this url from src attribute using javascript:
String videoUrl = (String) js.executeScript("return document.getElementById( 'video' ).getAttribute( 'src' );");
But the result is still the same: "".
I guess that the problem is in crossorigin="anonymous" but what to do with it? How can I get src value?
Sorry, for my poor English.
As per the HTML you have provided you need to induce WebDriverWait and you can use the following solution:
WebElement videoElement = new WebDriverWait(driver, 20).until(ExpectedConditions.visibilityOfElementLocated(By.xpath("//video[#id='video' and #crossorigin='anonymous'][starts-with(#src,'http')]")));
System.out.println(videoElement.getAttribute("src"));
Try fetching innerHTML attribute. Code :
By videoLocator = By.id("video");
WebElement videoElement = driver.findElement(videoLocator);
String videoUrl = videoElement.getAttribute("innerHTML");

how can I fetch outer div text only with JSoup?

I have the following html code:
<div class="description">
<div class='daterange'>
Hello
<span itemprop='startDate'>
June 3, 2011
</span>
</div>
This is some description <i>that</i> I want to fetch
</div><br/>
and I want to extract only the part:
This is some description <i>that</i> I want to fetch
How can I do it with jsoup?
I tried using String description = doc.select("div.description").text() but then I'm getting all content that's inside.
what you need is creating a String which will hold the words of the html file.
this is made by the following code, doc.body().text() is taking the text without all the html tags.
`public String getWords(String url) {
String text = "";
try {
Document doc = Jsoup.connect(url).get();
text = doc.body().text();
} catch (IOException ioe) {
ioe.printStackTrace();
}
return text;
}
`
Try this
String description = doc.select("div").remove().first().html();

Jsoup : how to extract img with space in filename?

I am trying to extract img using Jsoup. It works fine for images without any space in filename but it extract only the first part if there is a white space.
I tried with below.
String result = Jsoup.clean(content,"https://rally1.rallydev.com/", Whitelist.relaxed().preserveRelativeLinks(true), new Document.OutputSettings().prettyPrint(false));
Document doc = Jsoup.parse(result);
Elements images = doc.select("img");
e.g HTML content
Description:<div>some text content<br /></div>
<div><img src=/slm/attachment/43647556403/My file with space.png /></div>
<div><img src=/slm/attachment/43648152373/my_file_without_space.png/></div>
result content is:
Description:Some text content<br> <img src="/slm/attachment/43647556403/My"><img src="/slm/attachment/43648152373/my_file_without_space.png/">
in "result" for the image with space in file name has only first part "My". It ignored the content after whitespace.
How to extract filename if that contains space?
The problem can't be easily solved in Jsoup, since the src attribute value of the example with spaces actually is correctly identified to be only My. The file, with and space.png parts are in this example also attributes without values. Of course you can use JSoup to concatenate the attribute keys that follow the src attribute to its value. For example like this:
String test =""
+ "<div><img src=/slm/attachment/43647556403/My file with space.png /></div>"
+ "<div><img src=/slm/attachment/43647556403/My file with space.png name=whatever/></div>"
+ "<div><img src=/slm/attachment/43647556403/This breaks it.png name=whatever/></div>"
+ "<div><img src=\"/slm/attachment/43647556403/This works.png\" name=whatever/></div>"
+ "<div><img src=/slm/attachment/43648152373/my_file_without_space.png/></div>";
Document doc = Jsoup.parse(test);
Elements imgs = doc.select("img");
for (Element img : imgs){
Attribute src = null;
StringBuffer newSrcVal = new StringBuffer();
List<String> toRemove = new ArrayList<>();
for (Attribute a : img.attributes()){
if (a.getKey().equals("src")){
newSrcVal.append(a.getValue());
src = a;
}
else if (newSrcVal.length()>0){
//we already found the scr tag
if (a.getValue().isEmpty()){
newSrcVal.append(" ").append(a.getKey());
toRemove.add(a.getKey());
}
else{
//the empty attributes, i.e. file name parts are over
break;
}
}
}
for (String toRemAttr : toRemove){
img.removeAttr(toRemAttr);
}
src.setValue(newSrcVal.toString());
}
System.out.println(doc);
This algorithm cycles over all img elements and within each img it cycles over its attributes. When it finds the src attribute it keeps it for reference and starts to fill the newSrcBuf StringBuffer. All following value-less attributes will be added to to newSrcBuf until either another attribute with value is found or there are no more attributes. Finally the scr attribute value is reset with the contents of newSrcBuf and the former empty attributes are removed from the DOM.
Note that this will not work when your filename contains two or more consecutive spaces. JSoup discards those spaces between attributes and therefore you can't get them back after parsing. If you need that, then you need to manipulate the input html before parsing.
You can something like this:
Elements images = doc.select("img");
for(Element image: images){
String imgSrc = image.attr("src");
imgSrc = imgSrc.subString(imgSrc.lastIndexOf("/"), imgSrc.length()); // this will give you name.png
}

jSoup get title from img tag

I have a scenario where I need to pull the title from a img tag like below.
<img alt="Bear" border="0" src="/images/teddy/5433.gif" title="Bear"/>
I was able to get the image url. But how do i get the title from the img tag.
From above title = "bear". I want to extract this.
Use Element#attr() to extract arbitrary element attributes.
Element img = selectItSomehow();
String title = img.attr("title");
// ...
See also:
Jsoup Cookbook - Extract attributes, text, and HTML from elements
String html = "<img alt='Bear' border='0' src='/images/teddy/5433.gif' title='Bear'/>";
Document doc = Jsoup.parse(html);
Element e = doc.select("img[title]").first();
String title = e.attr("title");
System.out.println(title);

How to extract Dynamic text from a webpage

I want to get some text from webpage those are frequently changed.What are the technologies I cab use for this?,AS an example Currency rate that change everyday I want to extract from web page and want to save in DB,pls let me know any one knows about this,
thanxx
You can use JSoup to parse the HTML.
Example :
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""
String linkOuterH = link.outerHtml();
// "<b>example</b>"
String linkInnerH = link.html(); // "<b>example</b>"
You can look for particular DIV , tag this way, Check example

Categories

Resources