how can I fetch outer div text only with JSoup? - java

I have the following html code:
<div class="description">
<div class='daterange'>
Hello
<span itemprop='startDate'>
June 3, 2011
</span>
</div>
This is some description <i>that</i> I want to fetch
</div><br/>
and I want to extract only the part:
This is some description <i>that</i> I want to fetch
How can I do it with jsoup?
I tried using String description = doc.select("div.description").text() but then I'm getting all content that's inside.

what you need is creating a String which will hold the words of the html file.
this is made by the following code, doc.body().text() is taking the text without all the html tags.
`public String getWords(String url) {
String text = "";
try {
Document doc = Jsoup.connect(url).get();
text = doc.body().text();
} catch (IOException ioe) {
ioe.printStackTrace();
}
return text;
}
`

Try this
String description = doc.select("div").remove().first().html();

Related

I cant use Jsoup on Java

I want to pull the four data I marked in the table in the picture and the following data with jsoup. But I couldn't find which HTML codes to use.
Here is my code and website
https://www.ilan.gov.tr/ilan/kategori/9/ihale-duyurulari
Document doc = Jsoup.connect("https://www.ilan.gov.tr/ilan/kategori/9/ihale-duyurulari").get();
//System.out.println(doc.outerHtml());
for(Element row: doc.select("search-results-content row ng-tns-c146-3")) {
final String title = row.select(".list-desc ng-tns-c152-3").text();
final String title1 = row.select(".col col-4 col-lg-4 col-border ng-star-inserted").text();
System.out.println(title);
}

Jsoup clean title tag failure

I am using Jsoup 1.9.2 to process and clean some XML input of specific tags. During this, I noticed that Jsoup behaves strangely when it is asked to clean title tags. Specifically, other XML tags within the title tag do not get removed, and in fact get replaced by their escaped forms.
I created a short unit test for this as below. The test fails, as output comes out with the value of CuCl<sub>2</sub>.
#Test
public void stripXmlSubInTitle() {
final String input = "<title>CuCl<sub>2</sub></title>";
final String output = Jsoup.clean(input, Whitelist.none());
assertEquals("CuCl2", output);
}
If the title tag is replaced with other tags (e.g., p or div), then everything works as expected. Any explanation and workaround will be appreciated.
The title tag should be used within the head (or in HTML5 within the html) tag. Since it is used to display the title of the HTML document, mostly in a browser window/tab, it is not supposed to have child tags.
JSoup treats it differently than actual content tags like p or div, the same applies for textarea.
Edit:
You could do something like this:
public static void main(String[] args) {
try {
final String input = "<content><title>CuCl<sub>2</sub></title><othertag>blabla</othertag><title>title with no subtags</title></content>";
Document document = Jsoup.parse(input);
Elements titles = document.getElementsByTag("title");
for (Element element : titles) {
element.text(Jsoup.clean(element.ownText(), Whitelist.none()));
}
System.out.println(document.body().toString());
} catch (Exception e) {
e.printStackTrace();
}
}
That would return:
<body>
<content>
<title>CuCl2</title>
<othertag>
blabla
</othertag>
<title>title with no subtags</title>
</content>
</body>
Depending on your needs, some adjustments need to be made, e.g.
System.out.println(Jsoup.clean(document.body().toString(), Whitelist.none()));
That would return:
CuCl2 blabla title with no subtags

adding text before and after a link jSoup

I've just stared learning Jsoup and the cookbook on their website but I'm just a bit stuck with addling text to an element I've parsed.
try{
Document doc = Jsoup.connect(url).get();
Element add = doc.prependText("a href") ;
Elements links = add.select("a[href]");
for (Element link : links) {
PrintStream sb = System.out.format("%n %s",link.attr("abs:href"));
System.out.print("<br>");
}
}
catch(Exception e){
System.out.print("error --> " + e);
}
Example run with google.com I get
http://www.google.ie/imghp?hl=en&tab=wi<br>
http://maps.google.ie/maps?hl=en&tab=wl<br>
https://play.google.com/?hl=en&tab=w8<br>
But I really want
<a href> http://www.google.ie/imghp?hl=en&tab=wi<br></a>
<a href> http://maps.google.ie/maps?hl=en&tab=wl<br></a>
<a href> https://play.google.com/?hl=en&tab=w8<br></a>
With this code I've gotten all the links off the page but I want to also get the and tags so I can them create my on webpage. I've tried adding a string and prepend text but just can't seem to get it right.
Thanks
with link.attr(...) you get the attribute value.
But you need the whole tag:
Document doc = Jsoup.connect(...).get();
for( Element e : doc.select("a[href]") ) // Select all 'a'-Tags with 'href' attribute
{
String wholeTag = e.toString(); // Get a string as the element is
/* No you you can use the html - in this example for a simple output */
System.out.println(wholeTag);
}

jSoup get title from img tag

I have a scenario where I need to pull the title from a img tag like below.
<img alt="Bear" border="0" src="/images/teddy/5433.gif" title="Bear"/>
I was able to get the image url. But how do i get the title from the img tag.
From above title = "bear". I want to extract this.
Use Element#attr() to extract arbitrary element attributes.
Element img = selectItSomehow();
String title = img.attr("title");
// ...
See also:
Jsoup Cookbook - Extract attributes, text, and HTML from elements
String html = "<img alt='Bear' border='0' src='/images/teddy/5433.gif' title='Bear'/>";
Document doc = Jsoup.parse(html);
Element e = doc.select("img[title]").first();
String title = e.attr("title");
System.out.println(title);

HTML Parsing using Jsoup.Jar

Document doc = Jsoup.connect("http://reviews.opentable.com/0938/9/reviews.htm").get();
Element part = doc.body();
Elements parts = part.getElementsByTag("span");
String attValue;
String html;
for(Element ent : parts)
{
if(ent.hasAttr("class"))
{
attValue = ent.attr("class");
if(attValue=="BVRRReviewText description")
{
System.out.println("\n");
html=ent.text();
System.out.println(html);
}
}
}
Am using Jsoup.jar for the above program.
I am accessing the webpage and my aim is to the print the text that is found within the tag <span class="BVRRReviewText description">text</span>.
But nothing is getting printed as output. There is no contents added to the String html in the program. But attValue is getting all the attribute values of the span tag.
Where must I have went wrong? Please advise.
if(attValue=="BVRRReviewText description")
should be
if(attValue.equals("...")) surely?
This is Java, not Javascript.
Change
attValue=="BVRRReviewText description"
for
attValue.matches("...")

Categories

Resources