Number of styled elements in an HTML using JSoup - java

How can I count the number of all styled elements in an HTML using JSoup?
If the document object is doc, I do not mean this:
doc.select["*[style]"]
Because this just selects all elements which have style as an attribute, but I want to know the number of elements which style has been applied to in any way like by css or from header style.

You can do it by using *[style] selector and calling Elements.size() method, e.g.
final String html = "<html><body><p>test</p><p style=\"color:red\"></p><span>aa</span><span style=\"font-size:10pt\">adasd</span></body></html>";
final Document doc = Jsoup.parse(html);
final int count = doc.select("*[style]").size();
System.out.println("Count = " + count);
Output
Count = 2

Related

Android - how to parse html by jsoup and fill into the arraylist?

I want to read the date from this HTML link:
http://jadvalbaz.blog.ir/post/%D8%B1%D8%A7%D9%87%D9%86%D9%85%D8%A7%DB%8C-%D8%AD%D9%84-%D8%AC%D8%AF%D9%88%D9%84-%D8%AD%D8%B1%D9%81-%D8%B0
if you look at the view-source
ذات اریه (پنومونی- سینه پهلو)ذر (مورچه ریز)ذرع (مقیاس طول)ذره ای بنیادی از رده هیبرونها که بار الکتریکی ندارد (لاندا)ذره منفی اتم (الکترون)ذریه (نسل)ذل (خواری)ذم (نکوهش)ذهاب (رفتن)ذی (صاحب)
my words are separated by <.br>, I want to read each word to ArrayList, I means how to omit the <.br> and read the words.
here is my code:
Document document = Jsoup.connect(url).get();
for (Element span : document.select("?").select("?")) {
title = span.toString();
name.add(title);
}
How to read them, what to put instead of question mark.
any suggestion?
edit the css of your template and define a class for your words then Use the Element.select(String selector) and Elements.select(String selector) method.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element masthead = doc.select("p.words").first(); // p with class=words
follow below link for more information about extracting data with this methods:
Use selector-syntax to find elements

Elements returns empty string

I am trying to scrape prices of a website with jSoup, but I only get an empty string.
I've tested my code with jSoup Online and I expect <meta itemprop="price" content="6,99"> to be printed when I use the following code:
Document doc = Jsoup.connect(URL).get();
Elements meta = doc.select("meta[itemprop=price]");
System.out.println("meta: " + meta.text());
price = meta.attr("content");
However, I just get an empty string and no error. What am I doing wrong here?
For the ones interested I am trying to scrape the price of this page
Try this:
Document doc = Jsoup.connect(URL).get();
Element meta = doc.select("meta[itemprop=price]").first();
System.out.println("meta: " + meta.text());
String price = meta.attr("content");
The webserver you are trying to access needs another user agent string to respond with the info you want. Try this:
Document doc = Jsoup.connect(URL).userAgent("Mozilla/5.0").get();

get src attribute inside div tag jsoup

I am trying to parse the html but getting nullpointor. I want to extract image uri from the below html.
String html = "<div class=\"thumb-box thumb-160\"><a class=\"mimg\" data-id=\"1394085169856_6744\" href=\"#\"><img class=\"thumb\" src=\"http://i.ytimg.com/vi/u7deClndzQw/hqdefault.jpg\" style=\"top: -15px;\"><span class=\"btn\"></span></a></div>";
Document document = Jsoup.parse(html);
Element element = document.select("div.thumb-box thumb-160").first();
System.out.println(element.select("img").attr("src"));
Element element = document.select("div.thumb-box thumb-160").first();
you have to use . (dot) for every class
Element element = document.select("div.thumb-box.thumb-160").first();
Besides it is rather straight forward do select like this
Element element = document.select("div.thumb-box.thumb-160:eq(0) a").first();
This yould yet you anchor element out of the box

Remove disabled attribute of INPUT tag with Jsoup?

I have an HTML string, where I have a text box, and what I want is to remove the disabled attribute by its ID.
String baseHtml = "<div id='stylized' class='myform'>"
+ "<input id='txt_question' disabled='disabled' name='preg' type='text' style='width:150px;'>"
+ "</div>";
Document doc = Jsoup.parse(baseHtml);
Elements elements = doc.getElementById("txt_question").select("input");
elements.remove();
elements = doc.select("input");
System.out.println(doc.outerHtml());
The problem is that it erases all INPUT tag, what I want is to take only the disabled attribute.
Can you help me please.
Elements#select supports CSS selectors so you can do it as follows:
Elements elementTxtQuestion = doc.select("#txt_question"); // selects element with Id 'txt_question'
elementTxtQuestion.removeAttr("disabled"); // removes attribute 'disabled'
You can find more information here: Use selector-syntax to find elements.

jsoup - extract text from wikipedia article

I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. How can I use JSoup to extract all the text of a Wikipedia article (for example all the text in http://en.wikipedia.org/wiki/Boston)?
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").get();
Element contentDiv = doc.select("div[id=content]").first();
contentDiv.toString(); // The result
You retrieve formatted content this way, of course. If you want "raw" content you can filter the result with Jsoup.clean or use the call contentDiv.text().
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
Element lastParagraph = paragraphs.last();
Element p;
int i=1;
p=firstParagraph;
System.out.println(p.text());
while (p!=lastParagraph){
p=paragraphs.get(i);
System.out.println(p.text());
i++;
}
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").timeout(5000);
Element iamcontaningIDofintendedTAG= doc.select("#iamID") ;
System.out.println(iamcontaningIDofintendedTAG.toString());
OR
Elements iamcontaningCLASSofintendedTAG= doc.select(".iamCLASS") ;
System.out.println(iamcontaningCLASSofintendedTAG.toString());

Categories

Resources