Number of styled elements in an HTML using JSoup

Number of styled elements in an HTML using JSoup - java

How can I count the number of all styled elements in an HTML using JSoup?
If the document object is doc, I do not mean this:
doc.select["*[style]"]
Because this just selects all elements which have style as an attribute, but I want to know the number of elements which style has been applied to in any way like by css or from header style.

You can do it by using *[style] selector and calling Elements.size() method, e.g.
final String html = "<html><body><p>test</p><p style=\"color:red\"></p><span>aa</span><span style=\"font-size:10pt\">adasd</span></body></html>";
final Document doc = Jsoup.parse(html);
final int count = doc.select("*[style]").size();
System.out.println("Count = " + count);
Output
Count = 2

Related

Android - how to parse html by jsoup and fill into the arraylist?

I want to read the date from this HTML link:
http://jadvalbaz.blog.ir/post/%D8%B1%D8%A7%D9%87%D9%86%D9%85%D8%A7%DB%8C-%D8%AD%D9%84-%D8%AC%D8%AF%D9%88%D9%84-%D8%AD%D8%B1%D9%81-%D8%B0
if you look at the view-source
ذات اریه (پنومونی- سینه پهلو)ذر (مورچه ریز)ذرع (مقیاس طول)ذره ای بنیادی از رده هیبرونها که بار الکتریکی ندارد (لاندا)ذره منفی اتم (الکترون)ذریه (نسل)ذل (خواری)ذم (نکوهش)ذهاب (رفتن)ذی (صاحب)
my words are separated by <.br>, I want to read each word to ArrayList, I means how to omit the <.br> and read the words.
here is my code:
Document document = Jsoup.connect(url).get();
for (Element span : document.select("?").select("?")) {
title = span.toString();
name.add(title);
}
How to read them, what to put instead of question mark.
any suggestion?

edit the css of your template and define a class for your words then Use the Element.select(String selector) and Elements.select(String selector) method.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element masthead = doc.select("p.words").first(); // p with class=words
follow below link for more information about extracting data with this methods:
Use selector-syntax to find elements

Elements returns empty string

I am trying to scrape prices of a website with jSoup, but I only get an empty string.
I've tested my code with jSoup Online and I expect <meta itemprop="price" content="6,99"> to be printed when I use the following code:
Document doc = Jsoup.connect(URL).get();
Elements meta = doc.select("meta[itemprop=price]");
System.out.println("meta: " + meta.text());
price = meta.attr("content");
However, I just get an empty string and no error. What am I doing wrong here?
For the ones interested I am trying to scrape the price of this page

Try this:
Document doc = Jsoup.connect(URL).get();
Element meta = doc.select("meta[itemprop=price]").first();
System.out.println("meta: " + meta.text());
String price = meta.attr("content");

The webserver you are trying to access needs another user agent string to respond with the info you want. Try this:
Document doc = Jsoup.connect(URL).userAgent("Mozilla/5.0").get();

get src attribute inside div tag jsoup

I am trying to parse the html but getting nullpointor. I want to extract image uri from the below html.
String html = "<div class=\"thumb-box thumb-160\"><a class=\"mimg\" data-id=\"1394085169856_6744\" href=\"#\"><img class=\"thumb\" src=\"http://i.ytimg.com/vi/u7deClndzQw/hqdefault.jpg\" style=\"top: -15px;\"><span class=\"btn\"></span></a></div>";
Document document = Jsoup.parse(html);
Element element = document.select("div.thumb-box thumb-160").first();
System.out.println(element.select("img").attr("src"));

Element element = document.select("div.thumb-box thumb-160").first();
you have to use . (dot) for every class
Element element = document.select("div.thumb-box.thumb-160").first();
Besides it is rather straight forward do select like this
Element element = document.select("div.thumb-box.thumb-160:eq(0) a").first();
This yould yet you anchor element out of the box

Remove disabled attribute of INPUT tag with Jsoup?

I have an HTML string, where I have a text box, and what I want is to remove the disabled attribute by its ID.
String baseHtml = "<div id='stylized' class='myform'>"
+ "<input id='txt_question' disabled='disabled' name='preg' type='text' style='width:150px;'>"
+ "</div>";
Document doc = Jsoup.parse(baseHtml);
Elements elements = doc.getElementById("txt_question").select("input");
elements.remove();
elements = doc.select("input");
System.out.println(doc.outerHtml());
The problem is that it erases all INPUT tag, what I want is to take only the disabled attribute.
Can you help me please.

Elements#select supports CSS selectors so you can do it as follows:
Elements elementTxtQuestion = doc.select("#txt_question"); // selects element with Id 'txt_question'
elementTxtQuestion.removeAttr("disabled"); // removes attribute 'disabled'
You can find more information here: Use selector-syntax to find elements.

jsoup - extract text from wikipedia article

I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. How can I use JSoup to extract all the text of a Wikipedia article (for example all the text in http://en.wikipedia.org/wiki/Boston)?

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").get();
Element contentDiv = doc.select("div[id=content]").first();
contentDiv.toString(); // The result
You retrieve formatted content this way, of course. If you want "raw" content you can filter the result with Jsoup.clean or use the call contentDiv.text().

Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
Element lastParagraph = paragraphs.last();
Element p;
int i=1;
p=firstParagraph;
System.out.println(p.text());
while (p!=lastParagraph){
p=paragraphs.get(i);
System.out.println(p.text());
i++;
}

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").timeout(5000);
Element iamcontaningIDofintendedTAG= doc.select("#iamID") ;
System.out.println(iamcontaningIDofintendedTAG.toString());
OR
Elements iamcontaningCLASSofintendedTAG= doc.select(".iamCLASS") ;
System.out.println(iamcontaningCLASSofintendedTAG.toString());

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Number of styled elements in an HTML using JSoup - java

Related

Android - how to parse html by jsoup and fill into the arraylist?

Elements returns empty string

get src attribute inside div tag jsoup

Remove disabled attribute of INPUT tag with Jsoup?

jsoup - extract text from wikipedia article

Categories

Resources