jsoup spaces error and td classname fail - java

I am trying to pick, using Jsoup, the paragraph inside the following HTML snippet:
Blockquote
<td class="team team-a ">
MyTeam
</td>
The problem is that, for some reason, Jsoup doesn't seem to pick up the "td class="team team-a "
In my opinion space problem.
I tried to format ...
Elements team = document.select("td[class=team team-b ]");
Elements vendegCsapat_e = document.select("td.team team-b ");
.. but there is no solution! :(
What could be the problem in the above code? thx

Your CSS selector is not correct. To select multiple classes, use:
Elements team = document.select("td.team.team-b");
In case you'd like to know what your original meant, td.team team-b would be read in English as "select a tag team-b which descends from a tag td with a class of .team". team-b is not a valid HTML tag, so Jsoup did not select anything.

Related

Jsoup Trouble extracting formatting from html tables

<tr>
<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order
Theorems</span>
</th><th bgcolor="PINK"> <em><a href="\
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2\]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>
</th><th bgcolor="SKYBLUE"> <a href="\
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3\]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>
</th><th bgcolor="LIME"> <a href="\
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3\]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III--
-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>
</th><th bgcolor="YELLOW"> <a href="\
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0\]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II--
-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>
</th></tr>
So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.
If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?
So I was attempting to extract just the bgcolor and
I've tried doc.select("tr:contains([bgcolor]"), doc.select(th, [bgcolor), doc.select([bgcolor]), doc.select(tr:containsdata(bgcolor) , as well as doc.select([style]) and all have either returned no output or returned a parse error. I can extract the stuff in the span class just fine but it is more of a problem of also extracting bgcolor and align.
You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().
Document document = Jsoup.parse(YOUT HTML GOES HERE);
System.out.println(document);
Elements elements = document.select("tr > th");
for (Element element : elements) {
String align = element.attr("align");
String color = element.attr("bgcolor");
String spanText = element.select("span").text();
System.out.println("Align is " + align +
"\nBackground Color is " + color +
"\nSpan Text is " + spanText);
}
For any further information feel free to ask me! Hope this helped you!
Updated Answer to comment:
To do that, you'll need to use this line inside the for each loop:
String fullText = element.text();
That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals.
That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.

In JSoup I am trying to get text from a span that has multiple classes with strange names the compiler is not liking

Here is my code:
enter code heretext = text.toUpperCase();
Document doc = Jsoup.connect("https://finance.yahoo.com/quote/" + text + "?p=" + text).userAgent("Safari").get();
Element temp = doc.selectFirst("span.Trsdu(0.3s).Fw(b).Fz(36px).Mb(-4px).D(ib)");
System.out.println(temp);
here is the span I am trying to get:
<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="35">1,119.50</span>
I am new to JSoup so if i am being ignorant please let me know what i should do
This may not be the answer but I can't comment yet since I don't have 50 rep points but I'd still like to help so I'll post it here.
Jsoup has a lot of issue with recognizing characters that I've also encountered.
For this particular example, I think you can use the data attribute 'data-react-id' to locate that element. First you would select all spans then the attribute, something like this doc.select("span").select("[data-react-id]=35]")
Hope that helps.

Get title attribute with jsoup

I have a problem with parsing a website.
The website contains a phrase like this:
<td class="school">
<abbr title data-original-title="Highschool">...</abbr>
</td>
How can I get the title (Highschool)?
I'm programming with jsoup and java.
Thanks for your help.
Just try reading jsoup cookbook.
First you should get abbr element, and then its data-original-title attribute:
Element abbrElement = doc.select("abbr").first();
String originalTitle = abbrElement.attr("data-original-title");
Of course you should make sure that you select right abbr element. Above code will select the first one appearing in the document.
This can be done relatively easy using jsoup's DOM methods or selection on a parsed document. Check out these links for reference:
DOM navigation
Extracting attributes
//assuming that the class "school" contains the tag for the title
Elements titles = doc.getElementsByClass("school").getElementsByTag("abbr");
for (Element t: titles) {
String title= t.attr("data-original-title");
//do something with the title
}

Get severals class same name with JSOUP

Is there a way to get HTML from severals class with same name with the plugin JSoup of Java ?
For example:
<div class="div_idalgo_content_result_date_match_local">
blablabla
</div>
<div class="div_idalgo_content_result_date_match_local">
123456789
</div>
I'd like get blablabla in one String and 123456789 in another.
I wish my question is understandable.
This can be done in several different ways.
If you want to select the div's with the class name above, you can simply use the following:
Elements div = doc.select("div.div_idalgo_content_result_date_match_local");
This will give you a collection of Element that you can iterate over.
If you after that would like to select perhaps only the first one, you can use the :eq(0)-parameter, or the first()-parameter.
Element firstDiv = div.first();
OR
Elements div = doc.select("div.div_idalgo_content_result_date_match_local:eq(0)");
Note that the second method you are selecting from the document, while in the first method you select from the collection of Element's. You can of course also change the value of the :eq(0) to something else that matches your element. There are many useful selectors that you can use that I have included a link to in the end of the answer.
The following code will split your div's into two:
Elements div = doc.select("div.div_idalgo_content_result_date_match_local");
Element firstDiv = div.first();
Element secondDiv = div.get(1);
System.out.println("This is the first div: " + firstDiv.text());
System.out.println("This is the second div: " + secondDiv.text());
JSoup Cookbook - Selector syntax

Selenium Firefox/Web-Driver can't find element by XPath

I'm having trouble reading a link inside a div.
Ok, here's what the div looks like:
<div id="AjaxStream" style="clear: both">
<a target="_blank" href="http://www.something.com/">
<img height="370" width="752" border="4" usemap="#Link" src="somefile.png">
</a>
</div>
The following code, to find the div works perfectly fine.
(I tried element.getAttribute("id") - which returned "AjaxStream")
WebElement element = river.findElement(By.xpath("//html/body/div/div[2]/div/div[11]"));
And here is what's not working:
WebElement element = driver.findElement(By.xpath("//html/body/div/div[2]/div/div[11]/a"));
This should actually fine the link-element, but it doesn't. Any ideas?
Thanks in advance.
##Edit:
Nevermind - I fixed it. The problem was that the element wasn't loaded. I added a Thread.sleep(1000) before trying to find the element - and now it works perfectly fine.
try
WebElement element = driver.findElement(By.xpath("//div[#id='AjaxStream']/a"));
String link = element.getAttribute("href");
//I need 6 characters but it's a 1 "char" fix
Have a look at your xpath...to me that is unreadable. If someone comes in later in a few months, are they able to translate that xpath in to tag you're looking for? A better solution would be a to add an id attribute to the tag you are interested, and find it by that ID.

Categories

Resources