Jsoup Trouble extracting formatting from html tables - java

<tr>
<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order
Theorems</span>
</th><th bgcolor="PINK"> <em><a href="\
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2\]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>
</th><th bgcolor="SKYBLUE"> <a href="\
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3\]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>
</th><th bgcolor="LIME"> <a href="\
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3\]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III--
-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>
</th><th bgcolor="YELLOW"> <a href="\
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0\]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II--
-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>
</th></tr>
So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.
If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?
So I was attempting to extract just the bgcolor and
I've tried doc.select("tr:contains([bgcolor]"), doc.select(th, [bgcolor), doc.select([bgcolor]), doc.select(tr:containsdata(bgcolor) , as well as doc.select([style]) and all have either returned no output or returned a parse error. I can extract the stuff in the span class just fine but it is more of a problem of also extracting bgcolor and align.

You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().
Document document = Jsoup.parse(YOUT HTML GOES HERE);
System.out.println(document);
Elements elements = document.select("tr > th");
for (Element element : elements) {
String align = element.attr("align");
String color = element.attr("bgcolor");
String spanText = element.select("span").text();
System.out.println("Align is " + align +
"\nBackground Color is " + color +
"\nSpan Text is " + spanText);
}
For any further information feel free to ask me! Hope this helped you!
Updated Answer to comment:
To do that, you'll need to use this line inside the for each loop:
String fullText = element.text();
That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals.
That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.

Related

Split text within span if there is an img tag between them - selenium - java

I'm having a scenario where within a span tag I have two strings, separated by an img tag.
<span>
text
<img/>
text
</span>
When I'm trying to find this span using selenium and Xpath, I found it - but getText() method of the span element returning "texttext". My intention is to get "text text".
driver.findElement(By.xpath("MY_XPATH_TO_FIND_THAT_SPAN").getText();
My Xpath is fine (because I'm getting the right web element, but how can I get the string as I note here? I want to append a space whenever there is an img tag.
Will be glad for your help,
Thanks!
There is no direct way to do it using .getText(). You can use .getAttribute("innerHTML") and then you will need to replace whatever is between the two "text" strings (IMG, etc.) with a space.
Here's a simple example based on your HTML that will probably work.
String s = driver.findElement(By.xpath("MY_XPATH_TO_FIND_THAT_SPAN").getAttribute("innerHTML"); // <span>text<img/>text</span>
s = s.replaceAll("<img.*?/>", " ");
System.out.println(s);
This prints
<span>text text</span>
To retrieve the text text from the first child node and the text text from third child node you can use the getAttribute("innerHTML") method and then use split() method and finally print text text inserting a space between them accordingly as follows :
String my_string = driver.findElement(By.xpath("MY_XPATH_TO_FIND_THAT_SPAN")).getAttribute("innerHTML");
String[] stringParts = my_string.split("\n");
String partA = stringParts[0];
String partB = stringParts[2];
System.out.println(partA + " " + partB);

Get title attribute with jsoup

I have a problem with parsing a website.
The website contains a phrase like this:
<td class="school">
<abbr title data-original-title="Highschool">...</abbr>
</td>
How can I get the title (Highschool)?
I'm programming with jsoup and java.
Thanks for your help.
Just try reading jsoup cookbook.
First you should get abbr element, and then its data-original-title attribute:
Element abbrElement = doc.select("abbr").first();
String originalTitle = abbrElement.attr("data-original-title");
Of course you should make sure that you select right abbr element. Above code will select the first one appearing in the document.
This can be done relatively easy using jsoup's DOM methods or selection on a parsed document. Check out these links for reference:
DOM navigation
Extracting attributes
//assuming that the class "school" contains the tag for the title
Elements titles = doc.getElementsByClass("school").getElementsByTag("abbr");
for (Element t: titles) {
String title= t.attr("data-original-title");
//do something with the title
}

Get severals class same name with JSOUP

Is there a way to get HTML from severals class with same name with the plugin JSoup of Java ?
For example:
<div class="div_idalgo_content_result_date_match_local">
blablabla
</div>
<div class="div_idalgo_content_result_date_match_local">
123456789
</div>
I'd like get blablabla in one String and 123456789 in another.
I wish my question is understandable.
This can be done in several different ways.
If you want to select the div's with the class name above, you can simply use the following:
Elements div = doc.select("div.div_idalgo_content_result_date_match_local");
This will give you a collection of Element that you can iterate over.
If you after that would like to select perhaps only the first one, you can use the :eq(0)-parameter, or the first()-parameter.
Element firstDiv = div.first();
OR
Elements div = doc.select("div.div_idalgo_content_result_date_match_local:eq(0)");
Note that the second method you are selecting from the document, while in the first method you select from the collection of Element's. You can of course also change the value of the :eq(0) to something else that matches your element. There are many useful selectors that you can use that I have included a link to in the end of the answer.
The following code will split your div's into two:
Elements div = doc.select("div.div_idalgo_content_result_date_match_local");
Element firstDiv = div.first();
Element secondDiv = div.get(1);
System.out.println("This is the first div: " + firstDiv.text());
System.out.println("This is the second div: " + secondDiv.text());
JSoup Cookbook - Selector syntax

jsoup spaces error and td classname fail

I am trying to pick, using Jsoup, the paragraph inside the following HTML snippet:
Blockquote
<td class="team team-a ">
MyTeam
</td>
The problem is that, for some reason, Jsoup doesn't seem to pick up the "td class="team team-a "
In my opinion space problem.
I tried to format ...
Elements team = document.select("td[class=team team-b ]");
Elements vendegCsapat_e = document.select("td.team team-b ");
.. but there is no solution! :(
What could be the problem in the above code? thx
Your CSS selector is not correct. To select multiple classes, use:
Elements team = document.select("td.team.team-b");
In case you'd like to know what your original meant, td.team team-b would be read in English as "select a tag team-b which descends from a tag td with a class of .team". team-b is not a valid HTML tag, so Jsoup did not select anything.

Selenium: Extract Text of a div with cssSelector in Java

I am writing a JUnit test for a webpage, using Selenium, and I am trying to verify that the expected text exists within a page. The code of the webpage I am testing looks like this:
<div id="recipient_div_3" class="label_spacer">
<label class="nodisplay" for="Recipient_nickname"> recipient field: reqd info </label>
<span id="Recipient_nickname_div_2" class="required-field"> *</span>
Recipient:
</div>
I want to compare what is expected with what is on the page, so I want to use
Assert.assertTrue(). I know that to get everything from the div, I can do
String element = driver.findElement(By.cssSelector("div[id='recipient_div_3']")).getText().replaceAll("\n", " ");
but this will return "reqd info * Recipient:"
Is there any way to just get the text from the div ("Recipient") using cssSelector, without the other tags?
You can't do this with a CSS selector, because CSS selectors don't have a fine-grained enough approach to express "the text node contained in the DIV but not its other contents". You can do that with an XPath locator, though:
driver.findElement(By.xpath("//div[#id='recipient_div_3']/text()")).getText()
That XPath expression will identify just the single text node that is a direct child of the DIV, rather than all the text contained within it and its child nodes.
I am not sure if it is possible with one css locator, but you can get text from div, then get text from div's child nodes and subtract them. Something like that (code wasn't checked):
String temp = "";
List<WebElement> tempElements = driver.findElements(By.cssSelector("div[id='recipient_div_3'] *"));
for (WebElement tempElement : tempElements) {
temp =+ " " + tempElement.getText();
}
String element = driver.findElement(By.cssSelector("div[id='recipient_div_3']")).getText().replaceAll("\n", " ").replace(temp, "");
This is for case when you try to avoid using xpath. Xpath allows to do it:
//div[#id='recipient_div_3']/text()
You could also get the text content of an element and remove the tags with regexp. Also notice: you should use the reluctant quntifier
https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
String getTextContentWithoutTags(WebElement element) {
return element.getText().replaceAll("<[^>]*?/>", "").trim();
}

Categories

Resources