How to get related classes and values in JSoup? - java

I have an HTML file, a part of which looks like this:
<a name="user_createtime"></a>
<p class="column">
<span class="coltitle">CreateTime</span> <span class="titleDesc"><span class='defPopupLink' onClick='popupDefinition(event, "datetime")'>datetime</span></span> <span class = "spaceandsize">(non-null)<sup><span class='glossaryLink' onclick="popupDefinition(event, '<b>non-null</b><br>The column cannot contain null values.')">?</span></sup></span>
<br>
<span class="desc">Timestamp when the object was created</span>
<a name="user_createuser"></a>
<p class="column">
<span class="coltitle">CreateUser</span> <span class="titleDesc">foreign key to User</span>
<span class = "spaceandsize">(database column: CreateUserID)</span>
<br>
<span class="desc">User who created the object</span>
There are many such Coltitle. titleDesc and desc classes.
Now, if I get an input string like "CreateTime", I want the output to be:
CreateTime, datetime, Timestamp when the object was created
and if I get an input string "CreateUser", I want the output to be:
CreateUser, foreign key to User, User who created the object
I'm using Jsoup for this, and I have gotten this far:
Elements colElements = Jsoup.parse(html).getElementsByClass("coltitle").select("*");
System.out.println("your Col:");
for (Element element : colElements)
{
if(element.ownText().equalsIgnoreCase("CreateTime"))
System.out.println(element.text());
}
which just prints the selected coltitle. How do I parse the related classes and get their values? Or, are they not even related and am I just treading down the wrong path?
Can someone please help me get my desired output?

You are only selecting the <span>-tags, thus, only printing what they values they hold.
You can use the siblingElements()-method to get the siblings of the element that you first select.
Your HTML does not seem to be formatted correctly, but the following should work
System.out.println("your Col:");
for (Element element : colElements) {
if (element.ownText().equalsIgnoreCase("CreateTime")) {
System.out.print(element.text());
for (Element sibling : element.siblingElements()) {
System.out.print(", " + sibling.text());
}
}
if (element.ownText().equalsIgnoreCase("CreateUser")) {
System.out.print("\n"+element.text());
for (Element sibling : element.siblingElements()) {
System.out.print(", " + sibling.text());
}
}
}
This will select the elements of the class 'colTitle'.
The if-case will check if it's either of them, and then print out the element text. It will then move on to it's siblings, and print out their texts.

According to the api docs, you can call children() on colElements.
http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#children()

Related

Check that two span classes are equal to text and click on the correct span element after

I have two span elements and I need to check that my text: Frat Brothers (2013) is equal to text inside this span clases and that click on this element.
<a href="/frat-brothers" class="">
<span class="name-content-row__row__title">Frat Brothers</span>
<span class="name-content-row__row--year">(2013)</span>
</a>
My code:
String title = "Frat Brothers (2013)";
List<WebElement> content = driver.findElements(By.cssSelector("span[class*='name-content-row__'"));
for (WebElement e : content) {
System.out.println("elememts is : " + e.getText());
if (e.getText().equals(title)) {
click(e);
}
output:
elememts is : Frat Brothers
elememts is : (2013)
if statment isn't executed.
if statment did not execute, cause you have
String title = "Frat Brothers (2013)";
change that to
String title = "Frat Brothers";
and you should be good to go.
also do not use click(e); instead it should be e.click();
driver.findElements method accepts By parameter while you passing it a String.
In case you want to select elements according to this CSS Selector you can do this:
List<WebElement> content = driver.findElements(By.cssSelector("span[class*='name-content-row__'"));
Also, you will get 2 span elements, first with text Frat Brothers and second with text (2013).
No one of these elements text will NOT be equal to Frat Brothers (2013).
You can check if title contains these texts
You can try the following xPath: //a[normalize-space()='Frat Brothers (2013)']
So that there would be no need for extra code. Like:
String title = "Frat Brothers (2013)";
WebElement content = driver.findElement(By.xpath("//a[normalize-space()='" + title + "']"));
content.click();
P.S. - Here is the xPath test: http://xpather.com/V9cjThsr
Text verification, add testng dependency in your pom.xml
String title = "Frat Brothers (2013)";
// storing text from the element
String first = driver.findElements(By.cssSelector("span[class*='name-content-row__'")).get(0).getText();
String second = driver.findElements(By.cssSelector("span[class*='name-content-row__'")).get(1).getText();
// validating the text
Assert.assertTrue(title.contains(first), "checking name is available in the element");
Assert.assertTrue(title.contains(second), "checking year is available in the element");

How to get content from span with Jsoup

I am using Jsoup HTML parser to extract content from a HTML page.
<span class="mainPrice reduced_">
<span class="oPrice" data-test="preisArtikel">
<span itemprop="price" content="68.00"><span class="oPriceLeft">68</span><span class="oPriceSeparator">,</span><span class="oPriceRight">00</span></span><span class="oPriceSymbol oPriceSymbolRight">€</span>
I want to extract the content (68.00) and I tried following:
Elements price = doc.select("span.oPrice");
String priceString = price.text();
That doesn't work because the class "oPrice" occurs 44 times in the page and the string "priceString" contains 44 different prices.
Thank you for your help.
Try this:
//For one element
Element elements = document.select("span[content]").first();
System.out.println(elements.attr("content"));
If you have multiple like same span
//For multiple
Elements elements = document.select("span[content]");
for (Element element:elements){
System.out.println(element.attr("content"));
}
Output:
68.00
On top of that Check JsoupSelector for the reference.

get div by class containing TWO whitespaces in a row (JSoup)

i'm trying to get a specific div by it's class. The class actually contains multiple classes seperated with spaces, but: the last class is seperated by to spaces!
Ex: class=test[SPACE]test[SPACE]test[SPACE][SPACE]test
full:
listing[SPACE]category_templates[SPACE]clearfix[SPACE]shelfListing[SPACE][SPACE]multiSaveListing
Now how did i go on about doing that?
Did not work (No Error thrown):
Elements divItemContainer = doc.select("div[class=listing category_templates clearfix shelfListing multiSaveListing]");
for (Element div : divItemContainer) {
Toast.makeText(ApplicationContextProvider.getContext(), "Got Div: ", Toast.LENGTH_SHORT).show();
}
Did not work (Thrown Error: String cannot contain whitespaces):
Elements divItemContainer = doc.select("div.listing.category_templates.clearfix.shelfListing..multiSaveListing");
for (Element div : divItemContainer) {
Toast.makeText(ApplicationContextProvider.getContext(), "Got Div: ", Toast.LENGTH_SHORT).show();
}
Did not work (No Error):
Elements divItemContainer = doc.select("div.listing.category_templates.clearfix.shelfListing.multiSaveListing");
for (Element div : divItemContainer) {
Toast.makeText(ApplicationContextProvider.getContext(), "Got Div: ", Toast.LENGTH_SHORT).show();
}
PS: The Toast is meant to purposly crash the App! It does nothing but kill and that's supposed to happen (at least at the moment)
Source:
<div class="listing category_templates clearfix shelfListing multiSaveListing"><div id="yousaveImage"></div><div class="multisave" id="multiSaveId"><a class="linksave" href="/promotion/2-for-250/ls85559"><span class="view-all">View all</span><span class="offer-2for3">2 for</span><span><span class="poundSign"></span><span class="ping-offer-finalValue">£2.50</span><span class="ping-offer-finalValue-1" style="display:none"></span><span class="pencep" style="display:none">p</span></span></a></div><div class="container"><div class="slider category_templates"><input id="itemId" value="1000000476716" type="hidden"><input id="maxQtyId" value="24.0" type="hidden"><div class="product active"><div class="slider"><div class="information active"><div class="imgContainer"><img class="" src="http://ui2.assets-asda.com:80/g/v5/501/375/5051413501375_130_IDShot_4.jpeg" data-original="http://ui2.assets-asda.com:80/g/v5/501/375/5051413501375_130_IDShot_4.jpeg" alt="ASDA Chosen By You Orange & Pineapple Double Strength Squash 2 FOR £2.50" title="" onerror="loadNoImage(this)"><span class="accessible"> Add to shopping list</span></div><p class="bundle-contains" style="display:none;"> Contains <span>0</span> <span>items</span></p><p class="subTitle">1.5LT</p></div></div><div class="product-content"><span class="bundle-banner" style="display:none;"> Bundle </span><span class="promoBanner"></span><span class="primaryBanner" style="display:none;">2 FOR £2.50</span><span class="title" id="productTitle"><a role="presentation" aria-hidden="true" tabindex="-1" href="/product/no-added-sugar/asda-chosen-by-you-orange-pineapple-double-strength-squash/1000000476716" title="ASDA Chosen By You Orange & Pineapple Double Strength Squash"><span>ASDA Chosen By You Orange & Pineapple Double Strength Squash</span></a></span><div class="product-type-icons" style="visibility:visible"><i data-contentid="" data-similarproducts="true" data-title="Suitable for Vegetarians" data-name="Suitable for Vegetarians" title="Vegetarian" class="type-icon icon-suitable-for-vegetarians" data-infoiconid="1215398078196" data-id="2854136">Vegetarian</i></div><div class="rating-static rating-50"><span class="star star1"></span><span class="star star2"></span><span class="star star3"></span><span class="star star4"></span><span class="star star5"></span></div><div class="prod-limit-Mask"></div><div class="quantity-info-Mask"><span class="qLimit-toolTip"></span> Close <div class="qLimit-popUp"><p id="quantityLimitText"><span class="qLimit-Sorry">Sorry...</span>You can't add more than <span class="max-qty-val">24</span> per order</p></div></div><div id="cartBground" class="addedbg"><div class="price-cart-block"><div class="price-wrap category_templates"><span class="price"><span>£1.40</span></span><span class="priceInformation"> (9.3p/100ml) </span></div>AddView bundle<div class="quantityOptions clearfix"><span>–</span><input aria-label="Quantity in your trolley" value="1" name="quantityInTrolley" class="prd-txt" maxlength="5" type="number"><span>+</span>Add<div id="qtySelect" class="qty-wrapper" style="display: none;"><div class="qty-select"><span class="qty-value" tabindex="0" title="Quantity">Q<span class="accessible">uanti</span>ty</span><span class="qty-select-icon"></span></div><ul class="qty-list" style="display:none"><li class="qtyAccessible"><span title="Quantity" data-salesunit="Qty">Q<span class="accessible">uanti</span>ty</span></li><li class="kgAccessible"><span title="Kilogram" data-salesunit="kg">k<span class="accessible">ilo</span>g<span class="accessible">ram</span></span></li></ul></div><p id="inTrolleyId">in your trolley</p></div></div><div id="itemAjaxLoader" class="ajaxLoader 1000000476716" style="display:none;"><img src="//ui3.assets-asda.com/theme/img/common/loader.svg" style="width: 32px;" onerror="this.src=//ui3.assets-asda.com/theme/img/common/ajax-loader.gif; this.onerror=null;"></div><div class="unavail-item-message"> Item unavailable<span class="qLimit-toolTip"></span></div><div class="unavail-item"><span class="unavailable-image"></span><span></span></div></div></div><div class="sectionMenu"></div></div></div></div></div>
This works, but it is unsafe and no reason to use it. Moreover in order for this to work you the order of the classes and the whitespaces must be identical. You say it doesn't, but I've tested it and it does.
Elements divItemContainer = doc.select("div[class=listing category_templates clearfix shelfListing multiSaveListing]");
for (Element div : divItemContainer) {
Toast.makeText(ApplicationContextProvider.getContext(), "Got Div: ", Toast.LENGTH_SHORT).show();
}
This is the way to do it. The order of the classes doesn't matter, nor the whitespaces. You say it doesn't work, but I've tested it and it does.
Elements divItemContainer = doc.select("div.listing.category_templates.clearfix.shelfListing.multiSaveListing");
for (Element div : divItemContainer) {
Toast.makeText(ApplicationContextProvider.getContext(), "Got Div: ", Toast.LENGTH_SHORT).show();
}
For this one the error is correct.
Elements divItemContainer = doc.select("div.listing.category_templates.clearfix.shelfListing..multiSaveListing");
for (Element div : divItemContainer) {
Toast.makeText(ApplicationContextProvider.getContext(), "Got Div: ", Toast.LENGTH_SHORT).show();
}
You query goes through a validation before executed. The validation that takes place takes as a parameter every class you input. The css selector you type gets split for every . and by typing consecutive . you are creating empty classes.
public static void notEmpty(String string) {
if ((string == null) || (string.length() == 0))
throw new IllegalArgumentException("String must not be empty");
}
The reason it doesn't work is not your selector. Try typing the response you get from the server. When you don Document doc = Jsoup.parse()... try printing the doc. Does this contain the element you are searching for? I'm suspecting it doesn't.
If I'm right in that the element you are searching for is not present in the response you are getting, then you have two possibilities.
The server perceives your program as a bot and doesn't allow that or it serves you a page for mobiles, so it serves you something else from what you are seeing when navigating through the browser. If this is the case then the solution is to set a userAgent
The element is not present because it is generated by javascript. Jsoup is just a parser, not a browser. It cannot execute javascript, thus it cannot generate the dynamic content. In order to check if the content you need is dynamic, just navigate to the page and press Ctrl + U and check if the element you need is in there. That's the content before any javascript is executed.

Using JSoup to select a group of tags

I am attempting to use JSoup to scrape some information off a page, which can be identified by a group of tags in a particular order. The order of them is as follows:
<span class="sold" >Sold</span></td>
<td class='prc'>
<div class="g-b bidsold" itemprop="price">
AU $1.00</div>
I am looking to grab each value that is in place of the AU $1.00 field on the page, but they can only be identified by the span class="sold" selector that occurs a few tags beforehand.
I have tried something like select("span.sold:lt(4) + [itemprop=price]") but feel like I'm flailing around in the dark!
The code below should do the trick!!!
Document doc = Jsoup.connect(/*URL of your HTML document*/").get();
Element part = doc.body();
Elements parts = part.getElementsByTag("div");
String attValue;
String requiredContent;
for(Element ent : parts)
{
if(ent.hasAttr("class"))
{
attValue = ent.attr("class");
if(attValue.equals("g-b bidsold"))
{
System.out.println("\n");
requiredContent=ent.text();
System.out.println(requiredContent);
}
}
}
Just make sure to iterate and get the output in an array.
You could also do this:
Elements soldPrices = doc.select("td:has(.sold) + td [itemprop=price]");
That will return elements (the DIVs) that have price itemprops, which have immediately preceeding TDs with elements (the SPANs) with class=sold.
See the Selector syntax for more details.

How to parse out all links from source with Jericho in Java while filtering out or ignoring elements with a specific id?

I am using the Jericho java client library to parse out all href links. What I want to do is filter out or skip all links from the source that contain a specific id. I have tried several things, and my solution is not pretty but basically I can accomplish this by checking on something like this:
for(Element element : elements) {
if (element.getAllStartTags().toString().contains("skip_me")) {
// do something
}
}
But I prefer a cleaner solution. Let's assume this is the source:
<td>
<a href="http://www.yahoo.com" id="skip_me" />
</td>
<td>
<a href="http://www.google.com" />
</td>
Just a small snippet, but what I want this to return me in the end is just "www.google.com". I would appreciate any help with this. Thanks.
Here is another solution:
for( Element element : elements )
{
if( element.getStartTag().getName() == HTMLElementName.A ) // Select only 'a'-tags
{
final String id = element.getAttributeValue("id"); // Get Attribute 'id'
if( id == null || !id.equals("skip_me") ) // Process element if it has a.) no id (null) or b.) the id is not 'skip_me'
{
System.out.println(element); // Process Element
}
}
}
Output:
(using your html)
<a href="http://www.google.com" />
Another solution:
List<Element> elements = source.getAllElements("a");
for(Element element : elements )
{
final String id = element.getAttributeValue("id");
if(id == null || !id.equals("skip_me"))
{
System.out.println(element.toString());
}
}
Output:
<a href="http://www.google.com" />

Categories

Resources