Select a class with a space character in Jsoup - java

I am using Jsoup and am attempting to select an html class with a space in its name:
<p class="story-body-text story-content" /p>
The usual method for class selection (.class) is not working in this instance. My code is:
Elements text = doc.select(".story-body-text story-content");
Which is returning an empty list of Elements. I have seen that I can perhaps try
Elements text = doc.select(".~story-body-text");
However, that is giving me troublesome source not found errors in eclipse, despite the fact that I have added the Jsoup jar into my project, so ideally there would be another solution as I can't seem to solve the source not found issue.

# is the prefix for an id. . is the prefix for a class name. When there is a space in the class attribute, it's considered as seperate class names.
I'd expect these to work:
Elements text = doc.select(".story-body-text");
Elements text = doc.select(".story-content");
Elements text = doc.select(".story-body-text.story-content");

Related

Parsing xml with multi childs using jsoup

I have an xml file that looks as follows - link.
I would like to get the title from it.
In order to do so, I did the following:
Document bookDoc = Jsoup.connect( url ).parser( Parser.xmlParser() ).get();
Node node = bookDoc.childNode( 2 ).childNode( 3 ).childNode( 3 );
This returns me this:
Now I have 2 questions:
Isnt there any simpler way to get this title instead of using all of these childNodes? My worry is that in some result the title wont exactly be at childNode(3) and all my code wont work.
How do I eventually get this title? Im stuck at this point and cant get the string of the title.
Thank you
You can use selectors to access elements. Here you want to select by tag name. Two ways to get the element you want:
String title1 = bookDoc.select("record>display>title").text();
String title2 = bookDoc.selectFirst("record").selectFirst("display").selectFirst("title").text();
If you want to select more complicated things read:
https://jsoup.org/cookbook/extracting-data/dom-navigation
https://jsoup.org/cookbook/extracting-data/selector-syntax
But you probably won't need them for parsing this XML.

Find element by text inside another element using UISelector query

I have the following code snippet and the screenshot attached.
String query = "new UiScrollable(new UiSelector().className(\"androidx.recyclerview.widget.RecyclerView\"))" +
".scrollIntoView(new UiSelector().text(\"Test Group\"))";
driver.findElementByAndroidUIAutomator (query).click ();
What I want is to find an element with the text "Test Group" using UISelector, but inside the RecyclerView only (not searching the whole app source). What I get is the element inside search field instead (not in the RecyclerView).
Please advice. I know that I can get all searched elements using findElements(By.id("name")). But I want to use UI selector in this case.
With UiSelector you can use chaining:
String query = "new UiScrollable(resourseIdMatches(\".*recycler_view\")).scrollIntoView(resourseIdMatches(\".*recycler_view\")).childSelector(text(\"Text Group\")))";
In addition new UiSelector... part can be omitted. Appium does support this syntax.

How can I efficiently extract text from bunch for web pages without extra information

I have list of webpages around 1 million, I want to efficiently just extract text from those pages. Currently I am using BeautifulSoup library in python to get text from HTML and using request command to get html of a webpage. This approach extract some extra information in addition to the text like if any javascript is listed in body.
Could you please suggest me any suitable and efficient way to do the task. I looked at scrapy but it looks like it crawls specific website. Can we pass it list of specific webpages to get information from ?
Thank you in advance.
Yes, you can use Scrapy to crawl a set of URLs in a generic fashion.
You simply need to set them on the start_urls list attribute of your spider, or reimplement the start_requests spider method to yield requests from any data source, and then implement your parse callback to perform the generic content extraction you want.
You can use html-text to extract text from them, and regular Scrapy selectors to extract additional data like the one you mention.
In scrapy you can set up your own parser. E.g. Beautiful soup. This parser you can call from your parse method.
To extract text from generic pages I traverse the body only, exclude comments etc and some tags like script, style, etc:
for snippet in soup.find('body').descendants:
if isinstance(snippet, bs4.element.NavigableString) \
and not isinstance(snippet, EXCLUDED_STRING_TYPES)\
and snippet.parent.name not in EXCLUDED_TAGS:
snippet = re.sub(UNICODE_WHITESPACES, ' ', snippet)
snippet = snippet.strip()
if snippet != '':
snippets.append(snippet)
with
EXCLUDED_STRING_TYPES = (bs4.Comment, bs4.CData, bs4.ProcessingInstruction, bs4.Declaration)
EXCLUDED_TAGS = ['script', 'noscript', 'style', 'pre', 'code']
UNICODE_WHITESPACES = re.compile(u'[\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004'
u'\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]+')

findElement that is generated HTML during runtime

I'm trying to find an element that is generated from the PCA Predict API, found in this link here. http://www.pcapredict.com/en-gb/address-capture-software/
The code I have at the moment is as follows but it throws an timeout exception due to it not finding any elements. Yet the xpath is correct as I have checked it in developer tools.
By PCA = By.id("inputPCAnywhere");
driver.findElement(PCA).clear();
driver.findElement(PCA).sendKeys(ValidPostcode);
wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath("//div[#class='pcaitem pcafirstitem']")));
driver.findElement(By.xpath("//div[#class='pcaitem pcafirstitem']")).click();
The element is visible on the page, and developer tools only returns one result that that xpath, there is no ID's to find it by.
It looks like that the first item is getting "selected" by default leading to it's class value being equal to the following:
<div class="pcaitem pcafirstitem pcaselected"...>...</div>
All other following results have only pcaitem class, but none have a pcaitem pcafirstitem class value.
In other words, your problem is the strict class match. I would improve the locator to have a partial match on the class attribute. For instance, with a CSS selector:
wait.until(ExpectedConditions.visibilityOfElementLocated(By.cssSelector(".pcaitem.pcafirstitem")));

Java Selenium - Find string text from multiple divs with same class name using xpath

i'm hoping you can help me.
I've been going through all sorts of forums and questions on here on how to cycle through multiple divs with the same class name using xpath queries. I'm fairly new to WebDriver and Java so I'm probably not asking the question properly.
I have a table where i'm trying to identify the values within and ensure they're correct. each field has the same class identifier, and i'm able to successfully pull back the first result and confirm via report logging using the following
String className1 = driver.findElement(By.xpath("(//div[#class='table_class'])")).getText();
Reporter.log("=====Class Identified as "+className1+"=====", true);
However, when i then try and cycle through (I've seen multiple answers saying to add a [2] suffix to the xpath query) i'm getting a compile error:
String className2 = driver.findElement(By.xpath("(//div[#class='table_class'])")[2]).getText();
Reporter.log("=====Class Identified as "+className2+"=====", true);
The above gives an error saying "The type of the expression must be an array type but it resolved to By"
I'm not 100% sure how to structure this in order to set up an array and then cycle through.
Whilst this is just verifying field labels for now, ultimately i need to use this approach to verify the subsequent data that is pulled through, and i'll have the same problem then
You are getting the error for -
String className2 = driver.findElement(By.xpath("(//div[#class='table_class'])")[2]).getText();
because you are using index in wrong way modify it to -
String className2 = driver.findElement(By.xpath("(//div[#class='table_class'])[2]")).getText();
or
String className2 = driver.findElement(By.xpath("(//div[#class='table_class'][2])")).getText();
And the better way to do this is -
int index=1;
List <WebElement> allElement = driver.findElements(By.xpath("(//div[#class='table_class'])"));
for(WebElement element: allElement)
{
String className = element.getText();
Reporter.log("=====Class Identified as "+className+""+index+""+"=====", true);
index++
}

Categories

Resources