How to compare children of an element in a DOM with jsoup - java

I am working on a project where I have to be able to know that an element have repeated children .For example in that DOM, I want to know that the element tbody has similar children
My goal is to extract data- and store it in a database -from pages that I ignore their structure.

Use Jquery to get your td elements and iterate with each over them.

you can use JSOUP for this. its very easy to use as well
for example you want to get all td tag in within your document:
String html=... //your html string
Document doc = JSoup.parse(html);
Elements elements = doc.select("tbody").select("td");
System.out.println(elements.size()); //prints number of td within tbody REGARDLESS of where in the DOM tree they live.
Edit1:
to get all elements you can do:
for(Element e : doc.getAllElements){
System.out.println(e.getTagName());//prints the tag name
}

Related

how to add the second css selector xpath where the same element appears more than once

how to add the second css selector or xpath where the same element appears more than once.
WebElement element = driver.findElement(By.cssSelector("span.mat-content"));
I need to add the same css selector again but for the 2nd element
how do it do that.
This is for the first one which I have added as seen above:
HTML:
<span class="mat-content ng-tns-c143-2587"> = first one
HTML which I need to add -
<span class="mat-content ng-tns-c143-2589"> = How do I add the css selector using this html tag.
WebElement element2 = driver.findElement(By.cssSelector("span.mat-content")); = In this place how do I add it for the second element
If this span.mat-content represent multiple web element in HTMLDOM.
You can use findElements to grab them all.
List<WebElement> elements = driver.findElements(By.cssSelector("span.mat-content"));
Now elements is a list in Java-Selenium bindings.
You could do, elements.get(1) and this shall represent the second web element.
or You can iterate the entire list like this:
for (WebElement element : elements){
element.getText(); //Note that each `element` is a web web element.
}
If you do not wish to have the above way. You can try xpath indexing.
(//span[contains(#class,'span.mat-content')])[1]
should represent the first element.
and
(//span[contains(#class,'span.mat-content')])[2]
should represent the second element and so on..
[3], [4], .... [n]
just replace css with xpath. xpath indexing is not preferred choice.
List<WebElement> elements= driver.findElements(By.cssSelector("span.mat-
content"));
WebElement element2 = elements.get(1);
Should work

JSoup Scraping based on custom attributes

So I am using JSoup to scrape a website that creates a bunch of divs with dynamic class names (they change every reload), but the same attribute names. E.g:
<div class="[random text here that changes] js_resultTile" data-listing-number="[some number]">
<div class="a12_regularTile js_rollover_container " itemscope itemtype="http://schema.org/Product" data-listing-number="[same number here]">
<a href...
I've tried multiple approaches to selecting those divs and saving them in elements, but I can't seem to get it right. I've tried by attribute:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.select("div[data-listing-number]");
I've tried by class:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.getElementsByClass("a12_regularTile")
And:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.select("div[class*=js_resultTile]")
I've tried another attribute method:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = new Elements();
for (Element element : doc.getAllElements() )
{
for ( Attribute attribute : element.attributes() )
{
if ( attribute.getKey().equalsIgnoreCase("data-listing-number"))
{
myEls.add(element);
}
}
}
None of these work. I can select the doc that gets me all the HTML, but my myEls object is always empty. What can I use to select these elements?
Are you sure these elements are present in HTML returned by server? They may be added later by JavaScript. If JavaScript is involved in page presentation then you won't be able to use Jsoup. More details in my answer to similar question here: JSoup: Difficulty extracting a single element
And one more tip. Instead of using your for-for-if construction you can use this:
for (Element element : doc.getAllElements()) {
if (element.dataset().containsKey("listing-number")) {
myEls.add(element);
}
}

Jsoup Grab embedded tags

I am using Jsoup and was wondering how do you get embedded tags? I can get the section tag but I am not sure how to get the div tag inside as I have a list of elements. My question is how do I fetch a div tag inside a section tag?
this will work surely
Elements elements = doc.select("section.page-content-full div.content");
Just use the query selector syntax :
Elements elems = doc.select("section.main-page-content-full>div.content");
If you want just the first element use the following :
Elements elems = doc.select("section.main-page-content-full>div.content").first();

JSoup extracting data from within paragraph

I want to extract all the text there is between all paragraphs on an unknown site (meaning i do not know the structure of the site).
So far i've got:
Elements paragraphEmail = doc.select("p");
Where doc = Jsoup.connect(url).get();
for (Element e : paragraphEmail) {
}
How to achieve this?
doc.select("p") will give you all the paragraph elements as a collection Elements.
Use a for each loop to get the text:
for(Element e : paragraphEmail){
System.out.println(e.text());
}
I suggest you take a look at the Jsoup cookbook and the API reference to get more familiar with the methods in Jsoup.
Cookbook
API Reference

How can I select the divs elements that not having another divs inside it?

I'm using Java and Jsoup to parse HTML pages and I want to get all the divs that not contains another divs inside it to print the text it contains.
But for example, if a div contains a table, and the table costains a div, I don't want it. I want only the div at the most bottom level, with none another div inside it (another tags are ok).
How I do this?
Primarilly, I want to know if there is some syntax that can I use with the select() method.
Document doc; //comes as parameter
Elements divs = doc.getElementsByTag("div");
for(Element div: divs){
if(div.getElementsByTag("div").size() == 1){
//is a div with no divs inside it
}
}

Categories

Resources