Jsoup - Convert html texts into a list of Strings

Jsoup - Convert html texts into a list of Strings - java

Using Jsoup I want to be able add text existing in each html tag to a List<String> in order.
This is fairly easy using BeautifulSoup4 in python but I'm having a hard time in Java.
BeautifulSoup Code:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
text_list =[]
for t in visible_texts:
text_list.append(t.strip())
return list(filter(None, text_list))
html = urllib.request.urlopen('https://someURL.com/something').read()
print(text_from_html(html))
This code will print ["text1", "text2", "text3",...]
My initial attempt was to follow the Jsoup documentation for text conversion.
Jsoup Code Attempt-1:
Document doc = Jsoup.connect('https://someURL.com/something')
.userAgent("Bot")
.get();
Elements divElements = doc.select("*")
List<String> texts = divElements.eachText();
System.out.println(texts);
What ends up happening is a duplication of texts ["text1 text2 text3","text2 text3", "text3",...]
My assumption is that Jsoup goes through each Element and prints out every text within that Element including the text existing in each child node. Then it goes to the child node and prints out the remaining text, so on and so forth.
I have seen many people specify Tag/Attributes via cssQuery to bypass this problem but my project requires to do this for any scrape-able website.
Any suggestion is appreciated.

Your assumption is right - but BeautifulSoup would probably do the same. Only the text=True in findAll(text=True) limits the result to pure text-nodes. To have the equivalent in JSoup use the following selector:
Elements divElements = doc.select(":matchText");

Related

How to remove special characters from a xpath using Selenium?

As you are able to see, I have used one dynamic xpath: //td[text()='Discharge Air']/following-sibling::td/span to go from zone1 until zone3, but when I am using gettext() to fetch only 100 but special character °F is also coming. Hence please suggest how to remove this special character °F, because I want only data 100 from this xpath? As you can see in the image, only 1 span is available, so I can't separate span also.
String s = driver.findElement(By.xpath("//td[text()='Discharge Air']/following-sibling::td/span")).getText();
s.replace("°F","");//replace the °F with empty string
Instead of String, can i use List because all these xpath are of same type,hence directly i can write and afterwards i can use for loop for getText().
List s=driver.findElements(By.xpath("//td[text()='Discharge Air']/following-sibling::td/span"));
s.replace("°F","");
Thanks in advance,

List disch_Air = driver.findElements(By.xpath("//td[text()='Discharge Air']/following-sibling::td/span"));
for(int i=0;i<disch_Air.size();i++) {
System.out.println(disch_Air.get(i).getText().replace("°F", ""));
}
}
This is what i want and its working fine thank you so much guys for ur help

Use this:
//first find the elements and save it as you did (with the xpath you posted)
String s = driver.findElement(By.xpath("//td[text()='Discharge Air']/following-sibling::td/span")).getText();
s.replace("°F","");//replace the °F with empty string
and if you see that there are still spaces on your string you can use this to remove them:
s.trim();

Parsing xml with multi childs using jsoup

I have an xml file that looks as follows - link.
I would like to get the title from it.
In order to do so, I did the following:
Document bookDoc = Jsoup.connect( url ).parser( Parser.xmlParser() ).get();
Node node = bookDoc.childNode( 2 ).childNode( 3 ).childNode( 3 );
This returns me this:
Now I have 2 questions:
Isnt there any simpler way to get this title instead of using all of these childNodes? My worry is that in some result the title wont exactly be at childNode(3) and all my code wont work.
How do I eventually get this title? Im stuck at this point and cant get the string of the title.
Thank you

You can use selectors to access elements. Here you want to select by tag name. Two ways to get the element you want:
String title1 = bookDoc.select("record>display>title").text();
String title2 = bookDoc.selectFirst("record").selectFirst("display").selectFirst("title").text();
If you want to select more complicated things read:
https://jsoup.org/cookbook/extracting-data/dom-navigation
https://jsoup.org/cookbook/extracting-data/selector-syntax
But you probably won't need them for parsing this XML.

How do i Jsoup query for the value of a html key/value pair

sorry if my terms are off, i havent done this before
Im using jsoup to scrape a single value off a website page,
I am trying to find the "serialno" which is stored within this function (java script?)
function set(obj, val)
{
document.getElementById(obj).innerHTML= val;
}
called by
{set("modelname", "NPort 5650-16");set("mac", "00:90:E8:22:76:F4");set("serialno", "2583");set("ver", "3.3 Build 08042219");setlabel("NPORT");uptime("264 days, 03h:31m:34s");}<
i am unsure how i can use jsoup to extract/print the serialno value, which in this case happens to be 2583. ive tried basic commands using getElementById but ive never used jsoup before. i am familiar with maps, but not sure how i can manipulate with jsoup, and most of the tutorials online need the actual 'path' to the exact cell within the table (where this info is displayed).

You can't use Jsoup to do this. Jsoup can parse HTML, but javascipt is out of its reach and is recognized as text. It can't be executed and selecting things from javascript is not possible.
But if you already have HTML parsed to Document and you're looking for an alternative solution you may try to use regular expressions to grab this value.
Document doc = Jsoup.parse...
String html = doc.toString();
Pattern p = Pattern.compile("set\\(\"serialno\", \"(\\d+)\"\\)");
Matcher m = p.matcher(html);
if (m.find()) {
String serialno = m.group(1);
System.out.println(serialno);
}

JxBrowser How to get value from a html node within java

Hello guys I am trying out the jxBrowser component and I am unable to the value of selected html component...
List<DOMElement> paragraphs = divRoot.findElements(By.cssSelector("p"));
for (DOMElement paragraph : paragraphs) {
System.out.println("paragraph.getNodeValue() = " +
paragraph.getNodeValue());
}
I am able to find paragraphs.. But can't get their node's value.. or simply <p>I cant get this value<p/> The code must be okay because its just a pure copy of their own sample code: here
So my question is... What have I done wrong? It seems properly imported.. I am using library version 6.19.1 on a macbook. ( And I even tried it on a windows 10 with same result.. )
Or if there is other java browser solution with similar functions.. What I need is to load a page, get some values out of some divs and then simulate click.

DOMElement.getNodeValue() returns the value of this node, depending on its DOMNodeType. The text you are trying to get is a children node for the node, so you need to get it with the following code paragraph.getChildren().get(0).
So, the final code will look like the following:
for (DOMElement paragraph : paragraphs) {
System.out.println("paragraph.getNodeValue() = " +
paragraph.getChildren().get(0).getNodeValue());
}

Using multiple criteria to find a WebElement in Selenium

I am using Selenium to test a website, does this work if I find and element by more than one criteria? for example :
driverChrome.findElements(By.tagName("input").id("id_Start"));
or
driverChrome.findElements(By.tagName("input").id("id_Start").className("blabla"));

No it does not. You cannot concatenate/add selectors like that. This is not valid anyway. However, you can write the selectors such a way that will cover all the scenarios and use that with findElements()
By byXpath = By.xpath("//input[(#id='id_Start') and (#class = 'blabla')]")
List<WebElement> elements = driver.findElements(byXpath);
This should return you a list of elements with input tags having class name blabla and having id id_Start

To combine By statements, use ByChained:
driverChrome.findElements(
new ByChained(
By.tagName("input"),
By.id("id_Start"),
By.className("blabla")
)
)
However if the criteria refer to the same element, see #Saifur's answer.

CSS Selectors would be perfect in this scenario.
Your example would
By.css("input#id_start.blabla")
There are lots of information if you search for CSS selectors. Also, when dealing with classes, CSS is easier than XPath because Xpath treats class as a literal string, where as CSS treats it as a space delimited collection

Based #George's repply, the same code for C# :
//reference
using OpenQA.Selenium.Support.PageObjects;
...
int allElements = _driver.FindElements(new ByChained(
By.CssSelector(".sc-pAyMl.cnszJw"),
By.Id("base-field")
)).Count();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup - Convert html texts into a list of Strings - java

Your assumption is right - but BeautifulSoup would probably do the same. Only the text=True in findAll(text=True) limits the result to pure text-nodes. To have the equivalent in JSoup use the following selector: Elements divElements = doc.select(":matchText");

Related

How to remove special characters from a xpath using Selenium?

Parsing xml with multi childs using jsoup

How do i Jsoup query for the value of a html key/value pair

JxBrowser How to get value from a html node within java

Using multiple criteria to find a WebElement in Selenium

Categories

Resources