Jsoup clean title tag failure - java

I am using Jsoup 1.9.2 to process and clean some XML input of specific tags. During this, I noticed that Jsoup behaves strangely when it is asked to clean title tags. Specifically, other XML tags within the title tag do not get removed, and in fact get replaced by their escaped forms.
I created a short unit test for this as below. The test fails, as output comes out with the value of CuCl<sub>2</sub>.
#Test
public void stripXmlSubInTitle() {
final String input = "<title>CuCl<sub>2</sub></title>";
final String output = Jsoup.clean(input, Whitelist.none());
assertEquals("CuCl2", output);
}
If the title tag is replaced with other tags (e.g., p or div), then everything works as expected. Any explanation and workaround will be appreciated.

The title tag should be used within the head (or in HTML5 within the html) tag. Since it is used to display the title of the HTML document, mostly in a browser window/tab, it is not supposed to have child tags.
JSoup treats it differently than actual content tags like p or div, the same applies for textarea.
Edit:
You could do something like this:
public static void main(String[] args) {
try {
final String input = "<content><title>CuCl<sub>2</sub></title><othertag>blabla</othertag><title>title with no subtags</title></content>";
Document document = Jsoup.parse(input);
Elements titles = document.getElementsByTag("title");
for (Element element : titles) {
element.text(Jsoup.clean(element.ownText(), Whitelist.none()));
}
System.out.println(document.body().toString());
} catch (Exception e) {
e.printStackTrace();
}
}
That would return:
<body>
<content>
<title>CuCl2</title>
<othertag>
blabla
</othertag>
<title>title with no subtags</title>
</content>
</body>
Depending on your needs, some adjustments need to be made, e.g.
System.out.println(Jsoup.clean(document.body().toString(), Whitelist.none()));
That would return:
CuCl2 blabla title with no subtags

Related

How to retrieve data from data table from Sports Reference using JSoup?

I'm attempting to use JSoup to retrieve the amount of wins for a team from a Sports Reference table.
Specifically, I am trying to receive the following data point highlighted below, with the html code provided
Below is what I have tried already, but I get a null pointer exception when trying to access the text of this element, telling me that my code is likely not parsing the HTML code correctly.
Element wins = document.selectFirst("td[data-stat=\"wins\"]");
What I want is for the text of this element to be 34 (or some number depending on the number of wins for the team).
Check what your Document was able to read from page and print it. If it contains HTML content which can be dynamically added by JavaScript by browser, you need to use as tool Selenium not Jsoup.
For reading HTML source, you can write similar to:
import java.io.IOException;
import org.jsoup.Jsoup;
public class JSoupHTMLSourceEx {
public static void main(String[] args) throws IOException {
String webPage = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
String html = Jsoup.connect(webPage).get().html();
System.out.println(html);
}
}
Since Jsoup supports cssSelector, you can try to get an element like:
public static void main(String[] args) {
String webPage = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
String html = Jsoup.connect(webPage).get().html();
Document document = Jsoup.parse(html);
Elements tds = document.select("#team_misc > tbody > tr:nth-child(1) > td:nth-child(2)");
for (Element e : tds) {
System.out.println(e.text());
}
}
But better solution is to use Selenium - a portable framework for testing web applications (more details about Selenium tool):
public static void main(String[] args) {
String baseUrl = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
WebDriver driver = new FirefoxDriver();
driver.get(baseUrl);
String innerText = driver.findElement(
By.xpath("//*[#id="team_misc"]/tbody/tr[1]/td[1]")).getText();
System.out.println(innerText);
driver.quit();
}
}
Also you can try instead of:
driver.findElement(By.xpath("//*[#id="team_misc"]/tbody/tr[1]/td[1]")).getText();
in this form:
driver.findElement(By.xpath("//[#id="team_misc"]/tbody/tr[1]/td[1]")).getAttribute("innerHTML");
P.S. In the future it would be useful to add source links from where you want to get information or at least snippet of the DOM structure instead of image.

Jsoup Scraping HTML dynamic content

I'm new to Jsoup and I have been trying to create a small code that gets the name of the items in a steam inventory using Jsoup.
public Element getItem(String user) throws IOException{
Document doc;
doc = Jsoup.connect("http://steamcommunity.com/id/"+user+"/inventory").get();
Element element = doc.getElementsByClass("hover_item_name").first();
return element;
}
this methods returns:
<h1 class="hover_item_name" id="iteminfo0_item_name"></h1>
and I want the information beetwen the "h1" labels which is generated when you click on a specific window.
Thank you in advance.
You can use the .select(String cssQuery) method:
doc.select("h1") gives you all h1 Elements.
If you need the actual Text in these tags use the .text() for each Element.
If you need a attribute like class or id use .attr(String attributeKey) on a Element eg:
doc.getElementsByClass("hover_item_name").first().attr("id")
gives you "iteminfo0_item_name"
But if you need to perform clicks on a website you can't do that with JSoup, hence JSoup is a HTML parser and not a browser alternative. Jsoup can't handle dynamic content.
But what you could do is, firstly scrape the relevant data in your h1 tags and then send a new .post() request, respectively an ajax call
If you rather want a real webdriver, have a look at Selenium.
Use .text() and return a String, i.e.:
public String getItem(String user) throws IOException{
Document doc;
doc = Jsoup.connect("http://steamcommunity.com/id/"+user+"/inventory").get();
Element element = doc.getElementsByClass("hover_item_name").first();
String text = element.text();
return text;
}

Replace string with jsoup only in text portions

I have found several topics with similar questions and valuable answers, but I am still struggling with this:
I want to parse some html with Jsoup so I can replace, for example,
"changeme"
with
<changed>changeme</changed>
, but only if it appears on a text portion of the html, no if it is part of a tag. So, starting with this html:
<body>
<p>test changeme app</p>
</BODY>
</HTML>
I would want to get to this:
<body>
<p>test <changed>changeme</changed> app</p>
</BODY>
</HTML>
I have tried several approaches, this one is which brings me closer to the desired result:
Document doc = null;
try {
doc = Jsoup.parse(new File("tmp1450348256397.txt"), "UTF-8");
} catch (Exception ex) {
}
Elements els = doc.body().getAllElements();
for (Element e : els) {
if (e.text().contains("changeme")) {
e.html(e.html().replaceAll("changeme","<changed>changeme</changed>"));
}
}
html = doc.toString();
System.out.println(html);
But with this approach I find two problems:
<body>
<p><a href="http://<changed>changeme</changed> .html">test
<changed>
changeme
</changed>
app</a></p>
</BODY>
</HTML>
Line breaks are inserted before and after the new element I am introducing. This is not a real problem as I coul get rid of them if I use #changed# to do the replacing and after the doc.toString() I replace them again to the desired value (with < >).
The real problem: The URL in the href has been modified, and I don't want it to happen.
Ideas? Thx.
Here is my solution:
String html=""
+"<p><a href=\"http://changeme.html\">"
+ "test changeme "
+ "<div class=\"changeme\">"
+ "inner text changeme"
+ "</div>"
+ " app</a>"
+"</p>";
Document doc = Jsoup.parse(html);
Elements els = doc.body().getAllElements();
for (Element e : els) {
List<TextNode> tnList = e.textNodes();
for (TextNode tn : tnList){
String orig = tn.text();
tn.text(orig.replaceAll("changeme","<changed>changeme</changed>"));
}
}
html = doc.toString();
System.out.println(html);
TextNodes are always leaf nodes, i.e. they do not contain more HTML elements. In your original approach you replace the HTML of an element with new HTML with replaced changme strings. You only check for the changeme to be part of the TextNodes contents, but you replace every occurrence in the HTML string of the element, including all occurrences outside TextNodes.
My solution basically works like yours, but I use the JSoup method textNodes(). This way I don't need to typecast.
P.S.
Of course, my solution as well as yours will contain <changed>changeme</changed> instead of <changed>changeme</changed> in the end. This may or may not be what you want. If you do not want this, then your result is not any more valid HTML, since changed is no valid HTML tag. Jsoup will not help you in this case. However, you can of course replace in the resulting string all <changed>changeme</changed> again - outside JSoup.
I think your issue is that you're replacing the elements html rather than just its text, change:
e.html(e.html().replaceAll("changeme","<changed>changeme</changed>"));
to
e.text(e.text().replaceAll("changeme","<changed>changeme</changed>"));
the line breaks issue can probably be solved by doing doc.outputSettings().prettyPrint(false); before doing html = doc.toString();
Finally I tried this solution (at the end of the question), using TextNodes:
How I can replace "text" in the each tag using Jsoup
This is the resulting code:
Elements els = doc.body().getAllElements();
for (Element e : els) {
for (Node child : e.childNodes()){
if (child instanceof TextNode && !((TextNode) child).isBlank()) {
((TextNode)child).text(((TextNode)child).text().replaceAll("changeme","<changed>changeme</changed>"));
}
}
}
Now the output is the expected, and it even does not introduce extra break lines. In this case prettyPrint must be set to True.
The only problem is that I don't really understand the difference of using TextNode vs Element.text(). If someone wants to provide some info it will be much appreciated.
Thanks.

Does something like assertHTMLEquals exist for unit testing

In my test I check:
assertEquals("<div class=\"action-button\" title=\"actionButton\">"</div>", result);
If someone changes the html (result), putting there SPACE, the HTML still valid, but my test would fail.
Is there some way to compare two html pieces if those are equal as HTML. Like assertHTMLEquals
XML UNIT says that this two lines are equal:
string1:
<ldapConfigurations>
<ldap tenantid="" active="false">
</ldap>
</ldapConfigurations>
string2:
<ldapConfigurations>
<ldapdd tenantid="" active="false">
</ldap>
</ldapConfigurations>
but they are not, as you can see. (see: ldapdd )
This won't necessarily work for your case, but if your HTML happens to be valid XML it will. You can use this tool called xmlunit. With it, you can write an assert method that looks like this:
public static void assertXMLEqual(Reader reader, String xml) {
try {
XMLAssert.assertXMLEqual(reader, new StringReader(xml));
} catch (Exception ex) {
ex.printStackTrace();
XMLAssert.fail();
}
}
If that doesn't work for you, maybe there's some other tool out there meant for HTML comparisons. And if that doesn't work, you may want to end up using a library like jtagsoup (or whatever) and comparing if all the fields it parses are equal.
You can achieve malformed HTML asserting throught the TolerantSaxDocumentBuilder utility of XMLUnit.
TolerantSaxDocumentBuilder tolerantSaxDocumentBuilder =
new TolerantSaxDocumentBuilder(XMLUnit.newTestParser());
HTMLDocumentBuilder htmlDocumentBuilder =
new HTMLDocumentBuilder(tolerantSaxDocumentBuilder);
XMLAssert.assertXMLEqual(htmlDocumentBuilder.parse(expectedHTML),
htmlDocumentBuilder.parse(actualHTML));
To support badly formed HTML (such as elements without closing tags - unthinkable in XML), you must make use of an additional document builder, the TolerantSaxDocumentBuilder, along with the HTMLDocumentBuilder (this one will allow asserting on web pages). After that, assert the documents as usual.
Working code example:
public class TestHTML {
public static void main(String[] args) throws Exception {
String result = "<div class=\"action-button\" title=\"actionButton\"> </div>";
assertHTMLEquals("<div class=\"action-button\" title=\"actionButton\"></div>", result); // ok!
// notice it is badly formed
String expectedHtml = "<html><title>Page Title</title>"
+ "<body><h1>Heads<ul>"
+ "<li id='1'>Some Item<li id='2'>Another item";
String actualMalformedHTML = expectedHtml.replace(" ", " "); // just added some spaces, wont matter
assertHTMLEquals(expectedHtml, actualMalformedHTML); // ok!
actualMalformedHTML = actualMalformedHTML.replace("Heads", "Tails");
assertHTMLEquals(expectedHtml, actualMalformedHTML); // assertion fails
}
public static void assertHTMLEquals(String expectedHTML, String actualHTML) throws Exception {
TolerantSaxDocumentBuilder tolerantSaxDocumentBuilder = new TolerantSaxDocumentBuilder(XMLUnit.newTestParser());
HTMLDocumentBuilder htmlDocumentBuilder = new HTMLDocumentBuilder(tolerantSaxDocumentBuilder);
XMLAssert.assertXMLEqual(htmlDocumentBuilder.parse(expectedHTML), htmlDocumentBuilder.parse(actualHTML));
}
}
Notice that XML functions, such as XPath, will be available to your HTML document as well.
If using Maven, add this to your pom.xml:
<dependency>
<groupId>xmlunit</groupId>
<artifactId>xmlunit</artifactId>
<version>1.4</version>
</dependency>

HTML Parsing using Jsoup.Jar

Document doc = Jsoup.connect("http://reviews.opentable.com/0938/9/reviews.htm").get();
Element part = doc.body();
Elements parts = part.getElementsByTag("span");
String attValue;
String html;
for(Element ent : parts)
{
if(ent.hasAttr("class"))
{
attValue = ent.attr("class");
if(attValue=="BVRRReviewText description")
{
System.out.println("\n");
html=ent.text();
System.out.println(html);
}
}
}
Am using Jsoup.jar for the above program.
I am accessing the webpage and my aim is to the print the text that is found within the tag <span class="BVRRReviewText description">text</span>.
But nothing is getting printed as output. There is no contents added to the String html in the program. But attValue is getting all the attribute values of the span tag.
Where must I have went wrong? Please advise.
if(attValue=="BVRRReviewText description")
should be
if(attValue.equals("...")) surely?
This is Java, not Javascript.
Change
attValue=="BVRRReviewText description"
for
attValue.matches("...")

Categories

Resources