JSOUP check if element is img

JSOUP check if element is img - java

I have an object Elements xxx.
Now I want to iterate over it and I would like to check if any element is img tag. How can I do that ?

You can use the tagName:
Elements yourElements = ...
for( Element element : yourElements )
{
if( element.tagName().equals("img") == true)
{
// It's an 'img'
}
else
{
// It's not an 'img'
}
}

Related

Can't click on element by text, if elements have the same text inside

I'm trying to click on element by text from list of elements, but sometimes elements could have the same text and if statement not executed.
public void clickByText() {
String myText = "Text1";
List<WebElement> elements = driver.findElements(myElements);
for (WebElement e : elements) {
if (e.getText().equals(myText)) {
e.click();
break;
} else {
System.out.println("not exists");
break;
}
}
}

Maybe don't look for the text by equals, use contains.
Remove the break from your code, this makes only the else or only the if block to run only once.
code example:
public void clickByText() {
String myText = "Text1";
List<WebElement> elements = driver.findElements(myElements);
for (WebElement e : elements) {
if (e.getText().contains(myText)) {
e.click();
} else {
System.out.println("not exists: " + e.getText());
}
}
}

I think it's an issue with duplicates. since list in Java can contains duplicates, whereas set do not. try the below code :
List<WebElement> elements = driver.findElements(By.cssSelector(""));
Set<WebElement> setElements = new HashSet<WebElement>(elements);
for (WebElement e : elements) {
// rest of your code
}

Merging same elements in JSoup

I have the HTML string like
<b>test</b><b>er</b>
<span class="ab">continue</span><span> without</span>
I want to collapse the Tags which are similar and belong to each other. In the above sample I want to have
<b>tester</b>
since the tags have the same tag withouth any further attribute or style. But for the span Tag it should remain the same because it has a class attribute. I am aware that I can iterate via Jsoup over the tree.
Document doc = Jsoup.parse(input);
for (Element element : doc.select("b")) {
}
But I'm not clear how look forward (I guess something like nextSibling) but than how to collapse the elements?
Or exists a simple regexp merge?
The attributes I can specify on my own. It's not required to have a one-fits-for-all Tag solution.

My approach would be like this. Comments in the code
public class StackOverflow60704600 {
public static void main(final String[] args) throws IOException {
Document doc = Jsoup.parse("<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>");
mergeSiblings(doc, "b");
System.out.println(doc);
}
private static void mergeSiblings(Document doc, String selector) {
Elements elements = doc.select(selector);
for (Element element : elements) {
// get the next sibling
Element nextSibling = element.nextElementSibling();
// merge only if the next sibling has the same tag name and the same set of attributes
if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
&& nextSibling.attributes().equals(element.attributes())) {
// your element has only one child, but let's rewrite all of them if there's more
while (nextSibling.childNodes().size() > 0) {
Node siblingChildNode = nextSibling.childNodes().get(0);
element.appendChild(siblingChildNode);
}
// remove because now it doesn't have any children
nextSibling.remove();
}
}
}
}
output:
<html>
<head></head>
<body>
<b>tester</b>
<span class="ab">continue</span>
<span> without</span>
</body>
</html>
One more note on why I used loop while (nextSibling.childNodes().size() > 0). It turned out for or iterator couldn't be used here because appendChild adds the child but removes it from the source element and remaining childen are be shifted. It may not be visible here but the problem will appear when you try to merge: <b>test</b><b>er<a>123</a></b>

I tried to update the code from #Krystian G but my edit was rejected :-/ Therefore I post it as an own post. The code is an excellent starting point but it fails if between the tags a TextNode appears, e.g.
<span> no class but further</span> (in)valid <span>spanning</span> would result into a
<span> no class but furtherspanning</span> (in)valid
Therefore the corrected code looks like:
public class StackOverflow60704600 {
public static void main(final String[] args) throws IOException {
String test1="<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>";
String test2="<b>test</b><b>er<a>123</a></b>";
String test3="<span> no class but further</span> <span>spanning</span>";
String test4="<span> no class but further</span> (in)valid <span>spanning</span>";
Document doc = Jsoup.parse(test1);
mergeSiblings(doc, "b");
System.out.println(doc);
}
private static void mergeSiblings(Document doc, String selector) {
Elements elements = doc.select(selector);
for (Element element : elements) {
Node nextElement = element.nextSibling();
// if the next Element is a TextNode but has only space ==> we need to preserve the
// spacing
boolean addSpace = false;
if (nextElement != null && nextElement instanceof TextNode) {
String content = nextElement.toString();
if (!content.isBlank()) {
// the next element has some content
continue;
} else {
addSpace = true;
}
}
// get the next sibling
Element nextSibling = element.nextElementSibling();
// merge only if the next sibling has the same tag name and the same set of
// attributes
if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
&& nextSibling.attributes().equals(element.attributes())) {
// your element has only one child, but let's rewrite all of them if there's more
while (nextSibling.childNodes().size() > 0) {
Node siblingChildNode = nextSibling.childNodes().get(0);
if (addSpace) {
// since we have had some space previously ==> preserve it and add it
if (siblingChildNode instanceof TextNode) {
((TextNode) siblingChildNode).text(" " + siblingChildNode.toString());
} else {
element.appendChild(new TextNode(" "));
}
}
element.appendChild(siblingChildNode);
}
// remove because now it doesn't have any children
nextSibling.remove();
}
}
}
}

Read a specified line of text from a webpage with Jsoup

So I am trying to get the data from this webpage using Jsoup...
I've tried looking up many different ways of doing it and I've gotten close but I don't know how to find tags for certain stats (Attack, Strength, Defence, etc.)
So let's say for examples sake I wanted to print out
'Attack', '15', '99', '200,000,000'
How should I go about doing this?

You can use CSS selectors in Jsoup to easily extract the column data.
// retrieve page source code
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
// find all of the table rows
Elements rows = doc.select("div#contentHiscores table tr");
ListIterator<Element> itr = rows.listIterator();
// loop over each row
while (itr.hasNext()) {
Element row = itr.next();
// does the second col contain the word attack?
if (row.select("td:nth-child(2) a:contains(attack)").first() != null) {
// if so, assign each sibling col to variable
String rank = row.select("td:nth-child(3)").text();
String level = row.select("td:nth-child(4)").text();
String xp = row.select("td:nth-child(5)").text();
System.out.printf("rank=%s level=%s xp=%s", rank, level, xp);
// stop looping rows, found attack
break;
}
}

A very rough implementation would be as below. I have just shown a snippet , optimizations or other conditionals need to be added
public static void main(String[] args) throws Exception {
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
Element contentHiscoresDiv = doc.getElementById("contentHiscores");
Element table = contentHiscoresDiv.child(0);
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (Element column : tds) {
if (column.children() != null && column.children().size() > 0) {
Element anchorTag = column.getElementsByTag("a").first();
if (anchorTag != null && anchorTag.text().contains("Attack")) {
System.out.println(anchorTag.text());
Elements attributeSiblings = column.siblingElements();
for (Element attributeSibling : attributeSiblings) {
System.out.println(attributeSibling.text());
}
}
}
}
}
}
Attack
15
99
200,000,000

Empty / Null Nodes returned from getChildNodes

I'm trying to parse the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<docusign-cfg>
<tagConfig>
<tags>
<approve>approve</approve>
<checkbox>checkbox</checkbox>
<company>company</company>
<date>date</date>
<decline>decline</decline>
<email>email</email>
<emailAddress>emailAddress</emailAddress>
<envelopeID>envelopeID</envelopeID>
<firstName>firstName</firstName>
<lastName>lastName</lastName>
<number>number</number>
<ssn>ssn</ssn>
<zip>zip</zip>
<signHere>signHere</signHere>
<checkbox>checkbox</checkbox>
<initialHere>initialHere</initialHere>
<dateSigned>dateSigned</dateSigned>
<fullName>fullName</fullName>
</tags>
</tagConfig>
</docusign-cfg>
I want to read either the name or content of each tag in the <tags> tag. I can do so with the following code:
public String[] getAvailableTags() throws Exception
{
String path = "/docusign-cfg/tagConfig/tags";
XPathFactory f = XPathFactory.newInstance();
XPath x = f.newXPath();
Object result = null;
try
{
XPathExpression expr = x.compile(path);
result = expr.evaluate(doc, XPathConstants.NODE);
}
catch (XPathExpressionException e)
{
throw new Exception("An error ocurred while trying to retrieve the tags");
}
Node node = (Node) result;
NodeList childNodes = node.getChildNodes();
String[] tags = new String[childNodes.getLength()];
System.out.println(tags.length);
for(int i = 0; i < tags.length; i++)
{
String content = childNodes.item(i).getNodeName().trim().replaceAll("\\s", "");
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags[i] = content;
}
}
return tags;
}
After some searching I found that parsing it this way causes it to read whitespace between nodes / tags causes those whitespaces to be read as children. In this case the whitespaces are considered children of <tags> .
My output:
37
null
approve
null
checkbox
null
company
null
date
null
decline
null
email
null
emailAddress
null
envelopeID
null
firstName
null
lastName
null
number
null
ssn
null
zip
null
signHere
null
checkbox
null
initialHere
null
dateSigned
null
fullName
null
37 is the number of nodes it found in <tags>
Everything below 37 is the content of the tag array.
How are these null elements being added to the tag array despite my checking for null?

I think that is because of the indexing of tag. The if check also skips an index. So even though value is not being inserted it will result in null. Use separate index for tag array
int j = 0;
for(int i = 0; i < tags.length; i++)
{
String content = childNodes.item(i).getNodeName().trim().replaceAll("\\s", "");
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags[j++] = content;
}
}
Since you are omitting some of the child nodes, creating an array of entire child nodes length may result in wastage of memory. You can use a List instead. If you are particular about String array you can later convert this to an array as well.
public String[] getAvailableTags() throws Exception
{
String path = "/docusign-cfg/tagConfig/tags";
XPathFactory f = XPathFactory.newInstance();
XPath x = f.newXPath();
Object result = null;
try
{
XPathExpression expr = x.compile(path);
result = expr.evaluate(doc, XPathConstants.NODE);
}
catch (XPathExpressionException e)
{
throw new Exception("An error ocurred while trying to retrieve the tags");
}
Node node = (Node) result;
NodeList childNodes = node.getChildNodes();
List<String> tags = new ArrayList<String>();
for(int i = 0; i < tags.length; i++)
{
String content = childNodes.item(i).getNodeName().trim().replaceAll("\\s", "");
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags.add(content);
}
}
String[] tagsArray = tags.toArray(new String[tags.size()]);
return tagsArray;
}

The contents of tag array defaults to null.
So it is not a case of how does the element become null, it is the case of it being left as null.
To prove this to yourself, add the following else block like this:
if(childNodes.item(i).getNodeType() == Node.ELEMENT_NODE &&
childNodes.item(i).getNodeName() != null)
{
tags[i] = content;
} else {
tags[i] = "Foo Bar";
}
You should now see 'Foo Bar' instead of null.
A better solution here would be to use an ArrayList, and append the tags to it instead of using an array. Then you do not need to track the indexes and so less chance of this type of bug.

Convert Iterator to a for loop with index in order to skip objects

I am using Jericho HTML Parser to parse some malformed html. In particular I am trying to get all text nodes, process the text and then replace it.
I want to skip specific elements from processing. For example I want to skip all elements, and any element that has attribute class="noProcess". So, if a div has class="noProcess" then I want to skip this div and all children from processing. However, I do want these skipped elements to return back to the output after processing.
Jericho provides an Iterator for all nodes but I am not sure how to skip complete elements from the Iterator. Here is my code:
private String doProcessHtml(String html) {
Source source = new Source(html);
OutputDocument outputDocument = new OutputDocument(source);
for (Segment segment : source) {
if (segment instanceof Tag) {
Tag tag = (Tag) segment;
System.out.println("FOUND TAG: " + tag.getName());
// DO SOMETHING HERE TO SKIP ENTIRE ELEMENT IF IS <A> OR CLASS="noProcess"
} else if (segment instanceof CharacterReference) {
CharacterReference characterReference = (CharacterReference) segment;
System.out.println("FOUND CHARACTERREFERENCE: " + characterReference.getCharacterReferenceString());
} else {
System.out.println("FOUND PLAIN TEXT: " + segment.toString());
outputDocument.replace(segment, doProcessText(segment.toString()));
}
}
return outputDocument.toString();
}
It doesn't look like using the ignoreWhenParsing() method works for me as the parser just treats the "ignored" element as text.
I was thinking that if I could convert the Iterator loop to a for (int i = 0;...) loop I could probably be able to skip the element and all its children by modifying i to point to the EndTag and then continue the loop.... but not sure.

I think you might want to consider a redesign of the way your segments are built. Is there a way to parse the html in such a way that each segment is a parent element that contains a nested list of child elements? That way you could do something like:
for (Segment segment : source) {
if (segment instanceof Tag) {
Tag tag = (Tag) segment;
System.out.println("FOUND TAG: " + tag.getName());
// DO SOMETHING HERE TO SKIP ENTIRE ELEMENT IF IS <A> OR CLASS="noProcess"
continue;
} else if (segment instanceof CharacterReference) {
CharacterReference characterReference = (CharacterReference) segment;
System.out.println("FOUND CHARACTERREFERENCE: " + characterReference.getCharacterReferenceString());
for(Segment child : segment.childNodes()) {
//Use recursion to process child elements
//You will want to put your for loop in a separate method so it can be called recursively.
}
} else {
System.out.println("FOUND PLAIN TEXT: " + segment.toString());
outputDocument.replace(segment, doProcessText(segment.toString()));
}
}
Without more code to inspect its hard to determine if restructuring the segment element is even possible or worth the effort.

Managed to have a working solution by using the getEnd() method of the Element object of the Tag. The idea is to skip elements if their end position is less than a position you set. So you find the end position of the element you want to exclude and you do not process anything else before that position:
final ArrayList<String> excludeTags = new ArrayList<String>(Arrays.asList(new String[] {"head", "script", "a"}));
final ArrayList<String> excludeClasses = new ArrayList<String>(Arrays.asList(new String[] {"noProcess"}));
Source.LegacyIteratorCompatabilityMode = true;
Source source = new Source(htmlToProcess);
OutputDocument outputDocument = new OutputDocument(source);
int skipToPos = 0;
for (Segment segment : source) {
if (segment.getBegin() >= skipToPos) {
if (segment instanceof Tag) {
Tag tag = (Tag) segment;
Element element = tag.getElement();
// check excludeTags
if (excludeTags.contains(tag.getName().toLowerCase())) {
skipToPos = element.getEnd();
}
// check excludeClasses
String classes = element.getAttributeValue("class");
if (classes != null) {
for (String theClass : classes.split(" ")) {
if (excludeClasses.contains(theClass.toLowerCase())) {
skipToPos = element.getEnd();
}
}
}
} else if (segment instanceof CharacterReference) { // for future use. Source.LegacyIteratorCompatabilityMode = true;
CharacterReference characterReference = (CharacterReference) segment;
} else {
outputDocument.replace(segment, doProcessText(segment.toString()));
}
}
}
return outputDocument.toString();

This should work.
String skipTag = null;
for (Segment segment : source) {
if (skipTag != null) { // is skipping ON?
if (segment instanceof EndTag && // if EndTag found for the
skipTag.equals(((EndTag) segment).getName())) { // tag we're skipping
skipTag = null; // set skipping OFF
}
continue; // continue skipping (or skip the EndTag)
} else if (segment instanceof Tag) { // is tag?
Tag tag = (Tag) segment;
System.out.println("FOUND TAG: " + tag.getName());
if (HTMLElementName.A.equals(tag.getName()) { // if <a> ?
skipTag = tag.getName(); // set
continue; // skipping ON
} else if (tag instanceof StartTag) {
if ("noProcess".equals( // if <tag class="noProcess" ..> ?
((StartTag) tag).getAttributeValue("class"))) {
skipTag = tag.getName(); // set
continue; // skipping ON
}
}
} // ...
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSOUP check if element is img - java

I have an object Elements xxx. Now I want to iterate over it and I would like to check if any element is img tag. How can I do that ?

You can use the tagName: Elements yourElements = ... for( Element element : yourElements ) { if( element.tagName().equals("img") == true) { // It's an 'img' } else { // It's not an 'img' } }

Related

Can't click on element by text, if elements have the same text inside

Merging same elements in JSoup

Read a specified line of text from a webpage with Jsoup

Empty / Null Nodes returned from getChildNodes

Convert Iterator to a for loop with index in order to skip objects

Categories

Resources