Assuming the following document:
<html>
<body>
<div>
Home
</div>
<div>
Link to a page
<b>Bold text</b>
Link to another page
</div>
</body>
</html>
If I run this xPath I get the result following:
/html/body/div/a/text() -> HomeLink to a pageLink to another page
I am looking for a way to reverse-engineer the results and extract the individual xPath selectors and its associate result as simple as possible. Something as:
/html/body/div[1]/a[1]/text() <-> Home
/html/body/div[2]/a[1]/text() <-> Link to a page
/html/body/div[2]/a[2]/text() <-> Link to another page
I can guess some complicate program by traversing the DOM tree or a SAX parsing but looks too complex.
Can someone figure out a simpler way to achieve this result in xPath (maybe helped by a bit of Java as well)? Basically the problem is to know each index of each tag and the associated result for each successful combination.
Thanks
Unfortunately, I don't know java.
Here is a sample Ruby code using nokogiri gem:
require 'nokogiri'
doc = Nokogiri::HTML open('/tmp/input.html')
doc.xpath('//a//text()').each {|a| puts "#{a.path} -> #{a.text}" }
Related
Add image of the html. I'm not able to copy multiple lines in inspect element for some reason
I'm trying to fill the input of this form
<div class="field" xpath="1"><label class="label">E-Mail *</label> <div class="control is-clearfix"><input type="text" autocomplete="on" class="input"> <!----> <!----> <!----></div> <!----></div>
My current xpath
"/html/body/div[5]/div[#class='animation-content modal-content']/div/section//section[#class='tab-content']/div[2]/div[1]/div/input[#type='text']"
The problem is the xpath I'm using changes every submission so I can only submit once. If someone can provide a xpath that doesn't change every submission I would really appreciate it!
according to the information you provided, you can Try to use this xpath:
//div[contains(#class, "field")]//input[#class="input"]
to find name:
//label[contains(text(), "Name")]/following-sibling::div//input[#class="input"]
email:
//label[contains(text(), "E-Mail")]/following-sibling::div//input[#class="input"]
phone:
//label[contains(text(), "Phone")]/following-sibling::div//input[#class="input"]
The Xpath added by the Roman is satisfactory for your needs it just you need to understand how you can improve your Xpath there are multiple ways to do it ,
I will add some more Xpath so that it will be helpful in the near future.
I personally prefer the below mentioned way of writing the Xpath
//label[contains(text(), "Name")]/following-sibling::div//input[#class="input"]
But there are some other ways to I will add one of the Xpath from my project where you can also learn how we use the parent and following-sibling
//label[contains(text(),'PlantCode*')]//parent::div[#class='rb_Work_FieldContainer']//following-sibling::div[contains(#class,'rb_Work_FieldValueArea rb_Work_FieldValueArea_create ')]//textarea[#class='textarea']
These are some of the ways to find the Xpath, You can also use the extension like chropath in the Chrome to help you out in building the Xpath,
I am trying to select, using Jsoup, a <div> that has multiple classes:
<div class="content-text right-align bold-font">...</div>
The syntax for doing so, to the best of my understanding, should be:
document.select("div.content-text.right-align.bold-font");
However, for some reason, this doesn't work for me.
When I try the same exact syntax on JSFIDDLE, it works without a hitch.
Does multi-class selection work in Jsoup?
(I'd rather find out that this is a bug in my code than find out that this is a Jsoup limitation :)
UPDATE (thanks to the answer below): Jsoup works perfectly with the aforementioned syntax.
Works for me with latest Jsoup (1.5.2).
String html = "<div class=\"content-text right-align bold-font\">foo</div>";
Document document = Jsoup.parse(html);
Elements elements = document.select("div.content-text.right-align.bold-font");
System.out.println(elements.text()); // foo
So either you're possibly using an outdated version of Jsoup which exposes a bug related to this, or the actual HTML doesn't contain a <div> like that.
It would by helpfull for you in near future. Have fun.
Jsoup selectors,
jQuery selectors
I'm writing a scraper. When I use inspect element in chrome I see the following:
but when I run my code Elements data = doc.select("div.item-header"); and I print the object data I see that the object has the following chunk of html in it:
<div class="item-header">
<h1 class="text size-20">Snake print bell sleeves top</h1>
<div class="text size-12 muted brandname ma_top5">
<!-- data here is irrelevant -->
</div>
</div>
So, what I can't figure out is, why does my code get a different html than that visible in chrome's inspect element? What am I missing here?
I'm using java, the library is Jsoup. Any help is greatly appreciated.
Websites consist of HTML and JavaScript code. Often that JavaScript is executed when the page is loaded and it's possible that the source of a page is modified or some additional content is loaded by asynchronous AJAX calls. Jsoup can't parse Javascript so it can only parse the original HTML document.
Don't use Chrome's Inspect option as it presents HTML after possible transformations. Use View source (CTRL+U). This way you'll see original HTML source unmodified by JavaScript (you can also try reloading the page with JavaScript disabled). And that original source is what gets downloaded and parsed by Jsoup.
If that's the case and you really want to parse the data that's loaded by JavaScript try to observe XHR requests in Chrome's Network tab. You can check this answer to see what I mean: How to Load Entire Contents of HTML - Jsoup
I am using Selenium and Java to write a test, when I use the code below:
List<WebElement> elements = wait.until(ExpectedConditions.presenceOfAllElementsLocatedBy
(By.xpath("//div[.//span[text()='Map']]//*")));
for (WebElement e : elements) {
System.out.println("=>" + e.getTagName() + "<=");
}
it shows all the web elements in that <div> tag.
Result:
=>span<=
=>div<=
=>div<=
=>path<=
=>path<=
=>span<=
As you see, some of the elements tag-name is path but when I use the code below it says that I could not find the element.
List<WebElement> elements = wait.until(ExpectedConditions.presenceOfAllElementsLocatedBy
(By.xpath("//div[.//span[text()='Map']]//path")));
It not easy to find the real issue with out knowing your HTML structure.
While I think there is a issue in your xpath
Try below xpath
//div//span[text()='Map']//path
Hope it will help you :)
By.xpath("//div[.//span[text()='Map']]//* will return all the decedents of span[text()='Map'] in the html hierarchy.
For example, this html structure will produce the same results you have
<div>
<span>Map</span>
<div></div>
<div>
<path></path>
<path></path>
</div>
<span></span>
</div>
As you can see, <path> is not <span> direct child, so By.xpath("//div[.//span[text()='Map']]//path is not a valid xpath.
The issue was related to some web elements that Selenium cannot navigate, the web element that I was trying to catch was inside a svg web element which is not detectable by Selenium have a look here this is exactly what was happening to me.
I need to parse HTML and find corresponding CSS styles. I can parse HTML and CSS separataly, but I can't combine them. For example, I have an XHTML page like this:
<html>
<head>
<title></title>
</head>
<body>
<div class="abc">Hello World</div>
</body>
</html>
I have to search for "hello world" and find its class name, and after that I need to find its style from an external CSS file. Answers using Java, JavaScript, and PHP are all okay.
Use jsoup library in java which is a HTML Parser. You can see for example here
For example you can do something like this:
String html="<<your html content>>";
Document doc = Jsoup.parse(html);
Element ele=doc.getElementsContainingOwnText("Hello World").first.clone(); //get tag containing Hello world
HashSet<String>class=ele.classNames(); //gives you the classnames of element containing Hello world
You can explore the library further to fit your needs.
Similiar question Can jQuery get all CSS styles associated with an element?. Maybe css optimizers can do what you want, take a look at unused-css.com its online tool but also lists other tools.
As i understood you have chance to parse style sheet from external file and this makes your task easy to solve. First try to parse html file with jsoup which supports jquery like selector syntax that helps you parse complicated html files easier. then check this previous solution to parse css file. Im not going to full solution as i state with these libraries all task done internally and the only thing you should do is writing glue code to combine these two.
Using Java java.util.regex
String s = "<body>...<div class=\"abc\">Hello World</div></body>";
Pattern p = Pattern.compile("<div.+?class\\s*?=\\s*['\"]?([^ '\"]+).*?>Hello World</div>", Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1));
}
prints abc