Get Some Attributes with JSoup - java

I was having some practices with programming, and I got stuck (also because of my lacking knowledge of web programming) in this part: I was to get some information from this page: http://db.fowtcg.us/index.php?p=card&code=VS01-003+R , but only the card properties, and I'm struggling a little with JSoup, I was able to fetch the data with:
Document doc = Jsoup.connect("http://db.fowtcg.us/?p=card&code=TTW-080+SR").get();
Elements newsHeadlines = doc.select("div.card-props");
System.out.println(newsHeadlines);
But I couldn't get the data back from the Element object (but i could see it was there debugging).
How can i proceed in order to fetch this information?

Here, use this instead:
Elements property = doc.select("div.col-xs-12.col-sm-7.box.card-props");
You need to make sure the selector you use match the original html document exactly.

You can use contains/ends-with selector also
//contains
Elements property = doc.select("div[class*=card-props]");
//ends-with
Elements property = doc.select("div[class$=card-props]");
Go through below link to know more about css selectors.
http://jsoup.org/cookbook/extracting-data/selector-syntax

Related

Fetching data from another website with JSOUP

Basically, I need a table with all the possible books that exist, and I don't wanna do that, because I'm a very lazy person xD. So, my question is.. can I use a site, that I have in mind, and just like cut off the rest this site(that I don't need) and leave only the search part(maybe do some kind of changes in layout)... then, make the search, find the book and store in my database only the data that make sense for me. Is that possible? I heard that JSOUP could help.
So, I just want some tips. (thx for reading).
the site: http://www.isbn.bn.br/website/consulta/cadastro
Yes, you can do that using Jsoup, the main problem is that the URL you shared uses JavaScript so you'll need to use Selenium to force the JS execution or you can also get the book URL and parse it.
The way to parse a web using Jsoup is:
Document document = Jsoup.connect("YOUR-URL-GOES-HERE")
.userAgent("Mozilla/5.0")
.get();
The you retrieve the whole HTML in a Document so you can get any Element contained in the Element using CSS Selectors, for example, if in the HTML you want to retrieve the title of the web, you can use:
Elements elements = document.select("title");
And that for every HTML tag that you want to retrieve information from. You can check the Jsoup Doc an check some of the examples explained: Jsoup
I hope it helps you!

Is there a way to convert an element link to XPath

I have written a Jsoup class file to scrape a page and grab the hrefs for every element on the page. What I would like to do from there is to extract the Xpath for each of the elements from their hrefs.
Is there a way to do this in JSoup? If not is what is the best way to do this in Java (and are there any resources on this)?
Update
I want to clarify my question.
I want to scan a page for all the href identifiers and grab the links (that part is done). For my script, I need to get the xpath of all the elements I have identified and scraped from the (scanned) page.
The problem is that I assumed I could easily translate the href links to Xpath.
The comment from #Rishal dev Singh ended up being the right answer.
Check his link here:
http://stackoverflow.com/questions/7085539/does-jsoup-support-xpath

What Java API data structure is good for HTML trees?

For fun, I'm writing a basic parser that finds data within an HTML document. I want to find the best structure to represent the branches of the parsed file.
The criteria for "best structure" is this: I want to easily search for a tag's relative location and access its contents, like "the image in the second image tag after the third h3 tag in the body" or "the title tag in the header".
I expect to search the first level of tags for the tag I'm looking for, then move into the branch associated with that tag. That's the structure this question is looking for, but if there is a better way to find relative locations in an HTML document, please explain.
So that's the question. More generally, what kind of Java structures are available through the API that can represent tree data structures?
Don't reinvent the wheel, just use an HTML parser like Jsoup, you will be able to get your tags thanks to a CSS selector using the method Element#select(cssQuery).
Document doc = Jsoup.parse(file, encoding);
Elements elements = doc.select(cssQuery);

Java Selenium fails to find all elements in the DOM using By.xpath("//*")

I have a strange scenario in which I am unable to find all elements in the DOM.
When viewing the DOM through Firefox / 'Inspect Elements', I clearly see some 'div' elements which are not present in the element-list generated with Java/Selenium:
List<WebElement> elements = webDriver.findElements(By.xpath("//*"));
I suspect that the line above does not provide any element that is a child of a non-visible element.
If my suspicion is not correct, then can anyone please explain the reason for what I'm seeing?
Otherwise, if this is indeed the case, then the only way around it would be to go over all non-visible elements and make them visible.
Is there any better way for handling this problem?
If yes - what is it?
If no - how do I make all elements visible (perhaps using JavascriptExecutor)?
Thanks
The other option is that the elements are in a frame. In that case you have to call webDriver.switchTo().frame(String name). Don't forget to switch back afterwards, ideally with webDriver.switchTo().defaultContent().
I think that also invisible elements are accessible by Selenium. I've been accessing some elements that I'd made invisible myself. You cannot interact with them though.
Naturally, as Dmitry suggests, to get all elements this way is really not a feasible way.
I'd suggest to use a separate HTML Parser library in order to get the required information for all html document nodes. For instance
Get the full page source, using driver.getPageSource();
Use http://jsoup.org or any other parser to parse the document and extract the required data
Here is an example:
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""
String linkOuterH = link.outerHtml();
// "<b>example</b>"
String linkInnerH = link.html(); // "<b>example</b>"
I would like to add that querying all elements via WebDriver is extremely slow operation. I’ve faced with this issue in my pet project, Page Recorder.
The solution was to use HtmlAgilityPack – .NET HTML Parser - in order to perform operations on all document nodes.
Problem found:
When I was viewing the webpage through Firefox / Inspect Element, the window was maximized.
When I was scraping the webpage with Java / Selenium, the window was not maximized.
In the specific webpage that I was working on, some elements (mostly advertisements) are added (becoming "unhidden") as soon as the window reaches a certain size (there is probably some javascript code running on the client side, which is responsible for this).
The current problem at hand is not within Selenium.
In order to solve it, one merely needs to add the following line:
webDriver.manage().window().maximize();

Jsoup is not finding my element

Perhaps I'm doing something wrong, but I'm trying to parse this page using jsoup, it for some reason it doesn't find me the div I'm looking for
doc = Jsoup.connect(params[0]).get();
content = doc.select("div.itemcontent").first().text();
Where am I going wrong here?
Thanks
The problem: you get a different website using jsoup than using a browser. I set another useragent in Jsoup, but no luck. Possible the content is changed through JavaScript?!
However, you can change the selector according to the webseite you get.
It's always a good idea to take a look into document as it's parsed - a simple System.out.println(doc) is enough.
Here are some steps you can try:
Print your Document doc (eg. using System.out)
Search for the required value(s) in there
Select those tags instead
I just played around a bit, but maybe you can use this snipped:
content = doc.select("description").first().text();
It seems to me, <description>...</description> is what you're looking for.

Categories

Resources