Jsoup Scraping HTML dynamic content - java

I'm new to Jsoup and I have been trying to create a small code that gets the name of the items in a steam inventory using Jsoup.
public Element getItem(String user) throws IOException{
Document doc;
doc = Jsoup.connect("http://steamcommunity.com/id/"+user+"/inventory").get();
Element element = doc.getElementsByClass("hover_item_name").first();
return element;
}
this methods returns:
<h1 class="hover_item_name" id="iteminfo0_item_name"></h1>
and I want the information beetwen the "h1" labels which is generated when you click on a specific window.
Thank you in advance.

You can use the .select(String cssQuery) method:
doc.select("h1") gives you all h1 Elements.
If you need the actual Text in these tags use the .text() for each Element.
If you need a attribute like class or id use .attr(String attributeKey) on a Element eg:
doc.getElementsByClass("hover_item_name").first().attr("id")
gives you "iteminfo0_item_name"
But if you need to perform clicks on a website you can't do that with JSoup, hence JSoup is a HTML parser and not a browser alternative. Jsoup can't handle dynamic content.
But what you could do is, firstly scrape the relevant data in your h1 tags and then send a new .post() request, respectively an ajax call
If you rather want a real webdriver, have a look at Selenium.

Use .text() and return a String, i.e.:
public String getItem(String user) throws IOException{
Document doc;
doc = Jsoup.connect("http://steamcommunity.com/id/"+user+"/inventory").get();
Element element = doc.getElementsByClass("hover_item_name").first();
String text = element.text();
return text;
}

Related

How to retrieve data from data table from Sports Reference using JSoup?

I'm attempting to use JSoup to retrieve the amount of wins for a team from a Sports Reference table.
Specifically, I am trying to receive the following data point highlighted below, with the html code provided
Below is what I have tried already, but I get a null pointer exception when trying to access the text of this element, telling me that my code is likely not parsing the HTML code correctly.
Element wins = document.selectFirst("td[data-stat=\"wins\"]");
What I want is for the text of this element to be 34 (or some number depending on the number of wins for the team).
Check what your Document was able to read from page and print it. If it contains HTML content which can be dynamically added by JavaScript by browser, you need to use as tool Selenium not Jsoup.
For reading HTML source, you can write similar to:
import java.io.IOException;
import org.jsoup.Jsoup;
public class JSoupHTMLSourceEx {
public static void main(String[] args) throws IOException {
String webPage = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
String html = Jsoup.connect(webPage).get().html();
System.out.println(html);
}
}
Since Jsoup supports cssSelector, you can try to get an element like:
public static void main(String[] args) {
String webPage = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
String html = Jsoup.connect(webPage).get().html();
Document document = Jsoup.parse(html);
Elements tds = document.select("#team_misc > tbody > tr:nth-child(1) > td:nth-child(2)");
for (Element e : tds) {
System.out.println(e.text());
}
}
But better solution is to use Selenium - a portable framework for testing web applications (more details about Selenium tool):
public static void main(String[] args) {
String baseUrl = "https://www.basketball-reference.com/teams/CHI/2020.html#all_team_misc";
WebDriver driver = new FirefoxDriver();
driver.get(baseUrl);
String innerText = driver.findElement(
By.xpath("//*[#id="team_misc"]/tbody/tr[1]/td[1]")).getText();
System.out.println(innerText);
driver.quit();
}
}
Also you can try instead of:
driver.findElement(By.xpath("//*[#id="team_misc"]/tbody/tr[1]/td[1]")).getText();
in this form:
driver.findElement(By.xpath("//[#id="team_misc"]/tbody/tr[1]/td[1]")).getAttribute("innerHTML");
P.S. In the future it would be useful to add source links from where you want to get information or at least snippet of the DOM structure instead of image.

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}
You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}
I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

Android - how to parse html by jsoup and fill into the arraylist?

I want to read the date from this HTML link:
http://jadvalbaz.blog.ir/post/%D8%B1%D8%A7%D9%87%D9%86%D9%85%D8%A7%DB%8C-%D8%AD%D9%84-%D8%AC%D8%AF%D9%88%D9%84-%D8%AD%D8%B1%D9%81-%D8%B0
if you look at the view-source
ذات اریه (پنومونی- سینه پهلو)ذر (مورچه ریز)ذرع (مقیاس طول)ذره ای بنیادی از رده هیبرونها که بار الکتریکی ندارد (لاندا)ذره منفی اتم (الکترون)ذریه (نسل)ذل (خواری)ذم (نکوهش)ذهاب (رفتن)ذی (صاحب)
my words are separated by <.br>, I want to read each word to ArrayList, I means how to omit the <.br> and read the words.
here is my code:
Document document = Jsoup.connect(url).get();
for (Element span : document.select("?").select("?")) {
title = span.toString();
name.add(title);
}
How to read them, what to put instead of question mark.
any suggestion?
edit the css of your template and define a class for your words then Use the Element.select(String selector) and Elements.select(String selector) method.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element masthead = doc.select("p.words").first(); // p with class=words
follow below link for more information about extracting data with this methods:
Use selector-syntax to find elements

Jsoup clean title tag failure

I am using Jsoup 1.9.2 to process and clean some XML input of specific tags. During this, I noticed that Jsoup behaves strangely when it is asked to clean title tags. Specifically, other XML tags within the title tag do not get removed, and in fact get replaced by their escaped forms.
I created a short unit test for this as below. The test fails, as output comes out with the value of CuCl<sub>2</sub>.
#Test
public void stripXmlSubInTitle() {
final String input = "<title>CuCl<sub>2</sub></title>";
final String output = Jsoup.clean(input, Whitelist.none());
assertEquals("CuCl2", output);
}
If the title tag is replaced with other tags (e.g., p or div), then everything works as expected. Any explanation and workaround will be appreciated.
The title tag should be used within the head (or in HTML5 within the html) tag. Since it is used to display the title of the HTML document, mostly in a browser window/tab, it is not supposed to have child tags.
JSoup treats it differently than actual content tags like p or div, the same applies for textarea.
Edit:
You could do something like this:
public static void main(String[] args) {
try {
final String input = "<content><title>CuCl<sub>2</sub></title><othertag>blabla</othertag><title>title with no subtags</title></content>";
Document document = Jsoup.parse(input);
Elements titles = document.getElementsByTag("title");
for (Element element : titles) {
element.text(Jsoup.clean(element.ownText(), Whitelist.none()));
}
System.out.println(document.body().toString());
} catch (Exception e) {
e.printStackTrace();
}
}
That would return:
<body>
<content>
<title>CuCl2</title>
<othertag>
blabla
</othertag>
<title>title with no subtags</title>
</content>
</body>
Depending on your needs, some adjustments need to be made, e.g.
System.out.println(Jsoup.clean(document.body().toString(), Whitelist.none()));
That would return:
CuCl2 blabla title with no subtags

How to Get Crawl content in Crawljax

I have crawl Dynamic webpage using Crawljax. i can able to get crawl current id, status and dom. but i can't get the Website content.. Any one help me??
CrawljaxConfigurationBuilder builder =
CrawljaxConfiguration.builderFor("http://demo.crawljax.com/");
builder.addPlugin(new OnNewStatePlugin() {
#Override
public String toString() {
return "Our example plugin";
}
#Override
public void onNewState(CrawlerContext cc, StateVertex sv) {
LOG.info("Found a new dom! Here it is:\n{}", cc.getBrowser().getStrippedDom());
String name = cc.getCurrentState().getName();
String url = cc.getBrowser().getCurrentUrl();
System.out.println(cc.getCurrentState().getDom());
System.out.println("New State: " + name + "; url: " + url);
}
});
CrawljaxRunner crawljax = new CrawljaxRunner(builder.build());
crawljax.call();
How to get dynamic/java script Webpage content..
We can able to get website source code
cc.getBrowser().getStrippedDom()); or cc.getCurrentState().getDocument();
This coding are Return Source code (css/java script file)..
Not possible.Because its testing tool.This tool only check Text are available, assign temp data to Fields.
To get the website content, use the following function:
cc.getCurrentState().getDom()
This function does not return a DOM node, but actually returns the page's HTML text instead. This is the right function to use if you want the page content, but it sounds like it returns a DOM node, so the name getDom is a misnomer. To get a DOM node instead, use:
cc.getCurrentState().getDocument()
which returns the Document DOM node.
You can retrieve the page content with:
cc.getCurrentState().getDocument().getTextContent()
(EDIT: This won't work -- getTextContent always returns null when called on Documents.)

Categories

Resources