I am trying to parse an HTML page on the internet to retrieve data from a table in it with Jsoup. But the page I want to parse contains more than one table. How can I do that? Is that possible?
Edit:
Here is the page I want to parse:
http://metudex.com/mobilepac/browse.php?SEARCH=calculus&kriter=X&Submit=Search
I want to retrieve data from the tables with book info.
Document doc = Jsoup.connect("http://metudex.com/mobilepac/browse.php?SEARCH=calculus&kriter=X&Submit=Search").get();
Elements els = doc.select("td:has(span.briefcitDetail)"); //gets every td that has a child span with class briefcitDetail
for(Element el : els) {
System.out.println("--" + el.text());
}
Related
So I am using JSoup to scrape a website that creates a bunch of divs with dynamic class names (they change every reload), but the same attribute names. E.g:
<div class="[random text here that changes] js_resultTile" data-listing-number="[some number]">
<div class="a12_regularTile js_rollover_container " itemscope itemtype="http://schema.org/Product" data-listing-number="[same number here]">
<a href...
I've tried multiple approaches to selecting those divs and saving them in elements, but I can't seem to get it right. I've tried by attribute:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.select("div[data-listing-number]");
I've tried by class:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.getElementsByClass("a12_regularTile")
And:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.select("div[class*=js_resultTile]")
I've tried another attribute method:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = new Elements();
for (Element element : doc.getAllElements() )
{
for ( Attribute attribute : element.attributes() )
{
if ( attribute.getKey().equalsIgnoreCase("data-listing-number"))
{
myEls.add(element);
}
}
}
None of these work. I can select the doc that gets me all the HTML, but my myEls object is always empty. What can I use to select these elements?
Are you sure these elements are present in HTML returned by server? They may be added later by JavaScript. If JavaScript is involved in page presentation then you won't be able to use Jsoup. More details in my answer to similar question here: JSoup: Difficulty extracting a single element
And one more tip. Instead of using your for-for-if construction you can use this:
for (Element element : doc.getAllElements()) {
if (element.dataset().containsKey("listing-number")) {
myEls.add(element);
}
}
I am working on a project where I have to be able to know that an element have repeated children .For example in that DOM, I want to know that the element tbody has similar children
My goal is to extract data- and store it in a database -from pages that I ignore their structure.
Use Jquery to get your td elements and iterate with each over them.
you can use JSOUP for this. its very easy to use as well
for example you want to get all td tag in within your document:
String html=... //your html string
Document doc = JSoup.parse(html);
Elements elements = doc.select("tbody").select("td");
System.out.println(elements.size()); //prints number of td within tbody REGARDLESS of where in the DOM tree they live.
Edit1:
to get all elements you can do:
for(Element e : doc.getAllElements){
System.out.println(e.getTagName());//prints the tag name
}
I have a table in a HTML page in which I have to iterate through to open the links into a next page where all the information is. In this page I extract any data I need and return to my basic page.
How do I change pages with the framework JSoup in Java? Is it actually possible?
If you look at the JSoup Cookbook, they have an example of getting all the links inside of an HTML element. Iterate the Elements from this example and do a Document doc = Jsoup.connect(<url from Elements>).get();. You can then do String htmlFromLink = doc.toString(); and get the HTML from the link.
I’m working with JSoup to parse the html website.
I want to get the article from (for example) Wikipedia.
I would like to get the text from the main page (http://en.wikipedia.org/wiki/Main_Page) from the table “From today’s featured article”.
Here’s the code:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page”);
Elements el = doc.select("div.mp-tfa”);
System.out.println(el);
The problem is that it doesn’t work properly - it prints out just a blank line.
The “From today’s featured article” table is inserted in div class=“mp-tfa”.
How to get this text in my java program?
Thanks in advance.
Change:
doc.select("div.mp-tfa");
To:
doc.select("div#mp-tfa");
The better way would to iterate over the Elements thus retrieved for the tag, class or Element of your choice, simply put:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page").get();
Elements el = doc.select("div#mp-tfa");
for (Element e : el) {
System.out.println(e.text());
}
Would give:
The Boulonnais is a heavy draft horse breed from Fr....
I think it's supposed to be:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page").get();
Elements el = doc.select("div#mp-tfa");
System.out.println(el);
I am using Jsoup.
I do a get document= connect.get(); and get the html page.
now I write that to a text(string).
I have users who populate these pages.
I know each user name . These pages have the username.
I am able to do a string.contains("username") to check if the user is present or not.
Now my issue is:
I have users with there names in
Tables
ordered lists
unordered lists
in Body
But in all these cases they have in format as:Example
<li>2012 academic record</li>
some are in table and all..
In the example I know the student name = john.
how can I get all the urls?
==
You can use regex for this:
Elements elements = document.select("[href~=(?is)http://university\\.xxx\\.students\\.com/grade9/(.+?)/[0-9]+?]")
more abstract: document.select("a[href~=regex]")
if you already know the name you can replace (.+?), eg.:
Elements elements = document.select("[href~=(?is)http://university\\.xxx\\.students\\.com/grade9/" + name + "/[0-9]+?]")
How about this:
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
if(link.attr("abs:href").contains(studentName) || link.text().contains(studentName)){
studentLinkList.add(link.attr("abs:href"));
}
}