JSoup get specific data from webpage

JSoup get specific data from webpage - java

I've been trying to get data from: http://www.betvictor.com/sports/en/to-lead-anytime, where I would like to get the list of matches using JSoup.
For example:
Caen v AS Saint Etienne
Celtic v Rangers
and so on...
My current code is:
String couponPage = "http://www.betvictor.com/sports/en/to-lead-anytime";
Document doc1 = Jsoup.connect(couponPage).get();
String match = doc1.select("#coupon_143751140 > table:nth-child(3) > tbody > tr:nth-child(2) > td.event_description").text();
System.out.println("match:" + match);
Once I can figure out how to get one item of data, I will put it in a for loop to loop through the whole table, but first I need to get one item of data.
Currently, the output is "match: " so it looks like the "match" variable is empty.
Any help is most appreciated,

I have worked out how to answer my question after a few hours of experimenting. Turns out the page didn't load properly straight away, and had to implement the "timeout" method.
Document doc;
try {
// need http protocol
doc = Jsoup.connect("http://www.betvictor.com/sports/en/football/coupons/100/0/0/43438/0/100/0/0/0/0/1").timeout(10000).get();
// get all links
Elements matches = doc.select("td.event_description a");
Elements odds = doc.select("td.event_description a");
for (Element match : matches) {
// get the value from href attribute
String matchEvent = match.text();
String[] parts = matchEvent.split(" v ");
String team1 = parts[0];
String team2 = parts[1];
System.out.println("text : " + team1 + " v " + team2);
}

Related

Jsoup iterate over Elements causes duplicated output

I have a page link to extract some data from it (I want to get tables' some tds attributes).
I used for-loop to iteration via elements that I have to extract some attributes
of it . but I get duplicated output.
The output should be like the output at the image on the end of my post
Document doc = Jsoup.connect("http://www.saudisale.com/SS_a_mpg.aspx").get();
Elements elements = doc.select("table").select("tbody").select("tr").select("td") ;
for(Element e:elements) {
System.out.println(e.select("span[id~=Label4]").text() +
"\t" + e.select("input[id$=ImageButton1]").attr("src") +
"\t" + "" + e.select("span[id~=Label13]").text());
}
This is the output that I get them, they are duplicated!!! :
The output should be like this:-

Would you please try below code?
Elements description = doc.select("tbody");
doc=Jsoup.parse(description.html());
description = doc.select("td");
for(int j = 0; j < description.size(); ++ j)
{
String bodytext = description.eq(j).text(); // bodytext is the text of each TD
}

I used for-loop with incrementing counter instead and problem solved.
where 31 is the number of items on that page
The following code gives the desired output.
for(int i=1;i<description.size();i++)
{
System.out.println(elements.select("td").select("span[id~=Label4]").get(i).text()+""+elements.select("td").select("input[id$=ImageButton1]").get(i).attr("src"));
}

using a regex in jsoup

I'm trying my first serious project in jsoup and I've got stuck in this matter-
I'm trying to get zipcodes from a site. There is a list of zipcodes.
Here is one of the lines that presents the zipcode-
<td align="center">33011</td>
So the idea I've got is going through the page and getting all the strings that contain 6 digits from 1-9. Regex is ^[0-9]{6,6}$
code was -
doc.select("td:matchesOwn(^[0-9]{5,5}$)");
but nothing came out. I can't find the way to get these zipcodes out of that site....
Does anyone know how to do it?
the real question here is how do i get the numbers that are not in any tags,but just written out in the open (i guess there is a term for that but im not that good with xml terms)

I solved it using Element#getElementsMatchingOwnText:
public static void main(String[] args) {
final String html = "<td align=\"center\">33011</td> ";
final Elements elements = Jsoup.parse(html).getElementsMatchingOwnText("^[0-9]{5,5}$");
for (final Element element : elements) {
System.out.println("element = [" + element + "]");
System.out.println("zip = [" + element.text() + "]");
}
}
Output:
element = [33011]
zip = [33011]

Jsoup -- iterate over multiple elements simultaneously?

I am attempting to convert an html page with entries that have multiple types of details (e.g. name, phone number, and address), into a spreadsheet. I am able to to isolate each of these details as Elements, but I cannot seem to find a way to iterate over multiple Elements at once to print names and phone numbers next to one another rather than having all the names printed and then all of the phone numbers printed.
Jsoup.connect(page).timeout(999999);
Document doc = Jsoup.connect(page).get();
String title = doc.title();
System.out.println(title);
Elements names = doc.select("li a");
Elements ratings = doc.select("li img");
for (Element name:names){
if (name.attr("href").startsWith("/biz/")){
System.out.println(name.text());
}
for (Element rating:ratings){
System.out.println(rating.attr("alt"));
}

Assuming the index its the same for both this would work fine.
for(int i = 0; i < names.size() && i < ratings.size(); i++) {
System.out.println("Name: " + names.get(i) + " Phone: " + ratings.get(i));
}

Can I include white space between all html text() elements in Jsoup

I want to use Jsoup to extract all text from an HTML page and return a single string of all the text without any HTML. The code I am using half works, but has the effect of joining elements which affects my keyword searches against the string.
This is the Java code I am using:
String resultText = scrapePage(htmldoc);
private String scrapePage(Document doc) {
Element allHTML = doc.select("html").first();
return allHTML.text();
}
Run against the following HTML:
<html>
<body>
<h1>Title</h1>
<p>here is para1</p>
<p>here is para2</p>
</body>
</html>
Outputting resultText gives "Titlehere is para1here is para2" meaning I can't search for the word "para1" as the only word is "para1here".
I don't want to split document into further elements than necessary (for example, getting all H1, p.text elements as there is such a wide range of tags I could be matching
(e.g. data1data2 would come from):
<td>data1</td><td>data2</td>
Is there a way if can get all the text from the page but also include a space between the tags? I don't want to preserve whitepsace otherwise, no need to keep line breaks etc. as I am just preparing a keyword string. I will probably trim all white space otherwise to a single space for this reason.

I don't have this issue using JSoup 1.7.3.
Here's the full code i used for testing:
final String html = "<html>\n"
+ " <body>\n"
+ " <h1>Title</h1>\n"
+ " <p>here is para1</p>\n"
+ " <p>here is para2</p>\n"
+ " </body>\n"
+ "</html>";
Document doc = Jsoup.parse(html);
Element element = doc.select("html").first();
System.out.println(element.text());
And the output:
Title here is para1 here is para2
Can you run my code? Also update to a newer version of jsoup if you don't have 1.7.3 yet.

Previous answer is not right, because it works just thanks to "\n" end of lines added to each line, but in reality you may not have end of line on end of each HTML line...

void example2text() throws IOException {
String url = "http://www.example.com/";
String out = new Scanner(new URL(url).openStream(), "UTF-8").useDelimiter("\\A").next();
org.jsoup.nodes.Document doc = Jsoup.parse(out);
String text = "";
Elements tags = doc.select("*");
for (Element tag : tags) {
for (TextNode tn : tag.textNodes()) {
String tagText = tn.text().trim();
if (tagText.length() > 0) {
text += tagText + " ";
}
}
}
System.out.println(text);
}
By using answer: https://stackoverflow.com/a/35798214/4642669

Why Jsoup cannot select td element?

I have made little test (with Jsoup 1.6.1):
String s = "" +Jsoup.parse("<td></td>").select("td").size();
System.out.println("Selected elements count : " + s);
It outputs:
Selected elements count : 0
But it should return 1, because I have parsed html with td element. What is wrong with my code or is there bug in Jsoup?

Because Jsoup is a HTML5 compliant parser and you feeded it with invalid HTML. A <td> has to go inside at least a <table>.
int size = Jsoup.parse("<table><td></td></table>").select("td").size();
System.out.println("Selected elements count : " + size);

String url = "http://foobar.com";
Document doc = Jsoup.connect(url).get();
Elements td = doc.select("td");

Jsoup 1.6.2 allows to parse with different parser and simple XML parser is provided. With following code I could solve my problem. You can later parse your fragment with HTML parse, to get valid HTML.
// Jsoup 1.6.2
String s = "" + Jsoup.parse("<td></td>", "", Parser.xmlParser()).select("td").size();
System.out.println("Selected elements count : " + s);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup get specific data from webpage - java

Related

Jsoup iterate over Elements causes duplicated output

using a regex in jsoup

Jsoup -- iterate over multiple elements simultaneously?

Can I include white space between all html text() elements in Jsoup

Why Jsoup cannot select td element?

Categories

Resources