So I am trying to get the data from this webpage using Jsoup...
I've tried looking up many different ways of doing it and I've gotten close but I don't know how to find tags for certain stats (Attack, Strength, Defence, etc.)
So let's say for examples sake I wanted to print out
'Attack', '15', '99', '200,000,000'
How should I go about doing this?
You can use CSS selectors in Jsoup to easily extract the column data.
// retrieve page source code
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
// find all of the table rows
Elements rows = doc.select("div#contentHiscores table tr");
ListIterator<Element> itr = rows.listIterator();
// loop over each row
while (itr.hasNext()) {
Element row = itr.next();
// does the second col contain the word attack?
if (row.select("td:nth-child(2) a:contains(attack)").first() != null) {
// if so, assign each sibling col to variable
String rank = row.select("td:nth-child(3)").text();
String level = row.select("td:nth-child(4)").text();
String xp = row.select("td:nth-child(5)").text();
System.out.printf("rank=%s level=%s xp=%s", rank, level, xp);
// stop looping rows, found attack
break;
}
}
A very rough implementation would be as below. I have just shown a snippet , optimizations or other conditionals need to be added
public static void main(String[] args) throws Exception {
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
Element contentHiscoresDiv = doc.getElementById("contentHiscores");
Element table = contentHiscoresDiv.child(0);
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (Element column : tds) {
if (column.children() != null && column.children().size() > 0) {
Element anchorTag = column.getElementsByTag("a").first();
if (anchorTag != null && anchorTag.text().contains("Attack")) {
System.out.println(anchorTag.text());
Elements attributeSiblings = column.siblingElements();
for (Element attributeSibling : attributeSiblings) {
System.out.println(attributeSibling.text());
}
}
}
}
}
}
Attack
15
99
200,000,000
Related
My object is to scrape data by using Java Selenium. I am able to load selenium driver, connect to the website and fetch the first column then go to the next pagination button until its become disable and write it to the console. Here is what I did so far:
public static WebDriver driver;
public static void main(String[] args) throws Exception {
System.setProperty("webdriver.chrome.driver", "E:\\eclipse-workspace\\package-name\\src\\working\\selenium\\driver\\chromedriver.exe");
System.setProperty("webdriver.chrome.silentOutput", "true");
driver = new ChromeDriver();
driver.get("https://datatables.net/examples/basic_init/zero_configuration.html");
driver.manage().window().maximize();
compareDispalyedRowCountToActualRowCount();
}
public static void compareDispalyedRowCountToActualRowCount() throws Exception {
try {
Thread.sleep(5000);
List<WebElement> namesElements = driver.findElements(By.cssSelector("#example>tbody>tr>td:nth-child(1)"));
System.out.println("size of names elements : " + namesElements.size());
List<String> names = new ArrayList<String>();
//Adding column1 elements to the list
for (WebElement nameEle : namesElements) {
names.add(nameEle.getText());
}
//Displaying the list elements on console
for (WebElement s : namesElements) {
System.out.println(s.getText());
}
//locating next button
String nextButtonClass = driver.findElement(By.id("example_next")).getAttribute("class");
//traversing through the table until the last button and adding names to the list defined about
while (!nextButtonClass.contains("disabled")) {
driver.findElement(By.id("example_next")).click();
Thread.sleep(1000);
namesElements = driver.findElements(By.cssSelector("#example>tbody>tr>td:nth-child(1)"));
for (WebElement nameEle : namesElements) {
names.add(nameEle.getText());
}
nextButtonClass = driver.findElement(By.id("example_next")).getAttribute("class");
}
//printing the whole list elements
for (String name : names) {
System.out.println(name);
}
//counting the size of the list
int actualCount = names.size();
System.out.println("Total number of names :" + actualCount);
//locating displayed count
String displayedCountString = driver.findElement(By.id("example_info")).getText().split(" ")[5];
int displayedCount = Integer.parseInt(displayedCountString);
System.out.println("Total Number of Displayed Names count:" + displayedCount);
Thread.sleep(1000);
// Actual count calculated Vs Dispalyed Count
if (actualCount == displayedCount) {
System.out.println("Actual row count = Displayed row Count");
} else {
System.out.println("Actual row count != Displayed row Count");
throw new Exception("Actual row count != Displayed row Count");
}
} catch (Exception e) {
e.printStackTrace();
}
}
I want to:
scrape more than one column or may be selected columns for example on this LINK name, office and age column
Then want to write these columns data in CSV file
Update
I tried like this but not running:
for(WebElement trElement : tr_collection){
int col_num=1;
List<WebElement> td_collection = trElement.findElements(
By.xpath("//*[#id=\"example\"]/tbody/tr[rown_num]/td[col_num]")
);
for(WebElement tdElement : td_collection){
rows += tdElement.getText()+"\t";
col_num++;
}
rows = rows + "\n";
row_num++;
}
Scraping:
Usually when I want to gather list elements I will select by Xpath instead of CssSelector. The structure of how to access elements through the Xpath is usually more clear, and depends on one or two integer values specifying the element.
So for your example where you want to find the names, you would find an element by the Xpath, the next element in the list's Xpath, and find the differing value:
The first name, 'Airi Satou' is found at the following Xpath:
//*[#id="example"]/tbody/tr[1]/td[1]
Airi's position has the following Xpath:
//*[#id="example"]/tbody/tr[1]/td[2]
You can see that across rows the Xpath for each piece of information differs on the 'td' markup.
The next name in the list, 'Angela Ramos' is found:
//*[#id="example"]/tbody/tr[2]/td[1]
And Angela's position is found:
//*[#id="example"]/tbody/tr[2]/td[2]
You can see that the difference in the column is controlled by the 'tr' markup.
By iterating over values of 'tr' and 'td' you can get the whole table.
As for writing to a CSV, there are a some solid Java libraries for writing to CSVs. I think a straightforward example to follow is here:
Java - Writing strings to a CSV file
UPDATE:
#User169 It looks like you're gathering a list of elements for each row in the table. You want to gather the Xpaths one by one, iterating over the list of webElements that you found originally. Try this, then add to it so it will get text and save it to an array.
for (int num_row = 1; num_row < total_rows; num_row++){
for (int num_col = 1; num_col < total_col; num_col++){
webElement info = driver.findElement(By.xpath("//*[#id=\"example\"]/tbody/tr[" + row_num + ']/td[' + col_num + "]");
}
}
I haven't tested it so it may need a few small changes.
I have the HTML string like
<b>test</b><b>er</b>
<span class="ab">continue</span><span> without</span>
I want to collapse the Tags which are similar and belong to each other. In the above sample I want to have
<b>tester</b>
since the tags have the same tag withouth any further attribute or style. But for the span Tag it should remain the same because it has a class attribute. I am aware that I can iterate via Jsoup over the tree.
Document doc = Jsoup.parse(input);
for (Element element : doc.select("b")) {
}
But I'm not clear how look forward (I guess something like nextSibling) but than how to collapse the elements?
Or exists a simple regexp merge?
The attributes I can specify on my own. It's not required to have a one-fits-for-all Tag solution.
My approach would be like this. Comments in the code
public class StackOverflow60704600 {
public static void main(final String[] args) throws IOException {
Document doc = Jsoup.parse("<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>");
mergeSiblings(doc, "b");
System.out.println(doc);
}
private static void mergeSiblings(Document doc, String selector) {
Elements elements = doc.select(selector);
for (Element element : elements) {
// get the next sibling
Element nextSibling = element.nextElementSibling();
// merge only if the next sibling has the same tag name and the same set of attributes
if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
&& nextSibling.attributes().equals(element.attributes())) {
// your element has only one child, but let's rewrite all of them if there's more
while (nextSibling.childNodes().size() > 0) {
Node siblingChildNode = nextSibling.childNodes().get(0);
element.appendChild(siblingChildNode);
}
// remove because now it doesn't have any children
nextSibling.remove();
}
}
}
}
output:
<html>
<head></head>
<body>
<b>tester</b>
<span class="ab">continue</span>
<span> without</span>
</body>
</html>
One more note on why I used loop while (nextSibling.childNodes().size() > 0). It turned out for or iterator couldn't be used here because appendChild adds the child but removes it from the source element and remaining childen are be shifted. It may not be visible here but the problem will appear when you try to merge: <b>test</b><b>er<a>123</a></b>
I tried to update the code from #Krystian G but my edit was rejected :-/ Therefore I post it as an own post. The code is an excellent starting point but it fails if between the tags a TextNode appears, e.g.
<span> no class but further</span> (in)valid <span>spanning</span> would result into a
<span> no class but furtherspanning</span> (in)valid
Therefore the corrected code looks like:
public class StackOverflow60704600 {
public static void main(final String[] args) throws IOException {
String test1="<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>";
String test2="<b>test</b><b>er<a>123</a></b>";
String test3="<span> no class but further</span> <span>spanning</span>";
String test4="<span> no class but further</span> (in)valid <span>spanning</span>";
Document doc = Jsoup.parse(test1);
mergeSiblings(doc, "b");
System.out.println(doc);
}
private static void mergeSiblings(Document doc, String selector) {
Elements elements = doc.select(selector);
for (Element element : elements) {
Node nextElement = element.nextSibling();
// if the next Element is a TextNode but has only space ==> we need to preserve the
// spacing
boolean addSpace = false;
if (nextElement != null && nextElement instanceof TextNode) {
String content = nextElement.toString();
if (!content.isBlank()) {
// the next element has some content
continue;
} else {
addSpace = true;
}
}
// get the next sibling
Element nextSibling = element.nextElementSibling();
// merge only if the next sibling has the same tag name and the same set of
// attributes
if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
&& nextSibling.attributes().equals(element.attributes())) {
// your element has only one child, but let's rewrite all of them if there's more
while (nextSibling.childNodes().size() > 0) {
Node siblingChildNode = nextSibling.childNodes().get(0);
if (addSpace) {
// since we have had some space previously ==> preserve it and add it
if (siblingChildNode instanceof TextNode) {
((TextNode) siblingChildNode).text(" " + siblingChildNode.toString());
} else {
element.appendChild(new TextNode(" "));
}
}
element.appendChild(siblingChildNode);
}
// remove because now it doesn't have any children
nextSibling.remove();
}
}
}
}
I want to load the table from http://www.espn.com/nba/hollinger/teamstats into a JTable. After parsing the table with JSoup, I managed to load the table header, but I have problem to load the data rows. First I tried only the odd rows, but JSoup loaded only the last odd row, I don't know how to load all of them.
I tried to load from the first row using .first(), but then loaded only the first row, nothing else.
Here is my code:
Document doc = null;
try {
doc = Jsoup.connect("http://www.espn.com/nba/hollinger/teamstats").get();
} catch (IOException e) {
e.printStackTrace();
}
String [][] data = new String[30][12];
String [] header = new String[12];
for (Element table : doc.select("table.tablehead .colhead")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (int i=0;i<12;i++) {
header[i]=tds.get(i).text();
}
}
}
for (Element table : doc.select("table.tablehead .oddRow")) {
for (int j=0;j<15;j++) {
for (Element row : table.select("tr")) {
for (int i=0;i<12;i++) {
Elements tds = row.select("td");
data[j][i]=tds.get(i).text();
}
}
}
}
The HTML table has 30 data rows, I want to load all of them into my JTable.
How to modify my code? Thanks for the help!
It looks like you are ocercomplicating things. To get text from headers
select table row <tr> holding header data
iterate over its table data <td>
get their text()
store it in array
For data
select all rows from table except first two since they are used for info and header (:gt(1) selector can be helpful here :gt(n): find elements whose sibling index is greater than n since we want to get tr siblings starting at indexes 2,3,4,... in other words greater than 1)
repeat what you did for headers but resulting array store as rows of 2D String array
Code:
Document doc = Jsoup.connect("http://www.espn.com/nba/hollinger/teamstats").get();
//headers: pick specific row, get its td, convert them to text() store as array
String[] headers = doc.select("table.tablehead tr.colhead td")
.stream()
.map(Element::text)
.toArray(String[]::new);
System.out.println(Arrays.toString(headers));
//data: select rows with data, convert row to array, hold each row array in 2D array
String[][] data = doc.select("table.tablehead tr:gt(1)")
.stream()
.map(row -> row.select("td")
.stream()
.map(Element::text)
.toArray(String[]::new)
).toArray(String[][]::new);
System.out.println("----");
for (String[] row : data){
System.out.println(Arrays.toString(row));
}
I was trying to scrape the data of a website and to some extents I succeed in my goal. But, there is a problem that the web page I am trying to scrape have got multiple HTML tables in it. Now, when I execute my program it only retrieves the data of the first table in the CSV file and not retrieving the other tables. My java class code is as follows.
public static void parsingHTML() throws Exception {
//tbodyElements = doc.getElementsByTag("tbody");
for (int i = 1; i <= 1; i++) {
Elements table = doc.getElementsByTag("table");
if (table.isEmpty()) {
throw new Exception("Table is not found");
}
elements = table.get(0).getElementsByTag("tr");
for (Element trElement : elements) {
trElement2 = trElement.getElementsByTag("tr");
tdElements = trElement.getElementsByTag("td");
File fold = new File("C:\\convertedCSV9.csv");
fold.delete();
File fnew = new File("C:\\convertedCSV9.csv");
FileWriter sb = new FileWriter(fnew, true);
//StringBuilder sb = new StringBuilder(" ");
//String y = "<tr>";
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
//Element tdElement1 = it.next();
//final String content2 = tdElement1.text();
if (it.hasNext()) {
sb.append("\r\n");
}
for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
Element tdElement2 = it.next();
final String content = tdElement2.text();
//stringjoiner.add(content);
//sb.append(formatData(content));
if (it2.hasNext()) {
sb.append(formatData(content));
sb.append(" , ");
}
if (!it.hasNext()) {
String content1 = content.replaceAll(",$", " ");
sb.append(formatData(content1));
//it2.next();
}
}
System.out.println(sb.toString());
sb.flush();
sb.close();
}
System.out.println(sampleList.add(tdElements));
}
}
}
What I analyze is that there is a loop which is only checking tr tds. So, after first table there is a style sheet on the HTML page. May be due to style sheet loop is breaking. I think that's the reason it is proceeding to the next table.
P.S: here's the link which I am trying to scrap
http://www.mufap.com.pk/nav_returns_performance.php?tab=01
What you do just at the beginning of your code will not work:
// loop just once, why
for (int i = 1; i <= 1; i++) {
Elements table = doc.getElementsByTag("table");
if (table.isEmpty()) {
throw new Exception("Table is not found");
}
elements = table.get(0).getElementsByTag("tr");
Here you loop just once, read all table elements and then process all tr elements for the first table you find. So even if you would loop more than once, you would always process the first table.
You will have to iterate all table elements, e.g.
for(Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
// process "td"s and so on
}
}
Edit Since you're having troubles with the code above, here's a more thorough example. Note that I'm using Jsoup to read and parse the HTML (you didn't specify what you are using)
Document doc = Jsoup
.connect("http://www.mufap.com.pk/nav_returns_performance.php?tab=01")
.get();
for (Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
// skip header "tr"s and process only data "tr"s
if (trElement.hasClass("tab-data1")) {
StringJoiner tdj = new StringJoiner(",");
for (Element tdElement : trElement.getElementsByTag("td")) {
tdj.add(tdElement.text());
}
System.out.println(tdj);
}
}
}
This will concat and print all data cells (those having the class tab-data1). You will still have to modify it to write to your CSV file though.
Note: in my tests this processes 21 tables, 243 trs and 2634 tds.
I am trying to get the link to every team in the table on http://www.statto.com/football/stats/england/premier-league. Currently my code only gets the team names, but seems to output every team as one string... I would like each element to be output as the link, so "Chelsea" would be "http://www.statto.com/football/teams/chelsea".
My current code:
Document doc = Jsoup.connect(
"http://www.statto.com/football/stats/england/premier-league").get();
Element tableHeader = doc.select("table[class=tabBG]").first();
for (Element element : tableHeader.children()) {
// Here you can do something with each element
String team = element.select("td:eq(1) a").text();
System.out.println(team);
}
}
Does anybody know how I can get the link to each item in the table to output as individual strings?
Thanks,
Rob
I have now worked out a solution to my problem, below is the code that works:
Document doc = Jsoup.connect(
"http://www.statto.com/football/stats/germany/bundesliga").get();
Element tableHeader = doc.select("tbody").first();
for (Element element : tableHeader.children()) {
// Here you can do something with each element
if (element.select("td:eq(1)").html().contains("acronym") != false || element.select("td:eq(1)").html().contains("nbsp") != false){
//do nothing
} else {
String teamname = element.select("td:eq(1) a").html();
String team = element.select("td:eq(1)").toString()
.replace("", "").replace(teamname, "").replace("<td class=\"steam\">", "").replace("\"</td>", "");
System.out.println(team);
}
}
Which gives me the below output:
http://www.statto.com/football/teams/bayern-munich
http://www.statto.com/football/teams/vfl-wolfsburg
http://www.statto.com/football/teams/borussia-monchengladbach
http://www.statto.com/football/teams/hannover-96
http://www.statto.com/football/teams/tsg-hoffenheim
http://www.statto.com/football/teams/bayer-leverkusen
http://www.statto.com/football/teams/augsburg
http://www.statto.com/football/teams/mainz
http://www.statto.com/football/teams/paderborn-07
http://www.statto.com/football/teams/fc-cologne
http://www.statto.com/football/teams/schalke-04
http://www.statto.com/football/teams/eintracht-frankfurt
http://www.statto.com/football/teams/sc-freiburg
http://www.statto.com/football/teams/hertha-bsc-berlin
http://www.statto.com/football/teams/werder-bremen
http://www.statto.com/football/teams/sv-hamburg
http://www.statto.com/football/teams/vfb-stuttgart
http://www.statto.com/football/teams/borussia-dortmund