java html parser multi page table

java html parser multi page table - java

i am using Jsoup as html parser to get all the details from the table in this website. With the code below am only able to get the data on the first page only. Any advise?
public static void main(String[] args) {
String html = "http://www.fifa.com/worldranking/rankingtable/index.html#";
try {
Document doc = Jsoup.connect(html).get();
Elements tableElements = doc.select("table");
Elements tableHeaderEles = tableElements.select("thead tr th");
System.out.println("headers");
System.out.print("row");
for (int i = 0; i < tableHeaderEles.size(); i++) {
System.out.print(tableHeaderEles.get(i).text() + " | ");
}
System.out.println();
Elements tableRowElements = tableElements.select(":not(thead) tr");
for (int i = 0; i < tableRowElements.size(); i++) {
Element row = tableRowElements.get(i);
System.out.print("row");
Elements rowItems = row.select("td");
for (int j = 0; j < rowItems.size(); j++)
{
System.out.print(rowItems.get(j).text() + " | ");
}
System.out.println();
}
} catch (IOException e) {
e.printStackTrace();
} }

JSoup is a HTML parser, but looking at the website is using javascript to load the table. So you will need to click into it.
You could use HTMLUnit or Selenium for navigate and JSoup to parse the HTML.
I hope it helps.
Edit:
Looking better in the code of the page. I think that it could be useful :
http://www.fifa.com/worldranking/rankingtable/gender=m/rank=100/confederation=0/page=0/_ranking_table.html
I change the values of the URL, look that the rank u can increase (is the date of the ranking) and the important one would be the page. You could load all the ranking increasing the page parameter. Then just parsing it with JSoup would be enough.
For example the last ranking would be:
http://www.fifa.com/worldranking/rankingtable/gender=m/rank=237/confederation=0/page=1/_ranking_table.html
Then you could increase the parameter page=2, then 3, ... till 7
Cheers.

Related

Getting Stale Element Reference Exception while accessing the element from a list

Below is the code which I am trying to get to work.
//Method to fetch all links from the sitemap container
public void GetAllLinks() {
WebElement pointer = LinksContainer;
String url = "";
List <WebElement> allURLs = pointer.findElements(By.tagName("a"));
System.out.println("Total links on the page: "+ allURLs.size());
for (int i=0; i<allURLs.size(); i++) {
WebElement link = allURLs.get(i);
url = link.getAttribute("href");
OpenAllLinks(url);
}
}
//Method to hit all the fetched URLs
public void OpenAllLinks(String linkURL) {
driver.get(linkURL);
}
I am fetching all the anchor elements from a sitemap page and then putting all those elements into a list. Then, I am getting the URLs from all those elements using the getAttribute(href). The code is working fine till here.
However, after that I am taking these URLs as arguments and passing into the method OpenAllLinks() to open all these URLs one by one using driver.get(). The code works till the first link but as soon as the first page is loaded, it gives me the stale element exception.

At the moment you are leaving the page where all these links are appearing all the web elements in allURLs list becoming stale elements.
What you can do is first to extract and save in a list all the links, not the web elements, and after that iterate with loop opening all these links.
Like this:
public void GetAllLinks() {
WebElement pointer = LinksContainer;
String url = "";
List <WebElement> allURLs = pointer.findElements(By.tagName("a"));
System.out.println("Total links on the page: "+ allURLs.size());
List<String>links = new ArrayList<>();
for (int i=0; i<allURLs.size(); i++) {
WebElement link = allURLs.get(i);
url = link.getAttribute("href");
links.add(url);
}
for (int i=0; i<links.size(); i++) {
OpenLink(links.get(i));
}
}
//Method to open the fetched URLs
public void OpenLink(String linkURL) {
driver.get(linkURL);
}

How to scrape selected table columns and write them in CVS in Java Selenium

My object is to scrape data by using Java Selenium. I am able to load selenium driver, connect to the website and fetch the first column then go to the next pagination button until its become disable and write it to the console. Here is what I did so far:
public static WebDriver driver;
public static void main(String[] args) throws Exception {
System.setProperty("webdriver.chrome.driver", "E:\\eclipse-workspace\\package-name\\src\\working\\selenium\\driver\\chromedriver.exe");
System.setProperty("webdriver.chrome.silentOutput", "true");
driver = new ChromeDriver();
driver.get("https://datatables.net/examples/basic_init/zero_configuration.html");
driver.manage().window().maximize();
compareDispalyedRowCountToActualRowCount();
}
public static void compareDispalyedRowCountToActualRowCount() throws Exception {
try {
Thread.sleep(5000);
List<WebElement> namesElements = driver.findElements(By.cssSelector("#example>tbody>tr>td:nth-child(1)"));
System.out.println("size of names elements : " + namesElements.size());
List<String> names = new ArrayList<String>();
//Adding column1 elements to the list
for (WebElement nameEle : namesElements) {
names.add(nameEle.getText());
}
//Displaying the list elements on console
for (WebElement s : namesElements) {
System.out.println(s.getText());
}
//locating next button
String nextButtonClass = driver.findElement(By.id("example_next")).getAttribute("class");
//traversing through the table until the last button and adding names to the list defined about
while (!nextButtonClass.contains("disabled")) {
driver.findElement(By.id("example_next")).click();
Thread.sleep(1000);
namesElements = driver.findElements(By.cssSelector("#example>tbody>tr>td:nth-child(1)"));
for (WebElement nameEle : namesElements) {
names.add(nameEle.getText());
}
nextButtonClass = driver.findElement(By.id("example_next")).getAttribute("class");
}
//printing the whole list elements
for (String name : names) {
System.out.println(name);
}
//counting the size of the list
int actualCount = names.size();
System.out.println("Total number of names :" + actualCount);
//locating displayed count
String displayedCountString = driver.findElement(By.id("example_info")).getText().split(" ")[5];
int displayedCount = Integer.parseInt(displayedCountString);
System.out.println("Total Number of Displayed Names count:" + displayedCount);
Thread.sleep(1000);
// Actual count calculated Vs Dispalyed Count
if (actualCount == displayedCount) {
System.out.println("Actual row count = Displayed row Count");
} else {
System.out.println("Actual row count != Displayed row Count");
throw new Exception("Actual row count != Displayed row Count");
}
} catch (Exception e) {
e.printStackTrace();
}
}
I want to:
scrape more than one column or may be selected columns for example on this LINK name, office and age column
Then want to write these columns data in CSV file
Update
I tried like this but not running:
for(WebElement trElement : tr_collection){
int col_num=1;
List<WebElement> td_collection = trElement.findElements(
By.xpath("//*[#id=\"example\"]/tbody/tr[rown_num]/td[col_num]")
);
for(WebElement tdElement : td_collection){
rows += tdElement.getText()+"\t";
col_num++;
}
rows = rows + "\n";
row_num++;
}

Scraping:
Usually when I want to gather list elements I will select by Xpath instead of CssSelector. The structure of how to access elements through the Xpath is usually more clear, and depends on one or two integer values specifying the element.
So for your example where you want to find the names, you would find an element by the Xpath, the next element in the list's Xpath, and find the differing value:
The first name, 'Airi Satou' is found at the following Xpath:
//*[#id="example"]/tbody/tr[1]/td[1]
Airi's position has the following Xpath:
//*[#id="example"]/tbody/tr[1]/td[2]
You can see that across rows the Xpath for each piece of information differs on the 'td' markup.
The next name in the list, 'Angela Ramos' is found:
//*[#id="example"]/tbody/tr[2]/td[1]
And Angela's position is found:
//*[#id="example"]/tbody/tr[2]/td[2]
You can see that the difference in the column is controlled by the 'tr' markup.
By iterating over values of 'tr' and 'td' you can get the whole table.
As for writing to a CSV, there are a some solid Java libraries for writing to CSVs. I think a straightforward example to follow is here:
Java - Writing strings to a CSV file
UPDATE:
#User169 It looks like you're gathering a list of elements for each row in the table. You want to gather the Xpaths one by one, iterating over the list of webElements that you found originally. Try this, then add to it so it will get text and save it to an array.
for (int num_row = 1; num_row < total_rows; num_row++){
for (int num_col = 1; num_col < total_col; num_col++){
webElement info = driver.findElement(By.xpath("//*[#id=\"example\"]/tbody/tr[" + row_num + ']/td[' + col_num + "]");
}
}
I haven't tested it so it may need a few small changes.

How to select the Angular grid-controller row in selenium with webdriver?

I am working on automation test framework in selenium java.
In my test application we have used angular grid-controller.
How to access grid row to perform few operations?

Finally I got the answer myself,
For every angular grid , angular generates the column index(hex number) , which is appended to the class attribute of the tag.
So we can access the cell value with same class attribute and and iterate through all the rows for the host name column as shown in image please find the code snippet in selenium for same:
[![List<WebElement> rows = driver.findElements(By.xpath("//*[contains(#class,'ui-grid-cell ng-scope ui-grid-disable-selection ui-grid-coluiGrid-0006')]//div"));
int iSize = rows.size();
for (int i = 0; i < iSize; i++) {
String sValue = "192.168.30.70";
if (sValue.equalsIgnoreCase(inputtext)) {
rows.get(i).click();
break;
}
}
so in this way we can search the particular grid column for the matching value.

I tried your code and somehow it didn't work for me, it didn't click the row or it didn't do anything. Here's a screenshot of my application:
screenshot
And here is my code:
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
List<WebElement> rows = driver.findElements(By.xpath("//*[contains(#class,'ui-grid-cell hoverable-cell ng-scope ui-grid-coluiGrid-0084')]//div"));
int iSize = rows.size();
for (int i = 0; i < iSize; i++) {
String sValue = "MATT";
if (sValue.equalsIgnoreCase("matt")) {
rows.get(i).click();
break;
}
}

Read a specified line of text from a webpage with Jsoup

So I am trying to get the data from this webpage using Jsoup...
I've tried looking up many different ways of doing it and I've gotten close but I don't know how to find tags for certain stats (Attack, Strength, Defence, etc.)
So let's say for examples sake I wanted to print out
'Attack', '15', '99', '200,000,000'
How should I go about doing this?

You can use CSS selectors in Jsoup to easily extract the column data.
// retrieve page source code
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
// find all of the table rows
Elements rows = doc.select("div#contentHiscores table tr");
ListIterator<Element> itr = rows.listIterator();
// loop over each row
while (itr.hasNext()) {
Element row = itr.next();
// does the second col contain the word attack?
if (row.select("td:nth-child(2) a:contains(attack)").first() != null) {
// if so, assign each sibling col to variable
String rank = row.select("td:nth-child(3)").text();
String level = row.select("td:nth-child(4)").text();
String xp = row.select("td:nth-child(5)").text();
System.out.printf("rank=%s level=%s xp=%s", rank, level, xp);
// stop looping rows, found attack
break;
}
}

A very rough implementation would be as below. I have just shown a snippet , optimizations or other conditionals need to be added
public static void main(String[] args) throws Exception {
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
Element contentHiscoresDiv = doc.getElementById("contentHiscores");
Element table = contentHiscoresDiv.child(0);
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (Element column : tds) {
if (column.children() != null && column.children().size() > 0) {
Element anchorTag = column.getElementsByTag("a").first();
if (anchorTag != null && anchorTag.text().contains("Attack")) {
System.out.println(anchorTag.text());
Elements attributeSiblings = column.siblingElements();
for (Element attributeSibling : attributeSiblings) {
System.out.println(attributeSibling.text());
}
}
}
}
}
}
Attack
15
99
200,000,000

How to fetch data of multiple HTML tables through Web Scraping in Java

I was trying to scrape the data of a website and to some extents I succeed in my goal. But, there is a problem that the web page I am trying to scrape have got multiple HTML tables in it. Now, when I execute my program it only retrieves the data of the first table in the CSV file and not retrieving the other tables. My java class code is as follows.
public static void parsingHTML() throws Exception {
//tbodyElements = doc.getElementsByTag("tbody");
for (int i = 1; i <= 1; i++) {
Elements table = doc.getElementsByTag("table");
if (table.isEmpty()) {
throw new Exception("Table is not found");
}
elements = table.get(0).getElementsByTag("tr");
for (Element trElement : elements) {
trElement2 = trElement.getElementsByTag("tr");
tdElements = trElement.getElementsByTag("td");
File fold = new File("C:\\convertedCSV9.csv");
fold.delete();
File fnew = new File("C:\\convertedCSV9.csv");
FileWriter sb = new FileWriter(fnew, true);
//StringBuilder sb = new StringBuilder(" ");
//String y = "<tr>";
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
//Element tdElement1 = it.next();
//final String content2 = tdElement1.text();
if (it.hasNext()) {
sb.append("\r\n");
}
for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
Element tdElement2 = it.next();
final String content = tdElement2.text();
//stringjoiner.add(content);
//sb.append(formatData(content));
if (it2.hasNext()) {
sb.append(formatData(content));
sb.append(" , ");
}
if (!it.hasNext()) {
String content1 = content.replaceAll(",$", " ");
sb.append(formatData(content1));
//it2.next();
}
}
System.out.println(sb.toString());
sb.flush();
sb.close();
}
System.out.println(sampleList.add(tdElements));
}
}
}
What I analyze is that there is a loop which is only checking tr tds. So, after first table there is a style sheet on the HTML page. May be due to style sheet loop is breaking. I think that's the reason it is proceeding to the next table.
P.S: here's the link which I am trying to scrap
http://www.mufap.com.pk/nav_returns_performance.php?tab=01

What you do just at the beginning of your code will not work:
// loop just once, why
for (int i = 1; i <= 1; i++) {
Elements table = doc.getElementsByTag("table");
if (table.isEmpty()) {
throw new Exception("Table is not found");
}
elements = table.get(0).getElementsByTag("tr");
Here you loop just once, read all table elements and then process all tr elements for the first table you find. So even if you would loop more than once, you would always process the first table.
You will have to iterate all table elements, e.g.
for(Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
// process "td"s and so on
}
}
Edit Since you're having troubles with the code above, here's a more thorough example. Note that I'm using Jsoup to read and parse the HTML (you didn't specify what you are using)
Document doc = Jsoup
.connect("http://www.mufap.com.pk/nav_returns_performance.php?tab=01")
.get();
for (Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
// skip header "tr"s and process only data "tr"s
if (trElement.hasClass("tab-data1")) {
StringJoiner tdj = new StringJoiner(",");
for (Element tdElement : trElement.getElementsByTag("td")) {
tdj.add(tdElement.text());
}
System.out.println(tdj);
}
}
}
This will concat and print all data cells (those having the class tab-data1). You will still have to modify it to write to your CSV file though.
Note: in my tests this processes 21 tables, 243 trs and 2634 tds.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java html parser multi page table - java

Related

Getting Stale Element Reference Exception while accessing the element from a list

How to scrape selected table columns and write them in CVS in Java Selenium

How to select the Angular grid-controller row in selenium with webdriver?

Read a specified line of text from a webpage with Jsoup

How to fetch data of multiple HTML tables through Web Scraping in Java

Categories

Resources