How to scrape table data from specific site JSOUP - java

I'm trying to scrape some data from table on this site:https://www.worldometers.info/coronavirus/
Here is the source code of scraper I've tried
public static void main(String[] args) throws Exception {
String url = "https://www.worldometers.info/coronavirus/";
try{
Document doc = Jsoup.connect(url).get();
Element table = doc.getElementById("main_table_countries_today");
Elements rows = table.getElementsByTag("tr");
for(Element row : rows){
Elements tds = row.getElementsByTag("td");
for(int i = 0;i<tds.size();i++){
System.out.println(tds.get(i).text());
}
}
}catch (IOException e){
e.printStackTrace();
}
}
And here is the output
China
80,928
+34
3,245
+8
70,420
7,263
2,274
56
Italy
35,713 ....
I would like to scrape only data for one specific country,eg. France.
But I don't have any idea how to do it.

You have to ask first every "td" if it's contain "France" then you can print the row.
public static void main(String[] args) throws Exception {
String url = "https://www.worldometers.info/coronavirus/";
try{
Document doc = Jsoup.connect(url).get();
Element table = doc.getElementById("main_table_countries_today");
Elements rows = table.getElementsByTag("tr");
for(Element row : rows){
Elements tds = row.getElementsByTag("td");
for(int i = 0;i<tds.size();i++){
if(tds.get(i).text().equals("France")){
System.out.println(row.text());
}
}
}
}catch (IOException e){
e.printStackTrace();
}
Output:
France 14,459 562 1,587 12,310 1,525 222

Related

How to get hyperlink boundaries of inline words with Aspose Words for Androd?

The android app reading paragraphs and some properties in Ms Word document with Aspose Words for Android library. It's getting paragraph text, style name and is seperated value. There are some words have hyperlink in paragraph line. How to get start and end boundaries of the hyperlink of words? For example:
This is an inline hyperlink paragraph example that the start bound is 18 and end bound is 27.
public static ArrayList<String[]> GetBookLinesByTag(String file) {
ArrayList<String[]> bookLines = new ArrayList<>();
try {
Document doc = new Document(file);
ParagraphCollection paras = doc.getFirstSection().getBody().getParagraphs();
for(int i = 0; i < paras.getCount(); i++){
String styleName = paras.get(i).getParagraphFormat().getStyleName().trim();
String isStyleSeparator = Integer.toString(paras.get(i).getBreakIsStyleSeparator() ? 1 : 0);
String content = paras.get(i).toString(SaveFormat.TEXT).trim();
bookLines.add(new String[]{content, styleName, isStyleSeparator});
}
} catch (Exception e){}
return bookLines;
}
Edit:
Thanks Alexey Noskov, solved with you.
public static ArrayList<String[]> GetBookLinesByTag(String file) {
ArrayList<String[]> bookLines = new ArrayList<>();
try {
Document doc = new Document(file);
ParagraphCollection paras = doc.getFirstSection().getBody().getParagraphs();
for(int i = 0; i < paras.getCount(); i++){
String styleName = paras.get(i).getParagraphFormat().getStyleName().trim();
String isStyleSeparator = Integer.toString(paras.get(i).getBreakIsStyleSeparator() ? 1 : 0);
String content = paras.get(i).toString(SaveFormat.TEXT).trim();
for (Field field : paras.get(i).getRange().getFields()) {
if (field.getType() == FieldType.FIELD_HYPERLINK) {
FieldHyperlink hyperlink = (FieldHyperlink) field;
String urlId = hyperlink.getSubAddress();
String urlText = hyperlink.getResult();
// Reformat linked text: urlText:urlId
content = urlText + ":" + urlId;
}
}
bookLines.add(new String[]{content, styleName, isStyleSeparator});
}
} catch (Exception e){}
return bookLines;
}
Hyperlinks in MS Word documents are represented as fields. If you press Alt+F9 in MS Word you will see something like this
{ HYPERLINK "https://aspose.com" }
Follow the link to learn more about fields in Aspose.Words document model and in MS Word.
https://docs.aspose.com/display/wordsjava/Introduction+to+Fields
In your case you need to locate position of FieldStart – this will be the start position, then measure length of content between FieldSeparator and FieldEnd – start position plus the calculated length will the end position.
Disclosure: I work at Aspose.Words team.

How to loop inside containers to select value in selenium?

I have a product page which has the sizes inside containers, i tried to list elements and get size by text but the list always returns zero, i tried the xpath of the parent and child and i get the same error, How can i list the sizes and select specific size ?
public void chooseSize(String size) {
String selectedSize;
List<WebElement> sizesList = actions.driver.findElements(By.xpath("SelectSizeLoactor"));
try {
for (int i = 0; i <= sizesList.size(); i++) {
if (sizesList.get(i).getText().toLowerCase().contains(size.toLowerCase()));
{
selectedSize = sizesList.get(i).getText();
sizesList.get(i).click();
assertTrue(selectedSize.equals(size));
}
}
} catch (Exception e) {
Assert.fail("Couldn't select size cause of " + e.getMessage());
}
It looks to me like the proper selector would be:
actions.driver.findElements(By.cssSelector(".SizeSelection-option"))
Try below options
List<WebElement> sizesList = actions.driver.findElements(By.xpath("//[#class='SelectSizeLoactor']"));
List<WebElement> sizesList = actions.driver.findElements(By.cssSelector(".SelectSizeLoactor"));
I found a quick solution i used part of the xpath with text() and passed the value of that text later then added the last of the xpath and it worked!
String SelectSizeLoactor = "//button[text()='"
public void chooseSize(String size) {
String selectedSize;
WebElement sizeLocator = actions.driver.findElement(By.xpath(SelectSizeLoactor+size.toUpperCase()+"']"));
try {
if (sizeLocator.getText().toUpperCase().contains(size.toUpperCase()));
{
selectedSize = sizeLocator.getText();
sizeLocator.click();
assertTrue(selectedSize.equals(size));
}
} catch (Exception e) {
Assert.fail("Couldn't select size cause of " + e.getMessage());
}
}

Use jsoup drawing html table

I'm reading a html file using jsoup. I want to show the html table,how can I do that?
I'm a beginner with jsoup - and a not that experienced java developer. :)
public class test {
public static void main(String[] args) throws IOException {
// TODO 自動產生的方法 Stub
File input = new File("D://index.html");//從一個html文件讀取
Document doc = Jsoup.parse(input,"UTF-8");
//test
Elements trs = doc.select("table").select("tr");
for(Element e : trs) {
System.out.println("-------------------");
System.out.println(e.text());
}
}
}
Without knowing jsoup, I guess you should descend into the html structure step by step, like this:
...
//test
Elements tables = doc.select("table");
for (Element table : tables) {
for (Element row : table.select("tr")) {
for (Element e : row.select("td")) {
// output your td-contents here
System.out.println("-------------------");
System.out.println(e.text());
}
}
}
...
The advantage of this approach is that you have more control over drawing separators between the HTML Elements.

Read a specified line of text from a webpage with Jsoup

So I am trying to get the data from this webpage using Jsoup...
I've tried looking up many different ways of doing it and I've gotten close but I don't know how to find tags for certain stats (Attack, Strength, Defence, etc.)
So let's say for examples sake I wanted to print out
'Attack', '15', '99', '200,000,000'
How should I go about doing this?
You can use CSS selectors in Jsoup to easily extract the column data.
// retrieve page source code
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
// find all of the table rows
Elements rows = doc.select("div#contentHiscores table tr");
ListIterator<Element> itr = rows.listIterator();
// loop over each row
while (itr.hasNext()) {
Element row = itr.next();
// does the second col contain the word attack?
if (row.select("td:nth-child(2) a:contains(attack)").first() != null) {
// if so, assign each sibling col to variable
String rank = row.select("td:nth-child(3)").text();
String level = row.select("td:nth-child(4)").text();
String xp = row.select("td:nth-child(5)").text();
System.out.printf("rank=%s level=%s xp=%s", rank, level, xp);
// stop looping rows, found attack
break;
}
}
A very rough implementation would be as below. I have just shown a snippet , optimizations or other conditionals need to be added
public static void main(String[] args) throws Exception {
Document doc = Jsoup
.connect("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=Lynx%A0Titan")
.get();
Element contentHiscoresDiv = doc.getElementById("contentHiscores");
Element table = contentHiscoresDiv.child(0);
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (Element column : tds) {
if (column.children() != null && column.children().size() > 0) {
Element anchorTag = column.getElementsByTag("a").first();
if (anchorTag != null && anchorTag.text().contains("Attack")) {
System.out.println(anchorTag.text());
Elements attributeSiblings = column.siblingElements();
for (Element attributeSibling : attributeSiblings) {
System.out.println(attributeSibling.text());
}
}
}
}
}
}
Attack
15
99
200,000,000

How to fetch data of multiple HTML tables through Web Scraping in Java

I was trying to scrape the data of a website and to some extents I succeed in my goal. But, there is a problem that the web page I am trying to scrape have got multiple HTML tables in it. Now, when I execute my program it only retrieves the data of the first table in the CSV file and not retrieving the other tables. My java class code is as follows.
public static void parsingHTML() throws Exception {
//tbodyElements = doc.getElementsByTag("tbody");
for (int i = 1; i <= 1; i++) {
Elements table = doc.getElementsByTag("table");
if (table.isEmpty()) {
throw new Exception("Table is not found");
}
elements = table.get(0).getElementsByTag("tr");
for (Element trElement : elements) {
trElement2 = trElement.getElementsByTag("tr");
tdElements = trElement.getElementsByTag("td");
File fold = new File("C:\\convertedCSV9.csv");
fold.delete();
File fnew = new File("C:\\convertedCSV9.csv");
FileWriter sb = new FileWriter(fnew, true);
//StringBuilder sb = new StringBuilder(" ");
//String y = "<tr>";
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
//Element tdElement1 = it.next();
//final String content2 = tdElement1.text();
if (it.hasNext()) {
sb.append("\r\n");
}
for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
Element tdElement2 = it.next();
final String content = tdElement2.text();
//stringjoiner.add(content);
//sb.append(formatData(content));
if (it2.hasNext()) {
sb.append(formatData(content));
sb.append(" , ");
}
if (!it.hasNext()) {
String content1 = content.replaceAll(",$", " ");
sb.append(formatData(content1));
//it2.next();
}
}
System.out.println(sb.toString());
sb.flush();
sb.close();
}
System.out.println(sampleList.add(tdElements));
}
}
}
What I analyze is that there is a loop which is only checking tr tds. So, after first table there is a style sheet on the HTML page. May be due to style sheet loop is breaking. I think that's the reason it is proceeding to the next table.
P.S: here's the link which I am trying to scrap
http://www.mufap.com.pk/nav_returns_performance.php?tab=01
What you do just at the beginning of your code will not work:
// loop just once, why
for (int i = 1; i <= 1; i++) {
Elements table = doc.getElementsByTag("table");
if (table.isEmpty()) {
throw new Exception("Table is not found");
}
elements = table.get(0).getElementsByTag("tr");
Here you loop just once, read all table elements and then process all tr elements for the first table you find. So even if you would loop more than once, you would always process the first table.
You will have to iterate all table elements, e.g.
for(Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
// process "td"s and so on
}
}
Edit Since you're having troubles with the code above, here's a more thorough example. Note that I'm using Jsoup to read and parse the HTML (you didn't specify what you are using)
Document doc = Jsoup
.connect("http://www.mufap.com.pk/nav_returns_performance.php?tab=01")
.get();
for (Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
// skip header "tr"s and process only data "tr"s
if (trElement.hasClass("tab-data1")) {
StringJoiner tdj = new StringJoiner(",");
for (Element tdElement : trElement.getElementsByTag("td")) {
tdj.add(tdElement.text());
}
System.out.println(tdj);
}
}
}
This will concat and print all data cells (those having the class tab-data1). You will still have to modify it to write to your CSV file though.
Note: in my tests this processes 21 tables, 243 trs and 2634 tds.

Categories

Resources