Check for table data - Jsoup - java

Is there a way to check if a table has a certain row using jsoup?
I am getting an java.lang.IndexOutOfBoundsException: Invalid location 1, size is 1 exception, my code for getting the info out of the table is:
docTide = Jsoup.connect("http://www.mhpa.co.uk/search-tide-times/").timeout(600000).get();
Elements tideTimeOdd = docTide.select("div.tide_row.odd div:eq(0)");
Elements tideTimeEven = docTide.select("div.tide_row.even div:eq(0)");
Elements tideHightOdd = docTide.select("div.tide_row.odd div:eq(2)");
Elements tideHightEven = docTide.select("div.tide_row.even div:eq(2)");
Element firstTideTime = tideTimeOdd.first();
Element secondTideTime = tideTimeEven.first();
Element thirdTideTime = tideTimeOdd.get(1);
Element fourthTideTime = tideTimeEven.get(1);
The exception is occurring because sometime the table only has 3 rows instead of 4, in this order;
odd
even
odd
even
it is the last 'even' row that is causing the problem.
<div class="tide_row odd">
<div class="time">00:57</div>
<div class="height_m">4.9</div>
<div class="height_f">16,1</div>
<div class="range_m">1.9</div>
<div class="range_f">6,3</div>
</div>
<div class="tide_row even">
<div class="time">07:23</div>
<div class="height_m">2.9</div>
<div class="height_f">9,6</div>
<div class="range_m">2</div>
<div class="range_f">6,7</div>
</div>
<div class="tide_row odd">
<div class="time">13:46</div>
<div class="height_m">5.1</div>
<div class="height_f">16,9</div>
<div class="range_m">2.2</div>
<div class="range_f">7,3</div>
</div>
<div class="tide_row even">
<div class="time">20:23</div>
<div class="height_m">2.8</div>
<div class="height_f">9,2</div>
<div class="range_m">2.3</div>
<div class="range_f">7,7</div>
</div>

To simply check the size of the Elements object, use the size() method to determine if it exists or not.
To check for a certain Element use the contains() method.
You might also consider using a loop to iterate over all the Element objects in your Elements collection.
if(tideTimeEven.size() > 1)
//Do something

You could do
if (tideTimeEven.size() > 1) {
Element fourthTideTime = tideTimeEven.get(1);
}

Related

How do i loop through divs using jsoup

Hi guys I'm using jsoup in a java webapplication on IntelliJ. I'm trying to scrape data of port call events from a shiptracking website and store the data in a mySQL database. The data for the events is organised in divs with the class name table-group and the values are in another div with the class name table-row. My problem is the divs rows for all the vessel are all the same class name and im trying to loop through each row and push the data to a database. So far i have managed to create a java class to scrape the first row. How can i loop through each row and store those values to my database. Should i create an array list to store the values?
this is my scraper class
public class Scarper {
private static Document doc;
public static void main(String[] args) {
final String url =
"https://www.myshiptracking.com/ports-arrivals-departures/?mmsi=&pid=277&type=0&time=&pp=20";
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
e.printStackTrace();
}
Events();
}
public static void Events() {
Elements elm = doc.select("div.table-group:nth-of-type(2) > .table-row");
List<String> arrayList = new ArrayList();
for (Element ele : elm) {
String event = ele.select("div.col:nth-of-type(2)").text();
String time = ele.select("div.col:nth-of-type(3)").text();
String port = ele.select("div.col:nth-of-type(4)").text();
String vessel = ele.select(".td_vesseltype.col").text();
Event ev = new Event();
System.out.println(event);
System.out.println(time);
System.out.println(port);
System.out.println(vessel);
}
}
}
sample of the div classes i want to scrape
<div style="box-sizing: border-box;padding: 0px 10px 10px 10px;">
<div class="cs-table">
<div class="heading">
<div class="col" style="width: 10px"></div>
<div class="col" style="width: 110px">Event</div>
<div class="col" style="width: 120px">Time (<span class="tooltip" title="My Time: In your current TimeZone">MT</span>)</div>
<div class="col" style="width: 150px">Port</div>
<div class="col">Vessel</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>BELFAST</div>
<div class="col td_vesseltype"><img src="/icons/icon7_511.png"><span class="padding_18">WILSON BLYTH [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-flag-checkered green"></i></div>
<div class="col">Arrival</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>HUNTERS QUAY</div>
<div class="col td_vesseltype"><img src="/icons/icon6_511.png"><span class="padding_18">SOUND OF SOAY [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>LARGS</div>
<div class="col td_vesseltype"><img src="/icons/icon6_511.png"><span class="padding_18">LOCH SHIRA [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>RYDE</div>
<div class="col td_vesseltype"><img src="/icons/icon4_511.png"><span class="padding_18">ISLAND FLYER [GB]</span></div>
</div>
</div>
You can start with looping over the table's rows: the selector for the table is .cs-table so you can get the table with Element table = doc.select(".cs-table").first();. Next you can get the table's rows with the selector div.table-row - Elements rows = doc.select("div.table-row"); now you can loop over all the rows and extract the data from each row. The code should look like:
Element table = doc.select(".cs-table").first();
Elements rows = doc.select("div.table-row");
for (Element row : rows) {
String event = row.select("div.col:nth-of-type(2)").text();
String time = row.select("div.col:nth-of-type(3)").text();
String port = row.select("div.col:nth-of-type(4)").text();
String vessel = row.select(".td_vesseltype.col").text();
System.out.println(event + "-" + time + " " + port + " " + vessel);
System.out.println("---------------------------");
// Do stuff with data here
}
Now it's up to you to decide if you want to keep the data in some array/list inside the loop and use it later, or to insert it directly to your database.

How to iterate many arrays simultaniously inside a same loop

I am new with thymeleaf, and I want to display 3 values from 3 different arrays with the same index, inside the same div.row, I tried several ways but I only could iterate one array at a time without errors, below is my Controller side:
public String index(Model model) {
String[] table0 = {"0","1","2","3"}
String[] table1 = {"14","21","25","75"}
String[] table2 = {"7","63","57","87"}
model.addAttribute("table0", table0;
model.addAttribute("table1", table1);
model.addAttribute("table2", table2);
return "index";
}
Inside the html file, table0 is the first array iterated without errors, I don't know how to edit/improve the following code to display all the three arrays tables0, tables1 and tables3 at the same time:
<div class="row" th:each="v0 : ${tables0}" >
<div class="cell" th:text="value">
<!-- Here I could display a value from tables0 -->
</div>
<div class="cell" >
<!-- Here I need to display the value of tables1 having the same index as v0 -->
</div>
<div class="cell" >
<!-- Here I need to display the value of tables2 having the same index as v0 -->
</div>
</div>
here you could find what you're searching about , keeping iteration status
by simply adding a var after the object , and use index to get the current index value
by example :
<div class="row" th:each="v0,iter : ${tables0}" >
<div class="cell" th:text="value">
<!-- Here I could display a value from tables0 -->
<span th:text="${v0}"></span>
</div>
<div class="cell" >
<span th:text="${table1[iter.index]}"></span>
</div>
<div class="cell" >
<span th:text="${table2[iter.index]}"></span>
</div>
</div>
You can use Thymeleaf's iterStat to do this.
Assuming the following input data:
String[] table0 = {"0", "1", "2", "3"};
String[] table1 = {"14", "21", "25", "75"};
String[] table2 = {"7", "63", "57", "87"};
You can use the following Thymeleaf markup:
<div class="row" th:each="val,iterStat : ${table0}" >
<div class="cell" th:text="${val}">
</div>
<div class="cell" th:text="${table1[iterStat.index]}">
</div>
<div class="cell" th:text="${table2[iterStat.index]}">
</div>
</div>
This produces a column of numbers as follows (I don't have any CSS so it's just the raw output):
0
14
7
1
21
63
2
25
57
3
75
87
The related html looks like this:
<div class="row">
<div class="cell">0</div>
<div class="cell">14</div>
<div class="cell">7</div>
</div>
<div class="row">
<div class="cell">1</div>
<div class="cell">21</div>
<div class="cell">63</div>
</div>
<div class="row">
<div class="cell">2</div>
<div class="cell">25</div>
<div class="cell">57</div>
</div>
<div class="row">
<div class="cell">3</div>
<div class="cell">75</div>
<div class="cell">87</div>
</div>
The iterStat function is described here - it basically keeps track of your iterations. Since you want the same index for each table, it's a good fit for your needs.

How to access nested divs using Jsoup

This is the html page:
<div class="doc_details">
<fieldset style="border: 0pt">
<div class="row">
<div class="col-sm-6 col-md-6">
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Speciality</b>
</div>
<div class="col-sm-6 col-md-6">ABCD</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>City</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Residence Address</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Business Address</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
</div>
</div>
</fieldset>
</div>
I would like to access only the values of the Speciality, city and address columns into a variable as follows:
Elements rows = doc.select("div.doc_details div.row div.row ");
Element row_div = rows.select("div.row").get(0);
doctor.speciality = row_div.select("div:eq(0)").text();
But even if I change the get(0) to get(1), I'm not able to get only the values in the variable.
You can probably do this with a css-selector :
doc.select("div.row > div.col-sm-6:nth-child(2)")
which returns this :
0 = {Element#754} "<div class="col-sm-6 col-md-6">\n ABCD \n</div>"
1 = {Element#756} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
2 = {Element#758} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
3 = {Element#760} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
It's then really up to you, you can for example map the list to the text of each div :
divs.stream().map(new Function<Element, String>() {
#Override
public String apply(Element element) {
return element.text();
}
}).collect(Collectors.toList()));
or more simple :
String speciality = divs.get(0).text();
String city = divs.get(1).text();
String adress = divs.get(2).text();
Try this:
Elements rows = doc.select("div.doc_details div.row div.row ");
Element row_div = rows.select("div.col-sm-6").get(1);
doctor.speciality = row_div.text();
tell me if it works!
Here is how I would do it:
Document doc = Jsoup.parse(html);
Elements rows = doc.select("div.doc_details div.row div.row ");
for (Element row : rows){
Elements innerDivs = row.select("div");
String header = innerDivs.get(1).text();
String content = innerDivs.get(2).text();
System.out.println("header = "+header+ " -> "+content);
}
I think you were mistaken in the css selector!
Edit: (thanks to the OP the indexes are correct now)

Thymeleaf: Iteration - increment by N and access list.get(N)

So I've read the Iterations part of the documentation already but still didn't give any idea on how to do the following:
Iterate per 2 records (since I am rendering something like below) and access list by index.
<div class="row">
<div class="col-md-6">
...
</div>
<div class="col-md-6">
...
</div>
</div>
Basically if this were in code, it looks something like
for (int i = 0; i < size;) {
// do stuff
// manual increment
if (i + 2 > size) {
i++;
} else {
i += 2;
}
}
Any other approach that would satisfy my problem is always welcome too!
I was able to solve it using something like below:
Basically, I still loop individually but just skip every other record using th:if="${stat.even}" and just get the next record by stat.index + 1.
Be really cautious about the IndexOutOfBoundsException though.
<div class="row" th:each="hivRisk, stat : ${hivRiskList}" th:if="${stat.even}">
<div class="col-md-6" th:with="leftRisk=${hivRiskList.get(stat.index)}">
<div class="checkbox checkbox-styled">
<label>
<input type="checkbox" value="-1" th:value="${leftRisk.id}"/>
<span th:text="${leftRisk.name}">HIV Risk</span>
</label>
</div>
</div>
<div class="col-md-6" th:if="${stat.index + 1 < hivRiskList.size()}" th:with="rightRisk=${hivRiskList.get(stat.index + 1)}">
<div class="checkbox checkbox-styled">
<label>
<input type="checkbox" value="-1" th:value="${rightRisk.id}"/>
<span th:text="${rightRisk.name}">HIV Risk</span>
</label>
</div>
</div>
</div>

How to find elements whose sibling index is less than x and greater than y

I have some Element eNews. After finding indexes by CssQuery I have to select sibling elements with index less than y and greater than x;
Elements lines = eNews.select("div.clear");
int x = lines.get(0).elementSiblingIndex();
int y = lines.get(1).elementSiblingIndex();
Elements tNews = eNews.getElementsByIndexGreaterThan(x)
?AND?
eNews.getElementsByIndexLessThan(y)
This is some sample code. I want to extract text from html tags between first and second <div class="clear></div>
<div class="aktualnosci">
<div class="zd">
<a href="/Data/Thumbs/ODAweDYwMA,dsc_0458.jpg" title="" rel="lightbox">
<img src="/Data/Thumbs/dsc_0458.jpg"/>
</a>
<p class="show"></p>
</div>
<h3>Awanse</h3>
<div class="data">
<img alt="" src="/Themes/kalendarz-ico.gif">
2013-11-18 12:26
</div>
<!--Start tag-->
<div class="clear"></div>
<!--Tags to extract-->
<p class="gr">W związku z Narodowym Świętem Niepodległości ....</p>
<p style="text-align: justify">W zeszły p....</p>
<p style="text-align: justify">OISW Kraków</p>
<!--End tag-->
<div class="clear"></div>
<div class="slider">
<span class="slide-left"></span>
<span class="slide-right"></span>
</div>
</div>
You can use a selector like div.clear ~ :gt(1):lt(4)
E.g.:
Elements tNews = eNews.select("div.clear ~ :gt(1):lt(4)");
See this example and the selector docs. (It's a bit hard to validate this does what you're trying to achieve without knowing your input HTML and the data you're trying to extract.)
Update based on your edit: there are a couple ways to do this if you can't know the indexes in advance. Below I get the first div, then accumulate sibling elements until we hit the next div.clear. (I'll have a think if I can generify this pattern and add it to jsoup.)
Document doc = Jsoup.parse(h);
Element firstDiv = doc.select("div.clear").first();
Elements news = new Elements();
Element item = firstDiv.nextElementSibling();
while (item != null && !(item.tagName().equals("div") && item.className().equals("clear"))) {
news.add(item);
item = item.nextElementSibling();
}
System.out.println(String.format("Found %s items", news.size()));
for (Element element : news) {
System.out.println(element.text());
}
Outputs:
Found 3 items
W związku z Narodowym Świętem Niepodległości ....
W zeszły p....
OISW Kraków

Categories

Resources