Is there a way to check if a table has a certain row using jsoup?
I am getting an java.lang.IndexOutOfBoundsException: Invalid location 1, size is 1 exception, my code for getting the info out of the table is:
docTide = Jsoup.connect("http://www.mhpa.co.uk/search-tide-times/").timeout(600000).get();
Elements tideTimeOdd = docTide.select("div.tide_row.odd div:eq(0)");
Elements tideTimeEven = docTide.select("div.tide_row.even div:eq(0)");
Elements tideHightOdd = docTide.select("div.tide_row.odd div:eq(2)");
Elements tideHightEven = docTide.select("div.tide_row.even div:eq(2)");
Element firstTideTime = tideTimeOdd.first();
Element secondTideTime = tideTimeEven.first();
Element thirdTideTime = tideTimeOdd.get(1);
Element fourthTideTime = tideTimeEven.get(1);
The exception is occurring because sometime the table only has 3 rows instead of 4, in this order;
odd
even
odd
even
it is the last 'even' row that is causing the problem.
<div class="tide_row odd">
<div class="time">00:57</div>
<div class="height_m">4.9</div>
<div class="height_f">16,1</div>
<div class="range_m">1.9</div>
<div class="range_f">6,3</div>
</div>
<div class="tide_row even">
<div class="time">07:23</div>
<div class="height_m">2.9</div>
<div class="height_f">9,6</div>
<div class="range_m">2</div>
<div class="range_f">6,7</div>
</div>
<div class="tide_row odd">
<div class="time">13:46</div>
<div class="height_m">5.1</div>
<div class="height_f">16,9</div>
<div class="range_m">2.2</div>
<div class="range_f">7,3</div>
</div>
<div class="tide_row even">
<div class="time">20:23</div>
<div class="height_m">2.8</div>
<div class="height_f">9,2</div>
<div class="range_m">2.3</div>
<div class="range_f">7,7</div>
</div>
To simply check the size of the Elements object, use the size() method to determine if it exists or not.
To check for a certain Element use the contains() method.
You might also consider using a loop to iterate over all the Element objects in your Elements collection.
if(tideTimeEven.size() > 1)
//Do something
You could do
if (tideTimeEven.size() > 1) {
Element fourthTideTime = tideTimeEven.get(1);
}
Related
Hi guys I'm using jsoup in a java webapplication on IntelliJ. I'm trying to scrape data of port call events from a shiptracking website and store the data in a mySQL database. The data for the events is organised in divs with the class name table-group and the values are in another div with the class name table-row. My problem is the divs rows for all the vessel are all the same class name and im trying to loop through each row and push the data to a database. So far i have managed to create a java class to scrape the first row. How can i loop through each row and store those values to my database. Should i create an array list to store the values?
this is my scraper class
public class Scarper {
private static Document doc;
public static void main(String[] args) {
final String url =
"https://www.myshiptracking.com/ports-arrivals-departures/?mmsi=&pid=277&type=0&time=&pp=20";
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
e.printStackTrace();
}
Events();
}
public static void Events() {
Elements elm = doc.select("div.table-group:nth-of-type(2) > .table-row");
List<String> arrayList = new ArrayList();
for (Element ele : elm) {
String event = ele.select("div.col:nth-of-type(2)").text();
String time = ele.select("div.col:nth-of-type(3)").text();
String port = ele.select("div.col:nth-of-type(4)").text();
String vessel = ele.select(".td_vesseltype.col").text();
Event ev = new Event();
System.out.println(event);
System.out.println(time);
System.out.println(port);
System.out.println(vessel);
}
}
}
sample of the div classes i want to scrape
<div style="box-sizing: border-box;padding: 0px 10px 10px 10px;">
<div class="cs-table">
<div class="heading">
<div class="col" style="width: 10px"></div>
<div class="col" style="width: 110px">Event</div>
<div class="col" style="width: 120px">Time (<span class="tooltip" title="My Time: In your current TimeZone">MT</span>)</div>
<div class="col" style="width: 150px">Port</div>
<div class="col">Vessel</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>BELFAST</div>
<div class="col td_vesseltype"><img src="/icons/icon7_511.png"><span class="padding_18">WILSON BLYTH [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-flag-checkered green"></i></div>
<div class="col">Arrival</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>HUNTERS QUAY</div>
<div class="col td_vesseltype"><img src="/icons/icon6_511.png"><span class="padding_18">SOUND OF SOAY [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>LARGS</div>
<div class="col td_vesseltype"><img src="/icons/icon6_511.png"><span class="padding_18">LOCH SHIRA [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>RYDE</div>
<div class="col td_vesseltype"><img src="/icons/icon4_511.png"><span class="padding_18">ISLAND FLYER [GB]</span></div>
</div>
</div>
You can start with looping over the table's rows: the selector for the table is .cs-table so you can get the table with Element table = doc.select(".cs-table").first();. Next you can get the table's rows with the selector div.table-row - Elements rows = doc.select("div.table-row"); now you can loop over all the rows and extract the data from each row. The code should look like:
Element table = doc.select(".cs-table").first();
Elements rows = doc.select("div.table-row");
for (Element row : rows) {
String event = row.select("div.col:nth-of-type(2)").text();
String time = row.select("div.col:nth-of-type(3)").text();
String port = row.select("div.col:nth-of-type(4)").text();
String vessel = row.select(".td_vesseltype.col").text();
System.out.println(event + "-" + time + " " + port + " " + vessel);
System.out.println("---------------------------");
// Do stuff with data here
}
Now it's up to you to decide if you want to keep the data in some array/list inside the loop and use it later, or to insert it directly to your database.
I am new with thymeleaf, and I want to display 3 values from 3 different arrays with the same index, inside the same div.row, I tried several ways but I only could iterate one array at a time without errors, below is my Controller side:
public String index(Model model) {
String[] table0 = {"0","1","2","3"}
String[] table1 = {"14","21","25","75"}
String[] table2 = {"7","63","57","87"}
model.addAttribute("table0", table0;
model.addAttribute("table1", table1);
model.addAttribute("table2", table2);
return "index";
}
Inside the html file, table0 is the first array iterated without errors, I don't know how to edit/improve the following code to display all the three arrays tables0, tables1 and tables3 at the same time:
<div class="row" th:each="v0 : ${tables0}" >
<div class="cell" th:text="value">
<!-- Here I could display a value from tables0 -->
</div>
<div class="cell" >
<!-- Here I need to display the value of tables1 having the same index as v0 -->
</div>
<div class="cell" >
<!-- Here I need to display the value of tables2 having the same index as v0 -->
</div>
</div>
here you could find what you're searching about , keeping iteration status
by simply adding a var after the object , and use index to get the current index value
by example :
<div class="row" th:each="v0,iter : ${tables0}" >
<div class="cell" th:text="value">
<!-- Here I could display a value from tables0 -->
<span th:text="${v0}"></span>
</div>
<div class="cell" >
<span th:text="${table1[iter.index]}"></span>
</div>
<div class="cell" >
<span th:text="${table2[iter.index]}"></span>
</div>
</div>
You can use Thymeleaf's iterStat to do this.
Assuming the following input data:
String[] table0 = {"0", "1", "2", "3"};
String[] table1 = {"14", "21", "25", "75"};
String[] table2 = {"7", "63", "57", "87"};
You can use the following Thymeleaf markup:
<div class="row" th:each="val,iterStat : ${table0}" >
<div class="cell" th:text="${val}">
</div>
<div class="cell" th:text="${table1[iterStat.index]}">
</div>
<div class="cell" th:text="${table2[iterStat.index]}">
</div>
</div>
This produces a column of numbers as follows (I don't have any CSS so it's just the raw output):
0
14
7
1
21
63
2
25
57
3
75
87
The related html looks like this:
<div class="row">
<div class="cell">0</div>
<div class="cell">14</div>
<div class="cell">7</div>
</div>
<div class="row">
<div class="cell">1</div>
<div class="cell">21</div>
<div class="cell">63</div>
</div>
<div class="row">
<div class="cell">2</div>
<div class="cell">25</div>
<div class="cell">57</div>
</div>
<div class="row">
<div class="cell">3</div>
<div class="cell">75</div>
<div class="cell">87</div>
</div>
The iterStat function is described here - it basically keeps track of your iterations. Since you want the same index for each table, it's a good fit for your needs.
This is the html page:
<div class="doc_details">
<fieldset style="border: 0pt">
<div class="row">
<div class="col-sm-6 col-md-6">
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Speciality</b>
</div>
<div class="col-sm-6 col-md-6">ABCD</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>City</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Residence Address</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Business Address</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
</div>
</div>
</fieldset>
</div>
I would like to access only the values of the Speciality, city and address columns into a variable as follows:
Elements rows = doc.select("div.doc_details div.row div.row ");
Element row_div = rows.select("div.row").get(0);
doctor.speciality = row_div.select("div:eq(0)").text();
But even if I change the get(0) to get(1), I'm not able to get only the values in the variable.
You can probably do this with a css-selector :
doc.select("div.row > div.col-sm-6:nth-child(2)")
which returns this :
0 = {Element#754} "<div class="col-sm-6 col-md-6">\n ABCD \n</div>"
1 = {Element#756} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
2 = {Element#758} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
3 = {Element#760} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
It's then really up to you, you can for example map the list to the text of each div :
divs.stream().map(new Function<Element, String>() {
#Override
public String apply(Element element) {
return element.text();
}
}).collect(Collectors.toList()));
or more simple :
String speciality = divs.get(0).text();
String city = divs.get(1).text();
String adress = divs.get(2).text();
Try this:
Elements rows = doc.select("div.doc_details div.row div.row ");
Element row_div = rows.select("div.col-sm-6").get(1);
doctor.speciality = row_div.text();
tell me if it works!
Here is how I would do it:
Document doc = Jsoup.parse(html);
Elements rows = doc.select("div.doc_details div.row div.row ");
for (Element row : rows){
Elements innerDivs = row.select("div");
String header = innerDivs.get(1).text();
String content = innerDivs.get(2).text();
System.out.println("header = "+header+ " -> "+content);
}
I think you were mistaken in the css selector!
Edit: (thanks to the OP the indexes are correct now)
So I've read the Iterations part of the documentation already but still didn't give any idea on how to do the following:
Iterate per 2 records (since I am rendering something like below) and access list by index.
<div class="row">
<div class="col-md-6">
...
</div>
<div class="col-md-6">
...
</div>
</div>
Basically if this were in code, it looks something like
for (int i = 0; i < size;) {
// do stuff
// manual increment
if (i + 2 > size) {
i++;
} else {
i += 2;
}
}
Any other approach that would satisfy my problem is always welcome too!
I was able to solve it using something like below:
Basically, I still loop individually but just skip every other record using th:if="${stat.even}" and just get the next record by stat.index + 1.
Be really cautious about the IndexOutOfBoundsException though.
<div class="row" th:each="hivRisk, stat : ${hivRiskList}" th:if="${stat.even}">
<div class="col-md-6" th:with="leftRisk=${hivRiskList.get(stat.index)}">
<div class="checkbox checkbox-styled">
<label>
<input type="checkbox" value="-1" th:value="${leftRisk.id}"/>
<span th:text="${leftRisk.name}">HIV Risk</span>
</label>
</div>
</div>
<div class="col-md-6" th:if="${stat.index + 1 < hivRiskList.size()}" th:with="rightRisk=${hivRiskList.get(stat.index + 1)}">
<div class="checkbox checkbox-styled">
<label>
<input type="checkbox" value="-1" th:value="${rightRisk.id}"/>
<span th:text="${rightRisk.name}">HIV Risk</span>
</label>
</div>
</div>
</div>
I have some Element eNews. After finding indexes by CssQuery I have to select sibling elements with index less than y and greater than x;
Elements lines = eNews.select("div.clear");
int x = lines.get(0).elementSiblingIndex();
int y = lines.get(1).elementSiblingIndex();
Elements tNews = eNews.getElementsByIndexGreaterThan(x)
?AND?
eNews.getElementsByIndexLessThan(y)
This is some sample code. I want to extract text from html tags between first and second <div class="clear></div>
<div class="aktualnosci">
<div class="zd">
<a href="/Data/Thumbs/ODAweDYwMA,dsc_0458.jpg" title="" rel="lightbox">
<img src="/Data/Thumbs/dsc_0458.jpg"/>
</a>
<p class="show"></p>
</div>
<h3>Awanse</h3>
<div class="data">
<img alt="" src="/Themes/kalendarz-ico.gif">
2013-11-18 12:26
</div>
<!--Start tag-->
<div class="clear"></div>
<!--Tags to extract-->
<p class="gr">W związku z Narodowym Świętem Niepodległości ....</p>
<p style="text-align: justify">W zeszły p....</p>
<p style="text-align: justify">OISW Kraków</p>
<!--End tag-->
<div class="clear"></div>
<div class="slider">
<span class="slide-left"></span>
<span class="slide-right"></span>
</div>
</div>
You can use a selector like div.clear ~ :gt(1):lt(4)
E.g.:
Elements tNews = eNews.select("div.clear ~ :gt(1):lt(4)");
See this example and the selector docs. (It's a bit hard to validate this does what you're trying to achieve without knowing your input HTML and the data you're trying to extract.)
Update based on your edit: there are a couple ways to do this if you can't know the indexes in advance. Below I get the first div, then accumulate sibling elements until we hit the next div.clear. (I'll have a think if I can generify this pattern and add it to jsoup.)
Document doc = Jsoup.parse(h);
Element firstDiv = doc.select("div.clear").first();
Elements news = new Elements();
Element item = firstDiv.nextElementSibling();
while (item != null && !(item.tagName().equals("div") && item.className().equals("clear"))) {
news.add(item);
item = item.nextElementSibling();
}
System.out.println(String.format("Found %s items", news.size()));
for (Element element : news) {
System.out.println(element.text());
}
Outputs:
Found 3 items
W związku z Narodowym Świętem Niepodległości ....
W zeszły p....
OISW Kraków