This is the html page:
<div class="doc_details">
<fieldset style="border: 0pt">
<div class="row">
<div class="col-sm-6 col-md-6">
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Speciality</b>
</div>
<div class="col-sm-6 col-md-6">ABCD</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>City</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Residence Address</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Business Address</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
</div>
</div>
</fieldset>
</div>
I would like to access only the values of the Speciality, city and address columns into a variable as follows:
Elements rows = doc.select("div.doc_details div.row div.row ");
Element row_div = rows.select("div.row").get(0);
doctor.speciality = row_div.select("div:eq(0)").text();
But even if I change the get(0) to get(1), I'm not able to get only the values in the variable.
You can probably do this with a css-selector :
doc.select("div.row > div.col-sm-6:nth-child(2)")
which returns this :
0 = {Element#754} "<div class="col-sm-6 col-md-6">\n ABCD \n</div>"
1 = {Element#756} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
2 = {Element#758} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
3 = {Element#760} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
It's then really up to you, you can for example map the list to the text of each div :
divs.stream().map(new Function<Element, String>() {
#Override
public String apply(Element element) {
return element.text();
}
}).collect(Collectors.toList()));
or more simple :
String speciality = divs.get(0).text();
String city = divs.get(1).text();
String adress = divs.get(2).text();
Try this:
Elements rows = doc.select("div.doc_details div.row div.row ");
Element row_div = rows.select("div.col-sm-6").get(1);
doctor.speciality = row_div.text();
tell me if it works!
Here is how I would do it:
Document doc = Jsoup.parse(html);
Elements rows = doc.select("div.doc_details div.row div.row ");
for (Element row : rows){
Elements innerDivs = row.select("div");
String header = innerDivs.get(1).text();
String content = innerDivs.get(2).text();
System.out.println("header = "+header+ " -> "+content);
}
I think you were mistaken in the css selector!
Edit: (thanks to the OP the indexes are correct now)
Related
Hi guys I'm using jsoup in a java webapplication on IntelliJ. I'm trying to scrape data of port call events from a shiptracking website and store the data in a mySQL database. The data for the events is organised in divs with the class name table-group and the values are in another div with the class name table-row. My problem is the divs rows for all the vessel are all the same class name and im trying to loop through each row and push the data to a database. So far i have managed to create a java class to scrape the first row. How can i loop through each row and store those values to my database. Should i create an array list to store the values?
this is my scraper class
public class Scarper {
private static Document doc;
public static void main(String[] args) {
final String url =
"https://www.myshiptracking.com/ports-arrivals-departures/?mmsi=&pid=277&type=0&time=&pp=20";
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
e.printStackTrace();
}
Events();
}
public static void Events() {
Elements elm = doc.select("div.table-group:nth-of-type(2) > .table-row");
List<String> arrayList = new ArrayList();
for (Element ele : elm) {
String event = ele.select("div.col:nth-of-type(2)").text();
String time = ele.select("div.col:nth-of-type(3)").text();
String port = ele.select("div.col:nth-of-type(4)").text();
String vessel = ele.select(".td_vesseltype.col").text();
Event ev = new Event();
System.out.println(event);
System.out.println(time);
System.out.println(port);
System.out.println(vessel);
}
}
}
sample of the div classes i want to scrape
<div style="box-sizing: border-box;padding: 0px 10px 10px 10px;">
<div class="cs-table">
<div class="heading">
<div class="col" style="width: 10px"></div>
<div class="col" style="width: 110px">Event</div>
<div class="col" style="width: 120px">Time (<span class="tooltip" title="My Time: In your current TimeZone">MT</span>)</div>
<div class="col" style="width: 150px">Port</div>
<div class="col">Vessel</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>BELFAST</div>
<div class="col td_vesseltype"><img src="/icons/icon7_511.png"><span class="padding_18">WILSON BLYTH [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-flag-checkered green"></i></div>
<div class="col">Arrival</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>HUNTERS QUAY</div>
<div class="col td_vesseltype"><img src="/icons/icon6_511.png"><span class="padding_18">SOUND OF SOAY [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>LARGS</div>
<div class="col td_vesseltype"><img src="/icons/icon6_511.png"><span class="padding_18">LOCH SHIRA [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>RYDE</div>
<div class="col td_vesseltype"><img src="/icons/icon4_511.png"><span class="padding_18">ISLAND FLYER [GB]</span></div>
</div>
</div>
You can start with looping over the table's rows: the selector for the table is .cs-table so you can get the table with Element table = doc.select(".cs-table").first();. Next you can get the table's rows with the selector div.table-row - Elements rows = doc.select("div.table-row"); now you can loop over all the rows and extract the data from each row. The code should look like:
Element table = doc.select(".cs-table").first();
Elements rows = doc.select("div.table-row");
for (Element row : rows) {
String event = row.select("div.col:nth-of-type(2)").text();
String time = row.select("div.col:nth-of-type(3)").text();
String port = row.select("div.col:nth-of-type(4)").text();
String vessel = row.select(".td_vesseltype.col").text();
System.out.println(event + "-" + time + " " + port + " " + vessel);
System.out.println("---------------------------");
// Do stuff with data here
}
Now it's up to you to decide if you want to keep the data in some array/list inside the loop and use it later, or to insert it directly to your database.
I am new with thymeleaf, and I want to display 3 values from 3 different arrays with the same index, inside the same div.row, I tried several ways but I only could iterate one array at a time without errors, below is my Controller side:
public String index(Model model) {
String[] table0 = {"0","1","2","3"}
String[] table1 = {"14","21","25","75"}
String[] table2 = {"7","63","57","87"}
model.addAttribute("table0", table0;
model.addAttribute("table1", table1);
model.addAttribute("table2", table2);
return "index";
}
Inside the html file, table0 is the first array iterated without errors, I don't know how to edit/improve the following code to display all the three arrays tables0, tables1 and tables3 at the same time:
<div class="row" th:each="v0 : ${tables0}" >
<div class="cell" th:text="value">
<!-- Here I could display a value from tables0 -->
</div>
<div class="cell" >
<!-- Here I need to display the value of tables1 having the same index as v0 -->
</div>
<div class="cell" >
<!-- Here I need to display the value of tables2 having the same index as v0 -->
</div>
</div>
here you could find what you're searching about , keeping iteration status
by simply adding a var after the object , and use index to get the current index value
by example :
<div class="row" th:each="v0,iter : ${tables0}" >
<div class="cell" th:text="value">
<!-- Here I could display a value from tables0 -->
<span th:text="${v0}"></span>
</div>
<div class="cell" >
<span th:text="${table1[iter.index]}"></span>
</div>
<div class="cell" >
<span th:text="${table2[iter.index]}"></span>
</div>
</div>
You can use Thymeleaf's iterStat to do this.
Assuming the following input data:
String[] table0 = {"0", "1", "2", "3"};
String[] table1 = {"14", "21", "25", "75"};
String[] table2 = {"7", "63", "57", "87"};
You can use the following Thymeleaf markup:
<div class="row" th:each="val,iterStat : ${table0}" >
<div class="cell" th:text="${val}">
</div>
<div class="cell" th:text="${table1[iterStat.index]}">
</div>
<div class="cell" th:text="${table2[iterStat.index]}">
</div>
</div>
This produces a column of numbers as follows (I don't have any CSS so it's just the raw output):
0
14
7
1
21
63
2
25
57
3
75
87
The related html looks like this:
<div class="row">
<div class="cell">0</div>
<div class="cell">14</div>
<div class="cell">7</div>
</div>
<div class="row">
<div class="cell">1</div>
<div class="cell">21</div>
<div class="cell">63</div>
</div>
<div class="row">
<div class="cell">2</div>
<div class="cell">25</div>
<div class="cell">57</div>
</div>
<div class="row">
<div class="cell">3</div>
<div class="cell">75</div>
<div class="cell">87</div>
</div>
The iterStat function is described here - it basically keeps track of your iterations. Since you want the same index for each table, it's a good fit for your needs.
The items in my itemList are incomplete! For some reason from the 10th iteration of my loop to the last
el.select(".item").select(".img").select(".pic").select(".picRind").select(".picCore").attr("src")
returns a empty string and I can't understand why
0-9th iteration is perfectly find though. I went through the html and my code should work for every li I'm iterating through.
private Document getHtmlDocument() throws IOException {
document = Jsoup.connect(url).get();
return document;
}
public List<AliExpressItem> getAliExpressItemList() throws IOException {
Document document;
Element ul;
Elements ulLi;
document = getHtmlDocument();
ul = document.getElementById("hs-below-list-items");
ulLi = ul.getElementsByClass("list-item");
List<AliExpressItem> itemList = new ArrayList<>();
for(Element el : ulLi) {
AliExpressItem item = new AliExpressItem();
item.setImage(el.select(".item")
.select(".img")
.select(".pic")
.select(".picRind")
.select(".picCore")
.attr("src"));
item.setDescription(el.select(".item")
.select(".info")
.select("h3")
.select("a")
.text());
item.setPrice(el.select(".item")
.select(".info")
.select(".price")
.select(".value")
.text());
itemList.add(item);
}
return itemList;
}
Theres a ul with 48 li's inside. The above code should work for all 48 li's
<li qrdata="|32805326364|cn1511315262" pub-catid="200247142" sessionid="201711160635492248862329348280002056372" class="list-item list-item-first ">
<div class="item">
<div class="img img-border">
<div class="pic">
<a class="picRind history-item j-p4plog" href="//www.aliexpress.com/item/Hot-Sale-Novelty-Toys-Hand-Spinner-Anti-stress-toys/32805326364.html?spm=2114.search0204.3.1.Lwk2KD&s=p&ws_ab_test=searchweb0_0,searchweb201602_5_10152_10065_10151_10344_10068_10130_10345_10324_10342_10547_10325_10343_10546_10340_10341_10548_10545_10541_10562_10084_10083_10307_5680011_10178_10060_10155_10154_10056_10055_10539_10312_10059_10313_10314_10534_10533_100031_10103_10073_10102_10594_10557_10558_10596_10142_10107,searchweb201603_14,ppcSwitch_5_ppcChannel&btsid=6350c066-2194-4756-b1f7-ed7e1b0028e1&rmStoreLevelAB=0" target="_blank" data-spm-anchor-id="2114.search0204.3.1"><img class="picCore pic-Core-v" src="//ae01.alicdn.com/kf/HTB1RUjgQFXXXXayXXXXq6xXFXXX4/Hot-Sale-Novelty-Toys-Hand-font-b-Spinner-b-font-Anti-stress-toys-fidget-font-b.jpg_220x220.jpg" alt="Hot Sale Novelty Toys Hand Spinner Anti stress toys fidget spinners For Autism and ADHD reliever stress spinner(China)"></a>
</div>
</div>
<div class="info">
<h3>
<a class="history-item product j-p4plog" href="//www.aliexpress.com/item/Hot-Sale-Novelty-Toys-Hand-Spinner-Anti-stress-toys/32805326364.html?spm=2114.search0204.3.2.Lwk2KD&s=p&ws_ab_test=searchweb0_0,searchweb201602_5_10152_10065_10151_10344_10068_10130_10345_10324_10342_10547_10325_10343_10546_10340_10341_10548_10545_10541_10562_10084_10083_10307_5680011_10178_10060_10155_10154_10056_10055_10539_10312_10059_10313_10314_10534_10533_100031_10103_10073_10102_10594_10557_10558_10596_10142_10107,searchweb201603_14,ppcSwitch_5_ppcChannel&btsid=6350c066-2194-4756-b1f7-ed7e1b0028e1&rmStoreLevelAB=0" title="Hot Sale Novelty Toys Hand Spinner Anti stress toys fidget spinners For Autism and ADHD reliever stress spinner" target="_blank" data-spm-anchor-id="2114.search0204.3.2">Hot Sale Novelty Toys Hand <font><b>Spinner</b></font> Anti stress toys fidget <font><b>spinners</b></font> For Autism and ADHD reliever stress <font><b>spinner</b></font></a>
</h3>
<span class="price price-m">
<span class="value" itemprop="price">US $1.99</span>
<span class="separator">/</span>
<span class="unit">unidad</span>
</span>
<strong class="free-s">Envío gratis</strong>
<div class="rate-history">
<span rel="nofollow" class="order-num">
<a class="order-num-a j-p4plog" href="//www.aliexpress.com/item/Hot-Sale-Novelty-Toys-Hand-Spinner-Anti-stress-toys/32805326364.html?spm=2114.search0204.3.3.Lwk2KD&s=p&ws_ab_test=searchweb0_0,searchweb201602_5_10152_10065_10151_10344_10068_10130_10345_10324_10342_10547_10325_10343_10546_10340_10341_10548_10545_10541_10562_10084_10083_10307_5680011_10178_10060_10155_10154_10056_10055_10539_10312_10059_10313_10314_10534_10533_100031_10103_10073_10102_10594_10557_10558_10596_10142_10107,searchweb201603_14,ppcSwitch_5_ppcChannel&btsid=6350c066-2194-4756-b1f7-ed7e1b0028e1&rmStoreLevelAB=0#thf" rel="nofollow" target="_blank" data-spm-anchor-id="2114.search0204.3.3"><em title="Pedido totales"> Ventas (0)</em></a>
</span>
</div>
</div>
<div class="info-more">
<div class="aplus-sp-main">
<div class="sp-box">
</div>
</div>
<div class="store-name-chat">
<div class="store-name util-clearfix">
Alisa's cabin
</div>
</div>
<a class="score-dot" href="//www.aliexpress.com/store/feedback-score/1308215.html?spm=2114.search0204.3.5.Lwk2KD" rel="nofollow" data-spm-anchor-id="2114.search0204.3.5"><span class="score-icon-new score-level-22" id="score1" feedbackscore="1,276" sellerpositivefeedbackpercentage="93.7"></span></a>
<div class="add-to-wishlist">
<a class="atwl-button j-p4plog" href="javascript:;" data-product-id="32805326364" data-batman-id="ja2kvte8" data-spm-anchor-id="2114.search0204.3.6">Añadir a Lista Deseos</a>
</div>
<input class="atc-product-id" type="hidden" value="32805326364">
<input class="atc-product-standard" type="hidden" value="">
</div>
</div>
I want to extract some data from many links from xbox. The problem I am experiencing is that in the section where the price is shown, the structure is different if the game is with discount (for example).
The code I have written to scrap the price:
String urlPage = "https://www.microsoft.com/en-us/store/p/call-of-duty-advanced-warfare-gold-edition/c20hl06x0v8w" ;
System.out.println("Comprobando entradas de: "+urlPage);
if (getStatusConnectionCode(urlPage) == 200) {
Document document = getHtmlDocument(urlPage);
Elements entradas = document.select("div.m-product-detail-hero-product-placement div.price-info");
for (Element elem : entradas) {
String titulo = elem.getElementsByClass("srv_saleprice").text();
}
}else{
System.out.println("El Status Code no es OK es: "+getStatusConnectionCode(urlPage));
}
The HTML for a game that has no discount:
URL for first case
<div class="price-info">
<div class="c-price">
<div class="price-text srv_price">
<div class="ea-vault-message hidden x-hidden">
<div>
Available in The Vault
</div>
<div>
or
</div>
</div>
<span>$59.99</span>
<sup>+</sup>
</div>
<div class="srv_microdata" itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<meta itemprop="price" content="59.99">
<meta itemprop="priceCurrency" content="USD">
</div>
</div>
</div>
And for a game with discount:
URL for the second case
<div class="price-info">
<div class="c-price">
<div class="price-text srv_price">
<div class="ea-vault-message hidden x-hidden">
<div>
Available in The Vault
</div>
<div>
or
</div>
</div>
<s class="srv_saleprice" aria-label="Full price was $159.99">$159.99</s>
<span> </span>
<div class="price-disclaimer">
<span>$135.99</span>
<sup>+</sup>
</div>
<span> </span>
<span></span>
</div>
<div class="caption text-muted srv_countdown">
<span class="sub">save $24.00</span>
</div>
<div class="srv_microdata" itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<meta itemprop="price" content="135.99">
<meta itemprop="priceCurrency" content="USD">
</div>
</div>
</div>
In this second example the value inside elements is $135.99 but is not the game base price ($159.99 in this case).
How could I extract only the base price for every game (with or without) discount?
I'm trying to parse HTML string using htmlparser library.
The html is like this:
<body>
<div class="Level1">
<div class="row">
<div class="txt">
Date of analysis:
</div><div class="content">
02/03/11
</div>
</div>
</div><div class="Level1">
<div class="row">
<div class="txt">
Site:
</div><div class="content">
13.0E
</div>
</div>
</div><div class="Level1">
<div class="row">
<div class="txt">
Network type:
</div><div class="content">
DVB-S
</div>
</div>
</div>
</body>
I need to extract "content" information for a given "txt". I have made a filter that returns the divs with class= "level1", but I don't know how to make a filter with the content of the div, I mean in case the value of txt is Site: then read content like 13.0E.
NodeList nl = parser.extractAllNodesThatMatch(new AndFilter(new TagNameFilter("div"), new HasAttributeFilter("class", "Level1")));
Can someone help me with this issue?? how to read a div inside a div?
Thanks!!
NodeList nl = parser.extractAllNodesThatMatch(new AndFilter(new TagNameFilter("div"), new HasAttributeFilter("class", "Level1")));
better to do it like this:
NodeList nl = parser.parse(null); // you can also filter here
NodeList divs = nl.extractAllNodesThatMatch(
new AndFilter(new TagNameFilter("DIV"),
new HasAttributeFilter("class", "txt")));
if( divs.size() > 0 ) {
Tag div = divs.elementAt(0);
String text = div.getText(); // this is the text of the div
}