I want to extract some data from many links from xbox. The problem I am experiencing is that in the section where the price is shown, the structure is different if the game is with discount (for example).
The code I have written to scrap the price:
String urlPage = "https://www.microsoft.com/en-us/store/p/call-of-duty-advanced-warfare-gold-edition/c20hl06x0v8w" ;
System.out.println("Comprobando entradas de: "+urlPage);
if (getStatusConnectionCode(urlPage) == 200) {
Document document = getHtmlDocument(urlPage);
Elements entradas = document.select("div.m-product-detail-hero-product-placement div.price-info");
for (Element elem : entradas) {
String titulo = elem.getElementsByClass("srv_saleprice").text();
}
}else{
System.out.println("El Status Code no es OK es: "+getStatusConnectionCode(urlPage));
}
The HTML for a game that has no discount:
URL for first case
<div class="price-info">
<div class="c-price">
<div class="price-text srv_price">
<div class="ea-vault-message hidden x-hidden">
<div>
Available in The Vault
</div>
<div>
or
</div>
</div>
<span>$59.99</span>
<sup>+</sup>
</div>
<div class="srv_microdata" itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<meta itemprop="price" content="59.99">
<meta itemprop="priceCurrency" content="USD">
</div>
</div>
</div>
And for a game with discount:
URL for the second case
<div class="price-info">
<div class="c-price">
<div class="price-text srv_price">
<div class="ea-vault-message hidden x-hidden">
<div>
Available in The Vault
</div>
<div>
or
</div>
</div>
<s class="srv_saleprice" aria-label="Full price was $159.99">$159.99</s>
<span> </span>
<div class="price-disclaimer">
<span>$135.99</span>
<sup>+</sup>
</div>
<span> </span>
<span></span>
</div>
<div class="caption text-muted srv_countdown">
<span class="sub">save $24.00</span>
</div>
<div class="srv_microdata" itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<meta itemprop="price" content="135.99">
<meta itemprop="priceCurrency" content="USD">
</div>
</div>
</div>
In this second example the value inside elements is $135.99 but is not the game base price ($159.99 in this case).
How could I extract only the base price for every game (with or without) discount?
Related
Hi guys I'm using jsoup in a java webapplication on IntelliJ. I'm trying to scrape data of port call events from a shiptracking website and store the data in a mySQL database. The data for the events is organised in divs with the class name table-group and the values are in another div with the class name table-row. My problem is the divs rows for all the vessel are all the same class name and im trying to loop through each row and push the data to a database. So far i have managed to create a java class to scrape the first row. How can i loop through each row and store those values to my database. Should i create an array list to store the values?
this is my scraper class
public class Scarper {
private static Document doc;
public static void main(String[] args) {
final String url =
"https://www.myshiptracking.com/ports-arrivals-departures/?mmsi=&pid=277&type=0&time=&pp=20";
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
e.printStackTrace();
}
Events();
}
public static void Events() {
Elements elm = doc.select("div.table-group:nth-of-type(2) > .table-row");
List<String> arrayList = new ArrayList();
for (Element ele : elm) {
String event = ele.select("div.col:nth-of-type(2)").text();
String time = ele.select("div.col:nth-of-type(3)").text();
String port = ele.select("div.col:nth-of-type(4)").text();
String vessel = ele.select(".td_vesseltype.col").text();
Event ev = new Event();
System.out.println(event);
System.out.println(time);
System.out.println(port);
System.out.println(vessel);
}
}
}
sample of the div classes i want to scrape
<div style="box-sizing: border-box;padding: 0px 10px 10px 10px;">
<div class="cs-table">
<div class="heading">
<div class="col" style="width: 10px"></div>
<div class="col" style="width: 110px">Event</div>
<div class="col" style="width: 120px">Time (<span class="tooltip" title="My Time: In your current TimeZone">MT</span>)</div>
<div class="col" style="width: 150px">Port</div>
<div class="col">Vessel</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>BELFAST</div>
<div class="col td_vesseltype"><img src="/icons/icon7_511.png"><span class="padding_18">WILSON BLYTH [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-flag-checkered green"></i></div>
<div class="col">Arrival</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>HUNTERS QUAY</div>
<div class="col td_vesseltype"><img src="/icons/icon6_511.png"><span class="padding_18">SOUND OF SOAY [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>LARGS</div>
<div class="col td_vesseltype"><img src="/icons/icon6_511.png"><span class="padding_18">LOCH SHIRA [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>RYDE</div>
<div class="col td_vesseltype"><img src="/icons/icon4_511.png"><span class="padding_18">ISLAND FLYER [GB]</span></div>
</div>
</div>
You can start with looping over the table's rows: the selector for the table is .cs-table so you can get the table with Element table = doc.select(".cs-table").first();. Next you can get the table's rows with the selector div.table-row - Elements rows = doc.select("div.table-row"); now you can loop over all the rows and extract the data from each row. The code should look like:
Element table = doc.select(".cs-table").first();
Elements rows = doc.select("div.table-row");
for (Element row : rows) {
String event = row.select("div.col:nth-of-type(2)").text();
String time = row.select("div.col:nth-of-type(3)").text();
String port = row.select("div.col:nth-of-type(4)").text();
String vessel = row.select(".td_vesseltype.col").text();
System.out.println(event + "-" + time + " " + port + " " + vessel);
System.out.println("---------------------------");
// Do stuff with data here
}
Now it's up to you to decide if you want to keep the data in some array/list inside the loop and use it later, or to insert it directly to your database.
I am new with thymeleaf, and I want to display 3 values from 3 different arrays with the same index, inside the same div.row, I tried several ways but I only could iterate one array at a time without errors, below is my Controller side:
public String index(Model model) {
String[] table0 = {"0","1","2","3"}
String[] table1 = {"14","21","25","75"}
String[] table2 = {"7","63","57","87"}
model.addAttribute("table0", table0;
model.addAttribute("table1", table1);
model.addAttribute("table2", table2);
return "index";
}
Inside the html file, table0 is the first array iterated without errors, I don't know how to edit/improve the following code to display all the three arrays tables0, tables1 and tables3 at the same time:
<div class="row" th:each="v0 : ${tables0}" >
<div class="cell" th:text="value">
<!-- Here I could display a value from tables0 -->
</div>
<div class="cell" >
<!-- Here I need to display the value of tables1 having the same index as v0 -->
</div>
<div class="cell" >
<!-- Here I need to display the value of tables2 having the same index as v0 -->
</div>
</div>
here you could find what you're searching about , keeping iteration status
by simply adding a var after the object , and use index to get the current index value
by example :
<div class="row" th:each="v0,iter : ${tables0}" >
<div class="cell" th:text="value">
<!-- Here I could display a value from tables0 -->
<span th:text="${v0}"></span>
</div>
<div class="cell" >
<span th:text="${table1[iter.index]}"></span>
</div>
<div class="cell" >
<span th:text="${table2[iter.index]}"></span>
</div>
</div>
You can use Thymeleaf's iterStat to do this.
Assuming the following input data:
String[] table0 = {"0", "1", "2", "3"};
String[] table1 = {"14", "21", "25", "75"};
String[] table2 = {"7", "63", "57", "87"};
You can use the following Thymeleaf markup:
<div class="row" th:each="val,iterStat : ${table0}" >
<div class="cell" th:text="${val}">
</div>
<div class="cell" th:text="${table1[iterStat.index]}">
</div>
<div class="cell" th:text="${table2[iterStat.index]}">
</div>
</div>
This produces a column of numbers as follows (I don't have any CSS so it's just the raw output):
0
14
7
1
21
63
2
25
57
3
75
87
The related html looks like this:
<div class="row">
<div class="cell">0</div>
<div class="cell">14</div>
<div class="cell">7</div>
</div>
<div class="row">
<div class="cell">1</div>
<div class="cell">21</div>
<div class="cell">63</div>
</div>
<div class="row">
<div class="cell">2</div>
<div class="cell">25</div>
<div class="cell">57</div>
</div>
<div class="row">
<div class="cell">3</div>
<div class="cell">75</div>
<div class="cell">87</div>
</div>
The iterStat function is described here - it basically keeps track of your iterations. Since you want the same index for each table, it's a good fit for your needs.
I am working on a biological database composed of genes, proteins and assays. Using Spring Boot and Thymeleaf, I want to establish a web visualisation. Each gene is shown on a page with name, description and sequence. The sequence is composed of the letters A,T,G and C. I want to color those letters (works). But, each letter is written in a new line, instead of the text being written until the line is full (and then to the next line etc). In gene.html, I used the small-tag when defining the colors (tried p before and though of this being the reason for my problem), but using small did not help.
I hope, the code snippets I provide are enough (if not, tell me what you need)
gene.html
<!DOCTYPE html>
<html lang="en" xmlns:th="http://thymeleaf.org">
<head>
<title>Gene</title>
<meta http-equiv="Content-Type" content="content/html; charset=UTF-8">
</head>
<body>
<!--import header-->
<header th:include="header"></header>
<div id="main">
<!--GeneID as page heading -->
<h2 th:text="'Gene: '+${identifier}"></h2>
<!--Gene description -->
<p th:text="${description}"></p>
<br/>
<!-- Sequence -->
<h3 th:text="'Sequence:'"></h3>
<!-- For each char in sequence-->
<th:block th:each="char:${sequence}">
<!--Print the char. Possibility to color encode the bases utilizing switch/case
<small th:text="${char}"></small> -->
<div th:switch="${char}">
<div th:case="'A'">
<small style="color: blue" th:text="${char}"></small>
</div>
<div th:case="'T'">
<small style="color: yellow" th:text="${char}"></small>
</div>
<div th:case="'C'">
<small style="color: forestgreen" th:text="${char}"></small>
</div>
<div th:case="'G'">
<small style="color: red" th:text="${char}"></small>
</div>
</div>
</th:block>
<br/>
<br/>
<!--Protein encoded by gene -->
<h3>Protein:</h3>
<a th:href="${'protein?id='+protein}" th:text="${protein}"></a>
</div>
</body>
</html>
GeneController.java
package gui.spring.controller;
import db.sample.Gene;
import db.sample.Protein;
import org.springframework.stereotype.Controller;
import org.springframework.ui.Model;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RequestParam;
import java.util.Optional;
import static main.Main.query;
/**
* #author Miriam Mueller
* #since 05-12-2018
* #version 1.0
* Class to handle view of one Gene. Gene name, description and sequence are shown. The encoded protein is linked.
*/
#Controller
public class GeneController {
//All calls of localhost:8080/gene get to this controller
#RequestMapping(value = "/gene", method = RequestMethod.GET)
public String einGenAnzeigen(Model model, #RequestParam(value="id") String id) {
model.addAttribute("geneSize",query.getGenes().size());
model.addAttribute("proteinSize",query.getProteins().size());
model.addAttribute("assaySize",query.getAssays().size());
Optional<Gene> gene = query.getGeneByName(id);
if(gene.isPresent()) {
// if gene exists
String description = gene.get().getDesc();
String[] arraySeq = gene.get().getSequence().split("(?!^)");
Protein protein = query.getGeneByName(id).get().getProtein();
model.addAttribute("identifier", gene.get().getIdentifier()); //GenID
model.addAttribute("sequence",arraySeq); //gene sequence
model.addAttribute("description",description); //description
model.addAttribute("protein",protein.getIdentifier()); //encoded protein
}else{
// error messages, if no gene with called id exists
model.addAttribute("gene", "There is no Gene with this ID.");
model.addAttribute("protein","There is no Gene with this ID.Therefore, no reference protein was found.");
model.addAttribute("sequence","");
model.addAttribute("description","");
}
// name of html-template
return "gene";
}
}
Thanks for your time and effort :)
What's happening is that the div element is a block element which means that they will stack vertically instead of horizontally. For example as your traverse the sequence:
<th:block th:each="char:${sequence}">
<!--Print the char. Possibility to color encode the bases utilizing switch/case
<small th:text="${char}"></small> -->
<div th:switch="${char}">
<div th:case="'A'">
<small style="color: blue" th:text="${char}"></small>
</div>
<div th:case="'T'">
<small style="color: yellow" th:text="${char}"></small>
</div>
<div th:case="'C'">
<small style="color: forestgreen" th:text="${char}"></small>
</div>
<div th:case="'G'">
<small style="color: red" th:text="${char}"></small>
</div>
</div>
</th:block>
every single div will be displayed in a new line. You can have them displayed inline by either changing the display on those div's to be inline or inline-block
<div th:case="'A'" style="display:inline-block;">
<small style="color: blue" th:text="${char}"></small>
</div>
<div th:case="'T'" style="display:inline-block;">
<small style="color: yellow" th:text="${char}"></small>
</div>
<div th:case="'C'" style="display:inline-block;">
<small style="color: forestgreen" th:text="${char}"></small>
</div>
<div th:case="'G'" style="display:inline-block;">
<small style="color: red" th:text="${char}"></small>
</div>
or using another element whose default display is not block e.g. span. Removing the div's will work as well since small is also an inline element.
you could use a style of float: left to get your div blocks to line up how you want.
I'd suggest you swap to using an external style sheet to hold your styles. You can then set different classes on your genes and style them that way. It's a lot easier than trying to manage them all in the html. And you could then get rid of that th:switch statement.
Create a main.css file in resources/static/css/ with the following content (as an example)
div.gene {
border: 1px solid #999;
float: left;
height: 14px;
width: 14px;
text-align: center;
font-size: 12px;
}
div.A {
color: blue;
}
div.T {
color: yellow;
}
div.C {
color: forestgreen;
}
div.G {
color: red;
}
add the following inside the <head> tag of your gene html so it can get hold of the css file
<link rel="stylesheet" href="/css/main.css" />
Change your gene.html to add the classes
replace
<div th:switch="${char}">
<div th:case="'A'">
<small style="color: blue" th:text="${char}"></small>
</div>
<div th:case="'T'">
<small style="color: yellow" th:text="${char}"></small>
</div>
<div th:case="'C'">
<small style="color: forestgreen" th:text="${char}"></small>
</div>
<div th:case="'G'">
<small style="color: red" th:text="${char}"></small>
</div>
</div>
with
<div th:class="${'gene ' + char}" th:text="${char}"/>
This will add a class 'gene' and a class with the char of the gene (e.g. 'A') to the div. The css then has styles for the gene which are common to all, and styles for the char which are specific (i.e. the color)
The items in my itemList are incomplete! For some reason from the 10th iteration of my loop to the last
el.select(".item").select(".img").select(".pic").select(".picRind").select(".picCore").attr("src")
returns a empty string and I can't understand why
0-9th iteration is perfectly find though. I went through the html and my code should work for every li I'm iterating through.
private Document getHtmlDocument() throws IOException {
document = Jsoup.connect(url).get();
return document;
}
public List<AliExpressItem> getAliExpressItemList() throws IOException {
Document document;
Element ul;
Elements ulLi;
document = getHtmlDocument();
ul = document.getElementById("hs-below-list-items");
ulLi = ul.getElementsByClass("list-item");
List<AliExpressItem> itemList = new ArrayList<>();
for(Element el : ulLi) {
AliExpressItem item = new AliExpressItem();
item.setImage(el.select(".item")
.select(".img")
.select(".pic")
.select(".picRind")
.select(".picCore")
.attr("src"));
item.setDescription(el.select(".item")
.select(".info")
.select("h3")
.select("a")
.text());
item.setPrice(el.select(".item")
.select(".info")
.select(".price")
.select(".value")
.text());
itemList.add(item);
}
return itemList;
}
Theres a ul with 48 li's inside. The above code should work for all 48 li's
<li qrdata="|32805326364|cn1511315262" pub-catid="200247142" sessionid="201711160635492248862329348280002056372" class="list-item list-item-first ">
<div class="item">
<div class="img img-border">
<div class="pic">
<a class="picRind history-item j-p4plog" href="//www.aliexpress.com/item/Hot-Sale-Novelty-Toys-Hand-Spinner-Anti-stress-toys/32805326364.html?spm=2114.search0204.3.1.Lwk2KD&s=p&ws_ab_test=searchweb0_0,searchweb201602_5_10152_10065_10151_10344_10068_10130_10345_10324_10342_10547_10325_10343_10546_10340_10341_10548_10545_10541_10562_10084_10083_10307_5680011_10178_10060_10155_10154_10056_10055_10539_10312_10059_10313_10314_10534_10533_100031_10103_10073_10102_10594_10557_10558_10596_10142_10107,searchweb201603_14,ppcSwitch_5_ppcChannel&btsid=6350c066-2194-4756-b1f7-ed7e1b0028e1&rmStoreLevelAB=0" target="_blank" data-spm-anchor-id="2114.search0204.3.1"><img class="picCore pic-Core-v" src="//ae01.alicdn.com/kf/HTB1RUjgQFXXXXayXXXXq6xXFXXX4/Hot-Sale-Novelty-Toys-Hand-font-b-Spinner-b-font-Anti-stress-toys-fidget-font-b.jpg_220x220.jpg" alt="Hot Sale Novelty Toys Hand Spinner Anti stress toys fidget spinners For Autism and ADHD reliever stress spinner(China)"></a>
</div>
</div>
<div class="info">
<h3>
<a class="history-item product j-p4plog" href="//www.aliexpress.com/item/Hot-Sale-Novelty-Toys-Hand-Spinner-Anti-stress-toys/32805326364.html?spm=2114.search0204.3.2.Lwk2KD&s=p&ws_ab_test=searchweb0_0,searchweb201602_5_10152_10065_10151_10344_10068_10130_10345_10324_10342_10547_10325_10343_10546_10340_10341_10548_10545_10541_10562_10084_10083_10307_5680011_10178_10060_10155_10154_10056_10055_10539_10312_10059_10313_10314_10534_10533_100031_10103_10073_10102_10594_10557_10558_10596_10142_10107,searchweb201603_14,ppcSwitch_5_ppcChannel&btsid=6350c066-2194-4756-b1f7-ed7e1b0028e1&rmStoreLevelAB=0" title="Hot Sale Novelty Toys Hand Spinner Anti stress toys fidget spinners For Autism and ADHD reliever stress spinner" target="_blank" data-spm-anchor-id="2114.search0204.3.2">Hot Sale Novelty Toys Hand <font><b>Spinner</b></font> Anti stress toys fidget <font><b>spinners</b></font> For Autism and ADHD reliever stress <font><b>spinner</b></font></a>
</h3>
<span class="price price-m">
<span class="value" itemprop="price">US $1.99</span>
<span class="separator">/</span>
<span class="unit">unidad</span>
</span>
<strong class="free-s">Envío gratis</strong>
<div class="rate-history">
<span rel="nofollow" class="order-num">
<a class="order-num-a j-p4plog" href="//www.aliexpress.com/item/Hot-Sale-Novelty-Toys-Hand-Spinner-Anti-stress-toys/32805326364.html?spm=2114.search0204.3.3.Lwk2KD&s=p&ws_ab_test=searchweb0_0,searchweb201602_5_10152_10065_10151_10344_10068_10130_10345_10324_10342_10547_10325_10343_10546_10340_10341_10548_10545_10541_10562_10084_10083_10307_5680011_10178_10060_10155_10154_10056_10055_10539_10312_10059_10313_10314_10534_10533_100031_10103_10073_10102_10594_10557_10558_10596_10142_10107,searchweb201603_14,ppcSwitch_5_ppcChannel&btsid=6350c066-2194-4756-b1f7-ed7e1b0028e1&rmStoreLevelAB=0#thf" rel="nofollow" target="_blank" data-spm-anchor-id="2114.search0204.3.3"><em title="Pedido totales"> Ventas (0)</em></a>
</span>
</div>
</div>
<div class="info-more">
<div class="aplus-sp-main">
<div class="sp-box">
</div>
</div>
<div class="store-name-chat">
<div class="store-name util-clearfix">
Alisa's cabin
</div>
</div>
<a class="score-dot" href="//www.aliexpress.com/store/feedback-score/1308215.html?spm=2114.search0204.3.5.Lwk2KD" rel="nofollow" data-spm-anchor-id="2114.search0204.3.5"><span class="score-icon-new score-level-22" id="score1" feedbackscore="1,276" sellerpositivefeedbackpercentage="93.7"></span></a>
<div class="add-to-wishlist">
<a class="atwl-button j-p4plog" href="javascript:;" data-product-id="32805326364" data-batman-id="ja2kvte8" data-spm-anchor-id="2114.search0204.3.6">Añadir a Lista Deseos</a>
</div>
<input class="atc-product-id" type="hidden" value="32805326364">
<input class="atc-product-standard" type="hidden" value="">
</div>
</div>
This is the html page:
<div class="doc_details">
<fieldset style="border: 0pt">
<div class="row">
<div class="col-sm-6 col-md-6">
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Speciality</b>
</div>
<div class="col-sm-6 col-md-6">ABCD</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>City</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Residence Address</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Business Address</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
</div>
</div>
</fieldset>
</div>
I would like to access only the values of the Speciality, city and address columns into a variable as follows:
Elements rows = doc.select("div.doc_details div.row div.row ");
Element row_div = rows.select("div.row").get(0);
doctor.speciality = row_div.select("div:eq(0)").text();
But even if I change the get(0) to get(1), I'm not able to get only the values in the variable.
You can probably do this with a css-selector :
doc.select("div.row > div.col-sm-6:nth-child(2)")
which returns this :
0 = {Element#754} "<div class="col-sm-6 col-md-6">\n ABCD \n</div>"
1 = {Element#756} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
2 = {Element#758} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
3 = {Element#760} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
It's then really up to you, you can for example map the list to the text of each div :
divs.stream().map(new Function<Element, String>() {
#Override
public String apply(Element element) {
return element.text();
}
}).collect(Collectors.toList()));
or more simple :
String speciality = divs.get(0).text();
String city = divs.get(1).text();
String adress = divs.get(2).text();
Try this:
Elements rows = doc.select("div.doc_details div.row div.row ");
Element row_div = rows.select("div.col-sm-6").get(1);
doctor.speciality = row_div.text();
tell me if it works!
Here is how I would do it:
Document doc = Jsoup.parse(html);
Elements rows = doc.select("div.doc_details div.row div.row ");
for (Element row : rows){
Elements innerDivs = row.select("div");
String header = innerDivs.get(1).text();
String content = innerDivs.get(2).text();
System.out.println("header = "+header+ " -> "+content);
}
I think you were mistaken in the css selector!
Edit: (thanks to the OP the indexes are correct now)