How do i loop through divs using jsoup - java

Hi guys I'm using jsoup in a java webapplication on IntelliJ. I'm trying to scrape data of port call events from a shiptracking website and store the data in a mySQL database. The data for the events is organised in divs with the class name table-group and the values are in another div with the class name table-row. My problem is the divs rows for all the vessel are all the same class name and im trying to loop through each row and push the data to a database. So far i have managed to create a java class to scrape the first row. How can i loop through each row and store those values to my database. Should i create an array list to store the values?
this is my scraper class
public class Scarper {
private static Document doc;
public static void main(String[] args) {
final String url =
"https://www.myshiptracking.com/ports-arrivals-departures/?mmsi=&pid=277&type=0&time=&pp=20";
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
e.printStackTrace();
}
Events();
}
public static void Events() {
Elements elm = doc.select("div.table-group:nth-of-type(2) > .table-row");
List<String> arrayList = new ArrayList();
for (Element ele : elm) {
String event = ele.select("div.col:nth-of-type(2)").text();
String time = ele.select("div.col:nth-of-type(3)").text();
String port = ele.select("div.col:nth-of-type(4)").text();
String vessel = ele.select(".td_vesseltype.col").text();
Event ev = new Event();
System.out.println(event);
System.out.println(time);
System.out.println(port);
System.out.println(vessel);
}
}
}
sample of the div classes i want to scrape
<div style="box-sizing: border-box;padding: 0px 10px 10px 10px;">
<div class="cs-table">
<div class="heading">
<div class="col" style="width: 10px"></div>
<div class="col" style="width: 110px">Event</div>
<div class="col" style="width: 120px">Time (<span class="tooltip" title="My Time: In your current TimeZone">MT</span>)</div>
<div class="col" style="width: 150px">Port</div>
<div class="col">Vessel</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>BELFAST</div>
<div class="col td_vesseltype"><img src="/icons/icon7_511.png"><span class="padding_18">WILSON BLYTH [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-flag-checkered green"></i></div>
<div class="col">Arrival</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>HUNTERS QUAY</div>
<div class="col td_vesseltype"><img src="/icons/icon6_511.png"><span class="padding_18">SOUND OF SOAY [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>LARGS</div>
<div class="col td_vesseltype"><img src="/icons/icon6_511.png"><span class="padding_18">LOCH SHIRA [GB]</span></div>
</div>
</div>
<div class="table-group">
<div class="table-row">
<div class="col"><i class="fa fa-sign-out red"></i></div>
<div class="col">Departure</div>
<div class="col" style="text-align: center;">2022-02-14 <b>16:51</b></div>
<div class="col"><img class="flag_line tooltip" src="/icons/flags2/16/GB.png" title=" United Kingdom"/>RYDE</div>
<div class="col td_vesseltype"><img src="/icons/icon4_511.png"><span class="padding_18">ISLAND FLYER [GB]</span></div>
</div>
</div>

You can start with looping over the table's rows: the selector for the table is .cs-table so you can get the table with Element table = doc.select(".cs-table").first();. Next you can get the table's rows with the selector div.table-row - Elements rows = doc.select("div.table-row"); now you can loop over all the rows and extract the data from each row. The code should look like:
Element table = doc.select(".cs-table").first();
Elements rows = doc.select("div.table-row");
for (Element row : rows) {
String event = row.select("div.col:nth-of-type(2)").text();
String time = row.select("div.col:nth-of-type(3)").text();
String port = row.select("div.col:nth-of-type(4)").text();
String vessel = row.select(".td_vesseltype.col").text();
System.out.println(event + "-" + time + " " + port + " " + vessel);
System.out.println("---------------------------");
// Do stuff with data here
}
Now it's up to you to decide if you want to keep the data in some array/list inside the loop and use it later, or to insert it directly to your database.

Related

Can't select Multiple Dropdown with Selenium webdriver (not class Select)

I'm having a problem with my tests with Selenium webdriver. I'm using Java. Can't select from a multiple drop-down that is not class Select. This is how the drop-down looks like:
Drop-Down picture
And that's the code:
<div class="form-group ">
<label for="CurrentCategoriesNomIds-selectized">Categories</label>
<select placeholder="" multiple="multiple" id="CurrentCategoriesNomIds" name="CurrentCategoriesNomIds" tabindex="-1" class="selectized" style="display: none;">
<option value="325" selected="selected">Education</option>
</select>
<div class="selectize-control multi plugin-remove_button">
<div class="selectize-input items not-full has-options has-items">
<div class="item" data-value="325">
Education
×
</div>
<input type="text" autocomplete="off" tabindex="" id="CurrentCategoriesNomIds-selectized" style="width: 4px; opacity: 1; position: relative; left: 0px;"></div>
<div class="selectize-dropdown multi plugin-remove_button" style="display: none; visibility: visible; width: 800px; top: 36px; left: 0px;">
<div class="selectize-dropdown-content">
<div class="option" data-selectable="" data-value="324">Agriculture</div>
<div class="option" data-selectable="" data-value="298">Culture</div>
<div class="option" data-selectable="" data-value="326">Employment</div>
<div class="option" data-selectable="" data-value="323">Environment</div>
<div class="option" data-selectable="" data-value="327">Other</div>
<div class="option" data-selectable="" data-value="297">Political</div>
<div class="option" data-selectable="" data-value="322">Transport</div>
</div>
</div>
</div>
</div>
This is how it looks like when 2 options are selected. I was wondering if I can try with KEYS but the page doesn't work like that. Haven't seen that kind of field before, and not sure how to proceed?
You can click on dropdown using this code :
public static void selectOption(WebDriver driver, String optionName) {
List<WebElement> options = driver.findElements(By.xpath("//div[#class='selectize-dropdoun-content']//div[#class='option']"));
options.forEach(option -> {
if (option.getAttribute("innerText").equals(optionName)) {
Actions actions = new Actions(driver);
actions.moveToElement(option).click().build().perform();
}
});
}
and then use like this:
String option = "Education";
selectOption(driver,option);
Hope that helps you:)
Adding screenshot for what I have tried on website : https://semantic-ui.com/modules/dropdown.html
I don't much use Java, so I'll be writing some pseudo code for this that should give you an outline on how to achieve it (but may not run as written).
public static void selectOptionFromSelectizeDropdown(String optionText, String dropdown){
boolean completed = false;
int numberOfOptions = driver.findElements(By.css(dropdown + " .option")).length
for(int i = 0; i < numberOfOptions && completed === false; i++){
// Check if it's displayed, if it is, HUZZAH! Click the option
if(driver.findElement(By.xpath('//*/*[contains(#class, "option") and contains(text(), "'+optionText+'")])')).isDisplayed()){
driver.findElement(By.xpath('//*/*[contains(#class, "option") and contains(text(), "'+optionText+'")])')).click();
completed === true;
break;
} else {
// In case there are many options, and you have to scroll through them.
int x = 0;
while(x <= 6){
driver.findElement(By.css(dropdown)).sendKeys(Keys.DOWN);
i++;
}
}
if(i===numberOfOptions - 1){
throw new Error("Option Not Found");
}
}
}
selectOptionFromSelectizeDropdown("Education", ".selectize-dropdown-content");
If this doesn't work, I'd recommend changing the click() to a sendKeys(Key.ENTER) to see if that would work.
Explanation
Will loop through, seeing if the option is displayed on the page. If not, will scroll down x times, and check again, until the option is found.
If it reaches the number of options inside the box, it will throw an error.

How to choose a item from a list in a multi_select in selenium java

what i have tried:
Select listbox = new Select(
driver.findElement(By.xpath("//*[#id='multiselect_categories']"))
);
listbox.selectByValue("ATM");
Html code when some option choose:
<input name="multiselect_categories" id="multiselect_categories"
type="text" autocomplete="off" placeholder="Select option"
tabindex="0" class="multiselect__input" style="display: none;">
<div class="multiselect__tags">
<div class="multiselect__tags-wrap" style="">
<span class="multiselect__tag">
<span>Actions and Practices</span>
<i aria-hidden="true" tabindex="1" class="multiselect__tag-icon"></i>
</span>
<span class="multiselect__tag">
<span>Air Carrier Services and Safety Oversight</span>
<i aria-hidden="true" tabindex="1" class="multiselect__tag-icon"></i>
</span>
</div>
<div class="multiselect__spinner" style="display: none;"></div>
<input name="multiselect_categories" id="multiselect_categories"
type="text" autocomplete="off" placeholder="Select option"
tabindex="0" class="multiselect__input"
style="width: 0px; position: absolute; padding: 0px; display: none;">
</div>
<div class="multiselect__content-wrapper" style="max-height: 291.375px; display: none;">
<ul class="multiselect__content" style="display: inline-block;">
<li class="multiselect__element">
<span data-select="Press enter to select" data-selected="Selected"
data-deselect="Press enter to remove" class="multiselect__option">
<span>ATM</span>
</span>
</li>
<li class="multiselect__element">
<span data-select="Press enter to select" data-selected="Selected"
data-deselect="Press enter to remove" class="multiselect__option
multiselect__option--selected">
<span>Actions and Practices</span>
</span>
</li>
<li class="multiselect__element">
<span data-select="Press enter to select" data-selected="Selected"
data-deselect="Press enter to remove" class="multiselect__option
multiselect__option--selected">
<span>Air Carrier Services and Safety Oversight</span>
</span>
</li>
</ul>
</div>
CODE that failed when adding to selenium code when adding to testng:
#Test(description = "Test5")
public void chooseCatagory(String... catagories) {
for(String catagory: catagories) {
// input catagory in text box which display placeholder `Select option`
driver.findElement(By.cssSelector("div.multiselect__tags #multiselect_categories"))
.sendKeys(catagory);
// find the item from auto-suggest list
driver.findElement(By.cssSelector("div.multiselect__tags + div > ul"))
.findElement(By.xpath("./li//span[text()='"+catagory+"']"))
.click();
}
}
chooseCatagory("ATM", "Airports");
Error from the above code:
org.testng.TestNGException:
Cannot inject #Test annotated Method [chooseCatagory] with [class [Ljava.lang.String;].
For more information on native dependency injection please refer to http://testng.org/doc/documentation-main.html#native-dependency-injection
org.testng.TestNGException:
HTML when there is nothing chosen:
<input name="multiselect_categories" id="multiselect_categories"
type="text" autocomplete="off" placeholder="Select option" tabindex="0" class="multiselect__input" style="display: none;">
<span><span class="multiselect__single">
Select option
</span></span>
what the list contains:
ATM,Action, refer to screenshot
#Test(description = "Test5")
public test_chooseCatagory() {
chooseCatagory("ATM", "Airports");
}
private void chooseCatagory(String... catagories) {
for(String catagory: catagories) {
// click the down arrow at right to make the filter text box and
// all option list display
driver.findElement(By.cssSelector("div.multiselect__select"))
.click();
// input catagory into text box to filter matched options
driver.findElement(By.cssSelector(".multiselect__tags #multiselect_categories"))
.sendKeys(catagory);
// click the option from filtered option list
driver.findElement(By.cssSelector(".multiselect__content-wrapper > ul"))
.findElement(By.xpath("./li//span[text()='"+catagory+"']"))
.click();
// sleep 2 seconds before next choosing
try {
Thread.sleep(2000);
}
catch(Exception e) {
}
}
}

Why does my loop only working on some of it's iterations?... (using Jsoup to extract data)

The items in my itemList are incomplete! For some reason from the 10th iteration of my loop to the last
el.select(".item").select(".img").select(".pic").select(".picRind").select(".picCore").attr("src")
returns a empty string and I can't understand why
0-9th iteration is perfectly find though. I went through the html and my code should work for every li I'm iterating through.
private Document getHtmlDocument() throws IOException {
document = Jsoup.connect(url).get();
return document;
}
public List<AliExpressItem> getAliExpressItemList() throws IOException {
Document document;
Element ul;
Elements ulLi;
document = getHtmlDocument();
ul = document.getElementById("hs-below-list-items");
ulLi = ul.getElementsByClass("list-item");
List<AliExpressItem> itemList = new ArrayList<>();
for(Element el : ulLi) {
AliExpressItem item = new AliExpressItem();
item.setImage(el.select(".item")
.select(".img")
.select(".pic")
.select(".picRind")
.select(".picCore")
.attr("src"));
item.setDescription(el.select(".item")
.select(".info")
.select("h3")
.select("a")
.text());
item.setPrice(el.select(".item")
.select(".info")
.select(".price")
.select(".value")
.text());
itemList.add(item);
}
return itemList;
}
Theres a ul with 48 li's inside. The above code should work for all 48 li's
<li qrdata="|32805326364|cn1511315262" pub-catid="200247142" sessionid="201711160635492248862329348280002056372" class="list-item list-item-first ">
<div class="item">
<div class="img img-border">
<div class="pic">
<a class="picRind history-item j-p4plog" href="//www.aliexpress.com/item/Hot-Sale-Novelty-Toys-Hand-Spinner-Anti-stress-toys/32805326364.html?spm=2114.search0204.3.1.Lwk2KD&s=p&ws_ab_test=searchweb0_0,searchweb201602_5_10152_10065_10151_10344_10068_10130_10345_10324_10342_10547_10325_10343_10546_10340_10341_10548_10545_10541_10562_10084_10083_10307_5680011_10178_10060_10155_10154_10056_10055_10539_10312_10059_10313_10314_10534_10533_100031_10103_10073_10102_10594_10557_10558_10596_10142_10107,searchweb201603_14,ppcSwitch_5_ppcChannel&btsid=6350c066-2194-4756-b1f7-ed7e1b0028e1&rmStoreLevelAB=0" target="_blank" data-spm-anchor-id="2114.search0204.3.1"><img class="picCore pic-Core-v" src="//ae01.alicdn.com/kf/HTB1RUjgQFXXXXayXXXXq6xXFXXX4/Hot-Sale-Novelty-Toys-Hand-font-b-Spinner-b-font-Anti-stress-toys-fidget-font-b.jpg_220x220.jpg" alt="Hot Sale Novelty Toys Hand Spinner Anti stress toys fidget spinners For Autism and ADHD reliever stress spinner(China)"></a>
</div>
</div>
<div class="info">
<h3>
<a class="history-item product j-p4plog" href="//www.aliexpress.com/item/Hot-Sale-Novelty-Toys-Hand-Spinner-Anti-stress-toys/32805326364.html?spm=2114.search0204.3.2.Lwk2KD&s=p&ws_ab_test=searchweb0_0,searchweb201602_5_10152_10065_10151_10344_10068_10130_10345_10324_10342_10547_10325_10343_10546_10340_10341_10548_10545_10541_10562_10084_10083_10307_5680011_10178_10060_10155_10154_10056_10055_10539_10312_10059_10313_10314_10534_10533_100031_10103_10073_10102_10594_10557_10558_10596_10142_10107,searchweb201603_14,ppcSwitch_5_ppcChannel&btsid=6350c066-2194-4756-b1f7-ed7e1b0028e1&rmStoreLevelAB=0" title="Hot Sale Novelty Toys Hand Spinner Anti stress toys fidget spinners For Autism and ADHD reliever stress spinner" target="_blank" data-spm-anchor-id="2114.search0204.3.2">Hot Sale Novelty Toys Hand <font><b>Spinner</b></font> Anti stress toys fidget <font><b>spinners</b></font> For Autism and ADHD reliever stress <font><b>spinner</b></font></a>
</h3>
<span class="price price-m">
<span class="value" itemprop="price">US $1.99</span>
<span class="separator">/</span>
<span class="unit">unidad</span>
</span>
<strong class="free-s">Envío gratis</strong>
<div class="rate-history">
<span rel="nofollow" class="order-num">
<a class="order-num-a j-p4plog" href="//www.aliexpress.com/item/Hot-Sale-Novelty-Toys-Hand-Spinner-Anti-stress-toys/32805326364.html?spm=2114.search0204.3.3.Lwk2KD&s=p&ws_ab_test=searchweb0_0,searchweb201602_5_10152_10065_10151_10344_10068_10130_10345_10324_10342_10547_10325_10343_10546_10340_10341_10548_10545_10541_10562_10084_10083_10307_5680011_10178_10060_10155_10154_10056_10055_10539_10312_10059_10313_10314_10534_10533_100031_10103_10073_10102_10594_10557_10558_10596_10142_10107,searchweb201603_14,ppcSwitch_5_ppcChannel&btsid=6350c066-2194-4756-b1f7-ed7e1b0028e1&rmStoreLevelAB=0#thf" rel="nofollow" target="_blank" data-spm-anchor-id="2114.search0204.3.3"><em title="Pedido totales"> Ventas (0)</em></a>
</span>
</div>
</div>
<div class="info-more">
<div class="aplus-sp-main">
<div class="sp-box">
</div>
</div>
<div class="store-name-chat">
<div class="store-name util-clearfix">
Alisa's cabin
</div>
</div>
<a class="score-dot" href="//www.aliexpress.com/store/feedback-score/1308215.html?spm=2114.search0204.3.5.Lwk2KD" rel="nofollow" data-spm-anchor-id="2114.search0204.3.5"><span class="score-icon-new score-level-22" id="score1" feedbackscore="1,276" sellerpositivefeedbackpercentage="93.7"></span></a>
<div class="add-to-wishlist">
<a class="atwl-button j-p4plog" href="javascript:;" data-product-id="32805326364" data-batman-id="ja2kvte8" data-spm-anchor-id="2114.search0204.3.6">Añadir a Lista Deseos</a>
</div>
<input class="atc-product-id" type="hidden" value="32805326364">
<input class="atc-product-standard" type="hidden" value="">
</div>
</div>

How can I extract information from HTML depending on the structure

I want to extract some data from many links from xbox. The problem I am experiencing is that in the section where the price is shown, the structure is different if the game is with discount (for example).
The code I have written to scrap the price:
String urlPage = "https://www.microsoft.com/en-us/store/p/call-of-duty-advanced-warfare-gold-edition/c20hl06x0v8w" ;
System.out.println("Comprobando entradas de: "+urlPage);
if (getStatusConnectionCode(urlPage) == 200) {
Document document = getHtmlDocument(urlPage);
Elements entradas = document.select("div.m-product-detail-hero-product-placement div.price-info");
for (Element elem : entradas) {
String titulo = elem.getElementsByClass("srv_saleprice").text();
}
}else{
System.out.println("El Status Code no es OK es: "+getStatusConnectionCode(urlPage));
}
The HTML for a game that has no discount:
URL for first case
<div class="price-info">
<div class="c-price">
<div class="price-text srv_price">
<div class="ea-vault-message hidden x-hidden">
<div>
Available in The Vault
</div>
<div>
or
</div>
</div>
<span>$59.99</span>
<sup>+</sup>
</div>
<div class="srv_microdata" itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<meta itemprop="price" content="59.99">
<meta itemprop="priceCurrency" content="USD">
</div>
</div>
</div>
And for a game with discount:
URL for the second case
<div class="price-info">
<div class="c-price">
<div class="price-text srv_price">
<div class="ea-vault-message hidden x-hidden">
<div>
Available in The Vault
</div>
<div>
or
</div>
</div>
<s class="srv_saleprice" aria-label="Full price was $159.99">$159.99</s>
<span> </span>
<div class="price-disclaimer">
<span>$135.99</span>
<sup>+</sup>
</div>
<span> </span>
<span></span>
</div>
<div class="caption text-muted srv_countdown">
<span class="sub">save $24.00</span>
</div>
<div class="srv_microdata" itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<meta itemprop="price" content="135.99">
<meta itemprop="priceCurrency" content="USD">
</div>
</div>
</div>
In this second example the value inside elements is $135.99 but is not the game base price ($159.99 in this case).
How could I extract only the base price for every game (with or without) discount?

How to access nested divs using Jsoup

This is the html page:
<div class="doc_details">
<fieldset style="border: 0pt">
<div class="row">
<div class="col-sm-6 col-md-6">
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Speciality</b>
</div>
<div class="col-sm-6 col-md-6">ABCD</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>City</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Residence Address</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
<div class="row">
<div class="col-sm-6 col-md-6">
<b>Business Address</b>
</div>
<div class="col-sm-6 col-md-6">Ranchi</div>
</div>
</div>
</div>
</fieldset>
</div>
I would like to access only the values of the Speciality, city and address columns into a variable as follows:
Elements rows = doc.select("div.doc_details div.row div.row ");
Element row_div = rows.select("div.row").get(0);
doctor.speciality = row_div.select("div:eq(0)").text();
But even if I change the get(0) to get(1), I'm not able to get only the values in the variable.
You can probably do this with a css-selector :
doc.select("div.row > div.col-sm-6:nth-child(2)")
which returns this :
0 = {Element#754} "<div class="col-sm-6 col-md-6">\n ABCD \n</div>"
1 = {Element#756} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
2 = {Element#758} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
3 = {Element#760} "<div class="col-sm-6 col-md-6">\n Ranchi \n</div>"
It's then really up to you, you can for example map the list to the text of each div :
divs.stream().map(new Function<Element, String>() {
#Override
public String apply(Element element) {
return element.text();
}
}).collect(Collectors.toList()));
or more simple :
String speciality = divs.get(0).text();
String city = divs.get(1).text();
String adress = divs.get(2).text();
Try this:
Elements rows = doc.select("div.doc_details div.row div.row ");
Element row_div = rows.select("div.col-sm-6").get(1);
doctor.speciality = row_div.text();
tell me if it works!
Here is how I would do it:
Document doc = Jsoup.parse(html);
Elements rows = doc.select("div.doc_details div.row div.row ");
for (Element row : rows){
Elements innerDivs = row.select("div");
String header = innerDivs.get(1).text();
String content = innerDivs.get(2).text();
System.out.println("header = "+header+ " -> "+content);
}
I think you were mistaken in the css selector!
Edit: (thanks to the OP the indexes are correct now)

Categories

Resources