Parsing <ul class="news-list"> with Java - java

How can I parse ul elements in a HTML document with a specific class type using Java?
I want to pars this section from HTML:
<ul class="news-list">
<li>
<a onclick="AjaxStatManager('Content','1258')" href="http://www.gyte.edu.tr/icerik/120/1258/kim-101-final-mazeret-sinavi.aspx" target="_self">
<div class="text">
<h2>KİM 101 Final Mazeret Sınavı</h2>
<p></p>
</div>
</a>
</li>
<li>
<a onclick="AjaxStatManager('Content','1248')" href="http://www.gyte.edu.tr/icerik/120/1248/butunleme-sinav-tarihleri.aspx" target="_self">
<div class="text">
<h2>Bütünleme Sınav Tarihleri</h2>
<p></p>
</div>
</a>
</li>
<li>
<a onclick="AjaxStatManager('Content','1242')" href="http://www.gyte.edu.tr/icerik/120/1242/bil-374-internet-teknolojileri-final-sinavi.aspx" target="_self">
<div class="text">
<h2>Bil 374 İnternet Teknolojileri Final Sınavı</h2>
<p></p>
</div>
</a>
</li>
<li>
<a onclick="AjaxStatManager('Content','1241')" href="http://www.gyte.edu.tr/icerik/120/1241/kim101-final-sinavi.aspx" target="_self">
<div class="text">
<h2>Kim101 Final Sınavı </h2>
<p></p>
</div>
</a>
</li>
<li>
<a onclick="AjaxStatManager('Content','1222')" href="/Files/UserFiles/85/duyurular/yeterlilik.pdf" target="_self">
<div class="text">
<h2>Doktora Yeterlilik Sınav Tarihleri</h2>
<p></p>
</div>
</a>
</li>
<li>
<a onclick="AjaxStatManager('Content','1221')" href="/Files/UserFiles/85/duyurular/duyuru-dokt-seminer.pdf" target="_self">
<div class="text">
<h2>Doktora Programı Adaylarına Önemli Duyuru</h2>
<p></p>
</div>
</a>
</li>
<li>
<a onclick="AjaxStatManager('Content','1127')" href="http://www.gyte.edu.tr/icerik/120/1127/20122013-egitimogretim-yili-guz-yari-yili--final-programi.aspx" target="_self">
<div class="text">
<h2>2012-2013 Eğitim-Öğretim Yılı Güz Yarı Yılı Final Programı</h2>
<p></p>
</div>
</a>
</li>
<li>
<a onclick="AjaxStatManager('Content','1109')" href="/Files/UserFiles/85/duyurular/Yüksek Lisans Doktora Seminer I ve II Sunum Takvimi.pdf" target="_self">
<div class="text">
<h2>Yüksek Lisans / Doktora Seminer I ve II Sunum Takvimi</h2>
<p></p>
</div>
</a>
</li>
<li>
<a onclick="AjaxStatManager('Content','998')" href="http://www.gyte.edu.tr/icerik/120/998/bilgisayar-muhendisligi-bolumu-20122013-guz-yari-yili-ders-programlari.aspx" target="_self">
<div class="text">
<h2>Bilgisayar Mühendisliği Bölümü 2012-2013 Güz Yarı Yılı Ders Programları</h2>
<p>Bilgisayar Mühendisliği Bölümü 2012-2013 Güz Yarı Yılı Ders Programları</p>
</div>
</a>
</li>
<li>
<a onclick="AjaxStatManager('Content','1101')" href="http://www.gyte.edu.tr/icerik/120/1101/kim-101-kimya-dersi---ii-vizesi.aspx" target="_self">
<div class="text">
<h2>KİM 101 Kimya Dersi II .vizesi</h2>
<p></p>
</div>
</a>
</li>
<li>
<a onclick="AjaxStatManager('Content','1073')" href="/Files/duyuru/bilgisayar_muh/Yuksek_lisans_-_Doktora_Seminer_I_-_II.pdf" target="_self">
<div class="text">
<h2>Yüksek Lisans/Doktora Seminer I ve II Ders Planı</h2>
<p></p>
</div>
</a>
</li>
<li>
<a onclick="AjaxStatManager('Content','1058')" href="/Files/duyuru/bilgisayar_muh/bil495-496syl.pdf" target="_self">
<div class="text">
<h2>BIL 495/496 Bitirme Projesi Ders Planı</h2>
<p></p>
</div>
</a>
</li>
<li>
<a onclick="AjaxStatManager('Content','1006')" href="/Files/duyuru/bilgisayar_muh/duy-ders2013guz_1.doc" target="_self">
<div class="text">
<h2>G.Y.T.E. Lisans Üstü Öğrencilerinin Dikkatine</h2>
<p></p>
</div>
</a>
</li>
<li>
<a onclick="AjaxStatManager('Content','984')" href="http://www.gyte.edu.tr/icerik/120/984/bil-341-programlama-dilleri-butunleme-sinavi.aspx" target="_self">
<div class="text">
<h2>BİL 341 Programlama Dilleri bütünleme sınavı</h2>
<p></p>
</div>
</a>
</li>
</ul>
I have following code to parse but it does not work:
try {
URL url = new URL("http://www.gyte.edu.tr/kategori/120/0/duyurular.aspx");
HTMLEditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
Reader HTMLReader = new InputStreamReader(url.openConnection().getInputStream());
kit.read(HTMLReader, doc, 0);
ElementIterator it = new ElementIterator(doc);
Element elem;
while ((elem = it.next()) != null) {
AttributeSet as = elem.getAttributes();
if (as.containsAttribute("class", "news-list")) {
int c = elem.getElementCount();
System.out.println("Element count = " + c);
}
}
} catch (IOException | BadLocationException e) {
e.printStackTrace();
return e.getMessage();
}
return "Success!";

You could load it into a Document object. This will read in the HTML for you and you can iterate/query using available methods.

I think it is work for an XPATH query.
XPath xpath = XPathFactory.newInstance().newXPath();
String expression= "//ul[#class = 'news-list']";
InputSource inputSource = new InputSource("your.html");
NodeSet nodes = (NodeSet) xpath.evaluate(expression, inputSource, XPathConstants.NODESET);

Here is the JSoup solution:
try {
Document doc = Jsoup.parse(new URL("http://www.gyte.edu.tr/kategori/120/0/duyurular.aspx"), 1000000);
Elements elements = doc.getElementsByAttributeValue("class", "news-list");
System.out.println(elements.size());
for (Element e : elements) {
System.out.println(e.toString());
}
} catch (Exception e) {
e.printStackTrace();
}
and the output:
<ul class="news-list">
<li> <a onclick="AjaxStatManager('Content','1258')" href="http://www.gyte.edu.tr/icerik/120/1258/kim-101-final-mazeret-sinavi.aspx" target="_self">
<div class="text">
<h2>KİM 101 Final Mazeret Sınavı</h2>
<p></p>
</div> </a> </li>
<li> <a onclick="AjaxStatManager('Content','1248')" href="http://www.gyte.edu.tr/icerik/120/1248/butunleme-sinav-tarihleri.aspx" target="_self">
<div class="text">
<h2>Bütünleme Sınav Tarihleri</h2>
<p></p>
</div> </a> </li>
<li> <a onclick="AjaxStatManager('Content','1242')" href="http://www.gyte.edu.tr/icerik/120/1242/bil-374-internet-teknolojileri-final-sinavi.aspx" target="_self">
<div class="text">
<h2>Bil 374 İnternet Teknolojileri Final Sınavı</h2>
<p></p>
</div> </a> </li>
<li> <a onclick="AjaxStatManager('Content','1241')" href="http://www.gyte.edu.tr/icerik/120/1241/kim101-final-sinavi.aspx" target="_self">
<div class="text">
<h2>Kim101 Final Sınavı </h2>
<p></p>
</div> </a> </li>
<li> <a onclick="AjaxStatManager('Content','1222')" href="/Files/UserFiles/85/duyurular/yeterlilik.pdf" target="_self">
<div class="text">
<h2>Doktora Yeterlilik Sınav Tarihleri</h2>
<p></p>
</div> </a> </li>
<li> <a onclick="AjaxStatManager('Content','1221')" href="/Files/UserFiles/85/duyurular/duyuru-dokt-seminer.pdf" target="_self">
<div class="text">
<h2>Doktora Programı Adaylarına Önemli Duyuru</h2>
<p></p>
</div> </a> </li>
<li> <a onclick="AjaxStatManager('Content','1127')" href="http://www.gyte.edu.tr/icerik/120/1127/20122013-egitimogretim-yili-guz-yari-yili--final-programi.aspx" target="_self">
<div class="text">
<h2>2012-2013 Eğitim-Öğretim Yılı Güz Yarı Yılı Final Programı</h2>
<p></p>
</div> </a> </li>
<li> <a onclick="AjaxStatManager('Content','1109')" href="/Files/UserFiles/85/duyurular/Yüksek Lisans Doktora Seminer I ve II Sunum Takvimi.pdf" target="_self">
<div class="text">
<h2>Yüksek Lisans / Doktora Seminer I ve II Sunum Takvimi</h2>
<p></p>
</div> </a> </li>
<li> <a onclick="AjaxStatManager('Content','998')" href="http://www.gyte.edu.tr/icerik/120/998/bilgisayar-muhendisligi-bolumu-20122013-guz-yari-yili-ders-programlari.aspx" target="_self">
<div class="text">
<h2>Bilgisayar Mühendisliği Bölümü 2012-2013 Güz Yarı Yılı Ders Programları</h2>
<p>Bilgisayar Mühendisliği Bölümü 2012-2013 Güz Yarı Yılı Ders Programları</p>
</div> </a> </li>
<li> <a onclick="AjaxStatManager('Content','1101')" href="http://www.gyte.edu.tr/icerik/120/1101/kim-101-kimya-dersi---ii-vizesi.aspx" target="_self">
<div class="text">
<h2>KİM 101 Kimya Dersi II .vizesi</h2>
<p></p>
</div> </a> </li>
<li> <a onclick="AjaxStatManager('Content','1073')" href="/Files/duyuru/bilgisayar_muh/Yuksek_lisans_-_Doktora_Seminer_I_-_II.pdf" target="_self">
<div class="text">
<h2>Yüksek Lisans/Doktora Seminer I ve II Ders Planı</h2>
<p></p>
</div> </a> </li>
<li> <a onclick="AjaxStatManager('Content','1058')" href="/Files/duyuru/bilgisayar_muh/bil495-496syl.pdf" target="_self">
<div class="text">
<h2>BIL 495/496 Bitirme Projesi Ders Planı</h2>
<p></p>
</div> </a> </li>
<li> <a onclick="AjaxStatManager('Content','1006')" href="/Files/duyuru/bilgisayar_muh/duy-ders2013guz_1.doc" target="_self">
<div class="text">
<h2>G.Y.T.E. Lisans Üstü Öğrencilerinin Dikkatine</h2>
<p></p>
</div> </a> </li>
<li> <a onclick="AjaxStatManager('Content','984')" href="http://www.gyte.edu.tr/icerik/120/984/bil-341-programlama-dilleri-butunleme-sinavi.aspx" target="_self">
<div class="text">
<h2>BİL 341 Programlama Dilleri bütünleme sınavı</h2>
<p></p>
</div> </a> </li>
</ul>

Related

Null pointer whilefetchnig data with Jsoup [duplicate]

This question already has answers here:
NullPointerException Parsing Jsoup
(1 answer)
What is a NullPointerException, and how do I fix it?
(12 answers)
Closed 3 years ago.
I have a problem that I do not know how to solve. While fetching data I get NPE. It is weird, because for other categories of book it works normally.
String romancesCategoryEmpikURL = "https://www.empik.com/ksiazki/poradniki";
Document document = Jsoup.connect(romancesCategoryEmpikURL).get();
List<Element> siteElements = document.select("div.productBox__info");
List<Book> romanceCategoryBooks = new ArrayList<>();
for (int i = 0; i < 15; i++) {
String author = siteElements.get(i).select("span > a").first().ownText();
romanceCategoryBooks.add(new Book.BookBuilder()
.withAuthor(author)
.withPrice(price)
.withTitle(title)
.withProductID(productID)
.withBookURL(BookURL)
.build());
}
NPE occurs with fetching author from site: https://www.empik.com/ksiazki/poradniki
HTML code:
<div class="productBox__info">
<a href="/jak-uratowac-swiat-czyli-co-dobrego-mozesz-zrobic-dla-planety-szpura-areta,p1223701396,ksiazka-p" class="productBox seoTitle" title="Jak uratować świat? Czyli co dobrego możesz zrobić dla planety - Szpura Areta" data-product-id="p1223701396">
<span class="productBox__title">
<span class="productBox__number">1</span>
Jak uratować świat? Czyli co dobrego możesz zrobić dla planety
</span>
</a>
<span class="productBox__subtitle">
<a href="/szukaj/produkt?author=szpura+areta" class="smartAuthor" title="Szpura Areta - wszystkie produkty">
Szpura Areta </a>
</span>
<div class="rating">
<ul class="ratingStars"><li class="rate"><i class="fa fa-fw fa-star active"></i></li><li class="rate"><i class="fa fa-fw fa-star active"></i></li><li class="rate"><i class="fa fa-fw fa-star active"></i></li><li class="rate"><i class="fa fa-fw fa-star active"></i></li><li class="rate"><i class="fa fa-fw fa-star active"></i></li></ul>
<div class="score">
4.7/5
</div>
</div>
<div class="productBox__price">
<div class="productBox__priceItem productBox__priceItem--promotion ta-productlist-price ">
37,49 zł </div>
<div class="productBox__priceItem productBox__priceItem--old ta-productlist-oldprice">
49,99 zł </div>
</div>
</div>
I want to fetch author which is Szpura Areta.

Selecting Elements from a 'special' listbox with Selenium in Java

So i have the following HTML Code of a listbox here:
<div role="listbox" aria-expanded="false" class="quantumWizMenuPaperselectEl docssharedWizSelectPaperselectRoot freebirdFormviewerViewItemsSelectSelect freebirdThemedSelectDarkerDisabled" jscontroller="YwHGTd" jsaction="click:cOuCgd(LgbsSe); keydown:I481le; keypress:Kr2w4b; mousedown:UX7yZ(LgbsSe),npT2md(preventDefault=true); mouseup:lbsD7e(LgbsSe); mouseleave:JywGue; touchstart:p6p2H(LgbsSe); touchmove:FwuNnf; touchend:yfqBxc(LgbsSe|preventMouseEvents=true|preventDefault=true); touchcancel:JMtRjd(LgbsSe); focus:AHmuwe; blur:O22p3e;b5SvAb:TvD9Pc;" jsshadow="" jsname="W85ice" aria-describedby="i.desc.709120473 i.err.709120473" aria-labelledby="i73">
<div jsname="LgbsSe" role="presentation">
<div class="quantumWizMenuPaperselectOptionList" jsname="d9BH4c" role="presentation">
<div class="quantumWizMenuPaperselectOption freebirdThemedSelectOptionDarkerDisabled exportOption isSelected isPlaceholder" jsname="wQNmvb" jsaction="" data-value="" aria-selected="true" role="option" tabindex="0">
<div class="quantumWizMenuPaperselectRipple exportInk" jsname="ksKsZd"></div>
<content class="quantumWizMenuPaperselectContent exportContent">Auswählen</content>
</div>
<div class="quantumWizMenuPaperselectOptionSeparator" role="presentation"></div>
<div class="quantumWizMenuPaperselectOption freebirdThemedSelectOptionDarkerDisabled exportOption" jsname="wQNmvb" jsaction="" data-value="140 cm" aria-selected="false" role="option" tabindex="-1">
<div class="quantumWizMenuPaperselectRipple exportInk" jsname="ksKsZd"></div>
<content class="quantumWizMenuPaperselectContent exportContent">140 cm</content>
</div>
<div class="quantumWizMenuPaperselectOption freebirdThemedSelectOptionDarkerDisabled exportOption" jsname="wQNmvb" jsaction="" data-value="141 cm" aria-selected="false" role="option" tabindex="-1">
<div class="quantumWizMenuPaperselectRipple exportInk" jsname="ksKsZd"></div>
<content class="quantumWizMenuPaperselectContent exportContent">141 cm</content>
</div>
<div class="quantumWizMenuPaperselectOption freebirdThemedSelectOptionDarkerDisabled exportOption" jsname="wQNmvb" jsaction="" data-value="142 cm" aria-selected="false" role="option" tabindex="-1">
<div class="quantumWizMenuPaperselectRipple exportInk" jsname="ksKsZd"></div>
<content class="quantumWizMenuPaperselectContent exportContent">142 cm</content>
</div>
<div class="quantumWizMenuPaperselectOption freebirdThemedSelectOptionDarkerDisabled exportOption" jsname="wQNmvb" jsaction="" data-value="143 cm" aria-selected="false" role="option" tabindex="-1">
<div class="quantumWizMenuPaperselectRipple exportInk" jsname="ksKsZd"></div>
<content class="quantumWizMenuPaperselectContent exportContent">143 cm</content>
</div>
</div>
<div class="quantumWizMenuPaperselectDropDown exportDropDown" role="presentation"></div>
</div>
<div class="exportSelectPopup quantumWizMenuPaperselectPopup" jsaction="click:dPTK6c(wQNmvb); mousedown:uYU8jb(wQNmvb); mouseup:LVEdXd(wQNmvb); mouseover:nfXz1e(wQNmvb); touchstart:Rh2fre(wQNmvb); touchmove:hvFWtf(wQNmvb); touchend:MkF9r(wQNmvb|preventMouseEvents=true)" role="presentation" jsname="V68bde" style="display:none;"></div>
</div>
I am writing an program which has to select an element of this listbox automatically in java (like "140 cm", "141 cm" like you see in the code etc...). I tried to access the listbox itself with the following code:
WebElement checkBox = driver.findElement(By.cssSelector("div[aria-labelledby*=i73]"));
CheckBox.click();
It worked but now i have to select somehow an element of this listbox. I tried it with the 'Select'-Command, which did not work:
Select listbox = new Select(checkBox);
listbox.selectByVisibleText("140 cm");
I also tried it with clicking on the specific div with the '140 cm' text and waiting for its clickability. But I get a timeout exception because it failed to wait for the element to be clickable.
WebElement boxElement = driver.findElement(By.cssSelector("div[data-value*='140']"));
WebDriverWait wait = new WebDriverWait(driver, 10);
boxElement = wait.until(ExpectedConditions.elementToBeClickable(By.cssSelector("div[data-value*='140']")));
boxElement.click();
I am desperate and do not know what to do. Can any of you guys help me? I am thankfully for every answer!
greetings

Why does my loop only working on some of it's iterations?... (using Jsoup to extract data)

The items in my itemList are incomplete! For some reason from the 10th iteration of my loop to the last
el.select(".item").select(".img").select(".pic").select(".picRind").select(".picCore").attr("src")
returns a empty string and I can't understand why
0-9th iteration is perfectly find though. I went through the html and my code should work for every li I'm iterating through.
private Document getHtmlDocument() throws IOException {
document = Jsoup.connect(url).get();
return document;
}
public List<AliExpressItem> getAliExpressItemList() throws IOException {
Document document;
Element ul;
Elements ulLi;
document = getHtmlDocument();
ul = document.getElementById("hs-below-list-items");
ulLi = ul.getElementsByClass("list-item");
List<AliExpressItem> itemList = new ArrayList<>();
for(Element el : ulLi) {
AliExpressItem item = new AliExpressItem();
item.setImage(el.select(".item")
.select(".img")
.select(".pic")
.select(".picRind")
.select(".picCore")
.attr("src"));
item.setDescription(el.select(".item")
.select(".info")
.select("h3")
.select("a")
.text());
item.setPrice(el.select(".item")
.select(".info")
.select(".price")
.select(".value")
.text());
itemList.add(item);
}
return itemList;
}
Theres a ul with 48 li's inside. The above code should work for all 48 li's
<li qrdata="|32805326364|cn1511315262" pub-catid="200247142" sessionid="201711160635492248862329348280002056372" class="list-item list-item-first ">
<div class="item">
<div class="img img-border">
<div class="pic">
<a class="picRind history-item j-p4plog" href="//www.aliexpress.com/item/Hot-Sale-Novelty-Toys-Hand-Spinner-Anti-stress-toys/32805326364.html?spm=2114.search0204.3.1.Lwk2KD&s=p&ws_ab_test=searchweb0_0,searchweb201602_5_10152_10065_10151_10344_10068_10130_10345_10324_10342_10547_10325_10343_10546_10340_10341_10548_10545_10541_10562_10084_10083_10307_5680011_10178_10060_10155_10154_10056_10055_10539_10312_10059_10313_10314_10534_10533_100031_10103_10073_10102_10594_10557_10558_10596_10142_10107,searchweb201603_14,ppcSwitch_5_ppcChannel&btsid=6350c066-2194-4756-b1f7-ed7e1b0028e1&rmStoreLevelAB=0" target="_blank" data-spm-anchor-id="2114.search0204.3.1"><img class="picCore pic-Core-v" src="//ae01.alicdn.com/kf/HTB1RUjgQFXXXXayXXXXq6xXFXXX4/Hot-Sale-Novelty-Toys-Hand-font-b-Spinner-b-font-Anti-stress-toys-fidget-font-b.jpg_220x220.jpg" alt="Hot Sale Novelty Toys Hand Spinner Anti stress toys fidget spinners For Autism and ADHD reliever stress spinner(China)"></a>
</div>
</div>
<div class="info">
<h3>
<a class="history-item product j-p4plog" href="//www.aliexpress.com/item/Hot-Sale-Novelty-Toys-Hand-Spinner-Anti-stress-toys/32805326364.html?spm=2114.search0204.3.2.Lwk2KD&s=p&ws_ab_test=searchweb0_0,searchweb201602_5_10152_10065_10151_10344_10068_10130_10345_10324_10342_10547_10325_10343_10546_10340_10341_10548_10545_10541_10562_10084_10083_10307_5680011_10178_10060_10155_10154_10056_10055_10539_10312_10059_10313_10314_10534_10533_100031_10103_10073_10102_10594_10557_10558_10596_10142_10107,searchweb201603_14,ppcSwitch_5_ppcChannel&btsid=6350c066-2194-4756-b1f7-ed7e1b0028e1&rmStoreLevelAB=0" title="Hot Sale Novelty Toys Hand Spinner Anti stress toys fidget spinners For Autism and ADHD reliever stress spinner" target="_blank" data-spm-anchor-id="2114.search0204.3.2">Hot Sale Novelty Toys Hand <font><b>Spinner</b></font> Anti stress toys fidget <font><b>spinners</b></font> For Autism and ADHD reliever stress <font><b>spinner</b></font></a>
</h3>
<span class="price price-m">
<span class="value" itemprop="price">US $1.99</span>
<span class="separator">/</span>
<span class="unit">unidad</span>
</span>
<strong class="free-s">Envío gratis</strong>
<div class="rate-history">
<span rel="nofollow" class="order-num">
<a class="order-num-a j-p4plog" href="//www.aliexpress.com/item/Hot-Sale-Novelty-Toys-Hand-Spinner-Anti-stress-toys/32805326364.html?spm=2114.search0204.3.3.Lwk2KD&s=p&ws_ab_test=searchweb0_0,searchweb201602_5_10152_10065_10151_10344_10068_10130_10345_10324_10342_10547_10325_10343_10546_10340_10341_10548_10545_10541_10562_10084_10083_10307_5680011_10178_10060_10155_10154_10056_10055_10539_10312_10059_10313_10314_10534_10533_100031_10103_10073_10102_10594_10557_10558_10596_10142_10107,searchweb201603_14,ppcSwitch_5_ppcChannel&btsid=6350c066-2194-4756-b1f7-ed7e1b0028e1&rmStoreLevelAB=0#thf" rel="nofollow" target="_blank" data-spm-anchor-id="2114.search0204.3.3"><em title="Pedido totales"> Ventas (0)</em></a>
</span>
</div>
</div>
<div class="info-more">
<div class="aplus-sp-main">
<div class="sp-box">
</div>
</div>
<div class="store-name-chat">
<div class="store-name util-clearfix">
Alisa's cabin
</div>
</div>
<a class="score-dot" href="//www.aliexpress.com/store/feedback-score/1308215.html?spm=2114.search0204.3.5.Lwk2KD" rel="nofollow" data-spm-anchor-id="2114.search0204.3.5"><span class="score-icon-new score-level-22" id="score1" feedbackscore="1,276" sellerpositivefeedbackpercentage="93.7"></span></a>
<div class="add-to-wishlist">
<a class="atwl-button j-p4plog" href="javascript:;" data-product-id="32805326364" data-batman-id="ja2kvte8" data-spm-anchor-id="2114.search0204.3.6">Añadir a Lista Deseos</a>
</div>
<input class="atc-product-id" type="hidden" value="32805326364">
<input class="atc-product-standard" type="hidden" value="">
</div>
</div>

How can I extract information from HTML depending on the structure

I want to extract some data from many links from xbox. The problem I am experiencing is that in the section where the price is shown, the structure is different if the game is with discount (for example).
The code I have written to scrap the price:
String urlPage = "https://www.microsoft.com/en-us/store/p/call-of-duty-advanced-warfare-gold-edition/c20hl06x0v8w" ;
System.out.println("Comprobando entradas de: "+urlPage);
if (getStatusConnectionCode(urlPage) == 200) {
Document document = getHtmlDocument(urlPage);
Elements entradas = document.select("div.m-product-detail-hero-product-placement div.price-info");
for (Element elem : entradas) {
String titulo = elem.getElementsByClass("srv_saleprice").text();
}
}else{
System.out.println("El Status Code no es OK es: "+getStatusConnectionCode(urlPage));
}
The HTML for a game that has no discount:
URL for first case
<div class="price-info">
<div class="c-price">
<div class="price-text srv_price">
<div class="ea-vault-message hidden x-hidden">
<div>
Available in The Vault
</div>
<div>
or
</div>
</div>
<span>$59.99</span>
<sup>+</sup>
</div>
<div class="srv_microdata" itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<meta itemprop="price" content="59.99">
<meta itemprop="priceCurrency" content="USD">
</div>
</div>
</div>
And for a game with discount:
URL for the second case
<div class="price-info">
<div class="c-price">
<div class="price-text srv_price">
<div class="ea-vault-message hidden x-hidden">
<div>
Available in The Vault
</div>
<div>
or
</div>
</div>
<s class="srv_saleprice" aria-label="Full price was $159.99">$159.99</s>
<span> </span>
<div class="price-disclaimer">
<span>$135.99</span>
<sup>+</sup>
</div>
<span> </span>
<span></span>
</div>
<div class="caption text-muted srv_countdown">
<span class="sub">save $24.00</span>
</div>
<div class="srv_microdata" itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<meta itemprop="price" content="135.99">
<meta itemprop="priceCurrency" content="USD">
</div>
</div>
</div>
In this second example the value inside elements is $135.99 but is not the game base price ($159.99 in this case).
How could I extract only the base price for every game (with or without) discount?

Jsoup select li from ul

I want to select from following ul all children with Jsoup.
<ul class="breadcrumbs xlarge-12 columns hide-for-small-only">
<li>
<a href="http://www.thalia.de/shop/home/show/">
Home
</a>
</li>
<li>
<a href="http://www.thalia.de/shop/buecher/show/">
Bücher
</a>
</li>
<li>
<a href="http://www.thalia.de/shop/fachbuecher-115/show/">
Fachbücher
</a>
</li>
<li>
<a href="http://www.thalia.de/shop/chemie-143/show/">
Chemie
</a>
</li>
</ul>
Here is my code, but this gives me only the first two elements. What am I doing wrong?
Elements category = doc.select("div.ncMain.productMainView");
Elements category2 = category.select("ul.breadcrumbs.xlarge-12.columns.hide-for-small-only");
Elements category3 = category2.select("ul li a");
String categoryString = category3.text();

Categories

Resources