How to parse 'div' without name? - java

Using Jsoup:
Element movie_div = doc.select("div.movie").first();
I got a such HTML-code:
<div class="movie">
<div>
<div>
<strong>Year:</strong> 2014
</div>
<div>
<strong>Country:</strong> USA
</div>
</div>
</div>
How can I use jsoup to extract the country and the year?
For the example html I want the extracted values to be "2014" and "USA".
Thanks.

Use
Element e = doc.select("div.movie").first().child(0);
List<TextNode> textNodes = e.child(0).textNodes();
String year = textNodes.get(textNodes.size()-1).text().trim();
textNodes = e.child(1).textNodes();
String country = textNodes.get(textNodes.size()-1).text().trim();

Did you try something like:
Element movie_div = doc.select("div.movie strong").first();
And to get the text value you should try;
movie_div.text();

Related

Jsoup css selector "not", not return anything

I'm trying to ignore an item and not parse it on Jsoup
But css selector "not", not working !!
I don't understand what is wrong ??
my code:
MangaList list = new MangaList();
Document document = getPage("https://3asq.org/");
MangaInfo manga;
for (Element o : document.select("div.page-item-detail:not(.item-thumb#manga-item-5520)")) {
manga = new MangaInfo();
manga.name = o.select("h3").first().select("a").last().text();
manga.path = o.select("a").first().attr("href");
try {
manga.preview = o.select("img").first().attr("src");
} catch (Exception e) {
manga.preview = "";
}
list.add(manga);
}
return list;
html code:
<div class="col-12 col-md-6 badge-pos-1">
<div class="page-item-detail manga">
<div id="manga-item-5520" class="item-thumb hover-details c-image-hover" data-post-id="5520">
<a href="https://3asq.org/manga/gosu/" title="Gosu">
<img width="110" height="150" src="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg" srcset="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg 110w, https://3asq.org/wp-content/uploads/2020/03/IMG_4497-175x238.jpg 175w" sizes="(max-width: 110px) 100vw, 110px" class="img-responsive" style="" alt="IMG_4497"/> </a>
</div>
<div class="item-summary">
<div class="post-title font-title">
<h3 class="h5">
<span class="manga-title-badges custom noal-manga">Noal-Manga</span> Gosu
</h3>
If I debug your code and extract the HTML for:
System.out.println(document.select("div.page-item-detail").get(0)) (hint use the expression evaluator in IntelliJ IDEA (Alt+F8 - for in-session, real-time debugging)
I get:
<div class="page-item-detail manga">
<div id="manga-item-2003" class="item-thumb hover-details c-image-hover" data-post-id="2003">
<a href="http...
...
</div>
</div>
</div>
It looks like you want to extract the next div tag down with class containing item-thumb ... but only if the id isn't manga-item-5520.
So here's what I did to remove that one item
document.select("div.page-item-detail div[class*=item-thumb][id!=manga-item-5520]")
Result size: 19
With the element included:
document.select("div.page-item-detail div[class*=item-thumb]")
Result size: 20
You can also try the following if you want to remain based at the outer div tag rather than the inner div tag.
document.select("div.page-item-detail:has(div[class*=item-thumb][id!=manga-item-5520])")

How to modify the html file content in jsoup?

I Have a html file like below
<div id ="test"> <u>s</u> </div>
I want to modify like this using java
<div id ="test"> <b>Test<b> </div>
is it possible in jsoup ?
it is posible:
Element el = doc.select("div#test").first();
for (Element elC : el.children()) {
elC.remove();
}
Element nel = el.appendElement("b");
nel.text("Test");

how to retrieve element from Elements in Java from a HTML

If i have code like this
Elements e = d.select("div[id=result_52]");
System.out.println("elemeeeeeee" + e);
the output of e is as below
elemeeeeeee<imagebox id="result_52" class="rsltGrid prod celwidget" name="B00BF9MZ44">
<div class="linePlaceholder"></div>
<div class="image imageContainer">
<a href="http://www.abcdefg.com/VIZIO-E241i-A1-24-Inch-1080p-tilted/dp/B00BF9MZ44/ref=lp_6459736011_1_53/190-4904523-2326018?s=tv&ie=UTF8&qid=1405599829&sr=1-53">
<div class="imageBox">
<img src="http://ecx.images-abcdefg.com/images/I/51PhLnnk7NL._AA160_.jpg" class="productImage cfMarker" alt="Product Details" />
</div>
I want both URL which is coming inside
You can use # to get the specific id.
Element e = d.select("div#result_52").get(0);
String firstURL = e.select("a").attr("href"); //select the `a` tag and the `href` attribute.
String secondURL = e.select("img").attr("src");

How to find elements whose sibling index is less than x and greater than y

I have some Element eNews. After finding indexes by CssQuery I have to select sibling elements with index less than y and greater than x;
Elements lines = eNews.select("div.clear");
int x = lines.get(0).elementSiblingIndex();
int y = lines.get(1).elementSiblingIndex();
Elements tNews = eNews.getElementsByIndexGreaterThan(x)
?AND?
eNews.getElementsByIndexLessThan(y)
This is some sample code. I want to extract text from html tags between first and second <div class="clear></div>
<div class="aktualnosci">
<div class="zd">
<a href="/Data/Thumbs/ODAweDYwMA,dsc_0458.jpg" title="" rel="lightbox">
<img src="/Data/Thumbs/dsc_0458.jpg"/>
</a>
<p class="show"></p>
</div>
<h3>Awanse</h3>
<div class="data">
<img alt="" src="/Themes/kalendarz-ico.gif">
2013-11-18 12:26
</div>
<!--Start tag-->
<div class="clear"></div>
<!--Tags to extract-->
<p class="gr">W związku z Narodowym Świętem Niepodległości ....</p>
<p style="text-align: justify">W zeszły p....</p>
<p style="text-align: justify">OISW Kraków</p>
<!--End tag-->
<div class="clear"></div>
<div class="slider">
<span class="slide-left"></span>
<span class="slide-right"></span>
</div>
</div>
You can use a selector like div.clear ~ :gt(1):lt(4)
E.g.:
Elements tNews = eNews.select("div.clear ~ :gt(1):lt(4)");
See this example and the selector docs. (It's a bit hard to validate this does what you're trying to achieve without knowing your input HTML and the data you're trying to extract.)
Update based on your edit: there are a couple ways to do this if you can't know the indexes in advance. Below I get the first div, then accumulate sibling elements until we hit the next div.clear. (I'll have a think if I can generify this pattern and add it to jsoup.)
Document doc = Jsoup.parse(h);
Element firstDiv = doc.select("div.clear").first();
Elements news = new Elements();
Element item = firstDiv.nextElementSibling();
while (item != null && !(item.tagName().equals("div") && item.className().equals("clear"))) {
news.add(item);
item = item.nextElementSibling();
}
System.out.println(String.format("Found %s items", news.size()));
for (Element element : news) {
System.out.println(element.text());
}
Outputs:
Found 3 items
W związku z Narodowym Świętem Niepodległości ....
W zeszły p....
OISW Kraków

how can I make a filter with the content of a div using html parser in java

I'm trying to parse HTML string using htmlparser library.
The html is like this:
<body>
<div class="Level1">
<div class="row">
<div class="txt">
Date of analysis:
</div><div class="content">
02/03/11
</div>
</div>
</div><div class="Level1">
<div class="row">
<div class="txt">
Site:
</div><div class="content">
13.0E
</div>
</div>
</div><div class="Level1">
<div class="row">
<div class="txt">
Network type:
</div><div class="content">
DVB-S
</div>
</div>
</div>
</body>
I need to extract "content" information for a given "txt". I have made a filter that returns the divs with class= "level1", but I don't know how to make a filter with the content of the div, I mean in case the value of txt is Site: then read content like 13.0E.
NodeList nl = parser.extractAllNodesThatMatch(new AndFilter(new TagNameFilter("div"), new HasAttributeFilter("class", "Level1")));
Can someone help me with this issue?? how to read a div inside a div?
Thanks!!
NodeList nl = parser.extractAllNodesThatMatch(new AndFilter(new TagNameFilter("div"), new HasAttributeFilter("class", "Level1")));
better to do it like this:
NodeList nl = parser.parse(null); // you can also filter here
NodeList divs = nl.extractAllNodesThatMatch(
new AndFilter(new TagNameFilter("DIV"),
new HasAttributeFilter("class", "txt")));
if( divs.size() > 0 ) {
Tag div = divs.elementAt(0);
String text = div.getText(); // this is the text of the div
}

Categories

Resources