How to modify the html file content in jsoup? - java

I Have a html file like below
<div id ="test"> <u>s</u> </div>
I want to modify like this using java
<div id ="test"> <b>Test<b> </div>
is it possible in jsoup ?

it is posible:
Element el = doc.select("div#test").first();
for (Element elC : el.children()) {
elC.remove();
}
Element nel = el.appendElement("b");
nel.text("Test");

Related

Jsoup css selector "not", not return anything

I'm trying to ignore an item and not parse it on Jsoup
But css selector "not", not working !!
I don't understand what is wrong ??
my code:
MangaList list = new MangaList();
Document document = getPage("https://3asq.org/");
MangaInfo manga;
for (Element o : document.select("div.page-item-detail:not(.item-thumb#manga-item-5520)")) {
manga = new MangaInfo();
manga.name = o.select("h3").first().select("a").last().text();
manga.path = o.select("a").first().attr("href");
try {
manga.preview = o.select("img").first().attr("src");
} catch (Exception e) {
manga.preview = "";
}
list.add(manga);
}
return list;
html code:
<div class="col-12 col-md-6 badge-pos-1">
<div class="page-item-detail manga">
<div id="manga-item-5520" class="item-thumb hover-details c-image-hover" data-post-id="5520">
<a href="https://3asq.org/manga/gosu/" title="Gosu">
<img width="110" height="150" src="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg" srcset="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg 110w, https://3asq.org/wp-content/uploads/2020/03/IMG_4497-175x238.jpg 175w" sizes="(max-width: 110px) 100vw, 110px" class="img-responsive" style="" alt="IMG_4497"/> </a>
</div>
<div class="item-summary">
<div class="post-title font-title">
<h3 class="h5">
<span class="manga-title-badges custom noal-manga">Noal-Manga</span> Gosu
</h3>
If I debug your code and extract the HTML for:
System.out.println(document.select("div.page-item-detail").get(0)) (hint use the expression evaluator in IntelliJ IDEA (Alt+F8 - for in-session, real-time debugging)
I get:
<div class="page-item-detail manga">
<div id="manga-item-2003" class="item-thumb hover-details c-image-hover" data-post-id="2003">
<a href="http...
...
</div>
</div>
</div>
It looks like you want to extract the next div tag down with class containing item-thumb ... but only if the id isn't manga-item-5520.
So here's what I did to remove that one item
document.select("div.page-item-detail div[class*=item-thumb][id!=manga-item-5520]")
Result size: 19
With the element included:
document.select("div.page-item-detail div[class*=item-thumb]")
Result size: 20
You can also try the following if you want to remain based at the outer div tag rather than the inner div tag.
document.select("div.page-item-detail:has(div[class*=item-thumb][id!=manga-item-5520])")

How access to inner same classname div with different idname in HTML data using Jsoup in android

I'm trying to parse data from HTML.I need to get the all names from inner div class=vacancy-item which has different idnames.
Below please See the HTML code
<section class="home-vacancies" id="vacancy_wrapper">
<div class="home-block-title">job openings</div>
<div class="vacancy-filter">
...................
</div>
<div class="vacancy-wrapper">
<div class="vacancy-item" data-id="9120">
..............
</div>
<div class="vacancy-item" data-id="9119">
..................
</div>
<div class="vacancy-item" data-id="9118">
................................
</div>
<div class="vacancy-item" data-id="9117">
.............................
</div>
Here is my code:
Please help.
doc = Jsoup.connect("URL").get();
//title = doc.select(".page-content div:eq(3)");
title = doc.getElementsByClass("div[class=vacancy-wrapper]");
titleList.clear();
for (Element titles : title) {
String text = titles.getElementsB("vacancy-item").text();
titleList.add(text);
}
Thanks!
You can only query for a class attribute with getElementByClass, e.g. getElementByClass("vacancy-wrapper") would work.
You will also need a second loop to get each vacancy-items text as a separate element:
Elements title = doc.getElementsByClass("vacancy-wrapper");
for (Element titles : title) {
Elements items = titles.getElementsByClass("vacancy-item");
for (Element item : items) {
String text = item.text();
// process text
}
}
An other option would be to use Jsoup's select method:
Elements es = doc.select("div.vacancy-wrapper div.vacancy-item");
for (Element vi : es) {
String text = vi.text());
// process text
}
This would select all div elements with a class attribute vacancy-item that are under a div with a class attribute vacancy-wrapper.

How to parse 'div' without name?

Using Jsoup:
Element movie_div = doc.select("div.movie").first();
I got a such HTML-code:
<div class="movie">
<div>
<div>
<strong>Year:</strong> 2014
</div>
<div>
<strong>Country:</strong> USA
</div>
</div>
</div>
How can I use jsoup to extract the country and the year?
For the example html I want the extracted values to be "2014" and "USA".
Thanks.
Use
Element e = doc.select("div.movie").first().child(0);
List<TextNode> textNodes = e.child(0).textNodes();
String year = textNodes.get(textNodes.size()-1).text().trim();
textNodes = e.child(1).textNodes();
String country = textNodes.get(textNodes.size()-1).text().trim();
Did you try something like:
Element movie_div = doc.select("div.movie strong").first();
And to get the text value you should try;
movie_div.text();

how to retrieve element from Elements in Java from a HTML

If i have code like this
Elements e = d.select("div[id=result_52]");
System.out.println("elemeeeeeee" + e);
the output of e is as below
elemeeeeeee<imagebox id="result_52" class="rsltGrid prod celwidget" name="B00BF9MZ44">
<div class="linePlaceholder"></div>
<div class="image imageContainer">
<a href="http://www.abcdefg.com/VIZIO-E241i-A1-24-Inch-1080p-tilted/dp/B00BF9MZ44/ref=lp_6459736011_1_53/190-4904523-2326018?s=tv&ie=UTF8&qid=1405599829&sr=1-53">
<div class="imageBox">
<img src="http://ecx.images-abcdefg.com/images/I/51PhLnnk7NL._AA160_.jpg" class="productImage cfMarker" alt="Product Details" />
</div>
I want both URL which is coming inside
You can use # to get the specific id.
Element e = d.select("div#result_52").get(0);
String firstURL = e.select("a").attr("href"); //select the `a` tag and the `href` attribute.
String secondURL = e.select("img").attr("src");

how can I make a filter with the content of a div using html parser in java

I'm trying to parse HTML string using htmlparser library.
The html is like this:
<body>
<div class="Level1">
<div class="row">
<div class="txt">
Date of analysis:
</div><div class="content">
02/03/11
</div>
</div>
</div><div class="Level1">
<div class="row">
<div class="txt">
Site:
</div><div class="content">
13.0E
</div>
</div>
</div><div class="Level1">
<div class="row">
<div class="txt">
Network type:
</div><div class="content">
DVB-S
</div>
</div>
</div>
</body>
I need to extract "content" information for a given "txt". I have made a filter that returns the divs with class= "level1", but I don't know how to make a filter with the content of the div, I mean in case the value of txt is Site: then read content like 13.0E.
NodeList nl = parser.extractAllNodesThatMatch(new AndFilter(new TagNameFilter("div"), new HasAttributeFilter("class", "Level1")));
Can someone help me with this issue?? how to read a div inside a div?
Thanks!!
NodeList nl = parser.extractAllNodesThatMatch(new AndFilter(new TagNameFilter("div"), new HasAttributeFilter("class", "Level1")));
better to do it like this:
NodeList nl = parser.parse(null); // you can also filter here
NodeList divs = nl.extractAllNodesThatMatch(
new AndFilter(new TagNameFilter("DIV"),
new HasAttributeFilter("class", "txt")));
if( divs.size() > 0 ) {
Tag div = divs.elementAt(0);
String text = div.getText(); // this is the text of the div
}

Categories

Resources