Extract all Images alt text from web page in Java using jSoup - java

I want to extract alt text for all images in web page.
I have tried all the approaches, I know. You can check
below code.
Document doc = Jsoup.connect("https://www.amazon.com/gp/offer-listing/B003FYLW9Q/ref=olp_f_new?ie=UTF8&f_new=true")
.userAgent("Mozilla")
.timeout(50000)
.cookie("cookiename", "val234")
.cookie("anothercookie", "ilovejsoup")
.referrer("http://google.com")
.header("headersecurity", "xyz123")
.get();
// Method 1
Elements images = doc.select("img[src~=(?i)\\.(gif)]");
System.out.println(images.attr("alt"));
// Method 2
String imageAlt = doc.getElementsByClass("a-spacing-none olpSellerName").select("img").attr("alt");
System.out.println(imageAlt);
Now this code does not work for the link right now in connect method.
It does not work for some link and it does not fetch all URLs alt in that webpage.
But this works for below links:
https://www.amazon.com/gp/offer-listing/B06XWZWYVP/ref=olp_f_new?ie=UTF8&f_new=true
https://www.amazon.com/gp/offer-listing/B079JD7F7G/ref=olp_f_new?ie=UTF8&f_new=true
The class is same for all the links. But it does not work for some links. Can any one please tell me the solution for this problem.

Related

how to Web Scraping a dynamic page in android with JSOUP

I am trying to web scrape the website savevideo.tube using JSOUP.
When we put a link in the search bar and click the search button, the website dynamically loads and shows some download links that I want to scrape. My problem is how to load link in JSOUP with the link search without clicking the search button and showing the results (scraping the results).
Is there any way to search for a link and load it without clicking any button and get results?
I tried this code but I'm not getting the required result.
val result:Document = Jsoup.connect(Constants.BASE_URL)
.data("url", Constants.YOUTUBE_LINK)
.data("sid", "9823478982349872384789273489238904790234")
.userAgent("Mozilla").post()
JSOUP is a Static HTML parser. You cannot parse the content that is loaded by javascript dynamically. For that, you have to use a web drive.
The best web drives that you can use are
HTML Unit
JBrowserDriver
You can also use selenium but it may not be ideal for android
:-
As mentioned by ʀᴀʜɪʟ, JSOUP is a static HTML parser only. If you want to scrape a website that uses JS generated content you should probably take a look at skrape.it library
fun getDocumentByUrl(urlToScrape: String) = skrape(BrowserFetcher) { // <--- pass BrowserFetcher to include rendered JS
request { url = urlToScrape }
response { htmlDocument { this } }
}
fun main() {
// do stuff with the document
}

How to set hidden element/text/id in background for visible text in pdf document

I am giving internal link in pdf document using pdfbox library. For that I am comparing the string and get the co-ordinates of string from pdf document and add link on string. This is working fine and here is the link.
Now, I have two same string and I want to redirect to two different pages like following
Link 1 -> redirects to Page 3
Link 1 -> redirects to Page 5
Is there any way that while creating pdf I add some hidden element/text/id in Link 1 in background from which I can identify on which page I should redirect.

Get scripts site with Jsoup

I'm trying to get the site scripts using 'Jsoup.connect(url).get().html()'
but it doesn’t appear the script I want, does anyone know how I can get this script?
Script I want to get
It doesn't appear in the source because that video is inside an iframe. That iframe has its own src attribute (visible on your screenshot). Try getting that page instead.
EDIT:
Get the first page and parse it. Then select iframe src and when you have the second URL do the same again so get the page and parse it:
String iframeUrl = Jsoup.connect(url).get().selectFirst("#option-1 iframe").attr("src");
System.out.println(iframeUrl);
Document document = Jsoup.connect(iframeUrl).get();
System.out.println(document.html());

web scraping jsoup java unable to scrape full information

I have an information to be scraped from a website. I could scrape it. But not all the information is being scraped. There is so much of data loss. The following images helps you further to understand :
I used Jsoup, connected it to URL and then extracted this particular data using the following code :
Document doc = Jsoup.connect("https://www.awattar.com/tariffs/hourly#").userAgent("Mozilla/17.0").get();
Elements durationCycle = doc.select("g.x.axis g.tick text");
But in the result, I couldn't find any of that related information at all. So I printed the whole document from the URL and it shows the following :
I could see the information when I download the page and read it as an input file but not when I connect directly to URL. But I want to connect it to URL. Is there any suggestion?
I hope my question is understandable. Let me know in case if it is not explanatory.
There is a request body limitation in Jsoup. you should use the maxBodySize parameter:
Document doc = Jsoup.connect("https://www.awattar.com/tariffs/hourly#").userAgent("Mozilla/17.0").maxBodySize(0).get();
"0" is no limit.

Parsing shopping websites usign jsoup

I have the following code:
doc = Jsoup.connect("http://www.amazon.com/gp/goldbox").userAgent("Mozilla").timeout(5000).get();
Elements hrefs = doc.select("div.a-row.layer");
System.out.println("Results:"+ hrefs); //I am trying to print out contents but not able to see the output.
Problem: Want to display all image src within the div with class name "a-row layer". But, i am unable to see the output.
What is the mistake with my query?
I have taken a look at the website and tested it myself. The issue seems to be that the piece of html code you want to extract (div.a-row.layer) is generated by JavaScript.
Jsoup does not support JavaScript and cannot parse those generated by it. You would need a headless web browser to deal with this, such as HTMLUnit.

Categories

Resources