How can I programatically get URL of all pages in a website

How can I programatically get URL of all pages in a website - java

I want to get this using preferably java or if there is a way to do it using selenium webdriver, I dont want the links present in a page . I want a result like https://www.xml-sitemaps.com/ gives a list of all page URLs in a domain. I dont need it like a tree or an xml, just plain simple URLs will do

You can look for tags ( like href or a ) and then store the links in a list.
List links = driver.findElements(By.tagName("href"));

Related

Extract the list of URLs obtained during a HTML page render in Java

I want to be able to get the list of all URLs that a browser will do a GET request for when we try to open a page. For example, if we try to open cnn.com, there are multiple URLs within the first HTTP response which the browser recursively requests for.
I'm not trying to render a page, but I'm trying to obtain a list of all the URLs that are requested when a page is rendered. Doing a simple scan of the HTTP response content wouldn't be sufficient, as there could potentially be images in the CSS which are downloaded. Is there any way I can do this in Java?
My question is similar to this question, but I want to write this in Java.

You can use Jsoup library to extract all the links from a webpage, e.g.:
Document document = Jsoup.connect("http://google.com").get();
Elements links = document.select("a[href]");
for(Element link : links) {
System.out.println(link.attr("href"));
}
Here's the documentation.

Jsoup with a plugin

I'm using Jsoup to scrape some online data from different stores, but I'm having trouble figuring out how to programmatically replicate what I do as a user. To get the data manually (after logging in), a user must select a store from a tree that pops up.
As best I can tell, the tree is not hard-coded into the site but is built interactively when your computer interacts with the server. When you look for the table in "view page source," there are no entries. When I inspect the tree, I do see the HTML and it seems to come from the "FancyTree" plugin.
As best as I can tell from tracking my activity on Developer Tools -- Network, the next step is a "GET" request which doesn't change the URL, so I'm not sure how my store selection is being transferred.
Any advice on how to get Jsoup or Java generally to programmatically interact with this table would be extremely helpful, thank you!

Jsoup can only parse the original source file, not the DOM. In order to parse the DOM, you'll need to render the page with something like HtmlUnit. Then you can parse the html content with Jsoup.
// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(myURL);
// convert page to generated HTML and convert to document
doc = Jsoup.parse(myPage.asXml());
// do something with html content
System.out.println(doc.html());
// clean up resources
webClient.close();
See Parsing Javascript Generated Page with Jsoup.

Parsing shopping websites usign jsoup

I have the following code:
doc = Jsoup.connect("http://www.amazon.com/gp/goldbox").userAgent("Mozilla").timeout(5000).get();
Elements hrefs = doc.select("div.a-row.layer");
System.out.println("Results:"+ hrefs); //I am trying to print out contents but not able to see the output.
Problem: Want to display all image src within the div with class name "a-row layer". But, i am unable to see the output.
What is the mistake with my query?

I have taken a look at the website and tested it myself. The issue seems to be that the piece of html code you want to extract (div.a-row.layer) is generated by JavaScript.
Jsoup does not support JavaScript and cannot parse those generated by it. You would need a headless web browser to deal with this, such as HTMLUnit.

I am using to jsoup to pull images from website url, but I want the page to load first is there anyway to do this?

The problem is that some of the urls I am trying to pull from have javascript slideshow containers that haven't loaded yet. I want to get the images in the slideshow but since this hasn't loaded yet it doesn't grab the element. Is there anyway to do this? This is what I have so far
Document doc = Jsoup.connect("http://news.nationalgeographic.com/news/2013/03/pictures/130316-gastric-brooding-frog-animals-weird-science-extinction-tedx/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+ng%2FNews%2FNews_Main+%28National+Geographic+News+-+Main%29").get();
Elements jpg = doc.select("img[src$=.jpg]");

jsoup can't handle javascript, but you can use an additional library for this:
Parse JavaScript with jsoup
Trying to parse html hidden by javascript

How to update some components of my JSP page using JavaScript

I have a Jsp page and I want automaticly update one of the form on it using Js. Can somebody suggest something
Thanks)

You can dynamically update the elements of the form using basic JavaScript, if thats what you are looking for. Here are some dirty examples:
Eg.
If the id of your form is myForm, you can use
document.forms["myform"].action = "somepage"; //to change the action
var elem1 = document.getElementById("elementID")` //to get an element
var elem2 = document.forms["myform"].element //other way to get an element
childElement = document.createElement("option"); //to create a new element
myform.appendChild(childElement); //to append some child-element to the form
etc. The values/attributes/styles can be changed for the elements too using simple JavaScript. Any JavaScript tutorial on the internet should be helpful.

Have a look at the jQuery library (http://jquery.com/), that will allow you to manipulate HTML elements on the page, including the form element and it's children to automatically update them however you please :)

For an update from server side i.e Data from database or some other file on server use ajax
Now it depends on you to go for javascript or jquery to make this work.
Google this you can find good solutions.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I programatically get URL of all pages in a website - java

You can look for tags ( like href or a ) and then store the links in a list. List links = driver.findElements(By.tagName("href"));

Related

Extract the list of URLs obtained during a HTML page render in Java

Jsoup with a plugin

Parsing shopping websites usign jsoup

I am using to jsoup to pull images from website url, but I want the page to load first is there anyway to do this?

How to update some components of my JSP page using JavaScript

Categories

Resources