I wrote a simple Java Web Crawler that lets the user type in any web page and it will search through the page and pull out the links as Strings. I am not using a package like Jsoup. My question is, how do I only print the absolute URLs rather than both relative and absolute URLs?
Inspect the src or href attribute to see if it's absolute, relative, or protocol-relative (//stackoverflow.com/file). Parse the page's URL. If the tag was protocol-relative, use the protocol from the parsed page URL, then append the content of the attribute. If it's relative, strip the query string and fragment IF from the original URL, and "append" the relative portion. Be aware that a relative URL can look like /foo, foo, foo/bar, or ./../../bar/../foo, so you might want to resolve path traversals before printing.
Edit:
Take a look at URL and the Commons URL Builder. They'll both be helpful.
Related
I am struggling with my thymeleaf template, as follows.
So, I have an Arraylist of urls of the same name, which I want to display on a page.
<a th:each="u:${urls}" th:href="${u}" th:value="${u}">[[${u}]]<br></a>
The problem is that, when I click on one of the rendered links. It simply appends my url to the current url. e.g.:
http://localhost:8080/www.google.com
What's going on here? and how should I achieve what I'm attempting to? I have tried "base href", to no avail.
The urls need to have http:// or https:// in front of them. (If they don't they are considered relative urls and the browser correctly appends http://localhost:8080/ to them.) You can add them in like this if you want:
<a th:each="u: ${urls}" th:href="|https://${u}|" th:text="${u}" />
Using <base href="<s:url value="/"/>" target="_blank"> resolves all images & stylesheets properly, when there are many namespaces like /, /admin etc.
But the action urls also get interrupted by base tag.
Suppose the current browser url is http://context/admin/dashboard
<s:url value="clients" namespace="admin"/> returns clients which in the browser gets resolved to http://context/clients instead of http://context/admin/clients
Is there a way to tell s:url to render absolute URLs instead of relative ?
http://struts.apache.org/development/2.x/docs/url.html
You have wrong value to the tag attribute namespace. The namespace value should correspond to the package attribute and use the path value calculated from the web content root. So, if you have declared the namespace="/admin" this value should be used to the corresponding url tag attribute.
<s:url action="clients" namespace="/admin"/>
The result outputs to HTML, and you could see what value is rendered.
I have parsed the outlinks of a web page which I am going to parse again using Jsoup. But the problem is that, the links are of the form: ../../../pincode/india/andaman-and-nicobar- islands/. In this form I cannot parse them. So I have converted to absolute url using link.attr("abs:href") with the help of other post of stackoverflow.
Url of the first web page that I have parsed is: http://www.mapsofindia.com/pincode/india/. And the absolute URls that I have got after parsing is of the form http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/. But I cannot parse them further using Jsoup. So when I am executing the following statement:
Jsoup.parse("http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/");
It is giving HTTP 400 error i.e. bad request. So I think there is some problem with the Urls. So can anyone please help me to solve the above problem to get the urls in proper manner so that I can parse them further. Thank you.
please test these two things:
try using link.absUrl("href") instead of link.attr("abs:href")
Check the base uri (calling baseUri() on your element or document)
Btw. you better use connect() Method for this thing:
Document doc = Jsoup.connect("http://<your url here>").get();
I want to print a list of absolute URLs in a Spring JSP, for an internal user to pick from. However, the page is rendered with the current URL prepended.
For Example I want to a link to www.anothersite.com, but the links comes out as http://localhost:8080/myapp/www.anothersite.com on the page
What am I doing wrong? Both lines below have the same result.
<c:forEach items="${listAppURLForm}" var="nextURL">
<li>
</c:out>>${nextURL.link}
<a href=${nextURL.link}>${nextURL.link}</a>
</li>
</c:forEach>
There's a misconception. An URL like www.example.com is definitely not an absolute URL. The URI scheme is completely missing. You need to prepend the URL with the desired scheme to make it really an absolute URL, like so http://www.example.com.
If you can't edit the URLs directly in the list, then you'd need to prefix it in HTML instead.
${nextURL.link}
You might want to perform a ${fn:startsWith()} check beforehand to prevent duplicate schemes.
Jsoup has 2 html parse() methods:
parse(String html) - "As no base URI is specified, absolute URL
detection relies on the HTML including a tag."
parse(String html, String baseUri) - "The URL where the HTML
was retrieved from. Used to resolve relative URLs to absolute URLs,
that occur before the HTML declares a tag."
I am having a difficulty understanding the meaning of the difference between the two:
In the 2nd parse() version, what does "resolve relative URLs to absolute URLs, that occur
before the HTML declares a <base href> tag" mean? What if a
<base href> tag never occurs in the page?
What is the purpose of absolute URL detection? Why does Jsoup need
to find the absolute URL?
Lastly, but most importantly: Is baseUri the full URL of HTML page
(as phrased in original documentation) or is it the base URL of
the HTML page?
It's used for among others Element#absUrl() so that you can retrieve the (intended) absolute URL of an <a href>, <img src>, <link href>, <script src>, etc. E.g.
for (Element link : document.select("a")) {
System.out.println(link.absUrl("href"));
}
This is very useful if you want to download and/or parse the linked resources as well.
In the 2nd parse() version, what does "resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag" mean? What if a <base href> tag never occurs in the page?
Some (poor) websites may have declared a <link> or <script> with a relative URL before the <base> tag. Or if there is no means of a <base> tag, then just the given baseUri will be used for resolving relative URLs of the entire document.
What is the purpose of absolute URL detection? Why does Jsoup need to find the absolute URL?
In order to return the right URL on Element#absUrl(). This is purely for enduser's convenience. Jsoup doesn't need it in order to successfully parse the HTML at its own.
Lastly, but most importantly: Is baseUri the full URL of HTML page (as phrased in original documentation) or is it the base URL of the HTML page?
The former. If the latter, then documentation would be lying. The baseUri must not to be confused with <base href>.