regex cut css links from html - java

I want to extract all css and js links from html page using regex, now I use:
([^ ()]*\.(?:css|js)\b)
that pattern, but it doesnt work perfectly, I wan to excluced symbols like '{}()}' before .css or .js path of link.
I try to use Jsoup parser but, he cant extract <link..> tags from js script inside html with code like:
if( userAgent.match( /ipad|iphone|htc|android|windows\s+phone/i ) ) {
document.write('<link rel="stylesheet" type="text/css" href="http://static.gazeta.ru/nm2012/css/new_common_css_pda54.css" />');
} else {
document.write('<link rel="stylesheet" type="text/css" href="http://static.gazeta.ru/nm2012/css/new_common_css275.css" />');
}

You can use the Javax DOM Parser since HTML is dervied from XML, or more HTML specific one like validator.nu used by Mozilla.

Related

Some URL patterns lose css

I'm making a simple servlet app, which is supposed to produce the same output for the following URL patterns:
#WebServlet(urlPatterns={"/Start", "/Start/*", "/Startup", "/Startup/*"})
The output for the following addresses is correct:
http://localhost:4413/TestA/Startup
http://localhost:4413/TestA/Start
http://localhost:4413/TestA
However, once I try something like this:
http://localhost:4413/TestA/Startup/
or
http://localhost:4413/TestA/Startup/blablabla
The css file does not see it.
What could be wrong here?
The css links are of the form:
<link rel="StyleSheet" href="res/mc.css" type="text/css" title="cse4413" media="screen, print"/>
This depends on how you have included the CSS file. If you had included like:
<link href="css/style.css" />
Then, it won't work on directory structures. So change your code, which is similar to the above one like this:
<link href="/css/style.css" />
You need to provide the relative path to the domain, not the file. So that it always requests the right URL.
Solved the problem by setting href to
href="${pageContext.request.contextPath}/res/mc.css"
Could anyone explain how is this different from a link of the form
project/WebContent/res/mc.css?

extract language from a web page with Jsoup

For example I have
<html lang="en"> ...... web page </html>
I want to extract the string "en" with Jsoup.
I tried with selector and attribute without success.
Document htmlDoc = Jsoup.parse(html);
Element taglang = htmlDoc.select("html").first();
System.out.println(taglang.text());
Looks like you want to get value of lang attribute. In that case you can use attr("nameOfAttribute") like
System.out.println(taglang.attr("lang"));

Remove newlines and whitespace from jsp

I need to store the html retrieved from a <jsp:include> in a javascript variable. So I will have something like this
<script>
var html = '<jsp:include page="...">';
</script>
The problem is the jsp file has lots of whitespace and newlines which makes the javascript invalid! I tried using the trimDirectiveWhitespaces directive as suggested here, but that does not remove newlines.
How can I remove newlines as well from html so it can be a valid javascript string?
Or, another solution is welcome as well.
EDIT:
The snippet should eventually look like this (but with many more options):
<script>
var html = '<label class="someClass">Label</label><select><option value="val1">Value</option></select>';
</script>

How to remove a specific tag from the entire html page using jsoup

i'm using jsoup 1.7.3 to edit some html files.
what i need is to remove the following tags from the html file :
<meta name="GENERATOR" content="XXXXXXXXXXXXXX">
<meta name="CREATED" content="0;0">
<meta name="CHANGED" content="0;0">
As you see its the tag, how can i do that, here what i've tried so far :
//im pretty sure that the <meta> tag is nested in the <header>
but removing the whole header is bad practice.
Document docsoup = Jsoup.parse(htmlin);
docsoup.head().remove();
what do you suggest ?
I recommend you use Jsoup selectors, for example
Document document = Jsoup.parse(html);
Elements selector = document.select("meta[name=GENERATOR]");
for (Element element : selector) {
element.remove();
}
doc.html(); // returns String html with elements removed

Jsoup.parse() vs. Jsoup.parse() - or How does URL detection work in Jsoup?

Jsoup has 2 html parse() methods:
parse(String html) - "As no base URI is specified, absolute URL
detection relies on the HTML including a tag."
parse(String html, String baseUri) - "The URL where the HTML
was retrieved from. Used to resolve relative URLs to absolute URLs,
that occur before the HTML declares a tag."
I am having a difficulty understanding the meaning of the difference between the two:
In the 2nd parse() version, what does "resolve relative URLs to absolute URLs, that occur
before the HTML declares a <base href> tag" mean? What if a
<base href> tag never occurs in the page?
What is the purpose of absolute URL detection? Why does Jsoup need
to find the absolute URL?
Lastly, but most importantly: Is baseUri the full URL of HTML page
(as phrased in original documentation) or is it the base URL of
the HTML page?
It's used for among others Element#absUrl() so that you can retrieve the (intended) absolute URL of an <a href>, <img src>, <link href>, <script src>, etc. E.g.
for (Element link : document.select("a")) {
System.out.println(link.absUrl("href"));
}
This is very useful if you want to download and/or parse the linked resources as well.
In the 2nd parse() version, what does "resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag" mean? What if a <base href> tag never occurs in the page?
Some (poor) websites may have declared a <link> or <script> with a relative URL before the <base> tag. Or if there is no means of a <base> tag, then just the given baseUri will be used for resolving relative URLs of the entire document.
What is the purpose of absolute URL detection? Why does Jsoup need to find the absolute URL?
In order to return the right URL on Element#absUrl(). This is purely for enduser's convenience. Jsoup doesn't need it in order to successfully parse the HTML at its own.
Lastly, but most importantly: Is baseUri the full URL of HTML page (as phrased in original documentation) or is it the base URL of the HTML page?
The former. If the latter, then documentation would be lying. The baseUri must not to be confused with <base href>.

Categories

Resources