extract language from a web page with Jsoup - java

For example I have
<html lang="en"> ...... web page </html>
I want to extract the string "en" with Jsoup.
I tried with selector and attribute without success.
Document htmlDoc = Jsoup.parse(html);
Element taglang = htmlDoc.select("html").first();
System.out.println(taglang.text());

Looks like you want to get value of lang attribute. In that case you can use attr("nameOfAttribute") like
System.out.println(taglang.attr("lang"));

Related

regex cut css links from html

I want to extract all css and js links from html page using regex, now I use:
([^ ()]*\.(?:css|js)\b)
that pattern, but it doesnt work perfectly, I wan to excluced symbols like '{}()}' before .css or .js path of link.
I try to use Jsoup parser but, he cant extract <link..> tags from js script inside html with code like:
if( userAgent.match( /ipad|iphone|htc|android|windows\s+phone/i ) ) {
document.write('<link rel="stylesheet" type="text/css" href="http://static.gazeta.ru/nm2012/css/new_common_css_pda54.css" />');
} else {
document.write('<link rel="stylesheet" type="text/css" href="http://static.gazeta.ru/nm2012/css/new_common_css275.css" />');
}
You can use the Javax DOM Parser since HTML is dervied from XML, or more HTML specific one like validator.nu used by Mozilla.

Change HTML element's CSS style using java

I am using JSP to create my web page. I need to use java classes to access the data that I need to pull from another website's JSON (this CANNOT change).
Say I have the code:
<div class="fruit apple"></div>
<div class="fruit banana"></div>
//"fruit peach", "fruit orange", and so on...
style.fruit {display: none;}
I need to change the HTML element using JAVA, not javascript. In my JSP file, it will be in a <% %> tag.
<% var divClassINeedToChange = "banana";
//some sort of JAVA code that is equivalent to:
//document.getElementsByClass(divClassINeedToChange).style.display = "block"; %>
I cannot find the line of java code that is equivalent to the above line.
I hope this help you
you can parse your page using DOM or SAX parser.
for example
DocumentBuilderFactory factory=DocumentBuilderFactory.newInstance();
DocumentBuilder builder=factory.newDocumentBuilder();
Document doc=builder.parse(new File(filename));
Element e = doc.getElemetById(divClassINeedToChange);

How to remove a specific tag from the entire html page using jsoup

i'm using jsoup 1.7.3 to edit some html files.
what i need is to remove the following tags from the html file :
<meta name="GENERATOR" content="XXXXXXXXXXXXXX">
<meta name="CREATED" content="0;0">
<meta name="CHANGED" content="0;0">
As you see its the tag, how can i do that, here what i've tried so far :
//im pretty sure that the <meta> tag is nested in the <header>
but removing the whole header is bad practice.
Document docsoup = Jsoup.parse(htmlin);
docsoup.head().remove();
what do you suggest ?
I recommend you use Jsoup selectors, for example
Document document = Jsoup.parse(html);
Elements selector = document.select("meta[name=GENERATOR]");
for (Element element : selector) {
element.remove();
}
doc.html(); // returns String html with elements removed

How to find a specific meta tag

I am trying to retrieve a meta tag (Tag name=Generator) using Jsoup parser in java.
The code I have is given below:
Elements metalinks=doc.select("meta"); // meta
boolean metafound=false;
for (Element singlemeta : metalinks)
{
metatagname = singlemeta.attr("abs:name");
metatagcontent = singlemeta.attr("abs:content");
if((metatagname=="Generator")||(metatagname=="generator")||(metatagname=="GENERATOR")){
// this is the tag we want to get value of...
metarequired=metatagcontent;
metafound=true;
}
}
if(metafound==false)
metarequired="NOT_FOUND";
However I am unable to extract the meta GENERATOR tag correctly.
One example of this tag is now given below:
<meta name="generator" content="Test page" />
For the very first line in code given above, I also tried the following code but that also does not work:
//Elements metalinks= doc.getElementsByTag("meta");
How do I extract the meta tag correctly?
It almost looks as if you're making it too complicated. What if you started out with something simple like this:
Elements metalinks = doc.select("meta[name=generator]");

How to extract the content attribute of the meta name=generator tag?

I am using the below code to extract meta 'generator' tag content from a web page using Jsoup:
Elements metalinks = doc.select("meta[name=generator]");
boolean metafound=false;
if(metalinks.isEmpty()==false)
{
metatagcontent = metalinks.first().select("content").toString();
metarequired=metatagcontent;
metafound=true;
}
else
{
metarequired="NOT_FOUND";
metafound=false;
}
The problem is that for a page that does contain the meta generator tag, no value is shown (when I output the value of variable 'metarequired'. For a page that does not have meta generator tag, the value 'NOT_FOUND' is shown correctly. What am I doing wrong here?
From your code,
metalinks.first().select("content").toString();
This is not correct. This is merely selecting
<meta ...>
<content ... /> <!-- This one, which of course doesn't exist. -->
</meta>
while you actually want to get the attribute
<meta ... content="..." />
You need to use attr("content") instead of select("content").
metatagcontent = metalinks.first().attr("content");
See also:
Jsoup cookbook - Selector syntax
Jsoup Selector API documentation
W3 CSS3 selector specification
Unrelated to the concrete problem, you don't need to test against a boolean inside an if block. The isEmpty() already returns a boolean:
if (!metalinks.isEmpty())

Categories

Resources