i'm using jsoup to parse all the HTML from this website: news
I can fetch all the tilte, description with select some Elements I need. But can't find the video URL element to select. How can i get the video link with jsoup or another kind of library. Thanks!
Maybe I misunderstood your question, but can't you search for <video> elements using JSoup?
All <video> elements have a so-called src attribute.
Maybe try something like this?
// HTML from your webpage
final var html = "this should hold your HTML";
// Deconstruct into element objects
final var document = Jsoup.parse(html);
// Use CSS to select the first <video> element
final var videoElement = document.select("video").first();
// Grab the video's URL by fetching the "src" attribute
final var src = videoElement.attr("src");
Now I did not thoroughly check the website you linked. But some websites insert videos using JavaScript. If this website inserts a video tag after loading, you might be out of luck as Jsoup does not run JavaScript. It only runs on the initial HTML fetched from the page.
Jsoup is an HTML parser, which is why it only parses HTML and not, say, generated HTML.
I have a table in a HTML page in which I have to iterate through to open the links into a next page where all the information is. In this page I extract any data I need and return to my basic page.
How do I change pages with the framework JSoup in Java? Is it actually possible?
If you look at the JSoup Cookbook, they have an example of getting all the links inside of an HTML element. Iterate the Elements from this example and do a Document doc = Jsoup.connect(<url from Elements>).get();. You can then do String htmlFromLink = doc.toString(); and get the HTML from the link.
I need to parse HTML and find corresponding CSS styles. I can parse HTML and CSS separataly, but I can't combine them. For example, I have an XHTML page like this:
<html>
<head>
<title></title>
</head>
<body>
<div class="abc">Hello World</div>
</body>
</html>
I have to search for "hello world" and find its class name, and after that I need to find its style from an external CSS file. Answers using Java, JavaScript, and PHP are all okay.
Use jsoup library in java which is a HTML Parser. You can see for example here
For example you can do something like this:
String html="<<your html content>>";
Document doc = Jsoup.parse(html);
Element ele=doc.getElementsContainingOwnText("Hello World").first.clone(); //get tag containing Hello world
HashSet<String>class=ele.classNames(); //gives you the classnames of element containing Hello world
You can explore the library further to fit your needs.
Similiar question Can jQuery get all CSS styles associated with an element?. Maybe css optimizers can do what you want, take a look at unused-css.com its online tool but also lists other tools.
As i understood you have chance to parse style sheet from external file and this makes your task easy to solve. First try to parse html file with jsoup which supports jquery like selector syntax that helps you parse complicated html files easier. then check this previous solution to parse css file. Im not going to full solution as i state with these libraries all task done internally and the only thing you should do is writing glue code to combine these two.
Using Java java.util.regex
String s = "<body>...<div class=\"abc\">Hello World</div></body>";
Pattern p = Pattern.compile("<div.+?class\\s*?=\\s*['\"]?([^ '\"]+).*?>Hello World</div>", Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1));
}
prints abc
I have a html page in which following master reset css is included. I'll be getting html code as a string in java, from which i have to remove/replace/comment following css code using java. I have to exclude other inline css styles while removing/replacing below css. I tried using StringUtils class, but its not working. How i can do this in java?
<style type="text/css">
#charset "utf-8";
/* CSS Document */
/* Ver 1.0 Author*/
/* master reset */
a,abbr,acronym,address,applet,b,big,blockquote,body,button,caption,center,cite,code,dd,del,dfn,
dir,div,dl,dt,em,embed,fieldset,font,form,frame,h1,h2,h3,h4,h5,h6,hr,html,i,iframe,img,input,
ins,kbd,label,legend,li,menu,object,ol,option,p,pre,q,s,samp,select,small,span,strike,strong,
sub,sup,table,tbody,td,textarea,tfoot,th,thead,tr,tt,u,ul,var
{background:transparent;border:0;font-family:inherit;font-size:100%;font-style:inherit;
font-weight:inherit;margin:0;outline:0;padding:0;vertical-align:baseline;}
html {font-size:1em;overflow-y:scroll;}
body {background:white;color:black;line-height:1;}
a,ins {text-decoration:none;}
blockquote,q{quotes:none;quotes:"" "";}
blockquote:before,blockquote:after,q:before,q:after {content:"";content:none;}
caption,center,td,th {text-align:left;}
del {text-decoration:line-through;}
dir,menu,ol,ul {list-style:none;}
table {border-collapse:collapse;border-spacing:0;}
textarea {overflow-y:auto;}
</style>
I'd recommend using an HTML parsing library such as JSoup to do this.
With JSoup, you can select certain elements (based on their tagname, id etc) using a selector. For example, to remove all the style elements:
Document doc = Jsoup.parse(html);
Elements els = doc.select("style");
for(Element e: els){
e.remove();
}
I'm looking for an HTML object model in Java, capable of parsing HTML (not required) and containing all HTML elements (and CSS as well) in an elegant object model.
I'm looking for a pure java version of the Groovy's HTML builder.
(I have no luck on google with this request.)
I want to be able to perform stuff like:
HTML html = new HTML();
Body body = html.body();
Table table body.addTable(myCssStyle);
Row row = table.addRow("a", "b", "c").withCss(cssRowStyle);
and so on...
Check out Jsoup:
Example:
(Building some html)
Document doc = Document.createShell("");
Element headline = doc.body().appendElement("h1").text("thats a headline");
Element pTag = doc.body().appendElement("p").text("some text ...");
Element span = pTag.prependElement("span").text("That's");
System.out.println(doc);
Output:
<html>
<head></head>
<body>
<h1>thats a headline</h1>
<p><span>That's</span>some text ...</p>
</body>
</html>
Documentation:
Codebook
API Documentation (JavaDoc)
Jakarta ECS might be able to do what you want.
Just an idea: you could take a look at the source code of xhtmlrenderer project.
http://code.google.com/p/flying-saucer//
It's not plain HTML (it's XHTML), but may be a good starting point, don't you think?