I made 2 simple html pages
page1:
<html>
<head>
</head>
<body>
enter page 2
<p>
some data
</p>
</body>
</html>
page2:
<html>
<head>
</head>
<body>
enter page 1
enter page 3
<p>
some other data
</p>
</body>
</html>
I want to get the links using jsoup library
Document doc = Jsoup.parse(file, "UTF-8", "http://example.com/"); //file = page1.html
Element link = doc.select("a").first();
String absHref = link.attr("href"); // "page2.html/"
now what I want to do, is to enter page 2 from page 1(its localy on my computer), and parse it.
I tried to do this:
Document doc2 = Jsoup.connect(absHref).get();
But it dosent work, doing me the 404 eror
EDIT:
From a small replay by #JonasCz I tried this: and it is working, I just think there is a better and smarter way.
File file = new File(args[0]);
String path = file.getParent() + "\\";
Document doc = Jsoup.parse(file, "UTF-8", "http://example.com/"); //file = page1.html
Element link = doc.select("a").first();
String Href = link.attr("href"); // "page2.html/"
File file2 = new File(path+href);
Document doc2 = Jsoup.parse(file2, "UTF-8", "http://example.com/");
Thank you
You are going the right way but you are not creating absolute URL.
Instead of:
String absHref = link.attr("href"); // "page2.html/"
Use
:
String absHref = link.absUrl("href"); // this wil give you http://example.com/page2.html
The rest is just as you are doing.
http://jsoup.org/apidocs/org/jsoup/nodes/Node.html
Unfortunetly, Jsoup is not a web crawler, but only parser with the ability to directly connect and fetch pages. Crawling logic - eg. what to fetch/visit next is on your responsibility to implement. You could google for web crawlers for Java, maybe something else would be more suitable.
Related
I want to enter complete html throgh string and then check is the given sting is a valid html or not.
Public booleanisValidHTML(String htmlData)
Description-Checks whether a given HTML data is a valid HTML data or not
htmlData- A HTML document in the form of string which contains TAGS and data.
returns-true if the given htmlData contains all valid tags with their allowed attributes and their possible values, otherwise false.
A valid HTML:
<html>
<head>
<title>Page Title</title>
</head>
<body>
<table style="width:100%">
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
<b>This text is bold</b>
</body>
</html>
The java code should look like
class htmlValidator{
public static void main(String args[]){
Scanner in =new Scanner(System.in);
String html=new String("pass the html here'');
isValidHtml(html)
}
public static boolean isValidHtml(String html){
/** write code here**/
/** method returns true if the given html is valid **
//**please help**/
}
}
Rather than writing regex to parse and check (which is generally A Bad Idea), you're better off using something like jsoup to parse it and check for errors.
From https://jsoup.org/cookbook/input/parse-document-from-string:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
I have a requirement in my application like,I need to open files(pdf,docx,pptx..)with in html(embedding into div or iframe) by passing document paths(urls) on click of a button.
I have tried opening the documents outside the application and its working,But I am stuck on hw to approach the above problem.
Thanks.
You can have a iframe inside a div and o0pen the document in that iframe element as follows:
<div>
<iframe src="" id="iframeRight" style="width:600px; height:500px;" frameborder="0">
</iframe>
</div>
In the controller, on click of button you can assign the base64 string of the document to iframe src as follows:
document.getElementById("iframeRight").src = base64String;
Just ensure that the base64 string has the metadata attached to it like for pdf it must start as:
data:application/pdf;base64,
PFB a demo program that lets user select the document and displays the same in the iframe. Copy paste in notepad and run this as html file. Select a doc less than 5 mb in size.
<!DOCTYPE html>
<html>
<body>
<input type="file" id="file-uploadUC" />
<iframe src="demo_iframe.htm" id="ifrm" height="200" width="300"></iframe>
<script>
document.getElementById("file-uploadUC").addEventListener("change", rdfile);
function rdfile() {
extn = '';
FileContentBase64 = '';
FileName = '';
if (this.files && this.files[0]) {
if (this.files[0].size <= 5347738) {
var FR = new FileReader();
FileName = this.files[0].name;
extn = FileName.split(".").pop();
FR.onload = function (e) {
FileContentBase64 = e.target.result;
document.getElementById("ifrm").src = FileContentBase64;
};
FR.readAsDataURL(this.files[0]);
}
else {
}
}
else {
}
}
</script>
</body>
</html>
I am using JSoup to dynamically set the href attribute of a <base/> element in an HTML document. This works as expected apart from the fact the closing </base> tag is omitted from the modified HTML.
Is there any way to have JSOUP return valid XHTML?
Input:
<html><head><base href="xyz"/></head><body></body></html>
Output:
<html>
<head>
<base href="https://myhost:8080/myapp/"> <-- missing closing tag
</head>
<body></body>
</html>
Code:
protected String modifyHtml(HttpServletRequest request, String html)
{
Document document = Jsoup.parse(html);
document.outputSettings().escapeMode(EscapeMode.xhtml);
Elements baseElements = document.select("base");
if (!baseElements.isEmpty())
{
Element base = baseElements.get(0);
base.attr("href", getBaseUrl(request));
}
return document.html();
}
In addition to (or instead of) the escape mode, you want to set the syntax:
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
I have
<meta itemprop="datePublished" content="2015-01-26 12:37:00">
and I want to select the content. I try without success:
Document doc = Jsoup.connect("http://www.somesite.com/index.html").get();
Element link= doc.select("meta").first();
String contetn= link.attr("content");
But in my html I have:
<div style="overflow: visible;" itemscope="" itemtype="http://schema.org/Article">
<meta itemprop="url" content="http://www.somesite.com/index.html">
<meta itemprop="headline" content="some text">
<meta itemprop="datePublished" content="2015-01-26 12:37:00">
<meta itemprop="dateModified" content="2015-01-26 14:03:16">
You can see that I search for the 3-td tag meta and I can't select it.
Element link= doc.select("meta").first();
This will select only the first meta-element found; since you have more than one in your second html, you'll get the wrong result.
But here's an example:
final String html = "<div style=\"overflow: visible;\" itemscope=\"\" itemtype=\"http://schema.org/Article\">\n"
+ "<meta itemprop=\"url\" content=\"http://www.somesite.com/index.html\">\n"
+ "<meta itemprop=\"headline\" content=\"some text\">\n"
+ "<meta itemprop=\"datePublished\" content=\"2015-01-26 12:37:00\">\n"
+ "<meta itemprop=\"dateModified\" content=\"2015-01-26 14:03:16\">";
Document doc = Jsoup.parse(html);
Element meta = doc.select("meta[itemprop=datePublished]").first();
String content = meta.attr("content");
System.out.println(content);
Output: 2015-01-26 12:37:00
This will select all meta-elements with attribute itemprop and attribute value datePublished. From all found, just the first is taken. Finally from the single element you can get the value of the content-attribute.
I'm trying to parse a page with Jsoup, but the html doesn't seem to be parsing correctly.
The general structure is:
<html>
<head> ... </head>
<frameset ...>
<frame ...>
#document
<html> ... </html>
</frame>
</frameset>
</html>
When I parse the html and print it with Document doc = Jsoup.parse(html); System.out.println(doc.html()); it prints out the outer html (including #document, but not the frames or inner html).
Does anyone know how to get the inner html with Jsoup, or should I consider using a different library?
Thanks.
Edit: Here's the site I'm parsing. I have a subscription to it; don't know if it'll let any of you in.
http://database.asahi.com/library2/login/login.php
After authentication, it will take you to: http://database.asahi.com/library2/main/start.php
Edit 2:
<html>
<head></head>
<frameset rows="58,*" border="0">
<frame name="Header"> </frame>
<frame name="Introduce">
#document
<html>
<head>hello</head>
<body>hello again</body>
</html>
</frame>
</frameset>
</html>
Then I run:
Document doc = Jsoup.parse(html);
Elements elems = doc.select("frameset > frame:last-child");
// print(elems);
switch(elems.size()) {
case 0: break;
case 1: doc = Jsoup.connect(elems.first().attr("src")).get(); break;
default: break;
}
System.out.println(doc.html());
The parsed html (doc.html()):
<html>
<head></head>
<body>
 #document hello hello again
</body>
</html>
So it's not even finding <frameset>
Any ideas?
Here is how to parse the nested html:
// Fetch the page with frameset
Document doc = Jsoup
.connect("http://database.asahi.com/library2/login/login.php")
.get(); // Add login, password etc
// Determine the frame url you want to parse...
// Note: I assume you want to parse the content of the first frame
Elements elts = doc.select("frameset > frame:first-child");
switch (elts.size()) {
case 0:
// No frame found ...
break;
case 1:
Element frameElt = elts.first();
Document frameDoc = Jsoup
.connect(frameElt.attr("src"))
.get();
// Add the frameDoc nodes to doc (via frameElt#insertChildren)
frameElt.insertChildren(0, frameDoc.childNodes());
break;
default:
// Strange result...
}
System.out.println(doc.html());