How do I enter a url link using jsoup

How do I enter a url link using jsoup - java

I made 2 simple html pages
page1:
<html>
<head>
</head>
<body>
enter page 2
<p>
some data
</p>
</body>
</html>
page2:
<html>
<head>
</head>
<body>
enter page 1
enter page 3
<p>
some other data
</p>
</body>
</html>
I want to get the links using jsoup library
Document doc = Jsoup.parse(file, "UTF-8", "http://example.com/"); //file = page1.html
Element link = doc.select("a").first();
String absHref = link.attr("href"); // "page2.html/"
now what I want to do, is to enter page 2 from page 1(its localy on my computer), and parse it.
I tried to do this:
Document doc2 = Jsoup.connect(absHref).get();
But it dosent work, doing me the 404 eror
EDIT:
From a small replay by #JonasCz I tried this: and it is working, I just think there is a better and smarter way.
File file = new File(args[0]);
String path = file.getParent() + "\\";
Document doc = Jsoup.parse(file, "UTF-8", "http://example.com/"); //file = page1.html
Element link = doc.select("a").first();
String Href = link.attr("href"); // "page2.html/"
File file2 = new File(path+href);
Document doc2 = Jsoup.parse(file2, "UTF-8", "http://example.com/");
Thank you

You are going the right way but you are not creating absolute URL.
Instead of:
String absHref = link.attr("href"); // "page2.html/"
Use
:
String absHref = link.absUrl("href"); // this wil give you http://example.com/page2.html
The rest is just as you are doing.
http://jsoup.org/apidocs/org/jsoup/nodes/Node.html
Unfortunetly, Jsoup is not a web crawler, but only parser with the ability to directly connect and fetch pages. Crawling logic - eg. what to fetch/visit next is on your responsibility to implement. You could google for web crawlers for Java, maybe something else would be more suitable.

Related

Validity of html

I want to enter complete html throgh string and then check is the given sting is a valid html or not.
Public booleanisValidHTML(String htmlData)
Description-Checks whether a given HTML data is a valid HTML data or not
htmlData- A HTML document in the form of string which contains TAGS and data.
returns-true if the given htmlData contains all valid tags with their allowed attributes and their possible values, otherwise false.
A valid HTML:
<html>
<head>
<title>Page Title</title>
</head>
<body>
<table style="width:100%">
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
<b>This text is bold</b>
</body>
</html>
The java code should look like
class htmlValidator{
public static void main(String args[]){
Scanner in =new Scanner(System.in);
String html=new String("pass the html here'');
isValidHtml(html)
}
public static boolean isValidHtml(String html){
/** write code here**/
/** method returns true if the given html is valid **
//**please help**/
}
}

Rather than writing regex to parse and check (which is generally A Bad Idea), you're better off using something like jsoup to parse it and check for errors.
From https://jsoup.org/cookbook/input/parse-document-from-string:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

How to open documents in html(inside iframe or div) by passing document path

I have a requirement in my application like,I need to open files(pdf,docx,pptx..)with in html(embedding into div or iframe) by passing document paths(urls) on click of a button.
I have tried opening the documents outside the application and its working,But I am stuck on hw to approach the above problem.
Thanks.

You can have a iframe inside a div and o0pen the document in that iframe element as follows:
<div>
<iframe src="" id="iframeRight" style="width:600px; height:500px;" frameborder="0">
</iframe>
</div>
In the controller, on click of button you can assign the base64 string of the document to iframe src as follows:
document.getElementById("iframeRight").src = base64String;
Just ensure that the base64 string has the metadata attached to it like for pdf it must start as:
data:application/pdf;base64,
PFB a demo program that lets user select the document and displays the same in the iframe. Copy paste in notepad and run this as html file. Select a doc less than 5 mb in size.
<!DOCTYPE html>
<html>
<body>
<input type="file" id="file-uploadUC" />
<iframe src="demo_iframe.htm" id="ifrm" height="200" width="300"></iframe>
<script>
document.getElementById("file-uploadUC").addEventListener("change", rdfile);
function rdfile() {
extn = '';
FileContentBase64 = '';
FileName = '';
if (this.files && this.files[0]) {
if (this.files[0].size <= 5347738) {
var FR = new FileReader();
FileName = this.files[0].name;
extn = FileName.split(".").pop();
FR.onload = function (e) {
FileContentBase64 = e.target.result;
document.getElementById("ifrm").src = FileContentBase64;
};
FR.readAsDataURL(this.files[0]);
}
else {
}
}
else {
}
}
</script>
</body>
</html>

JSoup Not Producing Valid XHTML

I am using JSoup to dynamically set the href attribute of a <base/> element in an HTML document. This works as expected apart from the fact the closing </base> tag is omitted from the modified HTML.
Is there any way to have JSOUP return valid XHTML?
Input:
<html><head><base href="xyz"/></head><body></body></html>
Output:
<html>
<head>
<base href="https://myhost:8080/myapp/"> <-- missing closing tag
</head>
<body></body>
</html>
Code:
protected String modifyHtml(HttpServletRequest request, String html)
{
Document document = Jsoup.parse(html);
document.outputSettings().escapeMode(EscapeMode.xhtml);
Elements baseElements = document.select("base");
if (!baseElements.isEmpty())
{
Element base = baseElements.get(0);
base.attr("href", getBaseUrl(request));
}
return document.html();
}

In addition to (or instead of) the escape mode, you want to set the syntax:
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);

How to get attribute content Jsoup?

I have
<meta itemprop="datePublished" content="2015-01-26 12:37:00">
and I want to select the content. I try without success:
Document doc = Jsoup.connect("http://www.somesite.com/index.html").get();
Element link= doc.select("meta").first();
String contetn= link.attr("content");
But in my html I have:
<div style="overflow: visible;" itemscope="" itemtype="http://schema.org/Article">
<meta itemprop="url" content="http://www.somesite.com/index.html">
<meta itemprop="headline" content="some text">
<meta itemprop="datePublished" content="2015-01-26 12:37:00">
<meta itemprop="dateModified" content="2015-01-26 14:03:16">
You can see that I search for the 3-td tag meta and I can't select it.

Element link= doc.select("meta").first();
This will select only the first meta-element found; since you have more than one in your second html, you'll get the wrong result.
But here's an example:
final String html = "<div style=\"overflow: visible;\" itemscope=\"\" itemtype=\"http://schema.org/Article\">\n"
+ "<meta itemprop=\"url\" content=\"http://www.somesite.com/index.html\">\n"
+ "<meta itemprop=\"headline\" content=\"some text\">\n"
+ "<meta itemprop=\"datePublished\" content=\"2015-01-26 12:37:00\">\n"
+ "<meta itemprop=\"dateModified\" content=\"2015-01-26 14:03:16\">";
Document doc = Jsoup.parse(html);
Element meta = doc.select("meta[itemprop=datePublished]").first();
String content = meta.attr("content");
System.out.println(content);
Output: 2015-01-26 12:37:00
This will select all meta-elements with attribute itemprop and attribute value datePublished. From all found, just the first is taken. Finally from the single element you can get the value of the content-attribute.

Nested html not being parsed by Jsoup

I'm trying to parse a page with Jsoup, but the html doesn't seem to be parsing correctly.
The general structure is:
<html>
<head> ... </head>
<frameset ...>
<frame ...>
#document
<html> ... </html>
</frame>
</frameset>
</html>
When I parse the html and print it with Document doc = Jsoup.parse(html); System.out.println(doc.html()); it prints out the outer html (including #document, but not the frames or inner html).
Does anyone know how to get the inner html with Jsoup, or should I consider using a different library?
Thanks.
Edit: Here's the site I'm parsing. I have a subscription to it; don't know if it'll let any of you in.
http://database.asahi.com/library2/login/login.php
After authentication, it will take you to: http://database.asahi.com/library2/main/start.php
Edit 2:
<html>
<head></head>
<frameset rows="58,*" border="0">
<frame name="Header"> </frame>
<frame name="Introduce">
#document
<html>
<head>hello</head>
<body>hello again</body>
</html>
</frame>
</frameset>
</html>
Then I run:
Document doc = Jsoup.parse(html);
Elements elems = doc.select("frameset > frame:last-child");
// print(elems);
switch(elems.size()) {
case 0: break;
case 1: doc = Jsoup.connect(elems.first().attr("src")).get(); break;
default: break;
}
System.out.println(doc.html());
The parsed html (doc.html()):
<html>
<head></head>
<body>
ï»¿ #document hello hello again
</body>
</html>
So it's not even finding <frameset>
Any ideas?

Here is how to parse the nested html:
// Fetch the page with frameset
Document doc = Jsoup
.connect("http://database.asahi.com/library2/login/login.php")
.get(); // Add login, password etc
// Determine the frame url you want to parse...
// Note: I assume you want to parse the content of the first frame
Elements elts = doc.select("frameset > frame:first-child");
switch (elts.size()) {
case 0:
// No frame found ...
break;
case 1:
Element frameElt = elts.first();
Document frameDoc = Jsoup
.connect(frameElt.attr("src"))
.get();
// Add the frameDoc nodes to doc (via frameElt#insertChildren)
frameElt.insertChildren(0, frameDoc.childNodes());
break;
default:
// Strange result...
}
System.out.println(doc.html());

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How do I enter a url link using jsoup - java

Related

Validity of html

How to open documents in html(inside iframe or div) by passing document path

JSoup Not Producing Valid XHTML

How to get attribute content Jsoup?

Nested html not being parsed by Jsoup

Categories

Resources