JSOUP HTML parsing from URL

JSOUP HTML parsing from URL - java

I'm using JSOUP in Java to parse HTMLs like these two:
This and this.
In the first case, I get the output.
And I have a problem with the connection:
doc = Jsoup.connect(url).get();
There are some URLs which can easily be parsed, and I've got the output, but there are URLs too which produces empty output like this:
Title: [].
I can't understand what the problem is if both URLs are the same.
This is my code:
Document doc;
try {
doc = Jsoup.connect("http://ekonomika.sme.sk/c/8047766/s-velkymi-chybami-stavali-aj-budovu-centralnej-banky.html").get();
String title = doc.title();
System.out.println("title : " + title);
}
catch (IOException e) {
e.printStackTrace();
}

Take a look at what's in the head of the second url
Element h = doc.head();
System.out.println("head : " + h);
You'll see there are some meta refresh tags and an empty title:
<head>
<noscript>
<meta http-equiv="refresh" content="1;URL='/c/8047766/s-velkymi-chybami-stavali-aj-budovu-centralnej-banky.html?piano_d=1'">
</noscript>
<meta http-equiv="refresh" content="10;URL='/c/8047766/s-velkymi-chybami-stavali-aj-budovu-centralnej-banky.html?piano_t=1'">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
</head>
Which explains the empty title. You have to follow the redirect.

Here is my code for parsing, with this URL I have no output.
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package commentparser;
import java.io.IOException;
import static java.lang.Boolean.FALSE;
import static java.lang.Boolean.TRUE;
import java.net.URL;
import static java.sql.JDBCType.NULL;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
import static javafx.beans.binding.Bindings.length;
import static jdk.nashorn.internal.objects.ArrayBufferView.length;
import static oracle.jrockit.jfr.events.Bits.length;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CommentParser {
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("http://ekonomika.sme.sk/c/8047766/s-velkymi-chybami-stavali-aj-budovu-centralnej-banky.html").followRedirects(true).get();
String title = doc.title();
System.out.println("title : " + title);
//Link for discussions
if(doc.select("a[href^=/diskusie/reaction_show]").isEmpty() == FALSE){
Elements description = doc.select("a[href^=/diskusie/reaction_show]");
for (Element link : description) {
// get the value from href attribute
System.out.println("Diskusie: " + link.attr("href"));
}
}
//Author of article
if(doc.select("span[class^=autor]").isEmpty() == FALSE){
Elements description = doc.select("span[class^=autor]");
for (Element link : description) {
// get the value from href attribute
//System.out.println("\nlink : " + link.attr("b"));
System.out.println(link.text());
}
}
// get all links
Elements links = doc.select("a[href]");
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Related

How to get the first `href` string using jsoup?

My code returns all the links on a webpage, but I would like to get the first link when I google search something for example "android". How do I do that?
Document doc = Jsoup.connect(sharedURL).get();
String title = doc.title();
Elements links = doc.select("a[href]");
stringBuilder.append(title).append("\n");
for (Element link : links) {
stringBuilder.append("\n").append(" ").append(link.text()).append(" ").append(link.attr("href")).append("\n");
}
Here ids my code

Elements#first and Node#absUrl
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/Wikipedia").get();
Elements links = doc.select("a[href]");
Node node = links.first();
System.out.println(node.absUrl("href"));
}
}
Output:
https://en.wikipedia.org/wiki/Wikipedia:Protection_policy#semi

Get src of a class nested in a class with Jsoup

I am a beginner at jsoup, and I would like to get the src of the image in this code:
<div class="detail-info-cover">
<img class="detail-info-cover-img" src="http://fmcdn.mfcdn.net/store/manga/33647/cover.jpg? token=eab4a510fcd567ead4d0d902a967be55576be642&ttl=1592125200&v=1591085412" alt="Ghost Writer (MIKAGE Natsu)"> </div>
If you run it you will see the image I want to get.

Do it as follows:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Main {
public static void main(String[] args){
String html = "<div class=\"detail-info-cover\"> \n"
+ "<img class=\"detail-info-cover-img\" src=\"http://fmcdn.mfcdn.net/store/manga/33647/cover.jpg? token=eab4a510fcd567ead4d0d902a967be55576be642&ttl=1592125200&v=1591085412\" alt=\"Ghost Writer (MIKAGE Natsu)\"> </div>";
Document doc = Jsoup.parse(html);
Element image = doc.select("img").first();
String imageUrl = image.absUrl("src");
System.out.println(imageUrl);
}
}
Output:
http://fmcdn.mfcdn.net/store/manga/33647/cover.jpg? token=eab4a510fcd567ead4d0d902a967be55576be642&ttl=1592125200&v=1591085412

Why html code in chrome devtools and html code parsed by jsoup are different?

I'm trying to extract information about created date of issues from HADOOP Jira issue site(https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues)
As you can see in this Screenshot, created date is the text between the time tag whose class is live stamp(e.g. <time class=livestamp ...> 'this text' </time>)
So, I tried parse it with code as below.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CreatedDateExtractor {
public static void main(String[] args) {
String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("time.livestamp"); //This line finds elements that matches time tags with livestamp class
System.out.println("# of elements : "+ elements.size());
for(Element e: elements) {
System.out.println(e.text());
}
}
}
I expect that created date is extracted, but the actual output is
# of elements : 0.
I found this is something wrong. So, I tried to parse whole html code from that side with below code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CreatedDateExtractor {
public static void main(String[] args) {
String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("*"); //This line finds whole elements in html document.
System.out.println("# of elements : "+ elements.size());
for(Element e: elements) {
System.out.println(e);
}
}
}
I compared both the html code in chrome devtools and the html code that I parsed one by one. Then I found those are different.
Can you explain why this happens and give me some advices how to extract created date?

I advice you to get elements with "time" tag, and use select to get time tags which have "livestamp" class. Here is the example:
Elements timeTags = doc.select("time");
Element timeLivestamp = null;
for(Element tag:timeTags){
Element livestamp = tag.selectFirst(".livestamp");
if(livestamp != null){
 timeLivestamp = livestamp;
break;
}
}
I don't know why but when I want to use .select() method of Jsoup with more than 1 selector (as you used like time.livestamp), I get interesting outputs like this.

Problems calling Jsoup in a JSP scriptlet

I want to show parsed Elements in my JSP page.
I already have Jsoup in my Maven dependencies
I have a class for parsing with jsoup which returns a string.
package com.user.jsoup;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupClass {
public String testMethod() throws IOException {
Document doc = Jsoup.connect("https://www.google.de").get();
String test = doc.title();
return test;
}
}
My JSP is:
<%#page import="com.user.jsoup.JsoupClass"%>
<%
JsoupClass jsclass = new JsoupClass();
out.print(jsclass.testMethod());
%>
Unfortunately it won't display anything.
What am I doing wrong?

I could solve my problem my adding
System.setProperty("https.proxyHost", "host");
System.setProperty("https.proxyPort", "port");
to my JsoupClass

how to get HTML DOM path by text content?

a HTML file:
<html>
<body>
<div class="main">
<p id="tID">content</p>
</div>
</body>
</html>
i has a String == "content",
i want to use "content" get HTML DOM path:
html body div.main p#tID
chrome developer tools has this feature(Elements tag,bottom bar), i want to know how to do it in java?
thanks for your help :)

Have fun :)
JAVA CODE
import java.io.File;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.DomSerializer;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
public class Teste {
public static void main(String[] args) {
try {
// read and clean document
TagNode tagNode = new HtmlCleaner().clean(new File("test.xml"));
Document document = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
// use XPath to find target node
XPath xpath = XPathFactory.newInstance().newXPath();
Node node = (Node) xpath.evaluate("//*[text()='content']", document, XPathConstants.NODE);
// assembles jquery/css selector
String result = "";
while (node != null && node.getParentNode() != null) {
result = readPath(node) + " " + result;
node = node.getParentNode();
}
System.out.println(result);
// returns html body div#myDiv.foo.bar p#tID
} catch (Exception e) {
e.printStackTrace();
}
}
// Gets id and class attributes of this node
private static String readPath(Node node) {
NamedNodeMap attributes = node.getAttributes();
String id = readAttribute(attributes.getNamedItem("id"), "#");
String clazz = readAttribute(attributes.getNamedItem("class"), ".");
return node.getNodeName() + id + clazz;
}
// Read attribute
private static String readAttribute(Node node, String token) {
String result = "";
if(node != null) {
result = token + node.getTextContent().replace(" ", token);
}
return result;
}
}
XML EXAMPLE
<html>
<body>
<br>
<div id="myDiv" class="foo bar">
<p id="tID">content</p>
</div>
</body>
</html>
EXPLANATIONS
Object document points to evaluated XML.
The XPath //*[text()='content'] finds everthing with text = 'content', and find the node.
The while loops up to the first node, getting id and classes of current element.
MORE EXPLANATIONS
In this new solution I'm using HtmlCleaner. So, you can have <br>, for example, and cleaner will replace with <br/>.
To use HtmlCleaner, just download the newest jar here.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSOUP HTML parsing from URL - java

Related

How to get the first `href` string using jsoup?

Get src of a class nested in a class with Jsoup

Why html code in chrome devtools and html code parsed by jsoup are different?

Problems calling Jsoup in a JSP scriptlet

how to get HTML DOM path by text content?

Categories

Resources