How to get the first `href` string using jsoup? - java

My code returns all the links on a webpage, but I would like to get the first link when I google search something for example "android". How do I do that?
Document doc = Jsoup.connect(sharedURL).get();
String title = doc.title();
Elements links = doc.select("a[href]");
stringBuilder.append(title).append("\n");
for (Element link : links) {
stringBuilder.append("\n").append(" ").append(link.text()).append(" ").append(link.attr("href")).append("\n");
}
Here ids my code

Elements#first and Node#absUrl
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/Wikipedia").get();
Elements links = doc.select("a[href]");
Node node = links.first();
System.out.println(node.absUrl("href"));
}
}
Output:
https://en.wikipedia.org/wiki/Wikipedia:Protection_policy#semi

Related

Liferay 7 Extending EditableFragmentEntryProcessor

I want to extend functionality of EditableFragmentEntryProcessor in Liferay 7.4 (<lfr-editable> tags in fragments) by searching in text syntaxes like {user.name} and replacing it with value from response from my external API.
e.x.
I type something like
This is super fragment and you are {user.name}.
And result should be
This is super fragment and you are Steven.
I achieve that with creating my own FragmentEntryProcessor, but I did this by putting fragment configuration variable in my custom tag
<my-data-api> ${configuration.testVariable} </my-data-api>
I tried something like this before
<my-data-api>
<lfr-editable id="some-id" type="text">
some text to edit
</lfr-editable>
</my-data-api>
And it doesn't work (and I know why).
So I want to get something like this. Appreciate any help or hints.
EDIT:
Here my custom FragmentEntryProcessor:
package com.example.fragmentEntryProcessorTest.portlet;
import com.example.test.api.api.TestPortletApi;
import com.liferay.fragment.exception.FragmentEntryContentException;
import com.liferay.fragment.model.FragmentEntryLink;
import com.liferay.fragment.processor.FragmentEntryProcessor;
import com.liferay.fragment.processor.FragmentEntryProcessorContext;
import com.liferay.portal.kernel.exception.PortalException;
import com.liferay.portal.kernel.util.Validator;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.osgi.service.component.annotations.Component;
import org.osgi.service.component.annotations.Reference;
import java.io.IOException;
/**
* #author kabatk
*/
#Component(
immediate = true, property = "fragment.entry.processor.priority:Integer=100",
service = FragmentEntryProcessor.class
)
public class FragmentEntryProcessorApiDataCopy implements FragmentEntryProcessor {
private static final String _TAG = "my-data-api";
#Reference
private TestPortletApi _api;
#Override
public String processFragmentEntryLinkHTML(
FragmentEntryLink fragmentEntryLink, String html,
FragmentEntryProcessorContext fragmentEntryProcessorContext)
throws PortalException {
Document document = _getDocument(html);
Elements elements = document.getElementsByTag(_TAG);
elements.forEach(
element -> {
String text = element.text();
String attrValue = element.attr("dataType");
String classValues = element.attr("classes");
Element myElement = null;
String result;
try {
result = _api.changeContent(text);
} catch (IOException e) {
e.printStackTrace();
result = "";
}
if(attrValue.equals("img")){
myElement = document.createElement("img");
myElement.attr("class", classValues);
myElement.attr("src", result);
}else if(attrValue.equals("text")){
myElement = document.createElement("div");
myElement.attr("class", classValues);
myElement.html(result);
}
if(myElement != null)
element.replaceWith(myElement);
else
element.replaceWith(
document.createElement("div").text("Error")
);
});
Element bodyElement = document.body();
return bodyElement.html();
}
#Override
public void validateFragmentEntryHTML(String html, String configuration)
throws PortalException {
Document document = _getDocument(html);
Elements elements = document.getElementsByTag(_TAG);
for (Element element : elements) {
if (Validator.isNull(element.attr("dataType"))) {
throw new FragmentEntryContentException("Missing 'dataType' attribute!");
}
}
}
private Document _getDocument(String html) {
Document document = Jsoup.parseBodyFragment(html);
Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);
document.outputSettings(outputSettings);
return document;
}
}

Why html code in chrome devtools and html code parsed by jsoup are different?

I'm trying to extract information about created date of issues from HADOOP Jira issue site(https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues)
As you can see in this Screenshot, created date is the text between the time tag whose class is live stamp(e.g. <time class=livestamp ...> 'this text' </time>)
So, I tried parse it with code as below.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CreatedDateExtractor {
public static void main(String[] args) {
String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("time.livestamp"); //This line finds elements that matches time tags with livestamp class
System.out.println("# of elements : "+ elements.size());
for(Element e: elements) {
System.out.println(e.text());
}
}
}
I expect that created date is extracted, but the actual output is
# of elements : 0.
I found this is something wrong. So, I tried to parse whole html code from that side with below code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CreatedDateExtractor {
public static void main(String[] args) {
String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("*"); //This line finds whole elements in html document.
System.out.println("# of elements : "+ elements.size());
for(Element e: elements) {
System.out.println(e);
}
}
}
I compared both the html code in chrome devtools and the html code that I parsed one by one. Then I found those are different.
Can you explain why this happens and give me some advices how to extract created date?
I advice you to get elements with "time" tag, and use select to get time tags which have "livestamp" class. Here is the example:
Elements timeTags = doc.select("time");
Element timeLivestamp = null;
for(Element tag:timeTags){
Element livestamp = tag.selectFirst(".livestamp");
if(livestamp != null){
 timeLivestamp = livestamp;
break;
}
}
I don't know why but when I want to use .select() method of Jsoup with more than 1 selector (as you used like time.livestamp), I get interesting outputs like this.

How to access data from Google Finance table with JSoup?

I am trying to scrape the data from the table where it states 'range', '52 week', 'open', etc... on this site: https://finance.google.com/finance?q=aapl&ei=czANWqmhNoPYswHV9YnwBg
The code below scrapes the write data but not in the format I want. It outputs the contents of the whole table as one string, whereas I would like each part of the table to be outputted separately on its own line.
Any help would be greatly appreciated.
Thank you!
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JSoup {
public static void main(String[] args) throws Exception {
String ticker = "AAPL";
String url = "https://finance.google.com/finance?q="+ticker+"&ei=czANWqmhNoPYswHV9YnwBg";
Document document = Jsoup.connect(url).get();
for (Element row : document.select("table.snap-data")) {
final String key = row.select(".range").text();
final String val = row.select(".val").text();
System.out.println(key);
System.out.println(val);
}
}
}

Getting sub links of a URL using jsoup

Consider a URl www.example.com it may have plenty numbers of links ,some may be internal and other may be external.I want to get a list of all the sub links ,not even the sub-sub links but only sub link.
E.G if there are four links as follows
1)www.example.com/images/main
2)www.example.com/data
3)www.example.com/users
4)www.example.com/admin/data
Then out of the four only 2 and 3 are of use as they are sub links not the sub-sub and so on links .Is there a way to achieve it through j-soup..If this could not be achieved through j-soup then one can introduce me with some other java API.
Also note that it should be a link of the parent Url which is initially sent(i.e. www.example.com)
If i can understand a sub-link can contain one slash you can attempt with this with counting the number of slashes for example :
List<String> list = new ArrayList<>();
list.add("www.example.com/images/main");
list.add("www.example.com/data");
list.add("www.example.com/users");
list.add("www.example.com/admin/data");
for(String link : list){
if((link.length() - link.replaceAll("[/]", "").length()) == 1){
System.out.println(link);
}
}
link.length(): count the number of characters
link.replaceAll("[/]", "").length() : count the number of slashes
If the difference equal to one then right link else no.
EDIT
How will i scan the whole website for sub links?
The answer for this with the robots.txt file or Robots exclusion standard, so in this it define all the sub-links of the web site for example https://stackoverflow.com/robots.txt, so the idea is, to read this file and you can extract the sub-links from this web-site here is a piece of code that can help you :
public static void main(String[] args) throws Exception {
//Your web site
String website = "http://stackoverflow.com";
//We will read the URL https://stackoverflow.com/robots.txt
URL url = new URL(website + "/robots.txt");
//List of your sub-links
List<String> list;
//Read the file with BufferedReader
try (BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()))) {
String subLink;
list = new ArrayList<>();
//Loop throw your file
while ((subLink = in.readLine()) != null) {
//Check if the sub-link is match with this regex, if yes then add it to your list
if (subLink.matches("Disallow: \\/\\w+\\/")) {
list.add(website + "/" + subLink.replace("Disallow: /", ""));
}else{
System.out.println("not match");
}
}
}
//Print your result
System.out.println(list);
}
This will show you :
[https://stackoverflow.com/posts/, https://stackoverflow.com/posts?,
https://stackoverflow.com/search/, https://stackoverflow.com/search?,
https://stackoverflow.com/feeds/, https://stackoverflow.com/feeds?,
https://stackoverflow.com/unanswered/,
https://stackoverflow.com/unanswered?, https://stackoverflow.com/u/,
https://stackoverflow.com/messages/, https://stackoverflow.com/ajax/,
https://stackoverflow.com/plugins/]
Here is a Demo about the regex that i use.
Hope this can help you.
To scan the links on the web page you can use JSoup library.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
class read_data {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("**your_url**").get();
Elements links = doc.select("a");
List<String> list = new ArrayList<>();
for (Element link : links) {
list.add(link.attr("abs:href"));
}
} catch (IOException ex) {
}
}
}
list can be used as suggested in the previous answer.
The code for reading all the links on a website is given below. I have used http://stackoverflow.com/ for illustration. I would recommend you to go through company's terms of use before scraping it's website.
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class readAllLinks {
public static Set<String> uniqueURL = new HashSet<String>();
public static String my_site;
public static void main(String[] args) {
readAllLinks obj = new readAllLinks();
my_site = "stackoverflow.com";
obj.get_links("http://stackoverflow.com/");
}
private void get_links(String url) {
try {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a");
links.stream().map((link) -> link.attr("abs:href")).forEachOrdered((this_url) -> {
boolean add = uniqueURL.add(this_url);
if (add && this_url.contains(my_site)) {
System.out.println(this_url);
get_links(this_url);
}
});
} catch (IOException ex) {
}
}
}
You will get list of all the links in uniqueURL field.

How to get element by tags using JSoup? - java

How to get element by tags using JSoup (http://jsoup.org/)?
I have the following input and require the following output but i am not getting the text inside the <source>...<\source> tags:
[in:]
<html>
<something>
<source>foo bar bar</source>
<something>
<source>foo foo bar</source>
</html>
[desired out:]
foo bar bar
foo foo bar
I have tried this:
import java.io.*;
import java.util.List;
import org.apache.commons.io.IOUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class HelloJsoup {
public static void main(String[] args) throws IOException {
String br = "<html><source>foo bar bar</source></html>";
Document doc = Jsoup.parse(br);
//System.out.println(doc);
for (Element sentence : doc.getElementsByTag("source"))
System.out.print(sentence);
}
}
but it outputs:
<source></source>
You need to use the xmlParser(), which you can pass in to the parse() method:
String br = "<html><source>foo bar bar</source></html>";
Document doc = Jsoup.parse(br, "", Parser.xmlParser());
for (Element sentence : doc.getElementsByTag("source"))
System.out.println(sentence.text());
}
More on this in the docs: http://jsoup.org/apidocs/org/jsoup/parser/Parser.html#xmlParser()

Categories

Resources