I'm trying to exclude a list of links which I do not wish to crawl.
I could not find anything useful in the documentation that skips user requested urls.
I was, however, able to do it like this:
if(!(link.attr("href").startsWith("https://blog.olark.com") ||
link.attr("href").startsWith("http://www.olark.com")||
link.attr("href").startsWith("https://www.olark.com")||
link.attr("href").startsWith("https://olark.com") ||
link.attr("href").startsWith("http://olark.com"))) {
this.links.add(link.absUrl("href")); //get the absolute url and add it to links list. }
Of course this isn't a proper way to do it, so I wrapped the links in a List and tried to loop through it - however, it did not exclude a single link (code below):
List<String> exclude = Arrays.asList("https://blog.olark.com", "http://www.olark.com", "https://www.olark.com", "https://olark.com", "http://olark.com");
for (String string : exclude) {
if(!link.attr("href").startsWith(string)) {
this.links.add(link.absUrl("href")); //get the absolute url and add it to links list.
}
}
So my question is: How do I make it avoid a list of urls? I'm thinking something similar to the second code block I've written, but I'm open for ideas or fixes.
You can start with selecting and removing all the unwanted links. Then you can process your document without any checks.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupQuestion51072084 {
public static void main(final String[] args) throws IOException {
Document doc = Jsoup.parse("" +
"" +
"" +
"" +
"" +
"");
System.out.println("Document before modifications:\n" + doc);
// select links having "olark.com" in href.
Elements links = doc.select("a[href*=olark.com]");
System.out.println();
System.out.println("Links to remove: " + links);
System.out.println();
// remove them from the document
for (Element link : links) {
link.remove();
}
System.out.println("Document without unwanted links:\n" + doc);
}
}
and the output is:
Document before modifications:
<html>
<head></head>
<body>
</body>
</html>
Links to remove:
Document without unwanted links:
<html>
<head></head>
<body>
</body>
</html>
Related
I want to select all Html tags code with Jsoup
<html>
<head></head>
<body>
.....
</body>
</html>
I tried that:
Document dc = Jsoup.parse(fichier, "utf-8");
String tags = dc.outerHtml();
Your question it's not clear, but it seems that you simply want to get all the tag node names, to do so you can parse the html and getAllElements() and then iterate over the list element getting the nodeName() of each one, using java 8 to take advantage of forEach your code could be something like:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class JSoup {
public static void main(String[] args) {
String fichier = "<html>" +
"<head></head>" +
"<body></body>" +
"</html>";
Document dc = Jsoup.parse(fichier, "utf-8");
Elements elements = dc.getAllElements();
elements.forEach( element -> System.out.println(element.nodeName()));
}
}
This code prints all the tag node names:
#document
html
head
body
I want to get only:
http://tamilblog.ishafoundation.org/nalvazhvu/vazhkai/
and not all these:
I just want to apply this to my loop (section):
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class NewClassssssss {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://tamilblog.ishafoundation.org/page/3//").get();
Elements section = doc.select("section#content");
Elements article = section.select("article");
Elements links = doc.select("a[href]");
for (Element a : section) {
// System.out.println("Title : \n" + a.select("a").text());
System.out.println(a.select("a[href]"));
}
System.out.println(links);
}
}
There are some problems in the code:
1. Invalid search scope
Elements links = doc.select("a[href]");
The above line gets all links from the whole document instead of the articles only.
2. Invalid node used in loop
for (Element a : section) {
// ...
}
The above for loop works on the sections instead of the links.
3. Repetitive calls to select method
Elements section = doc.select("section#content");
Elements article = section.select("article");
Elements links = doc.select("a[href]");
It's not necessary to perform a selection for each node in the hierarchy. Jsoup can navigate through it for you. Those three lines can be replaced with one line:
Elements links = doc.select("section#content article a");
SAMPLE CODE
Here is a sample code resuming all the three precedent points:
Document doc = Jsoup.connect("http://tamilblog.ishafoundation.org/nalvazhvu/vazhkai/").get();
for (Element a : doc.select("section#content article a")) {
System.out.println("Title : \n" + a.text());
System.out.println(a.absUrl("href")); // absUrl is used here for *always* having absolute urls.
}
OUTPUT
Title :
http://tamilblog.ishafoundation.org/kalyana-parisaga-isha-kaattupoo/
Title :
இதயம் பேசுகிறது
http://tamilblog.ishafoundation.org/isha-pakkam/idhyam-pesugiradhu/
Title :
வாழ்க்கை
http://tamilblog.ishafoundation.org/nalvazhvu/vazhkai/
Title :
கல்யாணப் பரிசாக ஈஷா காட்டுப்பூ…
http://tamilblog.ishafoundation.org/kalyana-parisaga-isha-kaattupoo/
... (truncated for brievety)
Elements links = document.select("a[href]");
for (Element link : links) {
System.out.println(link.attr("abs:href"));
}
I need help with my Java project using Jsoup (if you think there is a more efficient way to achieve the purpose, please let me know). The purpose of my program is to parse certain useful information from different URLs and put it in a text file. I am not an expert in HTML or JavaScript, therefore, it has been difficult for me to code in Java exactly what I want to parse.
In the website that you see in the code below as one of the examples, the information that interests me to parse with Jsoup is everything you can see in the table under “Routing”(Route, Location, Vessel/Voyage, Container Arrival Date, Container Departure Date; = Origin, Seattle SSA Terminal T18, 26 Jun 15 A, 26 Jun 15 A… and so on).
So far, with Jsoup we are only able to parse the title of the website, yet we have been unsuccessful in getting any of the body.
Here is the code that I used, which I got from an online source:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Jsouptest71115 {
public static void main(String[] args) throws Exception {
String url = "http://google.com/gentrack/trackingMain.do "
+ "?trackInput01=999061985";
Document document = Jsoup.connect(url).get();
String title = document.title();
System.out.println("title : " + title);
String body = document.select("body").text();
System.out.println("Body: " + body);
}
}
Working code:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.ArrayList;
public class Sample {
public static void main(String[] args) {
String url = "http://homeport8.apl.com/gentrack/blRoutingPopup.do";
try {
Connection.Response response = Jsoup.connect(url)
.data("blNbr", "999061985") // tracking number
.method(Connection.Method.POST)
.execute();
Element tableElement = response.parse().getElementsByTag("table")
.get(2).getElementsByTag("table")
.get(2);
Elements trElements = tableElement.getElementsByTag("tr");
ArrayList<ArrayList<String>> tableArrayList = new ArrayList<>();
for (Element trElement : trElements) {
ArrayList<String> columnList = new ArrayList<>();
for (int i = 0; i < 5; i++) {
columnList.add(i, trElement.children().get(i).text());
}
tableArrayList.add(columnList);
}
System.out.println("Origin/Location: "
+tableArrayList.get(1).get(1));// row and column number
System.out.println("Discharge Port/Container Arrival Date: "
+tableArrayList.get(5).get(3));
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output:
Origin/Location: SEATTLE SSA TERMINAL (T18), WA
Discharge Port/Container Arrival Date: 23 Jul 15 E
You need to utilize document.select("body") select method input to which is CSS selector. To know more about CSS selectors just google it, or Read this. Using CSS selectors you can identify parts of web page body easily.
In your particular case you will have a different problem though, for instance the table you are after is inside an IFrame and if you look at the html of web page you are visiting its(iframe's) url is "http://homeport8.apl.com/gentrack/blRoutingFrame.do", so if you visit this URL directly so that you can access its content you will get an exception which is perhaps some restriction from Server. To get content properly you need to visit two URLs via JSoup, 1. http://homeport8.apl.com/gentrack/trackingMain.do?trackInput01=999061985 and 2. http://homeport8.apl.com/gentrack/blRoutingFrame.do?trackInput01=999061985
For first URL you'll get nothing useful, but for second URL you'll get tables of your interest. The try using document.select("table") which will give you List of tables iterator over this list and find table of your interest. Once you have the table use Element.select("tr") to get a table row and then for each "tr" use Element.select("td") to get table cell data.
The webpage you are visiting didn't use CSS class and id selectors which would have made reading it with jsoup a lot easier so I am afraid iterating over document.select("table") is your best and easy option.
Good Luck.
<div class="serieSelector serieSelected" data-serie="36" data-title="Steps">
<div class="value fontGreyBold">2620</div>
<div id="stepsPulse" class="fontGreyLight">Steps</div>
</div>
I am currently working on an Android project which needs to parse some data from the website and display the data in TextView's. As seen above, I need to display the value Highlighted which is "2620". I'm using Jsoup and that is my Element data obtained from the website. I dunno what tag to use exactly.
try {
Document document = Jsoup.connect(url).get();
Elements stepstaken = document
.select("div[class=measureValue fontGreyBold]span[class]");
stta = stepstaken.attr("class");
} catch (IOException e) {
e.printStackTrace();
}
The above code doesn't work so any possible replies are appreciated. Thanks!
I always just use PHP Simple DOM Parser whenever I need to parse anything. Then you'd just create a simple REST API that returns the parsed results. Works like a charm. :)
Try this seclector
document.select("div.value.fontGreyBold");
Example
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class JsoupParser {
public static void main(String[] args) {
String html = "<div class=\"serieSelector serieSelected\" data-serie=\"36\" data-title=\"Steps\">"
+ "<div class=\"value fontGreyBold\">2620</div>"
+ "<div id=\"stepsPulse\" class=\"fontGreyLight\">Steps</div>"
+ "</div>";
Document document = Jsoup.parse(html);
Elements stepstaken = document.select("div.value.fontGreyBold");
System.out.println(stepstaken.text());
}
}
I am trying to display a quick summary of a long html message sent by user. I would like to do this in java rather than javascript. How can I achieve this? I have looked at jsoup and htmlunit but can not find the method that does it!
With jsoup you can parse the document, select the inner element where the text content is too long and replace its text content with an excerpt.
Parse a document
Find an element
Extract the text content
Compute a replacement string
Set the new text content
It is all in their doc.
All in one it results in:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class A {
public static void main(String[] args) {
String html = "<html><head><title>First parse</title></head><body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Element pTag = doc.select("body > p").first(); // the p tag
String pContent = pTag.text();
pContent = pContent.substring(0, 7) + "... (too long)";
pTag.text(pContent);
System.out.println(doc);
}
}
Prints:
<html>
<head>
<title>First parse</title>
</head>
<body>
<p>Parsed ... (too long)</p>
</body>
</html>