jsoup - extract text from wikipedia article - java

I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. How can I use JSoup to extract all the text of a Wikipedia article (for example all the text in http://en.wikipedia.org/wiki/Boston)?

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").get();
Element contentDiv = doc.select("div[id=content]").first();
contentDiv.toString(); // The result
You retrieve formatted content this way, of course. If you want "raw" content you can filter the result with Jsoup.clean or use the call contentDiv.text().

Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
Element lastParagraph = paragraphs.last();
Element p;
int i=1;
p=firstParagraph;
System.out.println(p.text());
while (p!=lastParagraph){
p=paragraphs.get(i);
System.out.println(p.text());
i++;
}

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").timeout(5000);
Element iamcontaningIDofintendedTAG= doc.select("#iamID") ;
System.out.println(iamcontaningIDofintendedTAG.toString());
OR
Elements iamcontaningCLASSofintendedTAG= doc.select(".iamCLASS") ;
System.out.println(iamcontaningCLASSofintendedTAG.toString());

Related

Java JSoup: article extraction with image links and paragraph

I am currently making an article content extraction application using Jsoup and Java. My problem is when I scrape the article, Jsoup tends to return a list of Element rather than preserves the order of the article. For example, in an normal article with more than 1 image, it could has an order like this: (Title, sapo, image, paragraph, image, paragraph, paragraph, image, paragraph). So how can I scrape the main content of the website (text and image links) without losing its order?
Below is my idea for doing that but it doesn't work.
int cur = 0;
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("div");
for (Element element : elements) {
if (element.select("div[type=\"Photo\"] img").hasAttr("src")) {
Elements temp = element.select("div[type=\"Photo\"] img");
System.out.println(temp.get(cur).attr("src"));
cur++;
}
System.out.println(element.select("p span").text());
System.out.println("");
}
If you wanted to extract the article data from the sites that you linked to in the comment, you could do something like this:
Document doc = Jsoup.connect(url).get();
// Full article
Elements elements = doc.select("div.sidebar-1");
System.out.println("## Article title:");
System.out.println(elements.select("h1.title-detail").text());
System.out.println("## Article summary:");
System.out.println(elements.select("p.description").text());
// Images and paragraphs
for (Element e : elements.select("article.fck_detail p,figure")) {
if (e.is("p")) {
System.out.println("## Paragraph");
System.out.println(e.text());
} else {
System.out.println("## Image (image URL)");
System.out.println(e.select("img[src]").attr("src"));
}
}
The idea is this one:
find the outermost container that contains the full article
extract title and the summary
loop through the image (figure) and paragraph (p) elements of the article - the order will be preserved automatically

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}
You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}
I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

Number of styled elements in an HTML using JSoup

How can I count the number of all styled elements in an HTML using JSoup?
If the document object is doc, I do not mean this:
doc.select["*[style]"]
Because this just selects all elements which have style as an attribute, but I want to know the number of elements which style has been applied to in any way like by css or from header style.
You can do it by using *[style] selector and calling Elements.size() method, e.g.
final String html = "<html><body><p>test</p><p style=\"color:red\"></p><span>aa</span><span style=\"font-size:10pt\">adasd</span></body></html>";
final Document doc = Jsoup.parse(html);
final int count = doc.select("*[style]").size();
System.out.println("Count = " + count);
Output
Count = 2

Android - how to parse html by jsoup and fill into the arraylist?

I want to read the date from this HTML link:
http://jadvalbaz.blog.ir/post/%D8%B1%D8%A7%D9%87%D9%86%D9%85%D8%A7%DB%8C-%D8%AD%D9%84-%D8%AC%D8%AF%D9%88%D9%84-%D8%AD%D8%B1%D9%81-%D8%B0
if you look at the view-source
ذات اریه (پنومونی- سینه پهلو)ذر (مورچه ریز)ذرع (مقیاس طول)ذره ای بنیادی از رده هیبرونها که بار الکتریکی ندارد (لاندا)ذره منفی اتم (الکترون)ذریه (نسل)ذل (خواری)ذم (نکوهش)ذهاب (رفتن)ذی (صاحب)
my words are separated by <.br>, I want to read each word to ArrayList, I means how to omit the <.br> and read the words.
here is my code:
Document document = Jsoup.connect(url).get();
for (Element span : document.select("?").select("?")) {
title = span.toString();
name.add(title);
}
How to read them, what to put instead of question mark.
any suggestion?
edit the css of your template and define a class for your words then Use the Element.select(String selector) and Elements.select(String selector) method.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element masthead = doc.select("p.words").first(); // p with class=words
follow below link for more information about extracting data with this methods:
Use selector-syntax to find elements

JSoup - Add onclick function to the anchor href

Existing HTMl Document
Link
Like to convert it as:
Link
Using JSoup Java library many fancy parsing can be done. But not able to find clue to add attribute like above requirement. Please help.
To set an attribute have a look at the doc
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p>Link</body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.getElementsByTag("a");
for (Element element : links) {
element.attr("onclick", "openFunction('"+element.attr("href")+"')");
element.attr("href", "#");
}
System.out.println(doc.html());
Will change :
<a href="http://google.com">
into
Link
Use Element#attr. I just used a loop but you can do it however you want.
Document doc = Jsoup.parse("Link");
for (Element e : doc.getElementsByTag("a")){
if (e.text().equals("Link")){
e.attr("onclick", "openFunction('http://google.com')");
System.out.println(e);
}
}
Output
Link

Categories

Resources