Java JSoup: article extraction with image links and paragraph

Java JSoup: article extraction with image links and paragraph - java

I am currently making an article content extraction application using Jsoup and Java. My problem is when I scrape the article, Jsoup tends to return a list of Element rather than preserves the order of the article. For example, in an normal article with more than 1 image, it could has an order like this: (Title, sapo, image, paragraph, image, paragraph, paragraph, image, paragraph). So how can I scrape the main content of the website (text and image links) without losing its order?
Below is my idea for doing that but it doesn't work.
int cur = 0;
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("div");
for (Element element : elements) {
if (element.select("div[type=\"Photo\"] img").hasAttr("src")) {
Elements temp = element.select("div[type=\"Photo\"] img");
System.out.println(temp.get(cur).attr("src"));
cur++;
}
System.out.println(element.select("p span").text());
System.out.println("");
}

If you wanted to extract the article data from the sites that you linked to in the comment, you could do something like this:
Document doc = Jsoup.connect(url).get();
// Full article
Elements elements = doc.select("div.sidebar-1");
System.out.println("## Article title:");
System.out.println(elements.select("h1.title-detail").text());
System.out.println("## Article summary:");
System.out.println(elements.select("p.description").text());
// Images and paragraphs
for (Element e : elements.select("article.fck_detail p,figure")) {
if (e.is("p")) {
System.out.println("## Paragraph");
System.out.println(e.text());
} else {
System.out.println("## Image (image URL)");
System.out.println(e.select("img[src]").attr("src"));
}
}
The idea is this one:
find the outermost container that contains the full article
extract title and the summary
loop through the image (figure) and paragraph (p) elements of the article - the order will be preserved automatically

Related

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}

I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

Android - how to parse html by jsoup and fill into the arraylist?

I want to read the date from this HTML link:
http://jadvalbaz.blog.ir/post/%D8%B1%D8%A7%D9%87%D9%86%D9%85%D8%A7%DB%8C-%D8%AD%D9%84-%D8%AC%D8%AF%D9%88%D9%84-%D8%AD%D8%B1%D9%81-%D8%B0
if you look at the view-source
ذات اریه (پنومونی- سینه پهلو)ذر (مورچه ریز)ذرع (مقیاس طول)ذره ای بنیادی از رده هیبرونها که بار الکتریکی ندارد (لاندا)ذره منفی اتم (الکترون)ذریه (نسل)ذل (خواری)ذم (نکوهش)ذهاب (رفتن)ذی (صاحب)
my words are separated by <.br>, I want to read each word to ArrayList, I means how to omit the <.br> and read the words.
here is my code:
Document document = Jsoup.connect(url).get();
for (Element span : document.select("?").select("?")) {
title = span.toString();
name.add(title);
}
How to read them, what to put instead of question mark.
any suggestion?

edit the css of your template and define a class for your words then Use the Element.select(String selector) and Elements.select(String selector) method.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element masthead = doc.select("p.words").first(); // p with class=words
follow below link for more information about extracting data with this methods:
Use selector-syntax to find elements

Real time web crawling using Jsoup

I have this web page https://rrtp.comed.com/pricing-table-today/ and from that I need to get the information about Time (Hour Ending) and Day-Ahead Hourly Price column alone. I tried with the following code,
Document doc = Jsoup.connect("https://rrtp.comed.com/pricing-table-today/").get();
for (Element table : doc.select("table.prices three-col")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 2) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}
but unfortunately I am unable to get the data I need.
Is there something wrong in the code..? or This page can't be crawled...?
Need some help

As I said in comment:
You should hit https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717 because it's source from which data is loaded on the page you have pointed to.
Data under this link is not a valid html document (and this is why it's not working for you), but you can easily make it "quite" right.
All you have to do is first get the response and add <table>..</table> tags around it, then it's enough to parse it as html document.
Connection.Response response = Jsoup.connect("https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717").execute();
Document doc = Jsoup.parse("<table>" + response.body() + "</table>");
for (Element element : doc.select("tr")) {
System.out.println(element.html());
}

Extracting heading and paragraphs from doc and docx files using apache-poi

I am trying to read Microsoft word documents via apache-poi and found that there are couple of convenient methods provided to scan through document like getText(), getParagraphList() etc.. But my use case is slightly different and the way we want to scan through any document is, it should give us events/information like heading, paragraph, table in the same sequence as they appear in document. It will help me in preparing a document structure like,
<content>
<section>
<heading> ABC </heading>
<paragraph>xyz </paragraph>
<paragraph>scanning through APIs</paragraph>
<section>
.
.
.
</content>
The main intent is to maintain the relationship between heading and paragraphs as in original document. Not sure but can something like this work for me,
Iterator<IBodyElement> itr = doc.getBodyElementsIterator();
while(itr.hasNext()) {
IBodyElement ele = itr.next();
System.out.println(ele.getElementType());
}
I was able to get the paragraph list but not heading information using this code. Just to mention, I would be interested in all headings, they might be explicitly marked as heading by using style or by using large font size.

Headers aren't stored inline in the main document, they live elsewhere, which is why you're not getting them as body elements. Body elements are things like sections, paragraphs and tables, not headers, so you have to fetch them yourself.
If you look at this code in Apache Tika, you'll see an example of how to do so. Assuming you're iterating over the body elements, and want headers / footers of paragraphs, you'll want code something like this (based on the Tika code):
for(IBodyElement element : bodyElement.getBodyElements()) {
if(element instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph)element;
XWPFHeaderFooterPolicy headerFooterPolicy = null;
if (paragraph.getCTP().getPPr() != null) {
CTSectPr ctSectPr = paragraph.getCTP().getPPr().getSectPr();
if(ctSectPr != null) {
headerFooterPolicy = new XWPFHeaderFooterPolicy(document, ctSectPr);
// Handle Header
}
}
// Handle paragraph
if (headerFooterPolicy != null) {
// Handle footer
}
}
if(element instanceof XWPFTable) {
XWPFTable table = (XWPFTable)element;
// Handle table
}
if (element instanceof XWPFSDT){
XWPFSDT sdt = (XWPFSDT) element;
// Handle SDT
}
}

jsoup - extract text from wikipedia article

I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. How can I use JSoup to extract all the text of a Wikipedia article (for example all the text in http://en.wikipedia.org/wiki/Boston)?

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").get();
Element contentDiv = doc.select("div[id=content]").first();
contentDiv.toString(); // The result
You retrieve formatted content this way, of course. If you want "raw" content you can filter the result with Jsoup.clean or use the call contentDiv.text().

Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
Element lastParagraph = paragraphs.last();
Element p;
int i=1;
p=firstParagraph;
System.out.println(p.text());
while (p!=lastParagraph){
p=paragraphs.get(i);
System.out.println(p.text());
i++;
}

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").timeout(5000);
Element iamcontaningIDofintendedTAG= doc.select("#iamID") ;
System.out.println(iamcontaningIDofintendedTAG.toString());
OR
Elements iamcontaningCLASSofintendedTAG= doc.select(".iamCLASS") ;
System.out.println(iamcontaningCLASSofintendedTAG.toString());

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java JSoup: article extraction with image links and paragraph - java

Related

Use JSoup to get all textual links

Android - how to parse html by jsoup and fill into the arraylist?

Real time web crawling using Jsoup

Extracting heading and paragraphs from doc and docx files using apache-poi

jsoup - extract text from wikipedia article

Categories

Resources