Java Jsoup: Retrieve only the article - java

Trying to retrieve the text of the article. I want to select all of the text within
<p>... </p>
I was able to do that.
But I only want to retrieve the text from the article body, not the entire page
Document article = Jsoup.connect("html doc").get();
Elements paragraphs = article.select("p");
The code above gets the entire text from the page. I just want the text between
<article itemprop= "articleBody">...</article>
I'm sorry if this was hard to understand, I tried to formulate the
questions as best I could.

Elements#text() will return text-only content of all the combined paragraphs (see here for more details https://jsoup.org/apidocs/org/jsoup/select/Elements.html)

Try selecting on the itemprop attribute
for (Element paragraph : doc.select("article[itemprop=articleBody]"))
System.out.println(paragraph.text());
See CSS Selectors for more tips

Related

How to target a specific text field that follows a specific url using Jsoup?

Currently I am trying to scrape a static html page using the Jsoup library in Java. I found a way to get exactly what I want but I'm not sure what to choose for my selector. Before, I was using CSS but the location of the text I want is not the same for every html page.
Therefore I was thinking of using this logic, the text that appears after a specific URL because the way the page is laid out is:
-Topic as a link-
Text field containing information related to Topic.
The HTML looks like this
<A NAME="Topic"></A> <H2> TITLE OF TOPIC </H2>
<PRE><B leftmargin=150 marginwidth=100\>Content that I want to scrape</B></PRE>
I want to scrape everything in "Content that I want to scrape".
Based on your example it looks like you trying to get text from <PRE> which is placed directly after <A>. In that case you can use siblingA + siblingB which will try to find sibling B immediately preceded by sibling A (you can find more information about selectors and examples at official tutorial and Selector documentation) .
So in your case doc.select("a+pre").text() should be enough.
You can add more details like specific URL of href attribute like a[href=#TOPIC LiNK], or condition that <A href=...> also needs to be preceded by <A name=..> like
doc.select("a[name] + a + pre")

Open Link in HTML with JSOUP

I have a table in a HTML page in which I have to iterate through to open the links into a next page where all the information is. In this page I extract any data I need and return to my basic page.
How do I change pages with the framework JSoup in Java? Is it actually possible?
If you look at the JSoup Cookbook, they have an example of getting all the links inside of an HTML element. Iterate the Elements from this example and do a Document doc = Jsoup.connect(<url from Elements>).get();. You can then do String htmlFromLink = doc.toString(); and get the HTML from the link.

Jsoup getting background image path from css

I am looking for all of the images on a given website.
For this purpose i need to find the ones that are within the css for example:
.gk-crop {
background-image: url("../images/style1/g_rss-2.png");
}
Now my question is how can i get all of these urls with JSoup?
so far ive tried the following:
Document doc = Jsoup.connect(url).get();
Elements imagePath = doc.select("[src]");
imagePath.select("*[style*='background-image']");
but so far no luck.
Does anyone know how i can acheive it?
Jsoup doesn't parse css files.
Have a look at this to know what Jsoup is responsible for.
You need a separate css parser to extract url from css files. Have a look at this
Just like Niranjan mentioned, Jsoup is not for parsing CSS but XML. If you really need to extract some images from CSS, you will need to use some some 3rd party library for that purpose OR write simple regex for grabbing URLs from CSS file - its still plain text isn't it? This is not flexible resolution to your problem, but it would be the fastest one:)
If you want to select the URL's of all the images on a website you can select all the image tags and then get the absolute URL's.
Example:
String html = "http://www.bbc.co.uk";
Document doc = Jsoup.connect(html).get();
Elements titles = doc.select("img");
for (Element e : titles) {
System.out.println(e.absUrl("src"));
}
which will grab all the <img> elements and present it, such as
http://sa.bbc.co.uk/bbc/bbc/s?name=SET-COUNTER&pal_route=index&ml_name=barlesque&app_type=web&language=en-GB&ml_version=0.16.1&pal_webapp=wwhp&blq_s=3.5&blq_r=3.5&blq_v=default-worldwide
http://static.bbci.co.uk/frameworks/barlesque/2.50.2/desktop/3.5/img/blq-blocks_grey_alpha.png
http://static.bbci.co.uk/frameworks/barlesque/2.50.2/desktop/3.5/img/blq-search_grey_alpha.png
http://news.bbcimg.co.uk/media/images/69139000/jpg/_69139104_69139103.jpg
http://news.bbcimg.co.uk/media/images/69134000/jpg/_69134575_waynerooney1.jpg
If you only want the .JPG files, tell the selector that by including
Elements titles = doc.select("img[src$=.jpg]");
which result in only parsing the .JPG-urls.

Parsing XML with Jsoup

I get the following XML which represents a news article:
<content>
Some text blalalala
<h2>Small subtitle</h2>
Some more text blbla
<ul class="list">
<li>List item 1</li>
<li>List item 2</li>
</ul>
<br />
Even more freakin text
</content>
I know the format isn't ideal but for now I have to take it.
The Article should look like:
Some text blalalala
Small subtitle
List with items
Even more freakin text
I parse this XML with Jsoup. I can get the text within the <content> tag with doc.ownText() but then I have no idea where the other stuff (subtitle) is placed, I get only one big String.
Would it be better to use an event based parser for this (I hate them :() or is there a possibility to do something like doc.getTextUntilTagAppears("tagName")?
Edit: For clarification, I know hot to get the elements under <content>, my problem is with getting the text within <content>, broken up every time when its interrupted by an element.
I learned that I can get all the text within content with .textNodes(), works great, but then again I don't know where which text node belongs in my article (one at the top before h2, the other one at the bottom).
Jsoup has a fantastic selector based syntax. See here
If you want the subtitle
Document doc = Jsoup.parse("path-to-your-xml"); // get the document node
You know that subtitle is in the h2 element
Element subtitle = doc.select("h2").first(); // first h2 element that appears
And if you like to have the list:
Elements listItems = doc.select("ul.list > li");
for(Element item: listItems)
System.out.println(item.text()); // print list's items one after another
The mistake I made was going through the XML by Elements, which do not include TextNodes. When I go through it Node by Node, I can check wether the Node is an Element or a TextNode, that way I can treat them accordingly.

How to read/parse article content from link to string

I was in need of help.
How do I get content on article websites with java or android?
You can try http://jsoup.org/
Use it to fetch the page from link and parse the content.
Well, here is a sample,
String url = "http://inet.detik.com/read/2012/12/12/105558/2116258/796/produktif-kerja-mobile-dengan-samsung-ativ-smart-pc-yang-revolusioner";
Document doc = Jsoup.connect(url).timeout(20000).get();
Elements elements = doc.select("div[class=text_detail]");
if (elements.size() > 0) {
System.out.println(elements.text());
}
The above code just print outs the entire text. If you want to get a pretty print version, you need to handle some html tags (such as br) by yourself. You can easily visit the html tags with jsoup, so just spend some time on the documents and write the code on your own.

Categories

Resources