I was in need of help.
How do I get content on article websites with java or android?
You can try http://jsoup.org/
Use it to fetch the page from link and parse the content.
Well, here is a sample,
String url = "http://inet.detik.com/read/2012/12/12/105558/2116258/796/produktif-kerja-mobile-dengan-samsung-ativ-smart-pc-yang-revolusioner";
Document doc = Jsoup.connect(url).timeout(20000).get();
Elements elements = doc.select("div[class=text_detail]");
if (elements.size() > 0) {
System.out.println(elements.text());
}
The above code just print outs the entire text. If you want to get a pretty print version, you need to handle some html tags (such as br) by yourself. You can easily visit the html tags with jsoup, so just spend some time on the documents and write the code on your own.
Related
i'm using jsoup to parse all the HTML from this website: news
I can fetch all the tilte, description with select some Elements I need. But can't find the video URL element to select. How can i get the video link with jsoup or another kind of library. Thanks!
Maybe I misunderstood your question, but can't you search for <video> elements using JSoup?
All <video> elements have a so-called src attribute.
Maybe try something like this?
// HTML from your webpage
final var html = "this should hold your HTML";
// Deconstruct into element objects
final var document = Jsoup.parse(html);
// Use CSS to select the first <video> element
final var videoElement = document.select("video").first();
// Grab the video's URL by fetching the "src" attribute
final var src = videoElement.attr("src");
Now I did not thoroughly check the website you linked. But some websites insert videos using JavaScript. If this website inserts a video tag after loading, you might be out of luck as Jsoup does not run JavaScript. It only runs on the initial HTML fetched from the page.
Jsoup is an HTML parser, which is why it only parses HTML and not, say, generated HTML.
Trying to retrieve the text of the article. I want to select all of the text within
<p>... </p>
I was able to do that.
But I only want to retrieve the text from the article body, not the entire page
Document article = Jsoup.connect("html doc").get();
Elements paragraphs = article.select("p");
The code above gets the entire text from the page. I just want the text between
<article itemprop= "articleBody">...</article>
I'm sorry if this was hard to understand, I tried to formulate the
questions as best I could.
Elements#text() will return text-only content of all the combined paragraphs (see here for more details https://jsoup.org/apidocs/org/jsoup/select/Elements.html)
Try selecting on the itemprop attribute
for (Element paragraph : doc.select("article[itemprop=articleBody]"))
System.out.println(paragraph.text());
See CSS Selectors for more tips
I'm trying to work with JSoup to parse an HTML file I have generated through Servlet. From what I have read, I need to declare a Document. When I run the code
Document doc= Jsoup.parse(URL, 10000);
It always times out, if i increase the timeout time, it will run until it reaches that time. When i put in Integer.MAX_VALUE, it simply runs forever. I am working in Google Chrome on a macbook pro.
My questions are:
Is this just my computer or am i doing something wrong?
Is there a way to fix this or a way to parse the HTML page that is entirely differently?
Alternative Solutions
As explained in the documentation of Jsoup if you have an accessible URL than you can get its content this way:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
If you have HTML in a string this is how you should parse it:
document = Jsoup.parse(htmlString);
If you have HTML in a local file then:
Document doc = Jsoup.parse(new File("FilePath"), "UTF-8", "http://example.com/");
Your Solution
The way you are using Jsoup parser is correct but problem is with link perhaps if you can provide details about it then we can figure out what's going wrong.
Make sure whatever HTML is generated by your Servlet is accessible if it is the linkof yours should be a URL to that Servlet.
I have a table in a HTML page in which I have to iterate through to open the links into a next page where all the information is. In this page I extract any data I need and return to my basic page.
How do I change pages with the framework JSoup in Java? Is it actually possible?
If you look at the JSoup Cookbook, they have an example of getting all the links inside of an HTML element. Iterate the Elements from this example and do a Document doc = Jsoup.connect(<url from Elements>).get();. You can then do String htmlFromLink = doc.toString(); and get the HTML from the link.
I am looking for all of the images on a given website.
For this purpose i need to find the ones that are within the css for example:
.gk-crop {
background-image: url("../images/style1/g_rss-2.png");
}
Now my question is how can i get all of these urls with JSoup?
so far ive tried the following:
Document doc = Jsoup.connect(url).get();
Elements imagePath = doc.select("[src]");
imagePath.select("*[style*='background-image']");
but so far no luck.
Does anyone know how i can acheive it?
Jsoup doesn't parse css files.
Have a look at this to know what Jsoup is responsible for.
You need a separate css parser to extract url from css files. Have a look at this
Just like Niranjan mentioned, Jsoup is not for parsing CSS but XML. If you really need to extract some images from CSS, you will need to use some some 3rd party library for that purpose OR write simple regex for grabbing URLs from CSS file - its still plain text isn't it? This is not flexible resolution to your problem, but it would be the fastest one:)
If you want to select the URL's of all the images on a website you can select all the image tags and then get the absolute URL's.
Example:
String html = "http://www.bbc.co.uk";
Document doc = Jsoup.connect(html).get();
Elements titles = doc.select("img");
for (Element e : titles) {
System.out.println(e.absUrl("src"));
}
which will grab all the <img> elements and present it, such as
http://sa.bbc.co.uk/bbc/bbc/s?name=SET-COUNTER&pal_route=index&ml_name=barlesque&app_type=web&language=en-GB&ml_version=0.16.1&pal_webapp=wwhp&blq_s=3.5&blq_r=3.5&blq_v=default-worldwide
http://static.bbci.co.uk/frameworks/barlesque/2.50.2/desktop/3.5/img/blq-blocks_grey_alpha.png
http://static.bbci.co.uk/frameworks/barlesque/2.50.2/desktop/3.5/img/blq-search_grey_alpha.png
http://news.bbcimg.co.uk/media/images/69139000/jpg/_69139104_69139103.jpg
http://news.bbcimg.co.uk/media/images/69134000/jpg/_69134575_waynerooney1.jpg
If you only want the .JPG files, tell the selector that by including
Elements titles = doc.select("img[src$=.jpg]");
which result in only parsing the .JPG-urls.